[jira] [Commented] (SPARK-4118) Create python bindings for Streaming KMeans

2015-05-27 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560672#comment-14560672
 ] 

Manoj Kumar commented on SPARK-4118:


[~mengxr] Hi, can this be assigned to me? 

 Create python bindings for Streaming KMeans
 ---

 Key: SPARK-4118
 URL: https://issues.apache.org/jira/browse/SPARK-4118
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark, Streaming
Reporter: Anant Daksh Asthana
Priority: Minor

 Create Python bindings for Streaming K-means
 This is in reference to https://issues.apache.org/jira/browse/SPARK-3254
 which adds Streaming K-means functionality to MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7892) Python class in __main__ may trigger AssertionError

2015-05-27 Thread flykobe cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

flykobe cheng closed SPARK-7892.

Resolution: Duplicate

 Python class in __main__ may trigger AssertionError
 ---

 Key: SPARK-7892
 URL: https://issues.apache.org/jira/browse/SPARK-7892
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Linux, Python 2.7.3
 pickled by Python pickle Lib
Reporter: flykobe cheng
Priority: Minor

 Callback functions for spark transformations and actions will be pickled. 
 If the callback is instancemethod of __main__ module's class, and the class 
 has more than one instancemethod which using class properties or 
 classmethods, the class will be pickled twice, and 'pickle.memoize' twice, 
 then trigger AssertionError.
 Demo code:
 class AClass(object):
 _class_var = {'classkey': 'classval', } 
 def main_object_method(self, item):
 logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
 AClass._class_var['classkey']))
 def main_object_method2(self, item):
 logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
 AClass._class_var['classkey']))
 
 def test_main_object_method(sc):
 obj = AClass()
 res = sc.parallelize(range(4)).map(obj.main_object_method).collect()
 if __name__ == '__main__':
 cf = pyspark.SparkConf()
 cf.set('spark.cores.max', 1)
 sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = 
 cf)
 test_main_object_method(sc)
 Traceback:
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 310, in save_function_tuple
 save(f_globals)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in 
 save
 f(self, obj) # Call unbound method with explicit self
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 174, in save_dict
 pickle.Pickler.save_dict(self, obj)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in 
 save_dict
 self._batch_setitems(obj.iteritems())
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in 
 _batch_setitems
 save(v)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in 
 save
 f(self, obj) # Call unbound method with explicit self
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 468, in save_global
 d),obj=obj)
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 638, in save_reduce
 self.memoize(obj)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in 
 memoize
 assert id(obj) not in self.memo 
 AssertionError
 Problem in Python/Lib/pickle.py:
 def memoize(self, obj):
 Store an object in the memo.
 if self.fast:
 return
 assert id(obj) not in self.memo
 memo_len = len(self.memo)
 self.write(self.put(memo_len))
 self.memo[id(obj)] = memo_len, obj



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is _*mask*_, which takes another graph as a 
parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


As for details complex graph operator list, it can be found 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is _*mask*_, which takes another graph as a 
parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


As for details complex graph operator list, it can be found 
here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 Currently there are 30+ operators in GraphX. But few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as *union or join*, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD

2015-05-27 Thread Rene Treffer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560710#comment-14560710
 ] 

Rene Treffer commented on SPARK-7697:
-

I've had a similar problem, especially with unsigned bigint. Java has no type 
for that. (It only fails if a value actually exceeds the java long range).

I worked around the problem by extending DriverQuirks, now JDBCDialects. The 
idea is that you can map problematic types to whatever you'd want:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L395

I am mapping unsigned bigint to string in order to load it. This works at some 
post-processing overhead (basically a udf to map unsigned long stored in a 
string to signed long).

 Column with an unsigned int should be treated as long in JDBCRDD
 

 Key: SPARK-7697
 URL: https://issues.apache.org/jira/browse/SPARK-7697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: DAITO Teppei
Assignee: Liang-Chi Hsieh
 Fix For: 1.4.0


 Columns with an unsigned numeric type in JDBC should be treated as the next 
 'larger' Java type
 in JDBCRDD#getCatalystType .
 https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49
 {code:title=q.sql}
 create table t1 (id int unsigned);
 insert into t1 values (4234567890);
 {code}
 {code:title=T1.scala}
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 object T1 {
   def main(args: Array[String]) {
 val sc = new SparkContext(new SparkConf())
 val s = new SQLContext(sc)
 val url = jdbc:mysql://localhost/test
 val t1 = s.jdbc(url, t1)
 t1.printSchema()
 t1.collect().foreach(println)
   }
 }
 {code}
 This code caused error like below.
 {noformat}
 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
 xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in 
 column '1' is outside valid range for the datatype INTEGER.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
 at com.mysql.jdbc.Util.getInstance(Util.java:360)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
 at 
 com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090)
 at 
 com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364)
 at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is _*mask*_, which takes another graph as a 
parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


As for details complex graph operator list, it can be found 
here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is mask, which takes another graph as a parameter 
and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as union or join, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:complex_graph_operations. This issue will focus on two frequently-used 
operators first: union and join.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 Currently there are 30+ operators in GraphX. But few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as *union or join*, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:[complex_graph_operations](http://techieme.in/complex-graph-operations/).
  This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px,align=center!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. For vertex, it's quite nature to just make a union and remove those 
duplicate ones. But for edges, a mergeEdges function seems to be more 
reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px,align=center!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary to consider how to handle this case for both Vertex 
and Edge. For vertex, it's quite nature to just make a union and remove those 
duplicate ones. But for edges, a mergeEdges function seems to be more 
reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px,align=center!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. For vertex, it's quite nature to just make a union and 
 remove those duplicate ones. But for edges, a mergeEdges function seems to be 
 more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help user to focus and think in 
graph. 


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into big 
graph*_,complex operators will be helpful to operate between graphs directly. 
Higher level operators of graphs can help user to focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help user to focus and think in 
 graph. 
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7887) Remove EvaluatedType from SQL Expression

2015-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7887.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Remove EvaluatedType from SQL Expression
 

 Key: SPARK-7887
 URL: https://issues.apache.org/jira/browse/SPARK-7887
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 It's not a very useful type to use. We can just remove it to simplify 
 expressions slightly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6590) Make DataFrame.where accept a string conditionExpr

2015-05-27 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang closed SPARK-6590.
--
Resolution: Won't Fix

https://github.com/apache/spark/pull/6429#issuecomment-105788726
from Reynold.

 Make DataFrame.where accept a string conditionExpr
 --

 Key: SPARK-6590
 URL: https://issues.apache.org/jira/browse/SPARK-6590
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Priority: Minor

 In our doc, we say where is an alias of filter. However, where does not 
 support a conditionExpr in string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7892) Python class in __main__ may trigger AssertionError

2015-05-27 Thread flykobe cheng (JIRA)
flykobe cheng created SPARK-7892:


 Summary: Python class in __main__ may trigger AssertionError
 Key: SPARK-7892
 URL: https://issues.apache.org/jira/browse/SPARK-7892
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Linux, Python 2.7.3
pickled by Python pickle Lib
Reporter: flykobe cheng
Priority: Minor


Callback functions for spark transformations and actions will be pickled. 
If the callback is instancemethod of __main__ module's class, and the class has 
more than one instancemethod which using class properties or classmethods, the 
class will be pickled twice, and 'pickle.memoize' twice, then trigger 
AssertionError.

Demo code:
class AClass(object):
_class_var = {'classkey': 'classval', } 

def main_object_method(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))

def main_object_method2(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))


def test_main_object_method(sc):
obj = AClass()
res = sc.parallelize(range(4)).map(obj.main_object_method).collect()


if __name__ == '__main__':
cf = pyspark.SparkConf()
cf.set('spark.cores.max', 1)

sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf)

test_main_object_method(sc)


Traceback:
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 310, in save_function_tuple
save(f_globals)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in 
save_dict
self._batch_setitems(obj.iteritems())
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in 
_batch_setitems
save(v)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 468, in save_global
d),obj=obj)
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 638, in save_reduce
self.memoize(obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in 
memoize
assert id(obj) not in self.memo 
AssertionError


Problem in Python/Lib/pickle.py:
def memoize(self, obj):
Store an object in the memo.
if self.fast:
return
assert id(obj) not in self.memo
memo_len = len(self.memo)
self.write(self.put(memo_len))
self.memo[id(obj)] = memo_len, obj



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is ***mask***, which takes another graph as a 
parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as **union or join**, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). 
This issue will focus on two frequently-used operators first: **union** and 
**join**.

  was:
Currently there are 30+ operators in GraphX. But few of them consider operation 
between graphs. The only one is mask, which takes another graph as a parameter 
and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as union or join, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:complex_graph_operations. We will focus on two frequently-used operators 
first: union and join.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 Currently there are 30+ operators in GraphX. But few of them consider 
 operators between graphs. The only one is ***mask***, which takes another 
 graph as a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as **union or join**, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:[complex_graph_operations](http://techieme.in/complex-graph-operations/).
  This issue will focus on two frequently-used operators first: **union** and 
 **join**.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
External issue URL: https://issues.apache.org/jira/browse/SPARK-7893

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H.
 A Simple interface would be:
   def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, 
 ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
   def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://issues.apache.org/jira/secure/attachment/12735570/union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !https://issues.apache.org/jira/secure/attachment/12735570/union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. For vertex, it's quite nature to just 
 make a union and remove those duplicates vertexes. But for edges, a 
 mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as streaming graph, small graph merge into big 
graph……complex operators will be helpful to operate between graphs directly, 
such as *union or join*. Higher level operators of graphs can help user to 
focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as streaming grpah, small graph and big graph……it 
will be helpful to operate between graphs directly, such as *union or join*.  . 
Higher level operators of graphs can help user to focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as streaming graph, small graph merge into big 
 graph……complex operators will be helpful to operate between graphs directly, 
 such as *union or join*. Higher level operators of graphs can help user to 
 focus and think in graph.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help users to focus and think in 
graph. Performance optimization can be done internally and be transparent to 
them.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help users to focus and think in 
graph. Performance optimization can be done within operator and be transparent 
to them.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is mask, which takes another graph as a parameter 
and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as union or join, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:complex_graph_operations. This issue will focus on two frequently-used 
operators first: union and join.

  was:
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is ***mask***, which takes another graph as a 
parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as **union or join**, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). 
This issue will focus on two frequently-used operators first: **union** and 
**join**.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 Currently there are 30+ operators in GraphX. But few of them consider 
 operators between graphs. The only one is mask, which takes another graph as 
 a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as union or join, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:complex_graph_operations. This issue will focus on two frequently-used 
 operators first: union and join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Attachment: union_operator.png

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Labels: graph union  (was: )

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://issues.apache.org/jira/secure/attachment/12735570/union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px,align=center!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary to consider how to handle this case for both Vertex 
and Edge. For vertex, it's quite nature to just make a union and remove those 
duplicate ones. But for edges, a mergeEdges function seems to be more 
reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px,align=center!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px,align=center!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary to consider how to handle this case for 
 both Vertex and Edge. For vertex, it's quite nature to just make a union and 
 remove those duplicate ones. But for edges, a mergeEdges function seems to be 
 more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into big 
graph*_……complex operators will be helpful to operate between graphs directly, 
such as *union or join*. Higher level operators of graphs can help user to 
focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as streaming graph, small graph merge into big 
graph……complex operators will be helpful to operate between graphs directly, 
such as *union or join*. Higher level operators of graphs can help user to 
focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into big 
 graph*_……complex operators will be helpful to operate between graphs 
 directly, such as *union or join*. Higher level operators of graphs can help 
 user to focus and think in graph.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into big 
graph*_,complex operators will be helpful to operate between graphs directly. 
Higher level operators of graphs can help user to focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into big 
graph*_……complex operators will be helpful to operate between graphs directly, 
such as *union or join*. Higher level operators of graphs can help user to 
focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into big 
 graph*_,complex operators will be helpful to operate between graphs directly. 
 Higher level operators of graphs can help user to focus and think in graph.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7165) Sort Merge Join for outer joins

2015-05-27 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7165:
---
Priority: Blocker  (was: Major)

 Sort Merge Join for outer joins
 ---

 Key: SPARK-7165
 URL: https://issues.apache.org/jira/browse/SPARK-7165
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7891) Python class in __main__ may trigger AssertionError

2015-05-27 Thread flykobe cheng (JIRA)
flykobe cheng created SPARK-7891:


 Summary: Python class in __main__ may trigger AssertionError
 Key: SPARK-7891
 URL: https://issues.apache.org/jira/browse/SPARK-7891
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Linux, Python 2.7.3
pickled by Python pickle Lib
Reporter: flykobe cheng
Priority: Minor


Callback functions for spark transformations and actions will be pickled. 
If the callback is instancemethod of __main__ module's class, and the class has 
more than one instancemethod which using class properties or classmethods, the 
class will be pickled twice, and 'pickle.memoize' twice, then trigger 
AssertionError.

Demo code:
class AClass(object):
_class_var = {'classkey': 'classval', } 

def main_object_method(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))

def main_object_method2(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))


def test_main_object_method(sc):
obj = AClass()
res = sc.parallelize(range(4)).map(obj.main_object_method).collect()


if __name__ == '__main__':
cf = pyspark.SparkConf()
cf.set('spark.cores.max', 1)

sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf)

test_main_object_method(sc)


Traceback:
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 310, in save_function_tuple
save(f_globals)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in 
save_dict
self._batch_setitems(obj.iteritems())
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in 
_batch_setitems
save(v)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 468, in save_global
d),obj=obj)
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 638, in save_reduce
self.memoize(obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in 
memoize
assert id(obj) not in self.memo
AssertionError


Problem in Python/Lib/pickle.py:
def memoize(self, obj):
Store an object in the memo.
if self.fast:
return
assert id(obj) not in self.memo
memo_len = len(self.memo)
self.write(self.put(memo_len))
self.memo[id(obj)] = memo_len, obj



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX. But few of them consider operation 
between graphs. The only one is mask, which takes another graph as a parameter 
and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as union or join, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:complex_graph_operations. We will focus on two frequently-used operators 
first: union and join.

  was:
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is mask, which takes another graph as a parameter 
and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as union or join, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:complex_graph_operations. We will focus on two frequently-used operators 
first: union and join.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 Currently there are 30+ operators in GraphX. But few of them consider 
 operation between graphs. The only one is mask, which takes another graph as 
 a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as union or join, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:complex_graph_operations. We will focus on two frequently-used operators 
 first: union and join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


As for details complex graph operator list, it can be found 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is _*mask*_, which takes another graph as a 
parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


As for details complex graph operator list, it can be found 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as *union or join*, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


As for details complex graph operator list, it can be found 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as *union or join*, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7891) Python class in __main__ may trigger AssertionError

2015-05-27 Thread flykobe cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

flykobe cheng updated SPARK-7891:
-
Description: 
Callback functions for spark transformations and actions will be pickled. 
If the callback is instancemethod of __main__ module's class, and the class has 
more than one instancemethod which using class properties or classmethods, the 
class will be pickled twice, and 'pickle.memoize' twice, then trigger 
AssertionError.

Demo code and traceback attached.

  was:
Callback functions for spark transformations and actions will be pickled. 
If the callback is instancemethod of __main__ module's class, and the class has 
more than one instancemethod which using class properties or classmethods, the 
class will be pickled twice, and 'pickle.memoize' twice, then trigger 
AssertionError.

Demo code:
class AClass(object):
_class_var = {'classkey': 'classval', } 

def main_object_method(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))

def main_object_method2(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))


def test_main_object_method(sc):
obj = AClass()
res = sc.parallelize(range(4)).map(obj.main_object_method).collect()


if __name__ == '__main__':
cf = pyspark.SparkConf()
cf.set('spark.cores.max', 1)

sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf)

test_main_object_method(sc)


Traceback:
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 310, in save_function_tuple
save(f_globals)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in 
save_dict
self._batch_setitems(obj.iteritems())
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in 
_batch_setitems
save(v)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 468, in save_global
d),obj=obj)
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 638, in save_reduce
self.memoize(obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in 
memoize
assert id(obj) not in self.memo
AssertionError


Problem in Python/Lib/pickle.py:
def memoize(self, obj):
Store an object in the memo.
if self.fast:
return
assert id(obj) not in self.memo
memo_len = len(self.memo)
self.write(self.put(memo_len))
self.memo[id(obj)] = memo_len, obj


 Python class in __main__ may trigger AssertionError
 ---

 Key: SPARK-7891
 URL: https://issues.apache.org/jira/browse/SPARK-7891
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Linux, Python 2.7.3
 pickled by Python pickle Lib
Reporter: flykobe cheng
Priority: Minor
 Attachments: demo_error.log, demo_pickle_error.py


 Callback functions for spark transformations and actions will be pickled. 
 If the callback is instancemethod of __main__ module's class, and the class 
 has more than one instancemethod which using class properties or 
 classmethods, the class will be pickled twice, and 'pickle.memoize' twice, 
 then trigger AssertionError.
 Demo code and traceback attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function

2015-05-27 Thread shijinkui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shijinkui updated SPARK-5062:
-
Fix Version/s: 1.3.2

 Pregel use aggregateMessage instead of mapReduceTriplets function
 -

 Key: SPARK-5062
 URL: https://issues.apache.org/jira/browse/SPARK-5062
 Project: Spark
  Issue Type: Wish
  Components: GraphX
Reporter: shijinkui
 Fix For: 1.3.2

 Attachments: graphx_aggreate_msg.jpg


 since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it 
 improve the performance indeed.
 it's time to replace mapReduceTriplets with aggregateMessage in Pregel.
 we can discuss it.
 i have draw a graph of aggregateMessage to show why it can improve the 
 performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)
Andy Huang created SPARK-7894:
-

 Summary: Graph Union Operator
 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang


This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H.


A Simple interface would be:

def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, 
ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. For vertex, it's quite nature to just 
 make a union and remove those duplicates vertexes. But for edges, a 
 mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)
Andy Huang created SPARK-7893:
-

 Summary: Complex Operators between Graphs
 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang


Currently there are 30+ operators in GraphX. But few of them consider operators 
between graphs. The only one is mask, which takes another graph as a parameter 
and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as union or join, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.

As for details complex graph operator list, it can be found 
here:complex_graph_operations. We will focus on two frequently-used operators 
first: union and join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Labels: graph  (was: graph union)

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7817) Intellij Idea cannot find symbol when import scala object

2015-05-27 Thread bofei.xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bofei.xiao closed SPARK-7817.
-

Thanks Owen!

 Intellij Idea cannot find symbol when import scala object
 -

 Key: SPARK-7817
 URL: https://issues.apache.org/jira/browse/SPARK-7817
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.3.1
 Environment: micorosoft server 2003
 java 1.6
 maven 3.04
Reporter: bofei.xiao

 [ERROR] 
 src\main\java\org\apache\spark\exaples\streaming\JavaQueueStream.java:[33,47] 
 cannot find symbol
 symbol  : class StreamingExamples
 location: package org.apache.spark.exaples.streaming
 in fact,StreamingExamples is a object under org.apache.spark.exaples.streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x

2015-05-27 Thread Konstantin Shaposhnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560688#comment-14560688
 ] 

Konstantin Shaposhnikov commented on SPARK-7042:


Yes, I've just tested it locally - 2.11 Spark build works with akka 2.3.11

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7892) Python class in __main__ may trigger AssertionError

2015-05-27 Thread flykobe cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

flykobe cheng updated SPARK-7892:
-
Description: 
Callback functions for spark transformations and actions will be pickled. 
If the callback is instancemethod of __main__ module's class, and the class has 
more than one instancemethod which using class properties or classmethods, the 
class will be pickled twice, and 'pickle.memoize' twice, then trigger 
AssertionError.

Demo code and traceback attached.

  was:
Callback functions for spark transformations and actions will be pickled. 
If the callback is instancemethod of __main__ module's class, and the class has 
more than one instancemethod which using class properties or classmethods, the 
class will be pickled twice, and 'pickle.memoize' twice, then trigger 
AssertionError.

Demo code:
class AClass(object):
_class_var = {'classkey': 'classval', } 

def main_object_method(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))

def main_object_method2(self, item):
logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
AClass._class_var['classkey']))


def test_main_object_method(sc):
obj = AClass()
res = sc.parallelize(range(4)).map(obj.main_object_method).collect()


if __name__ == '__main__':
cf = pyspark.SparkConf()
cf.set('spark.cores.max', 1)

sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf)

test_main_object_method(sc)


Traceback:
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 310, in save_function_tuple
save(f_globals)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in 
save_dict
self._batch_setitems(obj.iteritems())
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in 
_batch_setitems
save(v)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save
f(self, obj) # Call unbound method with explicit self
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 468, in save_global
d),obj=obj)
  File 
/home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
 line 638, in save_reduce
self.memoize(obj)
  File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in 
memoize
assert id(obj) not in self.memo 
AssertionError


Problem in Python/Lib/pickle.py:
def memoize(self, obj):
Store an object in the memo.
if self.fast:
return
assert id(obj) not in self.memo
memo_len = len(self.memo)
self.write(self.put(memo_len))
self.memo[id(obj)] = memo_len, obj


 Python class in __main__ may trigger AssertionError
 ---

 Key: SPARK-7892
 URL: https://issues.apache.org/jira/browse/SPARK-7892
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Linux, Python 2.7.3
 pickled by Python pickle Lib
Reporter: flykobe cheng
Priority: Minor

 Callback functions for spark transformations and actions will be pickled. 
 If the callback is instancemethod of __main__ module's class, and the class 
 has more than one instancemethod which using class properties or 
 classmethods, the class will be pickled twice, and 'pickle.memoize' twice, 
 then trigger AssertionError.
 Demo code and traceback attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 

The union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

| G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 The union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 | G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. For vertex, it's quite nature to just 
 make a union and remove those duplicates vertexes. But for edges, a 
 mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Labels: graph union  (was: graph)

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. For vertex, it's quite nature to just 
 make a union and remove those duplicates vertexes. But for edges, a 
 mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H.


A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Labels: complex graph operators  (was: )

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, operators

 Currently there are 30+ operators in GraphX. But few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as *union or join*, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Labels: complex graph join operators union  (was: complex graph operators)

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX. But few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case, it will be helpful to operate between graphs directly, 
 such as *union or join*, especially for streaming case, small graph and big 
 graph. Higher level operators of graphs can help user to focus and think in 
 graph.
 As for details complex graph operator list, it can be found 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as streaming grpah, small graph and big graph……it 
will be helpful to operate between graphs directly, such as *union or join*.  . 
Higher level operators of graphs can help user to focus and think in graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case, it will be helpful to operate between graphs directly, 
such as *union or join*, especially for streaming case, small graph and big 
graph. Higher level operators of graphs can help user to focus and think in 
graph.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as streaming grpah, small graph and big graph……it 
 will be helpful to operate between graphs directly, such as *union or join*.  
 . Higher level operators of graphs can help user to focus and think in graph.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7853) ClassNotFoundException for SparkSQL

2015-05-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560772#comment-14560772
 ] 

Apache Spark commented on SPARK-7853:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6435

 ClassNotFoundException for SparkSQL
 ---

 Key: SPARK-7853
 URL: https://issues.apache.org/jira/browse/SPARK-7853
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Hao
Priority: Blocker

 Reproduce steps:
 {code}
 bin/spark-sql --jars 
 ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar
 CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 
 'org.apache.hive.hcatalog.data.JsonSerDe';
 {code}
 Throws Exception like:
 {noformat}
 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, 
 b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe']
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
 Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot 
 validate serde: org.apache.hive.hcatalog.data.JsonSerDe
   at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457)
   at 
 org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131)
   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727)
   at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7891) Python class in __main__ may trigger AssertionError

2015-05-27 Thread flykobe cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

flykobe cheng updated SPARK-7891:
-
Attachment: demo_error.log
demo_pickle_error.py

 Python class in __main__ may trigger AssertionError
 ---

 Key: SPARK-7891
 URL: https://issues.apache.org/jira/browse/SPARK-7891
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Linux, Python 2.7.3
 pickled by Python pickle Lib
Reporter: flykobe cheng
Priority: Minor
 Attachments: demo_error.log, demo_pickle_error.py


 Callback functions for spark transformations and actions will be pickled. 
 If the callback is instancemethod of __main__ module's class, and the class 
 has more than one instancemethod which using class properties or 
 classmethods, the class will be pickled twice, and 'pickle.memoize' twice, 
 then trigger AssertionError.
 Demo code:
 class AClass(object):
 _class_var = {'classkey': 'classval', } 
 def main_object_method(self, item):
 logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
 AClass._class_var['classkey']))
 def main_object_method2(self, item):
 logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, 
 AClass._class_var['classkey']))
 
 def test_main_object_method(sc):
 obj = AClass()
 res = sc.parallelize(range(4)).map(obj.main_object_method).collect()
 if __name__ == '__main__':
 cf = pyspark.SparkConf()
 cf.set('spark.cores.max', 1)
 sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = 
 cf)
 test_main_object_method(sc)
 Traceback:
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 310, in save_function_tuple
 save(f_globals)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in 
 save
 f(self, obj) # Call unbound method with explicit self
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 174, in save_dict
 pickle.Pickler.save_dict(self, obj)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in 
 save_dict
 self._batch_setitems(obj.iteritems())
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in 
 _batch_setitems
 save(v)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in 
 save
 f(self, obj) # Call unbound method with explicit self
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 468, in save_global
 d),obj=obj)
   File 
 /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py,
  line 638, in save_reduce
 self.memoize(obj)
   File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in 
 memoize
 assert id(obj) not in self.memo
 AssertionError
 Problem in Python/Lib/pickle.py:
 def memoize(self, obj):
 Store an object in the memo.
 if self.fast:
 return
 assert id(obj) not in self.memo
 memo_len = len(self.memo)
 self.write(self.put(memo_len))
 self.memo[id(obj)] = memo_len, obj



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
 External issue ID:   (was: 7893)
External issue URL:   (was: 
https://issues.apache.org/jira/browse/SPARK-7893)

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H.
 A Simple interface would be:
   def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, 
 ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
   def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H.


A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 
Vertexes and edges which are included in either graph will be part of the new 
graph.

The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H.


A Simple interface would be:

def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, 
ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. 

For vertex, it's quite nature to have a union and remove duplicates vertexes. 
But for edges, a mergeEdges function seems to be more reasonable.

def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang

 This operator aims to union two graphs and generate a new graph directly. 
 Vertexes and edges which are included in either graph will be part of the new 
 graph.
 The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex 
 sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H.
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. 
 For vertex, it's quite nature to have a union and remove duplicates vertexes. 
 But for edges, a mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=800px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bg.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=800px!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. For vertex, it's quite nature to just 
 make a union and remove those duplicates vertexes. But for edges, a 
 mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px,align=center!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=800px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px,align=center!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. For vertex, it's quite nature to just 
 make a union and remove those duplicates vertexes. But for edges, a 
 mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7894) Graph Union Operator

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7894:
--
Description: 
This operator aims to union two graphs and generate a new graph directly. The 
union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

bg.  G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]


  was:
This operator aims to union two graphs and generate a new graph directly. 

The union of two graphs is the union of their vertex sets and their edge 
families.Vertexes and edges which are included in either graph will be part of 
the new graph.

| G ∪ H = (VG ∪ VH, EG ∪ EH).

The below image shows a union of graph G and graph H

!union_operator.png|width=600px!

A Simple interface would be:

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]


However, inevitably vertexes and edges overlapping will happen between borders 
of graphs. It is necessary for interface to consider how to handle this case 
for both Vertex and Edge. For vertex, it's quite nature to just make a union 
and remove those duplicates vertexes. But for edges, a mergeEdges function 
seems to be more reasonable.

bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
(ED, ED) = ED): Graph[VD, ED]



 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bg.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. It is necessary for interface to consider how to handle 
 this case for both Vertex and Edge. For vertex, it's quite nature to just 
 make a union and remove those duplicates vertexes. But for edges, a 
 mergeEdges function seems to be more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help users to focus and think in 
graph. Performance optimization can be done within operator and be transparent 
to them.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help users to focus and think in 
graph. Performance optimization can be done within operator and be transparent.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done within operator and be 
 transparent to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7893) Complex Operators between Graphs

2015-05-27 Thread Andy Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Huang updated SPARK-7893:
--
Description: 
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help users to focus and think in 
graph. Performance optimization can be done within operator and be transparent.


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.

  was:
Currently there are 30+ operators in GraphX, while few of them consider 
operators between graphs. The only one is _*mask*_, which takes another graph 
as a parameter and return a new graph.

In many complex case,such as _*streaming graph, small graph merge into huge 
graph*_, higher level operators of graphs can help user to focus and think in 
graph. 


Complex graph operator list is 
here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
This issue will focus on two frequently-used operators first: *union* and 
*join*.


 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done within operator and be 
 transparent.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7782) A small problem on history server webpage

2015-05-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560995#comment-14560995
 ] 

Apache Spark commented on SPARK-7782:
-

User 'zuxqoj' has created a pull request for this issue:
https://github.com/apache/spark/pull/6437

 A small problem on history server webpage
 -

 Key: SPARK-7782
 URL: https://issues.apache.org/jira/browse/SPARK-7782
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1, 1.3.1
Reporter: Xia Hu
Priority: Minor
  Labels: starter

 A very little problem on spark history server webpage.
 we can click on head of each row to sort the app lists, for example sort by 
 start time or completed time. But when it shows with the down arrow, it's 
 actually with ascending order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7782) A small problem on history server webpage

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7782:
---

Assignee: Apache Spark

 A small problem on history server webpage
 -

 Key: SPARK-7782
 URL: https://issues.apache.org/jira/browse/SPARK-7782
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1, 1.3.1
Reporter: Xia Hu
Assignee: Apache Spark
Priority: Minor
  Labels: starter

 A very little problem on spark history server webpage.
 we can click on head of each row to sort the app lists, for example sort by 
 start time or completed time. But when it shows with the down arrow, it's 
 actually with ascending order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD

2015-05-27 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-7897:
--

 Summary: Column with an unsigned bigint should be treated as 
DecimalType in JDBCRDD
 Key: SPARK-7897
 URL: https://issues.apache.org/jira/browse/SPARK-7897
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7896) IndexOutOfCountsException in ChainedBuffer

2015-05-27 Thread Arun Ahuja (JIRA)
Arun Ahuja created SPARK-7896:
-

 Summary: IndexOutOfCountsException in ChainedBuffer
 Key: SPARK-7896
 URL: https://issues.apache.org/jira/browse/SPARK-7896
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Arun Ahuja


I've run into this on two tasks that use the same dataset.

The dataset is a collection of strings where the most common string appears 
~200M times and the next few appear ~50M times each.

for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) 
to get the counts (how I got the number above), but I hit the error on 
rdd.groupByKey().

Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do 
rdd2.leftOuterJoin(rdd) without hitting this error

{code}
15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 
(TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): 
java.lang.IndexOutOfBoundsException: 512
at 
scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
at 
org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110)
at 
org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at 
org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147)
at 
org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7897:
---

Assignee: (was: Apache Spark)

 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD
 --

 Key: SPARK-7897
 URL: https://issues.apache.org/jira/browse/SPARK-7897
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD

2015-05-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561015#comment-14561015
 ] 

Apache Spark commented on SPARK-7897:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6438

 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD
 --

 Key: SPARK-7897
 URL: https://issues.apache.org/jira/browse/SPARK-7897
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD

2015-05-27 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561022#comment-14561022
 ] 

Liang-Chi Hsieh commented on SPARK-7697:


[~treffer] Thanks for reporting the problem. I open another 
[ticket|https://issues.apache.org/jira/browse/SPARK-7897] and PR for that. I 
will use DecimalType for unsigned bigint. It would be better if you can test it.

 Column with an unsigned int should be treated as long in JDBCRDD
 

 Key: SPARK-7697
 URL: https://issues.apache.org/jira/browse/SPARK-7697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: DAITO Teppei
Assignee: Liang-Chi Hsieh
 Fix For: 1.4.0


 Columns with an unsigned numeric type in JDBC should be treated as the next 
 'larger' Java type
 in JDBCRDD#getCatalystType .
 https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49
 {code:title=q.sql}
 create table t1 (id int unsigned);
 insert into t1 values (4234567890);
 {code}
 {code:title=T1.scala}
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 object T1 {
   def main(args: Array[String]) {
 val sc = new SparkContext(new SparkConf())
 val s = new SQLContext(sc)
 val url = jdbc:mysql://localhost/test
 val t1 = s.jdbc(url, t1)
 t1.printSchema()
 t1.collect().foreach(println)
   }
 }
 {code}
 This code caused error like below.
 {noformat}
 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
 xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in 
 column '1' is outside valid range for the datatype INTEGER.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
 at com.mysql.jdbc.Util.getInstance(Util.java:360)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
 at 
 com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090)
 at 
 com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364)
 at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7897:
---

Assignee: Apache Spark

 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD
 --

 Key: SPARK-7897
 URL: https://issues.apache.org/jira/browse/SPARK-7897
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7895) Move Kafka examples from scala-2.10/src to src

2015-05-27 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-7895:
---

 Summary: Move Kafka examples from scala-2.10/src to src
 Key: SPARK-7895
 URL: https://issues.apache.org/jira/browse/SPARK-7895
 Project: Spark
  Issue Type: Improvement
  Components: Examples, Streaming
Reporter: Shixiong Zhu


Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we 
can move KafkaWordCount and DirectKafkaWordCount from 
examples/scala-2.10/src/ to examples/src/ so that they will appear in 
spark-examples-***-jar for Scala 2.11.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer

2015-05-27 Thread Arun Ahuja (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Ahuja updated SPARK-7896:
--
Summary: IndexOutOfBoundsException in ChainedBuffer  (was: 
IndexOutOfCountsException in ChainedBuffer)

 IndexOutOfBoundsException in ChainedBuffer
 --

 Key: SPARK-7896
 URL: https://issues.apache.org/jira/browse/SPARK-7896
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Arun Ahuja

 I've run into this on two tasks that use the same dataset.
 The dataset is a collection of strings where the most common string appears 
 ~200M times and the next few appear ~50M times each.
 for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) 
 to get the counts (how I got the number above), but I hit the error on 
 rdd.groupByKey().
 Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do 
 rdd2.leftOuterJoin(rdd) without hitting this error
 {code}
 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 
 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): 
 java.lang.IndexOutOfBoundsException: 512
 at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
 at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
 at 
 org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110)
 at 
 org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141)
 at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
 at 
 org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147)
 at 
 org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7895) Move Kafka examples from scala-2.10/src to src

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7895:
---

Assignee: Apache Spark

 Move Kafka examples from scala-2.10/src to src
 --

 Key: SPARK-7895
 URL: https://issues.apache.org/jira/browse/SPARK-7895
 Project: Spark
  Issue Type: Improvement
  Components: Examples, Streaming
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we 
 can move KafkaWordCount and DirectKafkaWordCount from 
 examples/scala-2.10/src/ to examples/src/ so that they will appear in 
 spark-examples-***-jar for Scala 2.11.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7895) Move Kafka examples from scala-2.10/src to src

2015-05-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560937#comment-14560937
 ] 

Apache Spark commented on SPARK-7895:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6436

 Move Kafka examples from scala-2.10/src to src
 --

 Key: SPARK-7895
 URL: https://issues.apache.org/jira/browse/SPARK-7895
 Project: Spark
  Issue Type: Improvement
  Components: Examples, Streaming
Reporter: Shixiong Zhu

 Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we 
 can move KafkaWordCount and DirectKafkaWordCount from 
 examples/scala-2.10/src/ to examples/src/ so that they will appear in 
 spark-examples-***-jar for Scala 2.11.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7895) Move Kafka examples from scala-2.10/src to src

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7895:
---

Assignee: (was: Apache Spark)

 Move Kafka examples from scala-2.10/src to src
 --

 Key: SPARK-7895
 URL: https://issues.apache.org/jira/browse/SPARK-7895
 Project: Spark
  Issue Type: Improvement
  Components: Examples, Streaming
Reporter: Shixiong Zhu

 Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we 
 can move KafkaWordCount and DirectKafkaWordCount from 
 examples/scala-2.10/src/ to examples/src/ so that they will appear in 
 spark-examples-***-jar for Scala 2.11.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7782) A small problem on history server webpage

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7782:
---

Assignee: (was: Apache Spark)

 A small problem on history server webpage
 -

 Key: SPARK-7782
 URL: https://issues.apache.org/jira/browse/SPARK-7782
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1, 1.3.1
Reporter: Xia Hu
Priority: Minor
  Labels: starter

 A very little problem on spark history server webpage.
 we can click on head of each row to sort the app lists, for example sort by 
 start time or completed time. But when it shows with the down arrow, it's 
 actually with ascending order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7806) spark-ec2 launch script fails for Python3

2015-05-27 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561390#comment-14561390
 ] 

Shivaram Venkataraman commented on SPARK-7806:
--

Merged https://github.com/mesos/spark-ec2/pull/117 which fixes the issue on 
spark-ec2 side.

 spark-ec2 launch script fails for Python3
 -

 Key: SPARK-7806
 URL: https://issues.apache.org/jira/browse/SPARK-7806
 Project: Spark
  Issue Type: Bug
  Components: EC2, PySpark
Affects Versions: 1.3.1
 Environment: All platforms.  
Reporter: Matthew Goodman
Priority: Minor

 Depending on the options used the spark-ec2 script will terminate 
 ungracefully.  
 Relevant buglets include:
  - urlopen() returning bytes vs. string
  - floor division change for partition calculation
  - filter() iteration behavior change in module calculation
 I have a fixed version that I wish to contribute.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7806) spark-ec2 launch script fails for Python3

2015-05-27 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-7806.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

[~srowen] Could you add [~meawoppl] to the Developers group and assign this 
issue ? 

 spark-ec2 launch script fails for Python3
 -

 Key: SPARK-7806
 URL: https://issues.apache.org/jira/browse/SPARK-7806
 Project: Spark
  Issue Type: Bug
  Components: EC2, PySpark
Affects Versions: 1.3.1
 Environment: All platforms.  
Reporter: Matthew Goodman
Priority: Minor
 Fix For: 1.4.0


 Depending on the options used the spark-ec2 script will terminate 
 ungracefully.  
 Relevant buglets include:
  - urlopen() returning bytes vs. string
  - floor division change for partition calculation
  - filter() iteration behavior change in module calculation
 I have a fixed version that I wish to contribute.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7550:

Assignee: Yin Huai

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Yin Huai

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7550:

Assignee: Cheng Hao  (was: Yin Huai)

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-7550:

Shepherd: Yin Huai

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-05-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561416#comment-14561416
 ] 

Yin Huai commented on SPARK-7550:
-

[~chenghao] Will you have time to take a look at it? It is a related issue of 
SPARK-6923. I think once we store serde info using Hive's data structures, our 
CLI will work correctly. But for SPARK-6923, it also needs to handle data 
source tables that only have schema info in the table properties.

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7899) PySpark sql/tests breaks pylint validation

2015-05-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561466#comment-14561466
 ] 

Apache Spark commented on SPARK-7899:
-

User 'mnazario' has created a pull request for this issue:
https://github.com/apache/spark/pull/6439

 PySpark sql/tests breaks pylint validation
 --

 Key: SPARK-7899
 URL: https://issues.apache.org/jira/browse/SPARK-7899
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 1.4.0
Reporter: Michael Nazario

 The pyspark.sql.types module is dynamically named {{types}} from {{_types}} 
 which messes up pylint validation
 From [~justin.uang] below:
 In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was 
 renamed to {{pyspark/sql/\_types.py}} and then some magic in 
 {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to 
 {{types}}. I imagine that this is some naming conflict with Python 3, but 
 what was the error that showed up?
 The reason why I'm asking about this is because it's messing with pylint, 
 since pylint cannot now statically find the module. I tried also importing 
 the package so that {{\_\_init\_\_}} would be run in a init-hook, but that 
 isn't what the discovery mechanism is using. I imagine it's probably just 
 crawling the directory structure.
 One way to work around this would be something akin to this 
 (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
  where I would have to create a fake module, but I would probably be missing 
 a ton of pylint features on users of that module, and it's pretty hacky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7899) PySpark sql/tests breaks pylint validation

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7899:
---

Assignee: (was: Apache Spark)

 PySpark sql/tests breaks pylint validation
 --

 Key: SPARK-7899
 URL: https://issues.apache.org/jira/browse/SPARK-7899
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 1.4.0
Reporter: Michael Nazario

 The pyspark.sql.types module is dynamically named {{types}} from {{_types}} 
 which messes up pylint validation
 From [~justin.uang] below:
 In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was 
 renamed to {{pyspark/sql/\_types.py}} and then some magic in 
 {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to 
 {{types}}. I imagine that this is some naming conflict with Python 3, but 
 what was the error that showed up?
 The reason why I'm asking about this is because it's messing with pylint, 
 since pylint cannot now statically find the module. I tried also importing 
 the package so that {{\_\_init\_\_}} would be run in a init-hook, but that 
 isn't what the discovery mechanism is using. I imagine it's probably just 
 crawling the directory structure.
 One way to work around this would be something akin to this 
 (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
  where I would have to create a fake module, but I would probably be missing 
 a ton of pylint features on users of that module, and it's pretty hacky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7901) Attempt to request negative number of executors with dynamic allocation

2015-05-27 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-7901:


 Summary: Attempt to request negative number of executors with 
dynamic allocation
 Key: SPARK-7901
 URL: https://issues.apache.org/jira/browse/SPARK-7901
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Ryan Williams


I ran a {{spark-shell}} on YARN with dynamic allocation enabled; relevant 
params:

{code}
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.minExecutors=5 \
  --conf spark.dynamicAllocation.maxExecutors=300 \
  --conf spark.dynamicAllocation.schedulerBacklogTimeout=3 \
  --conf spark.dynamicAllocation.executorIdleTimeout=300 \
{code}

It started out with executors, went up to 300 when I ran a job, and then killed 
them all back down to 5 executors after 5mins of idle time; all working as 
intended.

When I ran another job, it tried to request -187 executors:

{code}
15/05/27 17:41:12 ERROR util.Utils: Uncaught exception in thread 
spark-dynamic-executor-allocation-0
java.lang.IllegalArgumentException: Attempted to request a negative number of 
executor(s) -187 from the cluster manager. Please specify a positive number!
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
at 
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
at 
org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Now it seems like I'm stuck with 5 executors in this application as some 
internal state is corrupt.

[This dropbox 
folder|https://www.dropbox.com/sh/36slqgyll8nwxrk/AACPMc9UbKRY7SieR9bCXPJCa?dl=0]
 has the stdout from my console, including the -187 error above, as well as the 
eventlog for this application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7901) Attempt to request negative number of executors with dynamic allocation

2015-05-27 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561484#comment-14561484
 ] 

Ryan Williams commented on SPARK-7901:
--

Looks like a dupe of 
[SPARK-6954|https://issues.apache.org/jira/browse/SPARK-6954]…

 Attempt to request negative number of executors with dynamic allocation
 ---

 Key: SPARK-7901
 URL: https://issues.apache.org/jira/browse/SPARK-7901
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Ryan Williams

 I ran a {{spark-shell}} on YARN with dynamic allocation enabled; relevant 
 params:
 {code}
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.minExecutors=5 \
   --conf spark.dynamicAllocation.maxExecutors=300 \
   --conf spark.dynamicAllocation.schedulerBacklogTimeout=3 \
   --conf spark.dynamicAllocation.executorIdleTimeout=300 \
 {code}
 It started out with executors, went up to 300 when I ran a job, and then 
 killed them all back down to 5 executors after 5mins of idle time; all 
 working as intended.
 When I ran another job, it tried to request -187 executors:
 {code}
 15/05/27 17:41:12 ERROR util.Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number of 
 executor(s) -187 from the cluster manager. Please specify a positive number!
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
   at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
   at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
   at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
   at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Now it seems like I'm stuck with 5 executors in this application as some 
 internal state is corrupt.
 [This dropbox 
 folder|https://www.dropbox.com/sh/36slqgyll8nwxrk/AACPMc9UbKRY7SieR9bCXPJCa?dl=0]
  has the stdout from my console, including the -187 error above, as well as 
 the eventlog for this application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7901) Attempt to request negative number of executors with dynamic allocation

2015-05-27 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams resolved SPARK-7901.
--
Resolution: Duplicate

 Attempt to request negative number of executors with dynamic allocation
 ---

 Key: SPARK-7901
 URL: https://issues.apache.org/jira/browse/SPARK-7901
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1
Reporter: Ryan Williams

 I ran a {{spark-shell}} on YARN with dynamic allocation enabled; relevant 
 params:
 {code}
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.minExecutors=5 \
   --conf spark.dynamicAllocation.maxExecutors=300 \
   --conf spark.dynamicAllocation.schedulerBacklogTimeout=3 \
   --conf spark.dynamicAllocation.executorIdleTimeout=300 \
 {code}
 It started out with executors, went up to 300 when I ran a job, and then 
 killed them all back down to 5 executors after 5mins of idle time; all 
 working as intended.
 When I ran another job, it tried to request -187 executors:
 {code}
 15/05/27 17:41:12 ERROR util.Utils: Uncaught exception in thread 
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number of 
 executor(s) -187 from the cluster manager. Please specify a positive number!
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
   at 
 org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
   at 
 org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
   at 
 org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
   at 
 org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
   at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Now it seems like I'm stuck with 5 executors in this application as some 
 internal state is corrupt.
 [This dropbox 
 folder|https://www.dropbox.com/sh/36slqgyll8nwxrk/AACPMc9UbKRY7SieR9bCXPJCa?dl=0]
  has the stdout from my console, including the -187 error above, as well as 
 the eventlog for this application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3379) Implement 'POWER' for sql

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-3379:

Target Version/s:   (was: 1.4.0)

 Implement 'POWER' for sql
 -

 Key: SPARK-3379
 URL: https://issues.apache.org/jira/browse/SPARK-3379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
   Original Estimate: 0h
  Remaining Estimate: 0h

 Add support for the mathematical function POWER within spark sql. Spitted 
 from SPARK-3176



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer

2015-05-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561513#comment-14561513
 ] 

Apache Spark commented on SPARK-7896:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/6440

 IndexOutOfBoundsException in ChainedBuffer
 --

 Key: SPARK-7896
 URL: https://issues.apache.org/jira/browse/SPARK-7896
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Arun Ahuja
Assignee: Sandy Ryza
Priority: Blocker

 I've run into this on two tasks that use the same dataset.
 The dataset is a collection of strings where the most common string appears 
 ~200M times and the next few appear ~50M times each.
 for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) 
 to get the counts (how I got the number above), but I hit the error on 
 rdd.groupByKey().
 Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do 
 rdd2.leftOuterJoin(rdd) without hitting this error
 {code}
 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 
 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): 
 java.lang.IndexOutOfBoundsException: 512
 at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
 at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
 at 
 org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110)
 at 
 org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141)
 at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
 at 
 org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147)
 at 
 org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6690) spark-sql script ends up throwing Exception when event logging is enabled.

2015-05-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561522#comment-14561522
 ] 

Yin Huai commented on SPARK-6690:
-

Per https://github.com/apache/spark/pull/5341, 
https://github.com/apache/spark/pull/5560 addressed this issue.

 spark-sql script ends up throwing Exception when event logging is enabled.
 --

 Key: SPARK-6690
 URL: https://issues.apache.org/jira/browse/SPARK-6690
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Priority: Minor

 When event logging is enabled, spark-sql script ends up throwing Exception 
 like as follows.
 {code}
 15/04/03 13:51:49 INFO handler.ContextHandler: stopped 
 o.e.j.s.ServletContextHandler{/jobs,null}
 15/04/03 13:51:49 ERROR scheduler.LiveListenerBus: Listener 
 EventLoggingListener threw an exception
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53)
   at 
 org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1171)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
 Caused by: java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804)
   at 
 org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127)
   ... 17 more
 15/04/03 13:51:49 INFO ui.SparkUI: Stopped Spark web UI at 
 http://sarutak-devel:4040
 15/04/03 13:51:49 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 Exception in thread Thread-6 java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
   at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:209)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408)
   at scala.Option.foreach(Option.scala:236)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1408)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)
 {code}
 This is because FileSystem#close is called by the shutdown hook registered in 
 SparkSQLCLIDriver.
 {code}
 Runtime.getRuntime.addShutdownHook(
   new Thread() {
 override def run() {
   SparkSQLEnv.stop()
 }
   }
 )
 {code}
 This issue was resolved by SPARK-3062 but I think, it's brought again by 
 SPARK-2261.



--
This message was sent by Atlassian JIRA

[jira] [Resolved] (SPARK-6690) spark-sql script ends up throwing Exception when event logging is enabled.

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-6690.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

 spark-sql script ends up throwing Exception when event logging is enabled.
 --

 Key: SPARK-6690
 URL: https://issues.apache.org/jira/browse/SPARK-6690
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Priority: Minor
 Fix For: 1.4.0


 When event logging is enabled, spark-sql script ends up throwing Exception 
 like as follows.
 {code}
 15/04/03 13:51:49 INFO handler.ContextHandler: stopped 
 o.e.j.s.ServletContextHandler{/jobs,null}
 15/04/03 13:51:49 ERROR scheduler.LiveListenerBus: Listener 
 EventLoggingListener threw an exception
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53)
   at 
 org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1171)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
 Caused by: java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804)
   at 
 org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127)
   ... 17 more
 15/04/03 13:51:49 INFO ui.SparkUI: Stopped Spark web UI at 
 http://sarutak-devel:4040
 15/04/03 13:51:49 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 Exception in thread Thread-6 java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
   at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:209)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408)
   at scala.Option.foreach(Option.scala:236)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1408)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)
 {code}
 This is because FileSystem#close is called by the shutdown hook registered in 
 SparkSQLCLIDriver.
 {code}
 Runtime.getRuntime.addShutdownHook(
   new Thread() {
 override def run() {
   SparkSQLEnv.stop()
 }
   }
 )
 {code}
 This issue was resolved by SPARK-3062 but I think, it's brought again by 
 SPARK-2261.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Assigned] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer

2015-05-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7896:
---

Assignee: Apache Spark  (was: Sandy Ryza)

 IndexOutOfBoundsException in ChainedBuffer
 --

 Key: SPARK-7896
 URL: https://issues.apache.org/jira/browse/SPARK-7896
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Arun Ahuja
Assignee: Apache Spark
Priority: Blocker

 I've run into this on two tasks that use the same dataset.
 The dataset is a collection of strings where the most common string appears 
 ~200M times and the next few appear ~50M times each.
 for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) 
 to get the counts (how I got the number above), but I hit the error on 
 rdd.groupByKey().
 Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do 
 rdd2.leftOuterJoin(rdd) without hitting this error
 {code}
 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 
 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): 
 java.lang.IndexOutOfBoundsException: 512
 at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
 at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
 at 
 org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110)
 at 
 org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141)
 at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
 at 
 org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147)
 at 
 org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4119:

Target Version/s:   (was: 1.4.0)

 Don't rely on HIVE_DEV_HOME to find .q files
 

 Key: SPARK-4119
 URL: https://issues.apache.org/jira/browse/SPARK-4119
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 After merging in Hive 0.13.1 support, a bunch of .q files and golden answer 
 files got updated. Unfortunately, some .q were updated in Hive. For example, 
 an ORDER BY clause was added to groupby1_limit.q for bug fix.
 With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with 
 false test failures. Because .q files are looked up from HIVE_DEV_HOME and 
 outdated .q files are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4782:

Target Version/s:   (was: 1.4.0)

 Add inferSchema support for RDD[Map[String, Any]]
 -

 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang
Priority: Minor

 The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
 be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
 It's very inefficient.
 Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
 Schemaless data as adding Map like interface to any serialization format is 
 easy.
 So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
 serialization format we want to support, we just need to add a Map interface 
 wrapper to it*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files

2015-05-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561525#comment-14561525
 ] 

Yin Huai commented on SPARK-4119:
-

[~lian cheng] Feel free to re-target it.

 Don't rely on HIVE_DEV_HOME to find .q files
 

 Key: SPARK-4119
 URL: https://issues.apache.org/jira/browse/SPARK-4119
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor

 After merging in Hive 0.13.1 support, a bunch of .q files and golden answer 
 files got updated. Unfortunately, some .q were updated in Hive. For example, 
 an ORDER BY clause was added to groupby1_limit.q for bug fix.
 With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with 
 false test failures. Because .q files are looked up from HIVE_DEV_HOME and 
 outdated .q files are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7902) SQL UDF doesn't support UDT in PySpark

2015-05-27 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7902:


 Summary: SQL UDF doesn't support UDT in PySpark
 Key: SPARK-7902
 URL: https://issues.apache.org/jira/browse/SPARK-7902
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Xiangrui Meng


We don't convert Python SQL internal types to Python types in SQL UDF 
execution. This causes problems if the input arguments contain UDTs or the 
return type is a UDT. Right now, the raw SQL types are passed into the Python 
UDF and the return value is not converted to Python SQL types.

This is the code to produce this bug. (Actually, it triggers another bug first 
right now.)
{code}
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], [features])
sz = udf(lambda s: s.size, IntegerType())
df.select(sz(df.features).alias(sz)).collect()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis

2015-05-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561529#comment-14561529
 ] 

Yin Huai commented on SPARK-6467:
-

Probably {{Generate}} should override it? Seems, it is the cause of some wrong 
analysis error message (like, {{abc is not in col1, col2, abc, col3}}).

 Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis
 ---

 Key: SPARK-6467
 URL: https://issues.apache.org/jira/browse/SPARK-6467
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6467:

Priority: Major  (was: Minor)

 Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis
 ---

 Key: SPARK-6467
 URL: https://issues.apache.org/jira/browse/SPARK-6467
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-05-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561541#comment-14561541
 ] 

Yin Huai commented on SPARK-7550:
-

I think it will also address https://issues.apache.org/jira/browse/SPARK-6413.

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for DESCRIBE FORMATTED

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6413:

Target Version/s: 1.5.0  (was: 1.4.0)

 For data source tables, we should provide better output for DESCRIBE FORMATTED
 --

 Key: SPARK-6413
 URL: https://issues.apache.org/jira/browse/SPARK-6413
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Minor

 Right now, we will show Hive's stuff like SerDe. Users will be confused when 
 they see the output of DESCRIBE FORMATTED (it is a Hive native command for 
 now) and think the table is not stored in the right format. Actually, the 
 table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4684) Add a script to run JDBC server on Windows

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4684:

Target Version/s: 1.5.0  (was: 1.4.0)

 Add a script to run JDBC server on Windows
 --

 Key: SPARK-4684
 URL: https://issues.apache.org/jira/browse/SPARK-4684
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Matei Zaharia
Assignee: Cheng Lian
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7899) PySpark sql/tests breaks pylint validation

2015-05-27 Thread Michael Nazario (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561164#comment-14561164
 ] 

Michael Nazario commented on SPARK-7899:


The problem is that pyspark/sql/types conflicts with the built-in Python 3 
types module which causes tests to fail.

The Python documentation 
(https://docs.python.org/3/using/cmdline.html#interface-options) says that by 
calling python path/to/script.py, the path of the script is automatically 
added to sys.path. This causes the conflict with the built-in Python 3 types 
module.

You can fix this by using -m in running pyspark tests instead since this will 
run by a module name on sys.path and not add the directory of the script to the 
python path.


 PySpark sql/tests breaks pylint validation
 --

 Key: SPARK-7899
 URL: https://issues.apache.org/jira/browse/SPARK-7899
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 1.4.0
Reporter: Michael Nazario

 The pyspark.sql.types module is dynamically named types from _types which 
 messes up pylint validation
 From [~justin.uang] below:
 In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was 
 renamed to pyspark/sql/_types.py and then some magic in 
 pyspark/sql/__init__.py dynamically renamed the module back to types. I 
 imagine that this is some naming conflict with Python 3, but what was the 
 error that showed up?
 The reason why I'm asking about this is because it's messing with pylint, 
 since pylint cannot now statically find the module. I tried also importing 
 the package so that __init__ would be run in a init-hook, but that isn't what 
 the discovery mechanism is using. I imagine it's probably just crawling the 
 directory structure.
 One way to work around this would be something akin to this 
 (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
  where I would have to create a fake module, but I would probably be missing 
 a ton of pylint features on users of that module, and it's pretty hacky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7536) Audit MLlib Python API for 1.4

2015-05-27 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561135#comment-14561135
 ] 

Yanbo Liang commented on SPARK-7536:


[~josephkb] Yes, the 3 main sub-tasks is under working. I will submit the 
completed parts asap.  I know the 3 main sub-tasks listed above are related 
with version 1.4 release,  so I will try to finish them asap.

 Audit MLlib Python API for 1.4
 --

 Key: SPARK-7536
 URL: https://issues.apache.org/jira/browse/SPARK-7536
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 For new public APIs added to MLlib, we need to check the generated HTML doc 
 and compare the Scala  Python versions.  We need to track:
 * Inconsistency: Do class/method/parameter names match? SPARK-7667
 * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
 be as complete as the Scala doc. SPARK-7666
 * API breaking changes: These should be very rare but are occasionally either 
 necessary (intentional) or accidental.  These must be recorded and added in 
 the Migration Guide for this release. SPARK-7665
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
 component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
 functionality missing from Python.
 ** classification
 *** StreamingLogisticRegressionWithSGD SPARK-7633
 ** clustering
 *** GaussianMixture SPARK-6258
 *** LDA SPARK-6259
 *** Power Iteration Clustering SPARK-5962
 *** StreamingKMeans SPARK-4118 
 ** evaluation
 *** MultilabelMetrics SPARK-6094 
 ** feature
 *** ElementwiseProduct SPARK-7605
 *** PCA SPARK-7604
 ** linalg
 *** Distributed linear algebra SPARK-6100
 ** pmml.export SPARK-7638
 ** regression
 *** StreamingLinearRegressionWithSGD SPARK-4127
 ** stat
 *** KernelDensity SPARK-7639
 ** util
 *** MLUtils SPARK-6263 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7899) PySpark sql/tests breaks pylint validation

2015-05-27 Thread Michael Nazario (JIRA)
Michael Nazario created SPARK-7899:
--

 Summary: PySpark sql/tests breaks pylint validation
 Key: SPARK-7899
 URL: https://issues.apache.org/jira/browse/SPARK-7899
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 1.4.0
Reporter: Michael Nazario


The pyspark.sql.types module is dynamically named types from _types which 
messes up pylint validation

From [~justin.uang] below:

In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed 
to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py 
dynamically renamed the module back to types. I imagine that this is some 
naming conflict with Python 3, but what was the error that showed up?

The reason why I'm asking about this is because it's messing with pylint, since 
pylint cannot now statically find the module. I tried also importing the 
package so that __init__ would be run in a init-hook, but that isn't what the 
discovery mechanism is using. I imagine it's probably just crawling the 
directory structure.

One way to work around this would be something akin to this 
(http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
 where I would have to create a fake module, but I would probably be missing a 
ton of pylint features on users of that module, and it's pretty hacky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7900) Reduce number of tagging calls in spark-ec2

2015-05-27 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-7900:
---

 Summary: Reduce number of tagging calls in spark-ec2
 Key: SPARK-7900
 URL: https://issues.apache.org/jira/browse/SPARK-7900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Nicholas Chammas
Priority: Minor


spark-ec2 currently tags each instance with its own name:

https://github.com/apache/spark/blob/4615081d7a10b023491e25478d19b8161e030974/ec2/spark_ec2.py#L684-L692

Quite often, one of these tagging calls will fail:

{code}
Launching instances...
Launched 10 slaves in us-west-2a, regid = r-89656e83
Launched master in us-west-2a, regid = r-07646f0d
Waiting for AWS to propagate instance metadata...
Traceback (most recent call last):
  File ../spark/ec2/spark_ec2.py, line 1395, in module
main()
  File ../spark/ec2/spark_ec2.py, line 1387, in main
real_main()
  File ../spark/ec2/spark_ec2.py, line 1222, in real_main
(master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
  File ../spark/ec2/spark_ec2.py, line 667, in launch_cluster
value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
  File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
add_tag
self.add_tags({key: value}, dry_run)
  File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
add_tags
dry_run=dry_run
  File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, 
in create_tags
return self.get_status('CreateTags', params, verb='POST')
  File /path/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
get_status
raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
instance ID 'i-d3b72524' does not 
exist/Message/Error/ErrorsRequestIDf0936ab5-4d10-46d1-a35d-cefaf8a68adc/RequestID/Response
{code}

This is presumably a problem with AWS metadata taking time to become available 
on all the servers that spark-ec2 hits as it makes the several tagging calls.

Instead of retrying the tagging calls, we should just reduce them to 2 
calls--one for the master, one for the slaves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7898) pyspark merges stderr into stdout

2015-05-27 Thread Sam Steingold (JIRA)
Sam Steingold created SPARK-7898:


 Summary: pyspark merges stderr into stdout
 Key: SPARK-7898
 URL: https://issues.apache.org/jira/browse/SPARK-7898
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Sam Steingold


When I type 
{code}
hadoop fs -text /foo/bar/baz.bz2 2err 1out
{code}

I get two non-empty files: {{err}} with 
{code}
2015-05-26 15:33:49,786 INFO  [main] bzip2.Bzip2Factory 
(Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded  initialized 
native-bzip2 library system-native
2015-05-26 15:33:49,789 INFO  [main] compress.CodecPool 
(CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2]
{code}
and {{out}} with the content of the file (as expected).

When I call the same command from Python (2.6):

{code}
from subprocess import Popen
with open(out,w) as out:
with open(err,w) as err:
p = Popen(['hadoop','fs','-text',/foo/bar/baz.bz2],
  stdin=None,stdout=out,stderr=err)
print p.wait()
{code}
I get the exact same (correct) behavior.

*However*, when I run the same code under *PySpark* (or using `spark-submit`), 
I get an *empty* {{err}} file and the {{out}} file starts with the log messages 
above (and then it contains the actual data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >