[jira] [Commented] (SPARK-4118) Create python bindings for Streaming KMeans
[ https://issues.apache.org/jira/browse/SPARK-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560672#comment-14560672 ] Manoj Kumar commented on SPARK-4118: [~mengxr] Hi, can this be assigned to me? Create python bindings for Streaming KMeans --- Key: SPARK-4118 URL: https://issues.apache.org/jira/browse/SPARK-4118 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark, Streaming Reporter: Anant Daksh Asthana Priority: Minor Create Python bindings for Streaming K-means This is in reference to https://issues.apache.org/jira/browse/SPARK-3254 which adds Streaming K-means functionality to MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7892) Python class in __main__ may trigger AssertionError
[ https://issues.apache.org/jira/browse/SPARK-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] flykobe cheng closed SPARK-7892. Resolution: Duplicate Python class in __main__ may trigger AssertionError --- Key: SPARK-7892 URL: https://issues.apache.org/jira/browse/SPARK-7892 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Linux, Python 2.7.3 pickled by Python pickle Lib Reporter: flykobe cheng Priority: Minor Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code: class AClass(object): _class_var = {'classkey': 'classval', } def main_object_method(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def main_object_method2(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def test_main_object_method(sc): obj = AClass() res = sc.parallelize(range(4)).map(obj.main_object_method).collect() if __name__ == '__main__': cf = pyspark.SparkConf() cf.set('spark.cores.max', 1) sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf) test_main_object_method(sc) Traceback: File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 310, in save_function_tuple save(f_globals) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in save_dict self._batch_setitems(obj.iteritems()) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in _batch_setitems save(v) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 468, in save_global d),obj=obj) File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 638, in save_reduce self.memoize(obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in memoize assert id(obj) not in self.memo AssertionError Problem in Python/Lib/pickle.py: def memoize(self, obj): Store an object in the memo. if self.fast: return assert id(obj) not in self.memo memo_len = len(self.memo) self.write(self.put(memo_len)) self.memo[id(obj)] = memo_len, obj -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560710#comment-14560710 ] Rene Treffer commented on SPARK-7697: - I've had a similar problem, especially with unsigned bigint. Java has no type for that. (It only fails if a value actually exceeds the java long range). I worked around the problem by extending DriverQuirks, now JDBCDialects. The idea is that you can map problematic types to whatever you'd want: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L395 I am mapping unsigned bigint to string in order to load it. This works at some post-processing overhead (basically a udf to map unsigned long stored in a string to signed long). Column with an unsigned int should be treated as long in JDBCRDD Key: SPARK-7697 URL: https://issues.apache.org/jira/browse/SPARK-7697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: DAITO Teppei Assignee: Liang-Chi Hsieh Fix For: 1.4.0 Columns with an unsigned numeric type in JDBC should be treated as the next 'larger' Java type in JDBCRDD#getCatalystType . https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49 {code:title=q.sql} create table t1 (id int unsigned); insert into t1 values (4234567890); {code} {code:title=T1.scala} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext object T1 { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf()) val s = new SQLContext(sc) val url = jdbc:mysql://localhost/test val t1 = s.jdbc(url, t1) t1.printSchema() t1.collect().foreach(println) } } {code} This code caused error like below. {noformat} 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in column '1' is outside valid range for the datatype INTEGER. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.mysql.jdbc.Util.handleNewInstance(Util.java:377) at com.mysql.jdbc.Util.getInstance(Util.java:360) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870) at com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090) at com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364) at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. This issue will focus on two frequently-used operators first: union and join. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into big graph*_,complex operators will be helpful to operate between graphs directly. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7887) Remove EvaluatedType from SQL Expression
[ https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7887. Resolution: Fixed Fix Version/s: 1.5.0 Remove EvaluatedType from SQL Expression Key: SPARK-7887 URL: https://issues.apache.org/jira/browse/SPARK-7887 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 It's not a very useful type to use. We can just remove it to simplify expressions slightly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6590) Make DataFrame.where accept a string conditionExpr
[ https://issues.apache.org/jira/browse/SPARK-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang closed SPARK-6590. -- Resolution: Won't Fix https://github.com/apache/spark/pull/6429#issuecomment-105788726 from Reynold. Make DataFrame.where accept a string conditionExpr -- Key: SPARK-6590 URL: https://issues.apache.org/jira/browse/SPARK-6590 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yin Huai Priority: Minor In our doc, we say where is an alias of filter. However, where does not support a conditionExpr in string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7892) Python class in __main__ may trigger AssertionError
flykobe cheng created SPARK-7892: Summary: Python class in __main__ may trigger AssertionError Key: SPARK-7892 URL: https://issues.apache.org/jira/browse/SPARK-7892 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Linux, Python 2.7.3 pickled by Python pickle Lib Reporter: flykobe cheng Priority: Minor Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code: class AClass(object): _class_var = {'classkey': 'classval', } def main_object_method(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def main_object_method2(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def test_main_object_method(sc): obj = AClass() res = sc.parallelize(range(4)).map(obj.main_object_method).collect() if __name__ == '__main__': cf = pyspark.SparkConf() cf.set('spark.cores.max', 1) sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf) test_main_object_method(sc) Traceback: File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 310, in save_function_tuple save(f_globals) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in save_dict self._batch_setitems(obj.iteritems()) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in _batch_setitems save(v) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 468, in save_global d),obj=obj) File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 638, in save_reduce self.memoize(obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in memoize assert id(obj) not in self.memo AssertionError Problem in Python/Lib/pickle.py: def memoize(self, obj): Store an object in the memo. if self.fast: return assert id(obj) not in self.memo memo_len = len(self.memo) self.write(self.put(memo_len)) self.memo[id(obj)] = memo_len, obj -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is ***mask***, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as **union or join**, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). This issue will focus on two frequently-used operators first: **union** and **join**. was: Currently there are 30+ operators in GraphX. But few of them consider operation between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. We will focus on two frequently-used operators first: union and join. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is ***mask***, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as **union or join**, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). This issue will focus on two frequently-used operators first: **union** and **join**. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- External issue URL: https://issues.apache.org/jira/browse/SPARK-7893 Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H. A Simple interface would be: def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://issues.apache.org/jira/secure/attachment/12735570/union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://issues.apache.org/jira/secure/attachment/12735570/union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as streaming graph, small graph merge into big graph……complex operators will be helpful to operate between graphs directly, such as *union or join*. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as streaming grpah, small graph and big graph……it will be helpful to operate between graphs directly, such as *union or join*. . Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as streaming graph, small graph merge into big graph……complex operators will be helpful to operate between graphs directly, such as *union or join*. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done within operator and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done internally and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. This issue will focus on two frequently-used operators first: union and join. was: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is ***mask***, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as **union or join**, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations](http://techieme.in/complex-graph-operations/). This issue will focus on two frequently-used operators first: **union** and **join**. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. This issue will focus on two frequently-used operators first: union and join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Attachment: union_operator.png Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Labels: graph union (was: ) Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://issues.apache.org/jira/secure/attachment/12735570/union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into big graph*_……complex operators will be helpful to operate between graphs directly, such as *union or join*. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as streaming graph, small graph merge into big graph……complex operators will be helpful to operate between graphs directly, such as *union or join*. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into big graph*_……complex operators will be helpful to operate between graphs directly, such as *union or join*. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into big graph*_,complex operators will be helpful to operate between graphs directly. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into big graph*_……complex operators will be helpful to operate between graphs directly, such as *union or join*. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into big graph*_,complex operators will be helpful to operate between graphs directly. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7165) Sort Merge Join for outer joins
[ https://issues.apache.org/jira/browse/SPARK-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7165: --- Priority: Blocker (was: Major) Sort Merge Join for outer joins --- Key: SPARK-7165 URL: https://issues.apache.org/jira/browse/SPARK-7165 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7891) Python class in __main__ may trigger AssertionError
flykobe cheng created SPARK-7891: Summary: Python class in __main__ may trigger AssertionError Key: SPARK-7891 URL: https://issues.apache.org/jira/browse/SPARK-7891 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Linux, Python 2.7.3 pickled by Python pickle Lib Reporter: flykobe cheng Priority: Minor Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code: class AClass(object): _class_var = {'classkey': 'classval', } def main_object_method(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def main_object_method2(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def test_main_object_method(sc): obj = AClass() res = sc.parallelize(range(4)).map(obj.main_object_method).collect() if __name__ == '__main__': cf = pyspark.SparkConf() cf.set('spark.cores.max', 1) sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf) test_main_object_method(sc) Traceback: File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 310, in save_function_tuple save(f_globals) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in save_dict self._batch_setitems(obj.iteritems()) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in _batch_setitems save(v) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 468, in save_global d),obj=obj) File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 638, in save_reduce self.memoize(obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in memoize assert id(obj) not in self.memo AssertionError Problem in Python/Lib/pickle.py: def memoize(self, obj): Store an object in the memo. if self.fast: return assert id(obj) not in self.memo memo_len = len(self.memo) self.write(self.put(memo_len)) self.memo[id(obj)] = memo_len, obj -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX. But few of them consider operation between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. We will focus on two frequently-used operators first: union and join. was: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. We will focus on two frequently-used operators first: union and join. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Currently there are 30+ operators in GraphX. But few of them consider operation between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. We will focus on two frequently-used operators first: union and join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7891) Python class in __main__ may trigger AssertionError
[ https://issues.apache.org/jira/browse/SPARK-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] flykobe cheng updated SPARK-7891: - Description: Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code and traceback attached. was: Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code: class AClass(object): _class_var = {'classkey': 'classval', } def main_object_method(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def main_object_method2(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def test_main_object_method(sc): obj = AClass() res = sc.parallelize(range(4)).map(obj.main_object_method).collect() if __name__ == '__main__': cf = pyspark.SparkConf() cf.set('spark.cores.max', 1) sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf) test_main_object_method(sc) Traceback: File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 310, in save_function_tuple save(f_globals) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in save_dict self._batch_setitems(obj.iteritems()) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in _batch_setitems save(v) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 468, in save_global d),obj=obj) File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 638, in save_reduce self.memoize(obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in memoize assert id(obj) not in self.memo AssertionError Problem in Python/Lib/pickle.py: def memoize(self, obj): Store an object in the memo. if self.fast: return assert id(obj) not in self.memo memo_len = len(self.memo) self.write(self.put(memo_len)) self.memo[id(obj)] = memo_len, obj Python class in __main__ may trigger AssertionError --- Key: SPARK-7891 URL: https://issues.apache.org/jira/browse/SPARK-7891 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Linux, Python 2.7.3 pickled by Python pickle Lib Reporter: flykobe cheng Priority: Minor Attachments: demo_error.log, demo_pickle_error.py Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code and traceback attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function
[ https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shijinkui updated SPARK-5062: - Fix Version/s: 1.3.2 Pregel use aggregateMessage instead of mapReduceTriplets function - Key: SPARK-5062 URL: https://issues.apache.org/jira/browse/SPARK-5062 Project: Spark Issue Type: Wish Components: GraphX Reporter: shijinkui Fix For: 1.3.2 Attachments: graphx_aggreate_msg.jpg since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it improve the performance indeed. it's time to replace mapReduceTriplets with aggregateMessage in Pregel. we can discuss it. i have draw a graph of aggregateMessage to show why it can improve the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7894) Graph Union Operator
Andy Huang created SPARK-7894: - Summary: Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H. A Simple interface would be: def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7893) Complex Operators between Graphs
Andy Huang created SPARK-7893: - Summary: Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is mask, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as union or join, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:complex_graph_operations. We will focus on two frequently-used operators first: union and join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Labels: graph (was: graph union) Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7817) Intellij Idea cannot find symbol when import scala object
[ https://issues.apache.org/jira/browse/SPARK-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bofei.xiao closed SPARK-7817. - Thanks Owen! Intellij Idea cannot find symbol when import scala object - Key: SPARK-7817 URL: https://issues.apache.org/jira/browse/SPARK-7817 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.3.1 Environment: micorosoft server 2003 java 1.6 maven 3.04 Reporter: bofei.xiao [ERROR] src\main\java\org\apache\spark\exaples\streaming\JavaQueueStream.java:[33,47] cannot find symbol symbol : class StreamingExamples location: package org.apache.spark.exaples.streaming in fact,StreamingExamples is a object under org.apache.spark.exaples.streaming -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560688#comment-14560688 ] Konstantin Shaposhnikov commented on SPARK-7042: Yes, I've just tested it locally - 2.11 Spark build works with akka 2.3.11 Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Assignee: Konstantin Shaposhnikov Priority: Minor Fix For: 1.5.0 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7892) Python class in __main__ may trigger AssertionError
[ https://issues.apache.org/jira/browse/SPARK-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] flykobe cheng updated SPARK-7892: - Description: Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code and traceback attached. was: Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code: class AClass(object): _class_var = {'classkey': 'classval', } def main_object_method(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def main_object_method2(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def test_main_object_method(sc): obj = AClass() res = sc.parallelize(range(4)).map(obj.main_object_method).collect() if __name__ == '__main__': cf = pyspark.SparkConf() cf.set('spark.cores.max', 1) sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf) test_main_object_method(sc) Traceback: File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 310, in save_function_tuple save(f_globals) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in save_dict self._batch_setitems(obj.iteritems()) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in _batch_setitems save(v) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 468, in save_global d),obj=obj) File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 638, in save_reduce self.memoize(obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in memoize assert id(obj) not in self.memo AssertionError Problem in Python/Lib/pickle.py: def memoize(self, obj): Store an object in the memo. if self.fast: return assert id(obj) not in self.memo memo_len = len(self.memo) self.write(self.put(memo_len)) self.memo[id(obj)] = memo_len, obj Python class in __main__ may trigger AssertionError --- Key: SPARK-7892 URL: https://issues.apache.org/jira/browse/SPARK-7892 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Linux, Python 2.7.3 pickled by Python pickle Lib Reporter: flykobe cheng Priority: Minor Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code and traceback attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. | G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. | G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Labels: graph union (was: graph) Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png|thumbnail! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H. A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !https://raw.githubusercontent.com/andyyehoo/anything/master/images/union_operator.png A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Labels: complex graph operators (was: ) Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, operators Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Labels: complex graph join operators union (was: complex graph operators) Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX. But few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. As for details complex graph operator list, it can be found here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as streaming grpah, small graph and big graph……it will be helpful to operate between graphs directly, such as *union or join*. . Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case, it will be helpful to operate between graphs directly, such as *union or join*, especially for streaming case, small graph and big graph. Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as streaming grpah, small graph and big graph……it will be helpful to operate between graphs directly, such as *union or join*. . Higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7853) ClassNotFoundException for SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560772#comment-14560772 ] Apache Spark commented on SPARK-7853: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6435 ClassNotFoundException for SparkSQL --- Key: SPARK-7853 URL: https://issues.apache.org/jira/browse/SPARK-7853 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Hao Priority: Blocker Reproduce steps: {code} bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; {code} Throws Exception like: {noformat} 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'] org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139) at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7891) Python class in __main__ may trigger AssertionError
[ https://issues.apache.org/jira/browse/SPARK-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] flykobe cheng updated SPARK-7891: - Attachment: demo_error.log demo_pickle_error.py Python class in __main__ may trigger AssertionError --- Key: SPARK-7891 URL: https://issues.apache.org/jira/browse/SPARK-7891 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Linux, Python 2.7.3 pickled by Python pickle Lib Reporter: flykobe cheng Priority: Minor Attachments: demo_error.log, demo_pickle_error.py Callback functions for spark transformations and actions will be pickled. If the callback is instancemethod of __main__ module's class, and the class has more than one instancemethod which using class properties or classmethods, the class will be pickled twice, and 'pickle.memoize' twice, then trigger AssertionError. Demo code: class AClass(object): _class_var = {'classkey': 'classval', } def main_object_method(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def main_object_method2(self, item): logging.warn(class var by %s: %s % (sys._getframe().f_code.co_name, AClass._class_var['classkey'])) def test_main_object_method(sc): obj = AClass() res = sc.parallelize(range(4)).map(obj.main_object_method).collect() if __name__ == '__main__': cf = pyspark.SparkConf() cf.set('spark.cores.max', 1) sc = pyspark.SparkContext(appName = flykobe_demo_pickle_error, conf = cf) test_main_object_method(sc) Traceback: File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 310, in save_function_tuple save(f_globals) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 174, in save_dict pickle.Pickler.save_dict(self, obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 654, in save_dict self._batch_setitems(obj.iteritems()) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 686, in _batch_setitems save(v) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 291, in save f(self, obj) # Call unbound method with explicit self File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 468, in save_global d),obj=obj) File /home/users/chengyi02/svn-root/app/ecom/darwin/local/spark-1.2.0.5-client/python/pyspark/cloudpickle.py, line 638, in save_reduce self.memoize(obj) File /home/users/chengyi02/.jumbo/lib/python2.7/pickle.py, line 248, in memoize assert id(obj) not in self.memo AssertionError Problem in Python/Lib/pickle.py: def memoize(self, obj): Store an object in the memo. if self.fast: return assert id(obj) not in self.memo memo_len = len(self.memo) self.write(self.put(memo_len)) self.memo[id(obj)] = memo_len, obj -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- External issue ID: (was: 7893) External issue URL: (was: https://issues.apache.org/jira/browse/SPARK-7893) Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H. A Simple interface would be: def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H. A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H. A Simple interface would be: def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang This operator aims to union two graphs and generate a new graph directly. Vertexes and edges which are included in either graph will be part of the new graph. The union of two graphs G(VG, EG) and H(VH, EH) is the union of their vertex sets and their edge families, which means G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H. A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to have a union and remove duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=800px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bg. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=800px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=800px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7894: -- Description: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bg. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] was: This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. | G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bg. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. It is necessary for interface to consider how to handle this case for both Vertex and Edge. For vertex, it's quite nature to just make a union and remove those duplicates vertexes. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done within operator and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done within operator and be transparent. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done within operator and be transparent to them. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Huang updated SPARK-7893: -- Description: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done within operator and be transparent. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. was: Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help user to focus and think in graph. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. Complex Operators between Graphs Key: SPARK-7893 URL: https://issues.apache.org/jira/browse/SPARK-7893 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Andy Huang Labels: complex, graph, join, operators, union Currently there are 30+ operators in GraphX, while few of them consider operators between graphs. The only one is _*mask*_, which takes another graph as a parameter and return a new graph. In many complex case,such as _*streaming graph, small graph merge into huge graph*_, higher level operators of graphs can help users to focus and think in graph. Performance optimization can be done within operator and be transparent. Complex graph operator list is here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. This issue will focus on two frequently-used operators first: *union* and *join*. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7782) A small problem on history server webpage
[ https://issues.apache.org/jira/browse/SPARK-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560995#comment-14560995 ] Apache Spark commented on SPARK-7782: - User 'zuxqoj' has created a pull request for this issue: https://github.com/apache/spark/pull/6437 A small problem on history server webpage - Key: SPARK-7782 URL: https://issues.apache.org/jira/browse/SPARK-7782 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.1, 1.3.1 Reporter: Xia Hu Priority: Minor Labels: starter A very little problem on spark history server webpage. we can click on head of each row to sort the app lists, for example sort by start time or completed time. But when it shows with the down arrow, it's actually with ascending order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7782) A small problem on history server webpage
[ https://issues.apache.org/jira/browse/SPARK-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7782: --- Assignee: Apache Spark A small problem on history server webpage - Key: SPARK-7782 URL: https://issues.apache.org/jira/browse/SPARK-7782 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.1, 1.3.1 Reporter: Xia Hu Assignee: Apache Spark Priority: Minor Labels: starter A very little problem on spark history server webpage. we can click on head of each row to sort the app lists, for example sort by start time or completed time. But when it shows with the down arrow, it's actually with ascending order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD
Liang-Chi Hsieh created SPARK-7897: -- Summary: Column with an unsigned bigint should be treated as DecimalType in JDBCRDD Key: SPARK-7897 URL: https://issues.apache.org/jira/browse/SPARK-7897 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7896) IndexOutOfCountsException in ChainedBuffer
Arun Ahuja created SPARK-7896: - Summary: IndexOutOfCountsException in ChainedBuffer Key: SPARK-7896 URL: https://issues.apache.org/jira/browse/SPARK-7896 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Arun Ahuja I've run into this on two tasks that use the same dataset. The dataset is a collection of strings where the most common string appears ~200M times and the next few appear ~50M times each. for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) to get the counts (how I got the number above), but I hit the error on rdd.groupByKey(). Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do rdd2.leftOuterJoin(rdd) without hitting this error {code} 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): java.lang.IndexOutOfBoundsException: 512 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110) at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141) at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) at org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7897: --- Assignee: (was: Apache Spark) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD -- Key: SPARK-7897 URL: https://issues.apache.org/jira/browse/SPARK-7897 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561015#comment-14561015 ] Apache Spark commented on SPARK-7897: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/6438 Column with an unsigned bigint should be treated as DecimalType in JDBCRDD -- Key: SPARK-7897 URL: https://issues.apache.org/jira/browse/SPARK-7897 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561022#comment-14561022 ] Liang-Chi Hsieh commented on SPARK-7697: [~treffer] Thanks for reporting the problem. I open another [ticket|https://issues.apache.org/jira/browse/SPARK-7897] and PR for that. I will use DecimalType for unsigned bigint. It would be better if you can test it. Column with an unsigned int should be treated as long in JDBCRDD Key: SPARK-7697 URL: https://issues.apache.org/jira/browse/SPARK-7697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: DAITO Teppei Assignee: Liang-Chi Hsieh Fix For: 1.4.0 Columns with an unsigned numeric type in JDBC should be treated as the next 'larger' Java type in JDBCRDD#getCatalystType . https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49 {code:title=q.sql} create table t1 (id int unsigned); insert into t1 values (4234567890); {code} {code:title=T1.scala} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext object T1 { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf()) val s = new SQLContext(sc) val url = jdbc:mysql://localhost/test val t1 = s.jdbc(url, t1) t1.printSchema() t1.collect().foreach(println) } } {code} This code caused error like below. {noformat} 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in column '1' is outside valid range for the datatype INTEGER. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.mysql.jdbc.Util.handleNewInstance(Util.java:377) at com.mysql.jdbc.Util.getInstance(Util.java:360) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870) at com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090) at com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364) at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344) at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7897) Column with an unsigned bigint should be treated as DecimalType in JDBCRDD
[ https://issues.apache.org/jira/browse/SPARK-7897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7897: --- Assignee: Apache Spark Column with an unsigned bigint should be treated as DecimalType in JDBCRDD -- Key: SPARK-7897 URL: https://issues.apache.org/jira/browse/SPARK-7897 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7895) Move Kafka examples from scala-2.10/src to src
Shixiong Zhu created SPARK-7895: --- Summary: Move Kafka examples from scala-2.10/src to src Key: SPARK-7895 URL: https://issues.apache.org/jira/browse/SPARK-7895 Project: Spark Issue Type: Improvement Components: Examples, Streaming Reporter: Shixiong Zhu Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we can move KafkaWordCount and DirectKafkaWordCount from examples/scala-2.10/src/ to examples/src/ so that they will appear in spark-examples-***-jar for Scala 2.11. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer
[ https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Ahuja updated SPARK-7896: -- Summary: IndexOutOfBoundsException in ChainedBuffer (was: IndexOutOfCountsException in ChainedBuffer) IndexOutOfBoundsException in ChainedBuffer -- Key: SPARK-7896 URL: https://issues.apache.org/jira/browse/SPARK-7896 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Arun Ahuja I've run into this on two tasks that use the same dataset. The dataset is a collection of strings where the most common string appears ~200M times and the next few appear ~50M times each. for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) to get the counts (how I got the number above), but I hit the error on rdd.groupByKey(). Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do rdd2.leftOuterJoin(rdd) without hitting this error {code} 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): java.lang.IndexOutOfBoundsException: 512 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110) at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141) at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) at org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7895) Move Kafka examples from scala-2.10/src to src
[ https://issues.apache.org/jira/browse/SPARK-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7895: --- Assignee: Apache Spark Move Kafka examples from scala-2.10/src to src -- Key: SPARK-7895 URL: https://issues.apache.org/jira/browse/SPARK-7895 Project: Spark Issue Type: Improvement Components: Examples, Streaming Reporter: Shixiong Zhu Assignee: Apache Spark Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we can move KafkaWordCount and DirectKafkaWordCount from examples/scala-2.10/src/ to examples/src/ so that they will appear in spark-examples-***-jar for Scala 2.11. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7895) Move Kafka examples from scala-2.10/src to src
[ https://issues.apache.org/jira/browse/SPARK-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560937#comment-14560937 ] Apache Spark commented on SPARK-7895: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/6436 Move Kafka examples from scala-2.10/src to src -- Key: SPARK-7895 URL: https://issues.apache.org/jira/browse/SPARK-7895 Project: Spark Issue Type: Improvement Components: Examples, Streaming Reporter: Shixiong Zhu Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we can move KafkaWordCount and DirectKafkaWordCount from examples/scala-2.10/src/ to examples/src/ so that they will appear in spark-examples-***-jar for Scala 2.11. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7895) Move Kafka examples from scala-2.10/src to src
[ https://issues.apache.org/jira/browse/SPARK-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7895: --- Assignee: (was: Apache Spark) Move Kafka examples from scala-2.10/src to src -- Key: SPARK-7895 URL: https://issues.apache.org/jira/browse/SPARK-7895 Project: Spark Issue Type: Improvement Components: Examples, Streaming Reporter: Shixiong Zhu Since spark-streaming-kafka now is published for both Scala 2.10 and 2.11, we can move KafkaWordCount and DirectKafkaWordCount from examples/scala-2.10/src/ to examples/src/ so that they will appear in spark-examples-***-jar for Scala 2.11. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7782) A small problem on history server webpage
[ https://issues.apache.org/jira/browse/SPARK-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7782: --- Assignee: (was: Apache Spark) A small problem on history server webpage - Key: SPARK-7782 URL: https://issues.apache.org/jira/browse/SPARK-7782 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.1, 1.3.1 Reporter: Xia Hu Priority: Minor Labels: starter A very little problem on spark history server webpage. we can click on head of each row to sort the app lists, for example sort by start time or completed time. But when it shows with the down arrow, it's actually with ascending order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7806) spark-ec2 launch script fails for Python3
[ https://issues.apache.org/jira/browse/SPARK-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561390#comment-14561390 ] Shivaram Venkataraman commented on SPARK-7806: -- Merged https://github.com/mesos/spark-ec2/pull/117 which fixes the issue on spark-ec2 side. spark-ec2 launch script fails for Python3 - Key: SPARK-7806 URL: https://issues.apache.org/jira/browse/SPARK-7806 Project: Spark Issue Type: Bug Components: EC2, PySpark Affects Versions: 1.3.1 Environment: All platforms. Reporter: Matthew Goodman Priority: Minor Depending on the options used the spark-ec2 script will terminate ungracefully. Relevant buglets include: - urlopen() returning bytes vs. string - floor division change for partition calculation - filter() iteration behavior change in module calculation I have a fixed version that I wish to contribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7806) spark-ec2 launch script fails for Python3
[ https://issues.apache.org/jira/browse/SPARK-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-7806. -- Resolution: Fixed Fix Version/s: 1.4.0 [~srowen] Could you add [~meawoppl] to the Developers group and assign this issue ? spark-ec2 launch script fails for Python3 - Key: SPARK-7806 URL: https://issues.apache.org/jira/browse/SPARK-7806 Project: Spark Issue Type: Bug Components: EC2, PySpark Affects Versions: 1.3.1 Environment: All platforms. Reporter: Matthew Goodman Priority: Minor Fix For: 1.4.0 Depending on the options used the spark-ec2 script will terminate ungracefully. Relevant buglets include: - urlopen() returning bytes vs. string - floor division change for partition calculation - filter() iteration behavior change in module calculation I have a fixed version that I wish to contribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7550: Assignee: Yin Huai Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Yin Huai As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7550: Assignee: Cheng Hao (was: Yin Huai) Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-7550: Shepherd: Yin Huai Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561416#comment-14561416 ] Yin Huai commented on SPARK-7550: - [~chenghao] Will you have time to take a look at it? It is a related issue of SPARK-6923. I think once we store serde info using Hive's data structures, our CLI will work correctly. But for SPARK-6923, it also needs to handle data source tables that only have schema info in the table properties. Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7899) PySpark sql/tests breaks pylint validation
[ https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561466#comment-14561466 ] Apache Spark commented on SPARK-7899: - User 'mnazario' has created a pull request for this issue: https://github.com/apache/spark/pull/6439 PySpark sql/tests breaks pylint validation -- Key: SPARK-7899 URL: https://issues.apache.org/jira/browse/SPARK-7899 Project: Spark Issue Type: Bug Components: PySpark, Tests Affects Versions: 1.4.0 Reporter: Michael Nazario The pyspark.sql.types module is dynamically named {{types}} from {{_types}} which messes up pylint validation From [~justin.uang] below: In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was renamed to {{pyspark/sql/\_types.py}} and then some magic in {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to {{types}}. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that {{\_\_init\_\_}} would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7899) PySpark sql/tests breaks pylint validation
[ https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7899: --- Assignee: (was: Apache Spark) PySpark sql/tests breaks pylint validation -- Key: SPARK-7899 URL: https://issues.apache.org/jira/browse/SPARK-7899 Project: Spark Issue Type: Bug Components: PySpark, Tests Affects Versions: 1.4.0 Reporter: Michael Nazario The pyspark.sql.types module is dynamically named {{types}} from {{_types}} which messes up pylint validation From [~justin.uang] below: In commit 04e44b37, the migration to Python 3, {{pyspark/sql/types.py}} was renamed to {{pyspark/sql/\_types.py}} and then some magic in {{pyspark/sql/\_\_init\_\_.py}} dynamically renamed the module back to {{types}}. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that {{\_\_init\_\_}} would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7901) Attempt to request negative number of executors with dynamic allocation
Ryan Williams created SPARK-7901: Summary: Attempt to request negative number of executors with dynamic allocation Key: SPARK-7901 URL: https://issues.apache.org/jira/browse/SPARK-7901 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1 Reporter: Ryan Williams I ran a {{spark-shell}} on YARN with dynamic allocation enabled; relevant params: {code} --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=5 \ --conf spark.dynamicAllocation.maxExecutors=300 \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=3 \ --conf spark.dynamicAllocation.executorIdleTimeout=300 \ {code} It started out with executors, went up to 300 when I ran a job, and then killed them all back down to 5 executors after 5mins of idle time; all working as intended. When I ran another job, it tried to request -187 executors: {code} 15/05/27 17:41:12 ERROR util.Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -187 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Now it seems like I'm stuck with 5 executors in this application as some internal state is corrupt. [This dropbox folder|https://www.dropbox.com/sh/36slqgyll8nwxrk/AACPMc9UbKRY7SieR9bCXPJCa?dl=0] has the stdout from my console, including the -187 error above, as well as the eventlog for this application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7901) Attempt to request negative number of executors with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-7901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561484#comment-14561484 ] Ryan Williams commented on SPARK-7901: -- Looks like a dupe of [SPARK-6954|https://issues.apache.org/jira/browse/SPARK-6954]… Attempt to request negative number of executors with dynamic allocation --- Key: SPARK-7901 URL: https://issues.apache.org/jira/browse/SPARK-7901 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1 Reporter: Ryan Williams I ran a {{spark-shell}} on YARN with dynamic allocation enabled; relevant params: {code} --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=5 \ --conf spark.dynamicAllocation.maxExecutors=300 \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=3 \ --conf spark.dynamicAllocation.executorIdleTimeout=300 \ {code} It started out with executors, went up to 300 when I ran a job, and then killed them all back down to 5 executors after 5mins of idle time; all working as intended. When I ran another job, it tried to request -187 executors: {code} 15/05/27 17:41:12 ERROR util.Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -187 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Now it seems like I'm stuck with 5 executors in this application as some internal state is corrupt. [This dropbox folder|https://www.dropbox.com/sh/36slqgyll8nwxrk/AACPMc9UbKRY7SieR9bCXPJCa?dl=0] has the stdout from my console, including the -187 error above, as well as the eventlog for this application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7901) Attempt to request negative number of executors with dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-7901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams resolved SPARK-7901. -- Resolution: Duplicate Attempt to request negative number of executors with dynamic allocation --- Key: SPARK-7901 URL: https://issues.apache.org/jira/browse/SPARK-7901 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1 Reporter: Ryan Williams I ran a {{spark-shell}} on YARN with dynamic allocation enabled; relevant params: {code} --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=5 \ --conf spark.dynamicAllocation.maxExecutors=300 \ --conf spark.dynamicAllocation.schedulerBacklogTimeout=3 \ --conf spark.dynamicAllocation.executorIdleTimeout=300 \ {code} It started out with executors, went up to 300 when I ran a job, and then killed them all back down to 5 executors after 5mins of idle time; all working as intended. When I ran another job, it tried to request -187 executors: {code} 15/05/27 17:41:12 ERROR util.Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -187 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Now it seems like I'm stuck with 5 executors in this application as some internal state is corrupt. [This dropbox folder|https://www.dropbox.com/sh/36slqgyll8nwxrk/AACPMc9UbKRY7SieR9bCXPJCa?dl=0] has the stdout from my console, including the -187 error above, as well as the eventlog for this application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3379) Implement 'POWER' for sql
[ https://issues.apache.org/jira/browse/SPARK-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-3379: Target Version/s: (was: 1.4.0) Implement 'POWER' for sql - Key: SPARK-3379 URL: https://issues.apache.org/jira/browse/SPARK-3379 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Original Estimate: 0h Remaining Estimate: 0h Add support for the mathematical function POWER within spark sql. Spitted from SPARK-3176 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer
[ https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561513#comment-14561513 ] Apache Spark commented on SPARK-7896: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/6440 IndexOutOfBoundsException in ChainedBuffer -- Key: SPARK-7896 URL: https://issues.apache.org/jira/browse/SPARK-7896 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Arun Ahuja Assignee: Sandy Ryza Priority: Blocker I've run into this on two tasks that use the same dataset. The dataset is a collection of strings where the most common string appears ~200M times and the next few appear ~50M times each. for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) to get the counts (how I got the number above), but I hit the error on rdd.groupByKey(). Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do rdd2.leftOuterJoin(rdd) without hitting this error {code} 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): java.lang.IndexOutOfBoundsException: 512 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110) at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141) at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) at org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6690) spark-sql script ends up throwing Exception when event logging is enabled.
[ https://issues.apache.org/jira/browse/SPARK-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561522#comment-14561522 ] Yin Huai commented on SPARK-6690: - Per https://github.com/apache/spark/pull/5341, https://github.com/apache/spark/pull/5560 addressed this issue. spark-sql script ends up throwing Exception when event logging is enabled. -- Key: SPARK-6690 URL: https://issues.apache.org/jira/browse/SPARK-6690 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Kousuke Saruta Priority: Minor When event logging is enabled, spark-sql script ends up throwing Exception like as follows. {code} 15/04/03 13:51:49 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null} 15/04/03 13:51:49 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1171) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127) ... 17 more 15/04/03 13:51:49 INFO ui.SparkUI: Stopped Spark web UI at http://sarutak-devel:4040 15/04/03 13:51:49 INFO scheduler.DAGScheduler: Stopping DAGScheduler Exception in thread Thread-6 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:209) at org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408) at org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1408) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107) {code} This is because FileSystem#close is called by the shutdown hook registered in SparkSQLCLIDriver. {code} Runtime.getRuntime.addShutdownHook( new Thread() { override def run() { SparkSQLEnv.stop() } } ) {code} This issue was resolved by SPARK-3062 but I think, it's brought again by SPARK-2261. -- This message was sent by Atlassian JIRA
[jira] [Resolved] (SPARK-6690) spark-sql script ends up throwing Exception when event logging is enabled.
[ https://issues.apache.org/jira/browse/SPARK-6690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-6690. - Resolution: Fixed Fix Version/s: 1.4.0 spark-sql script ends up throwing Exception when event logging is enabled. -- Key: SPARK-6690 URL: https://issues.apache.org/jira/browse/SPARK-6690 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Kousuke Saruta Priority: Minor Fix For: 1.4.0 When event logging is enabled, spark-sql script ends up throwing Exception like as follows. {code} 15/04/03 13:51:49 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null} 15/04/03 13:51:49 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1171) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127) ... 17 more 15/04/03 13:51:49 INFO ui.SparkUI: Stopped Spark web UI at http://sarutak-devel:4040 15/04/03 13:51:49 INFO scheduler.DAGScheduler: Stopping DAGScheduler Exception in thread Thread-6 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:209) at org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408) at org.apache.spark.SparkContext$$anonfun$stop$3.apply(SparkContext.scala:1408) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1408) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107) {code} This is because FileSystem#close is called by the shutdown hook registered in SparkSQLCLIDriver. {code} Runtime.getRuntime.addShutdownHook( new Thread() { override def run() { SparkSQLEnv.stop() } } ) {code} This issue was resolved by SPARK-3062 but I think, it's brought again by SPARK-2261. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Assigned] (SPARK-7896) IndexOutOfBoundsException in ChainedBuffer
[ https://issues.apache.org/jira/browse/SPARK-7896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7896: --- Assignee: Apache Spark (was: Sandy Ryza) IndexOutOfBoundsException in ChainedBuffer -- Key: SPARK-7896 URL: https://issues.apache.org/jira/browse/SPARK-7896 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Arun Ahuja Assignee: Apache Spark Priority: Blocker I've run into this on two tasks that use the same dataset. The dataset is a collection of strings where the most common string appears ~200M times and the next few appear ~50M times each. for this rdd: RDD[String], I can do rdd.map( x = (x, 1)).reduceByKey( _ + _) to get the counts (how I got the number above), but I hit the error on rdd.groupByKey(). Also, I have a second RDD of strings rdd2: RDD[String] and I cannot do rdd2.leftOuterJoin(rdd) without hitting this error {code} 15/05/26 23:27:55 WARN scheduler.TaskSetManager: Lost task 3169.1 in stage 5.0 (TID 4843, demeter-csmaz10-19.demeter.hpc.mssm.edu): java.lang.IndexOutOfBoundsException: 512 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.util.collection.ChainedBuffer.write(ChainedBuffer.scala:110) at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:141) at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) at org.apache.spark.serializer.KryoSerializationStream.flush(KryoSerializer.scala:147) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:78) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files
[ https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4119: Target Version/s: (was: 1.4.0) Don't rely on HIVE_DEV_HOME to find .q files Key: SPARK-4119 URL: https://issues.apache.org/jira/browse/SPARK-4119 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor After merging in Hive 0.13.1 support, a bunch of .q files and golden answer files got updated. Unfortunately, some .q were updated in Hive. For example, an ORDER BY clause was added to groupby1_limit.q for bug fix. With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with false test failures. Because .q files are looked up from HIVE_DEV_HOME and outdated .q files are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4782: Target Version/s: (was: 1.4.0) Add inferSchema support for RDD[Map[String, Any]] - Key: SPARK-4782 URL: https://issues.apache.org/jira/browse/SPARK-4782 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Priority: Minor The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to be converting each Map to JSON String first and use JsonRDD.inferSchema on it. It's very inefficient. Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for Schemaless data as adding Map like interface to any serialization format is easy. So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new serialization format we want to support, we just need to add a Map interface wrapper to it* Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files
[ https://issues.apache.org/jira/browse/SPARK-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561525#comment-14561525 ] Yin Huai commented on SPARK-4119: - [~lian cheng] Feel free to re-target it. Don't rely on HIVE_DEV_HOME to find .q files Key: SPARK-4119 URL: https://issues.apache.org/jira/browse/SPARK-4119 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor After merging in Hive 0.13.1 support, a bunch of .q files and golden answer files got updated. Unfortunately, some .q were updated in Hive. For example, an ORDER BY clause was added to groupby1_limit.q for bug fix. With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with false test failures. Because .q files are looked up from HIVE_DEV_HOME and outdated .q files are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7902) SQL UDF doesn't support UDT in PySpark
Xiangrui Meng created SPARK-7902: Summary: SQL UDF doesn't support UDT in PySpark Key: SPARK-7902 URL: https://issues.apache.org/jira/browse/SPARK-7902 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Xiangrui Meng We don't convert Python SQL internal types to Python types in SQL UDF execution. This causes problems if the input arguments contain UDTs or the return type is a UDT. Right now, the raw SQL types are passed into the Python UDF and the return value is not converted to Python SQL types. This is the code to produce this bug. (Actually, it triggers another bug first right now.) {code} from pyspark.mllib.linalg import SparseVector from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], [features]) sz = udf(lambda s: s.size, IntegerType()) df.select(sz(df.features).alias(sz)).collect() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561529#comment-14561529 ] Yin Huai commented on SPARK-6467: - Probably {{Generate}} should override it? Seems, it is the cause of some wrong analysis error message (like, {{abc is not in col1, col2, abc, col3}}). Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis --- Key: SPARK-6467 URL: https://issues.apache.org/jira/browse/SPARK-6467 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6467: Priority: Major (was: Minor) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis --- Key: SPARK-6467 URL: https://issues.apache.org/jira/browse/SPARK-6467 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561541#comment-14561541 ] Yin Huai commented on SPARK-7550: - I think it will also address https://issues.apache.org/jira/browse/SPARK-6413. Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for DESCRIBE FORMATTED
[ https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6413: Target Version/s: 1.5.0 (was: 1.4.0) For data source tables, we should provide better output for DESCRIBE FORMATTED -- Key: SPARK-6413 URL: https://issues.apache.org/jira/browse/SPARK-6413 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Minor Right now, we will show Hive's stuff like SerDe. Users will be confused when they see the output of DESCRIBE FORMATTED (it is a Hive native command for now) and think the table is not stored in the right format. Actually, the table is indeed stored in the right format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4684) Add a script to run JDBC server on Windows
[ https://issues.apache.org/jira/browse/SPARK-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4684: Target Version/s: 1.5.0 (was: 1.4.0) Add a script to run JDBC server on Windows -- Key: SPARK-4684 URL: https://issues.apache.org/jira/browse/SPARK-4684 Project: Spark Issue Type: New Feature Components: SQL Reporter: Matei Zaharia Assignee: Cheng Lian Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7899) PySpark sql/tests breaks pylint validation
[ https://issues.apache.org/jira/browse/SPARK-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561164#comment-14561164 ] Michael Nazario commented on SPARK-7899: The problem is that pyspark/sql/types conflicts with the built-in Python 3 types module which causes tests to fail. The Python documentation (https://docs.python.org/3/using/cmdline.html#interface-options) says that by calling python path/to/script.py, the path of the script is automatically added to sys.path. This causes the conflict with the built-in Python 3 types module. You can fix this by using -m in running pyspark tests instead since this will run by a module name on sys.path and not add the directory of the script to the python path. PySpark sql/tests breaks pylint validation -- Key: SPARK-7899 URL: https://issues.apache.org/jira/browse/SPARK-7899 Project: Spark Issue Type: Bug Components: PySpark, Tests Affects Versions: 1.4.0 Reporter: Michael Nazario The pyspark.sql.types module is dynamically named types from _types which messes up pylint validation From [~justin.uang] below: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7536) Audit MLlib Python API for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561135#comment-14561135 ] Yanbo Liang commented on SPARK-7536: [~josephkb] Yes, the 3 main sub-tasks is under working. I will submit the completed parts asap. I know the 3 main sub-tasks listed above are related with version 1.4 release, so I will try to finish them asap. Audit MLlib Python API for 1.4 -- Key: SPARK-7536 URL: https://issues.apache.org/jira/browse/SPARK-7536 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? SPARK-7667 * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. SPARK-7666 * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7899) PySpark sql/tests breaks pylint validation
Michael Nazario created SPARK-7899: -- Summary: PySpark sql/tests breaks pylint validation Key: SPARK-7899 URL: https://issues.apache.org/jira/browse/SPARK-7899 Project: Spark Issue Type: Bug Components: PySpark, Tests Affects Versions: 1.4.0 Reporter: Michael Nazario The pyspark.sql.types module is dynamically named types from _types which messes up pylint validation From [~justin.uang] below: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7900) Reduce number of tagging calls in spark-ec2
Nicholas Chammas created SPARK-7900: --- Summary: Reduce number of tagging calls in spark-ec2 Key: SPARK-7900 URL: https://issues.apache.org/jira/browse/SPARK-7900 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Nicholas Chammas Priority: Minor spark-ec2 currently tags each instance with its own name: https://github.com/apache/spark/blob/4615081d7a10b023491e25478d19b8161e030974/ec2/spark_ec2.py#L684-L692 Quite often, one of these tagging calls will fail: {code} Launching instances... Launched 10 slaves in us-west-2a, regid = r-89656e83 Launched master in us-west-2a, regid = r-07646f0d Waiting for AWS to propagate instance metadata... Traceback (most recent call last): File ../spark/ec2/spark_ec2.py, line 1395, in module main() File ../spark/ec2/spark_ec2.py, line 1387, in main real_main() File ../spark/ec2/spark_ec2.py, line 1222, in real_main (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name) File ../spark/ec2/spark_ec2.py, line 667, in launch_cluster value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id)) File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in add_tag self.add_tags({key: value}, dry_run) File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in add_tags dry_run=dry_run File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, in create_tags return self.get_status('CreateTags', params, verb='POST') File /path/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in get_status raise self.ResponseError(response.status, response.reason, body) boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe instance ID 'i-d3b72524' does not exist/Message/Error/ErrorsRequestIDf0936ab5-4d10-46d1-a35d-cefaf8a68adc/RequestID/Response {code} This is presumably a problem with AWS metadata taking time to become available on all the servers that spark-ec2 hits as it makes the several tagging calls. Instead of retrying the tagging calls, we should just reduce them to 2 calls--one for the master, one for the slaves. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7898) pyspark merges stderr into stdout
Sam Steingold created SPARK-7898: Summary: pyspark merges stderr into stdout Key: SPARK-7898 URL: https://issues.apache.org/jira/browse/SPARK-7898 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Sam Steingold When I type {code} hadoop fs -text /foo/bar/baz.bz2 2err 1out {code} I get two non-empty files: {{err}} with {code} 2015-05-26 15:33:49,786 INFO [main] bzip2.Bzip2Factory (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded initialized native-bzip2 library system-native 2015-05-26 15:33:49,789 INFO [main] compress.CodecPool (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2] {code} and {{out}} with the content of the file (as expected). When I call the same command from Python (2.6): {code} from subprocess import Popen with open(out,w) as out: with open(err,w) as err: p = Popen(['hadoop','fs','-text',/foo/bar/baz.bz2], stdin=None,stdout=out,stderr=err) print p.wait() {code} I get the exact same (correct) behavior. *However*, when I run the same code under *PySpark* (or using `spark-submit`), I get an *empty* {{err}} file and the {{out}} file starts with the log messages above (and then it contains the actual data). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org