[jira] [Updated] (SPARK-3364) Zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3364: --- Affects Version/s: 1.0.2 Zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Kevin Jung ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3364) Zip equal-length but unequally-partition
[ https://issues.apache.org/jira/browse/SPARK-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119414#comment-14119414 ] Guoqiang Li commented on SPARK-3364: This bug has been fixed in 1.1.0 . Zip equal-length but unequally-partition Key: SPARK-3364 URL: https://issues.apache.org/jira/browse/SPARK-3364 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Kevin Jung ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions. This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file. {noformat} var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3))) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3195) Can you add some statistics to do logistic regression better in mllib?
[ https://issues.apache.org/jira/browse/SPARK-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3195: - Affects Version/s: (was: 1.3.0) Can you add some statistics to do logistic regression better in mllib? -- Key: SPARK-3195 URL: https://issues.apache.org/jira/browse/SPARK-3195 Project: Spark Issue Type: New Feature Components: MLlib Reporter: miumiu Labels: test Fix For: 1.3.0 Original Estimate: 1m Remaining Estimate: 1m HI, In logistic regression model practice,Test of regression coefficient and whole model fitting are very important.Can you add some effective support on these Aspects? Such as,The likelihood ratio test or the wald test is offer used for test of coefficient,and the Hosmer-Lemeshow test is used for evaluate the model fitting. Learning that we have ROC and Precision-Recall already,but can you also provide KS statistic,which is mostly used in Model evaluation aspect? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3344) Reformat code: add blank lines
[ https://issues.apache.org/jira/browse/SPARK-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma closed SPARK-3344. -- Resolution: Invalid Reformat code: add blank lines -- Key: SPARK-3344 URL: https://issues.apache.org/jira/browse/SPARK-3344 Project: Spark Issue Type: Test Components: Deploy Reporter: WangTaoTheTonic Priority: Trivial There should have blank lines between test cases, so do some code reformat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119422#comment-14119422 ] Derrick Burns commented on SPARK-3219: -- The current implementation supports one concrete representation of points, centers, and centroids. However, in order for arbitrary distance functions to be supported efficiently, the clusterer needs abstractions for point, center, and centroid that are actually defined by along with the distance function. K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3365) Failure to save Lists to Parquet
Michael Armbrust created SPARK-3365: --- Summary: Failure to save Lists to Parquet Key: SPARK-3365 URL: https://issues.apache.org/jira/browse/SPARK-3365 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Michael Armbrust Reproduction, same works if type is Seq. (props to [~chrisgrier] for finding this) {code} scala case class Test(x: List[String]) defined class Test scala sparkContext.parallelize(Test(List()) :: Nil).saveAsParquetFile(bug) 23:09:51.807 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99) at parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:92) at parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3366) Compute best splits distributively in decision tree
Xiangrui Meng created SPARK-3366: Summary: Compute best splits distributively in decision tree Key: SPARK-3366 URL: https://issues.apache.org/jira/browse/SPARK-3366 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng The current implementation computes all best splits locally on the driver, which makes the driver a bottleneck for both communication and computation. It would be nice if we can compute the best splits distributively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2907) Use mutable.HashMap to represent Model in Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2907. -- Resolution: Fixed Use mutable.HashMap to represent Model in Word2Vec -- Key: SPARK-2907 URL: https://issues.apache.org/jira/browse/SPARK-2907 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Liquan Pei Assignee: Liquan Pei Use mutable.HashMap to represent Word2Vec to reduce memory footprint and shuffle size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3367) Remove spark.shuffle.spill.compress (replace it with existing spark.shuffle.compress)
Reynold Xin created SPARK-3367: -- Summary: Remove spark.shuffle.spill.compress (replace it with existing spark.shuffle.compress) Key: SPARK-3367 URL: https://issues.apache.org/jira/browse/SPARK-3367 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin See the discussion at: https://github.com/apache/spark/pull/2178#issuecomment-54214760 We have both spark.shuffle.spill.compress and spark.shuffle.compress. It is very unlikely a user would want the two to be different. We should just remove the spill.compress option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2304) Add example application to run tera sort
[ https://issues.apache.org/jira/browse/SPARK-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2304. Resolution: Won't Fix spark-perf is a better long term home for this. Add example application to run tera sort Key: SPARK-2304 URL: https://issues.apache.org/jira/browse/SPARK-2304 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin See http://sortbenchmark.org/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2243: --- Target Version/s: (was: 1.2.0) Assignee: (was: Reynold Xin) Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119479#comment-14119479 ] Reynold Xin commented on SPARK-2243: Pushing this for later since Hive on Spark no longer depends on this ... Support multiple SparkContexts in the same JVM -- Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: New Feature Components: Block Manager, Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Miguel Angel Fernandez Diaz Assignee: Reynold Xin We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at
[jira] [Commented] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file
[ https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119499#comment-14119499 ] David Greco commented on SPARK-3350: Here you are: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/09/03 09:33:10 INFO SecurityManager: Changing view acls to: dgreco, 14/09/03 09:33:10 INFO SecurityManager: Changing modify acls to: dgreco, 14/09/03 09:33:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dgreco, ); users with modify permissions: Set(dgreco, ) 14/09/03 09:33:11 INFO Slf4jLogger: Slf4jLogger started 14/09/03 09:33:11 INFO Remoting: Starting remoting 14/09/03 09:33:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.0.21:49652] 14/09/03 09:33:12 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@192.168.0.21:49652] 14/09/03 09:33:12 INFO Utils: Successfully started service 'sparkDriver' on port 49652. 14/09/03 09:33:12 INFO SparkEnv: Registering MapOutputTracker 14/09/03 09:33:12 INFO SparkEnv: Registering BlockManagerMaster 14/09/03 09:33:12 INFO DiskBlockManager: Created local directory at /var/folders/bw/7dmpv8qx3p75mcmctmchwl5mgn/T/spark-local-20140903093312-29d3 14/09/03 09:33:12 INFO Utils: Successfully started service 'Connection manager for block manager' on port 49653. 14/09/03 09:33:12 INFO ConnectionManager: Bound socket to port 49653 with id = ConnectionManagerId(192.168.0.21,49653) 14/09/03 09:33:12 INFO MemoryStore: MemoryStore started with capacity 983.1 MB 14/09/03 09:33:12 INFO BlockManagerMaster: Trying to register BlockManager 14/09/03 09:33:12 INFO BlockManagerMasterActor: Registering block manager 192.168.0.21:49653 with 983.1 MB RAM 14/09/03 09:33:12 INFO BlockManagerMaster: Registered BlockManager 14/09/03 09:33:12 INFO HttpFileServer: HTTP File server directory is /var/folders/bw/7dmpv8qx3p75mcmctmchwl5mgn/T/spark-f16014c3-f795-4300-88af-7e161d6f7547 14/09/03 09:33:12 INFO HttpServer: Starting HTTP Server 14/09/03 09:33:12 INFO Utils: Successfully started service 'HTTP file server' on port 49654. 14/09/03 09:33:13 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/09/03 09:33:13 INFO SparkUI: Started SparkUI at http://192.168.0.21:4040 14/09/03 09:33:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/03 09:33:15 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.0.21:49652/user/HeartbeatReceiver 14/09/03 09:33:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/09/03 09:33:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/09/03 09:33:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/09/03 09:33:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 14/09/03 09:33:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/09/03 09:33:18 INFO SparkContext: Starting job: saveAsHadoopFile at AvroWriteTestCase.scala:67 14/09/03 09:33:18 INFO DAGScheduler: Got job 0 (saveAsHadoopFile at AvroWriteTestCase.scala:67) with 1 output partitions (allowLocal=false) 14/09/03 09:33:18 INFO DAGScheduler: Final stage: Stage 0(saveAsHadoopFile at AvroWriteTestCase.scala:67) 14/09/03 09:33:18 INFO DAGScheduler: Parents of final stage: List() 14/09/03 09:33:18 INFO DAGScheduler: Missing parents: List() 14/09/03 09:33:18 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[4] at map at AvroWriteTestCase.scala:66), which has no missing parents 14/09/03 09:33:19 INFO MemoryStore: ensureFreeSpace(45032) called with curMem=0, maxMem=1030823608 14/09/03 09:33:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 44.0 KB, free 983.0 MB) 14/09/03 09:33:19 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[4] at map at AvroWriteTestCase.scala:66) 14/09/03 09:33:19 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/09/03 09:33:19 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 3191 bytes) 14/09/03 09:33:19 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/09/03 09:33:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at org.apache.spark.sql.SchemaRDDLike$class.queryExecution(SchemaRDDLike.scala:52) at org.apache.spark.sql.SchemaRDD.queryExecution$lzycompute(SchemaRDD.scala:103) at org.apache.spark.sql.SchemaRDD.queryExecution(SchemaRDD.scala:103) at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:126) at
[jira] [Created] (SPARK-3368) Spark cannot be used with Avro and Parquet
Graham Dennis created SPARK-3368: Summary: Spark cannot be used with Avro and Parquet Key: SPARK-3368 URL: https://issues.apache.org/jira/browse/SPARK-3368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Spark cannot currently (as of 1.0.2) use any Parquet write support classes that are not part of the spark assembly jar (at least when launched using `spark-submit`). This prevents using Avro with Parquet. See https://github.com/GrahamDennis/spark-avro-parquet for a test case to reproduce this issue. The problem appears in the master logs as: {noformat} 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0 parquet.hadoop.BadConfigurationException: could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121) at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115) ... 11 more {noformat} The root cause of the problem is that the class loader that's used to find the Parquet write support class only searches the spark assembly jar and doesn't also search the application jar. A solution would be to ensure that the application jar is always available on the executor classpath. This is the same underlying issue as SPARK-2878, and SPARK-3166 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3368) Spark cannot be used with Avro and Parquet
[ https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119522#comment-14119522 ] Graham Dennis commented on SPARK-3368: -- There are a couple of github repos that demonstrate using Avro Parquet, see https://github.com/AndreSchumacher/avro-parquet-spark-example and https://github.com/massie/spark-parquet-example In the first case, data is written out locally, and in the second, spark is launched via maven (not spark-submit) which puts both spark jars and application jars on the classpath at launch time. Spark cannot be used with Avro and Parquet -- Key: SPARK-3368 URL: https://issues.apache.org/jira/browse/SPARK-3368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Spark cannot currently (as of 1.0.2) use any Parquet write support classes that are not part of the spark assembly jar (at least when launched using `spark-submit`). This prevents using Avro with Parquet. See https://github.com/GrahamDennis/spark-avro-parquet for a test case to reproduce this issue. The problem appears in the master logs as: {noformat} 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0 parquet.hadoop.BadConfigurationException: could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121) at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115) ... 11 more {noformat} The root cause of the problem is that the class loader that's used to find the Parquet write support class only searches the spark assembly jar and doesn't also search the application jar. A solution would be to ensure that the application jar is always available on the executor classpath. This is the same underlying issue as SPARK-2878, and SPARK-3166 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3368) Spark cannot be used with Avro and Parquet
[ https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119524#comment-14119524 ] Graham Dennis commented on SPARK-3368: -- [~rxin]: I filed this issue to demonstate that the classpath issue behind SPARK-2878 and SPARK-3166 appears in other situations too. Spark cannot be used with Avro and Parquet -- Key: SPARK-3368 URL: https://issues.apache.org/jira/browse/SPARK-3368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Spark cannot currently (as of 1.0.2) use any Parquet write support classes that are not part of the spark assembly jar (at least when launched using `spark-submit`). This prevents using Avro with Parquet. See https://github.com/GrahamDennis/spark-avro-parquet for a test case to reproduce this issue. The problem appears in the master logs as: {noformat} 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0 parquet.hadoop.BadConfigurationException: could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121) at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115) ... 11 more {noformat} The root cause of the problem is that the class loader that's used to find the Parquet write support class only searches the spark assembly jar and doesn't also search the application jar. A solution would be to ensure that the application jar is always available on the executor classpath. This is the same underlying issue as SPARK-2878, and SPARK-3166 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3273) We should read the version information from the same place.
[ https://issues.apache.org/jira/browse/SPARK-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3273: --- Summary: We should read the version information from the same place. (was: The spark version in the welcome message of spark-shell is not correct) We should read the version information from the same place. --- Key: SPARK-3273 URL: https://issues.apache.org/jira/browse/SPARK-3273 Project: Spark Issue Type: Bug Components: Spark Shell Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3336: --- Description: Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: {code} Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at
[jira] [Comment Edited] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119400#comment-14119400 ] Reynold Xin edited comment on SPARK-3336 at 9/3/14 8:45 AM: [~marmbrus], If we reverse the order of count() and MYUDF(), then it can run successfully: {code} SELECT MYUDF(foo), COUNT(1) FROM bar GROUP BY MYUDF(foo); {code} If we put COUNT(1) before MYUDF(foo), it will also raise the same exception in Scala: {code} scala sqlContext.sql(SELECT count(1), MYUDF(name) from people GROUP BY MYUDF(name) ).collect() 14/09/02 22:45:50 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: name#0 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:178) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at
[jira] [Created] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
Sean Owen created SPARK-3369: Summary: Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2 Reporter: Sean Owen {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119660#comment-14119660 ] kay feng commented on SPARK-3336: - We tested on spark 1.1.0 rc2. In spark-shell, it works: ./bin/spark-shell val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.parallelize(List({foo:bar})) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, (s: String) = s.length) sqlContext.sql(select len(foo), count(1) from t1 group by len(foo)).collect() However, in pyspark, we still have error: TreeNodeException: Binding attribute... ./bin/pyspark from pyspark.sql import SQLContext sqlContext = SQLContext(sc) rdd = sc.parallelize([{foo:bar}]) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, lambda s: len(s)) sqlContext.sql(select len(foo), count(1) from t1 group by len(foo)).collect() [Spark SQL] In pyspark, cannot group by field on UDF Key: SPARK-3336 URL: https://issues.apache.org/jira/browse/SPARK-3336 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Assignee: Davies Liu Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: {code} Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
[jira] [Created] (SPARK-3370) The simple test error
edmonds created SPARK-3370: -- Summary: The simple test error Key: SPARK-3370 URL: https://issues.apache.org/jira/browse/SPARK-3370 Project: Spark Issue Type: Question Components: Examples, MLlib, PySpark Affects Versions: 1.0.2 Environment: Linux 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:01 UTC 2014 i686 i686 i686 GNU/Linux Reporter: edmonds I tried to run the ALS example, but failed both in als.py, and MovielenALS.scala, how can I fix that? the following is the report for running the als.py Traceback (most recent call last): File ALS.py, line 61, in module model = ALS.train(ratings, rank, numIterations) File spark-1.0.2-bin-hadoop2/python/pyspark/mllib/recommendation.py, line 66, in train ratingBytes._jrdd, rank, iterations, lambda_, blocks) File spark-1.0.2-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File spark-1.0.2-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o24.trainALSModel. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 23.0:0 failed 1 times, most recent failure: Exception failure in TID 62 on host localhost: java.lang.StackOverflowError java.util.zip.Inflater.inflate(Inflater.java:259) java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152) java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) scala.collection.immutable.$colon$colon.readObject(List.scala:366) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) scala.collection.immutable.$colon$colon.readObject(List.scala:362) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
[jira] [Created] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error
Pei-Lun Lee created SPARK-3371: -- Summary: Spark SQL: Renaming a function expression with group by gives error Key: SPARK-3371 URL: https://issues.apache.org/jira/browse/SPARK-3371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.parallelize(List({foo:bar})) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, (s: String) = s.length) sqlContext.sql(select len(foo) as a, count(1) from t1 group by len(foo)).collect() {code} running above code in spark-shell gives the following error {noformat} 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: foo#0 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {noformat} remove as a in the query causes no error -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3372) Scalastyle fails due to UTF-8 character being contained in Gradient.scala
[ https://issues.apache.org/jira/browse/SPARK-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119706#comment-14119706 ] Apache Spark commented on SPARK-3372: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2248 Scalastyle fails due to UTF-8 character being contained in Gradient.scala - Key: SPARK-3372 URL: https://issues.apache.org/jira/browse/SPARK-3372 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: Windows8 / Java 7 / Maven Reporter: Kousuke Saruta Priority: Blocker Gradient.scala includes 2 UTF-8 hyphens. Caused by this, mvn package falied on Windows8 because it cannot pass checkstyle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3372) Scalastyle fails due to UTF-8 character contained in Gradient.scala
[ https://issues.apache.org/jira/browse/SPARK-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3372: -- Summary: Scalastyle fails due to UTF-8 character contained in Gradient.scala (was: Scalastyle fails due to UTF-8 character being contained in Gradient.scala) Scalastyle fails due to UTF-8 character contained in Gradient.scala --- Key: SPARK-3372 URL: https://issues.apache.org/jira/browse/SPARK-3372 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: Windows8 / Java 7 / Maven Reporter: Kousuke Saruta Priority: Blocker Gradient.scala includes 2 UTF-8 hyphens. Caused by this, mvn package falied on Windows8 because it cannot pass checkstyle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file
[ https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119742#comment-14119742 ] David Greco commented on SPARK-3350: I managed to make it working, below the passing code, I changed the function rowToAvroKey to return a function instead. Now it works. I think it's related to how spark serialises closures. Just my .02 Hope it helps. package com.eligotech.hnavigator.prototypes.spark import org.apache.avro.{SchemaBuilder, Schema} import org.apache.avro.SchemaBuilder.FieldAssembler import org.apache.avro.generic.GenericData.Record import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.{AvroOutputFormat, AvroWrapper, AvroJob, AvroKey} import org.apache.hadoop.io.NullWritable import org.apache.hadoop.mapred.JobConf import org.apache.spark.sql.BooleanType import org.apache.spark.sql.DoubleType import org.apache.spark.sql.StructType import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.types.FloatType import org.apache.spark.sql.catalyst.types.IntegerType import org.apache.spark.sql.catalyst.types.LongType import org.apache.spark.sql.catalyst.types.StringType import org.apache.spark.sql.catalyst.types.StructField import org.apache.spark.{SparkConf, SparkContext} object AvroWriteTestCase extends App { val conf = new SparkConf().setAppName(HarpoonSparkTest).setMaster(local) val sparkContext = new SparkContext(conf) val sqlContext = new SQLContext(sparkContext) val tpeople = sparkContext.parallelize((1 to 100).map(i = Person(sCIAO$i, i))) case class Person(name: String, age: Int) import sqlContext._ tpeople.registerTempTable(people) val people = sqlContext.sql(select * from people) def rowToAvroKey(structType: StructType): Row = (AvroKey[GenericRecord], NullWritable) = (row: Row) ={ val schema = structTypeToAvroSchema(structType) val record = new Record(schema) var i = 0 structType.fields.foreach { f = record.put(f.name, row.apply(i)) i += 1 } (new AvroKey(record), NullWritable.get()) } def structTypeToAvroSchema(structType: StructType): Schema = { val fieldsAssembler: FieldAssembler[Schema] = SchemaBuilder.record(RECORD).fields() structType.fields foreach { (field: StructField) = field.dataType match { case IntegerType = fieldsAssembler.name(field.name).`type`().nullable().intType().noDefault() case LongType = fieldsAssembler.name(field.name).`type`().nullable().longType().noDefault() case StringType = fieldsAssembler.name(field.name).`type`().nullable().stringType().noDefault() case FloatType = fieldsAssembler.name(field.name).`type`().nullable().floatType().noDefault() case DoubleType = fieldsAssembler.name(field.name).`type`().nullable().doubleType().noDefault() case BooleanType = fieldsAssembler.name(field.name).`type`().nullable().booleanType().noDefault() case _ = throw new IllegalArgumentException(StructType with unhandled type: + field) } } fieldsAssembler.endRecord() } def saveAsAvroFile(schemaRDD: SchemaRDD, path: String): Unit = { val jobConf = new JobConf(schemaRDD.sparkContext.hadoopConfiguration) val schema = structTypeToAvroSchema(schemaRDD.schema) AvroJob.setOutputSchema(jobConf, schema) import org.apache.spark.SparkContext._ schemaRDD.map(rowToAvroKey(schemaRDD.schema)). saveAsHadoopFile(path, classOf[AvroWrapper[GenericRecord]], classOf[NullWritable], classOf[AvroOutputFormat[GenericRecord]], jobConf) } saveAsAvroFile(people, /tmp/people.avro) } Strange anomaly trying to write a SchemaRDD into an Avro file - Key: SPARK-3350 URL: https://issues.apache.org/jira/browse/SPARK-3350 Project: Spark Issue Type: Bug Components: SQL Environment: jdk1.7, macosx Reporter: David Greco Attachments: AvroWriteTestCase.scala I found a way to automatically save a SchemaRDD in Avro format, similarly to what Spark does with parquet file. I attached a test case to this issue. The code fails with a NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-563) Run findBugs and IDEA inspections in the codebase
[ https://issues.apache.org/jira/browse/SPARK-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-563. - Resolution: Won't Fix This appears to be obsolete/stale too. Run findBugs and IDEA inspections in the codebase - Key: SPARK-563 URL: https://issues.apache.org/jira/browse/SPARK-563 Project: Spark Issue Type: Improvement Reporter: Ismael Juma I ran into a few instances of unused local variables and unnecessary usage of the 'return' keyword (the recommended practice is to avoid 'return' if possible) and thought it would be good to run findBugs and IDEA inspections to clean-up the code. I am willing to do this, but first would like to know whether you agree that this is a good idea and whether this is the right time to do it. These changes tend to affect many source files and can cause issues if there is major work ongoing in separate branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119854#comment-14119854 ] Thomas Graves commented on SPARK-2718: -- [~andrewor] the pr for this is closed, can we close the jira also? YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119854#comment-14119854 ] Thomas Graves edited comment on SPARK-2718 at 9/3/14 1:37 PM: -- [~andrewor] the pr for this is closed, can we resolve the jira also? was (Author: tgraves): [~andrewor] the pr for this is closed, can we close the jira also? YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3373) trim some useless informations of VertexRDD in some cases
uncleGen created SPARK-3373: --- Summary: trim some useless informations of VertexRDD in some cases Key: SPARK-3373 URL: https://issues.apache.org/jira/browse/SPARK-3373 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2, 1.0.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3187) Refactor and cleanup Yarn allocator code
[ https://issues.apache.org/jira/browse/SPARK-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3187. -- Resolution: Fixed Fix Version/s: 1.2.0 Refactor and cleanup Yarn allocator code Key: SPARK-3187 URL: https://issues.apache.org/jira/browse/SPARK-3187 Project: Spark Issue Type: Improvement Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.2.0 This is a follow-up to SPARK-2933, which dealt with the ApplicationMaster code. There's a lot of logic in the container allocation code in alpha/stable that could probably be merged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3375) spark on yarn alpha container allocation issues
Thomas Graves created SPARK-3375: Summary: spark on yarn alpha container allocation issues Key: SPARK-3375 URL: https://issues.apache.org/jira/browse/SPARK-3375 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Priority: Blocker It looks like if yarn doesn't get the containers immediately it stops asking for them and the yarn application hangs with never getting any executors. This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
uncleGen created SPARK-3376: --- Summary: Memory-based shuffle strategy to reduce overhead of disk I/O Key: SPARK-3376 URL: https://issues.apache.org/jira/browse/SPARK-3376 Project: Spark Issue Type: Planned Work Reporter: uncleGen Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3377) codahale base Metrics data between applications can jumble up together
Kousuke Saruta created SPARK-3377: - Summary: codahale base Metrics data between applications can jumble up together Key: SPARK-3377 URL: https://issues.apache.org/jira/browse/SPARK-3377 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Critical I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names jumble up together. e.g, SparkPi.DAGScheduler.stage.failedStages jumble. (2) When 2+ executors run on the same machine, JVM metrics of each executors jumble. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. (was: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementation of shuffle.) Memory-based shuffle strategy to reduce overhead of disk I/O Key: SPARK-3376 URL: https://issues.apache.org/jira/browse/SPARK-3376 Project: Spark Issue Type: Planned Work Reporter: uncleGen Priority: Trivial I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-3376: Description: I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementation of shuffle. Memory-based shuffle strategy to reduce overhead of disk I/O Key: SPARK-3376 URL: https://issues.apache.org/jira/browse/SPARK-3376 Project: Spark Issue Type: Planned Work Reporter: uncleGen Priority: Trivial I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementation of shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3377) codahale base Metrics data between applications can jumble up together
[ https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119975#comment-14119975 ] Apache Spark commented on SPARK-3377: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2250 codahale base Metrics data between applications can jumble up together -- Key: SPARK-3377 URL: https://issues.apache.org/jira/browse/SPARK-3377 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Critical I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names jumble up together. e.g, SparkPi.DAGScheduler.stage.failedStages jumble. (2) When 2+ executors run on the same machine, JVM metrics of each executors jumble. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3377) codahale base Metrics data between applications can jumble up together
[ https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3377: -- Description: I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names jumble up together. e.g, SparkPi.DAGScheduler.stage.failedStages jumble. (2) When 2+ executors run on the same machine, JVM metrics of each executors jumble. e.g, We current implementation cannot distinguish metric jvm.memory is for which executor. was: I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names jumble up together. e.g, SparkPi.DAGScheduler.stage.failedStages jumble. (2) When 2+ executors run on the same machine, JVM metrics of each executors jumble. codahale base Metrics data between applications can jumble up together -- Key: SPARK-3377 URL: https://issues.apache.org/jira/browse/SPARK-3377 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Critical I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names jumble up together. e.g, SparkPi.DAGScheduler.stage.failedStages jumble. (2) When 2+ executors run on the same machine, JVM metrics of each executors jumble. e.g, We current implementation cannot distinguish metric jvm.memory is for which executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3304) ApplicationMaster's Finish status is wrong when uncaught exception is thrown from ReporterThread
[ https://issues.apache.org/jira/browse/SPARK-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120030#comment-14120030 ] Kousuke Saruta commented on SPARK-3304: --- Actually, I didn't see exception causing Reporter thread dead but unexpected exception like OOM or any other exceptions due to communication error with RM during resource negotiation caused by temporary traffic jam can be thrown, I think. ApplicationMaster's Finish status is wrong when uncaught exception is thrown from ReporterThread Key: SPARK-3304 URL: https://issues.apache.org/jira/browse/SPARK-3304 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta It's rare case but even though uncaught exception is thrown from Reporter thread, allocating containers, finish status is marked as SUCCEEDED. In addition, in YARN Cluster mode, if we don't notice Reporter thread is dead, we waits long time until timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3378) Replace the word SparkSQL with right word Spark SQL
Kousuke Saruta created SPARK-3378: - Summary: Replace the word SparkSQL with right word Spark SQL Key: SPARK-3378 URL: https://issues.apache.org/jira/browse/SPARK-3378 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Trivial In programming-guide.md, there are 2 SparkSQL. We should use Spark SQL instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3378) Replace the word SparkSQL with right word Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120081#comment-14120081 ] Apache Spark commented on SPARK-3378: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2251 Replace the word SparkSQL with right word Spark SQL --- Key: SPARK-3378 URL: https://issues.apache.org/jira/browse/SPARK-3378 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Trivial In programming-guide.md, there are 2 SparkSQL. We should use Spark SQL instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120141#comment-14120141 ] Davies Liu commented on SPARK-3336: --- [~kayfeng] The pyspark test case ran successfully in master but failed in 1.1 branch. After testing, This bug was fixed in https://github.com/apache/spark/pull/2155 accidentally. [Spark SQL] In pyspark, cannot group by field on UDF Key: SPARK-3336 URL: https://issues.apache.org/jira/browse/SPARK-3336 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Assignee: Davies Liu Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: {code} Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120145#comment-14120145 ] Davies Liu commented on SPARK-3336: --- [~marmbrus] Should we merge this patch into 1.1 branch ? we can hold it for 1.1.1. [Spark SQL] In pyspark, cannot group by field on UDF Key: SPARK-3336 URL: https://issues.apache.org/jira/browse/SPARK-3336 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Assignee: Davies Liu Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: {code} Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
[jira] [Created] (SPARK-3379) Implement 'POWER' for sql
Xinyun Huang created SPARK-3379: --- Summary: Implement 'POWER' for sql Key: SPARK-3379 URL: https://issues.apache.org/jira/browse/SPARK-3379 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Add support for the mathematical function POWER within spark sql. Spitted from SPARK-3176 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3176) Implement 'ABS' and 'LAST' for sql
[ https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyun Huang updated SPARK-3176: Summary: Implement 'ABS' and 'LAST' for sql (was: Implement 'POWER', 'ABS and 'LAST' for sql) Implement 'ABS' and 'LAST' for sql -- Key: SPARK-3176 URL: https://issues.apache.org/jira/browse/SPARK-3176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 3h Remaining Estimate: 3h Add support for the mathematical function POWER and ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3176) Implement 'ABS' and 'LAST' for sql
[ https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120162#comment-14120162 ] Xinyun Huang commented on SPARK-3176: - Split POWER to SPARK-3379 Implement 'ABS' and 'LAST' for sql -- Key: SPARK-3176 URL: https://issues.apache.org/jira/browse/SPARK-3176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 3h Remaining Estimate: 3h Add support for the mathematical function POWER and ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3176) Implement 'ABS' and 'LAST' for sql
[ https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyun Huang updated SPARK-3176: Description: Add support for the mathematical function ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. (was: Add support for the mathematical function POWER and ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql.) Implement 'ABS' and 'LAST' for sql -- Key: SPARK-3176 URL: https://issues.apache.org/jira/browse/SPARK-3176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 3h Remaining Estimate: 3h Add support for the mathematical function ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3379) Implement 'POWER' for sql
[ https://issues.apache.org/jira/browse/SPARK-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120210#comment-14120210 ] Apache Spark commented on SPARK-3379: - User 'xinyunh' has created a pull request for this issue: https://github.com/apache/spark/pull/2252 Implement 'POWER' for sql - Key: SPARK-3379 URL: https://issues.apache.org/jira/browse/SPARK-3379 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Add support for the mathematical function POWER within spark sql. Spitted from SPARK-3176 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3366) Compute best splits distributively in decision tree
[ https://issues.apache.org/jira/browse/SPARK-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120218#comment-14120218 ] Joseph K. Bradley commented on SPARK-3366: -- It is not really a bottleneck for tests I have done. E.g., on 2M examples, 3500 features, 15 workers EC2, about 95% of the time is spent during the aggregation. (This is a rough estimate based on timing results from several tests.) In terms of scaling, most factors should not make the driver be a bottleneck: * # examples: Increasing this will not increase the load on the driver. * # features: Increasing this will increase the load on the driver (linearly in # features), but it will also increase the amount of communication (also linearly). I suspect communication will outweigh the cost of computation in most cases. * # bins/splits: Same as # features. * tree size: Same as # features. Are there factors I have missed? I suspect we could save some time by doing a distributed best splits computation, but I do not think it is high priority. Compute best splits distributively in decision tree --- Key: SPARK-3366 URL: https://issues.apache.org/jira/browse/SPARK-3366 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng The current implementation computes all best splits locally on the driver, which makes the driver a bottleneck for both communication and computation. It would be nice if we can compute the best splits distributively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3327) Make broadcasted value mutable for caching useful information
[ https://issues.apache.org/jira/browse/SPARK-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3327. Resolution: Won't Fix Make broadcasted value mutable for caching useful information - Key: SPARK-3327 URL: https://issues.apache.org/jira/browse/SPARK-3327 Project: Spark Issue Type: New Feature Reporter: Liang-Chi Hsieh When implementing some algorithms, it is helpful that we can cache some useful information for using later. Specifically, we would like to performa operation A on each partition of data. Some variables are updated. Then we want to run operation B on the data too. B operation uses the variables updated by operation A. One of the examples is the Liblinear on Spark from Dr. Lin. They discuss the problem in Section IV.D of the paper Large-scale Logistic Regression and Linear Support Vector Machines Using Spark. Currently broadcasted variables can satisfy partial need for that. We can broadcast variables to reduce communication costs. However, because broadcasted variables can not be modified, it doesn't help solve the problem and we maybe need to collect updated variables back to master and broadcast them again before conducting next data operation. I would like to add an interface to broadcasted variables to make them mutable so later data operations can use them again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3380) DecisionTree: overflow and precision in aggregation
Joseph K. Bradley created SPARK-3380: Summary: DecisionTree: overflow and precision in aggregation Key: SPARK-3380 URL: https://issues.apache.org/jira/browse/SPARK-3380 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley DecisionTree does not check for overflows or loss of precision while aggregating sufficient statistics (binAggregates). It uses Double, which may be a problem for DecisionTree regression since the variance calculation could blow up. At the least, it could check for overflow and renormalize as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3381) DecisionTree: eliminate bins for unordered features
Joseph K. Bradley created SPARK-3381: Summary: DecisionTree: eliminate bins for unordered features Key: SPARK-3381 URL: https://issues.apache.org/jira/browse/SPARK-3381 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Trivial Code simplification: DecisionTree currently allocates bins for unordered features (in findSplitsBins). However, those bins are not needed; only the splits are required. This change will require modifying findSplitsBins, as well as modifying a few other functions to use splits instead of bins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3327) Make broadcasted value mutable for caching useful information
[ https://issues.apache.org/jira/browse/SPARK-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120223#comment-14120223 ] Reynold Xin commented on SPARK-3327: Closing this as won't fix. See comments in the pull request for more information: https://github.com/apache/spark/pull/2217 Make broadcasted value mutable for caching useful information - Key: SPARK-3327 URL: https://issues.apache.org/jira/browse/SPARK-3327 Project: Spark Issue Type: New Feature Reporter: Liang-Chi Hsieh When implementing some algorithms, it is helpful that we can cache some useful information for using later. Specifically, we would like to performa operation A on each partition of data. Some variables are updated. Then we want to run operation B on the data too. B operation uses the variables updated by operation A. One of the examples is the Liblinear on Spark from Dr. Lin. They discuss the problem in Section IV.D of the paper Large-scale Logistic Regression and Linear Support Vector Machines Using Spark. Currently broadcasted variables can satisfy partial need for that. We can broadcast variables to reduce communication costs. However, because broadcasted variables can not be modified, it doesn't help solve the problem and we maybe need to collect updated variables back to master and broadcast them again before conducting next data operation. I would like to add an interface to broadcasted variables to make them mutable so later data operations can use them again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3382) GradientDescent convergence tolerance
Joseph K. Bradley created SPARK-3382: Summary: GradientDescent convergence tolerance Key: SPARK-3382 URL: https://issues.apache.org/jira/browse/SPARK-3382 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Minor GradientDescent should support a convergence tolerance setting. In general, for optimization, convergence tolerance should be preferred over a limit on the number of iterations since it is a somewhat data-adaptive or data-specific convergence criterion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3309) Put all public API in __all__
[ https://issues.apache.org/jira/browse/SPARK-3309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3309. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2205 [https://github.com/apache/spark/pull/2205] Put all public API in __all__ - Key: SPARK-3309 URL: https://issues.apache.org/jira/browse/SPARK-3309 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 __all__ could be used to define a collection of public interfaces in pyspark. Also put all the public interfaces in top module of pyspark (pyspark.__init__.py). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3383) DecisionTree aggregate size could be smaller
Joseph K. Bradley created SPARK-3383: Summary: DecisionTree aggregate size could be smaller Key: SPARK-3383 URL: https://issues.apache.org/jira/browse/SPARK-3383 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Priority: Minor Storage and communication optimization: DecisionTree aggregate statistics could store less data (described below). The savings would be significant for datasets with many low-arity categorical features (binary features, or unordered categorical features). Savings would be negligible for continuous features. DecisionTree stores a vector sufficient statistics for each (node, feature, bin). We could store 1 fewer bin per (node, feature): For a given (node, feature), if we store these vectors for all but the last bin, and also store the total statistics for each node, then we could compute the statistics for the last bin. For binary and unordered categorical features, this would cut in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
RJ Nowling created SPARK-3384: - Summary: Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-3384: -- Description: In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. was: In the KMeans clustering implementation, the Breeze vectors are accumulated using +=: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120289#comment-14120289 ] Andrew Or commented on SPARK-2718: -- Yes, thanks for the reminder. YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes
[ https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-2718. Resolution: Fixed YARN does not handle spark configs with quotes or backslashes - Key: SPARK-2718 URL: https://issues.apache.org/jira/browse/SPARK-2718 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Andrew Or Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.1.0 Say we have the following config: {code} spark.app.name spark shell with spaces and quotes and \ backslashes \ {code} This works in standalone mode but not in YARN mode. This is because standalone mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does not. As a result, submitting an application to YARN with the given config leads to the following exception: {code} line 0: unexpected EOF while looking for matching `' syntax error: unexpected end of file at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3216. Resolution: Fixed Assignee: Andrew Or Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120294#comment-14120294 ] RJ Nowling commented on SPARK-3384: --- Xiangrui Meng I'll try to get a code example together in the next couple days. Even if Spark itself is thread safe, I would re-iterate that it is easy to make the mistake of using += in the wrong place. I suggest that we should frown upon that behavior, document it when we use it, and maybe even add checks for the presence of += with Breeze vectors in the tests so we can flag it. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120314#comment-14120314 ] Sean Owen commented on SPARK-3384: -- [~rnowling] I think it fairly important for speed in that section of code though. Using mutable data structures is not a problem if done correctly and for the right reason. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3385) Improve shuffle performance
Reynold Xin created SPARK-3385: -- Summary: Improve shuffle performance Key: SPARK-3385 URL: https://issues.apache.org/jira/browse/SPARK-3385 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Just a ticket to track various efforts related to improving shuffle in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120325#comment-14120325 ] Derrick Burns commented on SPARK-3219: -- Great! You can find my work here: https://github.com/derrickburns/generalized-kmeans-clustering.git. I should warn you that I rewrote much of the original Spark clusterer because the original is too tightly coupled to using the Euclidean norm and does not allow one to identify efficiently which points belong to which clusters. I have tested this version extensively. You will notice a package call com.rincaro.clusterer.metrics. Please take a look at the two files EuOps.scala and FastEuclideansOps.scala. They both implement the Euclidean norm. However, one is much faster than the other by using the same algebraic transformations that the Spark version uses. This demonstrates that it is possible to be efficient while not being tightly coupled. One could easily re-implement FastEuclideanOps using Breeze or Blas without effecting the core Kmeans implementation. Not included in this project is another distance function that that I have implemented: the Kullback-Leibler distance function, a.k.a. relative entropy. In my implementation, I also perform algebraic transformations to expedite the computation, resulting in a distance computation that is even faster than the fast euclidean norm. Let me know if this is useful to you. K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3386) Reuse serializer and serializer buffer in shuffle block iterator
Reynold Xin created SPARK-3386: -- Summary: Reuse serializer and serializer buffer in shuffle block iterator Key: SPARK-3386 URL: https://issues.apache.org/jira/browse/SPARK-3386 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120347#comment-14120347 ] Patrick Wendell commented on SPARK-3216: [~andrewor14] for this you should Mark it as a blocker for target version 1.0.3 and also put fix version 1.0.3 when it is merged. Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3216: --- Target Version/s: 1.0.3 Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.3 This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-3216: Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.3 This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3216: --- Fix Version/s: 1.0.3 Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.3 This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3216. Resolution: Fixed Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.3 This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances
[ https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120354#comment-14120354 ] Josh Rosen commented on SPARK-3358: --- Agreed. Long term, I think it would be better to address the causes behind why we need to fork so many processes. PySpark worker fork()ing performance regression in m3.* / PVM instances --- Key: SPARK-3358 URL: https://issues.apache.org/jira/browse/SPARK-3358 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: m3.* instances on EC2 Reporter: Josh Rosen SPARK-2764 (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py. Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances: {code} import os for x in range(1000): if os.fork() == 0: exit() {code} On the r3.xlarge instance: {code} real 0m0.761s user 0m0.008s sys 0m0.144s {code} And on m3.xlarge: {code} real0m1.699s user0m0.012s sys 0m1.008s {code} I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs. It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3375) spark on yarn container allocation issues
[ https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-3375: - Summary: spark on yarn container allocation issues (was: spark on yarn alpha container allocation issues) spark on yarn container allocation issues - Key: SPARK-3375 URL: https://issues.apache.org/jira/browse/SPARK-3375 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Priority: Blocker It looks like if yarn doesn't get the containers immediately it stops asking for them and the yarn application hangs with never getting any executors. This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3375) spark on yarn alpha container allocation issues
[ https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120361#comment-14120361 ] Thomas Graves commented on SPARK-3375: -- Ok so the real issue here is that we are sending the number of containers as 0 after we send the original one of X. on the yarn side this clears out the original request. I would expect this to be affect 2.x also. I think the original 0.23 code kept just sending max-running and now we are sending max-pending-running. For a ping we should just send empty asks. spark on yarn alpha container allocation issues --- Key: SPARK-3375 URL: https://issues.apache.org/jira/browse/SPARK-3375 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Priority: Blocker It looks like if yarn doesn't get the containers immediately it stops asking for them and the yarn application hangs with never getting any executors. This was introduced by https://github.com/apache/spark/pull/2169 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3216: - Fix Version/s: (was: 1.0.2) Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.3 This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0
[ https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3216: - Fix Version/s: 1.0.2 Spark-shell is broken for branch-1.0 Key: SPARK-3216 URL: https://issues.apache.org/jira/browse/SPARK-3216 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.3 This fails when EC2 tries to clone the most recent version of Spark from branch-1.0. This does not actually affect any released distributions, and so I did not set the affected/fix/target versions. I marked this a blocker because this is completely broken, but it is technically not blocking anything. This was caused by https://github.com/apache/spark/pull/1831, which broke spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was only merged into branch-1.1 and master, but not branch-1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432 ] Evan Sparks commented on SPARK-3384: I agree with Sean. Avoiding the costly penalty of object allocation overhead is important to avoid here. As far as I can tell, we are using reduceByKey in the prescribed way (see: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E) mutating the left input. I don't believe that spark needs this mutation to be thread-safe, because it executes the combine sequentially on all workers, and then reduces sequentially on the master, but I could be wrong. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans
[ https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432 ] Evan Sparks edited comment on SPARK-3384 at 9/3/14 8:54 PM: I agree with Sean. Avoiding the costly penalty of object allocation overhead is important here. As far as I can tell, we are using reduceByKey in the prescribed way (see: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E) mutating the left input. I don't believe that spark needs this mutation to be thread-safe, because it executes the combine sequentially on all workers, and then reduces sequentially on the master, but I could be wrong. was (Author: sparks): I agree with Sean. Avoiding the costly penalty of object allocation overhead is important to avoid here. As far as I can tell, we are using reduceByKey in the prescribed way (see: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E) mutating the left input. I don't believe that spark needs this mutation to be thread-safe, because it executes the combine sequentially on all workers, and then reduces sequentially on the master, but I could be wrong. Potential thread unsafe Breeze vector addition in KMeans Key: SPARK-3384 URL: https://issues.apache.org/jira/browse/SPARK-3384 Project: Spark Issue Type: Bug Components: MLlib Reporter: RJ Nowling In the KMeans clustering implementation, the Breeze vectors are accumulated using +=. For example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162 This is potentially a thread unsafe operation. (This is what I observed in local testing.) I suggest changing the += to + -- a new object will be allocated but it will be thread safe since it won't write to an old location accessed by multiple threads. Further testing is required to reproduce and verify. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3388) Expose cluster applicationId, use it to serve history information
Marcelo Vanzin created SPARK-3388: - Summary: Expose cluster applicationId, use it to serve history information Key: SPARK-3388 URL: https://issues.apache.org/jira/browse/SPARK-3388 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Marcelo Vanzin This is a follow up to SPARK-2150. Currently Spark uses the application-generated log directory name as the URL for the history server. That's not very user-friendly since there's no way for the user to figure out that information otherwise. A more user-friendly approach would be to use the cluster-defined application id, when present. That also has the advantage of providing that information as metadata for the Spark app, so that someone can easily correlate information kept in Spark with that kept by the cluster manager. Open PR: https://github.com/apache/spark/pull/1218 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2419) Misc updates to streaming programming guide
[ https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120463#comment-14120463 ] Apache Spark commented on SPARK-2419: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/2254 Misc updates to streaming programming guide --- Key: SPARK-2419 URL: https://issues.apache.org/jira/browse/SPARK-2419 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical This JIRA collects together a number of small issues that should be added to the streaming programming guide - Receivers consume an executor slot and highlight the fact the # cores # receivers is necessary - Classes of spark-streaming-XYZ cannot be access from Spark Shell - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and its dependencies to be packaged with application JAR - Ordering and parallelism of the output operations - Receiver's should be serializable - Add more information on how socketStream: input stream = iterator function. - New Flume and Kinesis stuff. - Twitter4j version - Design pattern: creating connections to external sinks - Design pattern: multiple input streams -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3263) PR #720 broke GraphGenerator.logNormal
[ https://issues.apache.org/jira/browse/SPARK-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3263. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 2168 [https://github.com/apache/spark/pull/2168] PR #720 broke GraphGenerator.logNormal -- Key: SPARK-3263 URL: https://issues.apache.org/jira/browse/SPARK-3263 Project: Spark Issue Type: Bug Components: GraphX Reporter: RJ Nowling Fix For: 1.3.0 PR #720 made multiple changes to GraphGenerator.logNormalGraph including: * Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations * Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated. * Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter. * Failed to update scala docs and programming guide for API changes I also see that PR #720 added a Synthetic Benchmark in the examples. Based on reading the Pregel paper, I believe the in-line functions are incorrect. I proposed to: * Removing the in-line calls * Adding a seed for deterministic behavior (when desired) * Keeping the number of partitions parameter. * Updating the synthetic benchmark example -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables
[ https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3373: -- Summary: Filtering operations should optionally rebuild routing tables (was: trim some useless informations of VertexRDD in some cases) Filtering operations should optionally rebuild routing tables - Key: SPARK-3373 URL: https://issues.apache.org/jira/browse/SPARK-3373 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.0, 1.0.2 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables
[ https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3373: -- Description: Graph operations that filter the edges (subgraph, mask, groupEdges) currently reuse the existing routing table to avoid the shuffle which would be required to build a new one. However, this may be inefficient when the filtering is highly selective. Vertices will be sent to more partitions than necessary, and the extra routing information may take up excessive space. Filtering operations should optionally rebuild routing tables - Key: SPARK-3373 URL: https://issues.apache.org/jira/browse/SPARK-3373 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.0, 1.0.2 Reporter: uncleGen Priority: Minor Graph operations that filter the edges (subgraph, mask, groupEdges) currently reuse the existing routing table to avoid the shuffle which would be required to build a new one. However, this may be inefficient when the filtering is highly selective. Vertices will be sent to more partitions than necessary, and the extra routing information may take up excessive space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3206) Error in PageRank values
[ https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120525#comment-14120525 ] Ankur Dave commented on SPARK-3206: --- I think the `run` method is incorrect due to a bug in Pregel. I wrote a standalone (non-Pregel) version of PageRank that provides the functionality of `run` without using Pregel: https://github.com/ankurdave/spark/blob/low-level-PageRank/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L274 I'd like to run that and check the results against these ones when I get a chance. Error in PageRank values Key: SPARK-3206 URL: https://issues.apache.org/jira/browse/SPARK-3206 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.2 Environment: UNIX with Hadoop Reporter: Peter Fontana I have found a small example where the PageRank values using run and runUntilConvergence differ quite a bit. I am running the Pagerank module on the following graph: Edge Table: || Node1 || Node2 || |1 | 2 | |1 | 3| |3 | 2| |3 | 4| |5 | 3| |6 | 7| |7 | 8| |8 | 9| |9 | 7| Node Table (note the extra node): || NodeID || NodeName || |a | 1| |b | 2| |c | 3| |d | 4| |e | 5| |f | 6| |g | 7| |h | 8| |i | 9| |j.longaddress.com | 10| with a default resetProb of 0.15. When I compute the pageRank with runUntilConvergence, running {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}} I get the ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,1.3299054047985106) (9,1.2381240056453071) (8,1.2803346052504254) (10,0.15) (5,0.15) (2,0.358781244) However, when I run page Rank with the run() method, running {{val ranksI = PageRank.run(graph,100).vertices}} I get the page ranks (4,0.295031247) (1,0.15) (6,0.15) (3,0.341249994) (7,0.999387662847) (9,0.999256447741) (8,0.999256447741) (10,0.15) (5,0.15) (2,0.295031247) These are quite different, leading me to suspect that one of the PageRank methods is incorrect. I have examined the source, but I do not know what the correct fix is, or which set of values is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.
[ https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave reopened SPARK-3123: --- Reopening so I can change the status from Closed to Resolved. override the setName function to set EdgeRDD's name manually just as VertexRDD does. -- Key: SPARK-3123 URL: https://issues.apache.org/jira/browse/SPARK-3123 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.0, 1.0.2 Reporter: uncleGen Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.
[ https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3123. --- Resolution: Fixed Issue resolved by pull request 2033 https://github.com/apache/spark/pull/2033 override the setName function to set EdgeRDD's name manually just as VertexRDD does. -- Key: SPARK-3123 URL: https://issues.apache.org/jira/browse/SPARK-3123 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.0, 1.0.2 Reporter: uncleGen Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.
[ https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3123: -- Comment: was deleted (was: Reopening so I can change the status from Closed to Resolved.) override the setName function to set EdgeRDD's name manually just as VertexRDD does. -- Key: SPARK-3123 URL: https://issues.apache.org/jira/browse/SPARK-3123 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.0, 1.0.2 Reporter: uncleGen Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2845) Add timestamp to BlockManager events
[ https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2845. Resolution: Fixed Target Version/s: 1.2.0 Add timestamp to BlockManager events Key: SPARK-2845 URL: https://issues.apache.org/jira/browse/SPARK-2845 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor BlockManager events in SparkListener.scala do not have a timestamp; while not necessary for rendering the Spark UI, the extra information might be interesting for other tools that use the event data. Note that the same applies to lots of other events; it would be nice, at some point, to have all of them have a timestamp assigned at the point of creation of the event, so that network traffic / event queueing in the driver do not affect that information. But right now, I'm more interested in those particular events. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2845) Add timestamp to BlockManager events
[ https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120533#comment-14120533 ] Andrew Or commented on SPARK-2845: -- Fixed by https://github.com/apache/spark/pull/654 Add timestamp to BlockManager events Key: SPARK-2845 URL: https://issues.apache.org/jira/browse/SPARK-2845 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor BlockManager events in SparkListener.scala do not have a timestamp; while not necessary for rendering the Spark UI, the extra information might be interesting for other tools that use the event data. Note that the same applies to lots of other events; it would be nice, at some point, to have all of them have a timestamp assigned at the point of creation of the event, so that network traffic / event queueing in the driver do not affect that information. But right now, I'm more interested in those particular events. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2845) Add timestamp to BlockManager events
[ https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2845: - Affects Version/s: 1.1.0 Add timestamp to BlockManager events Key: SPARK-2845 URL: https://issues.apache.org/jira/browse/SPARK-2845 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.2.0 BlockManager events in SparkListener.scala do not have a timestamp; while not necessary for rendering the Spark UI, the extra information might be interesting for other tools that use the event data. Note that the same applies to lots of other events; it would be nice, at some point, to have all of them have a timestamp assigned at the point of creation of the event, so that network traffic / event queueing in the driver do not affect that information. But right now, I'm more interested in those particular events. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2845) Add timestamp to BlockManager events
[ https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2845: - Fix Version/s: 1.2.0 Add timestamp to BlockManager events Key: SPARK-2845 URL: https://issues.apache.org/jira/browse/SPARK-2845 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.2.0 BlockManager events in SparkListener.scala do not have a timestamp; while not necessary for rendering the Spark UI, the extra information might be interesting for other tools that use the event data. Note that the same applies to lots of other events; it would be nice, at some point, to have all of them have a timestamp assigned at the point of creation of the event, so that network traffic / event queueing in the driver do not affect that information. But right now, I'm more interested in those particular events. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3388) Expose cluster applicationId, use it to serve history information
[ https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-3388. -- Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 Fixed by https://github.com/apache/spark/pull/1218 Expose cluster applicationId, use it to serve history information - Key: SPARK-3388 URL: https://issues.apache.org/jira/browse/SPARK-3388 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 This is a follow up to SPARK-2150. Currently Spark uses the application-generated log directory name as the URL for the history server. That's not very user-friendly since there's no way for the user to figure out that information otherwise. A more user-friendly approach would be to use the cluster-defined application id, when present. That also has the advantage of providing that information as metadata for the Spark app, so that someone can easily correlate information kept in Spark with that kept by the cluster manager. Open PR: https://github.com/apache/spark/pull/1218 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3388) Expose cluster applicationId, use it to serve history information
[ https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3388: - Affects Version/s: 1.1.0 Expose cluster applicationId, use it to serve history information - Key: SPARK-3388 URL: https://issues.apache.org/jira/browse/SPARK-3388 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 This is a follow up to SPARK-2150. Currently Spark uses the application-generated log directory name as the URL for the history server. That's not very user-friendly since there's no way for the user to figure out that information otherwise. A more user-friendly approach would be to use the cluster-defined application id, when present. That also has the advantage of providing that information as metadata for the Spark app, so that someone can easily correlate information kept in Spark with that kept by the cluster manager. Open PR: https://github.com/apache/spark/pull/1218 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3388) Expose cluster applicationId, use it to serve history information
[ https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3388: - Fix Version/s: (was: 1.1.0) Expose cluster applicationId, use it to serve history information - Key: SPARK-3388 URL: https://issues.apache.org/jira/browse/SPARK-3388 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 This is a follow up to SPARK-2150. Currently Spark uses the application-generated log directory name as the URL for the history server. That's not very user-friendly since there's no way for the user to figure out that information otherwise. A more user-friendly approach would be to use the cluster-defined application id, when present. That also has the advantage of providing that information as metadata for the Spark app, so that someone can easily correlate information kept in Spark with that kept by the cluster manager. Open PR: https://github.com/apache/spark/pull/1218 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3388) Expose cluster applicationId, use it to serve history information
[ https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3388: - Fix Version/s: 1.1.0 Expose cluster applicationId, use it to serve history information - Key: SPARK-3388 URL: https://issues.apache.org/jira/browse/SPARK-3388 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.2.0 This is a follow up to SPARK-2150. Currently Spark uses the application-generated log directory name as the URL for the history server. That's not very user-friendly since there's no way for the user to figure out that information otherwise. A more user-friendly approach would be to use the cluster-defined application id, when present. That also has the advantage of providing that information as metadata for the Spark app, so that someone can easily correlate information kept in Spark with that kept by the cluster manager. Open PR: https://github.com/apache/spark/pull/1218 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark
Uri Laserson created SPARK-3389: --- Summary: Add converter class to make reading Parquet files easy with PySpark Key: SPARK-3389 URL: https://issues.apache.org/jira/browse/SPARK-3389 Project: Spark Issue Type: Improvement Reporter: Uri Laserson If a user wants to read Parquet data from PySpark, they currently must use SparkContext.newAPIHadoopFile. If they do not provide a valueConverter, they will get JSON string that must be parsed. Here I add a Converter implementation based on the one in the AvroConverters.scala file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark
[ https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120638#comment-14120638 ] Uri Laserson commented on SPARK-3389: - https://github.com/apache/spark/pull/2256 Add converter class to make reading Parquet files easy with PySpark --- Key: SPARK-3389 URL: https://issues.apache.org/jira/browse/SPARK-3389 Project: Spark Issue Type: Improvement Reporter: Uri Laserson If a user wants to read Parquet data from PySpark, they currently must use SparkContext.newAPIHadoopFile. If they do not provide a valueConverter, they will get JSON string that must be parsed. Here I add a Converter implementation based on the one in the AvroConverters.scala file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark
[ https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120639#comment-14120639 ] Apache Spark commented on SPARK-3389: - User 'laserson' has created a pull request for this issue: https://github.com/apache/spark/pull/2256 Add converter class to make reading Parquet files easy with PySpark --- Key: SPARK-3389 URL: https://issues.apache.org/jira/browse/SPARK-3389 Project: Spark Issue Type: Improvement Reporter: Uri Laserson If a user wants to read Parquet data from PySpark, they currently must use SparkContext.newAPIHadoopFile. If they do not provide a valueConverter, they will get JSON string that must be parsed. Here I add a Converter implementation based on the one in the AvroConverters.scala file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark
[ https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3389: -- Component/s: PySpark Add converter class to make reading Parquet files easy with PySpark --- Key: SPARK-3389 URL: https://issues.apache.org/jira/browse/SPARK-3389 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Uri Laserson If a user wants to read Parquet data from PySpark, they currently must use SparkContext.newAPIHadoopFile. If they do not provide a valueConverter, they will get JSON string that must be parsed. Here I add a Converter implementation based on the one in the AvroConverters.scala file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120657#comment-14120657 ] Matei Zaharia commented on SPARK-3215: -- Thanks Marcelo! Just a few notes on the API: - It needs to be Java-friendly, so it's probably not good to use Scala functions (e.g. `JobContext = T`) and maybe even the Future type. - It's a little weird to be passing a SparkConf to the JobClient, since most flags in there will not affect the jobs run (as they use the remote Spark cluster's SparkConf). Maybe it would be better to just pass a cluster URL. - It would be good to give jobs some kind of ID that client apps can log and can refer to even if the client crashes and the JobHandle object is gone. This is similar to how Hive prints the MapReduce job IDs it launched, and lets you kill them later using MR's hadoop job -kill. Add remote interface for SparkContext - Key: SPARK-3215 URL: https://issues.apache.org/jira/browse/SPARK-3215 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Marcelo Vanzin Labels: hive Attachments: RemoteSparkContext.pdf A quick description of the issue: as part of running Hive jobs on top of Spark, it's desirable to have a SparkContext that is running in the background and listening for job requests for a particular user session. Running multiple contexts in the same JVM is not a very good solution. Not only SparkContext currently has issues sharing the same JVM among multiple instances, but that turns the JVM running the contexts into a huge bottleneck in the system. So I'm proposing a solution where we have a SparkContext that is running in a separate process, and listening for requests from the client application via some RPC interface (most probably Akka). I'll attach a document shortly with the current proposal. Let's use this bug to discuss the proposal and any other suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org