[jira] [Resolved] (SPARK-2252) mathjax doesn't work in HTTPS (math formulas not rendered)
[ https://issues.apache.org/jira/browse/SPARK-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2252. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 mathjax doesn't work in HTTPS (math formulas not rendered) -- Key: SPARK-2252 URL: https://issues.apache.org/jira/browse/SPARK-2252 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.0.1, 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2249) how to convert rows of schemaRdd into HashMaps
[ https://issues.apache.org/jira/browse/SPARK-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041766#comment-14041766 ] Sean Owen commented on SPARK-2249: -- This is where issues are reported, rather than where questions are asked. I think this should be closed. Use the user@ mailing list instead. how to convert rows of schemaRdd into HashMaps -- Key: SPARK-2249 URL: https://issues.apache.org/jira/browse/SPARK-2249 Project: Spark Issue Type: Question Reporter: jackielihf spark 1.0.0 how to convert rows of schemaRdd into HashMaps using column names as keys? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell
[ https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041789#comment-14041789 ] Prashant Sharma commented on SPARK-1199: One work around is to use `:paste` command of repl to work with these kind of scenarios. So if you use :paste and put the whole thing at once it will work nicely. I am just mentioning it because I found it, we also have a slightly better fix on github PR. Type mismatch in Spark shell when using case class defined in shell --- Key: SPARK-1199 URL: https://issues.apache.org/jira/browse/SPARK-1199 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Andrew Kerr Assignee: Prashant Sharma Priority: Blocker Define a class in the shell: {code} case class TestClass(a:String) {code} and an RDD {code} val data = sc.parallelize(Seq(a)).map(TestClass(_)) {code} define a function on it and map over the RDD {code} def itemFunc(a:TestClass):TestClass = a data.map(itemFunc) {code} Error: {code} console:19: error: type mismatch; found : TestClass = TestClass required: TestClass = ? data.map(itemFunc) {code} Similarly with a mapPartitions: {code} def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a data.mapPartitions(partitionFunc) {code} {code} console:19: error: type mismatch; found : Iterator[TestClass] = Iterator[TestClass] required: Iterator[TestClass] = Iterator[?] Error occurred in an application involving default arguments. data.mapPartitions(partitionFunc) {code} The behavior is the same whether in local mode or on a cluster. This isn't specific to RDDs. A Scala collection in the Spark shell has the same problem. {code} scala Seq(TestClass(foo)).map(itemFunc) console:15: error: type mismatch; found : TestClass = TestClass required: TestClass = ? Seq(TestClass(foo)).map(itemFunc) ^ {code} When run in the Scala console (not the Spark shell) there are no type mismatch errors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2125) Add sorting flag to ShuffleManager, and implement it in HashShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041809#comment-14041809 ] Saisai Shao commented on SPARK-2125: Hi Matei, I have some basic implementations about this sub-task, would you mind assign this task to me? Thanks a lot. Add sorting flag to ShuffleManager, and implement it in HashShuffleManager -- Key: SPARK-2125 URL: https://issues.apache.org/jira/browse/SPARK-2125 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2253) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2253: --- Description: Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This reduces memory usage for high cardinality aggregations. This one is for Spark core. There is another ticket tracking this for SQL. was: Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This one is for Spark core. There is another ticket tracking this for SQL. Disable partial aggregation automatically when reduction factor is low -- Key: SPARK-2253 URL: https://issues.apache.org/jira/browse/SPARK-2253 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This reduces memory usage for high cardinality aggregations. This one is for Spark core. There is another ticket tracking this for SQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2125) Add sorting flag to ShuffleManager, and implement it in HashShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041809#comment-14041809 ] Saisai Shao edited comment on SPARK-2125 at 6/24/14 7:26 AM: - Hi Matei, I have some basic implementations about this sub-task, would you mind assigning this task to me? Thanks a lot. was (Author: jerryshao): Hi Matei, I have some basic implementations about this sub-task, would you mind assign this task to me? Thanks a lot. Add sorting flag to ShuffleManager, and implement it in HashShuffleManager -- Key: SPARK-2125 URL: https://issues.apache.org/jira/browse/SPARK-2125 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2255) create a self destructing iterator that release key value pairs from in-memory hash maps
Reynold Xin created SPARK-2255: -- Summary: create a self destructing iterator that release key value pairs from in-memory hash maps Key: SPARK-2255 URL: https://issues.apache.org/jira/browse/SPARK-2255 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin This is a small thing to do that can help out with GC pressure. For aggregations (and potentially joins), we don't really need to hold onto the key value pairs as soon as we have iterate over them. We can create a self destructing iterator for AppendOnlyMap / ExternalAppendOnlyMap that removes references to the key value pair as the iterator goes through records so those memory can be freed quickly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2254) ScalaRefection should mark primitive types as non-nullable.
Takuya Ueshin created SPARK-2254: Summary: ScalaRefection should mark primitive types as non-nullable. Key: SPARK-2254 URL: https://issues.apache.org/jira/browse/SPARK-2254 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2255) create a self destructing iterator that releases records from hash maps
[ https://issues.apache.org/jira/browse/SPARK-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2255: --- Summary: create a self destructing iterator that releases records from hash maps (was: create a self destructing iterator that release key value pairs from in-memory hash maps) create a self destructing iterator that releases records from hash maps --- Key: SPARK-2255 URL: https://issues.apache.org/jira/browse/SPARK-2255 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin This is a small thing to do that can help out with GC pressure. For aggregations (and potentially joins), we don't really need to hold onto the key value pairs as soon as we have iterate over them. We can create a self destructing iterator for AppendOnlyMap / ExternalAppendOnlyMap that removes references to the key value pair as the iterator goes through records so those memory can be freed quickly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...
Ángel Álvarez created SPARK-2256: Summary: pyspark: RDD.take doesn't work ... sometimes ... Key: SPARK-2256 URL: https://issues.apache.org/jira/browse/SPARK-2256 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: local file/remote HDFS Reporter: Ángel Álvarez If I try to take some lines from a file, sometimes it doesn't work Code: myfile = sc.textFile(A_ko) print myfile.take(10) Stacktrace: 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19 Traceback (most recent call last): File mytest.py, line 19, in module print myfile.take(10) File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, line 537, in __call__ File spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 300, in get_return_value Test data: START TEST DATA A A A
[jira] [Commented] (SPARK-2248) spark.default.parallelism does not apply in local mode
[ https://issues.apache.org/jira/browse/SPARK-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041838#comment-14041838 ] Guoqiang Li commented on SPARK-2248: PR: https://github.com/apache/spark/pull/1194 spark.default.parallelism does not apply in local mode -- Key: SPARK-2248 URL: https://issues.apache.org/jira/browse/SPARK-2248 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Trivial Labels: Starter LocalBackend.defaultParallelism ignores the spark.default.parallelism property, unlike the other SchedulerBackends. We should make it take this in for consistency. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2125) Add sorting flag to ShuffleManager, and implement it in HashShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2125: --- Assignee: Saisai Shao Add sorting flag to ShuffleManager, and implement it in HashShuffleManager -- Key: SPARK-2125 URL: https://issues.apache.org/jira/browse/SPARK-2125 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Matei Zaharia Assignee: Saisai Shao -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2254) ScalaRefection should mark primitive types as non-nullable.
[ https://issues.apache.org/jira/browse/SPARK-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041837#comment-14041837 ] Takuya Ueshin commented on SPARK-2254: -- PRed: https://github.com/apache/spark/pull/1193 ScalaRefection should mark primitive types as non-nullable. --- Key: SPARK-2254 URL: https://issues.apache.org/jira/browse/SPARK-2254 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2249) how to convert rows of schemaRdd into HashMaps
[ https://issues.apache.org/jira/browse/SPARK-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-2249. -- Resolution: Not a Problem how to convert rows of schemaRdd into HashMaps -- Key: SPARK-2249 URL: https://issues.apache.org/jira/browse/SPARK-2249 Project: Spark Issue Type: Question Reporter: jackielihf spark 1.0.0 how to convert rows of schemaRdd into HashMaps using column names as keys? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)
[ https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2244: --- Description: {code} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. hundy = sc.parallelize(range(100)) hundy.count() 100 hundy.count() 100 hundy.count() 100 [repeat until hang, ctrl-C to get] hundy.count() ^CTraceback (most recent call last): File stdin, line 1, in module File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 774, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 765, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 685, in reduce vals = self.mapPartitions(func).collect() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 363, in send_command File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 472, in send_command File /usr/lib64/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt {code} was: $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. hundy = sc.parallelize(range(100)) hundy.count() 100 hundy.count() 100 hundy.count() 100 [repeat until hang, ctrl-C to get] hundy.count() ^CTraceback (most recent call last): File stdin, line 1, in module File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 774, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 765, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 685, in reduce vals = self.mapPartitions(func).collect() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 363, in send_command File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 472, in send_command File /usr/lib64/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt pyspark - RDD action hangs (after previously succeeding) Key: SPARK-2244 URL: https://issues.apache.org/jira/browse/SPARK-2244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55 code: sha b88238fa (master on 23 june 2014) cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running locally) Reporter: Matthew Farrellee Labels: openjdk, pyspark, python, shell, spark {code} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. hundy = sc.parallelize(range(100)) hundy.count() 100 hundy.count() 100 hundy.count() 100 [repeat until hang, ctrl-C to get] hundy.count() ^CTraceback (most recent
[jira] [Commented] (SPARK-2247) data frames based on RDD
[ https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041841#comment-14041841 ] Reynold Xin commented on SPARK-2247: This would be a great feature to build out on Spark SQL actually. data frames based on RDD - Key: SPARK-2247 URL: https://issues.apache.org/jira/browse/SPARK-2247 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Affects Versions: 1.0.0 Reporter: venu k tangirala Labels: features I would be nice to have R or python pandas like data frames on spark. 1) To be able to access the RDD data frame from python with pandas 2) To be able to access the RDD data frame from R 3) To be able to access the RDD data frame from scala's saddle -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2247) data frames based on RDD
[ https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2247: --- Target Version/s: (was: 1.0.0) data frames based on RDD - Key: SPARK-2247 URL: https://issues.apache.org/jira/browse/SPARK-2247 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Affects Versions: 1.0.0 Reporter: venu k tangirala Labels: features I would be nice to have R or python pandas like data frames on spark. 1) To be able to access the RDD data frame from python with pandas 2) To be able to access the RDD data frame from R 3) To be able to access the RDD data frame from scala's saddle -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2242) Running sc.parallelize(..).count() hangs pyspark
[ https://issues.apache.org/jira/browse/SPARK-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2242: --- Assignee: Andrew Or Running sc.parallelize(..).count() hangs pyspark Key: SPARK-2242 URL: https://issues.apache.org/jira/browse/SPARK-2242 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Xiangrui Meng Assignee: Andrew Or Priority: Blocker Running the following code hangs pyspark in a shell: {code} sc.parallelize(range(100), 100).count() {code} It happens in the master branch, but not branch-1.0. And it seems that it only happens in a pyspark shell. [~andrewor14] helped confirm this bug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2243) Using several Spark Contexts
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041844#comment-14041844 ] Reynold Xin commented on SPARK-2243: I don't think we currently support multiple SparkContext objects in the same JVM process. There are numerous assumptions in the code base that uses a a shared cache or thread local variables or some global identifiers which prevent us from using multiple SparkContext's. However, there is nothing fundamental here. We do want to support multiple SparkContext in the long run. Using several Spark Contexts Key: SPARK-2243 URL: https://issues.apache.org/jira/browse/SPARK-2243 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 1.0.0 Reporter: Miguel Angel Fernandez Diaz We're developing a platform where we create several Spark contexts for carrying out different calculations. Is there any restriction when using several Spark contexts? We have two contexts, one for Spark calculations and another one for Spark Streaming jobs. The next error arises when we first execute a Spark calculation and, once the execution is finished, a Spark Streaming job is launched: {code} 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Updated] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)
[ https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee updated SPARK-2244: - Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55 1.8.0_05 code: sha b88238fa (master on 23 june 2014) cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running locally) was: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55 code: sha b88238fa (master on 23 june 2014) cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running locally) pyspark - RDD action hangs (after previously succeeding) Key: SPARK-2244 URL: https://issues.apache.org/jira/browse/SPARK-2244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55 1.8.0_05 code: sha b88238fa (master on 23 june 2014) cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running locally) Reporter: Matthew Farrellee Labels: openjdk, pyspark, python, shell, spark {code} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. hundy = sc.parallelize(range(100)) hundy.count() 100 hundy.count() 100 hundy.count() 100 [repeat until hang, ctrl-C to get] hundy.count() ^CTraceback (most recent call last): File stdin, line 1, in module File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 774, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 765, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 685, in reduce vals = self.mapPartitions(func).collect() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 363, in send_command File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 472, in send_command File /usr/lib64/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter
zhengbing li created SPARK-2257: --- Summary: The algorithm of ALS in mlib lacks a parameter Key: SPARK-2257 URL: https://issues.apache.org/jira/browse/SPARK-2257 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: spark 1.0 Reporter: zhengbing li Fix For: 1.1.0 When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been improved, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) Is this solution reasonable?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter
[ https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengbing li updated SPARK-2257: Description: When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been decreased, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) Is this solution reasonable?? was: When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been improved, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) Is this solution reasonable?? The algorithm of ALS in mlib lacks a parameter --- Key: SPARK-2257 URL: https://issues.apache.org/jira/browse/SPARK-2257 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: spark 1.0 Reporter: zhengbing li Labels: patch Fix For: 1.1.0 Original Estimate: 336h Remaining Estimate: 336h When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i rank) { //ratingsNum(index) equals the number of products that a user rates
[jira] [Commented] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter
[ https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042025#comment-14042025 ] Sean Owen commented on SPARK-2257: -- I don't think this is a bug, in the sense that it is just a different formulation of ALS. It's in the ALS-WR paper, but not the more well-known Hu/Koren/Volinsky paper. This is weighted regularization and it does help in some cases. In fact, it's already implemented in MLlib, although went in just after 1.0.0: https://github.com/apache/spark/commit/a6e0afdcf0174425e8a6ff20b2bc2e3a7a374f19#diff-2b593e0b4bd6eddab37f04968baa826c I think this is therefore already implemented. The algorithm of ALS in mlib lacks a parameter --- Key: SPARK-2257 URL: https://issues.apache.org/jira/browse/SPARK-2257 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: spark 1.0 Reporter: zhengbing li Labels: patch Fix For: 1.1.0 Original Estimate: 336h Remaining Estimate: 336h When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been decreased, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) Is this solution reasonable?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter
[ https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengbing li closed SPARK-2257. --- Resolution: Fixed Fix Version/s: 1.0.1 The algorithm of ALS in mlib lacks a parameter --- Key: SPARK-2257 URL: https://issues.apache.org/jira/browse/SPARK-2257 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: spark 1.0 Reporter: zhengbing li Labels: patch Fix For: 1.0.1, 1.1.0 Original Estimate: 336h Remaining Estimate: 336h When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been decreased, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) Is this solution reasonable?? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2258) Worker UI displays zombie executors
Andrew Or created SPARK-2258: Summary: Worker UI displays zombie executors Key: SPARK-2258 URL: https://issues.apache.org/jira/browse/SPARK-2258 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or See attached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2258) Worker UI displays zombie executors
[ https://issues.apache.org/jira/browse/SPARK-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2258: - Attachment: Screen Shot 2014-06-24 at 9.23.18 AM.png Worker UI displays zombie executors --- Key: SPARK-2258 URL: https://issues.apache.org/jira/browse/SPARK-2258 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Attachments: Screen Shot 2014-06-24 at 9.23.18 AM.png See attached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042332#comment-14042332 ] David Rosenstrauch commented on SPARK-1556: --- Thanks again for the tip. I didn't seem to have the /usr/lib/spark/assembly/lib directory in my installation, but adding the symlink in /usr/lib/shark/lib did the trick as well. jets3t dep doesn't update properly with newer Hadoop versions - Key: SPARK-1556 URL: https://issues.apache.org/jira/browse/SPARK-1556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Reporter: Nan Zhu Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jet3st 0.7.x which has no definition of these classes What I met is that [code] 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:891) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at
[jira] [Created] (SPARK-2259) Spark submit documentation for --deploy-mode is highly misleading
Andrew Or created SPARK-2259: Summary: Spark submit documentation for --deploy-mode is highly misleading Key: SPARK-2259 URL: https://issues.apache.org/jira/browse/SPARK-2259 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Priority: Critical There are a few issues: 1. Client mode does not necessarily mean the driver program must be launched outside of the cluster. 2. For standalone clusters, only client mode is currently supported. This was the case supported even before 1.0. Currently, the docs tell the user to use cluster deploy mode when deploying your driver program within the cluster, which is true also for standalone-client mode. In short, the docs encourage the user to use standalone-cluster, an unsupported mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
[ https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2204: --- Target Version/s: 1.0.1, 1.1.0 Scheduler for Mesos in fine-grained mode launches tasks on wrong executors -- Key: SPARK-2204 URL: https://issues.apache.org/jira/browse/SPARK-2204 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Sebastien Rainville Priority: Blocker MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in the same order as the offers it was passed, but in the current implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on the wrong executors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2260) Spark submit standalone-cluster mode is broken
Andrew Or created SPARK-2260: Summary: Spark submit standalone-cluster mode is broken Key: SPARK-2260 URL: https://issues.apache.org/jira/browse/SPARK-2260 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Well, it is technically not officially supported... but we should still fix it. In particular, important configs such as spark.master and the application jar are not propagated to the worker nodes properly, due to obvious missing pieces in the code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1937: - Assignee: Rui Li Tasks can be submitted before executors are registered -- Key: SPARK-1937 URL: https://issues.apache.org/jira/browse/SPARK-1937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Rui Li Assignee: Rui Li Fix For: 1.1.0 Attachments: After-patch.PNG, Before-patch.png, RSBTest.scala During construction, TaskSetManager will assign tasks to several pending lists according to the tasks’ preferred locations. If the desired location is unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list containing tasks without preferred locations. The problem is that tasks may be submitted before the executors get registered with the driver, in which case TaskSetManager will assign all the tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, it will pick one from this list and assign it to arbitrary executor, since TaskSetManager considers the tasks can run equally well on any node. This problem deprives benefits of data locality, drags the whole job slow and can cause imbalance between executors. I ran into this issue when running a spark program on a 7-node cluster (node6~node12). The program processes 100GB data. Since the data is uploaded to HDFS from node6, this node has a complete copy of the data and as a result, node6 finishes tasks much faster, which in turn makes it complete dis-proportionally more tasks than other nodes. To solve this issue, I think we shouldn't check availability of executors/hosts when constructing TaskSetManager. If a task prefers a node, we simply add the task to that node’s pending list. When later on the node is added, TaskSetManager can schedule the task according to proper locality level. If unfortunately the preferred node(s) never gets added, TaskSetManager can still schedule the task at locality level “ANY”. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1937) Tasks can be submitted before executors are registered
[ https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1937. -- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 Tasks can be submitted before executors are registered -- Key: SPARK-1937 URL: https://issues.apache.org/jira/browse/SPARK-1937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Rui Li Assignee: Rui Li Fix For: 1.1.0 Attachments: After-patch.PNG, Before-patch.png, RSBTest.scala During construction, TaskSetManager will assign tasks to several pending lists according to the tasks’ preferred locations. If the desired location is unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list containing tasks without preferred locations. The problem is that tasks may be submitted before the executors get registered with the driver, in which case TaskSetManager will assign all the tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule, it will pick one from this list and assign it to arbitrary executor, since TaskSetManager considers the tasks can run equally well on any node. This problem deprives benefits of data locality, drags the whole job slow and can cause imbalance between executors. I ran into this issue when running a spark program on a 7-node cluster (node6~node12). The program processes 100GB data. Since the data is uploaded to HDFS from node6, this node has a complete copy of the data and as a result, node6 finishes tasks much faster, which in turn makes it complete dis-proportionally more tasks than other nodes. To solve this issue, I think we shouldn't check availability of executors/hosts when constructing TaskSetManager. If a task prefers a node, we simply add the task to that node’s pending list. When later on the node is added, TaskSetManager can schedule the task according to proper locality level. If unfortunately the preferred node(s) never gets added, TaskSetManager can still schedule the task at locality level “ANY”. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users that jars should be built with Java 6 for YARN
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Summary: Warn users that jars should be built with Java 6 for YARN (was: Warn users that jars should be built with Java 6 for PySpark to work on YARN) Warn users that jars should be built with Java 6 for YARN - Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Andrew Or Fix For: 1.0.0 Python sometimes fails to read jars created by Java 7. This is necessary for PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 for PySpark to work on YARN. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1911) Warn users that jars should be built with Java 6 for YARN
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-1911: -- Looks like we still haven't fixed this. At the very least we should explicitly add this warning in the documentation. The next step that is more involved is to add a warning prompt in the mvn build scripts whenever Java 7+ is detected. Warn users that jars should be built with Java 6 for YARN - Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 Python sometimes fails to read jars created by Java 7. This is necessary for PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 for PySpark to work on YARN. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Fix Version/s: (was: 1.0.0) 1.1.0 Warn users if their jars are not built with Java 6 -- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Affects Version/s: 1.1.0 Warn users if their jars are not built with Java 6 -- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Description: Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. was: Python sometimes fails to read jars created by Java 7. This is necessary for PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 for PySpark to work on YARN. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. Warn users if their jars are not built with Java 6 -- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Summary: Warn users if their assembly jars are not built with Java 6 (was: Warn users if their jars are not built with Java 6) Warn users if their assembly jars are not built with Java 6 --- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Description: The root cause of the problem is detailed in: https://issues.apache.org/jira/browse/SPARK-1520. In short, an assembly jar built with Java 7+ is not always accessible by Python or other versions of Java (especially Java 6). If the assembly jar is not built on the cluster itself, this problem may manifest itself in strange exceptions that are not trivial to debug. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. was: Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. Warn users if their assembly jars are not built with Java 6 --- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 The root cause of the problem is detailed in: https://issues.apache.org/jira/browse/SPARK-1520. In short, an assembly jar built with Java 7+ is not always accessible by Python or other versions of Java (especially Java 6). If the assembly jar is not built on the cluster itself, this problem may manifest itself in strange exceptions that are not trivial to debug. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1891) Add admin acls to the Web UI
[ https://issues.apache.org/jira/browse/SPARK-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042516#comment-14042516 ] Thomas Graves commented on SPARK-1891: -- https://github.com/apache/spark/pull/1196 combined pull request with SPARK-1890 Add admin acls to the Web UI Key: SPARK-1891 URL: https://issues.apache.org/jira/browse/SPARK-1891 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 1.1.0 In some environments you have multi-tenant clusters. In those environments you sometimes have users, operations, support teams, and framework development teams (Spark, MR, etc developers). When users have issues with their application they can ask for help from support teams or development. In those cases you want the support team or development teams to be able to see the application UI so they can help the user debug what is going on without having to rerun the application. In order to more easily do that we can add a concept of admin acls to the UI. This is the list of users who are considered admins for the cluster and would always have permission to view the UI. Generally this list would be set at the cluster level and picked up by anyone running on that cluster. We should add admin acls to the Spark web ui to allow this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Summary: Warn users if their jars are not built with Java 6 (was: Warn users that jars should be built with Java 6) Warn users if their jars are not built with Java 6 -- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 Python sometimes fails to read jars created by Java 7. This is necessary for PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 for PySpark to work on YARN. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1911: - Description: The root cause of the problem is detailed in: https://issues.apache.org/jira/browse/SPARK-1520. In short, an assembly jar built with Java 7+ is not always accessible by Python or other versions of Java (especially Java 6). If the assembly jar is not built on the cluster itself, this problem may manifest itself in strange exceptions that are not trivial to debug. This is an issue especially for PySpark on YARN, which relies on the python files included within the assembly jar. Currently we warn users only in make-distribution.sh, but most users build the jars directly. At the very least we need to emphasize this in the docs (currently missing entirely). The next step is to add a warning prompt in the mvn scripts whenever Java 7+ is detected. was: The root cause of the problem is detailed in: https://issues.apache.org/jira/browse/SPARK-1520. In short, an assembly jar built with Java 7+ is not always accessible by Python or other versions of Java (especially Java 6). If the assembly jar is not built on the cluster itself, this problem may manifest itself in strange exceptions that are not trivial to debug. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. Warn users if their assembly jars are not built with Java 6 --- Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 The root cause of the problem is detailed in: https://issues.apache.org/jira/browse/SPARK-1520. In short, an assembly jar built with Java 7+ is not always accessible by Python or other versions of Java (especially Java 6). If the assembly jar is not built on the cluster itself, this problem may manifest itself in strange exceptions that are not trivial to debug. This is an issue especially for PySpark on YARN, which relies on the python files included within the assembly jar. Currently we warn users only in make-distribution.sh, but most users build the jars directly. At the very least we need to emphasize this in the docs (currently missing entirely). The next step is to add a warning prompt in the mvn scripts whenever Java 7+ is detected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)
[ https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042523#comment-14042523 ] Matthew Farrellee commented on SPARK-2244: -- i have a theory - after a long bisect session the following commit was implicated - 3870248740d83b0292ccca88a494ce19783847f0 is the first bad commit commit 3870248740d83b0292ccca88a494ce19783847f0 Author: Kay Ousterhout kayousterh...@gmail.com Date: Wed Jun 18 13:16:26 2014 -0700 in that commit stderr is captured into a PIPE for the first time theory is the pipe is filling a buffer that is never drained, resulting in an eventual hang in communication. testing this theory by adding an additional EchoOutputThread for proc.stderr, which appears to resolve the issue i'll come up with an appropriate fix and send a pull request pyspark - RDD action hangs (after previously succeeding) Key: SPARK-2244 URL: https://issues.apache.org/jira/browse/SPARK-2244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55 1.8.0_05 code: sha b88238fa (master on 23 june 2014) cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running locally) Reporter: Matthew Farrellee Labels: openjdk, pyspark, python, shell, spark {code} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. hundy = sc.parallelize(range(100)) hundy.count() 100 hundy.count() 100 hundy.count() 100 [repeat until hang, ctrl-C to get] hundy.count() ^CTraceback (most recent call last): File stdin, line 1, in module File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 774, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 765, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 685, in reduce vals = self.mapPartitions(func).collect() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 363, in send_command File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 472, in send_command File /usr/lib64/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)
[ https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042529#comment-14042529 ] Reynold Xin commented on SPARK-2244: Is this related? https://github.com/apache/spark/pull/1178 pyspark - RDD action hangs (after previously succeeding) Key: SPARK-2244 URL: https://issues.apache.org/jira/browse/SPARK-2244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55 1.8.0_05 code: sha b88238fa (master on 23 june 2014) cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running locally) Reporter: Matthew Farrellee Labels: openjdk, pyspark, python, shell, spark {code} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. hundy = sc.parallelize(range(100)) hundy.count() 100 hundy.count() 100 hundy.count() 100 [repeat until hang, ctrl-C to get] hundy.count() ^CTraceback (most recent call last): File stdin, line 1, in module File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 774, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 765, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 685, in reduce vals = self.mapPartitions(func).collect() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 363, in send_command File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 472, in send_command File /usr/lib64/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)
[ https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042536#comment-14042536 ] Matthew Farrellee commented on SPARK-2244: -- yes, but i prefer my solution - https://github.com/apache/spark/pull/1197 pyspark - RDD action hangs (after previously succeeding) Key: SPARK-2244 URL: https://issues.apache.org/jira/browse/SPARK-2244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55 1.8.0_05 code: sha b88238fa (master on 23 june 2014) cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running locally) Reporter: Matthew Farrellee Labels: openjdk, pyspark, python, shell, spark {code} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type help, copyright, credits or license for more information. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. hundy = sc.parallelize(range(100)) hundy.count() 100 hundy.count() 100 hundy.count() 100 [repeat until hang, ctrl-C to get] hundy.count() ^CTraceback (most recent call last): File stdin, line 1, in module File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 774, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 765, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 685, in reduce vals = self.mapPartitions(func).collect() File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, line 649, in collect bytesInJava = self._jrdd.collect().iterator() File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 363, in send_command File /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 472, in send_command File /usr/lib64/python2.7/socket.py, line 430, in readline data = recv(1) KeyboardInterrupt {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1718) pyspark doesn't work with assembly jar containing over 65536 files/dirs built on redhat
[ https://issues.apache.org/jira/browse/SPARK-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1718. -- Resolution: Duplicate pyspark doesn't work with assembly jar containing over 65536 files/dirs built on redhat Key: SPARK-1718 URL: https://issues.apache.org/jira/browse/SPARK-1718 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Thomas Graves Recently pyspark was ported to yarn (pr 30), but when I went to try it I couldn't get it work. I was building on a redhat 6 box. I figured out that if the assembly jar file contained over 65536 files/directories then it wouldn't work. If I unjarred the assembly and removed some stuff to get it under 65536 and jarred it back up, then it would work. It appears to only be an issue when building on a redhat box as I can build on my mac and it works just fine there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2261) Spark application event logs are not very NameNode-friendly
Marcelo Vanzin created SPARK-2261: - Summary: Spark application event logs are not very NameNode-friendly Key: SPARK-2261 URL: https://issues.apache.org/jira/browse/SPARK-2261 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Priority: Minor Currently, EventLoggingListener will generate application logs using, in the worst case, five different entries in the file system: * The directory to hold the files * One file for the Spark version * One file for the event logs * One file to identify the compression codec of the event logs * One file to say the application is finished. It would be better to be more friendly to the NameNodes and use a single entry for all of those. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2262) Extreme Learning Machines (ELM) for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042731#comment-14042731 ] Erik Erlandson commented on SPARK-2262: --- I'd like to have this assigned to me Extreme Learning Machines (ELM) for MLLib - Key: SPARK-2262 URL: https://issues.apache.org/jira/browse/SPARK-2262 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Erik Erlandson Labels: features MLLib has a gap in the NN space. There's some good reason for this, as batching gradient updates in traditional backprop training is known to not perform well. However, Extreme Learning Machines(ELM) combine support for nonlinear activation functions in a hidden layer with a batch-friendly linear training. There is also a body of ELM literature on various avenues for extension, including multi-category classification, multiple hidden layers and adaptive addition/deletion of hidden nodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2251) MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition
[ https://issues.apache.org/jira/browse/SPARK-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042738#comment-14042738 ] Sean Owen commented on SPARK-2251: -- For what it's worth, I can reproduce this. In the sample, the test RDD has 2 partitions, containing 2 and 1 examples. The prediction RDD has 2 partitions, containing 1 and 2 examples respectively. So they aren't matched up, even though one is a 1-1 map() of the other. That seems like it shouldn't happen? maybe someone more knowledgeable can say whether that itself should occur. test is a PartitionwiseSampledRDD and prediction is a MappedRDD of course. If it is allowed to happen, then the example should be fixed, and I could easily supply a patch. It can be done without having to zip up RDDs to begin with. MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition -- Key: SPARK-2251 URL: https://issues.apache.org/jira/browse/SPARK-2251 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: OS: Fedora Linux Spark Version: 1.0.0. Git clone from the Spark Repository Reporter: Jun Xie Priority: Minor Labels: Naive-Bayes I follow the exact code from Naive Bayes Example (http://spark.apache.org/docs/latest/mllib-naive-bayes.html) of MLLib. When I executed the final command: val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / test.count() It complains Can only zip RDDs with same number of elements in each partition. I got the following exception: 14/06/23 19:39:23 INFO SparkContext: Starting job: count at console:31 14/06/23 19:39:23 INFO DAGScheduler: Got job 3 (count at console:31) with 2 output partitions (allowLocal=false) 14/06/23 19:39:23 INFO DAGScheduler: Final stage: Stage 4(count at console:31) 14/06/23 19:39:23 INFO DAGScheduler: Parents of final stage: List() 14/06/23 19:39:23 INFO DAGScheduler: Missing parents: List() 14/06/23 19:39:23 INFO DAGScheduler: Submitting Stage 4 (FilteredRDD[14] at filter at console:31), which has no missing parents 14/06/23 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (FilteredRDD[14] at filter at console:31) 14/06/23 19:39:23 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:0 as 3410 bytes in 0 ms 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:1 as 3410 bytes in 1 ms 14/06/23 19:39:23 INFO Executor: Running task ID 8 14/06/23 19:39:23 INFO Executor: Running task ID 9 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 ERROR Executor: Exception in task ID 9 org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 14/06/23 19:39:23 ERROR Executor: Exception in task ID 8
[jira] [Created] (SPARK-2263) Can't insert MapK, V values to Hive tables
Cheng Lian created SPARK-2263: - Summary: Can't insert MapK, V values to Hive tables Key: SPARK-2263 URL: https://issues.apache.org/jira/browse/SPARK-2263 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Scala {{Map\[K, V\]}} values are not converted to their Java correspondence: {code} scala loadTestTable(src) scala hql(create table m(value mapint, string)) res1: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:100 == Query Plan == Native command: executed by Hive scala hql(insert overwrite table m select map(key, value) from src) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 cannot be cast to java.util.Map org.apache.hadoop.hive.serde2.objectinspector.StandardMapObjectInspector.getMap(StandardMapObjectInspector.java:82) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:515) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1040) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1024) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1022) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1022) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:640) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1214) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) scala {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2264) CachedTableSuite SQL Tests are Failing
Patrick Wendell created SPARK-2264: -- Summary: CachedTableSuite SQL Tests are Failing Key: SPARK-2264 URL: https://issues.apache.org/jira/browse/SPARK-2264 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Michael Armbrust Priority: Blocker {code} [info] CachedTableSuite: [info] - read from cached table and uncache *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:185) [info] at org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply$mcV$sp(CachedTableSuite.scala:43) [info] at org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27) [info] at org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] ... [info] - correct error on uncache of non-cached table *** FAILED *** [info] Expected exception java.lang.IllegalArgumentException to be thrown, but java.lang.RuntimeException was thrown. (CachedTableSuite.scala:55) [info] - SELECT Star Cached Table *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] ... [info] - Self-join cached *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] ... [info] - 'CACHE TABLE' and 'UNCACHE TABLE' SQL statement *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.SQLContext.cacheTable(SQLContext.scala:189) [info] at org.apache.spark.sql.execution.CacheCommand.sideEffectResult$lzycompute(commands.scala:110) [info] at org.apache.spark.sql.execution.CacheCommand.sideEffectResult(commands.scala:108) [info] at org.apache.spark.sql.execution.CacheCommand.execute(commands.scala:118) [info] at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:322) [info] ... {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2264) CachedTableSuite SQL Tests are Failing
[ https://issues.apache.org/jira/browse/SPARK-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2264: Component/s: SQL CachedTableSuite SQL Tests are Failing -- Key: SPARK-2264 URL: https://issues.apache.org/jira/browse/SPARK-2264 Project: Spark Issue Type: Bug Components: SQL Reporter: Patrick Wendell Assignee: Michael Armbrust Priority: Blocker {code} [info] CachedTableSuite: [info] - read from cached table and uncache *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:185) [info] at org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply$mcV$sp(CachedTableSuite.scala:43) [info] at org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27) [info] at org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] ... [info] - correct error on uncache of non-cached table *** FAILED *** [info] Expected exception java.lang.IllegalArgumentException to be thrown, but java.lang.RuntimeException was thrown. (CachedTableSuite.scala:55) [info] - SELECT Star Cached Table *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] ... [info] - Self-join cached *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] ... [info] - 'CACHE TABLE' and 'UNCACHE TABLE' SQL statement *** FAILED *** [info] java.lang.RuntimeException: Table Not Found: testData [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64) [info] at org.apache.spark.sql.SQLContext.cacheTable(SQLContext.scala:189) [info] at org.apache.spark.sql.execution.CacheCommand.sideEffectResult$lzycompute(commands.scala:110) [info] at org.apache.spark.sql.execution.CacheCommand.sideEffectResult(commands.scala:108) [info] at org.apache.spark.sql.execution.CacheCommand.execute(commands.scala:118) [info] at
[jira] [Commented] (SPARK-2259) Spark submit documentation for --deploy-mode is highly misleading
[ https://issues.apache.org/jira/browse/SPARK-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042811#comment-14042811 ] Andrew Or commented on SPARK-2259: -- https://github.com/apache/spark/pull/1200 Spark submit documentation for --deploy-mode is highly misleading - Key: SPARK-2259 URL: https://issues.apache.org/jira/browse/SPARK-2259 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical There are a few issues: 1. Client mode does not necessarily mean the driver program must be launched outside of the cluster. 2. For standalone clusters, only client mode is currently supported. This was the case supported even before 1.0. Currently, the docs tell the user to use cluster deploy mode when deploying your driver program within the cluster, which is true also for standalone-client mode. In short, the docs encourage the user to use standalone-cluster, an unsupported mode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2266) Log page on Worker UI displays Some(app-id)
[ https://issues.apache.org/jira/browse/SPARK-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2266: - Attachment: Screen Shot 2014-06-24 at 5.07.54 PM.png Log page on Worker UI displays Some(app-id) - Key: SPARK-2266 URL: https://issues.apache.org/jira/browse/SPARK-2266 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Andrew Or Priority: Minor Fix For: 1.1.0 Attachments: Screen Shot 2014-06-24 at 5.07.54 PM.png Oops. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2267) Log the exception when TaskResultGetter fails to fetch/deserialize task result
[ https://issues.apache.org/jira/browse/SPARK-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2267: --- Summary: Log the exception when TaskResultGetter fails to fetch/deserialize task result (was: Log the exception when we fail to fetch/deserialize task result) Log the exception when TaskResultGetter fails to fetch/deserialize task result -- Key: SPARK-2267 URL: https://issues.apache.org/jira/browse/SPARK-2267 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin In 1.0.0, it is pretty confusing to get an error message that just says: Exception while deserializing and fetching task ... without a stack trace. This has been fixed in master by [~matei] https://github.com/apache/spark/commit/a01f3401e32ca4324884d13c9fad53c6c87bb5f0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2267) Log exception when TaskResultGetter fails to fetch/deserialize task result
[ https://issues.apache.org/jira/browse/SPARK-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2267: --- Summary: Log exception when TaskResultGetter fails to fetch/deserialize task result (was: Log the exception when TaskResultGetter fails to fetch/deserialize task result) Log exception when TaskResultGetter fails to fetch/deserialize task result -- Key: SPARK-2267 URL: https://issues.apache.org/jira/browse/SPARK-2267 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin In 1.0.0, it is pretty confusing to get an error message that just says: Exception while deserializing and fetching task ... without a stack trace. This has been fixed in master by [~matei] https://github.com/apache/spark/commit/a01f3401e32ca4324884d13c9fad53c6c87bb5f0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2268) Utils.createTempDir() creates race with HDFS at shutdown
Marcelo Vanzin created SPARK-2268: - Summary: Utils.createTempDir() creates race with HDFS at shutdown Key: SPARK-2268 URL: https://issues.apache.org/jira/browse/SPARK-2268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Utils.createTempDir() has this code: {code} // Add a shutdown hook to delete the temp dir when the JVM exits Runtime.getRuntime.addShutdownHook(new Thread(delete Spark temp dir + dir) { override def run() { // Attempt to delete if some patch which is parent of this is not already registered. if (! hasRootAsShutdownDeleteDir(dir)) Utils.deleteRecursively(dir) } }) {code} This creates a race with the shutdown hooks registered by HDFS, since the order of execution is undefined; if the HDFS hooks run first, you'll get exceptions about the file system being closed. Instead, this should use Hadoop's ShutdownHookManager with a proper priority, so that it runs before the HDFS hooks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution
[ https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042938#comment-14042938 ] Henry Saputra commented on SPARK-2192: -- I think several tests already have the data in the main/resources. Do you have list of which ones missing? Examples Data Not in Binary Distribution Key: SPARK-2192 URL: https://issues.apache.org/jira/browse/SPARK-2192 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough The data used by examples is not packaged up with the binary distribution. The data subdirectory of spark should make it's way in to the distribution somewhere so the examples can use it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2263) Can't insert MapK, V values to Hive tables
[ https://issues.apache.org/jira/browse/SPARK-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042941#comment-14042941 ] Cheng Lian commented on SPARK-2263: --- PR: https://github.com/apache/spark/pull/1205 Can't insert MapK, V values to Hive tables Key: SPARK-2263 URL: https://issues.apache.org/jira/browse/SPARK-2263 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Scala {{Map\[K, V\]}} values are not converted to their Java correspondence: {code} scala loadTestTable(src) scala hql(create table m(value mapint, string)) res1: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:100 == Query Plan == Native command: executed by Hive scala hql(insert overwrite table m select map(key, value) from src) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 cannot be cast to java.util.Map org.apache.hadoop.hive.serde2.objectinspector.StandardMapObjectInspector.getMap(StandardMapObjectInspector.java:82) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:515) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1040) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1024) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1022) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1022) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:640) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1214) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) scala {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2245) VertexRDD can not be materialized for checkpointing
[ https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041448#comment-14041448 ] Baoxu Shi edited comment on SPARK-2245 at 6/25/14 1:43 AM: --- Hi [~ankurd], I changed my pull request. But there is another exception, ShippableVertexPartition is not serializable. So I serialized it, but there is another exception org.apache.spark.graphx.impl.RoutingTablePartition is not serializable. Then I serialized it again, but on iteration 2 there will be an exception: org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to scala.Tuple2 The code I'm using are: val conf = new SparkConf().setAppName(HDTM) .setMaster(local[4]) val sc = new SparkContext(conf) sc.setCheckpointDir(./checkpoint) val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L))) val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), Edge(2L, 0L, 2L))) var g = Graph(v, e) val vertexIds = Seq(0L, 1L, 2L) var prevG: Graph[VertexId, Long] = null for (i - 1 to 2000) { vertexIds.toStream.foreach(id = { prevG = g g = Graph(g.vertices, g.edges) g.vertices.cache() g.edges.cache() prevG.unpersistVertices(blocking = false) prevG.edges.unpersist(blocking = false) }) g.vertices.checkpoint() g.edges.checkpoint() g.edges.count() g.vertices.count() println(s${g.vertices.isCheckpointed} ${g.edges.isCheckpointed}) println( iter + i + finished) } println(g.vertices.collect().mkString( )) println(g.edges.collect().mkString( )) Am I on the right track? Or Should there be another way to change it? was (Author: bxshi): Just submit the changes, thanks! VertexRDD can not be materialized for checkpointing --- Key: SPARK-2245 URL: https://issues.apache.org/jira/browse/SPARK-2245 Project: Spark Issue Type: Bug Components: GraphX Reporter: Baoxu Shi Seems one can not materialize VertexRDD by simply calling count method, which is overridden by VertexRDD. But if you call RDD's count, it could materialize it. Is this a feature that designed to get the count without materialize VertexRDD? If so, do you guys think it is necessary to add a materialize method to VertexRDD? By the way, does count() is the cheapest way to materialize a RDD? Or it just cost the same resources like other actions? The pull request is here: https://github.com/apache/spark/pull/1177 Best, -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2269) Clean up and add unit tests for resourceOffers in MesosSchedulerBackend
Patrick Wendell created SPARK-2269: -- Summary: Clean up and add unit tests for resourceOffers in MesosSchedulerBackend Key: SPARK-2269 URL: https://issues.apache.org/jira/browse/SPARK-2269 Project: Spark Issue Type: Bug Components: Mesos Reporter: Patrick Wendell This function could be simplified a bit. We could re-write it without offerableIndices or creating the mesosTasks array as large as the offer list. There is a lot of logic around making sure you get the correct index into mesosTasks and offers, really we should just build mesosTasks directly from the offers we get back. To associate the tasks we are launching with the offers we can just create a hashMap from the slaveId to the original offer. The basic logic of the function is that you take the mesos offers, convert them to spark offers, then convert the results back. One thing we should check is whether Mesos guarantees that it won't give two offers for the same worker. That would make things much more complicated. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2156) When the size of serialized results for one partition is slightly smaller than 10MB (the default akka.frameSize), the execution blocks
[ https://issues.apache.org/jira/browse/SPARK-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042962#comment-14042962 ] Patrick Wendell commented on SPARK-2156: Fixed in 1.1.0 via: https://github.com/apache/spark/pull/1132 When the size of serialized results for one partition is slightly smaller than 10MB (the default akka.frameSize), the execution blocks -- Key: SPARK-2156 URL: https://issues.apache.org/jira/browse/SPARK-2156 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Environment: AWS EC2 1 master 2 slaves with the instance type of r3.2xlarge Reporter: Chen Jin Assignee: Xiangrui Meng Priority: Blocker Fix For: 1.0.1, 1.1.0 Original Estimate: 504h Remaining Estimate: 504h I have done some experiments when the frameSize is around 10MB . 1) spark.akka.frameSize = 10 If one of the partition size is very close to 10MB, say 9.97MB, the execution blocks without any exception or warning. Worker finished the task to send the serialized result, and then throw exception saying hadoop IPC client connection stops (changing the logging to debug level). However, the master never receives the results and the program just hangs. But if sizes for all the partitions less than some number btw 9.96MB amd 9.97MB, the program works fine. 2) spark.akka.frameSize = 9 when the partition size is just a little bit smaller than 9MB, it fails as well. This bug behavior is not exactly what spark-1112 is about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2270) Kryo cannot serialize results returned by asJavaIterable (and thus groupBy/cogroup are broken in Java APIs when Kryo is used)
Reynold Xin created SPARK-2270: -- Summary: Kryo cannot serialize results returned by asJavaIterable (and thus groupBy/cogroup are broken in Java APIs when Kryo is used) Key: SPARK-2270 URL: https://issues.apache.org/jira/browse/SPARK-2270 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical The combination of Kryo serializer Java API could lead to the following exception in groupBy/groupByKey/cogroup: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Exception while deserializing and fetching task: java.lang.UnsupportedOperationException org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) akka.actor.ActorCell.invoke(ActorCell.scala:456) akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) akka.dispatch.Mailbox.run(Mailbox.scala:219) akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)3:45 PM {code} or {code} 14/06/24 16:38:09 ERROR TaskResultGetter: Exception while getting task result java.lang.UnsupportedOperationException at java.util.AbstractCollection.add(AbstractCollection.java:260) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at carbonite.serializer$mk_collection_reader$fn__50.invoke(serializer.clj:57) at clojure.lang.Var.invoke(Var.java:383) at carbonite.ClojureVecSerializer.read(ClojureVecSerializer.java:17) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:144) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} Thanks [~sorenmacbeth] for reporting this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2248) spark.default.parallelism does not apply in local mode
[ https://issues.apache.org/jira/browse/SPARK-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2248. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1194 [https://github.com/apache/spark/pull/1194] spark.default.parallelism does not apply in local mode -- Key: SPARK-2248 URL: https://issues.apache.org/jira/browse/SPARK-2248 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Matei Zaharia Assignee: Guoqiang Li Priority: Trivial Labels: Starter Fix For: 1.1.0 LocalBackend.defaultParallelism ignores the spark.default.parallelism property, unlike the other SchedulerBackends. We should make it take this in for consistency. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal
[ https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2271: --- Description: Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017 Use Hive's high performance Decimal128 to replace BigDecimal Key: SPARK-2271 URL: https://issues.apache.org/jira/browse/SPARK-2271 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal
[ https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2271: --- Assignee: Cheng Lian Use Hive's high performance Decimal128 to replace BigDecimal Key: SPARK-2271 URL: https://issues.apache.org/jira/browse/SPARK-2271 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2272) Feature scaling which standardizes the range of independent variables or features of data.
DB Tsai created SPARK-2272: -- Summary: Feature scaling which standardizes the range of independent variables or features of data. Key: SPARK-2272 URL: https://issues.apache.org/jira/browse/SPARK-2272 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. In this work, a trait called `VectorTransformer` is defined for generic transformation of a vector. It contains two methods, `apply` which applies transformation on a vector and `unapply` which applies inverse transformation on a vector. There are three concrete implementations of `VectorTransformer`, and they all can be easily extended with PMML transformation support. 1) `VectorStandardizer` - Standardises a vector given the mean and variance. Since the standardization will densify the output, the output is always in dense vector format. 2) `VectorRescaler` - Rescales a vector into target range specified by a tuple of two double values or two vectors as new target minimum and maximum. Since the rescaling will substrate the minimum of each column first, the output will always be in dense vector regardless of input vector type. 3) `VectorDivider` - Transforms a vector by dividing a constant or diving a vector with element by element basis. This transformation will preserve the type of input vector without densifying the result. Utility helper methods are implemented for taking an input of RDD[Vector], and then transformed RDD[Vector] and transformer are returned for dividing, rescaling, normalization, and standardization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2251) MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition
[ https://issues.apache.org/jira/browse/SPARK-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043052#comment-14043052 ] Jun Xie commented on SPARK-2251: Hi, Sean. Thanks very much for your insight. I am new to Spark. So if you can easily supply a patch. Please. Really appreciate it. I am digging it according to your suggestion to see what is going on. At the same time, familiar myself with Spark. MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition -- Key: SPARK-2251 URL: https://issues.apache.org/jira/browse/SPARK-2251 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Environment: OS: Fedora Linux Spark Version: 1.0.0. Git clone from the Spark Repository Reporter: Jun Xie Priority: Minor Labels: Naive-Bayes I follow the exact code from Naive Bayes Example (http://spark.apache.org/docs/latest/mllib-naive-bayes.html) of MLLib. When I executed the final command: val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / test.count() It complains Can only zip RDDs with same number of elements in each partition. I got the following exception: 14/06/23 19:39:23 INFO SparkContext: Starting job: count at console:31 14/06/23 19:39:23 INFO DAGScheduler: Got job 3 (count at console:31) with 2 output partitions (allowLocal=false) 14/06/23 19:39:23 INFO DAGScheduler: Final stage: Stage 4(count at console:31) 14/06/23 19:39:23 INFO DAGScheduler: Parents of final stage: List() 14/06/23 19:39:23 INFO DAGScheduler: Missing parents: List() 14/06/23 19:39:23 INFO DAGScheduler: Submitting Stage 4 (FilteredRDD[14] at filter at console:31), which has no missing parents 14/06/23 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (FilteredRDD[14] at filter at console:31) 14/06/23 19:39:23 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:0 as 3410 bytes in 0 ms 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on executor localhost: localhost (PROCESS_LOCAL) 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:1 as 3410 bytes in 1 ms 14/06/23 19:39:23 INFO Executor: Running task ID 8 14/06/23 19:39:23 INFO Executor: Running task ID 9 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24 14/06/23 19:39:23 INFO HadoopRDD: Input split: file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24 14/06/23 19:39:23 ERROR Executor: Exception in task ID 9 org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 14/06/23 19:39:23 ERROR Executor: Exception in task ID 8 org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067) at
[jira] [Commented] (SPARK-2268) Utils.createTempDir() creates race with HDFS at shutdown
[ https://issues.apache.org/jira/browse/SPARK-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043071#comment-14043071 ] Mridul Muralidharan commented on SPARK-2268: Setting priority for shutdown hooks does not have too much impact given the state of the VM. Note that this hook is trying to delete local directories - not dfs directories. Utils.createTempDir() creates race with HDFS at shutdown Key: SPARK-2268 URL: https://issues.apache.org/jira/browse/SPARK-2268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Utils.createTempDir() has this code: {code} // Add a shutdown hook to delete the temp dir when the JVM exits Runtime.getRuntime.addShutdownHook(new Thread(delete Spark temp dir + dir) { override def run() { // Attempt to delete if some patch which is parent of this is not already registered. if (! hasRootAsShutdownDeleteDir(dir)) Utils.deleteRecursively(dir) } }) {code} This creates a race with the shutdown hooks registered by HDFS, since the order of execution is undefined; if the HDFS hooks run first, you'll get exceptions about the file system being closed. Instead, this should use Hadoop's ShutdownHookManager with a proper priority, so that it runs before the HDFS hooks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2265) MIMA excludes aren't generated for external modules
[ https://issues.apache.org/jira/browse/SPARK-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043076#comment-14043076 ] Prashant Sharma commented on SPARK-2265: Those jars are already on its classpath. Take a look at https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L635. And I have also observed, that it fails to instrument some of them in the logs. {noformat} Unable to load:org/apache/spark/streaming/flume/FlumeEventServer.class Unable to load:org/apache/spark/streaming/mqtt/MQTTReceiver$$anon$1.class Unable to load:org/apache/spark/streaming/twitter/TwitterReceiver$$anon$1.class {noformat} MIMA excludes aren't generated for external modules --- Key: SPARK-2265 URL: https://issues.apache.org/jira/browse/SPARK-2265 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.1.0 The Flume, etc code doesn't get considered when generating the MIMA excludes. I think it could work to just set SPARK_CLASSPATH with the package jars for those modules before running GenerateMimaIgnore so the end up visible to the class loader. -- This message was sent by Atlassian JIRA (v6.2#6252)