[jira] [Closed] (SPARK-1784) Add a partitioner which partitions an RDD with each partition having specified # of keys
[ https://issues.apache.org/jira/browse/SPARK-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-1784. Resolution: Invalid Fix Version/s: (was: 1.0.0) Add a partitioner which partitions an RDD with each partition having specified # of keys Key: SPARK-1784 URL: https://issues.apache.org/jira/browse/SPARK-1784 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 0.9.0 Reporter: Syed A. Hashmi Priority: Minor At times on mailing lists, I have seen people complaining about having no control over # of keys per partition. RangePartitioner partitions keys in to roughly equal sized partitions, but in cases where user wants full control over specifying exact size, it is not possible today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1811) Support resizable output buffer for kryo serializer
[ https://issues.apache.org/jira/browse/SPARK-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1811: - Assignee: Koert Kuipers Support resizable output buffer for kryo serializer --- Key: SPARK-1811 URL: https://issues.apache.org/jira/browse/SPARK-1811 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Currently the size of kryo serializer output buffer can be set with spark.kryoserializer.buffer.mb The issue with this setting is that it has to be one-size-fits-all, so it ends up being the maximum size needed, even if only a single task out of many needs it to be that big. A resizable buffer will allow most tasks to use a modest sized buffer while the incidental task that needs a really big buffer can get it at a cost (allocating a new buffer and copying the contents over repeatedly as the buffer grows... with each new allocation the size doubles). The class used for the buffer is kryo Output, which supports resizing if maxCapacity is set bigger than capacity. I suggest we provide a setting spark.kryoserializer.buffer.max.mb which defaults to spark.kryoserializer.buffer.mb, and which sets Output's maxCapacity. Pull request for this jira: https://github.com/apache/spark/pull/735 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012127#comment-14012127 ] Archit Thakur edited comment on SPARK-874 at 5/29/14 6:45 AM: -- I am interested in taking it up, please do assign. Thanks :) . was (Author: archit279): I am intersted in taking it up. Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Priority: Minor Labels: starter Fix For: 1.1.0 When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. There is some equivalent logic here written in python, we just need to add it to the shell script: https://github.com/pwendell/spark-perf/blob/master/bin/run#L117 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished
[ https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012127#comment-14012127 ] Archit Thakur commented on SPARK-874: - I am intersted in taking it up. Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished --- Key: SPARK-874 URL: https://issues.apache.org/jira/browse/SPARK-874 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Patrick Wendell Priority: Minor Labels: starter Fix For: 1.1.0 When running benchmarking jobs, sometimes the cluster takes a long time to shut down. We should add a feature where it will ssh into all the workers every few seconds and check that the processes are dead, and won't return until they are all dead. This would help a lot with automating benchmarking scripts. There is some equivalent logic here written in python, we just need to add it to the shell script: https://github.com/pwendell/spark-perf/blob/master/bin/run#L117 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1961) when data return from map is about 10 kb, reduce(_ + _) would always pending
zhoudi created SPARK-1961: - Summary: when data return from map is about 10 kb, reduce(_ + _) would always pending Key: SPARK-1961 URL: https://issues.apache.org/jira/browse/SPARK-1961 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: zhoudi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1962) Add RDD cache reference counting
[ https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taeyun Kim updated SPARK-1962: -- Description: It would be nice if the RDD cache() method incorporate a reference counting information. That is, {code} void test() { JavaRDD... rdd = ...; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. ... rdd.uncache(); // to depth 1. Nop. rdd.uncache(); // to depth 0. Actual unpersist happens. } {code} This can be useful when writing code in modular way. When a function receives an rdd as an argument, it doesn't necessarily know the cache status of the rdd. But it could want to cache the rdd, since it will use the rdd multiple times. But with the current RDD API, it cannot determine whether it should unpersist it or leave it alone (so that caller can continue to use that rdd without rebuilding). Add RDD cache reference counting Key: SPARK-1962 URL: https://issues.apache.org/jira/browse/SPARK-1962 Project: Spark Issue Type: New Feature Reporter: Taeyun Kim Priority: Minor It would be nice if the RDD cache() method incorporate a reference counting information. That is, {code} void test() { JavaRDD... rdd = ...; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. ... rdd.uncache(); // to depth 1. Nop. rdd.uncache(); // to depth 0. Actual unpersist happens. } {code} This can be useful when writing code in modular way. When a function receives an rdd as an argument, it doesn't necessarily know the cache status of the rdd. But it could want to cache the rdd, since it will use the rdd multiple times. But with the current RDD API, it cannot determine whether it should unpersist it or leave it alone (so that caller can continue to use that rdd without rebuilding). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1962) Add RDD cache reference counting
[ https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taeyun Kim updated SPARK-1962: -- Affects Version/s: 1.0.0 Add RDD cache reference counting Key: SPARK-1962 URL: https://issues.apache.org/jira/browse/SPARK-1962 Project: Spark Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Taeyun Kim Priority: Minor It would be nice if the RDD cache() method incorporate a reference counting information. That is, {code} void test() { JavaRDD... rdd = ...; rdd.cache(); // to reference count 1. actual caching happens. rdd.cache(); // to reference count 2. Nop as long as the storage level is the same. Else, exception. ... rdd.uncache(); // to reference count 1. Nop. rdd.uncache(); // to reference count 0. Actual unpersist happens. } {code} This can be useful when writing code in modular way. When a function receives an RDD as an argument, it doesn't necessarily know the cache status of the RDD. But it could want to cache the RDD, since it will use the RDD multiple times. But with the current RDD API, it cannot determine whether it should unpersist it or leave it alone (so that the caller can continue to use that RDD without rebuilding). For API compatibility, introducing a new method or adding a parameter may be required. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1963) Job aborted with NullPointerException from DAGScheduler.scala:1020
[ https://issues.apache.org/jira/browse/SPARK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012181#comment-14012181 ] Kevin (Sangwoo) Kim commented on SPARK-1963: I guess the data is valid, because when I split the data into two and run the same code, each of split runs well. I'll post more after investigate into this issue. Job aborted with NullPointerException from DAGScheduler.scala:1020 -- Key: SPARK-1963 URL: https://issues.apache.org/jira/browse/SPARK-1963 Project: Spark Issue Type: Bug Reporter: Kevin (Sangwoo) Kim Hi, I'm testing Spark 0.9.1 from EC2 r3.8xlarge (32 core, 240GiB MEM) During counting active user from 70GB of data, Spark job aborted with NPE from DAGScheduler. I guess the number of active user count is around 1~2M. Here's what I did {code} val logs = sc.textFile(file:///spark/data/*) val activeUser = logs.map{x = val a = LogObjectExtractor.getAnonymousAction(x); a.getUserId}.distinct activeUser.count {code} and here's the log. {code} 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2235 as 1883 bytes in 1 ms 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2207 in 17541 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2204/2267) 14/05/29 05:26:46 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 2207) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2236 as TID 2236 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2236 as 1883 bytes in 1 ms 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2230 (task 1.0:2230) 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException java.lang.NullPointerException at $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17) at $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:97) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2230 as TID 2237 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2230 as 1883 bytes in 0 ms 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2231 (task 1.0:2231) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 1] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2231 as TID 2238 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) {code} ... {code} 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 27] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException
[jira] [Commented] (SPARK-1963) Job aborted with NullPointerException from DAGScheduler.scala:1020
[ https://issues.apache.org/jira/browse/SPARK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012241#comment-14012241 ] Kevin (Sangwoo) Kim commented on SPARK-1963: I think the path file:///spark/data/* was including some wrong files. :x request closing the issue. Job aborted with NullPointerException from DAGScheduler.scala:1020 -- Key: SPARK-1963 URL: https://issues.apache.org/jira/browse/SPARK-1963 Project: Spark Issue Type: Bug Reporter: Kevin (Sangwoo) Kim Hi, I'm testing Spark 0.9.1 from EC2 r3.8xlarge (32 core, 240GiB MEM) During counting active user from 70GB of data, Spark job aborted with NPE from DAGScheduler. I guess the number of active user count is around 1~2M. Here's what I did {code} val logs = sc.textFile(file:///spark/data/*) val activeUser = logs.map{x = val a = LogObjectExtractor.getAnonymousAction(x); a.getUserId}.distinct activeUser.count {code} and here's the log. {code} 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2235 as 1883 bytes in 1 ms 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2207 in 17541 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2204/2267) 14/05/29 05:26:46 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 2207) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2236 as TID 2236 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2236 as 1883 bytes in 1 ms 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2230 (task 1.0:2230) 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException java.lang.NullPointerException at $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17) at $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:97) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2230 as TID 2237 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2230 as 1883 bytes in 0 ms 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2231 (task 1.0:2231) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 1] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2231 as TID 2238 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) {code} ... {code} 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 27] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 28] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished
[jira] [Resolved] (SPARK-1963) Job aborted with NullPointerException from DAGScheduler.scala:1020
[ https://issues.apache.org/jira/browse/SPARK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin (Sangwoo) Kim resolved SPARK-1963. Resolution: Invalid Job aborted with NullPointerException from DAGScheduler.scala:1020 -- Key: SPARK-1963 URL: https://issues.apache.org/jira/browse/SPARK-1963 Project: Spark Issue Type: Bug Reporter: Kevin (Sangwoo) Kim Hi, I'm testing Spark 0.9.1 from EC2 r3.8xlarge (32 core, 240GiB MEM) During counting active user from 70GB of data, Spark job aborted with NPE from DAGScheduler. I guess the number of active user count is around 1~2M. Here's what I did {code} val logs = sc.textFile(file:///spark/data/*) val activeUser = logs.map{x = val a = LogObjectExtractor.getAnonymousAction(x); a.getUserId}.distinct activeUser.count {code} and here's the log. {code} 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2235 as 1883 bytes in 1 ms 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2207 in 17541 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2204/2267) 14/05/29 05:26:46 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 2207) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2236 as TID 2236 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2236 as 1883 bytes in 1 ms 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2230 (task 1.0:2230) 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException java.lang.NullPointerException at $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17) at $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:97) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2230 as TID 2237 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2230 as 1883 bytes in 0 ms 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2231 (task 1.0:2231) 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 1] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2231 as TID 2238 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal (PROCESS_LOCAL) {code} ... {code} 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 27] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException [duplicate 28] 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2201 in 17959 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2210/2267) 14/05/29 05:26:46 INFO
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012295#comment-14012295 ] sam commented on SPARK-1867: [~srowen] Thanks, I still find it difficult to find the correct artifact's. Perhaps you could help me with these please: Which artifact contains: org.apache.hadoop.io.LongWritable? Which artifact contains: org.apache.hadoop.io.CombineTextInputFormat? If I search for hadoop-io here https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging-Information/cdhvd_cdh5_maven_repo.html I can't find anything :( Many thanks Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 --
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012309#comment-14012309 ] Sean Owen commented on SPARK-1867: -- There is no hadoop-io module. Modules are subcomponents of the components distributed in something like CDH, and are not versioned independently, so you would not find them described on that page. That is, if CDH X.Y includes Hadoop Z.W then it includes version Z.W of all Hadoop's modules. LongWritable was in hadoop-core in Hadoop 1.x, and is in hadoop-common in 2.x. Just about everything depends on these modules, so you should not find yourself missing them at runtime if you are depending on something like hadoop-client. There isn't a org.apache.hadoop.io.CombineTextInputFormat, not that I can see -- where do you see that referenced? There is org.apache.hadoop.mapreduce.lib.CombineTextInputFormat although it looks like it appeared from about Hadoop 2.1 (https://issues.apache.org/jira/browse/MAPREDUCE-5069) You would not be able to use it if you are running on Hadoop 1, and wouldn't find it if you depend on Hadoop 1 modules. It appears to be the in mapreduce-client-core module, but again, you wouldn't need to depend on that directly. That gets pulled in from hadoop-client, via hadoop-mapreduce-client. I can see it doing so when you build Spark for Hadoop 2, for example. I'm still not clear why you are needing to hunt for individual classes? Maybe one of those is just due to a package typo. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
[jira] [Commented] (SPARK-1948) Scalac crashes when building Spark in IntelliJ IDEA
[ https://issues.apache.org/jira/browse/SPARK-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012366#comment-14012366 ] Cheng Lian commented on SPARK-1948: --- Hi [~sowen], thanks for the suggestion! I tried to switch to Maven, but it wasn't that smooth as expected, so I took sometime to investigate this issue. At last it seems to be a bug of the {{sbt-idea}} plugin. The {{.idea/libraries/SBT__org_apache_mesos_0_18_1.xml}} file generated by {{sbt gen-idea}} declares both {{mesos-0.18.1.jar}} and {{mesos-0.18.1-shaded-protobuf.jar}} in the {{CLASSES}} elements: {code} component name=libraryTable library name=SBT: org.apache.mesos:mesos:0.18.1 CLASSES root url=jar://$PROJECT_DIR$/lib_managed/jars/mesos-0.18.1.jar!// root url=jar://$PROJECT_DIR$/lib_managed/jars/mesos-0.18.1-shaded-protobuf.jar!// /CLASSES JAVADOC /JAVADOC SOURCES root url=jar://$PROJECT_DIR$/lib_managed/srcs/mesos-0.18.1-sources.jar!// /SOURCES /library /component {code} On the other hand, running {{show core/dependencyClasspath}} in SBT only shows the latter one, which means SBT only refers to the shaded one to build the project. Removing the non-shaded Mesos jar file from IDEA project settings leads everything to peace: # Run {{sbt/sbt gen-idea}} # Open the project in IDEA # Click the File / Project Structure... menu item # Locate the SBT: org.apache.mesos:mesos:0.18.1 dependency along the UI path Project Settings / Modules / core / Dependencies # Click the Edit button (with a pen icon) # Remove the non-shaded jar file from Classes Now build Spark within IDEA, everything should be fine. Scalac crashes when building Spark in IntelliJ IDEA --- Key: SPARK-1948 URL: https://issues.apache.org/jira/browse/SPARK-1948 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Cheng Lian Priority: Minor Attachments: scalac-crash.log After [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45], the master branch fails to compile within IntelliJ IDEA and causes {{scalac}} to crash. But building Spark with SBT is OK. This issue is not blocking, but it's annoying since it prevents developers from debugging Spark within IDEA. I can't figure out the exact reason, only nailed down to this commit with binary searching. Maybe I should fire a bug issue to IDEA instead? How to reproduce: # Checkout [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45] # Run {{sbt/sbt clean gen-idea}} under Spark source directory # Open the project in IntelliJ IDEA # Build the project The {{scalac}} crash report is attached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012431#comment-14012431 ] sam commented on SPARK-1867: Hmm, the reason for my confusion is a very stange compile problem, if I do the following: import org.apache.hadoop.io.LongWritable; ... public class CombinedInputFormat extends CombineFileInputFormatLongWritable, Text { It doesn't work, but if I do public class CombinedInputFormat extends CombineFileInputFormatorg.apache.hadoop.io.LongWritable, Text { it does work. Odd, though I was just copy pasting java code from web ... it's been 4 years since I've written Java so maybe I'm missing something silly. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012434#comment-14012434 ] Sean Owen commented on SPARK-1867: -- Something else is up; those are equivalent in Java and you couldn't have imported two symbols with the same name. I could look at the file offline if you think I might spot something else. Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1518: --- Component/s: Spark Core Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1518: --- Target Version/s: 1.1.0, 1.0.1 (was: 1.0.1) Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1961) when data return from map is about 10 kb, reduce(_ + _) would always pending
[ https://issues.apache.org/jira/browse/SPARK-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012471#comment-14012471 ] Patrick Wendell commented on SPARK-1961: Could you add a bit more information here or a reproduction? As it stands it's not really clear what this report means. when data return from map is about 10 kb, reduce(_ + _) would always pending Key: SPARK-1961 URL: https://issues.apache.org/jira/browse/SPARK-1961 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: zhoudi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1962) Add RDD cache reference counting
[ https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1962: --- Component/s: Spark Core Add RDD cache reference counting Key: SPARK-1962 URL: https://issues.apache.org/jira/browse/SPARK-1962 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Taeyun Kim Priority: Minor It would be nice if the RDD cache() method incorporate a reference counting information. That is, {code} void test() { JavaRDD... rdd = ...; rdd.cache(); // to reference count 1. actual caching happens. rdd.cache(); // to reference count 2. Nop as long as the storage level is the same. Else, exception. ... rdd.uncache(); // to reference count 1. Nop. rdd.uncache(); // to reference count 0. Actual unpersist happens. } {code} This can be useful when writing code in modular way. When a function receives an RDD as an argument, it doesn't necessarily know the cache status of the RDD. But it could want to cache the RDD, since it will use the RDD multiple times. But with the current RDD API, it cannot determine whether it should unpersist it or leave it alone (so that the caller can continue to use that RDD without rebuilding). For API compatibility, introducing a new method or adding a parameter may be required. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1935. Resolution: Fixed Explicitly add commons-codec 1.5 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Minor Fix For: 1.1.0, 1.0.1 Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1935: --- Fix Version/s: 1.0.1 1.1.0 Explicitly add commons-codec 1.5 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Minor Fix For: 1.1.0, 1.0.1 Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1964) Timestamp missing from HiveMetastore types parser
Michael Armbrust created SPARK-1964: --- Summary: Timestamp missing from HiveMetastore types parser Key: SPARK-1964 URL: https://issues.apache.org/jira/browse/SPARK-1964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust {code} -- Forwarded message -- From: dataginjaninja rickett.stepha...@gmail.com Date: Thu, May 29, 2014 at 8:54 AM Subject: Timestamp support in v1.0 To: d...@spark.incubator.apache.org Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in åçpyspark. *The error I get: * 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 15:44:48 INFO ParseDriver: Parse Completed 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI thrift: 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 15:44:49 INFO metastore: Connected to metastore. Traceback (most recent call last): File stdin, line 1, in module File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql return self.hiveql(hqlQuery) File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self) File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql. : java.lang.RuntimeException: Unsupported dataType: timestamp *The table I loaded:* DROP TABLE IF EXISTS aol; CREATE EXTERNAL TABLE aol ( userid STRING, query STRING, query_time TIMESTAMP, item_rank INT, click_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/data/aol'; *The pyspark commands:* from pyspark.sql import HiveContext hctx= HiveContext(sc) results = hctx.hql(SELECT COUNT(*) FROM aol).collect() {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1964) Timestamp missing from HiveMetastore types parser
[ https://issues.apache.org/jira/browse/SPARK-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-1964: --- Assignee: Michael Armbrust Timestamp missing from HiveMetastore types parser - Key: SPARK-1964 URL: https://issues.apache.org/jira/browse/SPARK-1964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Michael Armbrust {code} -- Forwarded message -- From: dataginjaninja rickett.stepha...@gmail.com Date: Thu, May 29, 2014 at 8:54 AM Subject: Timestamp support in v1.0 To: d...@spark.incubator.apache.org Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in åçpyspark. *The error I get: * 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 15:44:48 INFO ParseDriver: Parse Completed 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI thrift: 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 15:44:49 INFO metastore: Connected to metastore. Traceback (most recent call last): File stdin, line 1, in module File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql return self.hiveql(hqlQuery) File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self) File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql. : java.lang.RuntimeException: Unsupported dataType: timestamp *The table I loaded:* DROP TABLE IF EXISTS aol; CREATE EXTERNAL TABLE aol ( userid STRING, query STRING, query_time TIMESTAMP, item_rank INT, click_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/data/aol'; *The pyspark commands:* from pyspark.sql import HiveContext hctx= HiveContext(sc) results = hctx.hql(SELECT COUNT(*) FROM aol).collect() {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012520#comment-14012520 ] Matei Zaharia commented on SPARK-1518: -- Sorry, I'm still not sure I understand what you're asking for -- maybe I missed it above. Are you worried that the Spark assembly on the cluster has to be pre-built against Hadoop? We could perhaps make it find stuff out of HADOOP_HOME, but then it wouldn't work for users that don't have a Hadoop installation, which is a lot of users. For client apps, it's really enough to add that hadoop-client dependency. No other manipulation is needed. If you want to build a client app that automatically works with multiple versions of Hadoop, you can also package it with Spark and hadoop-client marked as provided and use spark-submit to put the Spark assembly on your cluster in the classpath. Then it will work with whatever version that was built against. But you need to specify hadoop-client when you run without spark-submit if you want to talk to the version of HDFS in your cluster (e.g. you're testing the app on your laptop and trying to make it read from HDFS). Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1964) Timestamp missing from HiveMetastore types parser
[ https://issues.apache.org/jira/browse/SPARK-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012543#comment-14012543 ] Michael Armbrust commented on SPARK-1964: - https://github.com/apache/spark/pull/913 Timestamp missing from HiveMetastore types parser - Key: SPARK-1964 URL: https://issues.apache.org/jira/browse/SPARK-1964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Michael Armbrust {code} -- Forwarded message -- From: dataginjaninja rickett.stepha...@gmail.com Date: Thu, May 29, 2014 at 8:54 AM Subject: Timestamp support in v1.0 To: d...@spark.incubator.apache.org Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in åçpyspark. *The error I get: * 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 15:44:48 INFO ParseDriver: Parse Completed 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI thrift: 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 15:44:49 INFO metastore: Connected to metastore. Traceback (most recent call last): File stdin, line 1, in module File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql return self.hiveql(hqlQuery) File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self) File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql. : java.lang.RuntimeException: Unsupported dataType: timestamp *The table I loaded:* DROP TABLE IF EXISTS aol; CREATE EXTERNAL TABLE aol ( userid STRING, query STRING, query_time TIMESTAMP, item_rank INT, click_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/data/aol'; *The pyspark commands:* from pyspark.sql import HiveContext hctx= HiveContext(sc) results = hctx.hql(SELECT COUNT(*) FROM aol).collect() {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012552#comment-14012552 ] Sean Owen commented on SPARK-1518: -- Heh, I think the essence is: at least one more separate Maven artifact, under a different classifier, for Hadoop 2.x builds. If you package that, you get Spark and everything it needs to work against a Hadoop 2 cluster. Yeah I see that you're suggesting various ways to push the app to the cluster, where it can bind to the right version of things, and that may be the right-est way to think about this. I had envisioned running a stand-alone app on a machine that is not part of the cluster, that is a client of it, and this means packaging in the right Hadoop client dependencies, and Spark already declares how it wants to include these various Hadoop client versions -- it's more than just including hadoop-client -- so wanted to leverage that. Let's see if this actually turns out to be a broader request though. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012650#comment-14012650 ] Ryan Compton commented on SPARK-1952: - Thanks so much! Here's the fix I had to make (removed jul-to-slf4j and jcl-over-slf4j) {code} diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 29dcd86..a113ead 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -336,8 +336,8 @@ object SparkBuild extends Build { log4j % log4j% 1.2.17, org.slf4j % slf4j-api% slf4jVersion, org.slf4j % slf4j-log4j12% slf4jVersion, -org.slf4j % jul-to-slf4j % slf4jVersion, -org.slf4j % jcl-over-slf4j % slf4jVersion, +//org.slf4j % jul-to-slf4j % slf4jVersion, +//org.slf4j % jcl-over-slf4j % slf4jVersion, commons-daemon % commons-daemon % 1.0.10, // workaround for bug HADOOP-9407 com.ning % compress-lzf % 1.0.0, org.xerial.snappy % snappy-java % 1.0.5, {code} This led to several of these during compilation: {code} [warn] Class org.apache.commons.logging.Log not found - continuing with a stub. {code} But it did successfully remove the SLF4JLocationAwareLog.class {code} [rfcompton@node19 spark-1.0.0]$ jar tvf assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf4j | grep Location 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} For reference, there was no *-*-slf4j in 0.9.1 : {code} [rfcompton@node19 spark-0.9.1-bin-hadoop1]$ cat project/SparkBuild.scala | grep slf4 val slf4jVersion = 1.7.2 org.slf4j% slf4j-api% slf4jVersion, org.slf4j% slf4j-log4j12% slf4jVersion, org.spark-project.akka %% akka-slf4j % 2.2.3-shaded-protobuf excludeAll(excludeNetty), {code} I'm the only guy in my lab using Spark so I have no idea if it's common to keep Pig UDFs in the same jar as Spark code. I'm going to assume it is and submit a pull request now. slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012651#comment-14012651 ] Matei Zaharia commented on SPARK-1518: -- Okay, got it. But this only applies to you running the job on your laptop, right? Because otherwise you'll get the right Hadoop via the installation on the cluster. For this use case I still think it's fine to require use of hadoop-client. It's been like that for the past 2 releases and nobody has asked questions about it. It's just one more entry to add to your pom.xml. The concrete problem is that Hadoop has been extremely fickle with compatibility even within a major release series (1.x or 2.x). HDFS protocol versions change and you can't access the cluster, YARN versions change, etc. I don't think there's a single release I'd call Hadoop 2, and it would be confusing to users to link to the Hadoop 2 artifact and not have it run on their cluster. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012655#comment-14012655 ] Matei Zaharia commented on SPARK-1518: -- BTW one other thing is that in 1.0, you can also use spark-submit in local mode to get your locally installed Spark. So people will be able to yum install spark-from-their-vendor, build their app with just spark-core, and then run it with the spark-submit on their PATH. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1965) Spark UI throws NPE on trying to load the app page for non-existent app
Kay Ousterhout created SPARK-1965: - Summary: Spark UI throws NPE on trying to load the app page for non-existent app Key: SPARK-1965 URL: https://issues.apache.org/jira/browse/SPARK-1965 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Kay Ousterhout Priority: Minor If you try to load the Spark UI for an application that doesn't exist: sparkHost:8080/app/?appId=foobar The UI throws a NPE. The problem is in ApplicationPage.scala -- Spark proceeds even if the app variable is null. We should handle this more gracefully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1966) Cannot cancel tasks running locally
Aaron Davidson created SPARK-1966: - Summary: Cannot cancel tasks running locally Key: SPARK-1966 URL: https://issues.apache.org/jira/browse/SPARK-1966 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Reporter: Aaron Davidson -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1967) Using parallelize method to create RDD, wordcount app just hanging there without errors or warnings
Min Li created SPARK-1967: - Summary: Using parallelize method to create RDD, wordcount app just hanging there without errors or warnings Key: SPARK-1967 URL: https://issues.apache.org/jira/browse/SPARK-1967 Project: Spark Issue Type: Bug Affects Versions: 0.9.1 Environment: Ubuntu-12.04, single machine spark standalone, 8 core, 8G mem, spark 0.9.1, java-1.7 Reporter: Min Li I was trying the parallelize method to create RDD. I used Java. And it's a simple wordcount program, except that I first read the input into memory and then use the parallelize method to create the RDD, rather than the default textFile method in the given example. Pseudo codes: JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome, $jars); ListString input = #read lines from input file and form a ArrayListString JavaRDD lines = ctx.parallelize(input); //followed by wordcount above is not working. JavaRDD lines = ctx.textFile(file); //followed by wordcount this is working The log is: 14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started 14/05/29 16:18:43 INFO Remoting: Starting remoting 14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster 14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140529161843-836a 14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB. 14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id = ConnectionManagerId(spark,42942) 14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager 14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager spark:42942 with 1056.0 MB RAM 14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at http://10.227.119.185:43522 14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker 14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is /tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040 14/05/29 16:18:44 INFO SparkContext: Added JAR /home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1401394724045 14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master spark://spark:7077... 14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140529161844-0001 14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) with 8 cores The app is hanging here forever. And spark:8080 spark:4040 are not showing any strange info. The Spark Stages page shows the Active Stages is reduceByKey, tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after parallelize, and the app will stuck at the count stage. I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have only one host. I used maven to compile a fat jar with spark specified as provided. I modified the run-example script to submit the jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1697) Driver error org.apache.spark.scheduler.TaskSetManager - Loss was due to java.io.FileNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012913#comment-14012913 ] Arup Malakar commented on SPARK-1697: - [~mridulm80] We saw this issue again. Are you referring to SPARK-1592 GC patch? That would require us to upgrade to trunk right? Also could you advice me on how to disable the cleaners? Driver error org.apache.spark.scheduler.TaskSetManager - Loss was due to java.io.FileNotFoundException -- Key: SPARK-1697 URL: https://issues.apache.org/jira/browse/SPARK-1697 Project: Spark Issue Type: Bug Reporter: Arup Malakar We are running spark-streaming 0.9.0 on top of Yarn (Hadoop 2.2.0-cdh5.0.0-beta-2). It reads from kafka and processes the data. So far we haven't seen any issues, except today we saw an exception in the driver log and it is not consuming kafka messages any more. Here is the exception we saw: {code} 2014-05-01 10:00:43,962 [Result resolver thread-3] WARN org.apache.spark.scheduler.TaskSetManager - Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: http://10.50.40.85:53055/broadcast_2412 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Updated] (SPARK-1939) Refactor takeSample method in RDD to use ScaSRS
[ https://issues.apache.org/jira/browse/SPARK-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-1939: - Summary: Refactor takeSample method in RDD to use ScaSRS (was: Improve takeSample method in RDD) Refactor takeSample method in RDD to use ScaSRS --- Key: SPARK-1939 URL: https://issues.apache.org/jira/browse/SPARK-1939 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Doris Xin Assignee: Doris Xin Labels: newbie reimplement takeSample with the ScaSRS algorithm -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1958) Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
[ https://issues.apache.org/jira/browse/SPARK-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1958: Assignee: Cheng Lian Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. Key: SPARK-1958 URL: https://issues.apache.org/jira/browse/SPARK-1958 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Fix For: 1.1.0 In some cases (like LIMIT) executeCollect() makes optimizations that execute().collect() will not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1852) SparkSQL Queries with Sorts run before the user asks them to
[ https://issues.apache.org/jira/browse/SPARK-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1852: Assignee: Cheng Lian SparkSQL Queries with Sorts run before the user asks them to Key: SPARK-1852 URL: https://issues.apache.org/jira/browse/SPARK-1852 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian This is related to [SPARK-1021] but will not be fixed by that since we do our own partitioning. Part of the problem here is that we calculate the range partitioning too eagerly. Though this could also be alleviated by avoiding the call to toRdd for non DDL queries. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1947) Child of SumDistinct or Average should be widened to prevent overflows the same as Sum.
[ https://issues.apache.org/jira/browse/SPARK-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1947: Assignee: Takuya Ueshin Child of SumDistinct or Average should be widened to prevent overflows the same as Sum. --- Key: SPARK-1947 URL: https://issues.apache.org/jira/browse/SPARK-1947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Child of {{SumDistinct}} or {{Average}} should be widened to prevent overflows the same as {{Sum}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1968) SQL commands for caching tables
[ https://issues.apache.org/jira/browse/SPARK-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1968: Component/s: SQL SQL commands for caching tables --- Key: SPARK-1968 URL: https://issues.apache.org/jira/browse/SPARK-1968 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Users should be able to cache tables by calling CACHE TABLE in SQL or HQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1911) Warn users that jars should be built with Java 6 for PySpark to work on YARN
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013142#comment-14013142 ] Tathagata Das commented on SPARK-1911: -- As far as I think, it is because Java 7 uses Zip64 encoding when making JARs with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it fails in those times when the Spark assembly JAR has more than 65k files, which in turn depends on whether it has been generated with YARN and/or Hive enabled. Warn users that jars should be built with Java 6 for PySpark to work on YARN Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Andrew Or Fix For: 1.0.0 Python sometimes fails to read jars created by Java 7. This is necessary for PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 for PySpark to work on YARN. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1911) Warn users that jars should be built with Java 6 for PySpark to work on YARN
[ https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013142#comment-14013142 ] Tathagata Das edited comment on SPARK-1911 at 5/30/14 12:31 AM: As far as I think, it is because Java 7 uses Zip64 encoding when making JARs with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it fails in those times when the Spark assembly JAR has more than 65k files, which in turn depends on whether it has been generated with YARN and/or Hive enabled. Java 6 uses the traditional Zip format to create JARs, even if it has more than 65k files. So python always seems to work with Java 6 Jars Caveat: I cant claim 100% certainty on this interpretation because there is so little documentation on this on the net. was (Author: tdas): As far as I think, it is because Java 7 uses Zip64 encoding when making JARs with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it fails in those times when the Spark assembly JAR has more than 65k files, which in turn depends on whether it has been generated with YARN and/or Hive enabled. Warn users that jars should be built with Java 6 for PySpark to work on YARN Key: SPARK-1911 URL: https://issues.apache.org/jira/browse/SPARK-1911 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Andrew Or Fix For: 1.0.0 Python sometimes fails to read jars created by Java 7. This is necessary for PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 for PySpark to work on YARN. Currently we warn users only in make-distribution.sh, but most users build the jars directly. We should emphasize it in the docs especially for PySpark and YARN because this issue is not trivial to debug. -- This message was sent by Atlassian JIRA (v6.2#6252)