[jira] [Closed] (SPARK-1784) Add a partitioner which partitions an RDD with each partition having specified # of keys

2014-05-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia closed SPARK-1784.


   Resolution: Invalid
Fix Version/s: (was: 1.0.0)

 Add a partitioner which partitions an RDD with each partition having 
 specified # of keys
 

 Key: SPARK-1784
 URL: https://issues.apache.org/jira/browse/SPARK-1784
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Syed A. Hashmi
Priority: Minor

 At times on mailing lists, I have seen people complaining about having no 
 control over # of keys per partition. RangePartitioner partitions keys in to 
 roughly equal sized partitions, but in cases where user wants full control 
 over specifying exact size, it is not possible today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1811) Support resizable output buffer for kryo serializer

2014-05-29 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1811:
-

Assignee: Koert Kuipers

 Support resizable output buffer for kryo serializer
 ---

 Key: SPARK-1811
 URL: https://issues.apache.org/jira/browse/SPARK-1811
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: koert kuipers
Assignee: Koert Kuipers
Priority: Minor

 Currently the size of kryo serializer output buffer can be set with 
 spark.kryoserializer.buffer.mb
 The issue with this setting is that it has to be one-size-fits-all, so it 
 ends up being the maximum size needed, even if only a single task out of many 
 needs it to be that big. A resizable buffer will allow most tasks to use a 
 modest sized buffer while the incidental task that needs a really big buffer 
 can get it at a cost (allocating a new buffer and copying the contents over 
 repeatedly as the buffer grows... with each new allocation the size doubles).
 The class used for the buffer is kryo Output, which supports resizing if  
 maxCapacity is set bigger than capacity. I suggest we provide a setting 
 spark.kryoserializer.buffer.max.mb which defaults to 
 spark.kryoserializer.buffer.mb, and which sets Output's maxCapacity.
 Pull request for this jira:
 https://github.com/apache/spark/pull/735



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished

2014-05-29 Thread Archit Thakur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012127#comment-14012127
 ] 

Archit Thakur edited comment on SPARK-874 at 5/29/14 6:45 AM:
--

I am interested in taking it up, please do assign. Thanks :) .


was (Author: archit279):
I am intersted in taking it up.

 Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are 
 finished
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 When running benchmarking jobs, sometimes the cluster takes a long time to 
 shut down. We should add a feature where it will ssh into all the workers 
 every few seconds and check that the processes are dead, and won't return 
 until they are all dead. This would help a lot with automating benchmarking 
 scripts.
 There is some equivalent logic here written in python, we just need to add it 
 to the shell script:
 https://github.com/pwendell/spark-perf/blob/master/bin/run#L117



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished

2014-05-29 Thread Archit Thakur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012127#comment-14012127
 ] 

Archit Thakur commented on SPARK-874:
-

I am intersted in taking it up.

 Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are 
 finished
 ---

 Key: SPARK-874
 URL: https://issues.apache.org/jira/browse/SPARK-874
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Patrick Wendell
Priority: Minor
  Labels: starter
 Fix For: 1.1.0


 When running benchmarking jobs, sometimes the cluster takes a long time to 
 shut down. We should add a feature where it will ssh into all the workers 
 every few seconds and check that the processes are dead, and won't return 
 until they are all dead. This would help a lot with automating benchmarking 
 scripts.
 There is some equivalent logic here written in python, we just need to add it 
 to the shell script:
 https://github.com/pwendell/spark-perf/blob/master/bin/run#L117



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1961) when data return from map is about 10 kb, reduce(_ + _) would always pending

2014-05-29 Thread zhoudi (JIRA)
zhoudi created SPARK-1961:
-

 Summary: when data return from map is about 10 kb, reduce(_ + _) 
would always pending
 Key: SPARK-1961
 URL: https://issues.apache.org/jira/browse/SPARK-1961
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: zhoudi






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Description: 
It would be nice if the RDD cache() method incorporate a reference counting 
information.

That is,

{code}
void test()
{

JavaRDD... rdd = ...;


rdd.cache();  // to depth 1. actual caching happens.
rdd.cache();  // to depth 2. Nop as long as the storage level is the same. 
Else, exception.

...

rdd.uncache();  // to depth 1. Nop.
rdd.uncache();  // to depth 0. Actual unpersist happens.
}
{code}

This can be useful when writing code in modular way.
When a function receives an rdd as an argument, it doesn't necessarily know the 
cache status of the rdd.
But it could want to cache the rdd, since it will use the rdd multiple times.
But with the current RDD API, it cannot determine whether it should unpersist 
it or leave it alone (so that caller can continue to use that rdd without 
rebuilding).


 Add RDD cache reference counting
 

 Key: SPARK-1962
 URL: https://issues.apache.org/jira/browse/SPARK-1962
 Project: Spark
  Issue Type: New Feature
Reporter: Taeyun Kim
Priority: Minor

 It would be nice if the RDD cache() method incorporate a reference counting 
 information.
 That is,
 {code}
 void test()
 {
 JavaRDD... rdd = ...;
 rdd.cache();  // to depth 1. actual caching happens.
 rdd.cache();  // to depth 2. Nop as long as the storage level is the same. 
 Else, exception.
 ...
 rdd.uncache();  // to depth 1. Nop.
 rdd.uncache();  // to depth 0. Actual unpersist happens.
 }
 {code}
 This can be useful when writing code in modular way.
 When a function receives an rdd as an argument, it doesn't necessarily know 
 the cache status of the rdd.
 But it could want to cache the rdd, since it will use the rdd multiple times.
 But with the current RDD API, it cannot determine whether it should unpersist 
 it or leave it alone (so that caller can continue to use that rdd without 
 rebuilding).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Affects Version/s: 1.0.0

 Add RDD cache reference counting
 

 Key: SPARK-1962
 URL: https://issues.apache.org/jira/browse/SPARK-1962
 Project: Spark
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Taeyun Kim
Priority: Minor

 It would be nice if the RDD cache() method incorporate a reference counting 
 information.
 That is,
 {code}
 void test()
 {
 JavaRDD... rdd = ...;
 rdd.cache();  // to reference count 1. actual caching happens.
 rdd.cache();  // to reference count 2. Nop as long as the storage level 
 is the same. Else, exception.
 ...
 rdd.uncache();  // to reference count 1. Nop.
 rdd.uncache();  // to reference count 0. Actual unpersist happens.
 }
 {code}
 This can be useful when writing code in modular way.
 When a function receives an RDD as an argument, it doesn't necessarily know 
 the cache status of the RDD.
 But it could want to cache the RDD, since it will use the RDD multiple times.
 But with the current RDD API, it cannot determine whether it should unpersist 
 it or leave it alone (so that the caller can continue to use that RDD without 
 rebuilding).
 For API compatibility, introducing a new method or adding a parameter may be 
 required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1963) Job aborted with NullPointerException from DAGScheduler.scala:1020

2014-05-29 Thread Kevin (Sangwoo) Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012181#comment-14012181
 ] 

Kevin (Sangwoo) Kim commented on SPARK-1963:


I guess the data is valid, because when I split the data into two and run the 
same code, each of split runs well. 

I'll post more after investigate into this issue. 

 Job aborted with NullPointerException from DAGScheduler.scala:1020
 --

 Key: SPARK-1963
 URL: https://issues.apache.org/jira/browse/SPARK-1963
 Project: Spark
  Issue Type: Bug
Reporter: Kevin (Sangwoo) Kim

 Hi, I'm testing Spark 0.9.1 from EC2 r3.8xlarge (32 core, 240GiB MEM)
 During counting active user from 70GB of data, Spark job aborted with NPE 
 from DAGScheduler. 
 I guess the number of active user count is around 1~2M. 
 Here's what I did 
 {code}
 val logs = sc.textFile(file:///spark/data/*)
 val activeUser = logs.map{x = val a = 
 LogObjectExtractor.getAnonymousAction(x); a.getUserId}.distinct
 activeUser.count
 {code}
 and here's the log. 
 {code}
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2235 as 
 1883 bytes in 1 ms
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2207 in 17541 
 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2204/2267)
 14/05/29 05:26:46 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 
 2207)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2236 as 
 TID 2236 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2236 as 
 1883 bytes in 1 ms
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2230 (task 1.0:2230)
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
   at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:97)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477)
   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2230 as 
 TID 2237 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2230 as 
 1883 bytes in 0 ms
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2231 (task 1.0:2231)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 1]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2231 as 
 TID 2238 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 {code}
 ...
 {code}
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 27]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException 

[jira] [Commented] (SPARK-1963) Job aborted with NullPointerException from DAGScheduler.scala:1020

2014-05-29 Thread Kevin (Sangwoo) Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012241#comment-14012241
 ] 

Kevin (Sangwoo) Kim commented on SPARK-1963:


I think the path file:///spark/data/* was including some wrong files. :x 
request closing the issue.

 Job aborted with NullPointerException from DAGScheduler.scala:1020
 --

 Key: SPARK-1963
 URL: https://issues.apache.org/jira/browse/SPARK-1963
 Project: Spark
  Issue Type: Bug
Reporter: Kevin (Sangwoo) Kim

 Hi, I'm testing Spark 0.9.1 from EC2 r3.8xlarge (32 core, 240GiB MEM)
 During counting active user from 70GB of data, Spark job aborted with NPE 
 from DAGScheduler. 
 I guess the number of active user count is around 1~2M. 
 Here's what I did 
 {code}
 val logs = sc.textFile(file:///spark/data/*)
 val activeUser = logs.map{x = val a = 
 LogObjectExtractor.getAnonymousAction(x); a.getUserId}.distinct
 activeUser.count
 {code}
 and here's the log. 
 {code}
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2235 as 
 1883 bytes in 1 ms
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2207 in 17541 
 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2204/2267)
 14/05/29 05:26:46 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 
 2207)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2236 as 
 TID 2236 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2236 as 
 1883 bytes in 1 ms
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2230 (task 1.0:2230)
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
   at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:97)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477)
   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2230 as 
 TID 2237 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2230 as 
 1883 bytes in 0 ms
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2231 (task 1.0:2231)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 1]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2231 as 
 TID 2238 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 {code}
 ...
 {code}
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 27]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 28]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished 

[jira] [Resolved] (SPARK-1963) Job aborted with NullPointerException from DAGScheduler.scala:1020

2014-05-29 Thread Kevin (Sangwoo) Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin (Sangwoo) Kim resolved SPARK-1963.


Resolution: Invalid

 Job aborted with NullPointerException from DAGScheduler.scala:1020
 --

 Key: SPARK-1963
 URL: https://issues.apache.org/jira/browse/SPARK-1963
 Project: Spark
  Issue Type: Bug
Reporter: Kevin (Sangwoo) Kim

 Hi, I'm testing Spark 0.9.1 from EC2 r3.8xlarge (32 core, 240GiB MEM)
 During counting active user from 70GB of data, Spark job aborted with NPE 
 from DAGScheduler. 
 I guess the number of active user count is around 1~2M. 
 Here's what I did 
 {code}
 val logs = sc.textFile(file:///spark/data/*)
 val activeUser = logs.map{x = val a = 
 LogObjectExtractor.getAnonymousAction(x); a.getUserId}.distinct
 activeUser.count
 {code}
 and here's the log. 
 {code}
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2235 as 
 1883 bytes in 1 ms
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2207 in 17541 
 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2204/2267)
 14/05/29 05:26:46 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(1, 
 2207)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2236 as 
 TID 2236 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2236 as 
 1883 bytes in 1 ms
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2230 (task 1.0:2230)
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
   at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:97)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477)
   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:477)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2230 as 
 TID 2237 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Serialized task 1.0:2230 as 
 1883 bytes in 0 ms
 14/05/29 05:26:46 WARN scheduler.TaskSetManager: Lost TID 2231 (task 1.0:2231)
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 1]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Starting task 1.0:2231 as 
 TID 2238 on executor 0: ip-10-169-5-198.ap-northeast-1.compute.internal 
 (PROCESS_LOCAL)
 {code}
 ...
 {code}
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 27]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.NullPointerException [duplicate 28]
 14/05/29 05:26:46 INFO scheduler.TaskSetManager: Finished TID 2201 in 17959 
 ms on ip-10-169-5-198.ap-northeast-1.compute.internal (progress: 2210/2267)
 14/05/29 05:26:46 INFO 

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-29 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012295#comment-14012295
 ] 

sam commented on SPARK-1867:


[~srowen] Thanks, I still find it difficult to find the correct artifact's.  
Perhaps you could help me with these please:

Which artifact contains: org.apache.hadoop.io.LongWritable?
Which artifact contains: org.apache.hadoop.io.CombineTextInputFormat?

If I search for hadoop-io here 
https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging-Information/cdhvd_cdh5_maven_repo.html
 I can't find anything :(

Many thanks

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012309#comment-14012309
 ] 

Sean Owen commented on SPARK-1867:
--

There is no hadoop-io module. Modules are subcomponents of the components 
distributed in something like CDH, and are not versioned independently, so you 
would not find them described on that page. That is, if CDH X.Y includes Hadoop 
Z.W then it includes version Z.W of all Hadoop's modules.

LongWritable was in hadoop-core in Hadoop 1.x, and is in hadoop-common in 2.x. 
Just about everything depends on these modules, so you should not find yourself 
missing them at runtime if you are depending on something like hadoop-client.

There isn't a org.apache.hadoop.io.CombineTextInputFormat, not that I can see 
-- where do you see that referenced?
There is org.apache.hadoop.mapreduce.lib.CombineTextInputFormat although it 
looks like it appeared from about Hadoop 2.1 
(https://issues.apache.org/jira/browse/MAPREDUCE-5069) You would not be able to 
use it if you are running on Hadoop 1, and wouldn't find it if you depend on 
Hadoop 1 modules.

It appears to be the in mapreduce-client-core module, but again, you wouldn't 
need to depend on that directly. That gets pulled in from hadoop-client, via 
hadoop-mapreduce-client. I can see it doing so when you build Spark for Hadoop 
2, for example.

I'm still not clear why you are needing to hunt for individual classes? Maybe 
one of those is just due to a package typo.



 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 

[jira] [Commented] (SPARK-1948) Scalac crashes when building Spark in IntelliJ IDEA

2014-05-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012366#comment-14012366
 ] 

Cheng Lian commented on SPARK-1948:
---

Hi [~sowen], thanks for the suggestion! I tried to switch to Maven, but it 
wasn't that smooth as expected, so I took sometime to investigate this issue. 
At last it seems to be a bug of the {{sbt-idea}} plugin.

The {{.idea/libraries/SBT__org_apache_mesos_0_18_1.xml}} file generated by 
{{sbt gen-idea}} declares both {{mesos-0.18.1.jar}} and 
{{mesos-0.18.1-shaded-protobuf.jar}} in the {{CLASSES}} elements:

{code}
component name=libraryTable
  library name=SBT: org.apache.mesos:mesos:0.18.1
CLASSES
  root url=jar://$PROJECT_DIR$/lib_managed/jars/mesos-0.18.1.jar!//
  root 
url=jar://$PROJECT_DIR$/lib_managed/jars/mesos-0.18.1-shaded-protobuf.jar!//
/CLASSES
JAVADOC /JAVADOC
SOURCES
  root 
url=jar://$PROJECT_DIR$/lib_managed/srcs/mesos-0.18.1-sources.jar!//
/SOURCES
  /library
/component
{code}

On the other hand, running {{show core/dependencyClasspath}} in SBT only shows 
the latter one, which means SBT only refers to the shaded one to build the 
project.

Removing the non-shaded Mesos jar file from IDEA project settings leads 
everything to peace:

# Run {{sbt/sbt gen-idea}}
# Open the project in IDEA
# Click the File / Project Structure... menu item
# Locate the SBT: org.apache.mesos:mesos:0.18.1 dependency along the UI path 
Project Settings / Modules / core / Dependencies
# Click the Edit button (with a pen icon)
# Remove the non-shaded jar file from Classes

Now build Spark within IDEA, everything should be fine.

 Scalac crashes when building Spark in IntelliJ IDEA
 ---

 Key: SPARK-1948
 URL: https://issues.apache.org/jira/browse/SPARK-1948
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Cheng Lian
Priority: Minor
 Attachments: scalac-crash.log


 After [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45], the 
 master branch fails to compile within IntelliJ IDEA and causes {{scalac}} to 
 crash. But building Spark with SBT is OK. This issue is not blocking, but 
 it's annoying since it prevents developers from debugging Spark within IDEA.
 I can't figure out the exact reason, only nailed down to this commit with 
 binary searching. Maybe I should fire a bug issue to IDEA instead?
 How to reproduce:
 # Checkout [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45]
 # Run {{sbt/sbt clean gen-idea}} under Spark source directory
 # Open the project in IntelliJ IDEA
 # Build the project
 The {{scalac}} crash report is attached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-29 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012431#comment-14012431
 ] 

sam commented on SPARK-1867:


Hmm, the reason for my confusion is a very stange compile problem, if I do the 
following:

import org.apache.hadoop.io.LongWritable;
...
public class CombinedInputFormat extends CombineFileInputFormatLongWritable, 
Text {

It doesn't work, but if I do

public class CombinedInputFormat extends 
CombineFileInputFormatorg.apache.hadoop.io.LongWritable, Text {

it does work. Odd, though I was just copy pasting java code from web ... it's 
been 4 years since I've written Java so maybe I'm missing something silly.

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core 

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012434#comment-14012434
 ] 

Sean Owen commented on SPARK-1867:
--

Something else is up; those are equivalent in Java and you couldn't have 
imported two symbols with the same name. I could look at the file offline if 
you think I might spot something else.

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1518:
---

Component/s: Spark Core

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1518:
---

Target Version/s: 1.1.0, 1.0.1  (was: 1.0.1)

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1961) when data return from map is about 10 kb, reduce(_ + _) would always pending

2014-05-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012471#comment-14012471
 ] 

Patrick Wendell commented on SPARK-1961:


Could you add a bit more information here or a reproduction? As it stands it's 
not really clear what this report means.

 when data return from map is about 10 kb, reduce(_ + _) would always pending
 

 Key: SPARK-1961
 URL: https://issues.apache.org/jira/browse/SPARK-1961
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: zhoudi





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1962:
---

Component/s: Spark Core

 Add RDD cache reference counting
 

 Key: SPARK-1962
 URL: https://issues.apache.org/jira/browse/SPARK-1962
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Taeyun Kim
Priority: Minor

 It would be nice if the RDD cache() method incorporate a reference counting 
 information.
 That is,
 {code}
 void test()
 {
 JavaRDD... rdd = ...;
 rdd.cache();  // to reference count 1. actual caching happens.
 rdd.cache();  // to reference count 2. Nop as long as the storage level 
 is the same. Else, exception.
 ...
 rdd.uncache();  // to reference count 1. Nop.
 rdd.uncache();  // to reference count 0. Actual unpersist happens.
 }
 {code}
 This can be useful when writing code in modular way.
 When a function receives an RDD as an argument, it doesn't necessarily know 
 the cache status of the RDD.
 But it could want to cache the RDD, since it will use the RDD multiple times.
 But with the current RDD API, it cannot determine whether it should unpersist 
 it or leave it alone (so that the caller can continue to use that RDD without 
 rebuilding).
 For API compatibility, introducing a new method or adding a parameter may be 
 required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency

2014-05-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1935.


Resolution: Fixed

 Explicitly add commons-codec 1.5 as a dependency
 

 Key: SPARK-1935
 URL: https://issues.apache.org/jira/browse/SPARK-1935
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor
 Fix For: 1.1.0, 1.0.1


 Right now, commons-codec is a transitive dependency. When Spark is built by 
 maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
 older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
 problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency

2014-05-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1935:
---

Fix Version/s: 1.0.1
   1.1.0

 Explicitly add commons-codec 1.5 as a dependency
 

 Key: SPARK-1935
 URL: https://issues.apache.org/jira/browse/SPARK-1935
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor
 Fix For: 1.1.0, 1.0.1


 Right now, commons-codec is a transitive dependency. When Spark is built by 
 maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
 older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
 problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1964) Timestamp missing from HiveMetastore types parser

2014-05-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1964:
---

 Summary: Timestamp missing from HiveMetastore types parser
 Key: SPARK-1964
 URL: https://issues.apache.org/jira/browse/SPARK-1964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust


{code}


-- Forwarded message --
From: dataginjaninja rickett.stepha...@gmail.com
Date: Thu, May 29, 2014 at 8:54 AM
Subject: Timestamp support in v1.0
To: d...@spark.incubator.apache.org


Can anyone verify which rc  [SPARK-1360] Add Timestamp Support for SQL #275
https://github.com/apache/spark/pull/275   is included in? I am running
rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables
when trying to use them in åçpyspark.

*The error I get:
*
14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
aol
14/05/29 15:44:48 INFO ParseDriver: Parse Completed
14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI
thrift:
14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection
attempt.
14/05/29 15:44:49 INFO metastore: Connected to metastore.
Traceback (most recent call last):
  File stdin, line 1, in module
  File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql
return self.hiveql(hqlQuery)
  File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql
return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self)
  File
/opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
line 537, in __call__
  File
/opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql.
: java.lang.RuntimeException: Unsupported dataType: timestamp

*The table I loaded:*
DROP TABLE IF EXISTS aol;
CREATE EXTERNAL TABLE aol (
userid STRING,
query STRING,
query_time TIMESTAMP,
item_rank INT,
click_url STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/tmp/data/aol';

*The pyspark commands:*
from pyspark.sql import HiveContext
hctx= HiveContext(sc)
results = hctx.hql(SELECT COUNT(*) FROM aol).collect()
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1964) Timestamp missing from HiveMetastore types parser

2014-05-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1964:
---

Assignee: Michael Armbrust

 Timestamp missing from HiveMetastore types parser
 -

 Key: SPARK-1964
 URL: https://issues.apache.org/jira/browse/SPARK-1964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 {code}
 -- Forwarded message --
 From: dataginjaninja rickett.stepha...@gmail.com
 Date: Thu, May 29, 2014 at 8:54 AM
 Subject: Timestamp support in v1.0
 To: d...@spark.incubator.apache.org
 Can anyone verify which rc  [SPARK-1360] Add Timestamp Support for SQL #275
 https://github.com/apache/spark/pull/275   is included in? I am running
 rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables
 when trying to use them in åçpyspark.
 *The error I get:
 *
 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
 aol
 14/05/29 15:44:48 INFO ParseDriver: Parse Completed
 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI
 thrift:
 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection
 attempt.
 14/05/29 15:44:49 INFO metastore: Connected to metastore.
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql
 return self.hiveql(hqlQuery)
   File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql
 return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self)
   File
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 537, in __call__
   File
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql.
 : java.lang.RuntimeException: Unsupported dataType: timestamp
 *The table I loaded:*
 DROP TABLE IF EXISTS aol;
 CREATE EXTERNAL TABLE aol (
 userid STRING,
 query STRING,
 query_time TIMESTAMP,
 item_rank INT,
 click_url STRING)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t'
 LOCATION '/tmp/data/aol';
 *The pyspark commands:*
 from pyspark.sql import HiveContext
 hctx= HiveContext(sc)
 results = hctx.hql(SELECT COUNT(*) FROM aol).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012520#comment-14012520
 ] 

Matei Zaharia commented on SPARK-1518:
--

Sorry, I'm still not sure I understand what you're asking for -- maybe I missed 
it above. Are you worried that the Spark assembly on the cluster has to be 
pre-built against Hadoop? We could perhaps make it find stuff out of 
HADOOP_HOME, but then it wouldn't work for users that don't have a Hadoop 
installation, which is a lot of users. For client apps, it's really enough to 
add that hadoop-client dependency. No other manipulation is needed.

If you want to build a client app that automatically works with multiple 
versions of Hadoop, you can also package it with Spark and hadoop-client marked 
as provided and use spark-submit to put the Spark assembly on your cluster in 
the classpath. Then it will work with whatever version that was built against. 
But you need to specify hadoop-client when you run without spark-submit if you 
want to talk to the version of HDFS in your cluster (e.g. you're testing the 
app on your laptop and trying to make it read from HDFS).

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1964) Timestamp missing from HiveMetastore types parser

2014-05-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012543#comment-14012543
 ] 

Michael Armbrust commented on SPARK-1964:
-

https://github.com/apache/spark/pull/913

 Timestamp missing from HiveMetastore types parser
 -

 Key: SPARK-1964
 URL: https://issues.apache.org/jira/browse/SPARK-1964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 {code}
 -- Forwarded message --
 From: dataginjaninja rickett.stepha...@gmail.com
 Date: Thu, May 29, 2014 at 8:54 AM
 Subject: Timestamp support in v1.0
 To: d...@spark.incubator.apache.org
 Can anyone verify which rc  [SPARK-1360] Add Timestamp Support for SQL #275
 https://github.com/apache/spark/pull/275   is included in? I am running
 rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables
 when trying to use them in åçpyspark.
 *The error I get:
 *
 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM
 aol
 14/05/29 15:44:48 INFO ParseDriver: Parse Completed
 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI
 thrift:
 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection
 attempt.
 14/05/29 15:44:49 INFO metastore: Connected to metastore.
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql
 return self.hiveql(hqlQuery)
   File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql
 return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self)
   File
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 537, in __call__
   File
 /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line
 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql.
 : java.lang.RuntimeException: Unsupported dataType: timestamp
 *The table I loaded:*
 DROP TABLE IF EXISTS aol;
 CREATE EXTERNAL TABLE aol (
 userid STRING,
 query STRING,
 query_time TIMESTAMP,
 item_rank INT,
 click_url STRING)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t'
 LOCATION '/tmp/data/aol';
 *The pyspark commands:*
 from pyspark.sql import HiveContext
 hctx= HiveContext(sc)
 results = hctx.hql(SELECT COUNT(*) FROM aol).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012552#comment-14012552
 ] 

Sean Owen commented on SPARK-1518:
--

Heh, I think the essence is: at least one more separate Maven artifact, under a 
different classifier, for Hadoop 2.x builds. If you package that, you get Spark 
and everything it needs to work against a Hadoop 2 cluster. Yeah I see that 
you're suggesting various ways to push the app to the cluster, where it can 
bind to the right version of things, and that may be the right-est way to think 
about this. I had envisioned running a stand-alone app on a machine that is not 
part of the cluster, that is a client of it, and this means packaging in the 
right Hadoop client dependencies, and Spark already declares how it wants to 
include these various Hadoop client versions -- it's more than just including 
hadoop-client -- so wanted to leverage that. Let's see if this actually turns 
out to be a broader request though.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig

2014-05-29 Thread Ryan Compton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012650#comment-14012650
 ] 

Ryan Compton commented on SPARK-1952:
-

Thanks so much! 

Here's the fix I had to make (removed jul-to-slf4j and jcl-over-slf4j)

{code}
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 29dcd86..a113ead 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -336,8 +336,8 @@ object SparkBuild extends Build {
 log4j  % log4j% 1.2.17,
 org.slf4j  % slf4j-api% slf4jVersion,
 org.slf4j  % slf4j-log4j12% slf4jVersion,
-org.slf4j  % jul-to-slf4j % slf4jVersion,
-org.slf4j  % jcl-over-slf4j   % slf4jVersion,
+//org.slf4j  % jul-to-slf4j % slf4jVersion,
+//org.slf4j  % jcl-over-slf4j   % slf4jVersion,
 commons-daemon % commons-daemon   % 1.0.10, // 
workaround for bug HADOOP-9407
 com.ning   % compress-lzf % 1.0.0,
 org.xerial.snappy  % snappy-java  % 1.0.5,
{code}

This led to several of these during compilation:

{code}
[warn] Class org.apache.commons.logging.Log not found - continuing with a stub.
{code}

But it did successfully remove the SLF4JLocationAwareLog.class

{code}
[rfcompton@node19 spark-1.0.0]$ jar tvf 
assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar
 | grep -i slf4j | grep Location
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
   479 Fri Dec 13 16:44:40 PST 2013 
parquet/org/slf4j/spi/LocationAwareLogger.class
{code}

For reference, there was no *-*-slf4j in 0.9.1 :

{code}
[rfcompton@node19 spark-0.9.1-bin-hadoop1]$ cat project/SparkBuild.scala | grep 
slf4
  val slf4jVersion = 1.7.2
org.slf4j% slf4j-api% slf4jVersion,
org.slf4j% slf4j-log4j12% slf4jVersion,
org.spark-project.akka  %% akka-slf4j   % 
2.2.3-shaded-protobuf  excludeAll(excludeNetty),
{code}

I'm the only guy in my lab using Spark so I have no idea if it's common to keep 
Pig UDFs in the same jar as Spark code. I'm going to assume it is and submit a 
pull request now.

 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j

 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012651#comment-14012651
 ] 

Matei Zaharia commented on SPARK-1518:
--

Okay, got it. But this only applies to you running the job on your laptop, 
right? Because otherwise you'll get the right Hadoop via the installation on 
the cluster.

For this use case I still think it's fine to require use of hadoop-client. It's 
been like that for the past 2 releases and nobody has asked questions about it. 
It's just one more entry to add to your pom.xml.

The concrete problem is that Hadoop has been extremely fickle with 
compatibility even within a major release series (1.x or 2.x). HDFS protocol 
versions change and you can't access the cluster, YARN versions change, etc. I 
don't think there's a single release I'd call Hadoop 2, and it would be 
confusing to users to link to the Hadoop 2 artifact and not have it run on 
their cluster.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012655#comment-14012655
 ] 

Matei Zaharia commented on SPARK-1518:
--

BTW one other thing is that in 1.0, you can also use spark-submit in local mode 
to get your locally installed Spark. So people will be able to yum install 
spark-from-their-vendor, build their app with just spark-core, and then run it 
with the spark-submit on their PATH.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1965) Spark UI throws NPE on trying to load the app page for non-existent app

2014-05-29 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-1965:
-

 Summary: Spark UI throws NPE on trying to load the app page for 
non-existent app
 Key: SPARK-1965
 URL: https://issues.apache.org/jira/browse/SPARK-1965
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Kay Ousterhout
Priority: Minor


If you try to load the Spark UI for an application that doesn't exist:

sparkHost:8080/app/?appId=foobar

The UI throws a NPE.  The problem is in ApplicationPage.scala -- Spark proceeds 
even if the app variable is null.  We should handle this more gracefully.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1966) Cannot cancel tasks running locally

2014-05-29 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1966:
-

 Summary: Cannot cancel tasks running locally
 Key: SPARK-1966
 URL: https://issues.apache.org/jira/browse/SPARK-1966
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
Reporter: Aaron Davidson






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1967) Using parallelize method to create RDD, wordcount app just hanging there without errors or warnings

2014-05-29 Thread Min Li (JIRA)
Min Li created SPARK-1967:
-

 Summary: Using parallelize method to create RDD, wordcount app 
just hanging there without errors or warnings
 Key: SPARK-1967
 URL: https://issues.apache.org/jira/browse/SPARK-1967
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
 Environment: Ubuntu-12.04, single machine spark standalone, 8 core, 8G 
mem, spark 0.9.1, java-1.7
Reporter: Min Li


I was trying the parallelize method to create RDD. I used Java. And it's a 
simple wordcount program, except that I first read the input into memory and 
then use the parallelize method to create the RDD, rather than the default 
textFile method in the given example. 
Pseudo codes:
JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome, 
$jars);
ListString input = #read lines from input file and form a ArrayListString
JavaRDD lines = ctx.parallelize(input);
//followed by wordcount
above is not working.
JavaRDD lines = ctx.textFile(file);
//followed by wordcount
this is working

The log is:
14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started
14/05/29 16:18:43 INFO Remoting: Starting remoting
14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster
14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140529161843-836a
14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB.
14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id = 
ConnectionManagerId(spark,42942)
14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager
14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
block manager spark:42942 with 1056.0 MB RAM
14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at 
http://10.227.119.185:43522
14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker
14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040
14/05/29 16:18:44 INFO SparkContext: Added JAR 
/home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at 
http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar
 with timestamp 1401394724045
14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master 
spark://spark:7077...
14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140529161844-0001
14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: 
app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) 
with 8 cores

The app is hanging here forever. And spark:8080 spark:4040 are not showing any 
strange info. The Spark Stages page shows the Active Stages is reduceByKey, 
tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after 
parallelize, and the app will stuck at the count stage.

I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have 
only one host. I used maven to compile a fat jar with spark specified as 
provided. I modified the run-example script to submit the jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1697) Driver error org.apache.spark.scheduler.TaskSetManager - Loss was due to java.io.FileNotFoundException

2014-05-29 Thread Arup Malakar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012913#comment-14012913
 ] 

Arup Malakar commented on SPARK-1697:
-

[~mridulm80] We saw this issue again. Are you referring to SPARK-1592 GC patch? 
That would require us to upgrade to trunk right? Also could you advice me on 
how to disable the cleaners?


 Driver error org.apache.spark.scheduler.TaskSetManager - Loss was due to 
 java.io.FileNotFoundException
 --

 Key: SPARK-1697
 URL: https://issues.apache.org/jira/browse/SPARK-1697
 Project: Spark
  Issue Type: Bug
Reporter: Arup Malakar

 We are running spark-streaming 0.9.0 on top of Yarn (Hadoop 
 2.2.0-cdh5.0.0-beta-2). It reads from kafka and processes the data. So far we 
 haven't seen any issues, except today we saw an exception in the driver log 
 and it is not consuming kafka messages any more. 
 Here is the exception we saw:
 {code}
 2014-05-01 10:00:43,962 [Result resolver thread-3] WARN  
 org.apache.spark.scheduler.TaskSetManager - Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://10.50.40.85:53055/broadcast_2412
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 

[jira] [Updated] (SPARK-1939) Refactor takeSample method in RDD to use ScaSRS

2014-05-29 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-1939:
-

Summary: Refactor takeSample method in RDD to use ScaSRS  (was: Improve 
takeSample method in RDD)

 Refactor takeSample method in RDD to use ScaSRS
 ---

 Key: SPARK-1939
 URL: https://issues.apache.org/jira/browse/SPARK-1939
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Doris Xin
Assignee: Doris Xin
  Labels: newbie

 reimplement takeSample with the ScaSRS algorithm



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1958) Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.

2014-05-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1958:


Assignee: Cheng Lian

 Calling .collect() on a SchemaRDD should call executeCollect() on the 
 underlying query plan.
 

 Key: SPARK-1958
 URL: https://issues.apache.org/jira/browse/SPARK-1958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some cases (like LIMIT) executeCollect() makes optimizations that 
 execute().collect() will not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1852) SparkSQL Queries with Sorts run before the user asks them to

2014-05-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1852:


Assignee: Cheng Lian

 SparkSQL Queries with Sorts run before the user asks them to
 

 Key: SPARK-1852
 URL: https://issues.apache.org/jira/browse/SPARK-1852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian

 This is related to [SPARK-1021] but will not be fixed by that since we do our 
 own partitioning.
 Part of the problem here is that we calculate the range partitioning too 
 eagerly.  Though this could also be alleviated by avoiding the call to toRdd 
 for non DDL queries.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1947) Child of SumDistinct or Average should be widened to prevent overflows the same as Sum.

2014-05-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1947:


Assignee: Takuya Ueshin

 Child of SumDistinct or Average should be widened to prevent overflows the 
 same as Sum.
 ---

 Key: SPARK-1947
 URL: https://issues.apache.org/jira/browse/SPARK-1947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin

 Child of {{SumDistinct}} or {{Average}} should be widened to prevent 
 overflows the same as {{Sum}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1968) SQL commands for caching tables

2014-05-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1968:


Component/s: SQL

 SQL commands for caching tables
 ---

 Key: SPARK-1968
 URL: https://issues.apache.org/jira/browse/SPARK-1968
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian

 Users should be able to cache tables by calling CACHE TABLE in SQL or HQL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1911) Warn users that jars should be built with Java 6 for PySpark to work on YARN

2014-05-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013142#comment-14013142
 ] 

Tathagata Das commented on SPARK-1911:
--

As far as I think, it is because Java 7 uses Zip64 encoding when making JARs 
with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it 
fails in those times when the Spark assembly JAR has more than 65k files, which 
in turn depends on whether it has been generated with YARN and/or Hive enabled.

 Warn users that jars should be built with Java 6 for PySpark to work on YARN
 

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Andrew Or
 Fix For: 1.0.0


 Python sometimes fails to read jars created by Java 7. This is necessary for 
 PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 
 for PySpark to work on YARN.
 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1911) Warn users that jars should be built with Java 6 for PySpark to work on YARN

2014-05-29 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013142#comment-14013142
 ] 

Tathagata Das edited comment on SPARK-1911 at 5/30/14 12:31 AM:


As far as I think, it is because Java 7 uses Zip64 encoding when making JARs 
with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it 
fails in those times when the Spark assembly JAR has more than 65k files, which 
in turn depends on whether it has been generated with YARN and/or Hive enabled.

Java 6 uses the traditional Zip format to create JARs, even if it has more than 
65k files. So python always seems to work with Java 6 Jars

Caveat: I cant claim 100% certainty on this interpretation because there is so 
little documentation on this on the net.


was (Author: tdas):
As far as I think, it is because Java 7 uses Zip64 encoding when making JARs 
with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it 
fails in those times when the Spark assembly JAR has more than 65k files, which 
in turn depends on whether it has been generated with YARN and/or Hive enabled.

 Warn users that jars should be built with Java 6 for PySpark to work on YARN
 

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Andrew Or
 Fix For: 1.0.0


 Python sometimes fails to read jars created by Java 7. This is necessary for 
 PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 
 for PySpark to work on YARN.
 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)