[jira] [Created] (SPARK-2013) Add Python pickleFile to programming guide

2014-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2013:


 Summary: Add Python pickleFile to programming guide
 Key: SPARK-2013
 URL: https://issues.apache.org/jira/browse/SPARK-2013
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Reporter: Matei Zaharia
Priority: Trivial
 Fix For: 1.1.0


Should be added in the Python version of 
http://spark.apache.org/docs/latest/programming-guide.html#external-datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2014) Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default

2014-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2014:


 Summary: Make PySpark store RDDs in MEMORY_ONLY_SER with 
compression by default
 Key: SPARK-2014
 URL: https://issues.apache.org/jira/browse/SPARK-2014
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Matei Zaharia


Since the data is serialized on the Python side, there's not much point in 
keeping it as byte arrays in Java, or even in skipping compression. We should 
make cache() in PySpark use MEMORY_ONLY_SER and turn on spark.rdd.compress for 
it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017477#comment-14017477
 ] 

Xiangrui Meng commented on SPARK-1977:
--

This is more likely a version conflict in your dependencies. From the Spark 
WebUI, you can find the system classpath in the environment tab. Please verify 
that you don't have two different versions of spark, kryo, or any other related 
library. Classes may hide inside an assembly jar.

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 

[jira] [Created] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large

2014-06-04 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2016:
--

 Summary: rdd in-memory storage UI becomes unresponsive when the 
number of RDD partitions is large
 Key: SPARK-2016
 URL: https://issues.apache.org/jira/browse/SPARK-2016
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


Try run
{code}
sc.parallelize(1 to 100, 100).cache().count()
{code}

And open the storage UI for this RDD. It takes forever to load the page.

When the number of partitions is very large, I think there are a few 
alternatives:

0. Only show the top 1000.
1. Pagination
2. Instead of grouping by RDD blocks, group by executors




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-06-04 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2017:
--

 Summary: web ui stage page becomes unresponsive when the number of 
tasks is large
 Key: SPARK-2017
 URL: https://issues.apache.org/jira/browse/SPARK-2017
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


{code}
sc.parallelize(1 to 100, 100).count()
{code}

The above code creates one million tasks to be executed. The stage detail web 
ui page takes forever to load (if it ever completes).

There are again a few different alternatives:

0. Limit the number of tasks we show.
1. Pagination
2. By default only show the aggregate metrics and failed tasks, and hide the 
successful ones.






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large

2014-06-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2016:
---

Labels: starter  (was: )

 rdd in-memory storage UI becomes unresponsive when the number of RDD 
 partitions is large
 

 Key: SPARK-2016
 URL: https://issues.apache.org/jira/browse/SPARK-2016
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 Try run
 {code}
 sc.parallelize(1 to 100, 100).cache().count()
 {code}
 And open the storage UI for this RDD. It takes forever to load the page.
 When the number of partitions is very large, I think there are a few 
 alternatives:
 0. Only show the top 1000.
 1. Pagination
 2. Instead of grouping by RDD blocks, group by executors



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-04 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017508#comment-14017508
 ] 

Neville Li edited comment on SPARK-1977 at 6/4/14 8:45 AM:
---

We submit 1 spark-assembly and 1 job assembly jar via spark-submit and there 
are no other obvious scala/spark/kryo jars in the global classpath. I can 
reproduce the same exception locally with the following snippet, when 
kryo.register() is commented out.

I just added mutable BitSet to Twitter chill: 
https://github.com/twitter/chill/pull/185

{code}
import com.twitter.chill._
import org.apache.spark.serializer.{KryoSerializer, KryoRegistrator}
import org.apache.spark.SparkConf
import scala.collection.mutable

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
// kryo.register(classOf[mutable.BitSet])
  }
}

case class OutLinkBlock(elementIds: Array[Int], shouldSend: 
Array[mutable.BitSet])

object KryoTest {
  def main(args: Array[String]) {
println(hello)
val conf = new SparkConf()
  .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
  .set(spark.kryo.registrator, classOf[MyRegistrator].getName)
val serializer = new KryoSerializer(conf).newInstance()

val bytes = serializer.serialize(OutLinkBlock(Array(1, 2, 3), 
Array(mutable.BitSet(2, 4, 6
serializer.deserialize(bytes).asInstanceOf[OutLinkBlock]
  }
}
{code}


was (Author: sinisa_lyh):
We submit 1 spark-assembly and 1 job assembly jar via spark-submit and there 
are no other obvious scala/spark/kryo jars in the global classpath. I can 
reproduce the same exception locally with the following snippet, when 
kryo.register() is commented out.

I just added mutable BitSet to Twitter chill: 
https://github.com/twitter/chill/pull/185

{{code}}
import com.twitter.chill._
import org.apache.spark.serializer.{KryoSerializer, KryoRegistrator}
import org.apache.spark.SparkConf
import scala.collection.mutable

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
// kryo.register(classOf[mutable.BitSet])
  }
}

case class OutLinkBlock(elementIds: Array[Int], shouldSend: 
Array[mutable.BitSet])

object KryoTest {
  def main(args: Array[String]) {
println(hello)
val conf = new SparkConf()
  .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
  .set(spark.kryo.registrator, classOf[MyRegistrator].getName)
val serializer = new KryoSerializer(conf).newInstance()

val bytes = serializer.serialize(OutLinkBlock(Array(1, 2, 3), 
Array(mutable.BitSet(2, 4, 6
serializer.deserialize(bytes).asInstanceOf[OutLinkBlock]
  }
}
{{code}}

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-04 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017508#comment-14017508
 ] 

Neville Li commented on SPARK-1977:
---

We submit 1 spark-assembly and 1 job assembly jar via spark-submit and there 
are no other obvious scala/spark/kryo jars in the global classpath. I can 
reproduce the same exception locally with the following snippet, when 
kryo.register() is commented out.

I just added mutable BitSet to Twitter chill: 
https://github.com/twitter/chill/pull/185

{{code}}
import com.twitter.chill._
import org.apache.spark.serializer.{KryoSerializer, KryoRegistrator}
import org.apache.spark.SparkConf
import scala.collection.mutable

class MyRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
// kryo.register(classOf[mutable.BitSet])
  }
}

case class OutLinkBlock(elementIds: Array[Int], shouldSend: 
Array[mutable.BitSet])

object KryoTest {
  def main(args: Array[String]) {
println(hello)
val conf = new SparkConf()
  .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
  .set(spark.kryo.registrator, classOf[MyRegistrator].getName)
val serializer = new KryoSerializer(conf).newInstance()

val bytes = serializer.serialize(OutLinkBlock(Array(1, 2, 3), 
Array(mutable.BitSet(2, 4, 6
serializer.deserialize(bytes).asInstanceOf[OutLinkBlock]
  }
}
{{code}}

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 

[jira] [Issue Comment Deleted] (SPARK-1999) UI : StorageLevel in storage tab and RDD Storage Info never changes

2014-06-04 Thread Chen Chao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Chao updated SPARK-1999:
-

Comment: was deleted

(was: https://github.com/apache/spark/pull/950
sorry,i will repost soon, the above link will be invalid.)

 UI : StorageLevel in storage tab and RDD Storage Info never changes 
 

 Key: SPARK-1999
 URL: https://issues.apache.org/jira/browse/SPARK-1999
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Chen Chao

 StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if 
 you call rdd.unpersist() and then you give the rdd another different storage 
 level.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1999) UI : StorageLevel in storage tab and RDD Storage Info never changes

2014-06-04 Thread Chen Chao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017519#comment-14017519
 ] 

Chen Chao commented on SPARK-1999:
--

PR:https://github.com/apache/spark/pull/968

 UI : StorageLevel in storage tab and RDD Storage Info never changes 
 

 Key: SPARK-1999
 URL: https://issues.apache.org/jira/browse/SPARK-1999
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Chen Chao

 StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if 
 you call rdd.unpersist() and then you give the rdd another different storage 
 level.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1999) UI : StorageLevel in storage tab and RDD Storage Info never changes

2014-06-04 Thread Chen Chao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Chao updated SPARK-1999:
-

Comment: was deleted

(was: I have fixed and tested fine. Please assign it to me , I will post a PR 
soon!)

 UI : StorageLevel in storage tab and RDD Storage Info never changes 
 

 Key: SPARK-1999
 URL: https://issues.apache.org/jira/browse/SPARK-1999
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Chen Chao

 StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if 
 you call rdd.unpersist() and then you give the rdd another different storage 
 level.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2014-06-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017528#comment-14017528
 ] 

Sean Owen commented on SPARK-2018:
--

The meaning of the error is that Java thinks two serializable classes are not 
mutually compatible. This is because two different serialVersioUIDs get 
computed for two copies of what may be the same class. If I understand you 
correctly, you are communicating between different JVM versions, or reading 
one's output from the other? I don't think it's guaranteed that the 
auto-generated serialVersionUID will be the same. If so, it's nothing to do 
with big-endian-ness per se. Does it happen entirely within the same machine / 
JVM? 

 Big-Endian (IBM Power7)  Spark Serialization issue
 --

 Key: SPARK-2018
 URL: https://issues.apache.org/jira/browse/SPARK-2018
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: hardware : IBM Power7
 OS:Linux version 2.6.32-358.el6.ppc64 
 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013
 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 
 20130617_152572 (JIT enabled, AOT enabled)
 Hadoop:Hadoop-0.2.3-CDH5.0
 Spark:Spark-1.0.0 or Spark-0.9.1
 spark-env.sh:
 export JAVA_HOME=/opt/ibm/java-ppc64-70/
 export SPARK_MASTER_IP=9.114.34.69
 export SPARK_WORKER_MEMORY=1m
 export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib
 export  STANDALONE_SPARK_MASTER_HOST=9.114.34.69
 #export SPARK_JAVA_OPTS=' -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n '
Reporter: Yanjie Gao

 We have an application run on Spark on Power7 System .
 But we meet an important issue about serialization.
 The example HdfsWordCount can meet the problem.
 ./bin/run-example  org.apache.spark.examples.streaming.HdfsWordCount 
 localdir
 We used Power7 (Big-Endian arch) and Redhat  6.4.
 Big-Endian  is the main cause since the example ran successfully in another 
 Power-based Little Endian setup.
 here is the exception stack and log:
 Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp 
 /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
  -XX:MaxPermSize=128m  -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M 
 -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 
 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 
 app-20140604023054-
 
 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:22 INFO Remoting: Starting remoting
 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler
 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver
 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:24 INFO Remoting: Starting remoting
 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses 
 

[jira] [Created] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-04 Thread sam (JIRA)
sam created SPARK-2019:
--

 Summary: Spark workers die/disappear when job fails for nearly any 
reason
 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Reporter: sam


We either have to reboot all the nodes, or run 'sudo service spark-worker 
restart' across our cluster.  I don't think this should happen - the job 
failures are often not even that bad.  There is a 5 upvoted SO question here: 
http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails

We shouldn't be giving restart privileges to our devs, and therefore our sysadm 
has to frequently restart the workers.  When the sysadm is not around, there is 
nothing our devs can do.

Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6

2014-06-04 Thread Qiuzhuang Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017753#comment-14017753
 ] 

Qiuzhuang Lian commented on SPARK-1520:
---

I can run the assembly jar via bin\spark-shell.cmd but couldn't run the example 
LocalKMeans in Intellj IDEA which throws out Exception in thread main 
java.lang.NoClassDefFoundError: breeze/linalg/Vector. Can somebody suggest a 
fix since I prefer try coding in Intellj IDEA. Thanks.

 Assembly Jar with more than 65536 files won't work when compiled on  JDK7 and 
 run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 h1. Isolation and Cause
 The package-time behavior of Java 6 and 7 differ with respect to the format 
 used for jar files:
 ||Number of entries||JDK 6||JDK 7||
 |= 65536|zip|zip|
 | 65536|zip*|zip64|
 zip* is a workaround for the original zip format that [described in 
 JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows 
 some versions of Java 6 to support larger assembly jars.
 The Scala libraries we depend on have added a large number of classes which 
 bumped us over the limit. This causes the Java 7 packaging to not work with 
 Java 6. We can probably go back under the limit by clearing out some 
 accidental inclusion of FastUtil, but eventually we'll go over again.
 The real answer is to force people to build with JDK 6 if they want to run 
 Spark on JRE 6.
 -I've found that if I just unpack and re-pack the jar (using `jar`) it always 
 works:-
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 -I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.-
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-04 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017776#comment-14017776
 ] 

Mark Hamstra commented on SPARK-2019:
-

Please don't leave the Affects Version/s selector on None.  As with the SO 
question, is this an issue that you are seeing with Spark 0.9.0?  If so, then 
the version of Spark that you are using is significantly out of date even on 
the 0.9 branch.  Several bug fixes are present in the 0.9.1 release of Spark, 
which has been available for almost two months.  There are a few more in the 
current 0.9.2-SNAPSHOT code, and many more in the recent 1.0.0 release.

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-06-04 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017812#comment-14017812
 ] 

Kan Zhang commented on SPARK-1817:
--

There are 2 issues related to this bug. One is that we partition numeric ranges 
(e.g., Long and Double ranges) differently from other types of sequences (i.e, 
at different indexes). This causes elements to be dropped when zipping with 
numeric ranges since we zip by partition and partitions for numeric ranges may 
have different sizes from other sequences (even if the total length and the 
number of partitions are the same). This is fixed in SPARK-1837. One caveat is 
currently partitioning Double ranges still doesn't work properly due to a Scala 
bug that breaks {{take}} and {{drop}} on Double ranges 
(https://issues.scala-lang.org/browse/SI-8518).

The other issue is instead of dropping elements silently, we should throw an 
error during zipping when we found out that partition sizes are not the same 
between 2 sequences. This is fixed by https://github.com/apache/spark/pull/944

 RDD zip erroneous when partitions do not divide RDD count
 -

 Key: SPARK-1817
 URL: https://issues.apache.org/jira/browse/SPARK-1817
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Michael Malak
Assignee: Kan Zhang
 Fix For: 1.1.0


 Example:
 scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
 res1: Array[(Long, Int)] = Array((2,11))
 But more generally, it's whenever the number of partitions does not evenly 
 divide the total number of elements in the RDD.
 See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-06-04 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1817:
-

Comment: was deleted

(was: PR: https://github.com/apache/spark/pull/760)

 RDD zip erroneous when partitions do not divide RDD count
 -

 Key: SPARK-1817
 URL: https://issues.apache.org/jira/browse/SPARK-1817
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Michael Malak
Assignee: Kan Zhang
 Fix For: 1.1.0


 Example:
 scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
 res1: Array[(Long, Int)] = Array((2,11))
 But more generally, it's whenever the number of partitions does not evenly 
 divide the total number of elements in the RDD.
 See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2013) Add Python pickleFile to programming guide

2014-06-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2013:
-

Assignee: Kan Zhang

 Add Python pickleFile to programming guide
 --

 Key: SPARK-2013
 URL: https://issues.apache.org/jira/browse/SPARK-2013
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Reporter: Matei Zaharia
Assignee: Kan Zhang
Priority: Trivial
 Fix For: 1.1.0


 Should be added in the Python version of 
 http://spark.apache.org/docs/latest/programming-guide.html#external-datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1973) Add randomSplit to JavaRDD (with tests, and tidy Java tests)

2014-06-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1973.
--

Resolution: Implemented

PR: https://github.com/apache/spark/pull/919

 Add randomSplit to JavaRDD (with tests, and tidy Java tests)
 

 Key: SPARK-1973
 URL: https://issues.apache.org/jira/browse/SPARK-1973
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 I'd like to use randomSplit through the Java API, and would like to add a 
 convenience wrapper for this method to JavaRDD. This is fairly trivial. (In 
 fact, is the intent that JavaRDD not wrap every RDD method? and that 
 sometimes users should just use JavaRDD.wrapRDD()?)
 Along the way, I added tests for it, and also touched up the Java API test 
 style and behavior. This is maybe the more useful part of this small change.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1973) Add randomSplit to JavaRDD (with tests, and tidy Java tests)

2014-06-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1973:
-

Assignee: Sean Owen

 Add randomSplit to JavaRDD (with tests, and tidy Java tests)
 

 Key: SPARK-1973
 URL: https://issues.apache.org/jira/browse/SPARK-1973
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 I'd like to use randomSplit through the Java API, and would like to add a 
 convenience wrapper for this method to JavaRDD. This is fairly trivial. (In 
 fact, is the intent that JavaRDD not wrap every RDD method? and that 
 sometimes users should just use JavaRDD.wrapRDD()?)
 Along the way, I added tests for it, and also touched up the Java API test 
 style and behavior. This is maybe the more useful part of this small change.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1704) java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*])

2014-06-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-1704:


Comment: was deleted

(was: [~marmbrus] I am attaching the link to the PR.)

 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 

 Key: SPARK-1704
 URL: https://issues.apache.org/jira/browse/SPARK-1704
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux
Reporter: Yangjp
  Labels: sql
 Fix For: 1.1.0

   Original Estimate: 612h
  Remaining Estimate: 612h

 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
 at 
 org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
 at 
 org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
 at 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2020) Spark 1.0.0 fails to run in coarse-grained mesos mode

2014-06-04 Thread Ajay Viswanathan (JIRA)
Ajay Viswanathan created SPARK-2020:
---

 Summary: Spark 1.0.0 fails to run in coarse-grained mesos mode
 Key: SPARK-2020
 URL: https://issues.apache.org/jira/browse/SPARK-2020
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
 Environment: Ubuntu 14.04, 64-bit
8GB RAM
Reporter: Ajay Viswanathan


I am using Mesos to run Spark applications on a cluster.
Earlier, in Spark 0.9.1 and below, I could run tasks in coarse-grained more on 
the workers; but now, when I try to do the same in Spark 1.0.0, I get an 
exception preventing me from running the tasks. Fine-grained mode works fine in 
Spark 1.0.0 though.

Snippet of stderr - 
Executor registered on slave
Exception in thread main java.lang.NumberFormatException: For input string: 
ip
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)

Running the spark application connected to mesos master throws an error - Is 
Spark installed on it?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2020) Spark 1.0.0 fails to run in coarse-grained mesos mode

2014-06-04 Thread Ajay Viswanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018022#comment-14018022
 ] 

Ajay Viswanathan commented on SPARK-2020:
-

Do I have to use Java 8 to rectify this error?

 Spark 1.0.0 fails to run in coarse-grained mesos mode
 -

 Key: SPARK-2020
 URL: https://issues.apache.org/jira/browse/SPARK-2020
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
 Environment: Ubuntu 14.04, 64-bit
 8GB RAM
Reporter: Ajay Viswanathan

 I am using Mesos to run Spark applications on a cluster.
 Earlier, in Spark 0.9.1 and below, I could run tasks in coarse-grained more 
 on the workers; but now, when I try to do the same in Spark 1.0.0, I get an 
 exception preventing me from running the tasks. Fine-grained mode works fine 
 in Spark 1.0.0 though.
 Snippet of stderr - 
 Executor registered on slave
 Exception in thread main java.lang.NumberFormatException: For input string: 
 ip
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:492)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
 at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 Running the spark application connected to mesos master throws an error - Is 
 Spark installed on it?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-04 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018048#comment-14018048
 ] 

sam commented on SPARK-2019:


Sorry. Its 0.9.1

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1508) Add support for reading from SparkConf

2014-06-04 Thread Zongheng Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018043#comment-14018043
 ] 

Zongheng Yang commented on SPARK-1508:
--

WIP PR: https://github.com/apache/spark/pull/956

We'd want to support:

(1) API calls on SQLConf objects to get/set properties.
(2) Support SQL/HiveQL SET commands of various kinds, e.g. SET key=val, 
SET, SET key in the sense that these should be reflected in / go through 
SQLConf objects.
(3) Make sql(SET ...).collect() (or perhaps also some other operations; also 
for hql()) return expected results, i.e. the key/val pairs. To do this there 
are some necessary refactorings for the QueryExecution pipeline.

 Add support for reading from SparkConf
 --

 Key: SPARK-1508
 URL: https://issues.apache.org/jira/browse/SPARK-1508
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
 Fix For: 1.1.0


 Right now we have no ability to configure things in Spark SQL.  A good start 
 would be passing a SparkConf though the planner such that users could 
 override the number of partitions used during an Exchange.
 Note that while current spark confs are immutable after the context is 
 created, we want some ability to change settings on a per query basis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1508) Add support for reading from SparkConf

2014-06-04 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018053#comment-14018053
 ] 

Michael Armbrust commented on SPARK-1508:
-

It is likely we will fix this issue through the solution here.

 Add support for reading from SparkConf
 --

 Key: SPARK-1508
 URL: https://issues.apache.org/jira/browse/SPARK-1508
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
 Fix For: 1.1.0


 Right now we have no ability to configure things in Spark SQL.  A good start 
 would be passing a SparkConf though the planner such that users could 
 override the number of partitions used during an Exchange.
 Note that while current spark confs are immutable after the context is 
 created, we want some ability to change settings on a per query basis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-04 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-2019:


Affects Version/s: 0.9.1

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2019:
---

Fix Version/s: 0.9.2

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: sam
Priority: Critical
 Fix For: 0.9.2


 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-04 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018086#comment-14018086
 ] 

Patrick Wendell commented on SPARK-2019:


We should dig into this and figure out what's going on.

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: sam
Priority: Critical
 Fix For: 0.9.2


 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2019:
---

Priority: Critical  (was: Major)

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
Reporter: sam
Priority: Critical
 Fix For: 0.9.2


 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018155#comment-14018155
 ] 

Xiangrui Meng commented on SPARK-1977:
--

In our example code, we only register `Rating` and it works. Could you try 
adding the following:

{code}
kryo.register(classOf[Rating])
{code}

I need to reproduce this problem with `ALS.train`.

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 
 

[jira] [Updated] (SPARK-1912) Compression memory issue during reduce

2014-06-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1912:
-

Target Version/s: 0.9.2, 1.0.1, 1.1.0  (was: 0.9.2, 1.0.1)

 Compression memory issue during reduce
 --

 Key: SPARK-1912
 URL: https://issues.apache.org/jira/browse/SPARK-1912
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.1.0


 When we need to read a compressed block, we will first create a compress 
 stream instance(LZF or Snappy) and use it to wrap that block.
 Let's say a reducer task need to read 1000 local shuffle blocks, it will 
 first prepare to read that 1000 blocks, which means create 1000 compression 
 stream instance to wrap them. But the initialization of compression instance 
 will allocate some memory and when we have many compression instance at the 
 same time, it is a problem.
 Actually reducer reads the shuffle blocks one by one, so why we create 
 compression instance at the first time? Can we do it lazily that when a block 
 is first read, create compression instance for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1912) Compression memory issue during reduce

2014-06-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1912:
-

Target Version/s: 0.9.2, 1.0.1

 Compression memory issue during reduce
 --

 Key: SPARK-1912
 URL: https://issues.apache.org/jira/browse/SPARK-1912
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.1.0


 When we need to read a compressed block, we will first create a compress 
 stream instance(LZF or Snappy) and use it to wrap that block.
 Let's say a reducer task need to read 1000 local shuffle blocks, it will 
 first prepare to read that 1000 blocks, which means create 1000 compression 
 stream instance to wrap them. But the initialization of compression instance 
 will allocate some memory and when we have many compression instance at the 
 same time, it is a problem.
 Actually reducer reads the shuffle blocks one by one, so why we create 
 compression instance at the first time? Can we do it lazily that when a block 
 is first read, create compression instance for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-04 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018254#comment-14018254
 ] 

Neville Li commented on SPARK-1977:
---

Yes we did register 'Rating'. And we had to register(classOf[mutable.BitSet]) 
in addition to make it work.

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at scala.Option.foreach(Option.scala:236)
   

[jira] [Created] (SPARK-2023) PySpark reduce does a map side reduce and then sends the results to the driver for final reduce, instead do this more like Scala Spark.

2014-06-04 Thread holdenk (JIRA)
holdenk created SPARK-2023:
--

 Summary: PySpark reduce does a map side reduce and then sends the 
results to the driver for final reduce, instead do this more like Scala Spark.
 Key: SPARK-2023
 URL: https://issues.apache.org/jira/browse/SPARK-2023
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: holdenk


PySpark reduce does a map side reduce and then sends the results to the driver 
for final reduce, instead do this more like Scala Spark. The current 
implementation could be a bottleneck. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2024) Add saveAsSequenceFile to PySpark

2014-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2024:


 Summary: Add saveAsSequenceFile to PySpark
 Key: SPARK-2024
 URL: https://issues.apache.org/jira/browse/SPARK-2024
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matei Zaharia


After SPARK-1414 we will be able to read SequenceFiles from Python, but it 
remains to write them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-06-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1790:
---

Fix Version/s: 1.1.0
   0.9.2

 Update EC2 scripts to support r3 instance types
 ---

 Key: SPARK-1790
 URL: https://issues.apache.org/jira/browse/SPARK-1790
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Matei Zaharia
Assignee: Sujeet Varakhedi
  Labels: Starter
 Fix For: 0.9.2, 1.0.1, 1.1.0


 These were recently added by Amazon as a cheaper high-memory option



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-06-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1790.


Resolution: Fixed

 Update EC2 scripts to support r3 instance types
 ---

 Key: SPARK-1790
 URL: https://issues.apache.org/jira/browse/SPARK-1790
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Matei Zaharia
Assignee: Sujeet Varakhedi
  Labels: Starter
 Fix For: 0.9.2, 1.0.1, 1.1.0


 These were recently added by Amazon as a cheaper high-memory option



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2011) Eliminate duplicate join in Pregel

2014-06-04 Thread Tim Weninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018327#comment-14018327
 ] 

Tim Weninger commented on SPARK-2011:
-

I also think that there is a memory leak related to this. When I run a Pregel 
script/function. It creates and holds an EdgeRDD (that I can see in the Storage 
Menu on the WebUI) that is not released.

So, after 15 pregel iterations, I'll have 15 extra EdgeRDDs taking up space.

Is this related or should I make a new bug report (affects 1.0.1 snapshot)

TW

 Eliminate duplicate join in Pregel
 --

 Key: SPARK-2011
 URL: https://issues.apache.org/jira/browse/SPARK-2011
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 In the iteration loop, Pregel currently performs an innerJoin to apply 
 messages to vertices followed by an outerJoinVertices to join the resulting 
 subset of vertices back to the graph. These two operations could be merged 
 into a single call to joinVertices, which should be reimplemented in a more 
 efficient manner. This would allow us to examine only the vertices that 
 received messages.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-04 Thread Shuo Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018328#comment-14018328
 ] 

Shuo Xiang commented on SPARK-1977:
---

Hi [~neville], I just run the MovieLens example on my YARN cluster 
(hadoop-2.0.5-alpha) with kryo enabled and it works. I use the following 
command:

bin/spark-submit --master yarn-cluster  --class 
org.apache.spark.examples.mllib.MovieLensALS  --num-executors ** 
--driver-memory ** --executor-memory ** --executor-cores 1  
spark-examples-1.0.0-hadoop2.0.5-alpha.jar  --rank 5 --numIterations 20 
--lambda 1.0 --kryo /path/to/sample_movielens_data.txt

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 

[jira] [Created] (SPARK-2025) EdgeRDD persists after pregel iteration

2014-06-04 Thread Tim Weninger (JIRA)
Tim Weninger created SPARK-2025:
---

 Summary: EdgeRDD persists after pregel iteration
 Key: SPARK-2025
 URL: https://issues.apache.org/jira/browse/SPARK-2025
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0, 1.0.1
 Environment: RHEL6 on local and on spark cluster
Reporter: Tim Weninger


Symptoms: During execution of a pregel script/function a copy of an 
intermediate EdgeRDD object persists after each iteration as shown by the Spark 
WebUI - storage.

This is like a memory leak that affects in the Pregel function.

For example, after the first iteration I will have an EdgeRDD in addition to 
the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
iterations I will have 15 EdgeRDDs in addition to the current/correct state 
represented by a single set of 1 EdgeRDD and 1 VertexRDD.

At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, but 
there seems to be another EdgeRDD that is created somewhere that does not get 
unpersisted.

i _think_ this is from the replicateVertex function, but I cannot be sure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2025) EdgeRDD persists after pregel iteration

2014-06-04 Thread Tim Weninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Weninger updated SPARK-2025:


Description: 
Symptoms: During execution of a pregel script/function a copy of an 
intermediate EdgeRDD object persists after each iteration as shown by the Spark 
WebUI - storage.

This is like a memory leak that affects in the Pregel function.

For example, after the first iteration I will have an EdgeRDD in addition to 
the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
iterations I will have 15 EdgeRDDs in addition to the current/correct state 
represented by a single set of 1 EdgeRDD and 1 VertexRDD.

At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, but 
there seems to be another EdgeRDD that is created somewhere that does not get 
unpersisted.

i _think_ this is from the replicateVertex function, but I cannot be sure.

Update - Dave Ankur says, in comments on SPARK-2011 - 
{quote}
... is a bug introduced by https://github.com/apache/spark/pull/497.
It occurs because unpersistVertices used to unpersist both the vertices and the 
replicated vertices, but after unifying replicated vertices with edges, there 
was no way to unpersist only one of them. I think the solution is just to 
unpersist both the vertices and the edges in Pregel.{quote}

  was:
Symptoms: During execution of a pregel script/function a copy of an 
intermediate EdgeRDD object persists after each iteration as shown by the Spark 
WebUI - storage.

This is like a memory leak that affects in the Pregel function.

For example, after the first iteration I will have an EdgeRDD in addition to 
the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
iterations I will have 15 EdgeRDDs in addition to the current/correct state 
represented by a single set of 1 EdgeRDD and 1 VertexRDD.

At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, but 
there seems to be another EdgeRDD that is created somewhere that does not get 
unpersisted.

i _think_ this is from the replicateVertex function, but I cannot be sure.


 EdgeRDD persists after pregel iteration
 ---

 Key: SPARK-2025
 URL: https://issues.apache.org/jira/browse/SPARK-2025
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0, 1.0.1
 Environment: RHEL6 on local and on spark cluster
Reporter: Tim Weninger
  Labels: Pregel

 Symptoms: During execution of a pregel script/function a copy of an 
 intermediate EdgeRDD object persists after each iteration as shown by the 
 Spark WebUI - storage.
 This is like a memory leak that affects in the Pregel function.
 For example, after the first iteration I will have an EdgeRDD in addition to 
 the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
 iterations I will have 15 EdgeRDDs in addition to the current/correct state 
 represented by a single set of 1 EdgeRDD and 1 VertexRDD.
 At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, 
 but there seems to be another EdgeRDD that is created somewhere that does not 
 get unpersisted.
 i _think_ this is from the replicateVertex function, but I cannot be sure.
 Update - Dave Ankur says, in comments on SPARK-2011 - 
 {quote}
 ... is a bug introduced by https://github.com/apache/spark/pull/497.
 It occurs because unpersistVertices used to unpersist both the vertices and 
 the replicated vertices, but after unifying replicated vertices with edges, 
 there was no way to unpersist only one of them. I think the solution is just 
 to unpersist both the vertices and the edges in Pregel.{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2026) Maven hadoop* Profiles Should Set the expected Hadoop Version.

2014-06-04 Thread Bernardo Gomez Palacio (JIRA)
Bernardo Gomez Palacio created SPARK-2026:
-

 Summary: Maven hadoop* Profiles Should Set the expected Hadoop 
Version.
 Key: SPARK-2026
 URL: https://issues.apache.org/jira/browse/SPARK-2026
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0
Reporter: Bernardo Gomez Palacio


The Maven Profiles that refer to _hadoopX_, e.g. hadoop2.4, should set the 
expected _hadoop.version_.

e.g.

{code}
profile
  idhadoop-2.4/id
  properties
protobuf.version2.5.0/protobuf.version
jets3t.version0.9.0/jets3t.version
  /properties
/profile
{code}

as it is suggested

{code}
profile
  idhadoop-2.4/id
  properties
hadoop.version2.4.0/hadoop.version
 yarn.version${hadoop.version}/yarn.version
protobuf.version2.5.0/protobuf.version
jets3t.version0.9.0/jets3t.version
  /properties
/profile
{code}

Builds can still define the -Dhadoop.version option but this will correctly 
default the Hadoop Version to the one that is expected according the profile 
that is selected.

e.g.

{code}
$ mvn -P hadoop-2.4,yarn clean compile
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2027) spark-ec2 puts Hadoop's log4j ahead of Spark's in classpath

2014-06-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2027:
-

 Summary: spark-ec2 puts Hadoop's log4j ahead of Spark's in 
classpath
 Key: SPARK-2027
 URL: https://issues.apache.org/jira/browse/SPARK-2027
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-2025) EdgeRDD persists after pregel iteration

2014-06-04 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave reassigned SPARK-2025:
-

Assignee: Ankur Dave

 EdgeRDD persists after pregel iteration
 ---

 Key: SPARK-2025
 URL: https://issues.apache.org/jira/browse/SPARK-2025
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0, 1.0.1
 Environment: RHEL6 on local and on spark cluster
Reporter: Tim Weninger
Assignee: Ankur Dave
  Labels: Pregel

 Symptoms: During execution of a pregel script/function a copy of an 
 intermediate EdgeRDD object persists after each iteration as shown by the 
 Spark WebUI - storage.
 This is like a memory leak that affects in the Pregel function.
 For example, after the first iteration I will have an EdgeRDD in addition to 
 the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
 iterations I will have 15 EdgeRDDs in addition to the current/correct state 
 represented by a single set of 1 EdgeRDD and 1 VertexRDD.
 At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, 
 but there seems to be another EdgeRDD that is created somewhere that does not 
 get unpersisted.
 i _think_ this is from the replicateVertex function, but I cannot be sure.
 Update - Dave Ankur says, in comments on SPARK-2011 - 
 {quote}
 ... is a bug introduced by https://github.com/apache/spark/pull/497.
 It occurs because unpersistVertices used to unpersist both the vertices and 
 the replicated vertices, but after unifying replicated vertices with edges, 
 there was no way to unpersist only one of them. I think the solution is just 
 to unpersist both the vertices and the edges in Pregel.{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2025) EdgeRDD persists after pregel iteration

2014-06-04 Thread Tim Weninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018389#comment-14018389
 ] 

Tim Weninger commented on SPARK-2025:
-

adding 

{{prevG.edges.unpersist(blocking=false)}}

after line 152 in Pregel.scala fixes the issue

 EdgeRDD persists after pregel iteration
 ---

 Key: SPARK-2025
 URL: https://issues.apache.org/jira/browse/SPARK-2025
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0, 1.0.1
 Environment: RHEL6 on local and on spark cluster
Reporter: Tim Weninger
Assignee: Ankur Dave
  Labels: Pregel

 Symptoms: During execution of a pregel script/function a copy of an 
 intermediate EdgeRDD object persists after each iteration as shown by the 
 Spark WebUI - storage.
 This is like a memory leak that affects in the Pregel function.
 For example, after the first iteration I will have an EdgeRDD in addition to 
 the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
 iterations I will have 15 EdgeRDDs in addition to the current/correct state 
 represented by a single set of 1 EdgeRDD and 1 VertexRDD.
 At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, 
 but there seems to be another EdgeRDD that is created somewhere that does not 
 get unpersisted.
 i _think_ this is from the replicateVertex function, but I cannot be sure.
 Update - Dave Ankur says, in comments on SPARK-2011 - 
 {quote}
 ... is a bug introduced by https://github.com/apache/spark/pull/497.
 It occurs because unpersistVertices used to unpersist both the vertices and 
 the replicated vertices, but after unifying replicated vertices with edges, 
 there was no way to unpersist only one of them. I think the solution is just 
 to unpersist both the vertices and the edges in Pregel.{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-04 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018390#comment-14018390
 ] 

Neville Li commented on SPARK-1977:
---

Our YARN cluster runs 2.2.0. We built spark-assembly and spark-examples jars 
with 1.0.0 release source and the bundled make_distribution.sh. And here's my 
command:

{code}
spark-submit --master yarn-cluster --class 
org.apache.spark.examples.mllib.MovieLensALS --num-executors 2 
--executor-memory 2g --driver-memory 2g 
dist/lib/spark-examples-1.0.0-hadoop2.2.0.jar --kryo --implicitPrefs 
sample_movielens_data.txt
{code}

Here's a complete list of classpath from the environment tab.
{code}
/etc/hadoop/conf
/usr/lib/hadoop-hdfs/hadoop-hdfs-2.2.0.2.0.6.0-76-tests.jar
/usr/lib/hadoop-hdfs/hadoop-hdfs-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-hdfs/hadoop-hdfs-nfs-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-hdfs/lib/asm-3.2.jar
/usr/lib/hadoop-hdfs/lib/commons-cli-1.2.jar
/usr/lib/hadoop-hdfs/lib/commons-codec-1.4.jar
/usr/lib/hadoop-hdfs/lib/commons-daemon-1.0.13.jar
/usr/lib/hadoop-hdfs/lib/commons-el-1.0.jar
/usr/lib/hadoop-hdfs/lib/commons-io-2.1.jar
/usr/lib/hadoop-hdfs/lib/commons-lang-2.5.jar
/usr/lib/hadoop-hdfs/lib/commons-logging-1.1.1.jar
/usr/lib/hadoop-hdfs/lib/guava-11.0.2.jar
/usr/lib/hadoop-hdfs/lib/jackson-core-asl-1.8.8.jar
/usr/lib/hadoop-hdfs/lib/jackson-mapper-asl-1.8.8.jar
/usr/lib/hadoop-hdfs/lib/jasper-runtime-5.5.23.jar
/usr/lib/hadoop-hdfs/lib/jersey-core-1.9.jar
/usr/lib/hadoop-hdfs/lib/jersey-server-1.9.jar
/usr/lib/hadoop-hdfs/lib/jetty-6.1.26.jar
/usr/lib/hadoop-hdfs/lib/jetty-util-6.1.26.jar
/usr/lib/hadoop-hdfs/lib/jsp-api-2.1.jar
/usr/lib/hadoop-hdfs/lib/jsr305-1.3.9.jar
/usr/lib/hadoop-hdfs/lib/log4j-1.2.17.jar
/usr/lib/hadoop-hdfs/lib/netty-3.6.2.Final.jar
/usr/lib/hadoop-hdfs/lib/protobuf-java-2.5.0.jar
/usr/lib/hadoop-hdfs/lib/servlet-api-2.5.jar
/usr/lib/hadoop-hdfs/lib/xmlenc-0.52.jar
/usr/lib/hadoop-mapreduce/hadoop-archives-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-datajoin-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-distcp-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-extras-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-gridmix-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-app-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-core-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-hs-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-hs-plugins-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.2.0.2.0.6.0-76-tests.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-shuffle-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-rumen-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-mapreduce/lib/aopalliance-1.0.jar
/usr/lib/hadoop-mapreduce/lib/asm-3.2.jar
/usr/lib/hadoop-mapreduce/lib/avro-1.7.4.jar
/usr/lib/hadoop-mapreduce/lib/commons-compress-1.4.1.jar
/usr/lib/hadoop-mapreduce/lib/commons-io-2.1.jar
/usr/lib/hadoop-mapreduce/lib/guice-3.0.jar
/usr/lib/hadoop-mapreduce/lib/guice-servlet-3.0.jar
/usr/lib/hadoop-mapreduce/lib/hamcrest-core-1.1.jar
/usr/lib/hadoop-mapreduce/lib/jackson-core-asl-1.8.8.jar
/usr/lib/hadoop-mapreduce/lib/jackson-mapper-asl-1.8.8.jar
/usr/lib/hadoop-mapreduce/lib/javax.inject-1.jar
/usr/lib/hadoop-mapreduce/lib/jersey-core-1.9.jar
/usr/lib/hadoop-mapreduce/lib/jersey-guice-1.9.jar
/usr/lib/hadoop-mapreduce/lib/jersey-server-1.9.jar
/usr/lib/hadoop-mapreduce/lib/junit-4.10.jar
/usr/lib/hadoop-mapreduce/lib/log4j-1.2.17.jar
/usr/lib/hadoop-mapreduce/lib/netty-3.6.2.Final.jar
/usr/lib/hadoop-mapreduce/lib/paranamer-2.3.jar
/usr/lib/hadoop-mapreduce/lib/protobuf-java-2.5.0.jar
/usr/lib/hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar
/usr/lib/hadoop-mapreduce/lib/xz-1.0.jar
/usr/lib/hadoop-yarn/hadoop-yarn-api-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-client-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-common-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-server-common-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-server-nodemanager-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-server-resourcemanager-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-server-tests-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-server-web-proxy-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/hadoop-yarn-site-2.2.0.2.0.6.0-76.jar
/usr/lib/hadoop-yarn/lib/aopalliance-1.0.jar

[jira] [Commented] (SPARK-2025) EdgeRDD persists after pregel iteration

2014-06-04 Thread Tim Weninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018391#comment-14018391
 ] 

Tim Weninger commented on SPARK-2025:
-

I'll leave it to you to make the bug fix. You seem to be a pro.

 EdgeRDD persists after pregel iteration
 ---

 Key: SPARK-2025
 URL: https://issues.apache.org/jira/browse/SPARK-2025
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0, 1.0.1
 Environment: RHEL6 on local and on spark cluster
Reporter: Tim Weninger
Assignee: Ankur Dave
  Labels: Pregel

 Symptoms: During execution of a pregel script/function a copy of an 
 intermediate EdgeRDD object persists after each iteration as shown by the 
 Spark WebUI - storage.
 This is like a memory leak that affects in the Pregel function.
 For example, after the first iteration I will have an EdgeRDD in addition to 
 the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
 iterations I will have 15 EdgeRDDs in addition to the current/correct state 
 represented by a single set of 1 EdgeRDD and 1 VertexRDD.
 At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, 
 but there seems to be another EdgeRDD that is created somewhere that does not 
 get unpersisted.
 i _think_ this is from the replicateVertex function, but I cannot be sure.
 Update - Dave Ankur says, in comments on SPARK-2011 - 
 {quote}
 ... is a bug introduced by https://github.com/apache/spark/pull/497.
 It occurs because unpersistVertices used to unpersist both the vertices and 
 the replicated vertices, but after unifying replicated vertices with edges, 
 there was no way to unpersist only one of them. I think the solution is just 
 to unpersist both the vertices and the edges in Pregel.{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2025) EdgeRDD persists after pregel iteration

2014-06-04 Thread Ankur Dave (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018402#comment-14018402
 ] 

Ankur Dave commented on SPARK-2025:
---

Proposed fix: https://github.com/apache/spark/pull/972

 EdgeRDD persists after pregel iteration
 ---

 Key: SPARK-2025
 URL: https://issues.apache.org/jira/browse/SPARK-2025
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0, 1.0.1
 Environment: RHEL6 on local and on spark cluster
Reporter: Tim Weninger
Assignee: Ankur Dave
  Labels: Pregel

 Symptoms: During execution of a pregel script/function a copy of an 
 intermediate EdgeRDD object persists after each iteration as shown by the 
 Spark WebUI - storage.
 This is like a memory leak that affects in the Pregel function.
 For example, after the first iteration I will have an EdgeRDD in addition to 
 the EdgeRDD and VertexRDD that are kept for the next iteration. After 15 
 iterations I will have 15 EdgeRDDs in addition to the current/correct state 
 represented by a single set of 1 EdgeRDD and 1 VertexRDD.
 At the end of a Pregel loop the old EdgeRDD and VertexRDD are unpersisted, 
 but there seems to be another EdgeRDD that is created somewhere that does not 
 get unpersisted.
 i _think_ this is from the replicateVertex function, but I cannot be sure.
 Update - Dave Ankur says, in comments on SPARK-2011 - 
 {quote}
 ... is a bug introduced by https://github.com/apache/spark/pull/497.
 It occurs because unpersistVertices used to unpersist both the vertices and 
 the replicated vertices, but after unifying replicated vertices with edges, 
 there was no way to unpersist only one of them. I think the solution is just 
 to unpersist both the vertices and the edges in Pregel.{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1988) Enable storing edges out-of-core

2014-06-04 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1988:
--

Priority: Minor  (was: Major)

 Enable storing edges out-of-core
 

 Key: SPARK-1988
 URL: https://issues.apache.org/jira/browse/SPARK-1988
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor

 A graph's edges are usually the largest component of the graph, and a cluster 
 may not have enough memory to hold them. For example, a graph with 20 billion 
 edges requires at least 400 GB of memory, because each edge takes 20 bytes.
 GraphX only ever accesses the edges using full table scans or cluster scans 
 using the clustered index on source vertex ID. The edges are therefore 
 amenable to being stored on disk. EdgePartition should provide the option of 
 storing edges on disk transparently and streaming through them as needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2014-06-04 Thread Yanjie Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018409#comment-14018409
 ] 

Yanjie Gao commented on SPARK-2018:
---

Thanks for your quick reply!
I believe they  use the same jvm

Do you think this may have another reason?

How can I  debug it  to find the reason ?

Best regards !
Yanjie Gao
here is the ps -aux |grep java log

 test1  349  0.5  3.7 2945280 195456 pts/7  Sl   02:30   0:22 
/opt/ibm/java-ppc64-70//bin/java -cp 
/home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip 9.114.34.69 --port 7077 --webui-port 
8080
test1  492  0.4  3.7 2946496 194432 ?  Sl   02:30   0:19 
/opt/ibm/java-ppc64-70//bin/java -cp 
/home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://9.114.34.69:7077
test1 3160  0.0  0.0 104832  2816 pts/10   S+   03:40   0:00 grep java
test113163  0.1  2.7 1631232 144256 ?  Sl   Jun02   2:00 
/opt/ibm/java-ppc64-70/bin/java -Dproc_namenode -Xmx1000m 
-Djava.net.preferIPv4Stack=true 
-Dhadoop.log.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0/logs 
-Dhadoop.log.file=hadoop.log 
-Dhadoop.home.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0 -Dhadoop.id.str=test1 
-Dhadoop.root.logger=INFO,console 
-Djava.library.path=/home/test1/src/hadoop-2.3.0-cdh5.0.0/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true 
-Dhadoop.log.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0/logs 
-Dhadoop.log.file=hadoop-test1-namenode-p7hvs7br16.log 
-Dhadoop.home.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0 -Dhadoop.id.str=test1 
-Dhadoop.root.logger=INFO,RFA 
-Djava.library.path=/home/test1/src/hadoop-2.3.0-cdh5.0.0/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender 
-Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender 
-Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender 
-Dhadoop.security.logger=INFO,RFAS 
org.apache.hadoop.hdfs.server.namenode.NameNode
test113328  0.0  2.1 1636160 113152 ?  Sl   Jun02   1:39 
/opt/ibm/java-ppc64-70/bin/java -Dproc_datanode -Xmx1000m 
-Djava.net.preferIPv4Stack=true 
-Dhadoop.log.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0/logs 
-Dhadoop.log.file=hadoop.log 
-Dhadoop.home.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0 -Dhadoop.id.str=test1 
-Dhadoop.root.logger=INFO,console 
-Djava.library.path=/home/test1/src/hadoop-2.3.0-cdh5.0.0/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true 
-Dhadoop.log.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0/logs 
-Dhadoop.log.file=hadoop-test1-datanode-p7hvs7br16.log 
-Dhadoop.home.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0 -Dhadoop.id.str=test1 
-Dhadoop.root.logger=INFO,RFA 
-Djava.library.path=/home/test1/src/hadoop-2.3.0-cdh5.0.0/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server 
-Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS 
-Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS 
org.apache.hadoop.hdfs.server.datanode.DataNode
test113474  0.0  2.1 1624960 113408 ?  Sl   Jun02   0:35 
/opt/ibm/java-ppc64-70/bin/java -Dproc_secondarynamenode -Xmx1000m 
-Djava.net.preferIPv4Stack=true 
-Dhadoop.log.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0/logs 
-Dhadoop.log.file=hadoop.log 
-Dhadoop.home.dir=/home/test1/src/hadoop-2.3.0-cdh5.0.0 -Dhadoop.id.str=test1 
-Dhadoop.root.logger=INFO,console 
-Djava.library.path=/home/test1/src/hadoop-2.3.0-cdh5.0.0/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true 

[jira] [Created] (SPARK-2028) Users of HadoopRDD cannot access the partition InputSplits

2014-06-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2028:
-

 Summary: Users of HadoopRDD cannot access the partition InputSplits
 Key: SPARK-2028
 URL: https://issues.apache.org/jira/browse/SPARK-2028
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson
Assignee: Aaron Davidson


If a user creates a HadoopRDD (e.g., via textFile), there is no way to find out 
which file it came from, though this information is contained in the InputSplit 
within the RDD. We should find a way to expose this publicly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2028) Users of HadoopRDD cannot access the partition InputSplits

2014-06-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2028:
---

Issue Type: New Feature  (was: Bug)

 Users of HadoopRDD cannot access the partition InputSplits
 --

 Key: SPARK-2028
 URL: https://issues.apache.org/jira/browse/SPARK-2028
 Project: Spark
  Issue Type: New Feature
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 If a user creates a HadoopRDD (e.g., via textFile), there is no way to find 
 out which file it came from, though this information is contained in the 
 InputSplit within the RDD. We should find a way to expose this publicly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2028) Let users of HadoopRDD access the partition InputSplits

2014-06-04 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018474#comment-14018474
 ] 

Patrick Wendell commented on SPARK-2028:


I wantonly changed this from a Bug to a New Feature. We just never 
supported this before, but it would be nice to support in the future.

 Let users of HadoopRDD access the partition InputSplits
 ---

 Key: SPARK-2028
 URL: https://issues.apache.org/jira/browse/SPARK-2028
 Project: Spark
  Issue Type: New Feature
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 If a user creates a HadoopRDD (e.g., via textFile), there is no way to find 
 out which file it came from, though this information is contained in the 
 InputSplit within the RDD. We should find a way to expose this publicly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2027) spark-ec2 puts Hadoop's log4j ahead of Spark's in classpath

2014-06-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2027:
---

Component/s: EC2

 spark-ec2 puts Hadoop's log4j ahead of Spark's in classpath
 ---

 Key: SPARK-2027
 URL: https://issues.apache.org/jira/browse/SPARK-2027
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2028) Let users of HadoopRDD access the partition InputSplits

2014-06-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2028:
---

Summary: Let users of HadoopRDD access the partition InputSplits  (was: 
Users of HadoopRDD cannot access the partition InputSplits)

 Let users of HadoopRDD access the partition InputSplits
 ---

 Key: SPARK-2028
 URL: https://issues.apache.org/jira/browse/SPARK-2028
 Project: Spark
  Issue Type: New Feature
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 If a user creates a HadoopRDD (e.g., via textFile), there is no way to find 
 out which file it came from, though this information is contained in the 
 InputSplit within the RDD. We should find a way to expose this publicly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2024) Add saveAsSequenceFile to PySpark

2014-06-04 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018485#comment-14018485
 ] 

Kan Zhang commented on SPARK-2024:
--

You meant SPARK-1416?

 Add saveAsSequenceFile to PySpark
 -

 Key: SPARK-2024
 URL: https://issues.apache.org/jira/browse/SPARK-2024
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matei Zaharia

 After SPARK-1414 we will be able to read SequenceFiles from Python, but it 
 remains to write them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2029) Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.

2014-06-04 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2029:


 Summary: Bump pom.xml version number of master branch to 
1.1.0-SNAPSHOT.
 Key: SPARK-2029
 URL: https://issues.apache.org/jira/browse/SPARK-2029
 Project: Spark
  Issue Type: Bug
Reporter: Takuya Ueshin


Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2029) Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.

2014-06-04 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018492#comment-14018492
 ] 

Takuya Ueshin commented on SPARK-2029:
--

PRed: https://github.com/apache/spark/pull/974

 Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.
 ---

 Key: SPARK-2029
 URL: https://issues.apache.org/jira/browse/SPARK-2029
 Project: Spark
  Issue Type: Bug
Reporter: Takuya Ueshin

 Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2030) Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.

2014-06-04 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2030:


 Summary: Bump SparkBuild.scala version number of branch-1.0 to 
1.0.1-SNAPSHOT.
 Key: SPARK-2030
 URL: https://issues.apache.org/jira/browse/SPARK-2030
 Project: Spark
  Issue Type: Bug
Reporter: Takuya Ueshin


Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2030) Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.

2014-06-04 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018493#comment-14018493
 ] 

Takuya Ueshin commented on SPARK-2030:
--

PRed: https://github.com/apache/spark/pull/975

 Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.
 -

 Key: SPARK-2030
 URL: https://issues.apache.org/jira/browse/SPARK-2030
 Project: Spark
  Issue Type: Bug
Reporter: Takuya Ueshin

 Bump SparkBuild.scala version number of branch-1.0 to 1.0.1-SNAPSHOT.



--
This message was sent by Atlassian JIRA
(v6.2#6252)