I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
without problems. When I try to read it back in, it fails with: 

Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, most
recent failure: Lost task 401.3 in stage 0.0 (TID 449,
e1326.hpc-lca.ethz.ch): java.lang.NegativeArraySizeException: 
       
org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
        org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:98)
       
org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
       
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
       
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
       
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1875)
       
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1848)
       
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
       
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
        org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
        org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
        org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
       
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
       
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:330)
       
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
       
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
       
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
       
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)


Not really sure where to start looking for the culprit -- any suggestions
most welcome. Thanks!

Rok




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD-pickleFile-tp21395.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to