I'm trying to read a Thrift object from SequenceFile, using elephant-bird's ThriftWritable. My code looks like
val rawData = sc.sequenceFile[BooleanWritable, ThriftWritable[TrainingSample]](input) val samples = rawData.map { case (key, value) => { value.setConverter(classOf[TrainingSample]) val conversion = if (key.get) 1 else 0 val sample = value.get (conversion, sample) }} When I spark-submit in local mode, it failed with (Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost): java.lang.AbstractMethodError: org.apache.thrift.TUnion.standardSchemeReadValue(Lorg/apache/thrift/protocol/TProtocol;Lorg/apache/thrift/protocol/TField;)Ljava/lang/Object; ... ... I'm pretty sure it is caused by the conflict of libthrift, I use thrift-0.6.1 while spark uses 0.9.2, which requires TUnion object to implement the abstract 'standardSchemeReadValue' method. But when I set spark.files.userClassPathFirst=true, it failed even earlier: (Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ClassCastException: cannot assign instance of scala.None$ to field org.apache.spark.scheduler.Task.metrics of type scala.Option in instance of org.apache.spark.scheduler.ResultTask at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2089) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2006) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) It seems I introduced more conflict, but I couldn't figure out which one caused this failure. Interestingly, when I ran mvn test in my project, which test spark job in locally mode, all worked fine. So what is the right way to take user jars precedence over Spark jars? -- Yizhi Liu Senior Software Engineer / Data Mining www.mvad.com, Shanghai, China --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org