Using TestSQLContext from multiple tests leads to: SparkException: : Task not serializable
ERROR ContextCleaner: Error cleaning broadcast 10 java.lang.NullPointerException at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:246) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:46) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138) at scala.Option.foreach(Option.scala:236) On 15.12.2014, at 22:36, Marius Soutier <mps....@gmail.com> wrote: > Ok, maybe these test versions will help me then. I’ll check it out. > > On 15.12.2014, at 22:33, Michael Armbrust <mich...@databricks.com> wrote: > >> Using a single SparkContext should not cause this problem. In the SQL tests >> we use TestSQLContext and TestHive which are global singletons for all of >> our unit testing. >> >> On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier <mps....@gmail.com> wrote: >> Possible, yes, although I’m trying everything I can to prevent it, i.e. fork >> in Test := true and isolated. Can you confirm that reusing a single >> SparkContext for multiple tests poses a problem as well? >> >> Other than that, just switching from SQLContext to HiveContext also provoked >> the error. >> >> >> On 15.12.2014, at 20:22, Michael Armbrust <mich...@databricks.com> wrote: >> >>> Is it possible that you are starting more than one SparkContext in a single >>> JVM with out stopping previous ones? I'd try testing with Spark 1.2, which >>> will throw an exception in this case. >>> >>> On Mon, Dec 15, 2014 at 8:48 AM, Marius Soutier <mps....@gmail.com> wrote: >>> Hi, >>> >>> I’m seeing strange, random errors when running unit tests for my Spark >>> jobs. In this particular case I’m using Spark SQL to read and write Parquet >>> files, and one error that I keep running into is this one: >>> >>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 >>> in stage 6.0 failed 1 times, most recent failure: Lost task 19.0 in stage >>> 6.0 (TID 223, localhost): java.io.IOException: PARSING_ERROR(2) >>> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) >>> org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) >>> org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545) >>> >>> I can only prevent this from happening by using isolated Specs tests thats >>> always create a new SparkContext that is not shared between tests (but >>> there can also be only a single SparkContext per test), and also by using >>> standard SQLContext instead of HiveContext. It does not seem to have >>> anything to do with the actual files that I also create during the test run >>> with SQLContext.saveAsParquetFile. >>> >>> >>> Cheers >>> - Marius >>> >>> >>> PS The full trace: >>> >>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 >>> in stage 6.0 failed 1 times, most recent failure: Lost task 19.0 in stage >>> 6.0 (TID 223, localhost): java.io.IOException: PARSING_ERROR(2) >>> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) >>> org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) >>> org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:545) >>> >>> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) >>> >>> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) >>> >>> org.xerial.snappy.SnappyInputStream.<init>(SnappyInputStream.java:58) >>> >>> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) >>> >>> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:232) >>> >>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readObject$1.apply$mcV$sp(TorrentBroadcast.scala:169) >>> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927) >>> >>> org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155) >>> sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> java.lang.reflect.Method.invoke(Method.java:606) >>> >>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) >>> >>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) >>> >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>> >>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) >>> >>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) >>> >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) >>> >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) >>> >>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) >>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160) >>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> java.lang.Thread.run(Thread.java:745) >>> Driver stacktrace: >>> at >>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) >>> ~[spark-core_2.10-1.1.1.jar:1.1.1] >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) >>> ~[spark-core_2.10-1.1.1.jar:1.1.1] >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) >>> ~[spark-core_2.10-1.1.1.jar:1.1.1] >>> at >>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>> ~[scala-library.jar:na] >>> at >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>> ~[scala-library.jar:na] >>> at >>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) >>> ~[spark-core_2.10-1.1.1.jar:1.1.1] >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >