Hi all, We have a Spark Streaming job that's been working great under Mesos for many months with "local" as the master.
We just tried to run it on our new Mesos cluster. This cluster has been set up properly, and the Spark examples (e.g. SparkPi) run distributed, correctly, under Mesos. But our job does not. This stack trace is widely seen on the web, but nowhere is a root cause identified. Now we're seeing it as well: $ ./stagingviewbeta.sh java -cp /opt/mapr/hadoop/hadoop-0.20.2/conf:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.20.2-2.1.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-diagnostic-tools-0.20.2-2.1.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-jni-0.20.2-2.1.3.jar:./target/scala-2.9.2/vodviews2_2.9.2-1.0.jar:./lib/eventgateway2.jar:./lib/spark-core-assembly-0.7.3.jar:./lib/spark-streaming-assembly-0.7.3.jar:/usr/share/spark/lib/spark-core-assembly-0.7.3.jar:/usr/share/spark/lib/spark-streaming-assembly-0.7.3.jar:/usr/share/spark/lib/spark-examples_2.9.3-0.7.3.jar:/usr/share/spark/lib/spark-repl_2.9.3-0.7.3.jar:/usr/share/java/scala-library.jar:/usr/share/java/scala-compiler.jar:/usr/share/java/jline.jar -Dspark.local.dir=/spark/tmp/viewbeta -Djava.library.path=/opt/mapr/lib -Xms6g -Xmx6g ViewBeta mesos://bigd-mesos-01:5050 cdp-sleuth-kafka-01.cdp.webapps.rr.com:2181 localhost:9092 stagingviewbeta5 prod-eg_v2_2-big_data 1 true SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/craigv/viewbeta/lib/spark-core-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/craigv/viewbeta/lib/spark-streaming-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/share/spark/lib/spark-core-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/share/spark/lib/spark-streaming-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 13/11/13 22:29:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable topicMap: Map(prod-eg_v2_2-big_data_v0.3.201-8158c -> 1) 13/11/13 22:29:20 ERROR cluster.TaskSetManager: Task 1.0:1 failed more than 4 times; aborting job Exception in thread "Thread-27" spark.SparkException: Job failed: Task 1.0:1 failed more than 4 times at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642) at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640) at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303) at spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364) at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107) 13/11/13 22:29:21 ERROR cluster.TaskSetManager: Task 3.0:2 failed more than 4 times; aborting job Exception in thread "Thread-30" spark.SparkException: Job failed: Task 3.0:2 failed more than 4 times at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642) at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640) at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303) at spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364) at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107) Does anyone know what's causing this? There is very little information to go on here, and as soon as these two exceptions burp forth the job hangs. Please advise. Thanks in advance, Craig Vanderborgh
