Dell - Internal Use - Confidential Not sure if this would help but I was running into similar failures (my main java driver keep firing jobs which they fails with errorcode 1 but no exception trace till the main driver give up and cause same exception in spark.scheduler.DAGScheduler) when using my spark cluster compared to running it locally.
I able to resolve it as follows: 1. Avoid caching my initial javaRDD's generated from parallelize 2. I used partition explicitly setting in parallelize, as I found using JavaRDD.toDebugString with cluster of 32 core available, it was sometimes set partition to 2, sometimes to 21. I had to explicitly set the partition of my generated javaRDD to 32 core (max I have). 3. I did not set any worker memory in spark-env.sh which trigger worker to use max memory - 1G available in the box (i.e. using 61.9 GB per worker) 4. No setting any main driver memory setting in spark-env.sh 5. Paginate my data loading from the source when generating my initial javaRDD's, i.e. instead of invoking 1 reques to load all data, I divided my loading into chunks of data that make my worker jobs happy. Thanks, Hussam From: Craig Vanderborgh [mailto:[email protected]] Sent: Wednesday, November 13, 2013 2:38 PM To: [email protected] Subject: Spark Streaming Job Fails to Run Under Mesos-0.14 Hi all, We have a Spark Streaming job that's been working great under Mesos for many months with "local" as the master. We just tried to run it on our new Mesos cluster. This cluster has been set up properly, and the Spark examples (e.g. SparkPi) run distributed, correctly, under Mesos. But our job does not. This stack trace is widely seen on the web, but nowhere is a root cause identified. Now we're seeing it as well: $ ./stagingviewbeta.sh java -cp /opt/mapr/hadoop/hadoop-0.20.2/conf:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.20.2-2.1.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-diagnostic-tools-0.20.2-2.1.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-jni-0.20.2-2.1.3.jar:./target/scala-2.9.2/vodviews2_2.9.2-1.0.jar:./lib/eventgateway2.jar:./lib/spark-core-assembly-0.7.3.jar:./lib/spark-streaming-assembly-0.7.3.jar:/usr/share/spark/lib/spark-core-assembly-0.7.3.jar:/usr/share/spark/lib/spark-streaming-assembly-0.7.3.jar:/usr/share/spark/lib/spark-examples_2.9.3-0.7.3.jar:/usr/share/spark/lib/spark-repl_2.9.3-0.7.3.jar:/usr/share/java/scala-library.jar:/usr/share/java/scala-compiler.jar:/usr/share/java/jline.jar -Dspark.local.dir=/spark/tmp/viewbeta -Djava.library.path=/opt/mapr/lib -Xms6g -Xmx6g ViewBeta mesos://bigd-mesos-01:5050 cdp-sleuth-kafka-01.cdp.webapps.rr.com:2181<http://cdp-sleuth-kafka-01.cdp.webapps.rr.com:2181> localhost:9092 stagingviewbeta5 prod-eg_v2_2-big_data 1 true SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/craigv/viewbeta/lib/spark-core-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/craigv/viewbeta/lib/spark-streaming-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/share/spark/lib/spark-core-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/share/spark/lib/spark-streaming-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 13/11/13 22:29:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable topicMap: Map(prod-eg_v2_2-big_data_v0.3.201-8158c -> 1) 13/11/13 22:29:20 ERROR cluster.TaskSetManager: Task 1.0:1 failed more than 4 times; aborting job Exception in thread "Thread-27" spark.SparkException: Job failed: Task 1.0:1 failed more than 4 times at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642) at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640) at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303) at spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364) at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107) 13/11/13 22:29:21 ERROR cluster.TaskSetManager: Task 3.0:2 failed more than 4 times; aborting job Exception in thread "Thread-30" spark.SparkException: Job failed: Task 3.0:2 failed more than 4 times at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642) at spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640) at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303) at spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364) at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107) Does anyone know what's causing this? There is very little information to go on here, and as soon as these two exceptions burp forth the job hangs. Please advise. Thanks in advance, Craig Vanderborgh
