Dell - Internal Use - Confidential
Not sure if this would help but I was running into similar failures (my main 
java driver keep firing jobs which they fails with errorcode 1 but no exception 
trace till the main driver give up and cause same exception in 
spark.scheduler.DAGScheduler) when using my spark cluster compared to running 
it locally.

I able to resolve it as follows:

1.       Avoid caching my initial javaRDD's generated from parallelize

2.       I used partition explicitly setting in  parallelize, as I found using 
JavaRDD.toDebugString with cluster of 32 core available, it was sometimes set 
partition to 2, sometimes to 21.  I had to explicitly set the partition of my 
generated javaRDD to 32 core (max I have).

3.       I did not set any worker memory in spark-env.sh which trigger worker 
to use max memory - 1G available in the box (i.e. using 61.9 GB per worker)

4.       No setting any main driver memory setting in spark-env.sh

5.       Paginate my data loading from the source when generating my initial 
javaRDD's, i.e. instead of invoking 1 reques to load all data, I divided my 
loading into chunks of data that make my worker jobs happy.

Thanks,
Hussam

From: Craig Vanderborgh [mailto:[email protected]]
Sent: Wednesday, November 13, 2013 2:38 PM
To: [email protected]
Subject: Spark Streaming Job Fails to Run Under Mesos-0.14

Hi all,

We have a Spark Streaming job that's been working great under Mesos for many 
months with "local" as the master.
We just tried to run it on our new Mesos cluster.  This cluster has been set up 
properly, and the Spark examples (e.g. SparkPi) run distributed, correctly, 
under Mesos.  But our job does not.
This stack trace is widely seen on the web, but nowhere is a root cause 
identified.  Now we're seeing it as well:
$ ./stagingviewbeta.sh
java -cp 
/opt/mapr/hadoop/hadoop-0.20.2/conf:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.20.2-2.1.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-diagnostic-tools-0.20.2-2.1.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-jni-0.20.2-2.1.3.jar:./target/scala-2.9.2/vodviews2_2.9.2-1.0.jar:./lib/eventgateway2.jar:./lib/spark-core-assembly-0.7.3.jar:./lib/spark-streaming-assembly-0.7.3.jar:/usr/share/spark/lib/spark-core-assembly-0.7.3.jar:/usr/share/spark/lib/spark-streaming-assembly-0.7.3.jar:/usr/share/spark/lib/spark-examples_2.9.3-0.7.3.jar:/usr/share/spark/lib/spark-repl_2.9.3-0.7.3.jar:/usr/share/java/scala-library.jar:/usr/share/java/scala-compiler.jar:/usr/share/java/jline.jar
 -Dspark.local.dir=/spark/tmp/viewbeta  -Djava.library.path=/opt/mapr/lib 
-Xms6g -Xmx6g ViewBeta mesos://bigd-mesos-01:5050 
cdp-sleuth-kafka-01.cdp.webapps.rr.com:2181<http://cdp-sleuth-kafka-01.cdp.webapps.rr.com:2181>
 localhost:9092 stagingviewbeta5 prod-eg_v2_2-big_data 1 true
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/craigv/viewbeta/lib/spark-core-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/craigv/viewbeta/lib/spark-streaming-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/share/spark/lib/spark-core-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/share/spark/lib/spark-streaming-assembly-0.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
13/11/13 22:29:17 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
topicMap: Map(prod-eg_v2_2-big_data_v0.3.201-8158c -> 1)
13/11/13 22:29:20 ERROR cluster.TaskSetManager: Task 1.0:1 failed more than 4 
times; aborting job
Exception in thread "Thread-27" spark.SparkException: Job failed: Task 1.0:1 
failed more than 4 times
        at 
spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642)
        at 
spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640)
        at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303)
        at 
spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364)
        at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107)
13/11/13 22:29:21 ERROR cluster.TaskSetManager: Task 3.0:2 failed more than 4 
times; aborting job
Exception in thread "Thread-30" spark.SparkException: Job failed: Task 3.0:2 
failed more than 4 times
        at 
spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642)
        at 
spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640)
        at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:303)
        at 
spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364)
        at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107)
Does anyone know what's causing this?  There is very little information to go 
on here, and as soon as these two exceptions burp forth the job hangs.
Please advise.

Thanks in advance,
Craig Vanderborgh

Reply via email to