Dead lock running multiple Spark Jobs on Mesos

Martin Weindel Tue, 13 May 2014 16:57:06 -0700

I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos 0.17.0.

If I run a single Spark Job, the job runs fine on Mesos. Running multiple
Spark Jobs also works, if I'm using the coarse-grained mode
("spark.mesos.coarse" = true).


But if I run two Spark Jobs in parallel using the fine-grained mode, the
jobs seem to block each other after a few seconds.
And the Mesos UI reports no idle but also no used CPUs in this state.

As soon as I kill one job, the other continues normally. See below for some
log output.
Looks to me as if something strange happens with assigning resources to the
both jobs.

Can anybody give me a hint about the cause? The jobs read some HDFS files,
but have no other communication to external processes.
Or any other suggestions how to analyse this problem?

Thanks,

Martin

-----
Here is the relevant log output of job1:

INFO 17:53:09,247 Missing parents for Stage 2: List()
 INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at mapPartitions
at HighTemperatureSpansPerLogfile.java:92), which is now runnable
 INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2
(MapPartitionsRDD[9] at mapPartitions at
HighTemperatureSpansPerLogfile.java:92)
 INFO 17:53:09,269 Adding task set 2.0 with 1 tasks
................................................................................
 
*** at this point the job was killed *** 
 
 
log output of job2:
 INFO 17:53:04,874 Missing parents for Stage 6: List()
 INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at
ComputeLogFileTimespan.java:71), which is now runnable
 INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23] at
values at ComputeLogFileTimespan.java:71)
 INFO 17:53:04,882 Adding task set 6.0 with 1 tasks
................................................................................
*** at this point the job 1 was killed *** 
INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor
20140501-141732-308511242-5050-2657-1: ustst019-cep-node2.usu.usu.grp
(PROCESS_LOCAL)
 INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms
 INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to
sp...@ustst018-cep-node1.usu.usu.grp:40542
 INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dead-lock-running-multiple-Spark-Jobs-on-Mesos-tp5611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Dead lock running multiple Spark Jobs on Mesos

Reply via email to