Hello,

I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu.

My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues.

I am submitting spark jobs with spark-submit with mesos://zk://.../mesos.

About 50% of all jobs stall forever (or until I kill spark driver).
Error occurs randomly on different slave-nodes. It happens that 4 spark-job in a row run completely without problems and then problem suddenly occurs. I am always testing same set of 5 different jobs single and combined and the error occurs always in different job/node/stage/task combinations.

Whenever a slave-node stalls, this message appears in sandbox-log of the failing slave: 10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null

Any hints on how to address this issue are greatly appreciated

kind regards
reinis


Job that stalls, shows following in spark-driver log (As one can see - the task 1.0 is never finished):

10:01:25,620 INFO o.a.s.s.DAGScheduler - Submitting 4 missing tasks from Stage 0 (MapPartitionsRDD[1] at groupBy at ImportExtensionFieldsSparkJob.scala:57) 10:01:25,621 INFO o.a.s.s.TaskSchedulerImpl - Adding task set 0.0 with 4 tasks 10:01:25,656 INFO o.a.s.s.TaskSetManager - Starting task 0.0 in stage 0.0 (TID 0, app03, PROCESS_LOCAL, 1140 bytes) 10:01:25,660 INFO o.a.s.s.TaskSetManager - Starting task 1.0 in stage 0.0 (TID 1, app01, PROCESS_LOCAL, 1140 bytes) 10:01:25,661 INFO o.a.s.s.TaskSetManager - Starting task 2.0 in stage 0.0 (TID 2, app02, PROCESS_LOCAL, 1140 bytes) 10:01:25,662 INFO o.a.s.s.TaskSetManager - Starting task 3.0 in stage 0.0 (TID 3, app03, PROCESS_LOCAL, 1140 bytes) 10:01:36,842 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app02 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S1, app02, 59622) 10:01:36,862 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app03 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S2, app03, 39420) 10:01:36,917 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app01 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S3, app01, 45605) 10:01:38,701 INFO o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 in memory on app03 (size: 2.6 KB, free: 88.3 MB) 10:01:38,702 INFO o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 in memory on app02 (size: 2.6 KB, free: 88.3 MB) 10:01:41,400 INFO o.a.s.s.TaskSetManager - Finished task 0.0 in stage 0.0 (TID 0) in 15721 ms on app03 (1/4) 10:01:41,539 INFO o.a.s.s.TaskSetManager - Finished task 2.0 in stage 0.0 (TID 2) in 15870 ms on app02 (2/4) 10:01:41,697 INFO o.a.s.s.TaskSetManager - Finished task 3.0 in stage 0.0 (TID 3) in 16029 ms on app03 (3/4)

sandbox log of slave-node app01 (the one that stalls) shows following:

10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
*10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus
10:01:34 INFO SecurityManager: Changing view acls to...
10:01:35 INFO Slf4jLogger: Slf4jLogger started
10:01:35 INFO Remoting: Starting remoting
10:01:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@app01:xxx] 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port xxx. 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster 10:01:36 INFO DiskBlockManager: Created local directory at /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB
10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 10:01:36 INFO AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator 10:01:36 INFO Executor: Starting executor ID 20150511-150924-3410235146-5050-1903-S3 on host app01
10:01:36 INFO NettyBlockTransferService: Server created on XXX
10:01:36 INFO BlockManagerMaster: Trying to register BlockManager
10:01:36 INFO BlockManagerMaster: Registered BlockManager
10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver

As soon as spark-driver is aborted, following log entries are added to the sandbox log of slave-node app01:

10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown
10:17:29 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]

Successful Job shows instead following in spark-driver log:

08:03:19,862 INFO o.a.s.s.TaskSetManager - Finished task 3.0 in stage 1.0 (TID 7) in 1688 ms on app01 (1/4) 08:03:19,869 INFO o.a.s.s.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 4) in 1700 ms on app03 (2/4) 08:03:19,874 INFO o.a.s.s.TaskSetManager - Finished task 1.0 in stage 1.0 (TID 5) in 1703 ms on app02 (3/4) 08:03:19,878 INFO o.a.s.s.TaskSetManager - Finished task 2.0 in stage 1.0 (TID 6) in 1706 ms on app02 (4/4) 08:03:19,878 INFO o.a.s.s.DAGScheduler - Stage 1 (saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90) finished in 1.718 s 08:03:19,878 INFO o.a.s.s.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool 08:03:19,886 INFO o.a.s.s.DAGScheduler - Job 0 finished: saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90, took 16.946405 s

this corresponds nicelly to sandbox logs of slave-nodes

08:03:19 INFO Executor: Finished task 3.0 in stage 1.0 (TID 7). 872 bytes result sent to driver 08:03:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID 4). 872 bytes result sent to driver 08:03:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID 5). 872 bytes result sent to driver 08:03:19 INFO Executor: Finished task 2.0 in stage 1.0 (TID 6). 872 bytes result sent to driver 08:03:20 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Reply via email to