I'm using Apache Mesos 0.19.0 together with Apache Spark 1.0.2 on a three node cluster.
When using the fine-grained task scheduling mode of Spark, I reproducably see some kind of dead lock on high load. If multiple jobs are running, after some time the jobs do not submit any tasks anymore. I have added some more log output in the Scheduler implementation of Spark and it looks as if Mesos does not make any offers anymore, although there are allocatable resources. Below is the log from Mesos. The last task is normally finished, the resources recovered, the filters are removed, but the log shows no "sending ... offers to framework" entries after this timepoint. I have tried to wake up the offers with a reviveOffers call I have added to the Spark code, but with no effect. The "Resources" section on the Mesos web UI shows all CPUs as idle, none is used or offered. If I kill all jobs but one, this last job continues and finishes normally. Is this a bug? Thanks, Martin I0807 15:17:54.605695 15727 master.cpp:2933] Sending 1 offers to framework 20140717-090825-308511242-5050-15711-0044 I0807 15:17:54.615705 15732 master.cpp:1889] Processing reply for offers: [ 20140717-090825-308511242-5050-15711-2132 ] on slave 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 (ustst020-cep-node3.usu.usu.grp) for framework 20140717-090825-308511242-5050-15711-0044 I0807 15:17:54.615897 15732 master.hpp:655] Adding task 1 with resources cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 (ustst020-cep-node3.usu.usu.grp) I0807 15:17:54.616029 15732 master.cpp:3111] Launching task 1 of framework 20140717-090825-308511242-5050-15711-0044 with resources cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 (ustst020-cep-node3.usu.usu.grp) I0807 15:17:54.616325 15732 hierarchical_allocator_process.hpp:589] Framework 20140717-090825-308511242-5050-15711-0044 filtered slave 20140717-090821-325288458-5050-2360-1 for 8secs I0807 15:17:58.324476 15728 master.cpp:2628] Status update TASK_RUNNING (UUID: ec5ecf90-7313-4bf1-af9e-b5f6e35189f7) for task 1 of framework 20140717-090825-308511242-5050-15711-0044 from slave 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 (ustst020-cep-node3.usu.usu.grp) I0807 15:17:58.326279 15726 master.cpp:1988] Reviving offers for framework 20140717-090825-308511242-5050-15711-0044 I0807 15:17:58.326406 15732 hierarchical_allocator_process.hpp:660] Removed filters for framework 20140717-090825-308511242-5050-15711-0044 I0807 15:18:00.993798 15726 master.cpp:2628] Status update TASK_FINISHED (UUID: ef7a4dfd-c403-483a-a6a7-c2cd995aa64e) for task 1 of framework 20140717-090825-308511242-5050-15711-0044 from slave 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 (ustst020-cep-node3.usu.usu.grp) I0807 15:18:00.994935 15726 master.hpp:673] Removing task 1 with resources cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 (ustst020-cep-node3.usu.usu.grp) I0807 15:18:00.995511 15726 master.cpp:1988] Reviving offers for framework 20140717-090825-308511242-5050-15711-0044 I0807 15:18:00.995599 15725 hierarchical_allocator_process.hpp:636] Recovered cpus(*):1; mem(*):1 (total allocatable: cpus(*):2; mem(*):2; disk(*):12526; ports(*):[31000-32000]) on slave 20140717-090821-325288458-5050-2360-1 from framework 20140717-090825-308511242-5050-15711-0044 I0807 15:18:00.995846 15725 hierarchical_allocator_process.hpp:660] Removed filters for framework 20140717-090825-308511242-5050-15711-0044 I0807 15:18:01.055794 15730 master.cpp:1988] Reviving offers for framework 20140717-090825-308511242-5050-15711-0044 I0807 15:18:01.055982 15730 hierarchical_allocator_process.hpp:660] Removed filters for framework 20140717-090825-308511242-5050-15711-0044

