In the meantime I have found the causing problem and created an issue:
https://issues.apache.org/jira/browse/MESOS-1688

Am 07.08.2014 16:03, schrieb Martin Weindel:
I'm using Apache Mesos 0.19.0 together with Apache Spark 1.0.2 on a three node cluster.

When using the fine-grained task scheduling mode of Spark, I reproducably see some kind of dead lock on high load. If multiple jobs are running, after some time the jobs do not submit any tasks anymore.

I have added some more log output in the Scheduler implementation of Spark and it looks as if Mesos does not make any offers anymore, although there are allocatable resources.

Below is the log from Mesos. The last task is normally finished, the resources recovered, the filters are removed, but the log shows no "sending ... offers to framework" entries after this timepoint. I have tried to wake up the offers with a reviveOffers call I have added to the Spark code, but with no effect. The "Resources" section on the Mesos web UI shows all CPUs as idle, none is used or offered.

If I kill all jobs but one, this last job continues and finishes normally.

Is this a bug?

Thanks,
Martin
I0807 15:17:54.605695 15727 master.cpp:2933] Sending 1 offers to framework 
20140717-090825-308511242-5050-15711-0044
I0807 15:17:54.615705 15732 master.cpp:1889] Processing reply for offers: [ 
20140717-090825-308511242-5050-15711-2132 ] on slave 
20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051  
<http://10.130.99.20:5051>  (ustst020-cep-node3.usu.usu.grp) for framework 
20140717-090825-308511242-5050-15711-0044
I0807 15:17:54.615897 15732 master.hpp:655] Adding task 1 with resources 
cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 
(ustst020-cep-node3.usu.usu.grp)
I0807 15:17:54.616029 15732 master.cpp:3111] Launching task 1 of framework 
20140717-090825-308511242-5050-15711-0044 with resources cpus(*):1; mem(*):1 on slave 
20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051  
<http://10.130.99.20:5051>  (ustst020-cep-node3.usu.usu.grp)
I0807 15:17:54.616325 15732 hierarchical_allocator_process.hpp:589] Framework 
20140717-090825-308511242-5050-15711-0044 filtered slave 
20140717-090821-325288458-5050-2360-1 for 8secs
I0807 15:17:58.324476 15728 master.cpp:2628] Status update TASK_RUNNING (UUID: 
ec5ecf90-7313-4bf1-af9e-b5f6e35189f7) for task 1 of framework 
20140717-090825-308511242-5050-15711-0044 from slave 
20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051  
<http://10.130.99.20:5051>  (ustst020-cep-node3.usu.usu.grp)
I0807 15:17:58.326279 15726 master.cpp:1988] Reviving offers for framework 
20140717-090825-308511242-5050-15711-0044
I0807 15:17:58.326406 15732 hierarchical_allocator_process.hpp:660] Removed 
filters for framework 20140717-090825-308511242-5050-15711-0044
I0807 15:18:00.993798 15726 master.cpp:2628] Status update TASK_FINISHED (UUID: 
ef7a4dfd-c403-483a-a6a7-c2cd995aa64e) for task 1 of framework 
20140717-090825-308511242-5050-15711-0044 from slave 
20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051  
<http://10.130.99.20:5051>  (ustst020-cep-node3.usu.usu.grp)
I0807 15:18:00.994935 15726 master.hpp:673] Removing task 1 with resources 
cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 
(ustst020-cep-node3.usu.usu.grp)
I0807 15:18:00.995511 15726 master.cpp:1988] Reviving offers for framework 
20140717-090825-308511242-5050-15711-0044
I0807 15:18:00.995599 15725 hierarchical_allocator_process.hpp:636] Recovered 
cpus(*):1; mem(*):1 (total allocatable: cpus(*):2; mem(*):2; disk(*):12526; 
ports(*):[31000-32000]) on slave 20140717-090821-325288458-5050-2360-1 from 
framework 20140717-090825-308511242-5050-15711-0044
I0807 15:18:00.995846 15725 hierarchical_allocator_process.hpp:660] Removed 
filters for framework 20140717-090825-308511242-5050-15711-0044
I0807 15:18:01.055794 15730 master.cpp:1988] Reviving offers for framework 
20140717-090825-308511242-5050-15711-0044
I0807 15:18:01.055982 15730 hierarchical_allocator_process.hpp:660] Removed 
filters for framework 20140717-090825-308511242-5050-15711-0044

Reply via email to