Seems strange that you only have 2MB of allocatable memory on your slave ("total allocatable: cpus(*):2; mem(*):2;"). Try bumping that up to something like 2GB ("mem(*):2048") and I bet you'll see more tasks able to run. Even the default executor (no task) needs 32MB, so you won't be able to do much with a mesos slave that has <64MB memory. Are you explicitly setting a --resources flag on your slave? If not, do you only have tiny VMs available for the slaves?
On Thu, Aug 7, 2014 at 7:03 AM, Martin Weindel <martin.wein...@gmail.com> wrote: > I'm using Apache Mesos 0.19.0 together with Apache Spark 1.0.2 on a three > node cluster. > > When using the fine-grained task scheduling mode of Spark, I reproducably > see some kind of dead lock on high load. > If multiple jobs are running, after some time the jobs do not submit any > tasks anymore. > > I have added some more log output in the Scheduler implementation of Spark > and it looks as if Mesos does not make any offers anymore, although there > are allocatable resources. > > Below is the log from Mesos. The last task is normally finished, the > resources recovered, the filters are removed, but the log shows no "sending > ... offers to framework" entries after this timepoint. > I have tried to wake up the offers with a reviveOffers call I have added > to the Spark code, but with no effect. > The "Resources" section on the Mesos web UI shows all CPUs as idle, none > is used or offered. > > If I kill all jobs but one, this last job continues and finishes normally. > > Is this a bug? > > Thanks, > Martin > > I0807 15:17:54.605695 15727 master.cpp:2933] Sending 1 offers to framework > 20140717-090825-308511242-5050-15711-0044 > I0807 15:17:54.615705 15732 master.cpp:1889] Processing reply for offers: [ > 20140717-090825-308511242-5050-15711-2132 ] on slave > 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 > (ustst020-cep-node3.usu.usu.grp) for framework > 20140717-090825-308511242-5050-15711-0044 > I0807 15:17:54.615897 15732 master.hpp:655] Adding task 1 with resources > cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 > (ustst020-cep-node3.usu.usu.grp) > I0807 15:17:54.616029 15732 master.cpp:3111] Launching task 1 of framework > 20140717-090825-308511242-5050-15711-0044 with resources cpus(*):1; mem(*):1 > on slave 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 > (ustst020-cep-node3.usu.usu.grp) > I0807 15:17:54.616325 15732 hierarchical_allocator_process.hpp:589] Framework > 20140717-090825-308511242-5050-15711-0044 filtered slave > 20140717-090821-325288458-5050-2360-1 for 8secs > I0807 15:17:58.324476 15728 master.cpp:2628] Status update TASK_RUNNING > (UUID: ec5ecf90-7313-4bf1-af9e-b5f6e35189f7) for task 1 of framework > 20140717-090825-308511242-5050-15711-0044 from slave > 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 > (ustst020-cep-node3.usu.usu.grp) > I0807 15:17:58.326279 15726 master.cpp:1988] Reviving offers for framework > 20140717-090825-308511242-5050-15711-0044 > I0807 15:17:58.326406 15732 hierarchical_allocator_process.hpp:660] Removed > filters for framework 20140717-090825-308511242-5050-15711-0044 > I0807 15:18:00.993798 15726 master.cpp:2628] Status update TASK_FINISHED > (UUID: ef7a4dfd-c403-483a-a6a7-c2cd995aa64e) for task 1 of framework > 20140717-090825-308511242-5050-15711-0044 from slave > 20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051 > (ustst020-cep-node3.usu.usu.grp) > I0807 15:18:00.994935 15726 master.hpp:673] Removing task 1 with resources > cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 > (ustst020-cep-node3.usu.usu.grp) > I0807 15:18:00.995511 15726 master.cpp:1988] Reviving offers for framework > 20140717-090825-308511242-5050-15711-0044 > I0807 15:18:00.995599 15725 hierarchical_allocator_process.hpp:636] Recovered > cpus(*):1; mem(*):1 (total allocatable: cpus(*):2; mem(*):2; disk(*):12526; > ports(*):[31000-32000]) on slave 20140717-090821-325288458-5050-2360-1 from > framework 20140717-090825-308511242-5050-15711-0044 > I0807 15:18:00.995846 15725 hierarchical_allocator_process.hpp:660] Removed > filters for framework 20140717-090825-308511242-5050-15711-0044 > I0807 15:18:01.055794 15730 master.cpp:1988] Reviving offers for framework > 20140717-090825-308511242-5050-15711-0044 > I0807 15:18:01.055982 15730 hierarchical_allocator_process.hpp:660] Removed > filters for framework 20140717-090825-308511242-5050-15711-0044 > >