Re: Multiple schedulers on one machine?

Vinod Kone Tue, 28 Oct 2014 10:09:08 -0700

Sorry for the delay in response.

It looks like master received the registration request from the scheduler
at 20:13, at which point it sent a registration ack back to the scheduler.
What is not clear is why no registration requests were sent/received from
19:47 when the framework apparently started. You can add GLOG_v=1 to the
scheduler's environment to print more verbose logging for the driver. It
will tell us what the driver has been doing. Are you sure you didn't have
any network partitions around that time?


On Fri, Oct 10, 2014 at 1:30 PM, Colleen Lee <[email protected]> wrote:

> 2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job
>>> 1uZArT-yEeS7gCIACpcfeA
>>> <snip>
>>> 2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job
>>> 1uZArT-yEeS7gCIACpcfeA: Registered as
>>> 20140818-235718-3165886730-5050-901-1507 to master
>>> '20140818-235718-3165886730-5050-901'
>>>
>>> The snipped code is for unrelated internals of our client. Going back to
>>> implementation, we call for the "Running job ..." log line to be output
>>> immediately before calling driver.run(), and our implementation of the
>>> registered() method in the scheduler is simply to print out the second log
>>> line above. During this time, from the mesos master logs, the master
>>> continues to function as normal, sending offers to (other) frameworks,
>>> processing the replies, adding/launching tasks, completing/removing tasks,
>>> unregistering/removing frameworks, etc. Here are the log lines that may be
>>> suspicious during that window:
>>>
>>> W0919 19:47:00.258894   938 master.cpp:2718] Ignoring unknown exited
>>> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
>>> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
>>> W0919 19:47:00.260349   939 master.cpp:2718] Ignoring unknown exited
>>> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
>>> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
>>> I0919 20:07:02.690067   940 master.cpp:1041] Received registration
>>> request from scheduler(316)@10.151.31.120:36446
>>> I0919 20:07:02.690192   940 master.cpp:1059] Registering framework
>>> 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@
>>> 10.151.31.120:36446
>>>
>>
>> The log line for framework in master is for a different framework
>> (20140818-235718-3165886730-5050-901-1502) than the problematic
>> job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for
>> the log lines corresponding to the problematic framework (grep for ip:port)
>> in the master logs? That should tell us what's happening.
>>
>
> Good catch, here are the relevant lines from the master logs (I grepped
> for the framework id -- how is the port determined?).
>
> I0919 20:13:48.122954   941 master.cpp:1041] Received registration request
> from scheduler(13)@10.45.199.181:43573
> I0919 20:13:48.123080   941 master.cpp:1059] Registering framework
> 20140818-235718-3165886730-5050-901-1507 at scheduler(13)@
> 10.45.199.181:43573
> I0919 20:13:48.123914   941 hierarchical_allocator_process.hpp:331] Added
> framework 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:48.124739   941 master.cpp:2933] Sending 2 offers to framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:48.129932   940 master.cpp:1889] Processing reply for offers:
> [ 20140818-235718-3165886730-5050-901-486420 ] on slave
> 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
> (ip-10-101-195-45.ec2.internal) for framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:48.130028   940 master.hpp:655] Adding task
> 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave
> 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal)
> I0919 20:13:48.130630   940 master.cpp:3111] Launching task
> 1uZArT-yEeS7gCIACpcfeA of framework
> 20140818-235718-3165886730-5050-901-1507 with resources cpus(*):1;
> mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
> I0919 20:13:48.131531   940 master.cpp:1889] Processing reply for offers:
> [ 20140818-235718-3165886730-5050-901-486421 ] on slave
> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
> (ip-10-51-165-231.ec2.internal) for framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:48.131589   937 hierarchical_allocator_process.hpp:589]
> Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
> 20140818-235718-3165886730-5050-901-5 for 5secs
> I0919 20:13:48.133910   937 hierarchical_allocator_process.hpp:589]
> Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
> 20140818-235718-3165886730-5050-901-4 for 5secs
> I0919 20:13:49.833616   940 master.cpp:2628] Status update TASK_RUNNING
> (UUID: ea495e2b-2d19-4206-ae08-9daad311a525) for task
> 1uZArT-yEeS7gCIACpcfeA of framework
> 20140818-235718-3165886730-5050-901-1507 from slave
> 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
> (ip-10-101-195-45.ec2.internal)
> I0919 20:13:53.184873   941 master.cpp:2933] Sending 2 offers to framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:53.188966   938 master.cpp:1889] Processing reply for offers:
> [ 20140818-235718-3165886730-5050-901-486424 ] on slave
> 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
> (ip-10-101-195-45.ec2.internal) for framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:53.189149   938 hierarchical_allocator_process.hpp:589]
> Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
> 20140818-235718-3165886730-5050-901-5 for 5secs
> I0919 20:13:53.189857   937 master.cpp:1889] Processing reply for offers:
> [ 20140818-235718-3165886730-5050-901-486425 ] on slave
> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
> (ip-10-51-165-231.ec2.internal) for framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:53.190274   937 hierarchical_allocator_process.hpp:589]
> Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
> 20140818-235718-3165886730-5050-901-4 for 5secs
> I0919 20:13:56.359884   935 master.cpp:2628] Status update TASK_FINISHED
> (UUID: e57c953e-d6c2-45e4-a541-1bcf16cd3179) for task
> 1uZArT-yEeS7gCIACpcfeA of framework
> 20140818-235718-3165886730-5050-901-1507 from slave
> 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
> (ip-10-101-195-45.ec2.internal)
> I0919 20:13:56.360065   935 master.hpp:673] Removing task
> 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave
> 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal)
> I0919 20:13:56.360790   935 hierarchical_allocator_process.hpp:636]
> Recovered cpus(*):1; mem(*):1536 (total allocatable: cpus(*):8;
> mem(*):14019; disk(*):14910; ports(*):[31000-32000]) on slave
> 20140818-235718-3165886730-5050-901-5 from framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:56.363013   937 master.cpp:1272] Asked to unregister framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:56.363065   937 master.cpp:3344] Removing framework
> 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:56.363761   937 hierarchical_allocator_process.hpp:407]
> Deactivated framework 20140818-235718-3165886730-5050-901-1507
> I0919 20:13:56.364296   937 hierarchical_allocator_process.hpp:362]
> Removed framework 20140818-235718-3165886730-5050-901-1507
>
>
>> Are there any special steps we should take, in this case? I expect that
>>> the JVM's regular GC should take care of this, but we've noticed the number
>>> of threads increase steadily over a matter of days when running 1 job at a
>>> time (the increase happens much more quickly when we run multiple jobs at
>>> once!).
>>>
>>
>> Can you dump a stack trace and tell us what those threads are? Or is it
>> hard to tell past the JNI boundary?
>>
>
> There was one day where the process went from 40 to 42 Java threads. The
> new threads were the following:
>
> Thread 18610: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may be imprecise)
>  -
> scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
> @bci=525, line=2075 (Compiled frame)
>  -
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
> @bci=8, line=1979 (Compiled frame)
>  - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107
> (Interpreted frame)
>
>
> Thread 18607: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may be imprecise)
>  -
> scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
> @bci=525, line=2075 (Compiled frame)
>  -
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
> @bci=8, line=1979 (Compiled frame)
>  - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107
> (Interpreted frame)
>
> Thanks!
>
> Colleen
>

Re: Multiple schedulers on one machine?

Reply via email to