Re: Multiple schedulers on one machine?

Colleen Lee Mon, 06 Oct 2014 15:03:42 -0700

>
> This normally occurs as expected, but after some time running multiple
>> jobs, there will be approximately a 30-minute delay between the call to
>> driver.run() and the registered() method being called (based on logs).
>>
>> This seems weird. Can you show us the logs?
>


There isn't much informative here, unfortunately.

2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job
1uZArT-yEeS7gCIACpcfeA
<snip>
2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job 1uZArT-yEeS7gCIACpcfeA:
Registered as 20140818-235718-3165886730-5050-901-1507 to master
'20140818-235718-3165886730-5050-901'

The snipped code is for unrelated internals of our client. Going back to
implementation, we call for the "Running job ..." log line to be output
immediately before calling driver.run(), and our implementation of the
registered() method in the scheduler is simply to print out the second log
line above. During this time, from the mesos master logs, the master
continues to function as normal, sending offers to (other) frameworks,
processing the replies, adding/launching tasks, completing/removing tasks,
unregistering/removing frameworks, etc. Here are the log lines that may be
suspicious during that window:

W0919 19:47:00.258894   938 master.cpp:2718] Ignoring unknown exited
executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
W0919 19:47:00.260349   939 master.cpp:2718] Ignoring unknown exited
executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
I0919 20:07:02.690067   940 master.cpp:1041] Received registration request
from scheduler(316)@10.151.31.120:36446
I0919 20:07:02.690192   940 master.cpp:1059] Registering framework
20140818-235718-3165886730-5050-901-1502 at scheduler(316)@
10.151.31.120:36446

There are also several lines of "http.cpp:452] HTTP request for
'/master/state.json'" during this time. For reference, we are still on
Mesos 0.19.0.

Another thing we've noticed is that the number of threads used by the
>> process increases as more jobs are run. Does calling driver.stop()
>> terminate any threads launched for calling the native Mesos code through
>> JNI? Or are additional steps required?
>>
>>
> The ZooKeeper thread is not terminated on driver.stop() (it's a bug) but
> when the driver object is destructed.
>

Are there any special steps we should take, in this case? I expect that the
JVM's regular GC should take care of this, but we've noticed the number of
threads increase steadily over a matter of days when running 1 job at a
time (the increase happens much more quickly when we run multiple jobs at
once!).

Thanks!

Re: Multiple schedulers on one machine?

Reply via email to