> > This normally occurs as expected, but after some time running multiple >> jobs, there will be approximately a 30-minute delay between the call to >> driver.run() and the registered() method being called (based on logs). >> >> This seems weird. Can you show us the logs? >
There isn't much informative here, unfortunately. 2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job 1uZArT-yEeS7gCIACpcfeA <snip> 2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job 1uZArT-yEeS7gCIACpcfeA: Registered as 20140818-235718-3165886730-5050-901-1507 to master '20140818-235718-3165886730-5050-901' The snipped code is for unrelated internals of our client. Going back to implementation, we call for the "Running job ..." log line to be output immediately before calling driver.run(), and our implementation of the registered() method in the scheduler is simply to print out the second log line above. During this time, from the mesos master logs, the master continues to function as normal, sending offers to (other) frameworks, processing the replies, adding/launching tasks, completing/removing tasks, unregistering/removing frameworks, etc. Here are the log lines that may be suspicious during that window: W0919 19:47:00.258894 938 master.cpp:2718] Ignoring unknown exited executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) W0919 19:47:00.260349 939 master.cpp:2718] Ignoring unknown exited executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) I0919 20:07:02.690067 940 master.cpp:1041] Received registration request from scheduler(316)@10.151.31.120:36446 I0919 20:07:02.690192 940 master.cpp:1059] Registering framework 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@ 10.151.31.120:36446 There are also several lines of "http.cpp:452] HTTP request for '/master/state.json'" during this time. For reference, we are still on Mesos 0.19.0. Another thing we've noticed is that the number of threads used by the >> process increases as more jobs are run. Does calling driver.stop() >> terminate any threads launched for calling the native Mesos code through >> JNI? Or are additional steps required? >> >> > The ZooKeeper thread is not terminated on driver.stop() (it's a bug) but > when the driver object is destructed. > Are there any special steps we should take, in this case? I expect that the JVM's regular GC should take care of this, but we've noticed the number of threads increase steadily over a matter of days when running 1 job at a time (the increase happens much more quickly when we run multiple jobs at once!). Thanks!

