answers inline. 2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job > 1uZArT-yEeS7gCIACpcfeA > <snip> > 2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job > 1uZArT-yEeS7gCIACpcfeA: Registered as > 20140818-235718-3165886730-5050-901-1507 to master > '20140818-235718-3165886730-5050-901' > > The snipped code is for unrelated internals of our client. Going back to > implementation, we call for the "Running job ..." log line to be output > immediately before calling driver.run(), and our implementation of the > registered() method in the scheduler is simply to print out the second log > line above. During this time, from the mesos master logs, the master > continues to function as normal, sending offers to (other) frameworks, > processing the replies, adding/launching tasks, completing/removing tasks, > unregistering/removing frameworks, etc. Here are the log lines that may be > suspicious during that window: > > W0919 19:47:00.258894 938 master.cpp:2718] Ignoring unknown exited > executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ > 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) > W0919 19:47:00.260349 939 master.cpp:2718] Ignoring unknown exited > executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ > 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) > I0919 20:07:02.690067 940 master.cpp:1041] Received registration request > from scheduler(316)@10.151.31.120:36446 > I0919 20:07:02.690192 940 master.cpp:1059] Registering framework > 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@ > 10.151.31.120:36446 > > The log line for framework in master is for a different framework (20140818-235718-3165886730-5050-901-1502) than the problematic job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for the log lines corresponding to the problematic framework (grep for ip:port) in the master logs? That should tell us what's happening.
> Are there any special steps we should take, in this case? I expect that > the JVM's regular GC should take care of this, but we've noticed the number > of threads increase steadily over a matter of days when running 1 job at a > time (the increase happens much more quickly when we run multiple jobs at > once!). > > Can you dump a stack trace and tell us what those threads are? Or is it hard to tell past the JNI boundary?

