answers inline.

2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job
> 1uZArT-yEeS7gCIACpcfeA
> <snip>
> 2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job
> 1uZArT-yEeS7gCIACpcfeA: Registered as
> 20140818-235718-3165886730-5050-901-1507 to master
> '20140818-235718-3165886730-5050-901'
>
> The snipped code is for unrelated internals of our client. Going back to
> implementation, we call for the "Running job ..." log line to be output
> immediately before calling driver.run(), and our implementation of the
> registered() method in the scheduler is simply to print out the second log
> line above. During this time, from the mesos master logs, the master
> continues to function as normal, sending offers to (other) frameworks,
> processing the replies, adding/launching tasks, completing/removing tasks,
> unregistering/removing frameworks, etc. Here are the log lines that may be
> suspicious during that window:
>
> W0919 19:47:00.258894   938 master.cpp:2718] Ignoring unknown exited
> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
> W0919 19:47:00.260349   939 master.cpp:2718] Ignoring unknown exited
> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
> I0919 20:07:02.690067   940 master.cpp:1041] Received registration request
> from scheduler(316)@10.151.31.120:36446
> I0919 20:07:02.690192   940 master.cpp:1059] Registering framework
> 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@
> 10.151.31.120:36446
>
>
The log line for framework in master is for a different framework
(20140818-235718-3165886730-5050-901-1502) than the problematic
job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for
the log lines corresponding to the problematic framework (grep for ip:port)
in the master logs? That should tell us what's happening.


> Are there any special steps we should take, in this case? I expect that
> the JVM's regular GC should take care of this, but we've noticed the number
> of threads increase steadily over a matter of days when running 1 job at a
> time (the increase happens much more quickly when we run multiple jobs at
> once!).
>
>
Can you dump a stack trace and tell us what those threads are? Or is it
hard to tell past the JNI boundary?

Reply via email to