> > 2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job >> 1uZArT-yEeS7gCIACpcfeA >> <snip> >> 2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job >> 1uZArT-yEeS7gCIACpcfeA: Registered as >> 20140818-235718-3165886730-5050-901-1507 to master >> '20140818-235718-3165886730-5050-901' >> >> The snipped code is for unrelated internals of our client. Going back to >> implementation, we call for the "Running job ..." log line to be output >> immediately before calling driver.run(), and our implementation of the >> registered() method in the scheduler is simply to print out the second log >> line above. During this time, from the mesos master logs, the master >> continues to function as normal, sending offers to (other) frameworks, >> processing the replies, adding/launching tasks, completing/removing tasks, >> unregistering/removing frameworks, etc. Here are the log lines that may be >> suspicious during that window: >> >> W0919 19:47:00.258894 938 master.cpp:2718] Ignoring unknown exited >> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ >> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) >> W0919 19:47:00.260349 939 master.cpp:2718] Ignoring unknown exited >> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ >> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) >> I0919 20:07:02.690067 940 master.cpp:1041] Received registration >> request from scheduler(316)@10.151.31.120:36446 >> I0919 20:07:02.690192 940 master.cpp:1059] Registering framework >> 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@ >> 10.151.31.120:36446 >> > > The log line for framework in master is for a different framework > (20140818-235718-3165886730-5050-901-1502) than the problematic > job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for > the log lines corresponding to the problematic framework (grep for ip:port) > in the master logs? That should tell us what's happening. >
Good catch, here are the relevant lines from the master logs (I grepped for the framework id -- how is the port determined?). I0919 20:13:48.122954 941 master.cpp:1041] Received registration request from scheduler(13)@10.45.199.181:43573 I0919 20:13:48.123080 941 master.cpp:1059] Registering framework 20140818-235718-3165886730-5050-901-1507 at scheduler(13)@ 10.45.199.181:43573 I0919 20:13:48.123914 941 hierarchical_allocator_process.hpp:331] Added framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.124739 941 master.cpp:2933] Sending 2 offers to framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.129932 940 master.cpp:1889] Processing reply for offers: [ 20140818-235718-3165886730-5050-901-486420 ] on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) for framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.130028 940 master.hpp:655] Adding task 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal) I0919 20:13:48.130630 940 master.cpp:3111] Launching task 1uZArT-yEeS7gCIACpcfeA of framework 20140818-235718-3165886730-5050-901-1507 with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) I0919 20:13:48.131531 940 master.cpp:1889] Processing reply for offers: [ 20140818-235718-3165886730-5050-901-486421 ] on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) for framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:48.131589 937 hierarchical_allocator_process.hpp:589] Framework 20140818-235718-3165886730-5050-901-1507 filtered slave 20140818-235718-3165886730-5050-901-5 for 5secs I0919 20:13:48.133910 937 hierarchical_allocator_process.hpp:589] Framework 20140818-235718-3165886730-5050-901-1507 filtered slave 20140818-235718-3165886730-5050-901-4 for 5secs I0919 20:13:49.833616 940 master.cpp:2628] Status update TASK_RUNNING (UUID: ea495e2b-2d19-4206-ae08-9daad311a525) for task 1uZArT-yEeS7gCIACpcfeA of framework 20140818-235718-3165886730-5050-901-1507 from slave 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) I0919 20:13:53.184873 941 master.cpp:2933] Sending 2 offers to framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:53.188966 938 master.cpp:1889] Processing reply for offers: [ 20140818-235718-3165886730-5050-901-486424 ] on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) for framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:53.189149 938 hierarchical_allocator_process.hpp:589] Framework 20140818-235718-3165886730-5050-901-1507 filtered slave 20140818-235718-3165886730-5050-901-5 for 5secs I0919 20:13:53.189857 937 master.cpp:1889] Processing reply for offers: [ 20140818-235718-3165886730-5050-901-486425 ] on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) for framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:53.190274 937 hierarchical_allocator_process.hpp:589] Framework 20140818-235718-3165886730-5050-901-1507 filtered slave 20140818-235718-3165886730-5050-901-4 for 5secs I0919 20:13:56.359884 935 master.cpp:2628] Status update TASK_FINISHED (UUID: e57c953e-d6c2-45e4-a541-1bcf16cd3179) for task 1uZArT-yEeS7gCIACpcfeA of framework 20140818-235718-3165886730-5050-901-1507 from slave 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) I0919 20:13:56.360065 935 master.hpp:673] Removing task 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal) I0919 20:13:56.360790 935 hierarchical_allocator_process.hpp:636] Recovered cpus(*):1; mem(*):1536 (total allocatable: cpus(*):8; mem(*):14019; disk(*):14910; ports(*):[31000-32000]) on slave 20140818-235718-3165886730-5050-901-5 from framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:56.363013 937 master.cpp:1272] Asked to unregister framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:56.363065 937 master.cpp:3344] Removing framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:56.363761 937 hierarchical_allocator_process.hpp:407] Deactivated framework 20140818-235718-3165886730-5050-901-1507 I0919 20:13:56.364296 937 hierarchical_allocator_process.hpp:362] Removed framework 20140818-235718-3165886730-5050-901-1507 > Are there any special steps we should take, in this case? I expect that >> the JVM's regular GC should take care of this, but we've noticed the number >> of threads increase steadily over a matter of days when running 1 job at a >> time (the increase happens much more quickly when we run multiple jobs at >> once!). >> > > Can you dump a stack trace and tell us what those threads are? Or is it > hard to tell past the JNI boundary? > There was one day where the process went from 40 to 42 Java threads. The new threads were the following: Thread 18610: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) @bci=525, line=2075 (Compiled frame) - scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) @bci=8, line=1979 (Compiled frame) - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107 (Interpreted frame) Thread 18607: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) @bci=525, line=2075 (Compiled frame) - scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) @bci=8, line=1979 (Compiled frame) - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107 (Interpreted frame) Thanks! Colleen

