Sorry for the delay in response. It looks like master received the registration request from the scheduler at 20:13, at which point it sent a registration ack back to the scheduler. What is not clear is why no registration requests were sent/received from 19:47 when the framework apparently started. You can add GLOG_v=1 to the scheduler's environment to print more verbose logging for the driver. It will tell us what the driver has been doing. Are you sure you didn't have any network partitions around that time?
On Fri, Oct 10, 2014 at 1:30 PM, Colleen Lee <[email protected]> wrote: > 2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job >>> 1uZArT-yEeS7gCIACpcfeA >>> <snip> >>> 2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job >>> 1uZArT-yEeS7gCIACpcfeA: Registered as >>> 20140818-235718-3165886730-5050-901-1507 to master >>> '20140818-235718-3165886730-5050-901' >>> >>> The snipped code is for unrelated internals of our client. Going back to >>> implementation, we call for the "Running job ..." log line to be output >>> immediately before calling driver.run(), and our implementation of the >>> registered() method in the scheduler is simply to print out the second log >>> line above. During this time, from the mesos master logs, the master >>> continues to function as normal, sending offers to (other) frameworks, >>> processing the replies, adding/launching tasks, completing/removing tasks, >>> unregistering/removing frameworks, etc. Here are the log lines that may be >>> suspicious during that window: >>> >>> W0919 19:47:00.258894 938 master.cpp:2718] Ignoring unknown exited >>> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ >>> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) >>> W0919 19:47:00.260349 939 master.cpp:2718] Ignoring unknown exited >>> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ >>> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) >>> I0919 20:07:02.690067 940 master.cpp:1041] Received registration >>> request from scheduler(316)@10.151.31.120:36446 >>> I0919 20:07:02.690192 940 master.cpp:1059] Registering framework >>> 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@ >>> 10.151.31.120:36446 >>> >> >> The log line for framework in master is for a different framework >> (20140818-235718-3165886730-5050-901-1502) than the problematic >> job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for >> the log lines corresponding to the problematic framework (grep for ip:port) >> in the master logs? That should tell us what's happening. >> > > Good catch, here are the relevant lines from the master logs (I grepped > for the framework id -- how is the port determined?). > > I0919 20:13:48.122954 941 master.cpp:1041] Received registration request > from scheduler(13)@10.45.199.181:43573 > I0919 20:13:48.123080 941 master.cpp:1059] Registering framework > 20140818-235718-3165886730-5050-901-1507 at scheduler(13)@ > 10.45.199.181:43573 > I0919 20:13:48.123914 941 hierarchical_allocator_process.hpp:331] Added > framework 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:48.124739 941 master.cpp:2933] Sending 2 offers to framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:48.129932 940 master.cpp:1889] Processing reply for offers: > [ 20140818-235718-3165886730-5050-901-486420 ] on slave > 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 > (ip-10-101-195-45.ec2.internal) for framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:48.130028 940 master.hpp:655] Adding task > 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave > 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal) > I0919 20:13:48.130630 940 master.cpp:3111] Launching task > 1uZArT-yEeS7gCIACpcfeA of framework > 20140818-235718-3165886730-5050-901-1507 with resources cpus(*):1; > mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@ > 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal) > I0919 20:13:48.131531 940 master.cpp:1889] Processing reply for offers: > [ 20140818-235718-3165886730-5050-901-486421 ] on slave > 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 > (ip-10-51-165-231.ec2.internal) for framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:48.131589 937 hierarchical_allocator_process.hpp:589] > Framework 20140818-235718-3165886730-5050-901-1507 filtered slave > 20140818-235718-3165886730-5050-901-5 for 5secs > I0919 20:13:48.133910 937 hierarchical_allocator_process.hpp:589] > Framework 20140818-235718-3165886730-5050-901-1507 filtered slave > 20140818-235718-3165886730-5050-901-4 for 5secs > I0919 20:13:49.833616 940 master.cpp:2628] Status update TASK_RUNNING > (UUID: ea495e2b-2d19-4206-ae08-9daad311a525) for task > 1uZArT-yEeS7gCIACpcfeA of framework > 20140818-235718-3165886730-5050-901-1507 from slave > 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 > (ip-10-101-195-45.ec2.internal) > I0919 20:13:53.184873 941 master.cpp:2933] Sending 2 offers to framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:53.188966 938 master.cpp:1889] Processing reply for offers: > [ 20140818-235718-3165886730-5050-901-486424 ] on slave > 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 > (ip-10-101-195-45.ec2.internal) for framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:53.189149 938 hierarchical_allocator_process.hpp:589] > Framework 20140818-235718-3165886730-5050-901-1507 filtered slave > 20140818-235718-3165886730-5050-901-5 for 5secs > I0919 20:13:53.189857 937 master.cpp:1889] Processing reply for offers: > [ 20140818-235718-3165886730-5050-901-486425 ] on slave > 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 > (ip-10-51-165-231.ec2.internal) for framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:53.190274 937 hierarchical_allocator_process.hpp:589] > Framework 20140818-235718-3165886730-5050-901-1507 filtered slave > 20140818-235718-3165886730-5050-901-4 for 5secs > I0919 20:13:56.359884 935 master.cpp:2628] Status update TASK_FINISHED > (UUID: e57c953e-d6c2-45e4-a541-1bcf16cd3179) for task > 1uZArT-yEeS7gCIACpcfeA of framework > 20140818-235718-3165886730-5050-901-1507 from slave > 20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051 > (ip-10-101-195-45.ec2.internal) > I0919 20:13:56.360065 935 master.hpp:673] Removing task > 1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave > 20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal) > I0919 20:13:56.360790 935 hierarchical_allocator_process.hpp:636] > Recovered cpus(*):1; mem(*):1536 (total allocatable: cpus(*):8; > mem(*):14019; disk(*):14910; ports(*):[31000-32000]) on slave > 20140818-235718-3165886730-5050-901-5 from framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:56.363013 937 master.cpp:1272] Asked to unregister framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:56.363065 937 master.cpp:3344] Removing framework > 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:56.363761 937 hierarchical_allocator_process.hpp:407] > Deactivated framework 20140818-235718-3165886730-5050-901-1507 > I0919 20:13:56.364296 937 hierarchical_allocator_process.hpp:362] > Removed framework 20140818-235718-3165886730-5050-901-1507 > > >> Are there any special steps we should take, in this case? I expect that >>> the JVM's regular GC should take care of this, but we've noticed the number >>> of threads increase steadily over a matter of days when running 1 job at a >>> time (the increase happens much more quickly when we run multiple jobs at >>> once!). >>> >> >> Can you dump a stack trace and tell us what those threads are? Or is it >> hard to tell past the JNI boundary? >> > > There was one day where the process went from 40 to 42 Java threads. The > new threads were the following: > > Thread 18610: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - > scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) > @bci=525, line=2075 (Compiled frame) > - > scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) > @bci=8, line=1979 (Compiled frame) > - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107 > (Interpreted frame) > > > Thread 18607: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - > scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) > @bci=525, line=2075 (Compiled frame) > - > scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) > @bci=8, line=1979 (Compiled frame) > - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107 > (Interpreted frame) > > Thanks! > > Colleen >

