>
> 2014-09-19 19:46:49,361 INFO [JobManager] >>> Running job
>> 1uZArT-yEeS7gCIACpcfeA
>> <snip>
>> 2014-09-19 20:13:48,134 INFO [JobScheduler] >>> Job
>> 1uZArT-yEeS7gCIACpcfeA: Registered as
>> 20140818-235718-3165886730-5050-901-1507 to master
>> '20140818-235718-3165886730-5050-901'
>>
>> The snipped code is for unrelated internals of our client. Going back to
>> implementation, we call for the "Running job ..." log line to be output
>> immediately before calling driver.run(), and our implementation of the
>> registered() method in the scheduler is simply to print out the second log
>> line above. During this time, from the mesos master logs, the master
>> continues to function as normal, sending offers to (other) frameworks,
>> processing the replies, adding/launching tasks, completing/removing tasks,
>> unregistering/removing frameworks, etc. Here are the log lines that may be
>> suspicious during that window:
>>
>> W0919 19:47:00.258894   938 master.cpp:2718] Ignoring unknown exited
>> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
>> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
>> W0919 19:47:00.260349   939 master.cpp:2718] Ignoring unknown exited
>> executor default on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
>> 10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
>> I0919 20:07:02.690067   940 master.cpp:1041] Received registration
>> request from scheduler(316)@10.151.31.120:36446
>> I0919 20:07:02.690192   940 master.cpp:1059] Registering framework
>> 20140818-235718-3165886730-5050-901-1502 at scheduler(316)@
>> 10.151.31.120:36446
>>
>
> The log line for framework in master is for a different framework
> (20140818-235718-3165886730-5050-901-1502) than the problematic
> job/framework (20140818-235718-3165886730-5050-901-1507). Can you grep for
> the log lines corresponding to the problematic framework (grep for ip:port)
> in the master logs? That should tell us what's happening.
>

Good catch, here are the relevant lines from the master logs (I grepped for
the framework id -- how is the port determined?).

I0919 20:13:48.122954   941 master.cpp:1041] Received registration request
from scheduler(13)@10.45.199.181:43573
I0919 20:13:48.123080   941 master.cpp:1059] Registering framework
20140818-235718-3165886730-5050-901-1507 at scheduler(13)@
10.45.199.181:43573
I0919 20:13:48.123914   941 hierarchical_allocator_process.hpp:331] Added
framework 20140818-235718-3165886730-5050-901-1507
I0919 20:13:48.124739   941 master.cpp:2933] Sending 2 offers to framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:48.129932   940 master.cpp:1889] Processing reply for offers: [
20140818-235718-3165886730-5050-901-486420 ] on slave
20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
(ip-10-101-195-45.ec2.internal) for framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:48.130028   940 master.hpp:655] Adding task
1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave
20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal)
I0919 20:13:48.130630   940 master.cpp:3111] Launching task
1uZArT-yEeS7gCIACpcfeA of framework
20140818-235718-3165886730-5050-901-1507 with resources cpus(*):1;
mem(*):1536 on slave 20140818-235718-3165886730-5050-901-5 at slave(1)@
10.101.195.45:5051 (ip-10-101-195-45.ec2.internal)
I0919 20:13:48.131531   940 master.cpp:1889] Processing reply for offers: [
20140818-235718-3165886730-5050-901-486421 ] on slave
20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
(ip-10-51-165-231.ec2.internal) for framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:48.131589   937 hierarchical_allocator_process.hpp:589]
Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
20140818-235718-3165886730-5050-901-5 for 5secs
I0919 20:13:48.133910   937 hierarchical_allocator_process.hpp:589]
Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
20140818-235718-3165886730-5050-901-4 for 5secs
I0919 20:13:49.833616   940 master.cpp:2628] Status update TASK_RUNNING
(UUID: ea495e2b-2d19-4206-ae08-9daad311a525) for task
1uZArT-yEeS7gCIACpcfeA of framework
20140818-235718-3165886730-5050-901-1507 from slave
20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
(ip-10-101-195-45.ec2.internal)
I0919 20:13:53.184873   941 master.cpp:2933] Sending 2 offers to framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:53.188966   938 master.cpp:1889] Processing reply for offers: [
20140818-235718-3165886730-5050-901-486424 ] on slave
20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
(ip-10-101-195-45.ec2.internal) for framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:53.189149   938 hierarchical_allocator_process.hpp:589]
Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
20140818-235718-3165886730-5050-901-5 for 5secs
I0919 20:13:53.189857   937 master.cpp:1889] Processing reply for offers: [
20140818-235718-3165886730-5050-901-486425 ] on slave
20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
(ip-10-51-165-231.ec2.internal) for framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:53.190274   937 hierarchical_allocator_process.hpp:589]
Framework 20140818-235718-3165886730-5050-901-1507 filtered slave
20140818-235718-3165886730-5050-901-4 for 5secs
I0919 20:13:56.359884   935 master.cpp:2628] Status update TASK_FINISHED
(UUID: e57c953e-d6c2-45e4-a541-1bcf16cd3179) for task
1uZArT-yEeS7gCIACpcfeA of framework
20140818-235718-3165886730-5050-901-1507 from slave
20140818-235718-3165886730-5050-901-5 at slave(1)@10.101.195.45:5051
(ip-10-101-195-45.ec2.internal)
I0919 20:13:56.360065   935 master.hpp:673] Removing task
1uZArT-yEeS7gCIACpcfeA with resources cpus(*):1; mem(*):1536 on slave
20140818-235718-3165886730-5050-901-5 (ip-10-101-195-45.ec2.internal)
I0919 20:13:56.360790   935 hierarchical_allocator_process.hpp:636]
Recovered cpus(*):1; mem(*):1536 (total allocatable: cpus(*):8;
mem(*):14019; disk(*):14910; ports(*):[31000-32000]) on slave
20140818-235718-3165886730-5050-901-5 from framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:56.363013   937 master.cpp:1272] Asked to unregister framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:56.363065   937 master.cpp:3344] Removing framework
20140818-235718-3165886730-5050-901-1507
I0919 20:13:56.363761   937 hierarchical_allocator_process.hpp:407]
Deactivated framework 20140818-235718-3165886730-5050-901-1507
I0919 20:13:56.364296   937 hierarchical_allocator_process.hpp:362] Removed
framework 20140818-235718-3165886730-5050-901-1507


> Are there any special steps we should take, in this case? I expect that
>> the JVM's regular GC should take care of this, but we've noticed the number
>> of threads increase steadily over a matter of days when running 1 job at a
>> time (the increase happens much more quickly when we run multiple jobs at
>> once!).
>>
>
> Can you dump a stack trace and tell us what those threads are? Or is it
> hard to tell past the JNI boundary?
>

There was one day where the process went from 40 to 42 Java threads. The
new threads were the following:

Thread 18610: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
may be imprecise)
 -
scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
@bci=525, line=2075 (Compiled frame)
 -
scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
@bci=8, line=1979 (Compiled frame)
 - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107
(Interpreted frame)


Thread 18607: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
may be imprecise)
 -
scala.concurrent.forkjoin.ForkJoinPool.scan(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
@bci=525, line=2075 (Compiled frame)
 -
scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue)
@bci=8, line=1979 (Compiled frame)
 - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107
(Interpreted frame)

Thanks!

Colleen

Reply via email to