It could look like the scheduler is blocked somehow. The second TASK_FINISHED status update (@06:16:25.295990) is most likely resent by the status update manager because it never got an acknowledgement from the scheduler. You can run your scheduler with GLOG_v=2 to get more information about from scheduler process in libmesos, so for example
$ GLOG_v=2 ./your-scheduler master:5050 Do you block in any callbacks? Niklas On 8 October 2014 18:23, Colleen Lee <[email protected]> wrote: > We're encountering issues where our scheduler will never receive the > TASK_FINISHED message, causing the scheduler driver to never be stopped. > > Logs from scheduler: > > 2014-10-01 05:48:14,549 INFO [JobManager] >>> Running job > DTdJgUhTEeSnPCIACwSpXg > 2014-10-01 06:06:06,536 INFO [JobScheduler] >>> Job > DTdJgUhTEeSnPCIACwSpXg: Registered as > 20140927-080358-3165886730-5050-928-0069 to master > '20140927-080358-3165886730-5050-928' > 2014-10-01 06:06:06,538 INFO [JobScheduler] >>> Job > DTdJgUhTEeSnPCIACwSpXg: Found matching offer, declining all other offers > 2014-10-01 06:06:08,706 INFO [JobScheduler] >>> Job > DTdJgUhTEeSnPCIACwSpXg: Got status update task_id { > value: "DTdJgUhTEeSnPCIACwSpXg" > } > state: TASK_RUNNING > slave_id { > value: "20140818-235718-3165886730-5050-901-4" > } > timestamp: 1.41214356869682E9 > > Most of these logs are generated by our side, but they indicate that the > driver is run, the framework successfully registered with the master, the > framework successfully accepted offers, and the framework was successfully > transmitted a TASK_RUNNING message. > > Logs from the master: > > /var/log/mesos/mesos-master.INFO:I1001 06:06:06.540736 951 > master.hpp:655] Adding task DTdJgUhTEeSnPCIACwSpXg with resources > cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 > (ip-10-51-165-231.ec2.internal) > /var/log/mesos/mesos-master.INFO:I1001 06:06:06.541306 951 > master.cpp:3111] Launching task DTdJgUhTEeSnPCIACwSpXg of framework > 20140927-080358-3165886730-5050-928-0069 with resources cpus(*):1; > mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@ > 10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) > /var/log/mesos/mesos-master.INFO:I1001 06:06:08.699162 951 > master.cpp:2628] Status update TASK_RUNNING (UUID: > 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 from slave > 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 > (ip-10-51-165-231.ec2.internal) > /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295013 945 > master.cpp:2628] Status update TASK_FINISHED (UUID: > 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 from slave > 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 > (ip-10-51-165-231.ec2.internal) > /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295222 945 > master.hpp:673] Removing task DTdJgUhTEeSnPCIACwSpXg with resources > cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 > (ip-10-51-165-231.ec2.internal) > /var/log/mesos/mesos-master.INFO:W1001 06:16:25.295990 951 > master.cpp:2621] Could not lookup task for status update TASK_FINISHED > (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task > DTdJgUhTEeSnPCIACwSpXg of framework > 20140927-080358-3165886730-5050-928-0069 from slave > 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 > (ip-10-51-165-231.ec2.internal) > > The "Could not lookup task" messages have been recurring ever since. > > Logs from the slave: > > /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543417 1207 slave.cpp:933] > Got assigned task DTdJgUhTEeSnPCIACwSpXg for framework > 20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543709 1207 > slave.cpp:1043] Launching task DTdJgUhTEeSnPCIACwSpXg for framework > 20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.544926 1207 > slave.cpp:1153] Queuing task 'DTdJgUhTEeSnPCIACwSpXg' for executor default > of framework '20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.644618 1205 > slave.cpp:1783] Flushing queued task DTdJgUhTEeSnPCIACwSpXg for executor > 'default' of framework 20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698570 1204 > slave.cpp:2018] Handling status update TASK_RUNNING (UUID: > 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@ > 10.51.165.231:42450 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698796 1204 > status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: > 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698974 1204 > status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID: > 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 to > [email protected]:5050 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.699262 1204 > slave.cpp:2145] Sending acknowledgement for status update TASK_RUNNING > (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task > DTdJgUhTEeSnPCIACwSpXg of framework > 20140927-080358-3165886730-5050-928-0069 to executor(1)@ > 10.51.165.231:42450 > /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.711634 1208 > status_update_manager.cpp:398] Received status update acknowledgement > (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task > DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294246 1207 > slave.cpp:2018] Handling status update TASK_FINISHED (UUID: > 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@ > 10.51.165.231:42450 > /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294914 1205 > status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID: > 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295012 1205 > status_update_manager.cpp:373] Forwarding status update TASK_FINISHED > (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task > DTdJgUhTEeSnPCIACwSpXg of framework > 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050 > /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295220 1205 > slave.cpp:2145] Sending acknowledgement for status update TASK_FINISHED > (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task > DTdJgUhTEeSnPCIACwSpXg of framework > 20140927-080358-3165886730-5050-928-0069 to executor(1)@ > 10.51.165.231:42450 > /var/log/mesos/mesos-slave.INFO:W1001 06:16:25.295886 1202 > status_update_manager.cpp:477] Resending status update TASK_FINISHED (UUID: > 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of > framework 20140927-080358-3165886730-5050-928-0069 > /var/log/mesos/mesos-slave.INFO:I1001 06:16:25.296064 1202 > status_update_manager.cpp:373] Forwarding status update TASK_FINISHED > (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task > DTdJgUhTEeSnPCIACwSpXg of framework > 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050 > > As with the master logs, the "Resending"/"Forwarding" log lines have been > recurring ever since. > > Finally, logs from the executor indicate that the TASK_FINISHED message > had been sent. One oddity here is that some log lines that should be > printed by the executor after the TASK_FINISHED message is sent never > appear -- I'm guessing this is some kind of race condition, which explains > why this situation only happens occasionally. That said, there are no logs > about the executor's killTask, shutdown, or error methods being invoked, so > I'm not sure what is racing here. > > Any thoughts on what is going on here/how to debug would be appreciated. > Thanks! =) >

