It looks like the first TASK_FINISHED made it to the master, and should have been forwarded to the framework, but the acknowledgement never made it back to the slave. If you're using an old version of Mesos (pre-0.19), the acks may not be going through the master, so that would explain them not showing up in the master logs. If your Mesos is 0.19 or later, you should see a "Forwarding status update acknowledgement" log message when the master receives an ack. If the master/slave is not receiving the ack for TASK_FINISHED, it could be because your scheduler is never sending one.
On Tue, Oct 14, 2014 at 9:52 AM, Niklas Nielsen <[email protected]> wrote: > It could look like the scheduler is blocked somehow. The second > TASK_FINISHED status update (@06:16:25.295990) is most likely resent by > the status update manager because it never got an acknowledgement from the > scheduler. > You can run your scheduler with GLOG_v=2 to get more information about > from scheduler process in libmesos, so for example > > $ GLOG_v=2 ./your-scheduler master:5050 > > Do you block in any callbacks? > > Niklas > > On 8 October 2014 18:23, Colleen Lee <[email protected]> wrote: > >> We're encountering issues where our scheduler will never receive the >> TASK_FINISHED message, causing the scheduler driver to never be stopped. >> >> Logs from scheduler: >> >> 2014-10-01 05:48:14,549 INFO [JobManager] >>> Running job >> DTdJgUhTEeSnPCIACwSpXg >> 2014-10-01 06:06:06,536 INFO [JobScheduler] >>> Job >> DTdJgUhTEeSnPCIACwSpXg: Registered as >> 20140927-080358-3165886730-5050-928-0069 to master >> '20140927-080358-3165886730-5050-928' >> 2014-10-01 06:06:06,538 INFO [JobScheduler] >>> Job >> DTdJgUhTEeSnPCIACwSpXg: Found matching offer, declining all other offers >> 2014-10-01 06:06:08,706 INFO [JobScheduler] >>> Job >> DTdJgUhTEeSnPCIACwSpXg: Got status update task_id { >> value: "DTdJgUhTEeSnPCIACwSpXg" >> } >> state: TASK_RUNNING >> slave_id { >> value: "20140818-235718-3165886730-5050-901-4" >> } >> timestamp: 1.41214356869682E9 >> >> Most of these logs are generated by our side, but they indicate that the >> driver is run, the framework successfully registered with the master, the >> framework successfully accepted offers, and the framework was successfully >> transmitted a TASK_RUNNING message. >> >> Logs from the master: >> >> /var/log/mesos/mesos-master.INFO:I1001 06:06:06.540736 951 >> master.hpp:655] Adding task DTdJgUhTEeSnPCIACwSpXg with resources >> cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 >> (ip-10-51-165-231.ec2.internal) >> /var/log/mesos/mesos-master.INFO:I1001 06:06:06.541306 951 >> master.cpp:3111] Launching task DTdJgUhTEeSnPCIACwSpXg of framework >> 20140927-080358-3165886730-5050-928-0069 with resources cpus(*):1; >> mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@ >> 10.51.165.231:5051 (ip-10-51-165-231.ec2.internal) >> /var/log/mesos/mesos-master.INFO:I1001 06:06:08.699162 951 >> master.cpp:2628] Status update TASK_RUNNING (UUID: >> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 from slave >> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 >> (ip-10-51-165-231.ec2.internal) >> /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295013 945 >> master.cpp:2628] Status update TASK_FINISHED (UUID: >> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 from slave >> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 >> (ip-10-51-165-231.ec2.internal) >> /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295222 945 >> master.hpp:673] Removing task DTdJgUhTEeSnPCIACwSpXg with resources >> cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 >> (ip-10-51-165-231.ec2.internal) >> /var/log/mesos/mesos-master.INFO:W1001 06:16:25.295990 951 >> master.cpp:2621] Could not lookup task for status update TASK_FINISHED >> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task >> DTdJgUhTEeSnPCIACwSpXg of framework >> 20140927-080358-3165886730-5050-928-0069 from slave >> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051 >> (ip-10-51-165-231.ec2.internal) >> >> The "Could not lookup task" messages have been recurring ever since. >> >> Logs from the slave: >> >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543417 1207 >> slave.cpp:933] Got assigned task DTdJgUhTEeSnPCIACwSpXg for framework >> 20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543709 1207 >> slave.cpp:1043] Launching task DTdJgUhTEeSnPCIACwSpXg for framework >> 20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.544926 1207 >> slave.cpp:1153] Queuing task 'DTdJgUhTEeSnPCIACwSpXg' for executor default >> of framework '20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.644618 1205 >> slave.cpp:1783] Flushing queued task DTdJgUhTEeSnPCIACwSpXg for executor >> 'default' of framework 20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698570 1204 >> slave.cpp:2018] Handling status update TASK_RUNNING (UUID: >> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@ >> 10.51.165.231:42450 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698796 1204 >> status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: >> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698974 1204 >> status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID: >> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 to >> [email protected]:5050 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.699262 1204 >> slave.cpp:2145] Sending acknowledgement for status update TASK_RUNNING >> (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task >> DTdJgUhTEeSnPCIACwSpXg of framework >> 20140927-080358-3165886730-5050-928-0069 to executor(1)@ >> 10.51.165.231:42450 >> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.711634 1208 >> status_update_manager.cpp:398] Received status update acknowledgement >> (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task >> DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294246 1207 >> slave.cpp:2018] Handling status update TASK_FINISHED (UUID: >> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@ >> 10.51.165.231:42450 >> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294914 1205 >> status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID: >> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295012 1205 >> status_update_manager.cpp:373] Forwarding status update TASK_FINISHED >> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task >> DTdJgUhTEeSnPCIACwSpXg of framework >> 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050 >> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295220 1205 >> slave.cpp:2145] Sending acknowledgement for status update TASK_FINISHED >> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task >> DTdJgUhTEeSnPCIACwSpXg of framework >> 20140927-080358-3165886730-5050-928-0069 to executor(1)@ >> 10.51.165.231:42450 >> /var/log/mesos/mesos-slave.INFO:W1001 06:16:25.295886 1202 >> status_update_manager.cpp:477] Resending status update TASK_FINISHED (UUID: >> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of >> framework 20140927-080358-3165886730-5050-928-0069 >> /var/log/mesos/mesos-slave.INFO:I1001 06:16:25.296064 1202 >> status_update_manager.cpp:373] Forwarding status update TASK_FINISHED >> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task >> DTdJgUhTEeSnPCIACwSpXg of framework >> 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050 >> >> As with the master logs, the "Resending"/"Forwarding" log lines have been >> recurring ever since. >> >> Finally, logs from the executor indicate that the TASK_FINISHED message >> had been sent. One oddity here is that some log lines that should be >> printed by the executor after the TASK_FINISHED message is sent never >> appear -- I'm guessing this is some kind of race condition, which explains >> why this situation only happens occasionally. That said, there are no logs >> about the executor's killTask, shutdown, or error methods being invoked, so >> I'm not sure what is racing here. >> >> Any thoughts on what is going on here/how to debug would be appreciated. >> Thanks! =) >> > >

