It looks like the first TASK_FINISHED made it to the master, and should
have been forwarded to the framework, but the acknowledgement never made it
back to the slave. If you're using an old version of Mesos (pre-0.19), the
acks may not be going through the master, so that would explain them not
showing up in the master logs. If your Mesos is 0.19 or later, you should
see a "Forwarding status update acknowledgement" log message when the
master receives an ack. If the master/slave is not receiving the ack for
TASK_FINISHED, it could be because your scheduler is never sending one.

On Tue, Oct 14, 2014 at 9:52 AM, Niklas Nielsen <[email protected]>
wrote:

> It could look like the scheduler is blocked somehow. The second
> TASK_FINISHED status update (@06:16:25.295990) is most likely resent by
> the status update manager because it never got an acknowledgement from the
> scheduler.
> You can run your scheduler with GLOG_v=2 to get more information about
> from scheduler process in libmesos, so for example
>
> $ GLOG_v=2 ./your-scheduler master:5050
>
> Do you block in any callbacks?
>
> Niklas
>
> On 8 October 2014 18:23, Colleen Lee <[email protected]> wrote:
>
>> We're encountering issues where our scheduler will never receive the
>> TASK_FINISHED message, causing the scheduler driver to never be stopped.
>>
>> Logs from scheduler:
>>
>> 2014-10-01 05:48:14,549 INFO [JobManager] >>> Running job
>> DTdJgUhTEeSnPCIACwSpXg
>> 2014-10-01 06:06:06,536 INFO [JobScheduler] >>> Job
>> DTdJgUhTEeSnPCIACwSpXg: Registered as
>> 20140927-080358-3165886730-5050-928-0069 to master
>> '20140927-080358-3165886730-5050-928'
>> 2014-10-01 06:06:06,538 INFO [JobScheduler] >>> Job
>> DTdJgUhTEeSnPCIACwSpXg: Found matching offer, declining all other offers
>> 2014-10-01 06:06:08,706 INFO [JobScheduler] >>> Job
>> DTdJgUhTEeSnPCIACwSpXg: Got status update task_id {
>>   value: "DTdJgUhTEeSnPCIACwSpXg"
>> }
>> state: TASK_RUNNING
>> slave_id {
>>   value: "20140818-235718-3165886730-5050-901-4"
>> }
>> timestamp: 1.41214356869682E9
>>
>> Most of these logs are generated by our side, but they indicate that the
>> driver is run, the framework successfully registered with the master, the
>> framework successfully accepted offers, and the framework was successfully
>> transmitted a TASK_RUNNING message.
>>
>> Logs from the master:
>>
>> /var/log/mesos/mesos-master.INFO:I1001 06:06:06.540736   951
>> master.hpp:655] Adding task DTdJgUhTEeSnPCIACwSpXg with resources
>> cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4
>> (ip-10-51-165-231.ec2.internal)
>> /var/log/mesos/mesos-master.INFO:I1001 06:06:06.541306   951
>> master.cpp:3111] Launching task DTdJgUhTEeSnPCIACwSpXg of framework
>> 20140927-080358-3165886730-5050-928-0069 with resources cpus(*):1;
>> mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@
>> 10.51.165.231:5051 (ip-10-51-165-231.ec2.internal)
>> /var/log/mesos/mesos-master.INFO:I1001 06:06:08.699162   951
>> master.cpp:2628] Status update TASK_RUNNING (UUID:
>> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069 from slave
>> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
>> (ip-10-51-165-231.ec2.internal)
>> /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295013   945
>> master.cpp:2628] Status update TASK_FINISHED (UUID:
>> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069 from slave
>> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
>> (ip-10-51-165-231.ec2.internal)
>> /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295222   945
>> master.hpp:673] Removing task DTdJgUhTEeSnPCIACwSpXg with resources
>> cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4
>> (ip-10-51-165-231.ec2.internal)
>> /var/log/mesos/mesos-master.INFO:W1001 06:16:25.295990   951
>> master.cpp:2621] Could not lookup task for status update TASK_FINISHED
>> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
>> DTdJgUhTEeSnPCIACwSpXg of framework
>> 20140927-080358-3165886730-5050-928-0069 from slave
>> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
>> (ip-10-51-165-231.ec2.internal)
>>
>> The "Could not lookup task" messages have been recurring ever since.
>>
>> Logs from the slave:
>>
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543417  1207
>> slave.cpp:933] Got assigned task DTdJgUhTEeSnPCIACwSpXg for framework
>> 20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543709  1207
>> slave.cpp:1043] Launching task DTdJgUhTEeSnPCIACwSpXg for framework
>> 20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.544926  1207
>> slave.cpp:1153] Queuing task 'DTdJgUhTEeSnPCIACwSpXg' for executor default
>> of framework '20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.644618  1205
>> slave.cpp:1783] Flushing queued task DTdJgUhTEeSnPCIACwSpXg for executor
>> 'default' of framework 20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698570  1204
>> slave.cpp:2018] Handling status update TASK_RUNNING (UUID:
>> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@
>> 10.51.165.231:42450
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698796  1204
>> status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID:
>> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698974  1204
>> status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID:
>> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069 to
>> [email protected]:5050
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.699262  1204
>> slave.cpp:2145] Sending acknowledgement for status update TASK_RUNNING
>> (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task
>> DTdJgUhTEeSnPCIACwSpXg of framework
>> 20140927-080358-3165886730-5050-928-0069 to executor(1)@
>> 10.51.165.231:42450
>> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.711634  1208
>> status_update_manager.cpp:398] Received status update acknowledgement
>> (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task
>> DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294246  1207
>> slave.cpp:2018] Handling status update TASK_FINISHED (UUID:
>> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@
>> 10.51.165.231:42450
>> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294914  1205
>> status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID:
>> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295012  1205
>> status_update_manager.cpp:373] Forwarding status update TASK_FINISHED
>> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
>> DTdJgUhTEeSnPCIACwSpXg of framework
>> 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050
>> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295220  1205
>> slave.cpp:2145] Sending acknowledgement for status update TASK_FINISHED
>> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
>> DTdJgUhTEeSnPCIACwSpXg of framework
>> 20140927-080358-3165886730-5050-928-0069 to executor(1)@
>> 10.51.165.231:42450
>> /var/log/mesos/mesos-slave.INFO:W1001 06:16:25.295886  1202
>> status_update_manager.cpp:477] Resending status update TASK_FINISHED (UUID:
>> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
>> framework 20140927-080358-3165886730-5050-928-0069
>> /var/log/mesos/mesos-slave.INFO:I1001 06:16:25.296064  1202
>> status_update_manager.cpp:373] Forwarding status update TASK_FINISHED
>> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
>> DTdJgUhTEeSnPCIACwSpXg of framework
>> 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050
>>
>> As with the master logs, the "Resending"/"Forwarding" log lines have been
>> recurring ever since.
>>
>> Finally, logs from the executor indicate that the TASK_FINISHED message
>> had been sent. One oddity here is that some log lines that should be
>> printed by the executor after the TASK_FINISHED message is sent never
>> appear -- I'm guessing this is some kind of race condition, which explains
>> why this situation only happens occasionally. That said, there are no logs
>> about the executor's killTask, shutdown, or error methods being invoked, so
>> I'm not sure what is racing here.
>>
>> Any thoughts on what is going on here/how to debug would be appreciated.
>> Thanks! =)
>>
>
>

Reply via email to