It could look like the scheduler is blocked somehow. The second
TASK_FINISHED status update (@06:16:25.295990) is most likely resent by the
status update manager because it never got an acknowledgement from the
scheduler.
You can run your scheduler with GLOG_v=2 to get more information about from
scheduler process in libmesos, so for example

$ GLOG_v=2 ./your-scheduler master:5050

Do you block in any callbacks?

Niklas

On 8 October 2014 18:23, Colleen Lee <[email protected]> wrote:

> We're encountering issues where our scheduler will never receive the
> TASK_FINISHED message, causing the scheduler driver to never be stopped.
>
> Logs from scheduler:
>
> 2014-10-01 05:48:14,549 INFO [JobManager] >>> Running job
> DTdJgUhTEeSnPCIACwSpXg
> 2014-10-01 06:06:06,536 INFO [JobScheduler] >>> Job
> DTdJgUhTEeSnPCIACwSpXg: Registered as
> 20140927-080358-3165886730-5050-928-0069 to master
> '20140927-080358-3165886730-5050-928'
> 2014-10-01 06:06:06,538 INFO [JobScheduler] >>> Job
> DTdJgUhTEeSnPCIACwSpXg: Found matching offer, declining all other offers
> 2014-10-01 06:06:08,706 INFO [JobScheduler] >>> Job
> DTdJgUhTEeSnPCIACwSpXg: Got status update task_id {
>   value: "DTdJgUhTEeSnPCIACwSpXg"
> }
> state: TASK_RUNNING
> slave_id {
>   value: "20140818-235718-3165886730-5050-901-4"
> }
> timestamp: 1.41214356869682E9
>
> Most of these logs are generated by our side, but they indicate that the
> driver is run, the framework successfully registered with the master, the
> framework successfully accepted offers, and the framework was successfully
> transmitted a TASK_RUNNING message.
>
> Logs from the master:
>
> /var/log/mesos/mesos-master.INFO:I1001 06:06:06.540736   951
> master.hpp:655] Adding task DTdJgUhTEeSnPCIACwSpXg with resources
> cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4
> (ip-10-51-165-231.ec2.internal)
> /var/log/mesos/mesos-master.INFO:I1001 06:06:06.541306   951
> master.cpp:3111] Launching task DTdJgUhTEeSnPCIACwSpXg of framework
> 20140927-080358-3165886730-5050-928-0069 with resources cpus(*):1;
> mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4 at slave(1)@
> 10.51.165.231:5051 (ip-10-51-165-231.ec2.internal)
> /var/log/mesos/mesos-master.INFO:I1001 06:06:08.699162   951
> master.cpp:2628] Status update TASK_RUNNING (UUID:
> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069 from slave
> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
> (ip-10-51-165-231.ec2.internal)
> /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295013   945
> master.cpp:2628] Status update TASK_FINISHED (UUID:
> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069 from slave
> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
> (ip-10-51-165-231.ec2.internal)
> /var/log/mesos/mesos-master.INFO:I1001 06:16:15.295222   945
> master.hpp:673] Removing task DTdJgUhTEeSnPCIACwSpXg with resources
> cpus(*):1; mem(*):1536 on slave 20140818-235718-3165886730-5050-901-4
> (ip-10-51-165-231.ec2.internal)
> /var/log/mesos/mesos-master.INFO:W1001 06:16:25.295990   951
> master.cpp:2621] Could not lookup task for status update TASK_FINISHED
> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
> DTdJgUhTEeSnPCIACwSpXg of framework
> 20140927-080358-3165886730-5050-928-0069 from slave
> 20140818-235718-3165886730-5050-901-4 at slave(1)@10.51.165.231:5051
> (ip-10-51-165-231.ec2.internal)
>
> The "Could not lookup task" messages have been recurring ever since.
>
> Logs from the slave:
>
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543417  1207 slave.cpp:933]
> Got assigned task DTdJgUhTEeSnPCIACwSpXg for framework
> 20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.543709  1207
> slave.cpp:1043] Launching task DTdJgUhTEeSnPCIACwSpXg for framework
> 20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:06.544926  1207
> slave.cpp:1153] Queuing task 'DTdJgUhTEeSnPCIACwSpXg' for executor default
> of framework '20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.644618  1205
> slave.cpp:1783] Flushing queued task DTdJgUhTEeSnPCIACwSpXg for executor
> 'default' of framework 20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698570  1204
> slave.cpp:2018] Handling status update TASK_RUNNING (UUID:
> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@
> 10.51.165.231:42450
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698796  1204
> status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID:
> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.698974  1204
> status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID:
> 0e17e133-c1d0-4a56-9cef-afb45a486045) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069 to
> [email protected]:5050
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.699262  1204
> slave.cpp:2145] Sending acknowledgement for status update TASK_RUNNING
> (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task
> DTdJgUhTEeSnPCIACwSpXg of framework
> 20140927-080358-3165886730-5050-928-0069 to executor(1)@
> 10.51.165.231:42450
> /var/log/mesos/mesos-slave.INFO:I1001 06:06:08.711634  1208
> status_update_manager.cpp:398] Received status update acknowledgement
> (UUID: 0e17e133-c1d0-4a56-9cef-afb45a486045) for task
> DTdJgUhTEeSnPCIACwSpXg of framework 20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294246  1207
> slave.cpp:2018] Handling status update TASK_FINISHED (UUID:
> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069 from executor(1)@
> 10.51.165.231:42450
> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.294914  1205
> status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID:
> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295012  1205
> status_update_manager.cpp:373] Forwarding status update TASK_FINISHED
> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
> DTdJgUhTEeSnPCIACwSpXg of framework
> 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050
> /var/log/mesos/mesos-slave.INFO:I1001 06:16:15.295220  1205
> slave.cpp:2145] Sending acknowledgement for status update TASK_FINISHED
> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
> DTdJgUhTEeSnPCIACwSpXg of framework
> 20140927-080358-3165886730-5050-928-0069 to executor(1)@
> 10.51.165.231:42450
> /var/log/mesos/mesos-slave.INFO:W1001 06:16:25.295886  1202
> status_update_manager.cpp:477] Resending status update TASK_FINISHED (UUID:
> 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task DTdJgUhTEeSnPCIACwSpXg of
> framework 20140927-080358-3165886730-5050-928-0069
> /var/log/mesos/mesos-slave.INFO:I1001 06:16:25.296064  1202
> status_update_manager.cpp:373] Forwarding status update TASK_FINISHED
> (UUID: 587869ec-a4ef-439a-a706-8a46fbc9fde8) for task
> DTdJgUhTEeSnPCIACwSpXg of framework
> 20140927-080358-3165886730-5050-928-0069 to [email protected]:5050
>
> As with the master logs, the "Resending"/"Forwarding" log lines have been
> recurring ever since.
>
> Finally, logs from the executor indicate that the TASK_FINISHED message
> had been sent. One oddity here is that some log lines that should be
> printed by the executor after the TASK_FINISHED message is sent never
> appear -- I'm guessing this is some kind of race condition, which explains
> why this situation only happens occasionally. That said, there are no logs
> about the executor's killTask, shutdown, or error methods being invoked, so
> I'm not sure what is racing here.
>
> Any thoughts on what is going on here/how to debug would be appreciated.
> Thanks! =)
>

Reply via email to