[jira] [Commented] (MESOS-5693) slave delay to forword status update
[ https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363452#comment-15363452 ] Avinash Sridharan commented on MESOS-5693: -- So we should see the log from the retries (for TASK_RUNNING in case it is not acknowledged)? Not sure if the retries are show in LOG(INFO)? > slave delay to forword status update > > > Key: MESOS-5693 > URL: https://issues.apache.org/jira/browse/MESOS-5693 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.22.1 > Environment: debian7 >Reporter: zhangfuxing > > we observe that mesos slave delay to forward task status update to master, > I0615 14:59:10.997902 3890 slave.cpp:2531] Handling status update > TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from > executor(1)@10.0.40.189:54304 > I0615 14:59:11.001126 3895 status_update_manager.cpp:317] Received status > update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.001174 3895 status_update_manager.hpp:346] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework > 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.037376 3894 slave.cpp:2709] Sending acknowledgement for > status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for > task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to > executor(1)@10.0.40.189:54304 > I0615 15:54:21.352087 3888 slave.cpp:2776] Forwarding the update TASK_KILLED > (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of > framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060 > for this example, the task xxx.64554b80 has been killed at 14:59 but the > status didn't forward to master until 15:54 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5693) slave delay to forword status update
[ https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363424#comment-15363424 ] Adam B commented on MESOS-5693: --- Only the earliest unacknowledged update (i.e. the TASK_RUNNING, not the TASK_KILLED) will be sent for each task from the agent's StatusUpdateManager to the master. However, with these updates, the agent will add the latest task state (not a full StatusUpdate), so the master can know to release the resources and update the state in the webui. The data and messages from the final terminal status update must wait for all its preceding updates to be acknowledged so that it can be sent. > slave delay to forword status update > > > Key: MESOS-5693 > URL: https://issues.apache.org/jira/browse/MESOS-5693 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.22.1 > Environment: debian7 >Reporter: zhangfuxing > > we observe that mesos slave delay to forward task status update to master, > I0615 14:59:10.997902 3890 slave.cpp:2531] Handling status update > TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from > executor(1)@10.0.40.189:54304 > I0615 14:59:11.001126 3895 status_update_manager.cpp:317] Received status > update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.001174 3895 status_update_manager.hpp:346] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework > 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.037376 3894 slave.cpp:2709] Sending acknowledgement for > status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for > task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to > executor(1)@10.0.40.189:54304 > I0615 15:54:21.352087 3888 slave.cpp:2776] Forwarding the update TASK_KILLED > (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of > framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060 > for this example, the task xxx.64554b80 has been killed at 14:59 but the > status didn't forward to master until 15:54 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5693) slave delay to forword status update
[ https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363399#comment-15363399 ] Avinash Sridharan commented on MESOS-5693: -- Even if the scheduler disconnected from the Master (assuming its timeout is set to a very high value), would the Agent still keep forwarding updates to the Master? > slave delay to forword status update > > > Key: MESOS-5693 > URL: https://issues.apache.org/jira/browse/MESOS-5693 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.22.1 > Environment: debian7 >Reporter: zhangfuxing > > we observe that mesos slave delay to forward task status update to master, > I0615 14:59:10.997902 3890 slave.cpp:2531] Handling status update > TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from > executor(1)@10.0.40.189:54304 > I0615 14:59:11.001126 3895 status_update_manager.cpp:317] Received status > update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.001174 3895 status_update_manager.hpp:346] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework > 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.037376 3894 slave.cpp:2709] Sending acknowledgement for > status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for > task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to > executor(1)@10.0.40.189:54304 > I0615 15:54:21.352087 3888 slave.cpp:2776] Forwarding the update TASK_KILLED > (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of > framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060 > for this example, the task xxx.64554b80 has been killed at 14:59 but the > status didn't forward to master until 15:54 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5693) slave delay to forword status update
[ https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363397#comment-15363397 ] Avinash Sridharan commented on MESOS-5693: -- If possible could you also try upgrading to a more recent version of Mesos (v0.28.2) and see if you still hit the same problem. Thanks!! > slave delay to forword status update > > > Key: MESOS-5693 > URL: https://issues.apache.org/jira/browse/MESOS-5693 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.22.1 > Environment: debian7 >Reporter: zhangfuxing > > we observe that mesos slave delay to forward task status update to master, > I0615 14:59:10.997902 3890 slave.cpp:2531] Handling status update > TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from > executor(1)@10.0.40.189:54304 > I0615 14:59:11.001126 3895 status_update_manager.cpp:317] Received status > update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.001174 3895 status_update_manager.hpp:346] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework > 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.037376 3894 slave.cpp:2709] Sending acknowledgement for > status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for > task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to > executor(1)@10.0.40.189:54304 > I0615 15:54:21.352087 3888 slave.cpp:2776] Forwarding the update TASK_KILLED > (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of > framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060 > for this example, the task xxx.64554b80 has been killed at 14:59 but the > status didn't forward to master until 15:54 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5693) slave delay to forword status update
[ https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363396#comment-15363396 ] Adam B commented on MESOS-5693: --- Could be that the agent was not in contact with the master to be able to forward the update, but that's not possible for a whole hour. More likely the scheduler was disconnected for a long time, and since the scheduler never acknowledged the previous status update (TASK_RUNNING?), the agent never sent the next update in the queue. In order for Mesos to provide guaranteed at-least-once delivery of status updates to the schedulers, the scheduler must be connected to ACK each update. > slave delay to forword status update > > > Key: MESOS-5693 > URL: https://issues.apache.org/jira/browse/MESOS-5693 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.22.1 > Environment: debian7 >Reporter: zhangfuxing > > we observe that mesos slave delay to forward task status update to master, > I0615 14:59:10.997902 3890 slave.cpp:2531] Handling status update > TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from > executor(1)@10.0.40.189:54304 > I0615 14:59:11.001126 3895 status_update_manager.cpp:317] Received status > update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.001174 3895 status_update_manager.hpp:346] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework > 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.037376 3894 slave.cpp:2709] Sending acknowledgement for > status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for > task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to > executor(1)@10.0.40.189:54304 > I0615 15:54:21.352087 3888 slave.cpp:2776] Forwarding the update TASK_KILLED > (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of > framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060 > for this example, the task xxx.64554b80 has been killed at 14:59 but the > status didn't forward to master until 15:54 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5693) slave delay to forword status update
[ https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363392#comment-15363392 ] Avinash Sridharan commented on MESOS-5693: -- [~zfx] could you reproduce the problem with GLOG_v=1 or by running `strace` on the Agent when you see it stuck. I think we will need more information to see what exactly the Agent is doing between I0615 14:59:11.037376 and I0615 15:54:21.352087 that it wasn't able to send an update. > slave delay to forword status update > > > Key: MESOS-5693 > URL: https://issues.apache.org/jira/browse/MESOS-5693 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.22.1 > Environment: debian7 >Reporter: zhangfuxing > > we observe that mesos slave delay to forward task status update to master, > I0615 14:59:10.997902 3890 slave.cpp:2531] Handling status update > TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from > executor(1)@10.0.40.189:54304 > I0615 14:59:11.001126 3895 status_update_manager.cpp:317] Received status > update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task > xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.001174 3895 status_update_manager.hpp:346] Checkpointing > UPDATE for status update TASK_KILLED (UUID: > 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework > 20150629-151659-3355508746-5060-6173-0001 > I0615 14:59:11.037376 3894 slave.cpp:2709] Sending acknowledgement for > status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for > task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to > executor(1)@10.0.40.189:54304 > I0615 15:54:21.352087 3888 slave.cpp:2776] Forwarding the update TASK_KILLED > (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of > framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060 > for this example, the task xxx.64554b80 has been killed at 14:59 but the > status didn't forward to master until 15:54 -- This message was sent by Atlassian JIRA (v6.3.4#6332)