[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363452#comment-15363452
 ] 

Avinash Sridharan commented on MESOS-5693:
--

So we should see the log from the retries (for TASK_RUNNING in case it is not 
acknowledged)? Not sure if the retries are show in LOG(INFO)?

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363424#comment-15363424
 ] 

Adam B commented on MESOS-5693:
---

Only the earliest unacknowledged update (i.e. the TASK_RUNNING, not the 
TASK_KILLED) will be sent for each task from the agent's StatusUpdateManager to 
the master. However, with these updates, the agent will add the latest task 
state (not a full StatusUpdate), so the master can know to release the 
resources and update the state in the webui. The data and messages from the 
final terminal status update must wait for all its preceding updates to be 
acknowledged so that it can be sent.

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363399#comment-15363399
 ] 

Avinash Sridharan commented on MESOS-5693:
--

Even if the scheduler disconnected from the Master (assuming its timeout is set 
to a very high value), would the Agent still keep forwarding updates to the 
Master?

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363397#comment-15363397
 ] 

Avinash Sridharan commented on MESOS-5693:
--

If possible could you also try upgrading to a more recent version of Mesos 
(v0.28.2) and see if you still hit the same problem.

Thanks!!

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363396#comment-15363396
 ] 

Adam B commented on MESOS-5693:
---

Could be that the agent was not in contact with the master to be able to 
forward the update, but that's not possible for a whole hour. More likely the 
scheduler was disconnected for a long time, and since the scheduler never 
acknowledged the previous status update (TASK_RUNNING?), the agent never sent 
the next update in the queue. In order for Mesos to provide guaranteed 
at-least-once delivery of status updates to the schedulers, the scheduler must 
be connected to ACK each update.

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5693) slave delay to forword status update

2016-07-05 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363392#comment-15363392
 ] 

Avinash Sridharan commented on MESOS-5693:
--

[~zfx] could you reproduce the problem with GLOG_v=1 or by running `strace` on 
the Agent when you see it stuck. I think we will need more information to see 
what exactly the Agent is doing between I0615 14:59:11.037376  and I0615 
15:54:21.352087 that it wasn't able to send an update.

> slave delay to forword status update
> 
>
> Key: MESOS-5693
> URL: https://issues.apache.org/jira/browse/MESOS-5693
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Affects Versions: 0.22.1
> Environment: debian7 
>Reporter: zhangfuxing
>
> we observe that mesos slave delay to forward task status update to master, 
> I0615 14:59:10.997902  3890 slave.cpp:2531] Handling status update 
> TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 from 
> executor(1)@10.0.40.189:54304
> I0615 14:59:11.001126  3895 status_update_manager.cpp:317] Received status 
> update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task 
> xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.001174  3895 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_KILLED (UUID: 
> 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of framework 
> 20150629-151659-3355508746-5060-6173-0001
> I0615 14:59:11.037376  3894 slave.cpp:2709] Sending acknowledgement for 
> status update TASK_KILLED (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for 
> task xxx.64554b80 of framework 20150629-151659-3355508746-5060-6173-0001 to 
> executor(1)@10.0.40.189:54304
> I0615 15:54:21.352087  3888 slave.cpp:2776] Forwarding the update TASK_KILLED 
> (UUID: 17e9c12f-5241-4aca-81fa-67d6830990b0) for task xxx.64554b80 of 
> framework 20150629-151659-3355508746-5060-6173-0001 to master@10.0.1.200:5060
> for this example, the task xxx.64554b80 has been killed at 14:59 but the 
> status didn't forward to master until 15:54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)