[jira] [Commented] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately

2017-08-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114988#comment-16114988
 ] 

Benjamin Mahler commented on MESOS-7783:


The bug occurs as follows:

(1) Two (or more) tasks arrive at the agent, but do not yet reach 
{{Slave::_run}}.
(2) Kill task messages arrive at the agent and are processed.
(3) The first task to reach {{Slave::_run}} will cause the framework to be 
removed, since the pending tasks / executors are now empty (see 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L1841-L1845]).
(4) The remaining tasks to reach {{Slave::_run}} encounter the framework as 
removed and are dropped without a status update (see 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L1788-L1794]).

> Framework might not receive status update when a just launched task is killed 
> immediately
> -
>
> Key: MESOS-7783
> URL: https://issues.apache.org/jira/browse/MESOS-7783
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: reliability
> Attachments: GroupDeployIntegrationTest.log.zip, logs
>
>
> Our Marathon team are seeing issues in their integration test suite when 
> Marathon gets stuck in an infinite loop trying to kill a just launched task. 
> In their test a task launched which is immediately followed by killing the 
> task -- the framework does e.g., not wait for any task status update.
> In this case the launch and kill messages arrive at the agent in the correct 
> order, but both the launch and kill paths in the agent do not reach the point 
> where a status update is sent to the framework. Since the framework has seen 
> no status update on the task it re-triggers a kill, causing an infinite loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7783) Framework might not receive status update when a just launched task is killed immediately

2017-07-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086450#comment-16086450
 ] 

Benjamin Mahler commented on MESOS-7783:


Took a quick look at the code, this comment \[1\] in the kill task handling of 
the agent says that we avoid removing the framework so that the TASK_KILLED 
message can be sent later. However, when we later discover the task was killed 
during the launch path, the framework appears to have already been removed and 
we don't generate the update \[2\].

It appears that somehow the framework gets removed in the interim, but it's not 
in the logs. [~bbannier] these agent logs appear to be filtered on the task id, 
do you have the full agent logs? That should help reveal the cause.

\[1\] 
https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L2473-L2477
\[2\] 
https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L1788-L1794

> Framework might not receive status update when a just launched task is killed 
> immediately
> -
>
> Key: MESOS-7783
> URL: https://issues.apache.org/jira/browse/MESOS-7783
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.2.0
>Reporter: Benjamin Bannier
> Attachments: logs
>
>
> Our Marathon team are seeing issues in their integration test suite when 
> Marathon gets stuck in an infinite loop trying to kill a just launched task. 
> In their test a task launched which is immediately followed by killing the 
> task -- the framework does e.g., not wait for any task status update.
> In this case the launch and kill messages arrive at the agent in the correct 
> order, but both the launch and kill paths in the agent do not reach the point 
> where a status update is sent to the framework. Since the framework has seen 
> no status update on the task it re-triggers a kill, causing an infinite loop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)