[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-09-04 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922871#comment-16922871
 ] 

Meng Zhu commented on MESOS-9750:
-

Note, while this ticket makes the completed task with the nonterminal status 
list in the right place (i.e. completed tasks). However, it would result in a 
weird behavior where a completed task would have a nonterminal status e.g. 
TASK_RUNNING.

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> --
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
> Fix For: 1.7.3, 1.8.1, 1.9.0
>
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
> name: "test-task"
> task_id {
>   value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
> }
> executor_id {
>   value: "default"
> }
> agent_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
>   task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>   }
>   state: TASK_RUNNING
>   agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>   }
>   timestamp: 1556674758.2175469
>   executor_id {
> value: "default"
>   }
>   source: SOURCE_EXECUTOR
>   uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>   container_status { ... }
> }
>   }
> }
> get_executors {
>   completed_executors {
> executor_info {
>   executor_id {
> value: "default"
>   }
>   command {
> value: ""
>   }
>   framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
> }
>   }
> }
> get_frameworks {
>   completed_frameworks {
> framework_info {
>   user: "user"
>   name: "default"
>   id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
>   checkpoint: true
>   hostname: "localhost"
>   principal: "test-principal"
>   capabilities {
> type: MULTI_ROLE
>   }
>   capabilities {
> type: RESERVATION_REFINEMENT
>   }
>   roles: "*"
> }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-05-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839373#comment-16839373
 ] 

Joseph Wu commented on MESOS-9750:
--

Found one more code path where the agent's {{GET_STATE}} will return extraneous 
"launched_tasks".

This happens when a Framework or Master {{TEARDOWN}} call is used and the 
executor does not send a terminal status update in time.  This one does not 
require an agent restart/shutdown.
Also, this code path will result in an executor's checkpointed state looking 
identical to the agent shutdown case.  If the agent is restarted, the code in 
the above patch will be run to put the agent back into a consistent state.

Fix and test here: https://reviews.apache.org/r/70641/

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> --
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
> name: "test-task"
> task_id {
>   value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
> }
> executor_id {
>   value: "default"
> }
> agent_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
>   task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>   }
>   state: TASK_RUNNING
>   agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>   }
>   timestamp: 1556674758.2175469
>   executor_id {
> value: "default"
>   }
>   source: SOURCE_EXECUTOR
>   uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>   container_status { ... }
> }
>   }
> }
> get_executors {
>   completed_executors {
> executor_info {
>   executor_id {
> value: "default"
>   }
>   command {
> value: ""
>   }
>   framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
> }
>   }
> }
> get_frameworks {
>   completed_frameworks {
> framework_info {
>   user: "user"
>   name: "default"
>   id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
>   checkpoint: true
>   hostname: "localhost"
>   principal: "test-principal"
>   capabilities {
> type: MULTI_ROLE
>   }
>   capabilities {
> type: RESERVATION_REFINEMENT
>   }
>   roles: "*"
> }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-04-30 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830821#comment-16830821
 ] 

Joseph Wu commented on MESOS-9750:
--

Preliminary fix and test here: https://reviews.apache.org/r/70577/

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> --
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
> name: "test-task"
> task_id {
>   value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
> }
> executor_id {
>   value: "default"
> }
> agent_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
>   task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>   }
>   state: TASK_RUNNING
>   agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>   }
>   timestamp: 1556674758.2175469
>   executor_id {
> value: "default"
>   }
>   source: SOURCE_EXECUTOR
>   uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>   container_status { ... }
> }
>   }
> }
> get_executors {
>   completed_executors {
> executor_info {
>   executor_id {
> value: "default"
>   }
>   command {
> value: ""
>   }
>   framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
> }
>   }
> }
> get_frameworks {
>   completed_frameworks {
> framework_info {
>   user: "user"
>   name: "default"
>   id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
>   checkpoint: true
>   hostname: "localhost"
>   principal: "test-principal"
>   capabilities {
> type: MULTI_ROLE
>   }
>   capabilities {
> type: RESERVATION_REFINEMENT
>   }
>   roles: "*"
> }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)