[ 
https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cosmin Lehene updated MESOS-8111:
---------------------------------
    Description: 
After scaling down a cluster, the master is reporting a task as running 
although the slave has been long gone.
At the same time it reports it can't kill it because the agent is offline
{noformat}
I1018 16:55:22.000000  6976 master.cpp:4913] Processing KILL call for task 
'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
W1018 16:55:22.000000  6976 master.cpp:5000] Cannot kill task 
spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
(10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
{noformat}

Clearly, if the agent is offline the task is also not running. Also not sure 
waiting indefinitely for an agent to recover is a good strategy.

  was:
After scaling down a cluster, the master is reporting a task as running 
although the slave has been long gone.
At the same time it reports it can't kill it because the agent is offline
{noformat}
I1018 16:55:22.000000  6976 master.cpp:4913] Processing KILL call for task 
'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
W1018 16:55:22.000000  6976 master.cpp:5000] Cannot kill task 
spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
(10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
{noformat}


> Mesos sees task as running, but cannot kill it because the agent is offline
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-8111
>                 URL: https://issues.apache.org/jira/browse/MESOS-8111
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.2.3
>         Environment: DC/OS 1.9.4
>            Reporter: Cosmin Lehene
>
> After scaling down a cluster, the master is reporting a task as running 
> although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.000000  6976 master.cpp:4913] Processing KILL call for task 
> 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101
> W1018 16:55:22.000000  6976 master.cpp:5000] Cannot kill task 
> spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the 
> agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
> (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure 
> waiting indefinitely for an agent to recover is a good strategy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to