Mesos 1.3.0: Tasks stuck in 'STAGING' after agent reregistration

Eli Jordan Wed, 30 Aug 2017 22:46:46 -0700

Hi

We have recently upgraded from mesos 0.28.x to 1.3.0 and started seeing an 
issue, where tasks are stuck in the staging state.


What wee are seeing seems very similar to this bug report 
https://issues.apache.org/jira/browse/MESOS-5482 
<https://issues.apache.org/jira/browse/MESOS-5482>

Essentially, we see that when the agent reregisters (Note: we are still trying 
to figure out why this is happening) the tasks are terminated, but in some 
cases the terminal status update never makes it back to the framework.

Can anyone provide some guidance on what the best approach to resolve this is? 
Are there fixes in the works? Should there be changes in the framework to 
better handle this case? (Note: it is this https://github.com/mesos/kafka 
<https://github.com/mesos/kafka> framework that is encountering the problem)

Here is a break down of what we observed in the mess agent logs

mesos slave lost connection to zookeeper at 05:01 on dx-fmwk-agent[0-3]

Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:21.397711 12906 slave.cpp:911] Lost leading master
Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:21.397732 12906 slave.cpp:953] Detecting new master
Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:21.397753 12906 status_update_manager.cpp:177] Pausing sending status 
updates
Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: 
2017-08-29 05:01:21,397:12889(0x7fc4d6447700):ZOO_INFO@zookeeper_close@2543: 
Freeing zookeeper resources for sessionId=0x15d8772f08b5d77
Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: 

Causing it to re-register with the master. This means that the tasks running on 
the slave (i.e. the kafka brokers) will be stopped

e.g.

Aug 29 05:01:23 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:23.625780 12903 slave.cpp:5555] Killing executor 
'kafka-aggregate-5-719a0e03-7657-4a07-acff-36f821374ef2' of framework 
020072e0-ec00-4f66-935d-fe033b47be76-0003 at executor(1)@9.37.246.154 
<https://swgjazz.ibm.com:8003/jazz/users/9.37.246.154>:35098
Aug 29 05:01:23 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:23.625898 12903 slave.cpp:5555] Killing executor 
'kafka-local-5-d54944b8-c619-4a27-b233-efae98b1bf02' of framework 
020072e0-ec00-4f66-935d-fe033b47be76-0002 at executor(1)@9.37.246.154 
<https://swgjazz.ibm.com:8003/jazz/users/9.37.246.154>:41839

On dx-fmwk-agent[0-1] where the brokers recovered automatically, we see that 
the slave indicates TASK_LOST status to the framework, causing it to relaunch.

Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:22.235452 12900 status_update_manager.cpp:323] Received status update 
TASK_LOST (UUID: f3fde599-862a-4009-8947-8b733e357e87) for task 
kafka-local-5-aa4c5126-6899-4550-8cfb-a2a9b4d4a88b of framework 
020072e0-ec00-4f66-93
Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:22.236565 12900 status_update_manager.cpp:323] Received status update 
TASK_LOST (UUID: e19f3665-75b1-48c7-a5b4-e9bd8ee38e07) for task 
kafka-aggregate-5-c0000796-cb2e-46d4-92a5-307ba74b9088 of framework 
020072e0-ec00-4f6
Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:22.236675 12905 slave.cpp:4655] Forwarding the update TASK_LOST (UUID: 
f3fde599-862a-4009-8947-8b733e357e87) for task 
kafka-local-5-aa4c5126-6899-4550-8cfb-a2a9b4d4a88b of framework 
020072e0-ec00-4f66-935d-fe033b47be76-
Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 
05:01:22.236783 12905 slave.cpp:4655] Forwarding the update TASK_LOST (UUID: 
e19f3665-75b1-48c7-a5b4-e9bd8ee38e07) for task 
kafka-aggregate-5-c0000796-cb2e-46d4-92a5-307ba74b9088 of framework 
020072e0-ec00-4f66-935d-fe033b47b
Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]:

However on dx-fmwk-agent[2-3] we do not see this, and the framework does not 
receive a task lost status update.

When the VMs were restarted, we see that the mesos agent is forcibly 
reregistered, causing a TASK_LOST status update to be sent to the framework, 
and it then recovers.

Aug 29 12:53:07 dx-fmwk-agent2.rtp.raleigh.ibm.com mesos-slave[958]: I0829 
12:53:07.696624  3334 slave.cpp:4753] Master marked the agent as disconnected 
but the agent considers itself registered! Forcing re-registration.

Mesos 1.3.0: Tasks stuck in 'STAGING' after agent reregistration

Reply via email to