Hi We have recently upgraded from mesos 0.28.x to 1.3.0 and started seeing an issue, where tasks are stuck in the staging state.
What wee are seeing seems very similar to this bug report https://issues.apache.org/jira/browse/MESOS-5482 <https://issues.apache.org/jira/browse/MESOS-5482> Essentially, we see that when the agent reregisters (Note: we are still trying to figure out why this is happening) the tasks are terminated, but in some cases the terminal status update never makes it back to the framework. Can anyone provide some guidance on what the best approach to resolve this is? Are there fixes in the works? Should there be changes in the framework to better handle this case? (Note: it is this https://github.com/mesos/kafka <https://github.com/mesos/kafka> framework that is encountering the problem) Here is a break down of what we observed in the mess agent logs mesos slave lost connection to zookeeper at 05:01 on dx-fmwk-agent[0-3] Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:21.397711 12906 slave.cpp:911] Lost leading master Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:21.397732 12906 slave.cpp:953] Detecting new master Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:21.397753 12906 status_update_manager.cpp:177] Pausing sending status updates Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: 2017-08-29 05:01:21,397:12889(0x7fc4d6447700):ZOO_INFO@zookeeper_close@2543: Freeing zookeeper resources for sessionId=0x15d8772f08b5d77 Aug 29 05:01:21 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: Causing it to re-register with the master. This means that the tasks running on the slave (i.e. the kafka brokers) will be stopped e.g. Aug 29 05:01:23 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:23.625780 12903 slave.cpp:5555] Killing executor 'kafka-aggregate-5-719a0e03-7657-4a07-acff-36f821374ef2' of framework 020072e0-ec00-4f66-935d-fe033b47be76-0003 at executor(1)@9.37.246.154 <https://swgjazz.ibm.com:8003/jazz/users/9.37.246.154>:35098 Aug 29 05:01:23 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:23.625898 12903 slave.cpp:5555] Killing executor 'kafka-local-5-d54944b8-c619-4a27-b233-efae98b1bf02' of framework 020072e0-ec00-4f66-935d-fe033b47be76-0002 at executor(1)@9.37.246.154 <https://swgjazz.ibm.com:8003/jazz/users/9.37.246.154>:41839 On dx-fmwk-agent[0-1] where the brokers recovered automatically, we see that the slave indicates TASK_LOST status to the framework, causing it to relaunch. Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:22.235452 12900 status_update_manager.cpp:323] Received status update TASK_LOST (UUID: f3fde599-862a-4009-8947-8b733e357e87) for task kafka-local-5-aa4c5126-6899-4550-8cfb-a2a9b4d4a88b of framework 020072e0-ec00-4f66-93 Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:22.236565 12900 status_update_manager.cpp:323] Received status update TASK_LOST (UUID: e19f3665-75b1-48c7-a5b4-e9bd8ee38e07) for task kafka-aggregate-5-c0000796-cb2e-46d4-92a5-307ba74b9088 of framework 020072e0-ec00-4f6 Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:22.236675 12905 slave.cpp:4655] Forwarding the update TASK_LOST (UUID: f3fde599-862a-4009-8947-8b733e357e87) for task kafka-local-5-aa4c5126-6899-4550-8cfb-a2a9b4d4a88b of framework 020072e0-ec00-4f66-935d-fe033b47be76- Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: I0829 05:01:22.236783 12905 slave.cpp:4655] Forwarding the update TASK_LOST (UUID: e19f3665-75b1-48c7-a5b4-e9bd8ee38e07) for task kafka-aggregate-5-c0000796-cb2e-46d4-92a5-307ba74b9088 of framework 020072e0-ec00-4f66-935d-fe033b47b Aug 29 05:01:22 dx-fmwk-agent1.rtp.raleigh.ibm.com mesos-slave[12898]: However on dx-fmwk-agent[2-3] we do not see this, and the framework does not receive a task lost status update. When the VMs were restarted, we see that the mesos agent is forcibly reregistered, causing a TASK_LOST status update to be sent to the framework, and it then recovers. Aug 29 12:53:07 dx-fmwk-agent2.rtp.raleigh.ibm.com mesos-slave[958]: I0829 12:53:07.696624 3334 slave.cpp:4753] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.

