Andreas Raster created MESOS-2334: ------------------------------------- Summary: Tasks get stuck in TASK_STAGING after a network decode error Key: MESOS-2334 URL: https://issues.apache.org/jira/browse/MESOS-2334 Project: Mesos Issue Type: Bug Reporter: Andreas Raster
We observed that with a test case that schedules a large amount of small CommandInfo tasks (shell commands that look like this: "sleep `shuf -i 2-3 -n 1`; echo foo >> /share/bar") on a cluster with launchTasks, that sometimes we would get an issue where a single task that has been launched and was set to TASK_STAGING would never receive a TASK_RUNNING message (or any other message at all). So it would then just stay in TASK_STAGING infinitely until we would kill the framework. We asked in #mesos on freenode about this and got an answer from alexr_: [15:56:55] <alexr_> henno: thanks for the slave logs [15:57:09] rakete [~rak...@static.198.2.63.178.clients.your-server.de] has left #mesos [15:58:47] <alexr_> henno: it looks from the logs, that the slave successfully registers the executor and sends the task [15:59:07] tillt_ [~Till@212.53.142.20] has joined #mesos [15:59:30] <alexr_> the executor, for some reason, refuses to start the task, most probably because of the message decoding error telling us that he suspects the reason is a network decoding error. I am currently not 100% sure what he means by that and I wasn't the guy talking to alexr_ on irc so I cannot post the exact log section that indicates that decoding error. But I'll attach the logs that we supplied to alexr_, so those should contain the relevant information. The tasks name in question was: 727527fc-a3f3-418d-a44e-ec3bbdd26315 cat /var/log/mesos/mesos-slave.INFO | grep 727527fc-a3f3-418d-a44e-ec3bbdd26315 >> http://paste.ubuntu.com/10160270/ cat /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stderr >> http://paste.ubuntu.com/10160335/ cat /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stdout >> http://paste.ubuntu.com/10160346/ Now, if some relevant information is still missing, don't hesitate to ask me. -- This message was sent by Atlassian JIRA (v6.3.4#6332)