Andreas Raster created MESOS-2334:
-------------------------------------

             Summary: Tasks get stuck in TASK_STAGING after a network decode 
error
                 Key: MESOS-2334
                 URL: https://issues.apache.org/jira/browse/MESOS-2334
             Project: Mesos
          Issue Type: Bug
            Reporter: Andreas Raster


We observed that with a test case that schedules a large amount of small 
CommandInfo tasks (shell commands that look like this: "sleep `shuf -i 2-3 -n 
1`; echo foo >> /share/bar") on a cluster with launchTasks, that sometimes we 
would get an issue where a single task that has been launched and was set to 
TASK_STAGING would never receive a TASK_RUNNING message (or any other message 
at all). So it would then just stay in TASK_STAGING infinitely until we would 
kill the framework.

We asked in #mesos on freenode about this and got an answer from alexr_:

[15:56:55] <alexr_> henno: thanks for the slave logs
[15:57:09] rakete [~rak...@static.198.2.63.178.clients.your-server.de] has left 
#mesos
[15:58:47] <alexr_> henno: it looks from the logs, that the slave successfully 
registers the executor and sends the task
[15:59:07] tillt_ [~Till@212.53.142.20] has joined #mesos
[15:59:30] <alexr_> the executor, for some reason, refuses to start the task, 
most probably because of the message decoding error

telling us that he suspects the reason is a network decoding error. I am 
currently not 100% sure what he means by that and I wasn't the guy talking to 
alexr_ on irc so I cannot post the exact log section that indicates that 
decoding error. But I'll attach the logs that we supplied to alexr_, so those 
should contain the relevant information.

The tasks name in question was: 727527fc-a3f3-418d-a44e-ec3bbdd26315

cat /var/log/mesos/mesos-slave.INFO | grep 727527fc-a3f3-418d-a44e-ec3bbdd26315
>> http://paste.ubuntu.com/10160270/

cat 
/tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stderr
 
>> http://paste.ubuntu.com/10160335/

cat 
/tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stdout
>> http://paste.ubuntu.com/10160346/

Now, if some relevant information is still missing, don't hesitate to ask me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to