Re: Custom Scheduler: Diagnosing cause of container task failures
It looks like we can have a better error message here. @Jay, mind filing a JIRA ticket for with description, status update, and your fix attached? Thanks! On Fri, Aug 21, 2015 at 7:36 PM, Jay Taylor j...@jaytaylor.com wrote: Eventually I was able to isolate what was going on; in this case the FrameworkInfo.User was set to an invalid value and setting it to root did the trick. My scheduler is now working [in a basic form]!!! Cheers, Jay On Thu, Aug 20, 2015 at 4:15 PM, Jay Taylor j...@jaytaylor.com wrote: Hey Tim, Thank you for the quick response! Just checked the sandbox logs and they are all empty (stdout and stderr are both 0 bytes). I have discovered a little bit more information from the StatusUpdate event posted back to my scheduler: TaskStatus{ TaskId: TaskID{ Value:*fluxCapacitor-test-1,XXX_unrecognized:[], }, State: *TASK_FAILED, Message: *Abnormal executor termination, Source: *SOURCE_SLAVE, Reason: *REASON_COMMAND_EXECUTOR_FAILED, Data:nil, SlaveId: SlaveID{ Value: *20150804-211459-1407297728-5050-5855-S1, XXX_unrecognized: [], }, ExecutorId: nil, Timestamp: *1.440112075509318e+09, Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166], Healthy: nil, XXX_unrecognized: [], } How can I find out what why the command executor is failing? On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote: It received a TASK_FAILED from the executor, so you'll need to look at the sandbox logs of your task stdout and stderr files to see what went wrong. These files should be reachable by the Mesos UI. Tim On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote: Hey everyone, I am writing a scheduler for Mesos and on of my first goals is to get simple a docker container to run. The tasks get marked as failed with the failure messages originating from the slave logs. Now I'm not sure how to determine exactly what is causing the failure. The most informative log messages I've found were in the slave log: == /var/log/mesos/mesos-slave.INFO == W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown container: e190037a-b011-4681-9e10-dcbacf6cb819 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to master@63.198.215.105:5050 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework 20150804-211741-1608624320-5050-18273-0060 And this doesn't really tell me much about *why* it's failed. Is there somewhere else I should be looking or an option that needs to be turned on to show more information? Your assistance is greatly appreciated! Jay
Re: Custom Scheduler: Diagnosing cause of container task failures
Eventually I was able to isolate what was going on; in this case the FrameworkInfo.User was set to an invalid value and setting it to root did the trick. My scheduler is now working [in a basic form]!!! Cheers, Jay On Thu, Aug 20, 2015 at 4:15 PM, Jay Taylor j...@jaytaylor.com wrote: Hey Tim, Thank you for the quick response! Just checked the sandbox logs and they are all empty (stdout and stderr are both 0 bytes). I have discovered a little bit more information from the StatusUpdate event posted back to my scheduler: TaskStatus{ TaskId: TaskID{ Value:*fluxCapacitor-test-1,XXX_unrecognized:[], }, State: *TASK_FAILED, Message: *Abnormal executor termination, Source: *SOURCE_SLAVE, Reason: *REASON_COMMAND_EXECUTOR_FAILED, Data:nil, SlaveId: SlaveID{ Value: *20150804-211459-1407297728-5050-5855-S1, XXX_unrecognized: [], }, ExecutorId: nil, Timestamp: *1.440112075509318e+09, Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166], Healthy: nil, XXX_unrecognized: [], } How can I find out what why the command executor is failing? On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote: It received a TASK_FAILED from the executor, so you'll need to look at the sandbox logs of your task stdout and stderr files to see what went wrong. These files should be reachable by the Mesos UI. Tim On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote: Hey everyone, I am writing a scheduler for Mesos and on of my first goals is to get simple a docker container to run. The tasks get marked as failed with the failure messages originating from the slave logs. Now I'm not sure how to determine exactly what is causing the failure. The most informative log messages I've found were in the slave log: == /var/log/mesos/mesos-slave.INFO == W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown container: e190037a-b011-4681-9e10-dcbacf6cb819 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to master@63.198.215.105:5050 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework 20150804-211741-1608624320-5050-18273-0060 And this doesn't really tell me much about *why* it's failed. Is there somewhere else I should be looking or an option that needs to be turned on to show more information? Your assistance is greatly appreciated! Jay
Re: Custom Scheduler: Diagnosing cause of container task failures
It received a TASK_FAILED from the executor, so you'll need to look at the sandbox logs of your task stdout and stderr files to see what went wrong. These files should be reachable by the Mesos UI. Tim On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote: Hey everyone, I am writing a scheduler for Mesos and on of my first goals is to get simple a docker container to run. The tasks get marked as failed with the failure messages originating from the slave logs. Now I'm not sure how to determine exactly what is causing the failure. The most informative log messages I've found were in the slave log: == /var/log/mesos/mesos-slave.INFO == W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown container: e190037a-b011-4681-9e10-dcbacf6cb819 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to master@63.198.215.105:5050 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework 20150804-211741-1608624320-5050-18273-0060 And this doesn't really tell me much about *why* it's failed. Is there somewhere else I should be looking or an option that needs to be turned on to show more information? Your assistance is greatly appreciated! Jay
Custom Scheduler: Diagnosing cause of container task failures
Hey everyone, I am writing a scheduler for Mesos and on of my first goals is to get simple a docker container to run. The tasks get marked as failed with the failure messages originating from the slave logs. Now I'm not sure how to determine exactly what is causing the failure. The most informative log messages I've found were in the slave log: == /var/log/mesos/mesos-slave.INFO == W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown container: e190037a-b011-4681-9e10-dcbacf6cb819 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to master@63.198.215.105:5050 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework 20150804-211741-1608624320-5050-18273-0060 And this doesn't really tell me much about *why* it's failed. Is there somewhere else I should be looking or an option that needs to be turned on to show more information? Your assistance is greatly appreciated! Jay
Re: Custom Scheduler: Diagnosing cause of container task failures
Hey Tim, Thank you for the quick response! Just checked the sandbox logs and they are all empty (stdout and stderr are both 0 bytes). I have discovered a little bit more information from the StatusUpdate event posted back to my scheduler: TaskStatus{ TaskId: TaskID{ Value:*fluxCapacitor-test-1,XXX_unrecognized:[], }, State: *TASK_FAILED, Message: *Abnormal executor termination, Source: *SOURCE_SLAVE, Reason: *REASON_COMMAND_EXECUTOR_FAILED, Data:nil, SlaveId: SlaveID{ Value: *20150804-211459-1407297728-5050-5855-S1, XXX_unrecognized: [], }, ExecutorId: nil, Timestamp: *1.440112075509318e+09, Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166], Healthy: nil, XXX_unrecognized: [], } How can I find out what why the command executor is failing? On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote: It received a TASK_FAILED from the executor, so you'll need to look at the sandbox logs of your task stdout and stderr files to see what went wrong. These files should be reachable by the Mesos UI. Tim On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote: Hey everyone, I am writing a scheduler for Mesos and on of my first goals is to get simple a docker container to run. The tasks get marked as failed with the failure messages originating from the slave logs. Now I'm not sure how to determine exactly what is causing the failure. The most informative log messages I've found were in the slave log: == /var/log/mesos/mesos-slave.INFO == W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown container: e190037a-b011-4681-9e10-dcbacf6cb819 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to master@63.198.215.105:5050 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework 20150804-211741-1608624320-5050-18273-0060 And this doesn't really tell me much about *why* it's failed. Is there somewhere else I should be looking or an option that needs to be turned on to show more information? Your assistance is greatly appreciated! Jay