Re: Custom Scheduler: Diagnosing cause of container task failures

2015-08-25 Thread Alex Rukletsov
It looks like we can have a better error message here.

@Jay, mind filing a JIRA ticket for with description, status update, and
your fix attached? Thanks!

On Fri, Aug 21, 2015 at 7:36 PM, Jay Taylor j...@jaytaylor.com wrote:

 Eventually I was able to isolate what was going on; in this case the
 FrameworkInfo.User was set to an invalid value and setting it to root did
 the trick.

 My scheduler is now working [in a basic form]!!!

 Cheers,
 Jay

 On Thu, Aug 20, 2015 at 4:15 PM, Jay Taylor j...@jaytaylor.com wrote:

 Hey Tim,

 Thank you for the quick response!

 Just checked the sandbox logs and they are all empty (stdout and stderr
 are both 0 bytes).

 I have discovered a little bit more information from the StatusUpdate
 event posted back to my scheduler:

 TaskStatus{
 TaskId: TaskID{
 Value:*fluxCapacitor-test-1,XXX_unrecognized:[],
 },
 State: *TASK_FAILED,
 Message: *Abnormal executor termination,
 Source: *SOURCE_SLAVE,
 Reason: *REASON_COMMAND_EXECUTOR_FAILED,
 Data:nil,
 SlaveId: SlaveID{
 Value: *20150804-211459-1407297728-5050-5855-S1,
 XXX_unrecognized: [],
 },
 ExecutorId: nil,
 Timestamp: *1.440112075509318e+09,
 Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166],
 Healthy: nil,
 XXX_unrecognized: [],
 }

 How can I find out what why the command executor is failing?


 On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote:

 It received a TASK_FAILED from the executor, so you'll need to look at
 the sandbox logs of your task stdout and stderr files to see what went
 wrong.

 These files should be reachable by the Mesos UI.

 Tim

 On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote:

 Hey everyone,

 I am writing a scheduler for Mesos and on of my first goals is to get
 simple a docker container to run.

 The tasks get marked as failed with the failure messages originating
 from the slave logs.  Now I'm not sure how to determine exactly what is
 causing the failure.

 The most informative log messages I've found were in the slave log:

 == /var/log/mesos/mesos-slave.INFO ==
 W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
 container: e190037a-b011-4681-9e10-dcbacf6cb819
 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
 status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
 task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
 TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
 jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
 master@63.198.215.105:5050
 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
 status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
 for task jay-test-29 of framework 
 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
 20150804-211741-1608624320-5050-18273-0060

 And this doesn't really tell me much about *why* it's failed.

 Is there somewhere else I should be looking or an option that needs to
 be turned on to show more information?

 Your assistance is greatly appreciated!

 Jay







Re: Custom Scheduler: Diagnosing cause of container task failures

2015-08-21 Thread Jay Taylor
Eventually I was able to isolate what was going on; in this case the
FrameworkInfo.User was set to an invalid value and setting it to root did
the trick.

My scheduler is now working [in a basic form]!!!

Cheers,
Jay

On Thu, Aug 20, 2015 at 4:15 PM, Jay Taylor j...@jaytaylor.com wrote:

 Hey Tim,

 Thank you for the quick response!

 Just checked the sandbox logs and they are all empty (stdout and stderr
 are both 0 bytes).

 I have discovered a little bit more information from the StatusUpdate
 event posted back to my scheduler:

 TaskStatus{
 TaskId: TaskID{
 Value:*fluxCapacitor-test-1,XXX_unrecognized:[],
 },
 State: *TASK_FAILED,
 Message: *Abnormal executor termination,
 Source: *SOURCE_SLAVE,
 Reason: *REASON_COMMAND_EXECUTOR_FAILED,
 Data:nil,
 SlaveId: SlaveID{
 Value: *20150804-211459-1407297728-5050-5855-S1,
 XXX_unrecognized: [],
 },
 ExecutorId: nil,
 Timestamp: *1.440112075509318e+09,
 Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166],
 Healthy: nil,
 XXX_unrecognized: [],
 }

 How can I find out what why the command executor is failing?


 On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote:

 It received a TASK_FAILED from the executor, so you'll need to look at
 the sandbox logs of your task stdout and stderr files to see what went
 wrong.

 These files should be reachable by the Mesos UI.

 Tim

 On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote:

 Hey everyone,

 I am writing a scheduler for Mesos and on of my first goals is to get
 simple a docker container to run.

 The tasks get marked as failed with the failure messages originating
 from the slave logs.  Now I'm not sure how to determine exactly what is
 causing the failure.

 The most informative log messages I've found were in the slave log:

 == /var/log/mesos/mesos-slave.INFO ==
 W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
 container: e190037a-b011-4681-9e10-dcbacf6cb819
 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
 status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
 task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
 TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
 jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
 master@63.198.215.105:5050
 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
 status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
 for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
 20150804-211741-1608624320-5050-18273-0060

 And this doesn't really tell me much about *why* it's failed.

 Is there somewhere else I should be looking or an option that needs to
 be turned on to show more information?

 Your assistance is greatly appreciated!

 Jay






Re: Custom Scheduler: Diagnosing cause of container task failures

2015-08-20 Thread Tim Chen
It received a TASK_FAILED from the executor, so you'll need to look at the
sandbox logs of your task stdout and stderr files to see what went wrong.

These files should be reachable by the Mesos UI.

Tim

On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote:

 Hey everyone,

 I am writing a scheduler for Mesos and on of my first goals is to get
 simple a docker container to run.

 The tasks get marked as failed with the failure messages originating from
 the slave logs.  Now I'm not sure how to determine exactly what is causing
 the failure.

 The most informative log messages I've found were in the slave log:

 == /var/log/mesos/mesos-slave.INFO ==
 W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
 container: e190037a-b011-4681-9e10-dcbacf6cb819
 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
 status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
 task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
 TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
 jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
 master@63.198.215.105:5050
 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
 status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
 for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
 20150804-211741-1608624320-5050-18273-0060

 And this doesn't really tell me much about *why* it's failed.

 Is there somewhere else I should be looking or an option that needs to be
 turned on to show more information?

 Your assistance is greatly appreciated!

 Jay



Custom Scheduler: Diagnosing cause of container task failures

2015-08-20 Thread Jay Taylor
Hey everyone,

I am writing a scheduler for Mesos and on of my first goals is to get
simple a docker container to run.

The tasks get marked as failed with the failure messages originating from
the slave logs.  Now I'm not sure how to determine exactly what is causing
the failure.

The most informative log messages I've found were in the slave log:

== /var/log/mesos/mesos-slave.INFO ==
W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
container: e190037a-b011-4681-9e10-dcbacf6cb819
I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
master@63.198.215.105:5050
I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
20150804-211741-1608624320-5050-18273-0060

And this doesn't really tell me much about *why* it's failed.

Is there somewhere else I should be looking or an option that needs to be
turned on to show more information?

Your assistance is greatly appreciated!

Jay


Re: Custom Scheduler: Diagnosing cause of container task failures

2015-08-20 Thread Jay Taylor
Hey Tim,

Thank you for the quick response!

Just checked the sandbox logs and they are all empty (stdout and stderr are
both 0 bytes).

I have discovered a little bit more information from the StatusUpdate event
posted back to my scheduler:

TaskStatus{
TaskId: TaskID{
Value:*fluxCapacitor-test-1,XXX_unrecognized:[],
},
State: *TASK_FAILED,
Message: *Abnormal executor termination,
Source: *SOURCE_SLAVE,
Reason: *REASON_COMMAND_EXECUTOR_FAILED,
Data:nil,
SlaveId: SlaveID{
Value: *20150804-211459-1407297728-5050-5855-S1,
XXX_unrecognized: [],
},
ExecutorId: nil,
Timestamp: *1.440112075509318e+09,
Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166],
Healthy: nil,
XXX_unrecognized: [],
}

How can I find out what why the command executor is failing?


On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote:

 It received a TASK_FAILED from the executor, so you'll need to look at the
 sandbox logs of your task stdout and stderr files to see what went wrong.

 These files should be reachable by the Mesos UI.

 Tim

 On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote:

 Hey everyone,

 I am writing a scheduler for Mesos and on of my first goals is to get
 simple a docker container to run.

 The tasks get marked as failed with the failure messages originating from
 the slave logs.  Now I'm not sure how to determine exactly what is causing
 the failure.

 The most informative log messages I've found were in the slave log:

 == /var/log/mesos/mesos-slave.INFO ==
 W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
 container: e190037a-b011-4681-9e10-dcbacf6cb819
 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
 status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
 task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
 TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
 jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
 master@63.198.215.105:5050
 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
 status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
 for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
 20150804-211741-1608624320-5050-18273-0060

 And this doesn't really tell me much about *why* it's failed.

 Is there somewhere else I should be looking or an option that needs to be
 turned on to show more information?

 Your assistance is greatly appreciated!

 Jay