GPU containerizer post mortem

2016-09-12 Thread Justin Pinkul
Hi everyone,
I just ran into a very subtle configuration problem with enabling GPU support 
on Mesos and thought I'd share a brief post mortem.
Scenario:Running a GPU Mesos task. This task first executes nvidia-smi to 
confirm the GPUs are visible and then executes a Caffe training example to 
verify the GPU is usable.
Symptom:The nvidia-smi reported the correct number of GPUs but the training 
example crashed when creating the CUDA device.
Debugging tactics:To debug this I added an infinite loop to the end of the task 
so the environment would not be torn down. Next I logged into the machine, 
found the PID of the Mesos task and entered the namespace with: nsenter -t 
$TASK_PID -m -u -i -n -p -r -w

At this point I attempted to manually run the test and it worked. The reason it 
worked was that my test terminal was not added to the devices CGROUP. So next I 
added it to the CGROUP with:echo $TEST_TERMINAL_PID >> 
/sys/fs/cgroup/devices/mesos/f6736041-d403-4494-95fd-604eace34ce1/tasks
After joining the CGROUP I could reproduce the problem and systematically added 
devices to the CGROUP's allow list until it worked.
Root cause:After rebooting a machine the nvidia-uvm device is not created 
automatically. To create this device "sudo mknod -m 666 /dev/nvidia-uvm c 250 
0" was added to a start up script. The problem with this is that nvidia-uvm 
uses a major device ID in the experimental range. One of the consequences of 
this is that the major device ID might change on boot. This means the hardcoded 
value of 250 in the start up script is incorrect. When Mesos starts up it reads 
the major device ID from /dev/nvidia-uvm which matched the value given by the 
start up script. Then when it created the devices CGROUP it uses that number 
instead of the correct one. nvidia-smi worked because it never accessed the 
nvidia-uvm device.
The fix:Do not hard code the major device ID of nvidia-uvm in a start up 
script. Instead bring the device up with:nvidia-modprobe -u -c 0

I hope this information helps someone and a big thanks to Kevin Klues for 
helping me debug this issue.
Justin
  

Re: Setting log path for mesos java client library

2016-09-12 Thread Vinod Kone
Looks like Mesos logging flags

for these override

the corresponding GLOG related flags.

Try setting "MESOS_LOG_DIR=" and "MESOS_QUIET=true"

On Mon, Sep 12, 2016 at 12:09 PM, Wil Yegelwel  wrote:

> I’m trying to set the log path (and later the log format) for the mesos
> java lib. From the docs in http://mesos.apache.org/api/
> latest/java/org/apache/mesos/MesosSchedulerDriver.html it appears I need
> to set the correct GLOG environment variable in order to get this to work,
> but I can’t seem to get it. I’ve tried setting the environment variables:
> “GLOG_log_dir=…”, “GLOG_logtostderr=0” but neither seem to change the
> behavior and it is still logging to stderr. Has anyone been able to set the
> path the mesos java client library writes to and, if so, how?
>
>


Setting log path for mesos java client library

2016-09-12 Thread Wil Yegelwel
I’m trying to set the log path (and later the log format) for the mesos
java lib. From the docs in
http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html
it appears I need to set the correct GLOG environment variable in order to
get this to work, but I can’t seem to get it. I’ve tried setting the
environment variables: “GLOG_log_dir=…”, “GLOG_logtostderr=0” but neither
seem to change the behavior and it is still logging to stderr. Has anyone
been able to set the path the mesos java client library writes to and, if
so, how?


Re: forcing framework to re-schedule?

2016-09-12 Thread Victor L
How can i explicitly kill the task from my class?

On Mon, Sep 12, 2016 at 2:10 PM, haosdent  wrote:

> If the target you perform health check is your task, Mesos support health
> check by a command. When your task reaches the health task failure limit,
> the task would be killed and then your framework could launch the task
> again when receives the `TASK_KILLED` in `statusUpdate`.
>
> On Tue, Sep 13, 2016 at 2:03 AM, Victor L  wrote:
>
>> It checks if process is functional. I don't think standard healthchecks
>> wouldn't be sufficient for my purpose and my question still stands: how  to
>> use result...
>>
>> On Mon, Sep 12, 2016 at 1:48 PM, haosdent  wrote:
>>
>>> Hi, @victor What's your health check agent used for? Because Mesos
>>> supports health checks now.
>>>
>>> On Tue, Sep 13, 2016 at 1:46 AM, Victor L  wrote:
>>>
 Hello,
 I am writing "healthcheck agent" for mesos deployment framework as
 independent thread periodically checking if main process ( started by
 framework) is running...
 What would be the mechanism to "communicate" failure to the framework
 to cause specific outcome? For example: how can i use failure to cause
 framework to reschedule deployment on different node?
 Thanks,

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: forcing framework to re-schedule?

2016-09-12 Thread Victor L
It checks if process is functional. I don't think standard healthchecks
wouldn't be sufficient for my purpose and my question still stands: how  to
use result...

On Mon, Sep 12, 2016 at 1:48 PM, haosdent  wrote:

> Hi, @victor What's your health check agent used for? Because Mesos
> supports health checks now.
>
> On Tue, Sep 13, 2016 at 1:46 AM, Victor L  wrote:
>
>> Hello,
>> I am writing "healthcheck agent" for mesos deployment framework as
>> independent thread periodically checking if main process ( started by
>> framework) is running...
>> What would be the mechanism to "communicate" failure to the framework  to
>> cause specific outcome? For example: how can i use failure to cause
>> framework to reschedule deployment on different node?
>> Thanks,
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: forcing framework to re-schedule?

2016-09-12 Thread haosdent
If the target you perform health check is your task, Mesos support health
check by a command. When your task reaches the health task failure limit,
the task would be killed and then your framework could launch the task
again when receives the `TASK_KILLED` in `statusUpdate`.

On Tue, Sep 13, 2016 at 2:03 AM, Victor L  wrote:

> It checks if process is functional. I don't think standard healthchecks
> wouldn't be sufficient for my purpose and my question still stands: how  to
> use result...
>
> On Mon, Sep 12, 2016 at 1:48 PM, haosdent  wrote:
>
>> Hi, @victor What's your health check agent used for? Because Mesos
>> supports health checks now.
>>
>> On Tue, Sep 13, 2016 at 1:46 AM, Victor L  wrote:
>>
>>> Hello,
>>> I am writing "healthcheck agent" for mesos deployment framework as
>>> independent thread periodically checking if main process ( started by
>>> framework) is running...
>>> What would be the mechanism to "communicate" failure to the framework
>>> to cause specific outcome? For example: how can i use failure to cause
>>> framework to reschedule deployment on different node?
>>> Thanks,
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


-- 
Best Regards,
Haosdent Huang


Re: forcing framework to re-schedule?

2016-09-12 Thread haosdent
Hi, @victor What's your health check agent used for? Because Mesos supports
health checks now.

On Tue, Sep 13, 2016 at 1:46 AM, Victor L  wrote:

> Hello,
> I am writing "healthcheck agent" for mesos deployment framework as
> independent thread periodically checking if main process ( started by
> framework) is running...
> What would be the mechanism to "communicate" failure to the framework  to
> cause specific outcome? For example: how can i use failure to cause
> framework to reschedule deployment on different node?
> Thanks,
>



-- 
Best Regards,
Haosdent Huang


forcing framework to re-schedule?

2016-09-12 Thread Victor L
Hello,
I am writing "healthcheck agent" for mesos deployment framework as
independent thread periodically checking if main process ( started by
framework) is running...
What would be the mechanism to "communicate" failure to the framework  to
cause specific outcome? For example: how can i use failure to cause
framework to reschedule deployment on different node?
Thanks,