[jira] [Commented] (MESOS-6143) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-09-09 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477530#comment-15477530
 ] 

Justin Pinkul commented on MESOS-6143:
--

My {{resolv.conf}} is still empty when using the alpine image. I don't 
understand how that test would pass since {{resolv.conf}} is only modified by 
the {{network/cni}} isolator which is not enabled in this test.

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: MESOS-6143
> URL: https://issues.apache.org/jira/browse/MESOS-6143
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, isolation
>Affects Versions: 1.0.0
> Environment: OS: Debian Jessie
> Mesos version: 1.0.0
>Reporter: Justin Pinkul
>Assignee: Avinash Sridharan
> Fix For: 1.1.0
>
>
> When using the Mesos containierizer, host networking and a Docker image 
> {{resolv.conf}} is not copied from the host. The only piece of Mesos code 
> that copies these file is currently in the {{network/cni}} isolator so I 
> tried turning this on, by setting 
> {{isolation=network/cni,namespaces/pid,docker/runtime,cgroups/devices,gpu/nvidia,cgroups/cpu,disk/du,filesystem/linux}},
>  but the issue still remained. I suspect this might be related to not setting 
> {{network_cni_config_dir}} and {{network_cni_plugins_dir}} but it seems 
> incorrect that these flags would be required to use host networking.
> Here is how I am able to reproduce this issue:
> {code}
> mesos-execute --master=mesosmaster1:5050 \
>   --name=dns-test \
>   --docker_image=my-docker-image:1.1.3 \
>   --command="bash -c 'ping google.com; while ((1)); do date; 
> sleep 10; done'"
> # Find the PID of mesos-executor's child process and enter it
> nsenter -m -u -i -n -p -r -w -t $PID
> # This file will be empty
> cat /etc/resolv.conf
> {code}
> {code:title=Mesos agent log}
> I0908 17:39:24.599149 181564 slave.cpp:1688] Launching task dns-test for 
> framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006
> I0908 17:39:24.599567 181564 paths.cpp:528] Trying to chown 
> '/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S2/frameworks/51831498-0902-4ae9-a1ff-4396f8b8d823-0006/executors/dns-test/runs/52bdce71-04b0-4440-bb71-cb826f0635c6'
>  to user 'root'
> I0908 17:39:24.603970 181564 slave.cpp:5748] Launching executor dns-test of 
> framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S2/frameworks/51831498-0902-4ae9-a1ff-4396f8b8d823-0006/executors/dns-test/runs/52bdce71-04b0-4440-bb71-cb826f0635c6'
> I0908 17:39:24.604178 181564 slave.cpp:1914] Queuing task 'dns-test' for 
> executor 'dns-test' of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006
> I0908 17:39:24.604284 181571 docker.cpp:1020] Skipping non-docker container
> I0908 17:39:24.604532 181578 containerizer.cpp:781] Starting container 
> '52bdce71-04b0-4440-bb71-cb826f0635c6' for executor 'dns-test' of framework 
> '51831498-0902-4ae9-a1ff-4396f8b8d823-0006'
> I0908 17:39:24.606972 181571 provisioner.cpp:294] Provisioning image rootfs 
> '/mnt/01/mesos_work/provisioner/containers/52bdce71-04b0-4440-bb71-cb826f0635c6/backends/copy/rootfses/db97ba50-c9f0-45e7-8a39-871e4038abf9'
>  for container 52bdce71-04b0-4440-bb71-cb826f0635c6
> I0908 17:39:30.037472 181564 cpushare.cpp:389] Updated 'cpu.shares' to 102 
> (cpus 0.1) for container 52bdce71-04b0-4440-bb71-cb826f0635c6
> I0908 17:39:30.038415 181560 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNS | CLONE_NEWPID
> I0908 17:39:30.040742 181560 systemd.cpp:96] Assigned child process '190563' 
> to 'mesos_executors.slice'
> I0908 17:39:30.161613 181576 slave.cpp:2902] Got registration for executor 
> 'dns-test' of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 from 
> executor(1)@10.191.4.65:43707
> I0908 17:39:30.162148 181563 disk.cpp:171] Updating the disk resources for 
> container 52bdce71-04b0-4440-bb71-cb826f0635c6 to cpus(*):0.1; mem(*):32; 
> gpus(*):2
> I0908 17:39:30.162648 181566 cpushare.cpp:389] Updated 'cpu.shares' to 102 
> (cpus 0.1) for container 52bdce71-04b0-4440-bb71-cb826f0635c6
> I0908 17:39:30.162822 181574 slave.cpp:2079] Sending queued task 'dns-test' 
> to executor 'dns-test' of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 
> at executor(1)@10.191.4.65:43707
> I0908 17:39:30.168383 181570 slave.cpp:3285] Handling status update 
> TASK_RUNNING (UUID: 319e0235-01b9-42ce-a2f8-ed9fc33de150) for task dns-test 
> of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 from 
> executor(1)@10.191.4.65:43707
> I0908 17:39:30.169019 181577 status_update_manager.cpp:320] Received status 

[jira] [Created] (MESOS-6143) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-09-08 Thread Justin Pinkul (JIRA)
Justin Pinkul created MESOS-6143:


 Summary: resolv.conf is not copied when using the Mesos 
containerizer with a Docker image
 Key: MESOS-6143
 URL: https://issues.apache.org/jira/browse/MESOS-6143
 Project: Mesos
  Issue Type: Bug
  Components: containerization, isolation
Affects Versions: 1.0.0
 Environment: OS: Debian Jessie
Mesos version: 1.0.0
Reporter: Justin Pinkul


When using the Mesos containierizer, host networking and a Docker image 
{{resolv.conf}} is not copied from the host. The only piece of Mesos code that 
copies these file is currently in the {{network/cni}} isolator so I tried 
turning this on, by setting 
{{isolation=network/cni,namespaces/pid,docker/runtime,cgroups/devices,gpu/nvidia,cgroups/cpu,disk/du,filesystem/linux}},
 but the issue still remained. I suspect this might be related to not setting 
{{network_cni_config_dir}} and {{network_cni_plugins_dir}} but it seems 
incorrect that these flags would be required to use host networking.

Here is how I am able to reproduce this issue:
{code}
mesos-execute --master=mesosmaster1:5050 \
--name=dns-test \
--docker_image=my-docker-image:1.1.3 \
--command="bash -c 'ping google.com; while ((1)); do date; 
sleep 10; done'"

# Find the PID of mesos-executor's child process and enter it
nsenter -m -u -i -n -p -r -w -t $PID

# This file will be empty
cat /etc/resolv.conf
{code}

{code:title=Mesos agent log}
I0908 17:39:24.599149 181564 slave.cpp:1688] Launching task dns-test for 
framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006
I0908 17:39:24.599567 181564 paths.cpp:528] Trying to chown 
'/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S2/frameworks/51831498-0902-4ae9-a1ff-4396f8b8d823-0006/executors/dns-test/runs/52bdce71-04b0-4440-bb71-cb826f0635c6'
 to user 'root'
I0908 17:39:24.603970 181564 slave.cpp:5748] Launching executor dns-test of 
framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 with resources cpus(*):0.1; 
mem(*):32 in work directory 
'/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S2/frameworks/51831498-0902-4ae9-a1ff-4396f8b8d823-0006/executors/dns-test/runs/52bdce71-04b0-4440-bb71-cb826f0635c6'
I0908 17:39:24.604178 181564 slave.cpp:1914] Queuing task 'dns-test' for 
executor 'dns-test' of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006
I0908 17:39:24.604284 181571 docker.cpp:1020] Skipping non-docker container
I0908 17:39:24.604532 181578 containerizer.cpp:781] Starting container 
'52bdce71-04b0-4440-bb71-cb826f0635c6' for executor 'dns-test' of framework 
'51831498-0902-4ae9-a1ff-4396f8b8d823-0006'
I0908 17:39:24.606972 181571 provisioner.cpp:294] Provisioning image rootfs 
'/mnt/01/mesos_work/provisioner/containers/52bdce71-04b0-4440-bb71-cb826f0635c6/backends/copy/rootfses/db97ba50-c9f0-45e7-8a39-871e4038abf9'
 for container 52bdce71-04b0-4440-bb71-cb826f0635c6
I0908 17:39:30.037472 181564 cpushare.cpp:389] Updated 'cpu.shares' to 102 
(cpus 0.1) for container 52bdce71-04b0-4440-bb71-cb826f0635c6
I0908 17:39:30.038415 181560 linux_launcher.cpp:281] Cloning child process with 
flags = CLONE_NEWNS | CLONE_NEWPID
I0908 17:39:30.040742 181560 systemd.cpp:96] Assigned child process '190563' to 
'mesos_executors.slice'
I0908 17:39:30.161613 181576 slave.cpp:2902] Got registration for executor 
'dns-test' of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 from 
executor(1)@10.191.4.65:43707
I0908 17:39:30.162148 181563 disk.cpp:171] Updating the disk resources for 
container 52bdce71-04b0-4440-bb71-cb826f0635c6 to cpus(*):0.1; mem(*):32; 
gpus(*):2
I0908 17:39:30.162648 181566 cpushare.cpp:389] Updated 'cpu.shares' to 102 
(cpus 0.1) for container 52bdce71-04b0-4440-bb71-cb826f0635c6
I0908 17:39:30.162822 181574 slave.cpp:2079] Sending queued task 'dns-test' to 
executor 'dns-test' of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 at 
executor(1)@10.191.4.65:43707
I0908 17:39:30.168383 181570 slave.cpp:3285] Handling status update 
TASK_RUNNING (UUID: 319e0235-01b9-42ce-a2f8-ed9fc33de150) for task dns-test of 
framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 from 
executor(1)@10.191.4.65:43707
I0908 17:39:30.169019 181577 status_update_manager.cpp:320] Received status 
update TASK_RUNNING (UUID: 319e0235-01b9-42ce-a2f8-ed9fc33de150) for task 
dns-test of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006
I0908 17:39:30.169173 181576 slave.cpp:3678] Forwarding the update TASK_RUNNING 
(UUID: 319e0235-01b9-42ce-a2f8-ed9fc33de150) for task dns-test of framework 
51831498-0902-4ae9-a1ff-4396f8b8d823-0006 to master@10.191.248.194:5050
I0908 17:39:30.169242 181576 slave.cpp:3588] Sending acknowledgement for status 
update TASK_RUNNING (UUID: 319e0235-01b9-42ce-a2f8-ed9fc33de150) for task 
dns-test of framework 51831498-0902-4ae9-a1ff-4396f8b8d823-0006 to 
executor(1)@10.191.4.65:43707
I0908 

[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2018-08-28 Thread Justin Pinkul (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595136#comment-16595136
 ] 

Justin Pinkul commented on MESOS-8038:
--

In our GPU cluster we have seen many cases where a task acquires resources on a 
GPU and then gets stuck in the D state forever. Being stuck in the D state is 
generally caused by bugs in the GPU driver or a NFS driver. When these types of 
driver bugs are hit Linux has no way to recover and the only way to kill the 
process is to restart the machine.

In our GPU cluster we handle these issues by automatically detecting them and 
putting the machine into maintenance mode with a start time of now and an end 
time of one year from now. This prevents the machine from failing tasks until 
our operations team has a chance to investigate what caused the task to get 
stuck in the D state forever.

I think the only graceful way Mesos could handle this state is to offer less 
GPUs until the machine is restarted.

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)