[jira] [Commented] (AURORA-1739) createJob thrift api for golang consistenly failing with empty CronSchedule

2016-09-06 Thread Renan DelValle (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469144#comment-15469144
 ] 

Renan DelValle commented on AURORA-1739:


I encountered this as well but, thankfully, [~jfarrell] lent me a hand with 
this.  I'm sure you've fixed this issue by now, but for anyone else this might 
help.

This can be fixed by modifying the thrift API from which the go bindings get 
created.

This line: 
https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L328

Has to be changed from:
{code}
  4: string cronSchedule
{code}
to:
{code}
  4: optional string cronSchedule
{code}

Maybe I should submit a patch for this but I have to see if this causes any 
issues when any other language's bindings are generated first.

> createJob thrift api for golang consistenly failing with empty CronSchedule
> ---
>
> Key: AURORA-1739
> URL: https://issues.apache.org/jira/browse/AURORA-1739
> Project: Aurora
>  Issue Type: Bug
>  Components: Client
>Affects Versions: 0.15.0
>Reporter: Jimmy Wu
>Priority: Critical
>
> trying to create non cron job via the thrift api for golang  but consistently 
> getting error "Cron jobs may only be created/updated by calling 
> scheduleCronJob.".  Root cause :  CronSchedule is not set in JobConfiguration 
> hence an empty string is used, then create job request gets rejected because 
> aurora now treats empty cron schedule as failure (related changes 
> https://reviews.apache.org/r/28571/).  This issue breaks all createJob 
> requests submitted from golang thrift api because empty string is default 
> value for string instead of nil.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1221) Modify task state machine to treat STARTING as a new active state

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468426#comment-15468426
 ] 

Kai Huang edited comment on AURORA-1221 at 9/6/16 8:27 PM:
---

There are two side-effects if we add STARTING state into LIVE_STATES thrift 
constant.

1. aurora job create --wait-until=RUNNING will finish waiting when a task 
reaches STARTING state (instead of RUNNING)

2. aurora task commands will now also work for STARTING tasks.

For now, we do NOT treat STARTING as live state. It makes more sense to 
preserve the original meaning of the above two commands, especially for "aurora 
task ssh", given that the sandbox is being initialized in STARTING state.


was (Author: kaih):
There are two side-effects after we add STARTING state into LIVE_STATES thrift 
constant.

aurora job create --wait-until=RUNNING will finish waiting when a task reaches 
STARTING state (instead of RUNNING)

aurora task commands will now also work for STARTING tasks.

> Modify task state machine to treat STARTING as a new active state
> -
>
> Key: AURORA-1221
> URL: https://issues.apache.org/jira/browse/AURORA-1221
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Scheduler needs to treat STARTING as the new live state. 
> Open: should we treat STARTING as a transient state with general timeout 
> (currently 5 minutes) or treat it as a persistent live state instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-09-06 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang resolved AURORA-1222.
---
Resolution: Fixed

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468433#comment-15468433
 ] 

Kai Huang edited comment on AURORA-1222 at 9/6/16 8:19 PM:
---

After discussion with Maxim, I decided not to account STARTING for the platform 
and job uptime calculation. As a result, I add a new MTTS (Median Time To 
Starting) metric in the sla module. See https://reviews.apache.org/r/51580/.


was (Author: kaih):
After discussion with Maxim, I decided not to account STARTING for the platform 
and job uptime calculation.
As a result, I add a new MTTS (Median Time To Starting) metric in the sla 
module. See https://reviews.apache.org/r/51580/.

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1222) Modify stats and SLA metrics to properly account for STARTING

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468433#comment-15468433
 ] 

Kai Huang commented on AURORA-1222:
---

After discussion with Maxim, I decided not to account STARTING for the platform 
and job uptime calculation.
As a result, I add a new MTTS (Median Time To Starting) metric in the sla 
module. See https://reviews.apache.org/r/51580/.

> Modify stats and SLA metrics to properly account for STARTING
> -
>
> Key: AURORA-1222
> URL: https://issues.apache.org/jira/browse/AURORA-1222
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Both platform and job uptime calculations will be affected by treating 
> STARTING as a new live state. Also, a new MTTS (Median Time To Starting) 
> metric would be great to have in addition to MTTA and MTTR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1221) Modify task state machine to treat STARTING as a new active state

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468426#comment-15468426
 ] 

Kai Huang commented on AURORA-1221:
---

There are two side-effects after we add STARTING state into LIVE_STATES thrift 
constant.

aurora job create --wait-until=RUNNING will finish waiting when a task reaches 
STARTING state (instead of RUNNING)

aurora task commands will now also work for STARTING tasks.

> Modify task state machine to treat STARTING as a new active state
> -
>
> Key: AURORA-1221
> URL: https://issues.apache.org/jira/browse/AURORA-1221
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Scheduler needs to treat STARTING as the new live state. 
> Open: should we treat STARTING as a transient state with general timeout 
> (currently 5 minutes) or treat it as a persistent live state instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1640) Write enduser documentation for the Unified Containerizer support

2016-09-06 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468286#comment-15468286
 ] 

Stephan Erb commented on AURORA-1640:
-

https://reviews.apache.org/r/51664/

> Write enduser documentation for the Unified Containerizer support
> -
>
> Key: AURORA-1640
> URL: https://issues.apache.org/jira/browse/AURORA-1640
> Project: Aurora
>  Issue Type: Story
>  Components: Documentation
>Reporter: Stephan Erb
>Assignee: Stephan Erb
>
> We have to document the Unified Containerizer feature so that it is easy for 
> users and operators to adopt it. 
> Ideally, we cover:
> * how to configure the Aurora scheduler
> * links to the relevant Mesos documentation
> * an example showing a working Aurora spec that can be run within our vagrant 
> environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1762) /pendingtasks endpoint should show reason tasks are pending

2016-09-06 Thread Renan DelValle (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468270#comment-15468270
 ] 

Renan DelValle commented on AURORA-1762:


In that case, I'll ask someone from my research lab to take a crack at this.

> /pendingtasks endpoint should show reason tasks are pending
> ---
>
> Key: AURORA-1762
> URL: https://issues.apache.org/jira/browse/AURORA-1762
> Project: Aurora
>  Issue Type: Task
>Reporter: David Robinson
>Priority: Minor
>  Labels: newbie
>
> the /pendingtasks endpoint is essentially useless as is, it shows that tasks 
> are pending but doesn't show why. The information is also not easily 
> discovered via the /scheduler UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1762) /pendingtasks endpoint should show reason tasks are pending

2016-09-06 Thread Karthik Anantha Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Anantha Padmanabhan reassigned AURORA-1762:
---

Assignee: Karthik Anantha Padmanabhan

> /pendingtasks endpoint should show reason tasks are pending
> ---
>
> Key: AURORA-1762
> URL: https://issues.apache.org/jira/browse/AURORA-1762
> Project: Aurora
>  Issue Type: Task
>Reporter: David Robinson
>Assignee: Karthik Anantha Padmanabhan
>Priority: Minor
>  Labels: newbie
>
> the /pendingtasks endpoint is essentially useless as is, it shows that tasks 
> are pending but doesn't show why. The information is also not easily 
> discovered via the /scheduler UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1762) /pendingtasks endpoint should show reason tasks are pending

2016-09-06 Thread Karthik Anantha Padmanabhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Anantha Padmanabhan updated AURORA-1762:

Assignee: (was: Karthik Anantha Padmanabhan)

> /pendingtasks endpoint should show reason tasks are pending
> ---
>
> Key: AURORA-1762
> URL: https://issues.apache.org/jira/browse/AURORA-1762
> Project: Aurora
>  Issue Type: Task
>Reporter: David Robinson
>Priority: Minor
>  Labels: newbie
>
> the /pendingtasks endpoint is essentially useless as is, it shows that tasks 
> are pending but doesn't show why. The information is also not easily 
> discovered via the /scheduler UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468200#comment-15468200
 ] 

Joshua Cohen commented on AURORA-1763:
--

Yes, for the reasons Jie mentions, setting rootfs is not an option for Thermos.

Another option would be to configure each Mesos agent host with a 
{{/usr/local/nvidia}} and then configure the {{--global_container_mounts}} flag 
on the Scheduler to point to that path. Thermos will then mount that into each 
task.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468190#comment-15468190
 ] 

Jie Yu commented on AURORA-1763:


Setting rootfs for the executor is another option, but i think that might break 
thermos because it assume it can see host rootfs (i might be wrong)? Also, 
bundling executor (and libmesos.so) in an image is not trivial because of the 
ABI compatibility issue. That means thermos needs to have one docker image for 
each linux distribution.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468177#comment-15468177
 ] 

Jie Yu commented on AURORA-1763:


I mean it's from /var/run/mesos/isolators/gpu/xxx and should be mounted to 
/usr/local/nvidia in the task's rootfs.

GPU isolator will prepare all files under `/var/run/mesos/isolators/gpu` so 
thermos does not have to worry about it.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / 

[jira] [Comment Edited] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468170#comment-15468170
 ] 

Justin Pinkul edited comment on AURORA-1763 at 9/6/16 6:42 PM:
---

Including the GPU drivers in the Docker image could work as a temporary work 
around but falls apart incredibly fast. The problem here is that the driver is 
specific to the GPU and, I think, even the kernel version. This means that as 
soon as we have more than one type of GPU this totally falls apart. In our 
particular use case we expect to hit this scenario in around a month.

Binding the GPU volume into the task's rootfs will work. However as the 
executor is implemented right now determining the location to mount from will 
be very fragile. Since the executor is not launched with a different rootfs 
Mesos does not mount the driver's for the executor. This means that the 
executor would have to find the drivers itself, in this case from 
{{/run/mesos/isolators/gpu/nvidia_352.39}}. This sounds very fragile, 
especially since the path includes the driver version.

I think a better approach would be to set the rootfs for the executor so Mesos 
mounts the drivers to {{/usr/loca/nvidia}}. Then when the executor starts the 
task it can pass this mount on without duplicating the logic Mesos uses to find 
the drivers.


was (Author: jpinkul):
Including the GPU drivers in the Docker image could work as a temporary work 
around but falls apart incredibly fast. The problem here is that the driver is 
specific to the GPU and, I think, even the kernel version. This means that as 
soon as we have more than one type of GPU this totally falls part. In our 
particular use case we expect to hit this scenario in around a month.

Binding the GPU volume into the task's rootfs will work. However as the 
executor is implemented right now determining the location to mount from will 
be very fragile. Since the executor is not launched with a different rootfs 
Mesos does not mount the driver's for the executor. This means that the 
executor would have to find the drivers itself, in this case from 
{{/run/mesos/isolators/gpu/nvidia_352.39}}. This sounds very fragile, 
especially since the path includes the driver version.

I think a better approach would be to set the rootfs for the executor so Mesos 
mounts the drivers to {{/usr/loca/nvidia}}. Then when the executor starts the 
task it can pass this mount on without duplicating the logic Mesos uses to find 
the drivers.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468170#comment-15468170
 ] 

Justin Pinkul commented on AURORA-1763:
---

Including the GPU drivers in the Docker image could work as a temporary work 
around but falls apart incredibly fast. The problem here is that the driver is 
specific to the GPU and, I think, even the kernel version. This means that as 
soon as we have more than one type of GPU this totally falls part. In our 
particular use case we expect to hit this scenario in around a month.

Binding the GPU volume into the task's rootfs will work. However as the 
executor is implemented right now determining the location to mount from will 
be very fragile. Since the executor is not launched with a different rootfs 
Mesos does not mount the driver's for the executor. This means that the 
executor would have to find the drivers itself, in this case from 
{{/run/mesos/isolators/gpu/nvidia_352.39}}. This sounds very fragile, 
especially since the path includes the driver version.

I think a better approach would be to set the rootfs for the executor so Mesos 
mounts the drivers to {{/usr/loca/nvidia}}. Then when the executor starts the 
task it can pass this mount on without duplicating the logic Mesos uses to find 
the drivers.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468146#comment-15468146
 ] 

Joshua Cohen commented on AURORA-1763:
--

Where are they mounted from?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs 
> systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
> 97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468142#comment-15468142
 ] 

Jie Yu commented on AURORA-1763:


Short term, can thermos bind mount gpu volumes into task's rootfs? 

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs 
> systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
> 97 96 0:34 / /proc/sys/fs/binfmt_misc 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468115#comment-15468115
 ] 

Joshua Cohen commented on AURORA-1763:
--

[~jpinkul] I think in the mean time, the only solution is to explicitly include 
the GPU drivers in your Docker image if you'd like to use the unified 
containerizer? Is that correct [~jieyu]?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468116#comment-15468116
 ] 

Zameer Manji commented on AURORA-1763:
--

[~jieyu]: I agree Pods is the future here. We can drop our dependency on 
{{mesos-containerizer}} and use the appropriate primitives. However, what are 
we supposed to do in the short term?

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 
> pstore pstore rw
> 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
> 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
> 96 95 0:29 / /proc/sys/fs/binfmt_misc 

[jira] [Reopened] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-06 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang reopened AURORA-1223:
---

Found watch_secs constraint to relax at scheduler side.

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1223) Modify scheduler updater to not use "watch_secs" for health-check enabled jobs

2016-09-06 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468067#comment-15468067
 ] 

Kai Huang edited comment on AURORA-1223 at 9/6/16 6:19 PM:
---

After discussion on Aurora dev list, we decided to keep the watch_secs 
infrastructure on scheduler side. 

Our final conclusion is that we adopt the following implementation:

If the users want purely health checking driven updates they can set watch_secs 
to 0 and enable health checks. 

If they want to have both health checking and time driven updates they can set 
watch_secs to the time that they care about, and doing health checks at 
STARTING state as well.

If they just want time driven updates they can disable health checking and set 
watch_secs to the time that they care about.

There will be only one scheduler change required: 
Currently scheduler does not accept zero value for watch_secs, we need to relax 
this constraint.


was (Author: kaih):
After discussion on Aurora dev list, it turns out there will be no 
scheduler-side change associated with this issue.

Our final conclusion is that we adopt the following implementation:

If the users want purely health checking driven updates they can set watch_secs 
to 0 and enable health checks.  (watch_secs=0 is not allowed at client side, we 
will relax this constraint after we modified executor. However, no scheduler 
change is required since scheduler allows non-negative values for watch_secs)

If they want to have both health checking and time driven updates they can set 
watch_secs to the time that they care about, and doing health checks at 
STARTING state as well.

If they just want time driven updates they can disable health checking and set 
watch_secs to the time that they care about..

> Modify scheduler updater to not use "watch_secs" for health-check enabled jobs
> --
>
> Key: AURORA-1223
> URL: https://issues.apache.org/jira/browse/AURORA-1223
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> When health checks are enabled in a job config, scheduler updater should 
> ignore "watch_secs" UpdateConfig value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468110#comment-15468110
 ] 

Jie Yu commented on AURORA-1763:


This is expected. I think the ultimate solution for Aurora is to use the 
upcoming nested container primitive to create nested container. THe gpu 
isolator will be made nesting aware so it'll prepare rootfs for the nested 
container as well. The Mesos design doc is here:
https://docs.google.com/document/d/1FtcyQkDfGp-bPHTW4pUoqQCgVlPde936bo-IIENO_ho/edit

Expect nested container support to land soon (in 1 or 2 months)

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - 

[jira] [Commented] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467903#comment-15467903
 ] 

Joshua Cohen commented on AURORA-1763:
--

Thanks for the context. I'm wondering if this is what's causing it:

{code:title=https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L288-L290}
  if (!containerConfig.has_rootfs()) {
 return None();
  }
{code}

Aurora does not configure tasks with a {{ContainerInfo}} that has an {{Image}} 
set. Instead it configures the task's filesystem as a {{Volume}} with an 
{{Image}} set. The executor then uses {{mesos-containerizer launch ...}} to 
pivot/chroot into that filesystem. To me it looks like the code above is 
relying on the container itself having a rootfs, which won't be the case 
currently based on the way we isolate task filesystems

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> {noformat}140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 

[jira] [Updated] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1763:
-
Description: 
When launching a GPU job that uses a Docker image and the unified containerizer 
the Nvidia drivers are not correctly mounted. As an experiment I launched a 
task using both mesos-execute and Aurora using the same Docker image and ran 
nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder 
was not being mounted properly. To confirm this was the issue I tar'ed the 
drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to 
the Docker image. When this was done the task was able to launch correctly.

Here is the resulting mountinfo for the mesos-execute task. Notice how 
/usr/local/nvidia is mounted from the /mesos directory.

{noformat}140 102 8:17 
/mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
141 140 8:17 
/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
rw,mode=600,ptmxmode=666
148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat}

Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
missing.

{noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
rw,errors=remount-ro,data=ordered
73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
rw,size=10240k,nr_inodes=16521649,mode=755
74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
rw,gid=5,mode=620,ptmxmode=000
75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
rw,size=26438160k,mode=755
79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
rw,size=5120k
80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
securityfs securityfs rw
84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
ro,mode=755
85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - 
cgroup cgroup 
rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - 
cgroup cgroup rw,cpuset
87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
master:14 - cgroup cgroup rw,cpu,cpuacct
88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - 
cgroup cgroup rw,devices
89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - 
cgroup cgroup rw,freezer
90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
master:17 - cgroup cgroup rw,net_cls,net_prio
91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
cgroup cgroup rw,blkio
92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
master:19 - cgroup cgroup rw,perf_event
93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore 
pstore rw
94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 
rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc 
binfmt_misc rw
98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
99 98 8:17 
/mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
 
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs
 rw,relatime master:24 - ext4 /dev/sdb1 

[jira] [Updated] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul updated AURORA-1763:
--
Description: 
When launching a GPU job that uses a Docker image and the unified containerizer 
the Nvidia drivers are not correctly mounted. As an experiment I launched a 
task using both mesos-execute and Aurora using the same Docker image and ran 
`nvidia-smi`. During the experiment I noticed that the /usr/local/nvidia folder 
was not being mounted properly. To confirm this was the issue I tar'ed the 
drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to 
the Docker image. When this was done the task was able to launch correctly.

Here is the resulting mountinfo for the mesos-execute task. Notice how 
/usr/local/nvidia is mounted from the /mesos directory.

140 102 8:17 
/mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
141 140 8:17 
/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
rw,mode=600,ptmxmode=666
148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw

Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
missing.

72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
rw,errors=remount-ro,data=ordered
73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
rw,size=10240k,nr_inodes=16521649,mode=755
74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
rw,gid=5,mode=620,ptmxmode=000
75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
rw,size=26438160k,mode=755
79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
rw,size=5120k
80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
securityfs securityfs rw
84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
ro,mode=755
85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - 
cgroup cgroup 
rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - 
cgroup cgroup rw,cpuset
87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
master:14 - cgroup cgroup rw,cpu,cpuacct
88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - 
cgroup cgroup rw,devices
89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - 
cgroup cgroup rw,freezer
90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
master:17 - cgroup cgroup rw,net_cls,net_prio
91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
cgroup cgroup rw,blkio
92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
master:19 - cgroup cgroup rw,perf_event
93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore 
pstore rw
94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 
rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc 
binfmt_misc rw
98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
99 98 8:17 
/mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
 
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs
 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
100 

[jira] [Comment Edited] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467796#comment-15467796
 ] 

Joshua Cohen edited comment on AURORA-1763 at 9/6/16 4:17 PM:
--

[~jieyu] Is this something that should be handled by {{mesos-containerizer 
launch ...}}? I'm not sure how mesos decides when to mount 
{{/usr/local/nvidia}} or how Aurora could make that decision in the executor.


was (Author: joshua.cohen):
[~jieyu] Is this something that should be handled by {{mesos-containerizer 
launch ...}}? I'm not sure how mesos decides when to mount 
{{/usr/local/nvidia}} or how Aurora could make that decision in the Executor.

> GPU drivers are missing when using a Docker image
> -
>
> Key: AURORA-1763
> URL: https://issues.apache.org/jira/browse/AURORA-1763
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>
> When launching a GPU job that uses a Docker image and the unified 
> containerizer the Nvidia drivers are not correctly mounted. As an experiment 
> I launched a task using both mesos-execute and Aurora using the same Docker 
> image and ran nvidia-smi. During the experiment I noticed that the 
> /usr/local/nvidia folder was not being mounted properly. To confirm this was 
> the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) 
> and manually added it to the Docker image. When this was done the task was 
> able to launch correctly.
> Here is the resulting mountinfo for the mesos-execute task. Notice how 
> /usr/local/nvidia is mounted from the /mesos directory.
> 140 102 8:17 
> /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
>  / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
> 141 140 8:17 
> /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
>  /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
> rw,errors=remount-ro,data=ordered
> 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
> rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
> 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
> 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
> 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
> 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
> rw,mode=600,ptmxmode=666
> 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
> Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
> missing.
> 72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
> rw,errors=remount-ro,data=ordered
> 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
> rw,size=10240k,nr_inodes=16521649,mode=755
> 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
> rw,gid=5,mode=620,ptmxmode=000
> 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
> 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
> 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
> 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
> rw,size=26438160k,mode=755
> 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
> rw,size=5120k
> 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
> 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
> 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
> securityfs securityfs rw
> 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
> ro,mode=755
> 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 
> - cgroup cgroup 
> rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
> 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 
> - cgroup cgroup rw,cpuset
> 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> master:14 - cgroup cgroup rw,cpu,cpuacct
> 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 
> - cgroup cgroup rw,devices
> 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 
> - cgroup cgroup rw,freezer
> 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> master:17 - cgroup cgroup rw,net_cls,net_prio
> 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
> cgroup cgroup rw,blkio
> 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> master:19 - cgroup cgroup rw,perf_event
> 93 82 

[jira] [Updated] (AURORA-1763) GPU drivers are missing when using a Docker image

2016-09-06 Thread Justin Pinkul (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Pinkul updated AURORA-1763:
--
Description: 
When launching a GPU job that uses a Docker image and the unified containerizer 
the Nvidia drivers are not correctly mounted. As an experiment I launched a 
task using both mesos-execute and Aurora using the same Docker image and ran 
nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder 
was not being mounted properly. To confirm this was the issue I tar'ed the 
drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to 
the Docker image. When this was done the task was able to launch correctly.

Here is the resulting mountinfo for the mesos-execute task. Notice how 
/usr/local/nvidia is mounted from the /mesos directory.

140 102 8:17 
/mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62
 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
141 140 8:17 
/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11
 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia 
rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts 
rw,mode=600,ptmxmode=666
148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw

Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is 
missing.

72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 
rw,errors=remount-ro,data=ordered
73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev 
rw,size=10240k,nr_inodes=16521649,mode=755
74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts 
rw,gid=5,mode=620,ptmxmode=000
75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs 
rw,size=26438160k,mode=755
79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs 
rw,size=5120k
80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - 
securityfs securityfs rw
84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs 
ro,mode=755
85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - 
cgroup cgroup 
rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - 
cgroup cgroup rw,cpuset
87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
master:14 - cgroup cgroup rw,cpu,cpuacct
88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - 
cgroup cgroup rw,devices
89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - 
cgroup cgroup rw,freezer
90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
master:17 - cgroup cgroup rw,net_cls,net_prio
91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - 
cgroup cgroup rw,blkio
92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
master:19 - cgroup cgroup rw,perf_event
93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore 
pstore rw
94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 
rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc 
binfmt_misc rw
98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 
rw,errors=remount-ro,data=ordered
99 98 8:17 
/mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d
 
/mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs
 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
100 99