[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2018-05-16 Thread Eric Badger (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-7224:
--
Labels: Docker  (was: )

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.0
>
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, 
> YARN-7224.006.patch, YARN-7224.007.patch, YARN-7224.008.patch, 
> YARN-7224.009.patch
>
>
> This patch is to address issues when docker container is being used:
> 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are 
> pre-packaged inside docker image, it could conflict to driver and 
> nvidia-libraries installed on Host OS. An alternative solution is to detect 
> Host OS's installed drivers and devices, mount it when launch docker 
> container. Please refer to \[1\] for more details. 
> 2. Image detection: 
> From \[2\], the challenge is: 
> bq. Mounting user-level driver libraries and device files clobbers the 
> environment of the container, it should be done only when the container is 
> running a GPU application. The challenge here is to determine if a given 
> image will be using the GPU or not. We should also prevent launching 
> containers based on a Docker image that is incompatible with the host NVIDIA 
> driver version, you can find more details on this wiki page.
> 3. GPU isolation.
> *Proposed solution*:
> a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same 
> solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA.
> We won't ship nvidia-docker-plugin with out releases and we require cluster 
> admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. 
> "nvidia-docker" is a wrapper of docker binary which can address #3 as well, 
> however "nvidia-docker" doesn't provide same semantics of docker, and it 
> needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use 
> it. To avoid introducing additional issues, we plan to use 
> nvidia-docker-plugin + docker binary approach.
> b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin 
> \[3\] to create a volume which includes GPU-related libraries and mount it 
> when docker container being launched. Changes include: 
> - Instead of using {{volume-driver}}, this patch added {{docker volume 
> create}} command to c-e and NM Java side. The reason is {{volume-driver}} can 
> only use single volume driver for each launched docker container.
> - Updated {{c-e}} and Java side, if a mounted volume is a named volume in 
> docker, skip checking file existence. (Named-volume still need to be added to 
> permitted list of container-executor.cfg).
> c. To address isolation issue:
> We found that, cgroup + docker doesn't work under newer docker version which 
> uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup 
> which include any {{devices.deny}} causes docker container cannot be launched.
> Instead this patch passes allowed GPU devices via {{--device}} to docker 
> launch command.
> References:
> \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver
> \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection
> \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin
> \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-27 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.009.patch

bq. Could we print nvidia-docker-plugin -v some where from c-e or java side to 
dump version info. Helpful for debugging later.
Good suggestion, but can we get this done later (with other GPU-debuggbility 
JIRA, will file later).

Fixed #2/#3.  

Uploaded ver.9 patch, could you help review?

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, 
> YARN-7224.006.patch, YARN-7224.007.patch, YARN-7224.008.patch, 
> YARN-7224.009.patch
>
>
> This patch is to address issues when docker container is being used:
> 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are 
> pre-packaged inside docker image, it could conflict to driver and 
> nvidia-libraries installed on Host OS. An alternative solution is to detect 
> Host OS's installed drivers and devices, mount it when launch docker 
> container. Please refer to \[1\] for more details. 
> 2. Image detection: 
> From \[2\], the challenge is: 
> bq. Mounting user-level driver libraries and device files clobbers the 
> environment of the container, it should be done only when the container is 
> running a GPU application. The challenge here is to determine if a given 
> image will be using the GPU or not. We should also prevent launching 
> containers based on a Docker image that is incompatible with the host NVIDIA 
> driver version, you can find more details on this wiki page.
> 3. GPU isolation.
> *Proposed solution*:
> a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same 
> solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA.
> We won't ship nvidia-docker-plugin with out releases and we require cluster 
> admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. 
> "nvidia-docker" is a wrapper of docker binary which can address #3 as well, 
> however "nvidia-docker" doesn't provide same semantics of docker, and it 
> needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use 
> it. To avoid introducing additional issues, we plan to use 
> nvidia-docker-plugin + docker binary approach.
> b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin 
> \[3\] to create a volume which includes GPU-related libraries and mount it 
> when docker container being launched. Changes include: 
> - Instead of using {{volume-driver}}, this patch added {{docker volume 
> create}} command to c-e and NM Java side. The reason is {{volume-driver}} can 
> only use single volume driver for each launched docker container.
> - Updated {{c-e}} and Java side, if a mounted volume is a named volume in 
> docker, skip checking file existence. (Named-volume still need to be added to 
> permitted list of container-executor.cfg).
> c. To address isolation issue:
> We found that, cgroup + docker doesn't work under newer docker version which 
> uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup 
> which include any {{devices.deny}} causes docker container cannot be launched.
> Instead this patch passes allowed GPU devices via {{--device}} to docker 
> launch command.
> References:
> \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver
> \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection
> \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin
> \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-25 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.008.patch

Thanks [~sunilg] for comments,

bq. In assignGpus, do we also need to update the assigned gpus to container's 
resource mapping list ?
I would prefer to keep them in NMStateStore#storeAssignedResources, otherwise 
all new resource plugins need to implement such logics.

bq. In general dockerCommandPlugin.updateDockerRunCommand helps to update 
docker command for volume etc. However is its better to have an api named 
sanitize/verifyCommand in dockerCommandPlugin so that incoming/created command 
will validated and logged based on system parameters
I'm not quite sure about this, could you explain?

bq. Once a docker volume is created, when this volume will be cleaned or 
unmounted ? in case when container crashes or force stopping container from 
external docker commands etc
bq. With container upgrades or partially using GPU device for a timeslice of 
container lifetime, how volumes could be mounted/re-mounted ?
For the GPU docker integration, we don't need to do this. Because all launched 
containers will share the same docker volume, so we don't need to create the 
docker volume again and again. I agree that we may need this in the future. So 
I added one method (getCleanupDockerVolumeCommand) to DockerCommandPlugin 
interface.

bq. In GpuDevice, do we also need to add make (like nvidia with version etc ? )
We don't need it for now, we can add it in the future easily when required.

bq. In initializeWhenGpuRequested, we do a lazy initialization. However if 
docker end point is down(default port), this could cause delay in container 
launch. Do we need a health mechanism to get this data updated ?
To me this is same as docker daemon is down. And since containers will fail 
fast, so admin should be able to fix this issue. 

bq. Once docker volume is created, its better to dump the docker volume inspect 
o/p on created volume. Could help for debugging later.
I like this ideal, but considering size of this patch, can we do this in a 
follow up JIRA?

Attached ver.8 patch.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, 
> YARN-7224.006.patch, YARN-7224.007.patch, YARN-7224.008.patch
>
>
> This patch is to address issues when docker container is being used:
> 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are 
> pre-packaged inside docker image, it could conflict to driver and 
> nvidia-libraries installed on Host OS. An alternative solution is to detect 
> Host OS's installed drivers and devices, mount it when launch docker 
> container. Please refer to \[1\] for more details. 
> 2. Image detection: 
> From \[2\], the challenge is: 
> bq. Mounting user-level driver libraries and device files clobbers the 
> environment of the container, it should be done only when the container is 
> running a GPU application. The challenge here is to determine if a given 
> image will be using the GPU or not. We should also prevent launching 
> containers based on a Docker image that is incompatible with the host NVIDIA 
> driver version, you can find more details on this wiki page.
> 3. GPU isolation.
> *Proposed solution*:
> a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same 
> solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA.
> We won't ship nvidia-docker-plugin with out releases and we require cluster 
> admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. 
> "nvidia-docker" is a wrapper of docker binary which can address #3 as well, 
> however "nvidia-docker" doesn't provide same semantics of docker, and it 
> needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use 
> it. To avoid introducing additional issues, we plan to use 
> nvidia-docker-plugin + docker binary approach.
> b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin 
> \[3\] to create a volume which includes GPU-related libraries and mount it 
> when docker container being launched. Changes include: 
> - Instead of using {{volume-driver}}, this patch added {{docker volume 
> create}} command to c-e and NM Java side. The reason is {{volume-driver}} can 
> only use single volume driver for each launched docker container.
> - Updated {{c-e}} and Java side, if a mounted volume is a named volume in 
> docker, skip checking file existence. (Named-volume still need to be added to 
> permitted list of 

[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-24 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.007.patch

Attached ver.7 patch, fixed warnings / javadocs, UT failure is not related.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, 
> YARN-7224.006.patch, YARN-7224.007.patch
>
>
> This patch is to address issues when docker container is being used:
> 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are 
> pre-packaged inside docker image, it could conflict to driver and 
> nvidia-libraries installed on Host OS. An alternative solution is to detect 
> Host OS's installed drivers and devices, mount it when launch docker 
> container. Please refer to \[1\] for more details. 
> 2. Image detection: 
> From \[2\], the challenge is: 
> bq. Mounting user-level driver libraries and device files clobbers the 
> environment of the container, it should be done only when the container is 
> running a GPU application. The challenge here is to determine if a given 
> image will be using the GPU or not. We should also prevent launching 
> containers based on a Docker image that is incompatible with the host NVIDIA 
> driver version, you can find more details on this wiki page.
> 3. GPU isolation.
> *Proposed solution*:
> a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same 
> solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA.
> We won't ship nvidia-docker-plugin with out releases and we require cluster 
> admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. 
> "nvidia-docker" is a wrapper of docker binary which can address #3 as well, 
> however "nvidia-docker" doesn't provide same semantics of docker, and it 
> needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use 
> it. To avoid introducing additional issues, we plan to use 
> nvidia-docker-plugin + docker binary approach.
> b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin 
> \[3\] to create a volume which includes GPU-related libraries and mount it 
> when docker container being launched. Changes include: 
> - Instead of using {{volume-driver}}, this patch added {{docker volume 
> create}} command to c-e and NM Java side. The reason is {{volume-driver}} can 
> only use single volume driver for each launched docker container.
> - Updated {{c-e}} and Java side, if a mounted volume is a named volume in 
> docker, skip checking file existence. (Named-volume still need to be added to 
> permitted list of container-executor.cfg).
> c. To address isolation issue:
> We found that, cgroup + docker doesn't work under newer docker version which 
> uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup 
> which include any {{devices.deny}} causes docker container cannot be launched.
> Instead this patch passes allowed GPU devices via {{--device}} to docker 
> launch command.
> References:
> \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver
> \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection
> \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin
> \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-24 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.006.patch

Attached ver.6 patch to run Jenkins.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch, 
> YARN-7224.006.patch
>
>
> This patch is to address issues when docker container is being used:
> 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are 
> pre-packaged inside docker image, it could conflict to driver and 
> nvidia-libraries installed on Host OS. An alternative solution is to detect 
> Host OS's installed drivers and devices, mount it when launch docker 
> container. Please refer to \[1\] for more details. 
> 2. Image detection: 
> From \[2\], the challenge is: 
> bq. Mounting user-level driver libraries and device files clobbers the 
> environment of the container, it should be done only when the container is 
> running a GPU application. The challenge here is to determine if a given 
> image will be using the GPU or not. We should also prevent launching 
> containers based on a Docker image that is incompatible with the host NVIDIA 
> driver version, you can find more details on this wiki page.
> 3. GPU isolation.
> *Proposed solution*:
> a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same 
> solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA.
> We won't ship nvidia-docker-plugin with out releases and we require cluster 
> admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. 
> "nvidia-docker" is a wrapper of docker binary which can address #3 as well, 
> however "nvidia-docker" doesn't provide same semantics of docker, and it 
> needs to setup additional environments such as PATH/LD_LIBRARY_PATH to use 
> it. To avoid introducing additional issues, we plan to use 
> nvidia-docker-plugin + docker binary approach.
> b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin 
> \[3\] to create a volume which includes GPU-related libraries and mount it 
> when docker container being launched. Changes include: 
> - Instead of using {{volume-driver}}, this patch added {{docker volume 
> create}} command to c-e and NM Java side. The reason is {{volume-driver}} can 
> only use single volume driver for each launched docker container.
> - Updated {{c-e}} and Java side, if a mounted volume is a named volume in 
> docker, skip checking file existence. (Named-volume still need to be added to 
> permitted list of container-executor.cfg).
> c. To address isolation issue:
> We found that, cgroup + docker doesn't work under newer docker version which 
> uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup 
> which include any {{devices.deny}} causes docker container cannot be launched.
> Instead this patch passes allowed GPU devices via {{--device}} to docker 
> launch command.
> References:
> \[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver
> \[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection
> \[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin
> \[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-18 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Description: 
This patch is to address issues when docker container is being used:
1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are 
pre-packaged inside docker image, it could conflict to driver and 
nvidia-libraries installed on Host OS. An alternative solution is to detect 
Host OS's installed drivers and devices, mount it when launch docker container. 
Please refer to \[1\] for more details. 

2. Image detection: 
>From \[2\], the challenge is: 
bq. Mounting user-level driver libraries and device files clobbers the 
environment of the container, it should be done only when the container is 
running a GPU application. The challenge here is to determine if a given image 
will be using the GPU or not. We should also prevent launching containers based 
on a Docker image that is incompatible with the host NVIDIA driver version, you 
can find more details on this wiki page.

3. GPU isolation.

*Proposed solution*:

a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same 
solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA.

We won't ship nvidia-docker-plugin with out releases and we require cluster 
admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. 
"nvidia-docker" is a wrapper of docker binary which can address #3 as well, 
however "nvidia-docker" doesn't provide same semantics of docker, and it needs 
to setup additional environments such as PATH/LD_LIBRARY_PATH to use it. To 
avoid introducing additional issues, we plan to use nvidia-docker-plugin + 
docker binary approach.

b. To address GPU driver and nvidia libraries, we uses nvidia-docker-plugin 
\[3\] to create a volume which includes GPU-related libraries and mount it when 
docker container being launched. Changes include: 

- Instead of using {{volume-driver}}, this patch added {{docker volume create}} 
command to c-e and NM Java side. The reason is {{volume-driver}} can only use 
single volume driver for each launched docker container.
- Updated {{c-e}} and Java side, if a mounted volume is a named volume in 
docker, skip checking file existence. (Named-volume still need to be added to 
permitted list of container-executor.cfg).

c. To address isolation issue:

We found that, cgroup + docker doesn't work under newer docker version which 
uses {{runc}} as default runtime. Setting {{--cgroup-parent}} to a cgroup which 
include any {{devices.deny}} causes docker container cannot be launched.

Instead this patch passes allowed GPU devices via {{--device}} to docker launch 
command.

References:

\[1\] https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-driver
\[2\] https://github.com/NVIDIA/nvidia-docker/wiki/Image-inspection
\[3\] https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin
\[4\] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

  was:YARN-6620 added support of GPU isolation in NM side, which only supports 
non-docker containers. We need to add support to help docker containers 
launched by YARN can utilize GPUs.


> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch
>
>
> This patch is to address issues when docker container is being used:
> 1. GPU driver and nvidia libraries: If GPU drivers and NV libraries are 
> pre-packaged inside docker image, it could conflict to driver and 
> nvidia-libraries installed on Host OS. An alternative solution is to detect 
> Host OS's installed drivers and devices, mount it when launch docker 
> container. Please refer to \[1\] for more details. 
> 2. Image detection: 
> From \[2\], the challenge is: 
> bq. Mounting user-level driver libraries and device files clobbers the 
> environment of the container, it should be done only when the container is 
> running a GPU application. The challenge here is to determine if a given 
> image will be using the GPU or not. We should also prevent launching 
> containers based on a Docker image that is incompatible with the host NVIDIA 
> driver version, you can find more details on this wiki page.
> 3. GPU isolation.
> *Proposed solution*:
> a. Use nvidia-docker-plugin \[3\] to address issue #1, this is the same 
> solution used by K8S \[4\]. issue #2 could be addressed in a separate JIRA.
> We won't ship nvidia-docker-plugin with out releases and we require cluster 
> admin to preinstall nvidia-docker-plugin to use GPU+docker support on YARN. 
> "nvidia-docker" 

[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-15 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.005.patch

Attached ver.5 patch.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch, YARN-7224.005.patch
>
>
> YARN-6620 added support of GPU isolation in NM side, which only supports 
> non-docker containers. We need to add support to help docker containers 
> launched by YARN can utilize GPUs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-12 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.004.patch

Attached ver.4 patch, fixed warnings / test failures and added more preventive 
tests.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch, YARN-7224.004.patch
>
>
> YARN-6620 added support of GPU isolation in NM side, which only supports 
> non-docker containers. We need to add support to help docker containers 
> launched by YARN can utilize GPUs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-10 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.003.patch

Attached ver.003 patch, major updates:

1) Instead of using {{volume-driver}}, this patch added {{docker volume 
create}} command to c-e and NM Java side. The reason is {{volume-driver}} can 
only use single volume driver for each launched docker container.

2) Updated {{c-e}} and Java side, if a mounted volume is a named volume in 
docker, skip checking file existence. (Named-volume still need to be added to 
permitted list of container-executor.cfg).

3) More tests and cleanups.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch, 
> YARN-7224.003.patch
>
>
> YARN-6620 added support of GPU isolation in NM side, which only supports 
> non-docker containers. We need to add support to help docker containers 
> launched by YARN can utilize GPUs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.002-wip.patch

Attached ver.2 work-in-progress patch. Major change of this patch is I found 
cgroup + docker doesn't work under newer docker version which uses {{runc}} as 
default runtime. Setting {{--cgroup-parent}} to a cgroup which include any 
{{devices.deny}} causes docker container cannot be launched.

Instead this patch passes allowed GPU devices via {{--device}} to docker launch 
command. Tested this patch in a centos 7 machine with 2 GPU devices, it works 
fine. There're some cleanups need to be done and more unit tests need to be 
added. Marked as WIP.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch, YARN-7224.002-wip.patch
>
>
> YARN-6620 added support of GPU isolation in NM side, which only supports 
> non-docker containers. We need to add support to help docker containers 
> launched by YARN can utilize GPUs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-04 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Attachment: YARN-7224.001.patch

Attached ver.1 patch on top of YARN-6620. Please feel free to share your 
thoughts! 

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-7224.001.patch
>
>
> YARN-6620 added support of GPU isolation in NM side, which only supports 
> non-docker containers. We need to add support to help docker containers 
> launched by YARN can utilize GPUs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7224) Support GPU isolation for docker container

2017-10-04 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7224:
-
Description: YARN-6620 added support of GPU isolation in NM side, which 
only supports non-docker containers. We need to add support to help docker 
containers launched by YARN can utilize GPUs.

> Support GPU isolation for docker container
> --
>
> Key: YARN-7224
> URL: https://issues.apache.org/jira/browse/YARN-7224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> YARN-6620 added support of GPU isolation in NM side, which only supports 
> non-docker containers. We need to add support to help docker containers 
> launched by YARN can utilize GPUs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org