[jira] [Updated] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
[ https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-6223: - Attachment: YARN-6223.wip.3.patch Attached ver.3 patch, make container-executor code changes better organized. Work based on YARN-6033, so now all test cases are based on gtest. TODO: put all configuration items into sections. > [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation > on YARN > > > Key: YARN-6223 > URL: https://issues.apache.org/jira/browse/YARN-6223 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, > YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch > > > As varieties of workloads are moving to YARN, including machine learning / > deep learning which can speed up by leveraging GPU computation power. > Workloads should be able to request GPU from YARN as simple as CPU and memory. > *To make a complete GPU story, we should support following pieces:* > 1) GPU discovery/configuration: Admin can either config GPU resources and > architectures on each node, or more advanced, NodeManager can automatically > discover GPU resources and architectures and report to ResourceManager > 2) GPU scheduling: YARN scheduler should account GPU as a resource type just > like CPU and memory. > 3) GPU isolation/monitoring: once launch a task with GPU resources, > NodeManager should properly isolate and monitor task's resource usage. > For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced > an extensible framework to support isolation for different resource types and > different runtimes. > *Related JIRAs:* > There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but > different solutions: > For scheduling: > - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource > protocol instead of leveraging YARN-3926. > For isolation: > - And YARN-4122 proposed to use CGroups to do isolation which cannot solve > the problem listed at > https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as > minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver > versions, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
[ https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-6223: - Attachment: YARN-6223.wip.2.patch Attached ver.2 WIP patch which discussed with [~chris.douglas] / [~vinodkv] / [~sunil.gov...@gmail.com] / [~sidharta-s] / [~vvasudev]. I want you guys to help look at overall implementation in c side. I made it better modularized, and the new logics can be turned on/off. This is related to YARN-5673 but doesn't include refactoring of existing logics / dynamic load modules, etc. TODO items: - More validation / tests need for C side, now doesn't handle memory leak, etc. properly. - Doesn't include logics of GPU allocation recovery on NM restart. +[~tangzhankun], this may related to FPGA isolation as well. I will be traveling till early next week so please expect delayed responses :). > [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation > on YARN > > > Key: YARN-6223 > URL: https://issues.apache.org/jira/browse/YARN-6223 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, > YARN-6223.wip.1.patch, YARN-6223.wip.2.patch > > > As varieties of workloads are moving to YARN, including machine learning / > deep learning which can speed up by leveraging GPU computation power. > Workloads should be able to request GPU from YARN as simple as CPU and memory. > *To make a complete GPU story, we should support following pieces:* > 1) GPU discovery/configuration: Admin can either config GPU resources and > architectures on each node, or more advanced, NodeManager can automatically > discover GPU resources and architectures and report to ResourceManager > 2) GPU scheduling: YARN scheduler should account GPU as a resource type just > like CPU and memory. > 3) GPU isolation/monitoring: once launch a task with GPU resources, > NodeManager should properly isolate and monitor task's resource usage. > For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced > an extensible framework to support isolation for different resource types and > different runtimes. > *Related JIRAs:* > There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but > different solutions: > For scheduling: > - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource > protocol instead of leveraging YARN-3926. > For isolation: > - And YARN-4122 proposed to use CGroups to do isolation which cannot solve > the problem listed at > https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as > minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver > versions, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
[ https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-6223: - Attachment: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf YARN-6223.wip.1.patch Attached design doc and wip prototype patch. Please feel free to share your thoughts. > [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation > on YARN > > > Key: YARN-6223 > URL: https://issues.apache.org/jira/browse/YARN-6223 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, > YARN-6223.wip.1.patch > > > As varieties of workloads are moving to YARN, including machine learning / > deep learning which can speed up by leveraging GPU computation power. > Workloads should be able to request GPU from YARN as simple as CPU and memory. > *To make a complete GPU story, we should support following pieces:* > 1) GPU discovery/configuration: Admin can either config GPU resources and > architectures on each node, or more advanced, NodeManager can automatically > discover GPU resources and architectures and report to ResourceManager > 2) GPU scheduling: YARN scheduler should account GPU as a resource type just > like CPU and memory. > 3) GPU isolation/monitoring: once launch a task with GPU resources, > NodeManager should properly isolate and monitor task's resource usage. > For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced > an extensible framework to support isolation for different resource types and > different runtimes. > *Related JIRAs:* > There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but > different solutions: > For scheduling: > - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource > protocol instead of leveraging YARN-3926. > For isolation: > - And YARN-4122 proposed to use CGroups to do isolation which cannot solve > the problem listed at > https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as > minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver > versions, etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN
[ https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-6223: - Description: As varieties of workloads are moving to YARN, including machine learning / deep learning which can speed up by leveraging GPU computation power. Workloads should be able to request GPU from YARN as simple as CPU and memory. *To make a complete GPU story, we should support following pieces:* 1) GPU discovery/configuration: Admin can either config GPU resources and architectures on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures and report to ResourceManager 2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and memory. 3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly isolate and monitor task's resource usage. For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced an extensible framework to support isolation for different resource types and different runtimes. *Related JIRAs:* There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but different solutions: For scheduling: - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource protocol instead of leveraging YARN-3926. For isolation: - And YARN-4122 proposed to use CGroups to do isolation which cannot solve the problem listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc. was: As varieties of workloads are moving to YARN, including machine learning / deep learning which can speed up by leveraging GPU computation power. Workloads should be able to request GPU from YARN as simple as CPU and memory. To make a complete GPU story, we should support following pieces: 1) GPU discovery/configuration: Admin can either config GPU resources and architectures on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures and report to ResourceManager 2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and memory. 3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly isolate and monitor task's resource usage. For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced an extensible framework to support isolation for different resource types and different runtimes. There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but different solutions: For scheduling: - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource protocol instead of leveraging YARN-3926. For isolation: - And YARN-4122 proposed to use CGroups to do isolation which cannot solve the problem listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc. > [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation > on YARN > > > Key: YARN-6223 > URL: https://issues.apache.org/jira/browse/YARN-6223 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > > As varieties of workloads are moving to YARN, including machine learning / > deep learning which can speed up by leveraging GPU computation power. > Workloads should be able to request GPU from YARN as simple as CPU and memory. > *To make a complete GPU story, we should support following pieces:* > 1) GPU discovery/configuration: Admin can either config GPU resources and > architectures on each node, or more advanced, NodeManager can automatically > discover GPU resources and architectures and report to ResourceManager > 2) GPU scheduling: YARN scheduler should account GPU as a resource type just > like CPU and memory. > 3) GPU isolation/monitoring: once launch a task with GPU resources, > NodeManager should properly isolate and monitor task's resource usage. > For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced > an extensible framework to support isolation for different resource types and > different runtimes. > *Related JIRAs:* > There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but > different solutions: > For scheduling: > - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource > protocol instead of leveraging YARN-3926. > For isolation: > - And YARN-4122 proposed to use CGroups to do isolation which cannot solve > the problem listed at > https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as > minor