[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-09-25 Thread zhengchenyu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178812#comment-16178812
 ] 

zhengchenyu commented on YARN-6223:
---

I thinks NM recovery for gpu doesn't take effect. Becuase 
RecoveredContainerState's ResourceMappings didn't add to the recovered 
Container. 

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-09-25 Thread zhengchenyu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178761#comment-16178761
 ] 

zhengchenyu commented on YARN-6223:
---

I think there is a bug in below code. 
{code}
try {
  nmContext.getNMStateStore().storeAssignedResources(containerId,
RESOURCE_TYPE, allocatedDevices);
} catch (IOException e) {
throw new ResourceHandlerException(e);
}
{code}

When storeAssignedResources failed, there will be a leak. Because postComplete 
will never be called. It means usedDevices won't be removed. So should add 
cleanupAssignGpus in catch section. 

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-07-26 Thread Tan N. Le (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102067#comment-16102067
 ] 

Tan N. Le commented on YARN-6223:
-

Hi Wangda Tan,

Does this feature have a working version yet? I am looking for this feature on 
Yarn for machine learning jobs (like Tensorflow).
I wonder whether you can share the source code if it works.

best,
Tan

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-07-20 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095391#comment-16095391
 ] 

Wangda Tan commented on YARN-6223:
--

[~chris.douglas], I just split native patch to YARN-6852 and included summaries 
of what the patch is doing and how the patch will be used. Please share your 
thoughts.

Thanks!

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-07-19 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094074#comment-16094074
 ] 

Chris Douglas commented on YARN-6223:
-

[~leftnoteasy], could you summarize the implementation a bit? What would an 
example cfg look like and how is it interpreted?

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-07-19 Thread Zhankun Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093148#comment-16093148
 ] 

Zhankun Tang commented on YARN-6223:


[~wangda], sorry for the late reply. Great thanks for the ver.3 patch!

It looks good to me. But my concerns are mainly about modularity:
1. If container-executor will be compiled with GPU module, does it mean that 
all type of accelerator devices(like FPGA, SSD, DSP) should implement one 
module in C language? Or perhaps is it possible that we provide a generic 
interface in container-executor to handle all device isolation?
2. Since we already have "node-resources.xml" for end-users to declare 
customized resource like GPU/FPGA, is it possible to put the allowed devices 
configuration here instead of "container-executor.cfg"?
For instance:
{code:xml}
...

   yarn.nodemanager.resource-types.MCP
   2


   yarn.nodemanager.resource-types.NvidiaGPU.allowed
   195:0,195:1

...
{code}

Please correct me if I made mistakes.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-05-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014609#comment-16014609
 ] 

Wangda Tan commented on YARN-6223:
--

Opened YARN-6620 to track code works and reviews for GPU 
configuration/isolation.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-04-01 Thread Zhankun Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952132#comment-15952132
 ] 

Zhankun Tang commented on YARN-6223:


[~wangda], Yeah, I agree. It's a very good plan.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-03-31 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951989#comment-15951989
 ] 

Wangda Tan commented on YARN-6223:
--

[~tangzhankun], all great suggestions.

>From what you mentioned, there're at least 3 common things we can do to 
>support different resource types:
1) In RM scheduler, add special considerations to scarce resources like 
GPU/FPGA/SSD.
2) In NM side, have a common abstraction for resource discovery. 
3) Similarly, in NM side, have a common abstract for resource allocation and 
affinity (can help topology requirements such as NUMA/GPU interconnections, 
etc.).

I completely agree with directions you mentioned. For #1, like you said, we can 
improve it in the background and make it to be a longer term goal. For #2/#3, I 
think we can do it and refine design along with the GPU feature development.

Sounds like a plan? 


> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-03-31 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951985#comment-15951985
 ] 

Wangda Tan commented on YARN-6223:
--

[~grey],
[~hex108] mentioned sharing one GPU to multiple applications to me offline. 
[~hex108] could you add your thoughts here?

bq. We may need consider the scheduling of next level resource in GPU, at least 
not blocking future extension for next level resource scheduling. This is also 
related to the isolation part
Make sense and totally agree.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-03-31 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951986#comment-15951986
 ] 

Wangda Tan commented on YARN-6223:
--

[~Ying Zhang] sure thing :), could you take the uploaded doc/poc patch and let 
me know your suggestions?

Thanks,

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-03-30 Thread Zhankun Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950199#comment-15950199
 ] 

Zhankun Tang commented on YARN-6223:


[~wangda], thanks for sharing the whole story. I agree that we should evolve 
based on YARN-3926. We've done a PoC on enabling FPGA as a first-class 
internally and would like to share our finding and propose more ideas to make 
YARN's new resource model more general and flexible.

Current YARN-3926 only considers non-exclusive resource like cpu, memory and 
network bandwidths. Resources like GPU, FPGA and disks are sort of exclusive 
resources to be resolved. In my opinion, below are additional things may need 
more discussion about exclusive resources:

On RM side,
1. Device resource may have extra attributes needs to be matched when 
scheduling. Not just simply add or reduce a number when "fitsIn". For instance, 
in our PoC, A FPGA slot in one node may already have one IP flashed so that the 
scheduler should try to match this IP attribute to reuse it. The YARN-2139 
proposal also mentioned the locality issue which is similar to FPGA IP reusing. 
And for network port, an individual network port or a range may be requested 
which may requires different scheduler behavior.
2. Are there similar requirements like FPGA that the scheduler should schedule 
a compromised/non-matched resource that needs extra operation in NM to make it 
usable? In detail, when an application request a FPGA slot with required IP 
description, the scheduler can choose a non-perfect matched FPGA slot based on 
policy and leave a hint to tell NM that this FPGA slot should be re-programmed 
before container launch. 

In one word, these exclusive resource seems requires the scheduler to consider 
both countable availability and device affinity to avoid potential contention 
and improve utilization. 

On NM side,
1. Device resource dynamically discovery and static configuration. This is 
vendor specific and we should have a plugin framework for different vendor to 
implement. The interfaces may consists "listDevice, monitorDeviceHealth"
2. In current YARN-3926 implementation, when a container allocated by scheduler 
and send to NM, NM seems should have a new component that handling sub-optimal 
scheduling of the resources in container. This new component will track nodes's 
exclusive resource and connect the virtual presentation resource in container 
to the real devices.
3. Device resource needs additional preparation and isolation before container 
launch. For instance, FPGA device may needs download a IP file from a repo then 
flash to an allocated FPGA slot. This is also vendor specific and should be 
pluggable. We can try extend the ResourceHandlerModule introduced by YARN-3366 
to achieve this.

I'm not quite sure that if we should try unify non-exclusive and exclusive 
resource in scheduler directly, but at least I think we should make it a long 
term goal.

A practical way is that we split it into two steps. Firstly, keep current 
scheduler feature unchanged, then finish the NM side local scheduler and plugin 
framework design to be a sub-optimal solution( we're now designing this ). 
Secondly,  remove NM side local scheduler and try unify all types of resource 
in a global scheduler. Thought?

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead 

[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-03-07 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900877#comment-15900877
 ] 

Ying Zhang commented on YARN-6223:
--

Hi [~wangda], we are interested in this and would like to contribute. Please 
let us know how we can evolve.:-)

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-02-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881907#comment-15881907
 ] 

Vinod Kumar Vavilapalli commented on YARN-6223:
---

bq. I had a clarification regarding GPU scheduling - are you planning to use 
YARN-3926?
bq. 1. I feel the alternate approaches (YARN-4122/YARN-5517) are neither 
generic nor scaleable solutions.
bq. YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
protocol instead of leveraging YARN-3926.
Agreed, we should stop hard-cording more resources into the core and instead 
add them 'on the surface' in a pluggable manner.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-02-23 Thread Lei Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880655#comment-15880655
 ] 

Lei Guo commented on YARN-6223:
---

[~leftnoteasy], based on the description "YARN scheduler should account GPU as 
a resource type just like CPU and memory", seems we are planning to consider 
the number of GPU as the resource. Depending on the GPU type, driver and 
toolkit, the sharing of one GPU among multiple applications could be important. 
During some discussion with domain expert, they do need the sharing of single 
GPU among applications, and do see the context switch overhead during the 
sharing. If the scheduling can let application know which core/memory to use, 
they can avoid some level context switch. 

We may need consider the scheduling of next level resource in GPU, at least not 
blocking future extension for next level resource scheduling. This is also 
related to the isolation part 

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-02-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879797#comment-15879797
 ] 

Wangda Tan commented on YARN-6223:
--

bq. Thanks Wangda Tan for initiating this. I had a clarification regarding GPU 
scheduling - are you planning to use YARN-3926?
Will add link to YARN-3926 to this JIRA.


> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-02-22 Thread Subru Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879794#comment-15879794
 ] 

Subru Krishnan commented on YARN-6223:
--

Thanks [~leftnoteasy] for initiating this. I had a clarification regarding GPU 
scheduling - are you planning to use YARN-3926?
We are planning to use it because:
  1. I feel the alternate approaches (YARN-4122/YARN-5517) are neither generic 
nor scaleable solutions.
  2. Moreover YARN-3926 seems to be close to completion.

Thoughts? cc [~vvasudev], [~asuresh].

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2017-02-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879610#comment-15879610
 ] 

Wangda Tan commented on YARN-6223:
--

We're working on a prototype of the feature and will post design doc soon.

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org