[
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951989#comment-15951989
]
Wangda Tan commented on YARN-6223:
----------------------------------
[~tangzhankun], all great suggestions.
>From what you mentioned, there're at least 3 common things we can do to
>support different resource types:
1) In RM scheduler, add special considerations to scarce resources like
GPU/FPGA/SSD.
2) In NM side, have a common abstraction for resource discovery.
3) Similarly, in NM side, have a common abstract for resource allocation and
affinity (can help topology requirements such as NUMA/GPU interconnections,
etc.).
I completely agree with directions you mentioned. For #1, like you said, we can
improve it in the background and make it to be a longer term goal. For #2/#3, I
think we can do it and refine design along with the GPU feature development.
Sounds like a plan?
> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation
> on YARN
> ------------------------------------------------------------------------------------
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Wangda Tan
> Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf,
> YARN-6223.wip.1.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning /
> deep learning which can speed up by leveraging GPU computation power.
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and
> architectures on each node, or more advanced, NodeManager can automatically
> discover GPU resources and architectures and report to ResourceManager
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources,
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced
> an extensible framework to support isolation for different resource types and
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve
> the problem listed at
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver
> versions, etc.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]