[
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951985#comment-15951985
]
Wangda Tan commented on YARN-6223:
----------------------------------
[~grey],
[~hex108] mentioned sharing one GPU to multiple applications to me offline.
[~hex108] could you add your thoughts here?
bq. We may need consider the scheduling of next level resource in GPU, at least
not blocking future extension for next level resource scheduling. This is also
related to the isolation part
Make sense and totally agree.
> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation
> on YARN
> ------------------------------------------------------------------------------------
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Wangda Tan
> Assignee: Wangda Tan
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf,
> YARN-6223.wip.1.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning /
> deep learning which can speed up by leveraging GPU computation power.
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and
> architectures on each node, or more advanced, NodeManager can automatically
> discover GPU resources and architectures and report to ResourceManager
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources,
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced
> an extensible framework to support isolation for different resource types and
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve
> the problem listed at
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver
> versions, etc.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]