Wangda Tan created YARN-6223:
--------------------------------
Summary: [Umbrella] Natively support GPU
configuration/discovery/scheduling/isolation on YARN
Key: YARN-6223
URL: https://issues.apache.org/jira/browse/YARN-6223
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Wangda Tan
Assignee: Wangda Tan
As varieties of workloads are moving to YARN, including machine learning / deep
learning which can speed up by leveraging GPU computation power. Workloads
should be able to request GPU from YARN as simple as CPU and memory.
To make a complete GPU story, we should support following pieces:
1) GPU discovery/configuration: Admin can either config GPU resources and
architectures on each node, or more advanced, NodeManager can automatically
discover GPU resources and architectures and report to ResourceManager
2) GPU scheduling: YARN scheduler should account GPU as a resource type just
like CPU and memory.
3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager
should properly isolate and monitor task's resource usage.
For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced an
extensible framework to support isolation for different resource types and
different runtimes.
There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but
different solutions:
For scheduling:
- YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource
protocol instead of leveraging YARN-3926.
For isolation:
- And YARN-4122 proposed to use CGroups to do isolation which cannot solve the
problem listed at
https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as
minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver
versions, etc.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]