[
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950199#comment-15950199
]
Zhankun Tang commented on YARN-6223:
------------------------------------
[~wangda], thanks for sharing the whole story. I agree that we should evolve
based on YARN-3926. We've done a PoC on enabling FPGA as a first-class
internally and would like to share our finding and propose more ideas to make
YARN's new resource model more general and flexible.
Current YARN-3926 only considers non-exclusive resource like cpu, memory and
network bandwidths. Resources like GPU, FPGA and disks are sort of exclusive
resources to be resolved. In my opinion, below are additional things may need
more discussion about exclusive resources:
On RM side,
1. Device resource may have extra attributes needs to be matched when
scheduling. Not just simply add or reduce a number when "fitsIn". For instance,
in our PoC, A FPGA slot in one node may already have one IP flashed so that the
scheduler should try to match this IP attribute to reuse it. The YARN-2139
proposal also mentioned the locality issue which is similar to FPGA IP reusing.
And for network port, an individual network port or a range may be requested
which may requires different scheduler behavior.
2. Are there similar requirements like FPGA that the scheduler should schedule
a compromised/non-matched resource that needs extra operation in NM to make it
usable? In detail, when an application request a FPGA slot with required IP
description, the scheduler can choose a non-perfect matched FPGA slot based on
policy and leave a hint to tell NM that this FPGA slot should be re-programmed
before container launch.
In one word, these exclusive resource seems requires the scheduler to consider
both countable availability and device affinity to avoid potential contention
and improve utilization.
On NM side,
1. Device resource dynamically discovery and static configuration. This is
vendor specific and we should have a plugin framework for different vendor to
implement. The interfaces may consists "listDevice, monitorDeviceHealth"
2. In current YARN-3926 implementation, when a container allocated by scheduler
and send to NM, NM seems should have a new component that handling sub-optimal
scheduling of the resources in container. This new component will track nodes's
exclusive resource and connect the virtual presentation resource in container
to the real devices.
3. Device resource needs additional preparation and isolation before container
launch. For instance, FPGA device may needs download a IP file from a repo then
flash to an allocated FPGA slot. This is also vendor specific and should be
pluggable. We can try extend the ResourceHandlerModule introduced by YARN-3366
to achieve this.
I'm not quite sure that if we should try unify non-exclusive and exclusive
resource in scheduler directly, but at least I think we should make it a long
term goal.
A practical way is that we split it into two steps. Firstly, keep current
scheduler feature unchanged, then finish the NM side local scheduler and plugin
framework design to be a sub-optimal solution( we're now designing this ).
Secondly, remove NM side local scheduler and try unify all types of resource
in a global scheduler. Thought?
> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation
> on YARN
> ------------------------------------------------------------------------------------
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Wangda Tan
> Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning /
> deep learning which can speed up by leveraging GPU computation power.
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and
> architectures on each node, or more advanced, NodeManager can automatically
> discover GPU resources and architectures and report to ResourceManager
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources,
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced
> an extensible framework to support isolation for different resource types and
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve
> the problem listed at
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver
> versions, etc.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]