[ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950199#comment-15950199
 ] 

Zhankun Tang edited comment on YARN-6223 at 3/31/17 8:39 AM:
-------------------------------------------------------------

[~wangda], thanks for sharing the whole story. I agree that we should evolve 
based on YARN-3926. We've done a PoC on enabling FPGA as a first-class 
internally and would like to share our finding and propose more ideas to make 
YARN's new resource model more general and flexible.

Current YARN-3926 only considers non-exclusive resource like cpu, memory and 
network bandwidths. Resources like GPU, FPGA and disks are sort of exclusive 
resources to be resolved. In my opinion, below are additional things may need 
more discussion about exclusive resources:

On RM side,
1. Device resource may have extra attributes needs to be matched when 
scheduling. Not just simply add or reduce a number when "fitsIn". For instance, 
in our PoC, A FPGA slot in one node may already have one IP flashed so that the 
scheduler should try to match this IP attribute to reuse it. The YARN-2139 
proposal also mentioned the locality issue which is similar to FPGA IP reusing. 
And for network port, an individual network port or a range may be requested 
which may requires different scheduler behavior.
2. Are there similar requirements like FPGA that the scheduler should schedule 
a compromised/non-matched resource that needs extra operation in NM to make it 
usable? In detail, when an application request a FPGA slot with required IP 
description, the scheduler can choose a non-perfect matched FPGA slot based on 
policy and leave a hint to tell NM that this FPGA slot should be re-programmed 
before container launch. 

In one word, these exclusive resource seems requires the scheduler to consider 
both countable availability and device affinity to avoid potential contention 
and improve utilization. 

On NM side,
1. Device resource dynamically discovery and static configuration. This is 
vendor specific and we should have a plugin framework for different vendor to 
implement. The interfaces may consists "listDevice, monitorDeviceHealth"
2. In current YARN-3926 implementation, when a container allocated by scheduler 
and send to NM, NM seems should have a new component(or extend 
ContainerScheduler) that handling sub-optimal scheduling of the resources in 
container. This new component will track nodes's exclusive resource and connect 
the virtual presentation resource in container to the real devices.
3. Device resource needs additional preparation and isolation before container 
launch. For instance, FPGA device may needs download a IP file from a repo then 
flash to an allocated FPGA slot. This is also vendor specific and should be 
pluggable. We can try extend the ResourceHandlerModule introduced by YARN-3366 
to achieve this.

I'm not quite sure that if we should try unify non-exclusive and exclusive 
resource in scheduler directly, but at least I think we should make it a long 
term goal.

A practical way is that we split it into two steps. Firstly, keep current 
scheduler feature unchanged, then finish the NM side local resource scheduler 
and plugin framework design to be a sub-optimal solution( we're now designing 
this ). Secondly,  remove NM side local resource scheduler and try unify all 
types of resource in a global scheduler. Thought?


was (Author: tangzhankun):
[~wangda], thanks for sharing the whole story. I agree that we should evolve 
based on YARN-3926. We've done a PoC on enabling FPGA as a first-class 
internally and would like to share our finding and propose more ideas to make 
YARN's new resource model more general and flexible.

Current YARN-3926 only considers non-exclusive resource like cpu, memory and 
network bandwidths. Resources like GPU, FPGA and disks are sort of exclusive 
resources to be resolved. In my opinion, below are additional things may need 
more discussion about exclusive resources:

On RM side,
1. Device resource may have extra attributes needs to be matched when 
scheduling. Not just simply add or reduce a number when "fitsIn". For instance, 
in our PoC, A FPGA slot in one node may already have one IP flashed so that the 
scheduler should try to match this IP attribute to reuse it. The YARN-2139 
proposal also mentioned the locality issue which is similar to FPGA IP reusing. 
And for network port, an individual network port or a range may be requested 
which may requires different scheduler behavior.
2. Are there similar requirements like FPGA that the scheduler should schedule 
a compromised/non-matched resource that needs extra operation in NM to make it 
usable? In detail, when an application request a FPGA slot with required IP 
description, the scheduler can choose a non-perfect matched FPGA slot based on 
policy and leave a hint to tell NM that this FPGA slot should be re-programmed 
before container launch. 

In one word, these exclusive resource seems requires the scheduler to consider 
both countable availability and device affinity to avoid potential contention 
and improve utilization. 

On NM side,
1. Device resource dynamically discovery and static configuration. This is 
vendor specific and we should have a plugin framework for different vendor to 
implement. The interfaces may consists "listDevice, monitorDeviceHealth"
2. In current YARN-3926 implementation, when a container allocated by scheduler 
and send to NM, NM seems should have a new component(or extend 
ContainerScheduler) that handling sub-optimal scheduling of the resources in 
container. This new component will track nodes's exclusive resource and connect 
the virtual presentation resource in container to the real devices.
3. Device resource needs additional preparation and isolation before container 
launch. For instance, FPGA device may needs download a IP file from a repo then 
flash to an allocated FPGA slot. This is also vendor specific and should be 
pluggable. We can try extend the ResourceHandlerModule introduced by YARN-3366 
to achieve this.

I'm not quite sure that if we should try unify non-exclusive and exclusive 
resource in scheduler directly, but at least I think we should make it a long 
term goal.

A practical way is that we split it into two steps. Firstly, keep current 
scheduler feature unchanged, then finish the NM side local resource scheduler 
and plugin framework design to be a sub-optimal solution( we're now designing 
this ). Secondly,  remove NM side local scheduler and try unify all types of 
resource in a global scheduler. Thought?

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-6223
>                 URL: https://issues.apache.org/jira/browse/YARN-6223
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to