[ https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950199#comment-15950199 ]
Zhankun Tang commented on YARN-6223: ------------------------------------ [~wangda], thanks for sharing the whole story. I agree that we should evolve based on YARN-3926. We've done a PoC on enabling FPGA as a first-class internally and would like to share our finding and propose more ideas to make YARN's new resource model more general and flexible. Current YARN-3926 only considers non-exclusive resource like cpu, memory and network bandwidths. Resources like GPU, FPGA and disks are sort of exclusive resources to be resolved. In my opinion, below are additional things may need more discussion about exclusive resources: On RM side, 1. Device resource may have extra attributes needs to be matched when scheduling. Not just simply add or reduce a number when "fitsIn". For instance, in our PoC, A FPGA slot in one node may already have one IP flashed so that the scheduler should try to match this IP attribute to reuse it. The YARN-2139 proposal also mentioned the locality issue which is similar to FPGA IP reusing. And for network port, an individual network port or a range may be requested which may requires different scheduler behavior. 2. Are there similar requirements like FPGA that the scheduler should schedule a compromised/non-matched resource that needs extra operation in NM to make it usable? In detail, when an application request a FPGA slot with required IP description, the scheduler can choose a non-perfect matched FPGA slot based on policy and leave a hint to tell NM that this FPGA slot should be re-programmed before container launch. In one word, these exclusive resource seems requires the scheduler to consider both countable availability and device affinity to avoid potential contention and improve utilization. On NM side, 1. Device resource dynamically discovery and static configuration. This is vendor specific and we should have a plugin framework for different vendor to implement. The interfaces may consists "listDevice, monitorDeviceHealth" 2. In current YARN-3926 implementation, when a container allocated by scheduler and send to NM, NM seems should have a new component that handling sub-optimal scheduling of the resources in container. This new component will track nodes's exclusive resource and connect the virtual presentation resource in container to the real devices. 3. Device resource needs additional preparation and isolation before container launch. For instance, FPGA device may needs download a IP file from a repo then flash to an allocated FPGA slot. This is also vendor specific and should be pluggable. We can try extend the ResourceHandlerModule introduced by YARN-3366 to achieve this. I'm not quite sure that if we should try unify non-exclusive and exclusive resource in scheduler directly, but at least I think we should make it a long term goal. A practical way is that we split it into two steps. Firstly, keep current scheduler feature unchanged, then finish the NM side local scheduler and plugin framework design to be a sub-optimal solution( we're now designing this ). Secondly, remove NM side local scheduler and try unify all types of resource in a global scheduler. Thought? > [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation > on YARN > ------------------------------------------------------------------------------------ > > Key: YARN-6223 > URL: https://issues.apache.org/jira/browse/YARN-6223 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Wangda Tan > Assignee: Wangda Tan > > As varieties of workloads are moving to YARN, including machine learning / > deep learning which can speed up by leveraging GPU computation power. > Workloads should be able to request GPU from YARN as simple as CPU and memory. > *To make a complete GPU story, we should support following pieces:* > 1) GPU discovery/configuration: Admin can either config GPU resources and > architectures on each node, or more advanced, NodeManager can automatically > discover GPU resources and architectures and report to ResourceManager > 2) GPU scheduling: YARN scheduler should account GPU as a resource type just > like CPU and memory. > 3) GPU isolation/monitoring: once launch a task with GPU resources, > NodeManager should properly isolate and monitor task's resource usage. > For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced > an extensible framework to support isolation for different resource types and > different runtimes. > *Related JIRAs:* > There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but > different solutions: > For scheduling: > - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource > protocol instead of leveraging YARN-3926. > For isolation: > - And YARN-4122 proposed to use CGroups to do isolation which cannot solve > the problem listed at > https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as > minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver > versions, etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org