[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169655#comment-16169655
 ] 

Zhankun Tang commented on YARN-6620:
------------------------------------

{quote}
Good point, I think we should use node attribute to distinguish them. I think 
this might be unavoidable: different DL workload needs different driver 
versions / GPU architectures, and different frameworks like OpenCL/CUDA, we 
need node attribute anyway.
{quote}
[~wangda], Yeah. Node attributes is a must.
And just another thing come to my mind, do we need to support one physical 
machine with two different vendor GPU cards? If this scenario requirement is 
true, we may need to extend resource handler to mange different several 
plugins(I've done this in prior FPGA patch) as below:
1. In "bootstrap" method, all GPU vendor's plugin register to one GPU resource 
handler with the resource name it can handlers. For instance, one plugin A 
registers a resource "A-GPU" and B register "B-GPU". And GPU resource handler 
will holds records of <resourceName, pluginInstance>.
2. When "preStart" invoked, it will retrieve the ResourceInformation array from 
container.getResource().getResources() to find a proper GPU vendor plugin to do 
plugin callback( or no callback needed for GPU. It seems needed for FPGA) and 
then use GPU allocator allocates requested count of this specific type of GPU 
in a round-robin manner. Then do cgroups isolation.
3. Now back to the AM, it's possible to request a container with one "A-GPU" 
named resource in containerRequest and node attributes "CUDA v1" at the same 
time.

I'm not sure if this one host with different vendor device is a real 
requirements. If so, it may brings another concerns to our current design since 
we treat them as the same resource implicitly. Any idea?

> [YARN-6223] NM Java side code changes to support isolate GPU devices by using 
> CGroups
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-6620
>                 URL: https://issues.apache.org/jira/browse/YARN-6620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to