[jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups

Sunil G (JIRA) Wed, 04 Oct 2017 08:33:16 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191435#comment-16191435
 ]


Sunil G commented on YARN-6620:
-------------------------------

Thanks [~leftnoteasy] for the effort. Generally approach and patch seems fine. 

Few comments
# {{ResourceInformation.GPU_URI}}. I think this need not have to be hard coded. 
Other than memory-mb and vcores, all other resource names could be pulled from 
resource-types.xml. So is it fine to have the new resource names could be 
pulled form ResourceUtils itself.
# In {{NM#serviceInit}}, *ResourcePluginManager* is created always. So do we 
need to have a null check in other places?
# In {{recoverAssignedGpus}}, we could check for NumberFormatException also 
while doing parseInt
# Lets do something like {{Long.valueOf(value).intValue()}} instead of direct 
type casting from long to int. Refer {{getRequestedGpus}}
# In {{GPU#preStart(Container container)}}, if requested container doesnt have 
any GPU demand, do we need to proceed further?
# In case if future, an affinity constraint is coming for a given GPU device 
for a container, I guess we need a lil more changes to GpuAllocation class. 
Could we have some dummy apis defined now so that, too much of redesign is not 
needed later.
# In {{GPU#bootstrap}}, we are retuning null. Is it correct?
# In {{ResourcePluginManager}}, we could avoid synchronized and keep the map as 
ConcurrentHashMap ?
# {{ResourcePluginManager}} could be service itself so that we can handle cases 
where NM is shutdown and some cleanup is needed. I think all plugin could have 
a *cleanup* and that could be triggered from Manager.
# In {{GpuDiscover#initialize}}, along with file existence,  we could also 
check for file permission owner etc to ensure that its been accessed correctly.
# {{GpuDeviceInformationParser}} has too much xml dependency. Hadoop common has 
some xml parser, correct? could we use that ?


> [YARN-6223] NM Java side code changes to support isolate GPU devices by using 
> CGroups
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-6620
>                 URL: https://issues.apache.org/jira/browse/YARN-6620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups

Reply via email to