[jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups

Sunil G (JIRA) Thu, 05 Oct 2017 08:50:34 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193076#comment-16193076
 ]


Sunil G commented on YARN-6620:
-------------------------------

Thanks [~leftnoteasy]

Few more comments:

# As {{GPU_URI}} is now hardcoded, we have to ensure that same name is not 
supplied by user with different unit etc. Similar to CPU and Memory, we have to 
have a mandatory check for this resource also, correct ?
# In {{ResourcePluginManager#initialize}}, {{GpuResourcePlugin}} is created 
directly in main code path. Its better we pass the name of plugin 
{{resourceName}} to another factory and get the correct object back. For now 
this is fine as we have only one plugin. It ll be better to improve later also. 
# In {{ResourcePluginManager#cleanup}}, do we need to remove the plugin entry 
as cleanup is already done?
# {{GpuDiscoverer#getGpuDeviceInformation}} still continue to execute discover 
command even after reaching max fail limit?
{code}
102         if (numOfErrorExecutionSinceLastSucceed == 
MAX_REPEATED_ERROR_ALLOWED) {
103           LOG.error("Failed to execute GPU device information detection 
script for "
104               + MAX_REPEATED_ERROR_ALLOWED + " times, skip following 
exections.");
105           numOfErrorExecutionSinceLastSucceed++;
106         }
{code}Also please fix typo in {{exections}} to {{executions}} in above error 
message
# I think *GpuDiscoverer* need not have to have a {{getConf}} ?
# In {{GpuNodeResourceUpdateHandler#updateConfiguredResource}}, do we need to 
throw exception when yarn was not configured to support any GPUs but auto 
discovery found some devices as per plugin ? Could we log as warn may be?
# {{GpuDiscoverer}} command is specific to linux. So in windows shell, this has 
to be failed. correct?

> [YARN-6223] NM Java side code changes to support isolate GPU devices by using 
> CGroups
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-6620
>                 URL: https://issues.apache.org/jira/browse/YARN-6620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-6620) [YARN-6223] NM Java side code changes to support isolate GPU devices by using CGroups

Reply via email to