[
https://issues.apache.org/jira/browse/YARN-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721009#comment-16721009
]
Szilard Nemeth commented on YARN-9120:
--------------------------------------
I see that the designed way to turn off GPU on a specific node is to remove the
resource-plugin from the config.
I would like to introduce a new value to the same field we are using to control
whether auto discovery happens or the user defined the identifiers of the GPU
devices one wants to use on a specific node (property:
yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices)
This way, users could turn the auto discovery on and off even at runtime.
I think it's more lightweight and more feasible to have one single control how
many GPUs one node wants to contribute to scheduling (as said, this could be
controlled even at runtime) rather than manipulating the resource plugins,
because they are not that easily modifyable at runtime since it's the core part
of the initialization of NM, so it's obviously not that dynamically changeable.
Do you think it's a good idea?
Even if we don't want to use any runtime config changes to auto-discovery, I
think having one config to effectively turn off the GPU feature (with the new
switch) is better than having 2: removing the plugin and also remove GPU
related configs from container-executor.cfg.
This change requires changes 95% in GpuDiscoverer only, the other ~5% is with
the error handling changes.
Does this change make sense?
Thanks!
> Need to have a way to turn off GPU auto-discovery in GpuDiscoverer
> ------------------------------------------------------------------
>
> Key: YARN-9120
> URL: https://issues.apache.org/jira/browse/YARN-9120
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Szilard Nemeth
> Assignee: Szilard Nemeth
> Priority: Major
>
> GpuDiscoverer.getGpusUsableByYarn either parses the user-defined GPU devices
> or should have the value 'auto' (from property:
> yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices)
> In some circumstances, users would want to exclude a node from scheduling, so
> they should have an option to turn off auto-discovery.
> It's straightforward that this is possible by removing the GPU
> resource-plugin from YARN's config along with GPU-related config in
> container-executor.cfg, but doing that with a dedicated value for
> yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is a more
> lightweight approach.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]