[ 
https://issues.apache.org/jira/browse/YARN-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907942#comment-16907942
 ] 

Szilard Nemeth commented on YARN-9217:
--------------------------------------

Hi [~pbacsko]!

{quote}
Fundamental question: is this the way how we want to use thig plugin? Just 
asking because we might accidentally mask erratic behavior. Eg. a Hadoop user 
might think that he has a cluster with 10 GPUs. In reality, the plugin failed 
to detect some cards, and only 5 NMs support GPU scheduling. If it's not 
explicitly displayed, the user might be under the impression that 10 GPUs are 
ready to run YARN workloads. This can be very misleading.

At the very least, a fail-fast method should be considered.
{quote}
I agree with your approach on the fail-fast config flag so please fix the TODO 
and upload a new patch, then I can start reviewing it!

Thanks!

> Nodemanager will fail to start if GPU is misconfigured on the node or GPU 
> drivers missing
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-9217
>                 URL: https://issues.apache.org/jira/browse/YARN-9217
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Antal Bálint Steinbach
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: YARN-9217.001.patch, YARN-9217.002.patch, 
> YARN-9217.003.patch, YARN-9217.004.patch, YARN-9217.005.patch, 
> YARN-9217.006.patch, YARN-9217.007.patch, YARN-9217.008.patch, 
> YARN-9217.009.patch
>
>
> Nodemanager will not start
> 1. If Autodiscovery is enabled:
>  * If nvidia-smi path is misconfigured or the file does not exist.
>  * There is 0 GPU found
>  * If the file exists but it is not pointing to an nvidia-smi
>  * if the binary is ok but there is an IOException
> 2. If the manually configured GPU devices are misconfigured
>  * Any index:minor number format failure will cause a problem
>  * 0 configured device will cause a problem
>  * NumberFormatException is not handled
> It would be a better option to add warnings about the configuration, set 0 
> available GPUs and let the node work and run non-gpu jobs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to