[ 
https://issues.apache.org/jira/browse/YARN-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762647#comment-16762647
 ] 

Szilard Nemeth commented on YARN-9217:
--------------------------------------

Thanks [~bsteinbach] for this patch!

Some comments:
1. In GpuResourceHandlerImpl#bootstrap 


{code:java}
List<GpuDevice> usableGpus = 
        GpuDiscoverer.getInstance().getGpusUsableByYarn();
{code}


would look like better if written like this (different place for line break): 


{code:java}
List<GpuDevice> usableGpus = GpuDiscoverer.getInstance()
                .getGpusUsableByYarn();
{code}


2. In the same method, you are storing the message in a variable (message) but 
it's only used for the LOG.warn statement so I would rather inline it.

3. In GpuResourceHandlerImpl: There is an unused import of YarnException

4. There's an unnecessary empty line at the beginning of method: 
GpuDiscoverer#getGpusUsableByYarn. 

5. I would inline the msg variable as it's only used for logging in 
GpuDiscoverer#getGpusUsableByYarn

6. In GpuDiscoverer#splitDeviceIdToParts: The exception message contains 
"param". I would use "parameter".

7. In GpuDiscoverer#initialize: You have this code: 


{code:java}
 LOG.warn("Specified path is not pointing to an {} executable." +
            " Please check [{}] setting.", 
            DEFAULT_BINARY_NAME, 
            YarnConfiguration.NM_GPU_PATH_TO_EXEC);
{code}

I would rephrase it to: 
{code:java}
"Please check the configuration value of " + 
YarnConfiguration.NM_GPU_PATH_TO_EXEC
{code}


8. Unused import in: GpuNodeResourceUpdateHandler

9. I understand that the error handling of calling nvidia-smi could belong to 
GpuResourcePlugin as the NM Web API the only caller of nvidia-smi after 
startup, but I would prefer to keep the error handling in its original place 
instead. As we discussed offline, we should evaluate whether we need this 
"smart" error handling at all, as nvidia-smi is either gives back the data 
correctly or not, it's not likely that the correctness would change over time. 
Maybe it's better to keep the error handling in NvidiaBinaryHelper? Let's 
discuss this further later.

Overall comment: As YARN-9118, YARN-9121, YARN-9139, YARN-9138 is also dealing 
with similar code changes that your patch does, I suggest to wait for those to 
be merged then resolve the conflicts after that.

> Nodemanager will fail to start if GPU is misconfigured on the node or GPU 
> drivers missing
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-9217
>                 URL: https://issues.apache.org/jira/browse/YARN-9217
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Antal Bálint Steinbach
>            Assignee: Antal Bálint Steinbach
>            Priority: Major
>         Attachments: YARN-9217.001.patch, YARN-9217.002.patch, 
> YARN-9217.003.patch
>
>
> Nodemanager will not start
> 1. If Autodiscovery is enabled:
>  * If nvidia-smi path is misconfigured or the file does not exist.
>  * There is 0 GPU found
>  * If the file exists but it is not pointing to an nvidia-smi
>  * if the binary is ok but there is an IOException
> 2. If the manually configured GPU devices are misconfigured
>  * Any index:minor number format failure will cause a problem
>  * 0 configured device will cause a problem
>  * NumberFormatException is not handled
> It would be a better option to add warnings about the configuration, set 0 
> available GPUs and let the node work and run non-gpu jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to