[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU
[ https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299419#comment-17299419 ] Qi Zhu commented on YARN-8823: -- Thanks [~adam.antal] for reply. I will investigate it when i am free. > Monitor the healthy state of GPU > > > Key: YARN-8823 > URL: https://issues.apache.org/jira/browse/YARN-8823 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > We have GPU resource discovered when the NM bootstrap but not updated through > later heatbeat with RM. There should be a monitoring mechanism to check GPU > healthy status from time to time and also the corresponding handling. > And YARN-8851 will also handle device's monitoring. There could be some > common part between the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU
[ https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299405#comment-17299405 ] Adam Antal commented on YARN-8823: -- I think you can go ahead and work on this [~zhuqi]. > Monitor the healthy state of GPU > > > Key: YARN-8823 > URL: https://issues.apache.org/jira/browse/YARN-8823 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > We have GPU resource discovered when the NM bootstrap but not updated through > later heatbeat with RM. There should be a monitoring mechanism to check GPU > healthy status from time to time and also the corresponding handling. > And YARN-8851 will also handle device's monitoring. There could be some > common part between the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU
[ https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298127#comment-17298127 ] Qi Zhu commented on YARN-8823: -- [~adam.antal] [~tangzhankun] Is this going on? "I was wondering if this issue can be easily finished by writing a custom health checker script (idea from YARN-9923). I think it would make sense to push this feature - would you like to share your PoC and compare its advantages/disadvantages to a node checker script?" This is a good suggestion. > Monitor the healthy state of GPU > > > Key: YARN-8823 > URL: https://issues.apache.org/jira/browse/YARN-8823 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > We have GPU resource discovered when the NM bootstrap but not updated through > later heatbeat with RM. There should be a monitoring mechanism to check GPU > healthy status from time to time and also the corresponding handling. > And YARN-8851 will also handle device's monitoring. There could be some > common part between the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU
[ https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966847#comment-16966847 ] Adam Antal commented on YARN-8823: -- I was wondering if this issue can be easily finished by writing a custom health checker script (idea from YARN-9923). I think it would make sense to push this feature - would you like to share your PoC and compare its advantages/disadvantages to a node checker script? > Monitor the healthy state of GPU > > > Key: YARN-8823 > URL: https://issues.apache.org/jira/browse/YARN-8823 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > We have GPU resource discovered when the NM bootstrap but not updated through > later heatbeat with RM. There should be a monitoring mechanism to check GPU > healthy status from time to time and also the corresponding handling. > And YARN-8851 will also handle device's monitoring. There could be some > common part between the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU
[ https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959769#comment-16959769 ] Adam Antal commented on YARN-8823: -- Hi [~tangzhankun], Is there any update on this? > Monitor the healthy state of GPU > > > Key: YARN-8823 > URL: https://issues.apache.org/jira/browse/YARN-8823 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > We have GPU resource discovered when the NM bootstrap but not updated through > later heatbeat with RM. There should be a monitoring mechanism to check GPU > healthy status from time to time and also the corresponding handling. > And YARN-8851 will also handle device's monitoring. There could be some > common part between the two. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org