[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU

2021-03-11 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299419#comment-17299419
 ] 

Qi Zhu commented on YARN-8823:
--

Thanks [~adam.antal] for reply.

I will investigate it when i am free.

> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU

2021-03-11 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299405#comment-17299405
 ] 

Adam Antal commented on YARN-8823:
--

I think you can go ahead and work on this [~zhuqi].

> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU

2021-03-09 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298127#comment-17298127
 ] 

Qi Zhu commented on YARN-8823:
--

[~adam.antal] [~tangzhankun]

Is this going on?
"I was wondering if this issue can be easily finished by writing a custom 
health checker script (idea from YARN-9923).
I think it would make sense to push this feature - would you like to share your 
PoC and compare its advantages/disadvantages to a node checker script?"

This is a good suggestion.
 

> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU

2019-11-04 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966847#comment-16966847
 ] 

Adam Antal commented on YARN-8823:
--

I was wondering if this issue can be easily finished by writing a custom health 
checker script (idea from YARN-9923). 
I think it would make sense to push this feature - would you like to share your 
PoC and compare its advantages/disadvantages to a node checker script?

> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8823) Monitor the healthy state of GPU

2019-10-25 Thread Adam Antal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959769#comment-16959769
 ] 

Adam Antal commented on YARN-8823:
--

Hi [~tangzhankun],
Is there any update on this?

> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org