[jira] [Commented] (TEZ-3075) Revamp bad node handling

2018-10-16 Thread Yingda Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651765#comment-16651765
 ] 

Yingda Chen commented on TEZ-3075:
--

[~Chyler] and I will be looking at this together with TEZ-3822

> Revamp bad node handling
> 
>
> Key: TEZ-3075
> URL: https://issues.apache.org/jira/browse/TEZ-3075
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Bikas Saha
>Assignee: Ying Han
>Priority: Major
>
> The current logic around that is derived from MR and does not work in all 
> cases.
> Things to consider
> 1) Have a notion of probation where machines are put out of service for a 
> period of time (say 5m, 15m and 30m) before being given up for good. This 
> allows more graceful handling of temporary glitches.
> 2) Different handling for YARN marking a node as bad vs internal heuritics
> 3) Bad nodes should not immediately trigger re-execution of completed work. 
> That should be based on presence of downstream consumers (ie existing demand 
> for that output) and a reasonable indication by other consumers from that 
> node that it cannot serve results (eg. multiple reports of read errors with 
> that node as a source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3075) Revamp bad node handling

2018-10-10 Thread JIN SUN (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645319#comment-16645319
 ] 

JIN SUN commented on TEZ-3075:
--

+1

need more heuristics for handle machine failure

> Revamp bad node handling
> 
>
> Key: TEZ-3075
> URL: https://issues.apache.org/jira/browse/TEZ-3075
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Bikas Saha
>Priority: Major
>
> The current logic around that is derived from MR and does not work in all 
> cases.
> Things to consider
> 1) Have a notion of probation where machines are put out of service for a 
> period of time (say 5m, 15m and 30m) before being given up for good. This 
> allows more graceful handling of temporary glitches.
> 2) Different handling for YARN marking a node as bad vs internal heuritics
> 3) Bad nodes should not immediately trigger re-execution of completed work. 
> That should be based on presence of downstream consumers (ie existing demand 
> for that output) and a reasonable indication by other consumers from that 
> node that it cannot serve results (eg. multiple reports of read errors with 
> that node as a source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)