[ 
https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414089#comment-16414089
 ] 

Eric Yang commented on YARN-8044:
---------------------------------

What if binary doesn't exist on one of the faulty node due to disk failure, and 
exit code is -1.  We will want the retry to happen on some other nodes.  I am 
not sure that adding logic to detect exit code is a good way to go about fixing 
retry policy.  There are too many exit codes that have different meaning among 
applications. 

We might want to use the heuristic approach with failure validity intervals.  
We might be able to count number of failures within the time frame to decide if 
we should abort the containers.

> Determine the appropriate default ContainerRetryPolicy
> ------------------------------------------------------
>
>                 Key: YARN-8044
>                 URL: https://issues.apache.org/jira/browse/YARN-8044
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Shane Kumpf
>            Priority: Major
>
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which 
> may be too inclusive. Some error codes, such as -1, should likely result in a 
> hard fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to