[jira] [Commented] (YARN-8044) Determine the appropriate default ContainerRetryPolicy

2018-03-27 Thread Shane Kumpf (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416112#comment-16416112
 ] 

Shane Kumpf commented on YARN-8044:
---

Sounds good. I'll close this issue.

> Determine the appropriate default ContainerRetryPolicy
> --
>
> Key: YARN-8044
> URL: https://issues.apache.org/jira/browse/YARN-8044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Priority: Major
>
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which 
> may be too inclusive. Some error codes, such as -1, should likely result in a 
> hard fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8044) Determine the appropriate default ContainerRetryPolicy

2018-03-27 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416101#comment-16416101
 ] 

Eric Yang commented on YARN-8044:
-

[~shaneku...@gmail.com] It appears that Wangda is offering alternate 
configuration for retry policy in YARN-8080.  I think his proposal is good 
enough to let user decide to retry or not retry.  This can eliminate possible 
overlaps of exit code.

> Determine the appropriate default ContainerRetryPolicy
> --
>
> Key: YARN-8044
> URL: https://issues.apache.org/jira/browse/YARN-8044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Priority: Major
>
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which 
> may be too inclusive. Some error codes, such as -1, should likely result in a 
> hard fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8044) Determine the appropriate default ContainerRetryPolicy

2018-03-27 Thread Shane Kumpf (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416093#comment-16416093
 ] 

Shane Kumpf commented on YARN-8044:
---

{quote}What if binary doesn't exist on one of the faulty node due to disk 
failure, and exit code is -1.  We will want the retry to happen on some other 
nodes.
{quote}
I agree that we would want to retry in that case and can see the challenge with 
using exit codes.
{quote}We might want to use the heuristic approach with failure validity 
intervals.  We might be able to count number of failures within the time frame 
to decide if we should abort the containers.
{quote}
Make sense to me. It seems YARN-5015 / YARN-8032 addresses this approach.

Given the above, would it make more sense to re-purpose this issue to expose 
the retry policy used by Native Services to the end user? We could use 
RETRY_ON_ALL_ERRORS as the default.

> Determine the appropriate default ContainerRetryPolicy
> --
>
> Key: YARN-8044
> URL: https://issues.apache.org/jira/browse/YARN-8044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Priority: Major
>
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which 
> may be too inclusive. Some error codes, such as -1, should likely result in a 
> hard fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8044) Determine the appropriate default ContainerRetryPolicy

2018-03-26 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414089#comment-16414089
 ] 

Eric Yang commented on YARN-8044:
-

What if binary doesn't exist on one of the faulty node due to disk failure, and 
exit code is -1.  We will want the retry to happen on some other nodes.  I am 
not sure that adding logic to detect exit code is a good way to go about fixing 
retry policy.  There are too many exit codes that have different meaning among 
applications. 

We might want to use the heuristic approach with failure validity intervals.  
We might be able to count number of failures within the time frame to decide if 
we should abort the containers.

> Determine the appropriate default ContainerRetryPolicy
> --
>
> Key: YARN-8044
> URL: https://issues.apache.org/jira/browse/YARN-8044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Priority: Major
>
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which 
> may be too inclusive. Some error codes, such as -1, should likely result in a 
> hard fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8044) Determine the appropriate default ContainerRetryPolicy

2018-03-17 Thread Shane Kumpf (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403756#comment-16403756
 ] 

Shane Kumpf commented on YARN-8044:
---

{{ContainerRetryPolicy}} doesn't really provide a way to do this today. 
{{RETRY_ON_SPECIFIC_ERROR_CODES}} is likely too restrictive as -1 may be the 
only one where a hard fail makes sense. Adding {{FAIL_ON_SPECIFIC_ERROR_CODES}} 
support may make sense.

> Determine the appropriate default ContainerRetryPolicy
> --
>
> Key: YARN-8044
> URL: https://issues.apache.org/jira/browse/YARN-8044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Priority: Major
>
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which 
> may be too inclusive. Some error codes, such as -1, should likely result in a 
> hard fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org