[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184968#comment-15184968
 ] 

Junping Du commented on YARN-3998:
----------------------------------

bq. We could specify retry policy to RETRY_ON_SPECIFIC_ERROR_CODE to handle 
this case, the error code might be INITIALIZE_USER_FAILED for the case if "user 
no found". Or do you mean it is not sufficient to just specify error codes?
Theoretically, it sounds ok to make retry policy based on error code. However, 
I think in some particular case (like "user not found" exception in 
LinuxContainerExecutor), it could share the same error code like -1000(INVALID) 
with other cases and we may need more specific info (like parsing error 
messages).

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to