[
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13655439#comment-13655439
]
Bikas Saha commented on YARN-614:
---------------------------------
Sorry for the delayed response. Thanks for checking the flow of code.
Recovering there by mistake?
{code}
+ // After recovering all app attempts, go over every attempt, and check to
{code}
I think the previous patch was incrementing the ignoredFailureCount by looking
at the status of the current failed attempt. This patch changes it to iterating
over all the attempts. What was the reason? IMO, we could look at
currentAttempt.getMasterContainer() status and increment, right? Other than
repeating the calculation every time, the current code runs the risk of finding
an empty attempt.getJustFinishedContainers() for older attempts. In fact, I am
surprised that the RM is not cleaning up data structures from older attempts
and freeing up memory. We should open a jira to fix that. It could be a
significant overhead if jobs crash with large outstanding finished containers.
I am debating between shouldCountFailureToAttemptLimit() vs isSystemFailure().
What do you say?
In the test, I think we should use the submissioncontext to set the desired max
attempt value instead of relying on config default. One of the results of this
jira would be to set the config value to 1.
I wish the test could be made simpler and more constrained around the RMApp
code itself instead of mocking scheduler and allocate calls etc. It would be
less fragile to other changes. I cant think of anything off the top of my head.
Will comment if I hit on anything. Perhaps an alternate test that creates a
mockRM, mockNMs and then submits and app then removes the mockNM. This would
also be a more real life test.
> Retry attempts automatically for hardware failures or YARN issues and set
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-614
> URL: https://issues.apache.org/jira/browse/YARN-614
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Bikas Saha
> Assignee: Chris Riccomini
> Fix For: 2.0.5-beta
>
> Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch,
> YARN-614-3.patch, YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be
> retried unnecessarily. The only reason YARN should retry an attempt is when
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk
> errors are the hardware errors that come to mind.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira