[
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648197#comment-13648197
]
Bikas Saha commented on YARN-614:
---------------------------------
Unfortunately that is not how it happens. The RECOVER event is enqueued but not
sent since the event dispatcher starts after recovery is completed. So by the
time RECOVER is reached all the state has been populated. The catch is that
currently recovery stores the attempt state before launching the attempt. At
that time it does not have the container status because that is obtained when
the attempt finishes. So I am of the opinion of leaving recovery for a
different jira.
Now for the patch itself.
Minor style thing. All this code could go inside app.updateFailureCount() and
let it do whatever it wants because the app has access to all the data and
more. That method can evolve separately without bloating the transition method.
We need to check if the "justFinished" containers would always have an entry
for the master container. Specially the case when the node is lost because it
went down.
{code}
+ // If the failure was not the AM's fault (e.g. node lost, or disk
+ // failure), then increment ignored failures, so we don't count the
+ // failure when determining whether to restart the app or not.
+ RMAppAttempt appMasterAttempt = app.attempts.get(app.currentAttempt
+ .getAppAttemptId());
+ Container appMasterContainer = appMasterAttempt.getMasterContainer();
+ ContainerStatus status = appMasterAttempt.getJustFinishedContainers()
+ .get(appMasterContainer.getId());
+
+ app.updateFailureCount(status.getExitStatus());
{code}
I am assuming aborted implies node lost in the patch. We need to make sure that
aborted is not being used as a generic catch all. Else we may need to add a new
specific exit status NODE_LOST for the specific case.
{code}
+ private boolean shouldCountFailureToAttemptLimit(int
masterContainerExitStatus) {
+ return masterContainerExitStatus != ContainerExitStatus.DISKS_FAILED
+ && masterContainerExitStatus != ContainerExitStatus.ABORTED;
+ }
{code}
I am not in favor of changing the List to a Map. The search is performed only
once at the end of the life of the attempt and also if it has failed. So I am
not sure perf is an issue here if we iterate once through this list. List is
cheaper wrt memory and also maintains the order of completion of containers as
received by the RM. Its cheap for the ApplicationMasterService to pull when it
populates the allocate response. This code probably wont compile because
ApplicationMasterService expects a list and not a map.
{code}
- private final List<ContainerStatus> justFinishedContainers =
- new ArrayList<ContainerStatus>();
+ private final Map<ContainerId, ContainerStatus> justFinishedContainers =
+ new HashMap<ContainerId, ContainerStatus>();
{code}
Not quite sure why this method needs to be public. If its private then it need
not be part of the RMApp interface and thus MockAsm or MockRMapp need not
change.
{code}
@Override
+ public int getIgnoredFailures() {
+ this.readLock.lock();
+
+ try {
+ return this.ignoredFailures;
+ } finally {
+ this.readLock.unlock();
+ }
+ }
+
{code}
> Retry attempts automatically for hardware failures or YARN issues and set
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-614
> URL: https://issues.apache.org/jira/browse/YARN-614
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Bikas Saha
> Attachments: YARN-614-0.patch, YARN-614-1.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be
> retried unnecessarily. The only reason YARN should retry an attempt is when
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk
> errors are the hardware errors that come to mind.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira