[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1

Bikas Saha (JIRA) Thu, 02 May 2013 22:24:20 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648197#comment-13648197
 ]


Bikas Saha commented on YARN-614:
---------------------------------

Unfortunately that is not how it happens. The RECOVER event is enqueued but not 
sent since the event dispatcher starts after recovery is completed. So by the 
time RECOVER is reached all the state has been populated. The catch is that 
currently recovery stores the attempt state before launching the attempt. At 
that time it does not have the container status because that is obtained when 
the attempt finishes. So I am of the opinion of leaving recovery for a 
different jira.

Now for the patch itself.
Minor style thing. All this code could go inside app.updateFailureCount() and 
let it do whatever it wants because the app has access to all the data and 
more. That method can evolve separately without bloating the transition method.
We need to check if the "justFinished" containers would always have an entry 
for the master container. Specially the case when the node is lost because it 
went down.
{code}
+        // If the failure was not the AM's fault (e.g. node lost, or disk
+        // failure), then increment ignored failures, so we don't count the
+        // failure when determining whether to restart the app or not.
+        RMAppAttempt appMasterAttempt = app.attempts.get(app.currentAttempt
+            .getAppAttemptId());
+        Container appMasterContainer = appMasterAttempt.getMasterContainer();
+        ContainerStatus status = appMasterAttempt.getJustFinishedContainers()
+            .get(appMasterContainer.getId());
+
+        app.updateFailureCount(status.getExitStatus());
{code}

I am assuming aborted implies node lost in the patch. We need to make sure that 
aborted is not being used as a generic catch all. Else we may need to add a new 
specific exit status NODE_LOST for the specific case.
{code}
+  private boolean shouldCountFailureToAttemptLimit(int 
masterContainerExitStatus) {
+    return masterContainerExitStatus != ContainerExitStatus.DISKS_FAILED
+        && masterContainerExitStatus != ContainerExitStatus.ABORTED;
+  }
{code}

I am not in favor of changing the List to a Map. The search is performed only 
once at the end of the life of the attempt and also if it has failed. So I am 
not sure perf is an issue here if we iterate once through this list. List is 
cheaper wrt memory and also maintains the order of completion of containers as 
received by the RM. Its cheap for the ApplicationMasterService to pull when it 
populates the allocate response. This code probably wont compile because 
ApplicationMasterService expects a list and not a map.
{code}
-  private final List<ContainerStatus> justFinishedContainers =
-    new ArrayList<ContainerStatus>();
+  private final Map<ContainerId, ContainerStatus> justFinishedContainers =
+    new HashMap<ContainerId, ContainerStatus>();
{code}

Not quite sure why this method needs to be public. If its private then it need 
not be part of the RMApp interface and thus MockAsm or MockRMapp need not 
change.
{code}
   @Override
+  public int getIgnoredFailures() {
+    this.readLock.lock();
+
+    try {
+      return this.ignoredFailures;
+    } finally {
+      this.readLock.unlock();
+    }
+  }
+
{code}
                
> Retry attempts automatically for hardware failures or YARN issues and set 
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch, YARN-614-1.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1

Reply via email to