[ 
https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-6153:
-------------------------------
    Attachment: YARN-6153-branch-2.8.patch

I'm uploading the patch for branch-2.8.

1. in the testRMAppAttemptFailuresValidityInterval, using the systemClock has 
been replaced with Thread.sleep.

by following, the time to check the validity interval is no longer the 
systemClock in RMAppImpl.

{code}
-  private int getNumFailedAppAttempts() {
+  public int getNumFailedAppAttempts() {
     int completedAttempts = 0;
-    long endTime = this.systemClock.getTime();
     // Do not count AM preemption, hardware failures or NM resync
     // as attempt failure.
     for (RMAppAttempt attempt : attempts.values()) {
       if (attempt.shouldCountTowardsMaxAttemptRetry()) {
-        if (this.attemptFailuresValidityInterval <= 0
-            || (attempt.getFinishTime() > endTime
-                - this.attemptFailuresValidityInterval)) {
-          completedAttempts++;
-        }
+        completedAttempts++;
       }
     }
{code}

2. in the testAMRestartNotLostContainerAfterAttemptFailuresValidityInterval, 
the timeout value has been increased to 40 seconds.

currently, YARN-4807 is not yet included in the branch-2.8. I think that’s why 
the timeout happens.


> keepContainer does not work when AM retry window is set
> -------------------------------------------------------
>
>                 Key: YARN-6153
>                 URL: https://issues.apache.org/jira/browse/YARN-6153
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: kyungwan nam
>            Assignee: kyungwan nam
>             Fix For: 2.8.0, 3.0.0-alpha3
>
>         Attachments: YARN-6153.001.patch, YARN-6153.002.patch, 
> YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, 
> YARN-6153.006.patch, YARN-6153-branch-2.8.patch
>
>
> yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster.
> I submitted a YARN application (slider app) that keepContainers=true, 
> attemptFailuresValidityInterval=300000.
> it did work properly when AM was failed firstly.
> all containers launched by previous AM were resynced with new AM (attempt2) 
> without killing containers.
> after 10 minutes, I thought AM failure count was reset by 
> attemptFailuresValidityInterval (5 minutes).
> but, all containers were killed when AM was failed secondly. (new AM attempt3 
> was launched properly)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to