[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

Varun Vasudev (JIRA) Thu, 07 Jan 2016 06:11:01 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087392#comment-15087392
 ]


Varun Vasudev commented on YARN-3998:
-------------------------------------

Thanks for the patch [~hex108]!

In your implementation, the relaunched container will go through the 
launchContainer call which will try to setup the container launch environment 
again(like creating the local and log dirs, creating tokens, etc). Won't this 
lead to FileAlreadyExistsException being thrown as part of the launchContainer 
call? In addition, this also means that on a node with more than one local dir, 
different attempts could get allocated to different local dirs. I wonder if 
it's better to move the retry logic into the launchContainer function instead 
of adding a new state transition?

Some feedback on the code itself -
1)
Rename ContainerRetry to ContainerRetryContext

2)
{code}
{@link ContainerRetryPolicy} : NEVER_RETRY(no matter what error code is
+ *     when container fails to run, just do not retry), ALWAYS_RETRY(no matter
+ *     what error code is, when container fails to run, just retry),
+ *     RETRY_ON_SPECIFIC_ERROR_CODE(when container fails to run, do retry if 
the
+ *     error code is one of <em>errorCodes</em>, otherwise do not retry.
+ *     Note: if error code is 137(SIGKILL) or 143(SIGTERM), it will not retry
+ *     because it is usually killed on purpose.
{code}
Specify that the default policy is NEVER_RETRY

3)
Rename retryTimes to maxRetries

4)
Change the interval unit to ms instead of seconds

5)
{code}
+  // remain retries to relaunch container if needed
+  private int remainRetries;
{code}
Rename remainRetries to remainingRetryAttempts

6)
{code}
+    if (launchContext != null) {
+      this.containerRetry = launchContext.getContainerRetry();
+      if (this.containerRetry != null) {
+        remainRetries = containerRetry.getRetryTimes();
+      }
+    } else {
+      this.containerRetry = null;
+    }
{code}
Instead of setting containerRetry to null, can we initialize to a retry object 
with never_retry

7)
Rename RetryWithFailureTransition to RetryFailureTransition

8)
Change
{code}
+        LOG.info("Relaunch Container " + container.getContainerId()
+            + ". Remain retry times : " + container.remainRetries
+            + ". Retry interval is "
+            + container.containerRetry.getRetryInterval() + "s");
{code}
to 
{code}
+        LOG.info("Relaunching container " + container.getContainerId()
+            + ". Remaining retry attempts : " + container.remainRetries
+            + ". Retry interval is "
+            + container.containerRetry.getRetryInterval() + "s");
{code}

9)
Rename storeContainerRemainRetries to storeContainerRemainingRetryAttempts


> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

Reply via email to