[jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

Arun Suresh (JIRA) Wed, 07 Sep 2016 14:18:44 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471809#comment-15471809
 ]


Arun Suresh edited comment on YARN-5620 at 9/7/16 9:17 PM:
-----------------------------------------------------------

Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could 
you explain under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process is 
subsequently restarted. Now after say around 10 mins the process dies. There is 
no way form the NM to know if the process died because of the upgrade (memory 
leak ?) or due to some transient failure.. and therefore it cannot make the 
decision to Retry the process or Rollback the upgrade. Only the AM knows if the 
upgrade is actually successful. Essentially, the commit API should be used by 
the AM to notify the NM that upgrade is fine and any subsequent failure can be 
handled by the existing Retry Policy AFTER it has performed some upgrade 
diagnostics on the container. We can provide an *autoCommit* convenience method 
that clubs upgrade + commit. But I feel it is important we keep the explicit 
commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any 
previous version, rather than only the immediate previous one. In some sense, 
it’s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launch 
context is provided and managed by the AM, the AM should take care of tying the 
context with the version b) There are (possibly huge) storage implications the 
NM would have to deal with to keep track of all the earlier versions.
# It should not be called *rollback*. The AM should call 
{{upgradeContainer(launchContext)}} with some previous context. 



bq. IMHO, we probably can use one API restartContainer(context) for both 
upgrade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can be 
implemented as wrappers over {{restartContainer(launchContext)}}. But, in my 
opinion *rollback* should not be provided with an _explicit_ launchContext, it 
should always be the just previous context.

bq. Also, Forcing containers to be restarted with previous version if upgrade 
fails may not be suitable in all cases, User wants to troubleshoot the failure 
first before triggering a new wave of restarts.
Agreed... I can include an UpgradePolicy which allows users to *terminate* or 
*rollBack* (implicit rollback) on failure. Also COMMIT is useful here if the 
user wants to verify if one wave has successfully upgraded, commit upgrade in 
those instances and then move on to the next wave.

bq. IMO, as first cut implementation, we can fail the container if upgrade 
fails. we can add retry,  rollback, or release the container as RetryPolicy on 
failure later. your opinion ?
Yup.. will include a policy, as I mentioned above. Don't think *retry* makes 
sense though.





was (Author: asuresh):
Thanks for the review [~jianhe]

bq. The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could 
you explain under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process is 
subsequently restarted. Now after say around 10 mins the process dies. There is 
no way form the NM to know if the process died because of the upgrade (memory 
leak ?) or due to some transient failure.. and therefore it cannot make the 
decision to Retry the process or Rollback the upgrade. Only the AM knows if the 
upgrade is actually successful. Essentially, the commit API should be used by 
the AM to notify the NM that upgrade is fine and any subsequent failure can be 
handled by the existing Retry Policy AFTER it has performed some upgrade 
diagnostics on the container. We can provide an *autoCommit* convenience method 
that clubs upgrade + commit. But I feel it is important we keep the explicit 
commit API.

bq. The ROLLBACK_UPGRADE API: I think it should be able to rollback to any 
previous version, rather than only the immediate previous one. In some sense, 
it’s the same as upgrade.
I agree AM should be able to move to any previous version, but,
# I feel the versioning should NOT be managed by the NM, since a) the launch 
context is provided and managed by the AM, the AM should take care of tying the 
context with the version b) There are (possibly huge) storage implications the 
NM would have to deal with to keep track of all the earlier versions.
# It should not be called *rollback*. The AM should call 
{{restartContainer(launchContext)}} with some previous context. 


bq. IMHO, we probably can use one API restartContainer(context) for both 
upgrade and downgrade
I agree that both *rollback* (explicit rollback via API) and *upgrade* can be 
implemented as wrappers over {{restartContainer(launchContext)}}. But, in my 
opinion *rollback* should not be provided with an _explicit_ launchContext, it 
should always be the just previous context.






> Core changes in NodeManager to support for upgrade and rollback of Containers
> -----------------------------------------------------------------------------
>
>                 Key: YARN-5620
>                 URL: https://issues.apache.org/jira/browse/YARN-5620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

Reply via email to