[
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230349#comment-15230349
]
Varun Vasudev commented on YARN-4876:
-------------------------------------
Thanks for the document [~asuresh]!
Here are my initial thoughts -
{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}
I think we should avoid this for now - we should require that AMs that use
initialize() must call destroy and AMs that call start with the
ContainerLaunchContext can't call destroy. We can achieve that by adding the
destroyDelay field you mentioned in your document but don't allow AMs to set
it. If initialize is called, set destroyDelay internally to \-1, else to 0. I'm
not saying we should drop the feature, just that we should come back to it once
we've sorted out the lifecycle from an initialize->destroy perspective.
{code}
Modify 'StopContainerRequest' Record:
Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for
now.
{code}
• Introduce a new 'ContainerEventType.START_CONTAINER' event type.
• Introduce a new 'ContainerEventType.DESTROY_CONTAINER' event type.
• The Container remains in the LOCALIZED state until it receives the
'START_CONTAINER' event.
{code}
Can you add a state machine transition diagram to explain how new states and
events affect each other?
{code}
If 'initializeContainer' with a new ContainerLaunchContext is called by the AM
while the Container
is RUNNING, It is treated as a KILL_CONTAINER event followed by a
CONTAINER_RESOURCE_CLEANUP and an INIT_CONTAINER event to kick of
re-localization after which the Container will return to LOCALIZED state.
{code}
I'd really like to avoid this specific behavior. I think we should add an
explicit re-initialize API. For a running process, ideally, we want to localize
the upgraded bits while the container is running and then kill the existing
process to minimize the downtime. For containers where localization can take a
long time, forcing a kill and then a re-initialize adds quite a serious amount
of downtime. Re-initialize and initialize will probably end up having differing
behaviors. On a similar note, I think we might have to introduce a new
"re-initalizing/re-localizing/running-localizing state" which implies that a
container is running but we are carrying out some background work.
In addition, I don't think we can do a cleanup of resources during an upgrade.
For services that have local state in the container work dir, we're essentially
wiping away all the local state and forcing them to start from scratch.
Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm
assuming you meant CLEANUP_CONTAINER_RESOURCES
{code}
• If 'intializeContainer' is called WITHOUT a new ContainerLaunchContext by the
AM, it is considered a restart, and will follow the same code path as
'initializeContainer' with new ContainerLaunchContext, but will not perform a
CONTAINER_RESOURCE_CLEANUP and INIT_CONTAINER. The Container process will be
killed and the container will be returned to LOCALIZED state.
• If 'startContainer' is called WITHOUT a new ContainerLaunchContext by the AM,
it treated exactly as the above case, but it will also trigger a
START_CONTAINER event.
{code}
Instead of forcing AMs to make two calls, why don't we just add a restart API
that does everything you've outlined above? It's cleaner and we don't have to
do as many condition checks. In addition, with a restart API we can do stuff
like allowing AMs to specify a delay, or some conditions when the restart
should happen.
> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
> Key: YARN-4876
> URL: https://issues.apache.org/jira/browse/YARN-4876
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Arun Suresh
> Assignee: Arun Suresh
> Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the
> *ContainerManagementProtocol* and decouple the actual start of a container
> from the initialization. This will allow AMs to re-start a container without
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the
> initialize (and the cleanup with the destroy), This can also be used by
> applications to upgrade a Container by *re-initializing* with a new
> *ContainerLaunchContext*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)