[
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232530#comment-15232530
]
Arun Suresh commented on YARN-4876:
-----------------------------------
Thanks for the feedback [~vvasudev]..
bq. We can achieve that by adding the destroyDelay field you mentioned in your
document but don't allow AMs to set it. If initialize is called, set
destroyDelay internally to -1, else to 0.
I tend to agree with you, but my intention was to introduce a timeout after
which, if no action is taken by the AM, the Containers is killed. Maybe we can
have a default timeout (5 mins ?) and allow AMs to override it.
bq. Can you add a state machine transition diagram to explain how new states
and events affect each other?
Will do.. I was thinking maybe We add another Container State, such as
*AWAITING_START* to explicitly distinguish it from *LOCALIZED* as I had
suggested in the initial doc. Shall update and put it up.
bq. I think we should add an explicit re-initialize/re-localize API. For a
running process, ideally, we want to localize the upgraded bits while the
container is running and then kill the existing process to minimize the
downtime.
Yup, agreed.. we had thought about that, but felt that introducing concurrent
localization while running might introduce more states (like you identified -
"running-localizing.." etc). Also, was thinking about what happens when a
concurrent localization completes
* Should it move to the AWAITING state that waits for a startContainer command
from the AM (which would increase start-up latency) or should it just start
automatically?
* What happens when a concurrent re-localization attempt fails ? Should the
container continue running / be killed (notified to the RM). If it continues to
run, We need to notify the AM about the failure (or wait for the AM to call
getStatus etc.)
In any case, the interactions between AM and the NM/Container would become
non-trivial.
I was thinking we should probably do a sequential stop + initialize/localize +
start as a first cut, and tackle concurrent re-initialization is subsequent
JIRAs. Furthermore, I was planning on tackling this in a more principled manner
in YARN-4597
bq. Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm
assuming you meant CLEANUP_CONTAINER_RESOURCES
Yup
bq. Instead of forcing AMs to make two calls, why don't we just add a restart
API that does everything you've outlined above? It's cleaner and we don't have
to do as many condition checks.
Totally agree!! But I was thinking we get the base initialize/destroy and
start/stop APIs well defined and working as expected.. Was thinking clubbing
into composite commands can be handled in a subsequent JIRA. Since in any case,
we do have to handle all these cases when an AM calls initialize/start while
the container is running. Although we can just choose to ignore all commands
except a *restart*, *stop* or *destroy*, but I'd prefer to handle restart as a
composite command.
> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
> Key: YARN-4876
> URL: https://issues.apache.org/jira/browse/YARN-4876
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Arun Suresh
> Assignee: Arun Suresh
> Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the
> *ContainerManagementProtocol* and decouple the actual start of a container
> from the initialization. This will allow AMs to re-start a container without
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the
> initialize (and the cleanup with the destroy), This can also be used by
> applications to upgrade a Container by *re-initializing* with a new
> *ContainerLaunchContext*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)