Stop

Arun Suresh (JIRA) Fri, 08 Apr 2016 10:32:58 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232530#comment-15232530
 ]


Arun Suresh commented on YARN-4876:
-----------------------------------

Thanks for the feedback [~vvasudev]..

bq. We can achieve that by adding the destroyDelay field you mentioned in your 
document but don't allow AMs to set it. If initialize is called, set 
destroyDelay internally to -1, else to 0.
I tend to agree with you, but my intention was to introduce a timeout after 
which, if no action is taken by the AM, the Containers is killed. Maybe we can 
have a default timeout (5 mins ?) and allow AMs to override it.

bq. Can you add a state machine transition diagram to explain how new states 
and events affect each other?
Will do.. I was thinking maybe We add another Container State, such as 
*AWAITING_START* to explicitly distinguish it from *LOCALIZED* as I had 
suggested in the initial doc. Shall update and put it up.

bq. I think we should add an explicit re-initialize/re-localize API. For a 
running process, ideally, we want to localize the upgraded bits while the 
container is running and then kill the existing process to minimize the 
downtime.
Yup, agreed.. we had thought about that, but felt that introducing concurrent 
localization while running might introduce more states (like you identified - 
"running-localizing.." etc). Also, was thinking about what happens when a 
concurrent localization completes
* Should it move to the AWAITING state that waits for a startContainer command 
from the AM (which would increase start-up latency) or should it just start 
automatically? 
* What happens when a concurrent re-localization attempt fails ? Should the 
container continue running / be killed (notified to the RM). If it continues to 
run, We need to notify the AM about the failure (or wait for the AM to call 
getStatus etc.)

In any case, the interactions between AM and the NM/Container would become 
non-trivial.
I was thinking we should probably do a sequential stop + initialize/localize + 
start as a first cut, and tackle concurrent re-initialization is subsequent 
JIRAs. Furthermore, I was planning on tackling this in a more principled manner 
in YARN-4597

bq. Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm 
assuming you meant CLEANUP_CONTAINER_RESOURCES
Yup

bq. Instead of forcing AMs to make two calls, why don't we just add a restart 
API that does everything you've outlined above? It's cleaner and we don't have 
to do as many condition checks.
Totally agree!! But I was thinking we get the base initialize/destroy and 
start/stop APIs well defined and working as expected.. Was thinking clubbing 
into composite commands can be handled in a subsequent JIRA. Since in any case, 
we do have to handle all these cases when an AM calls initialize/start while 
the container is running. Although we can just choose to ignore all commands 
except a *restart*, *stop* or *destroy*, but I'd prefer to handle restart as a 
composite command.  


> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the 
> *ContainerManagementProtocol* and decouple the actual start of a container 
> from the initialization. This will allow AMs to re-start a container without 
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the 
> initialize (and the cleanup with the destroy), This can also be used by 
> applications to upgrade a Container by *re-initializing* with a new 
> *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop

Reply via email to