[ 
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235312#comment-15235312
 ] 

Varun Vasudev commented on YARN-4876:
-------------------------------------

bq. I tend to agree with you, but my intention was to introduce a timeout after 
which, if no action is taken by the AM, the Containers is killed. Maybe we can 
have a default timeout (5 mins ?) and allow AMs to override it.

The zero activity timeout makes sense, but for now let's not let AMs override 
it.

{quote}
Yup, agreed.. we had thought about that, but felt that introducing concurrent 
localization while running might introduce more states (like you identified - 
"running-localizing.." etc). Also, was thinking about what happens when a 
concurrent localization completes

    Should it move to the AWAITING state that waits for a startContainer 
command from the AM (which would increase start-up latency) or should it just 
start automatically?
    What happens when a concurrent re-localization attempt fails ? Should the 
container continue running / be killed (notified to the RM). If it continues to 
run, We need to notify the AM about the failure (or wait for the AM to call 
getStatus etc.)
{quote}

In the first case, we should continue to run the old bits until the AM asks for 
an upgrade/restart. In the second case, I would expect the old bits to continue 
running and provide a status function to let the AM know that the localization 
failed. This can become more and more involved - we might eventually need to 
provide first class localization APIs.

{quote}
In any case, the interactions between AM and the NM/Container would become 
non-trivial.
I was thinking we should probably do a sequential stop + initialize/localize + 
start as a first cut, and tackle concurrent re-initialization is subsequent 
JIRAs. Furthermore, I was planning on tackling this in a more principled manner 
in YARN-4597
{quote}

You're correct that some of the interactions become non-trivial but I think 
we're better off getting those interactions flushed out. My concern with the 
sequential start/stop is that it brings up scenarios such as
 - a running container will be failed because we couldn't localize the new 
bits(which operationally runs counter to what's expected). 
 - a running container will be killed to localize the new bits which get slowed 
down leading to no container running(again operationally runs counter to what 
we expect)
 - the AMs being unable to schedule upgrades
 
YARN-4597 is related to this but not exactly the same. It's scope is much 
broader - we really only care about being able to localize while a container is 
running. My suggestion is let's just take care of the whole localization while 
running piece in another JIRA. I have no problem doing the implementation in a 
phased manner, but we need to get the transitions sorted out first.

In addition, I was wondering if you had any thoughts about cleaning up 
resources for a running container and wiping out its data?

{quote}
Totally agree!! But I was thinking we get the base initialize/destroy and 
start/stop APIs well defined and working as expected.. Was thinking clubbing 
into composite commands can be handled in a subsequent JIRA. Since in any case, 
we do have to handle all these cases when an AM calls initialize/start while 
the container is running. Although we can just choose to ignore all commands 
except a restart, stop or destroy, but I'd prefer to handle restart as a 
composite command. 
{quote}

Given that we agree on this - how about we modify the scope of this jira to 
implement the initialize/destroy operations only. We can handle the upgrade 
changes in a complete manner in a follow up JIRA and then tackle restart. That 
way we're unblocked on this JIRA and we have some time to sort out the 
lifecycle changes. What do you think?

> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the 
> *ContainerManagementProtocol* and decouple the actual start of a container 
> from the initialization. This will allow AMs to re-start a container without 
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the 
> initialize (and the cleanup with the destroy), This can also be used by 
> applications to upgrade a Container by *re-initializing* with a new 
> *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to