[
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235312#comment-15235312
]
Varun Vasudev commented on YARN-4876:
-------------------------------------
bq. I tend to agree with you, but my intention was to introduce a timeout after
which, if no action is taken by the AM, the Containers is killed. Maybe we can
have a default timeout (5 mins ?) and allow AMs to override it.
The zero activity timeout makes sense, but for now let's not let AMs override
it.
{quote}
Yup, agreed.. we had thought about that, but felt that introducing concurrent
localization while running might introduce more states (like you identified -
"running-localizing.." etc). Also, was thinking about what happens when a
concurrent localization completes
Should it move to the AWAITING state that waits for a startContainer
command from the AM (which would increase start-up latency) or should it just
start automatically?
What happens when a concurrent re-localization attempt fails ? Should the
container continue running / be killed (notified to the RM). If it continues to
run, We need to notify the AM about the failure (or wait for the AM to call
getStatus etc.)
{quote}
In the first case, we should continue to run the old bits until the AM asks for
an upgrade/restart. In the second case, I would expect the old bits to continue
running and provide a status function to let the AM know that the localization
failed. This can become more and more involved - we might eventually need to
provide first class localization APIs.
{quote}
In any case, the interactions between AM and the NM/Container would become
non-trivial.
I was thinking we should probably do a sequential stop + initialize/localize +
start as a first cut, and tackle concurrent re-initialization is subsequent
JIRAs. Furthermore, I was planning on tackling this in a more principled manner
in YARN-4597
{quote}
You're correct that some of the interactions become non-trivial but I think
we're better off getting those interactions flushed out. My concern with the
sequential start/stop is that it brings up scenarios such as
- a running container will be failed because we couldn't localize the new
bits(which operationally runs counter to what's expected).
- a running container will be killed to localize the new bits which get slowed
down leading to no container running(again operationally runs counter to what
we expect)
- the AMs being unable to schedule upgrades
YARN-4597 is related to this but not exactly the same. It's scope is much
broader - we really only care about being able to localize while a container is
running. My suggestion is let's just take care of the whole localization while
running piece in another JIRA. I have no problem doing the implementation in a
phased manner, but we need to get the transitions sorted out first.
In addition, I was wondering if you had any thoughts about cleaning up
resources for a running container and wiping out its data?
{quote}
Totally agree!! But I was thinking we get the base initialize/destroy and
start/stop APIs well defined and working as expected.. Was thinking clubbing
into composite commands can be handled in a subsequent JIRA. Since in any case,
we do have to handle all these cases when an AM calls initialize/start while
the container is running. Although we can just choose to ignore all commands
except a restart, stop or destroy, but I'd prefer to handle restart as a
composite command.
{quote}
Given that we agree on this - how about we modify the scope of this jira to
implement the initialize/destroy operations only. We can handle the upgrade
changes in a complete manner in a follow up JIRA and then tackle restart. That
way we're unblocked on this JIRA and we have some time to sort out the
lifecycle changes. What do you think?
> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
> Key: YARN-4876
> URL: https://issues.apache.org/jira/browse/YARN-4876
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Arun Suresh
> Assignee: Arun Suresh
> Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the
> *ContainerManagementProtocol* and decouple the actual start of a container
> from the initialization. This will allow AMs to re-start a container without
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the
> initialize (and the cleanup with the destroy), This can also be used by
> applications to upgrade a Container by *re-initializing* with a new
> *ContainerLaunchContext*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)