[
https://issues.apache.org/jira/browse/YARN-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202047#comment-16202047
]
Arun Suresh edited comment on YARN-7275 at 10/12/17 2:51 PM:
-------------------------------------------------------------
Thanks for the updated patch [~kartheek]
Couple of comments:
* In the new {{ContainerScheduler::recoverActiveContainer}} method, if the
container is running, you need to update the utilizationTracker
{{this.utilizationTracker.addContainerResources(..)}}
* After the recovery process is complete on the NM, we need to consider the
following:
** It is possible that just before the NM went down, some of the queued
containers might have been in the process of being started or resumed (the
container would be in RESUMING / SCHEDULED and the recovered container state
would be QUEUED) - The LAUNCH event was sent but did not reach the
'ContainerLaunch' in which case - The {{ContainerScheduler}} would need resend
those events.
** It is also possible that just before the NM went down, some running
containers we in the process of being PAUSED (the container would be in the
PAUSING state and the rcs would be RUNNING) - The kill/pause event was sent but
again did not reach the executor
* Both the above scenarios should be covered by calling the
{{ContainerScheduler::startPendingContainers(..)}} method on the
ContainerScheduler. It will check if there are queued opportunisitc containers
and start/resume them. I propose we create another
{{ContainerSchedulerEventType}} - just call it RECOVERY_COMPLETED and dispatch
this event to the containerScheduler at the end of the
{{ContainerManager::recover()}} method. In the ContainerScheduler, when we
receive the event, just call {{startPendingContainers(..)}}. Makes sense ?
Also with regard to my earlier comment:
bq. in addition to storing the container update token, use the old resource
update key and store the changed resource also.
apologize, but I think we can revert it back to how you had it in your earlier
patch - because it looks this wont guarantee rollback will work - since the old
version of the NM will still see the new key and bomb anyway. So we will just
have to document that somewhere that if a running container is updated, roll
back is not possible until container is completed.
was (Author: asuresh):
Thanks for the updated patch [~kartheek]
Couple of comments:
* In the new {{ContainerScheduler::recoverActiveContainer}} method, if the
container is running, you need to update the utilizationTracker
{{this.utilizationTracker.addContainerResources(..)}}
* After the recovery process is complete on the NM, we need to consider the
following:
** It is possible that just before the NM went down, some of the queued
containers might have been in the process of being started or resumed (the
container would be in RESUMING / SCHEDULED and the recovered container state
would be QUEUED) - The LAUNCH event was sent but did not reach the
'ContainerLaunch' in which case - The {{ContainerScheduler}} would need resend
those events.
** It is also possible that just before the NM went down, some running
containers we in the process of being PAUSED (the container would be in the
PAUSING state and the rcs would be RUNNING) - The kill/pause event was sent but
again did not reach the executor
* Both the above scenarios should be covered by calling the
{{ContainerScheduler::startPendingContainers(..)}} method on the
ContainerScheduler. It will check if there are queued opportunisitc containers
and start/resume them. I propose we create another
{{ContainerSchedulerEventType}} - just call it RECOVERY_COMPLETED and dispatch
this event to the containerScheduler at the end of the
{{ContainerManager::recover()}} method. In the ContainerScheduler, when we
receive the event, just call {{startPendingContainers(..)}}. Makes sense ?
Also with regard to my earlier commet:
bq. in addition to storing the container update token, use the old resource
update key and store the changed resource also.
apologize, but I think we can revert it back to how you had it in your earlier
patch - because it looks this wont guarantee rollback will work - since the old
version of the NM will still see the new key and bomb anyway. So we will just
have to document that somewhere that if a running container is updated, roll
back is not possible until container is completed.
> NM Statestore cleanup for Container updates
> -------------------------------------------
>
> Key: YARN-7275
> URL: https://issues.apache.org/jira/browse/YARN-7275
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Arun Suresh
> Assignee: kartheek muthyala
> Priority: Blocker
> Attachments: YARN-7275.001.patch, YARN-7275.002.patch,
> YARN-7275.003.patch, YARN-7275.004.patch
>
>
> Currently, only resource updates are recorded in the NM state store, we need
> to add ExecutionType updates as well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]