[ 
https://issues.apache.org/jira/browse/YARN-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202047#comment-16202047
 ] 

Arun Suresh edited comment on YARN-7275 at 10/12/17 2:51 PM:
-------------------------------------------------------------

Thanks for the updated patch [~kartheek]

Couple of comments:
* In the new {{ContainerScheduler::recoverActiveContainer}} method, if the 
container is running, you need to update the utilizationTracker 
{{this.utilizationTracker.addContainerResources(..)}}
* After the recovery process is complete on the NM, we need to consider the 
following:
** It is possible that just before the NM went down, some of the queued 
containers might have been in the process of being started or resumed (the 
container would be in RESUMING / SCHEDULED and the recovered container state 
would be QUEUED) - The LAUNCH event was sent but did not reach the 
'ContainerLaunch' in which case - The {{ContainerScheduler}} would need resend 
those events.
** It is also possible that just before the NM went down, some running 
containers we in the process of being PAUSED (the container would be in the 
PAUSING state and the rcs would be RUNNING) - The kill/pause event was sent but 
again did not reach the executor
* Both the above scenarios should be covered by calling the 
{{ContainerScheduler::startPendingContainers(..)}} method on the 
ContainerScheduler. It will check if there are queued opportunisitc containers 
and start/resume them. I propose we create another 
{{ContainerSchedulerEventType}} - just call it RECOVERY_COMPLETED and dispatch 
this event to the containerScheduler at the end of the 
{{ContainerManager::recover()}} method. In the ContainerScheduler, when we 
receive the event, just call {{startPendingContainers(..)}}. Makes sense ?

Also with regard to my earlier comment:
bq.  in addition to storing the container update token, use the old resource 
update key and store the changed resource also.
apologize, but I think we can revert it back to how you had it in your earlier 
patch - because it looks this wont guarantee rollback will work - since the old 
version of the NM will still see the new key and bomb anyway. So we will just 
have to document that somewhere that if a running container is updated, roll 
back is not possible until container is completed.


was (Author: asuresh):
Thanks for the updated patch [~kartheek]

Couple of comments:
* In the new {{ContainerScheduler::recoverActiveContainer}} method, if the 
container is running, you need to update the utilizationTracker 
{{this.utilizationTracker.addContainerResources(..)}}
* After the recovery process is complete on the NM, we need to consider the 
following:
** It is possible that just before the NM went down, some of the queued 
containers might have been in the process of being started or resumed (the 
container would be in RESUMING / SCHEDULED and the recovered container state 
would be QUEUED) - The LAUNCH event was sent but did not reach the 
'ContainerLaunch' in which case - The {{ContainerScheduler}} would need resend 
those events.
** It is also possible that just before the NM went down, some running 
containers we in the process of being PAUSED (the container would be in the 
PAUSING state and the rcs would be RUNNING) - The kill/pause event was sent but 
again did not reach the executor
* Both the above scenarios should be covered by calling the 
{{ContainerScheduler::startPendingContainers(..)}} method on the 
ContainerScheduler. It will check if there are queued opportunisitc containers 
and start/resume them. I propose we create another 
{{ContainerSchedulerEventType}} - just call it RECOVERY_COMPLETED and dispatch 
this event to the containerScheduler at the end of the 
{{ContainerManager::recover()}} method. In the ContainerScheduler, when we 
receive the event, just call {{startPendingContainers(..)}}. Makes sense ?

Also with regard to my earlier commet:
bq.  in addition to storing the container update token, use the old resource 
update key and store the changed resource also.
apologize, but I think we can revert it back to how you had it in your earlier 
patch - because it looks this wont guarantee rollback will work - since the old 
version of the NM will still see the new key and bomb anyway. So we will just 
have to document that somewhere that if a running container is updated, roll 
back is not possible until container is completed.

> NM Statestore cleanup for Container updates
> -------------------------------------------
>
>                 Key: YARN-7275
>                 URL: https://issues.apache.org/jira/browse/YARN-7275
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: kartheek muthyala
>            Priority: Blocker
>         Attachments: YARN-7275.001.patch, YARN-7275.002.patch, 
> YARN-7275.003.patch, YARN-7275.004.patch
>
>
> Currently, only resource updates are recorded in the NM state store, we need 
> to add ExecutionType updates as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to