[ 
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1368:
--------------------------

    Attachment: YARN-1368.preliminary.patch

Preliminary patch to re-populate RMContainer, schedulerNode, 
schedulerApplicationAttempt, appSchedulingInfo and Queue states.
- ResourceTrackerService receives the containers info and send them to RMNode, 
which in turn sends container statuses to scheduler to do the recovery.
- the majority of the recovery logic is 
AbstractYarnScheduler#recoverContainersOnNode() which recovers RMContainer, 
SchedulerNode,Queue. SchedulerApplicationAttempt, appSchedulingInfo accordingly.

To do:
- Noticed that FiCaSchedulerNode and  FSSchedulerNode are almost the same. Any 
reason for keeping both ? thinking to merge the common methods into 
SchedulerNode.
- RM_WORK_PRESERVING_RECOVERY_ENABLED will be used to guard against the new 
changes.
- ContainerStatus sent in NM registration doesn’t capture enough information 
for re-constructing the containers. we may replace that with a new object or 
just adding more fields to encapsulate all the necessary information for 
re-constructing the container.
- More changes on recover interfaces, edge cases and the transition logic in 
RMApp/RMAppAttempt
- more thorough test cases.

RMContainer, SchedulerNode and SchedulerApplicationAttempt, AppSchedulingInfo 
can be recovered in a common way. CSQueue and FSQueue may need to implements 
its own recoverContainer method

> Common work to re-populate containers’ state into scheduler
> -----------------------------------------------------------
>
>                 Key: YARN-1368
>                 URL: https://issues.apache.org/jira/browse/YARN-1368
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running 
> containers upon registration. The RM needs to send this information to the 
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover 
> the current allocation state of the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to