[
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jian He updated YARN-1368:
--------------------------
Attachment: YARN-1368.preliminary.patch
Preliminary patch to re-populate RMContainer, schedulerNode,
schedulerApplicationAttempt, appSchedulingInfo and Queue states.
- ResourceTrackerService receives the containers info and send them to RMNode,
which in turn sends container statuses to scheduler to do the recovery.
- the majority of the recovery logic is
AbstractYarnScheduler#recoverContainersOnNode() which recovers RMContainer,
SchedulerNode,Queue. SchedulerApplicationAttempt, appSchedulingInfo accordingly.
To do:
- Noticed that FiCaSchedulerNode and FSSchedulerNode are almost the same. Any
reason for keeping both ? thinking to merge the common methods into
SchedulerNode.
- RM_WORK_PRESERVING_RECOVERY_ENABLED will be used to guard against the new
changes.
- ContainerStatus sent in NM registration doesn’t capture enough information
for re-constructing the containers. we may replace that with a new object or
just adding more fields to encapsulate all the necessary information for
re-constructing the container.
- More changes on recover interfaces, edge cases and the transition logic in
RMApp/RMAppAttempt
- more thorough test cases.
RMContainer, SchedulerNode and SchedulerApplicationAttempt, AppSchedulingInfo
can be recovered in a common way. CSQueue and FSQueue may need to implements
its own recoverContainer method
> Common work to re-populate containers’ state into scheduler
> -----------------------------------------------------------
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Anubhav Dhoot
> Attachments: YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running
> containers upon registration. The RM needs to send this information to the
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover
> the current allocation state of the cluster.
--
This message was sent by Atlassian JIRA
(v6.2#6252)