[
https://issues.apache.org/jira/browse/YARN-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997386#comment-13997386
]
Wangda Tan commented on YARN-1368:
----------------------------------
Hi Jian,
Thanks for updating your patch, I took a look at it, some comments,
1) RMAppImpl.java
{code}
+ // Add the current attempt to the scheduler.It'll be removed from
+ // scheduler in RMAppAttempt#BaseFinalTransition
+ app.handler.handle(new AppAttemptAddedSchedulerEvent(app.currentAttempt
+ .getAppAttemptId(), false));
{code}
Not quite understand about this, in current trunk code, how RMAppAttempt notify
scheduler about "AppAttemptAddedSchedulerEvent" when recovering? If this is a
missing part in current trunk, I tend to put sending
"AppAttemptAddedSchedulerEvent" code into RMAppAttemptImpl.
2) RMAppAttemptImpl.java
{code}
+ new ContainerFinishedTransition(
+ new AMContainerCrashedAtRunningTransition(),
+ RMAppAttemptState.RUNNING))
{code}
And
{code}
+ new ContainerFinishedTransition(
+ new AMContainerCrashedBeforeRunningTransition(),
+ RMAppAttemptState.LAUNCHED))
{code}
I found the "RUNNING" and "LAUNCHED" state are passed in as targetedFinalState,
and the targetFinalState will only used in FinalStateSavedTransition, I got
confused, could you please elaborate on this? Why use split
AMContainerCrashedTransition to two transitions and set their states to
RUNNING/LAUNCHED differently.
3) In AbstractYarnScheduler.java
3.1
{code}
+ if (rmApp == null) {
+ LOG.error("Skip recovering container " + status
+ + " for unknown application.");
+ continue;
+ }
{code}
And
{code}
+ if (rmApp.getApplicationSubmissionContext().getUnmanagedAM()) {
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Skip recovering container " + status
+ + " for unmanaged AM." + rmApp.getApplicationId());
+ }
+ continue;
+ }
{code}
And
{code}
+ SchedulerApplication schedulerApp = applications.get(appId);
+ if (schedulerApp == null) {
+ LOG.info("Skip recovering container " + status
+ + " for unknown SchedulerApplication. Application state is "
+ + rmApp.getState());
+ continue;
+ }
{code}
It's better to make log level more consistency.
3.2
{code}
+ public RMContainer createContainer(ContainerStatus status, RMNode node) {
+ Container container =
+ Container.newInstance(status.getContainerId(), node.getNodeID(),
+ node.getHttpAddress(), Resource.newInstance(1024, 1),
+ Priority.newInstance(0), null);
{code}
Should we change Resource(1024, 1) to its actually resource?
3.3 For recoverContainersOnNode, is it possible NODE_ADDED happened before
APP_ADDED? I ask this before container may be recovered before its application
added to scheduler if yes.
4) In FiCaSchedulerNode.java:
{code}
+ @Override
+ public void recoverContainer(RMContainer rmContainer) {
+ if (rmContainer.getState().equals(RMContainerState.COMPLETED)) {
+ return;
+ }
+ allocateContainer(null, rmContainer);
+ }
{code}
Since the allocateContainer doesn't use application-id parameter, I think it's
better to remove it.
5) In TestWorkPreservingRMRestart.java
{code}
+ assertEquals(usedCapacity, queue.getAbsoluteUsedCapacity(), 0);
{code}
It may better to use two parameter assertEquals, because delta is 0
> Common work to re-populate containers’ state into scheduler
> -----------------------------------------------------------
>
> Key: YARN-1368
> URL: https://issues.apache.org/jira/browse/YARN-1368
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Jian He
> Attachments: YARN-1368.1.patch, YARN-1368.2.patch,
> YARN-1368.combined.001.patch, YARN-1368.preliminary.patch
>
>
> YARN-1367 adds support for the NM to tell the RM about all currently running
> containers upon registration. The RM needs to send this information to the
> schedulers along with the NODE_ADDED_EVENT so that the schedulers can recover
> the current allocation state of the cluster.
--
This message was sent by Atlassian JIRA
(v6.2#6252)