[
https://issues.apache.org/jira/browse/YARN-8579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562355#comment-16562355
]
Chandni Singh commented on YARN-8579:
-------------------------------------
[~gsaha] please see
{quote}I do have one fundamental question though. I don't understand why for
FAIR scheduler the below assert fails (which means no NMTokens are sent over
even with this patch). The method where I made the code change is a common
method which is called by both Fair and Capacity Schedulers. Any idea? That's
why I had to enable this assert for CAPACITY scheduler only. I don't have a
cluster setup where I can test FairScheduler.
{quote}
The bug is in the order of calls to
{{SchedulerApplicationAttempt.pullPreviousAttemptContainers()}} and
{{SchedulerApplicationAttempt.pullUpdatedNMTokens()}} in {{FairScheduler}}
{{FiCaSchedulerApp}} does the right order. It calls
{{pullPreviousAttemptContainers()}} which updates the NM tokens and then pulls
with {{pullUpdatedNMTokens()}}
{code:java}
List<Container> previousAttemptContainers =
pullPreviousAttemptContainers();
List<Container> newlyAllocatedContainers = pullNewlyAllocatedContainers();
List<Container> newlyIncreasedContainers = pullNewlyIncreasedContainers();
List<Container> newlyDecreasedContainers = pullNewlyDecreasedContainers();
List<Container> newlyPromotedContainers = pullNewlyPromotedContainers();
List<Container> newlyDemotedContainers = pullNewlyDemotedContainers();
List<NMToken> updatedNMTokens = pullUpdatedNMTokens();
{code}
However, {{FairScheduler}} does the wrong order by calling first
{{pullUpdatedNMTokens()}} before {{pullPreviousAttemptContainers()}}.
{code:java}
return new Allocation(newlyAllocatedContainers, headroom,
preemptionContainerIds, null, null,
application.pullUpdatedNMTokens(), null, null,
application.pullNewlyPromotedContainers(),
application.pullNewlyDemotedContainers(),
application.pullPreviousAttemptContainers());
{code}
Since NMTokens are not updated, they are null in the allocation.
I think we should fix this instead of modifying the test to only check this
for capacity scheduler.
> New AM attempt could not retrieve previous attempt component data
> -----------------------------------------------------------------
>
> Key: YARN-8579
> URL: https://issues.apache.org/jira/browse/YARN-8579
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.1.1
> Reporter: Yesha Vora
> Assignee: Gour Saha
> Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8579.001.patch, YARN-8579.002.patch,
> YARN-8579.003.patch
>
>
> Steps:
> 1) Launch httpd-docker
> 2) Wait for app to be in STABLE state
> 3) Run validation for app (It takes around 3 mins)
> 4) Stop all Zks
> 5) Wait 60 sec
> 6) Kill AM
> 7) wait for 30 sec
> 8) Start all ZKs
> 9) Wait for application to finish
> 10) Validate expected containers of the app
> Expected behavior:
> New attempt of AM should start and docker containers launched by 1st attempt
> should be recovered by new attempt.
> Actual behavior:
> New AM attempt starts. It can not recover 1st attempt docker containers. It
> can not read component details from ZK.
> Thus, it starts new attempt for all containers.
> {code}
> 2018-07-19 22:42:47,595 [main] INFO service.ServiceScheduler - Registering
> appattempt_1531977563978_0015_000002, fault-test-zkrm-httpd-docker into
> registry
> 2018-07-19 22:42:47,611 [main] INFO service.ServiceScheduler - Received 1
> containers from previous attempt.
> 2018-07-19 22:42:47,642 [main] INFO service.ServiceScheduler - Could not
> read component paths:
> `/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components':
> No such file or directory: KeeperErrorCode = NoNode for
> /registry/users/hrt-qa/services/yarn-service/fault-test-zkrm-httpd-docker/components
> 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Handling
> container_e08_1531977563978_0015_01_000003 from previous attempt
> 2018-07-19 22:42:47,643 [main] INFO service.ServiceScheduler - Record not
> found in registry for container container_e08_1531977563978_0015_01_000003
> from previous attempt, releasing
> 2018-07-19 22:42:47,649 [AMRM Callback Handler Thread] INFO
> impl.TimelineV2ClientImpl - Updated timeline service address to xxx:33019
> 2018-07-19 22:42:47,651 [main] INFO service.ServiceScheduler - Triggering
> initial evaluation of component httpd
> 2018-07-19 22:42:47,652 [main] INFO component.Component - [INIT COMPONENT
> httpd]: 2 instances.
> 2018-07-19 22:42:47,652 [main] INFO component.Component - [COMPONENT httpd]
> Requesting for 2 container(s){code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]