Peter Bacsko created YARN-9552:
----------------------------------

             Summary: FairScheduler: NODE_UPDATE can cause a 
NoSuchElementException
                 Key: YARN-9552
                 URL: https://issues.apache.org/jira/browse/YARN-9552
             Project: Hadoop YARN
          Issue Type: Bug
          Components: fairscheduler
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


We observed a race condition inside YARN with the following stack trace:

{noformat}
18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR 
EventDispatcher: Error in handling event type NODE_UPDATE to the Event 
Dispatcher
java.util.NoSuchElementException
        at 
java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
        at 
java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
        at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

This is basically the same as the one described in YARN-7382, but the root 
cause is different.

When we create an application attempt, we create an {{FSAppAttempt}} object. 
This contains an {{AppSchedulingInfo}} which contains a set of 
{{SchedulerRequestKey}}. Initially, this set is empty and only initialized a 
bit later on a separate thread during a state transition:
{noformat}
2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] recovery.RMStateStore 
(RMStateStore.java:transition(239)) - Storing info for app: 
application_1557237478804_0001
2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
(RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from 
NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] 
fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted 
application application_1557237478804_0001 from user: bacskop, in queue: 
root.bacskop, currently num of applications: 1
2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl 
(RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from 
SUBMITTED to ACCEPTED on event = APP_ACCEPTED
2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] 
resourcemanager.ApplicationMasterService 
(ApplicationMasterService.java:registerAppAttempt(434)) - Registering app 
attempt : appattempt_1557237478804_0001_000001
2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_000001 
State change from NEW to SUBMITTED on event = START
2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:<init>(207)) - *** In the constructor of 
SchedulerApplicationAttempt
2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] 
scheduler.SchedulerApplicationAttempt 
(SchedulerApplicationAttempt.java:<init>(230)) - *** Contents of 
appSchedulingInfo: []
2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] 
fair.FairScheduler (FairScheduler.java:addApplicationAttempt(546)) - Added 
Application Attempt appattempt_1557237478804_0001_000001 to scheduler from 
user: bacskop
2019-05-07 15:58:02,756 INFO  [RM Event dispatcher] scheduler.AppSchedulingInfo 
(AppSchedulingInfo.java:updatePendingResources(257)) - *** Adding scheduler 
key: SchedulerRequestKey{priority=0, allocationRequestId=-1, 
containerToUpdate=null}  for attempt: appattempt_1557237478804_0001_000001
2019-05-07 15:58:02,759 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_000001 
State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2019-05-07 15:58:02,892 INFO  [main] impl.YarnClientImpl 
(YarnClientImpl.java:submitApplication(310)) - Submitted application 
application_1557237478804_0001
{noformat}
(some extra lines are printed with ***).

So at 15:58:02,747 the set is empty and populated with a single element at 
15:58:02,756 on "RM Event dispatcher". This means there's a tiny time window 
during which a {{NODE_UPDATE}} can cause a {{NoSuchElementException}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to