[ https://issues.apache.org/jira/browse/YARN-3675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554994#comment-14554994 ]
Karthik Kambatla commented on YARN-3675: ---------------------------------------- +1. Checking this in. Just want to note that we rely on the lock on FairScheduler to ensure a node doesn't get removed during attemptScheduling. I feel we are getting to the point where we should invest time in making these locks finer grained, otherwise we might end up in the MR1 world. > FairScheduler: RM quits when node removal races with continousscheduling on > the same node > ----------------------------------------------------------------------------------------- > > Key: YARN-3675 > URL: https://issues.apache.org/jira/browse/YARN-3675 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Reporter: Anubhav Dhoot > Assignee: Anubhav Dhoot > Priority: Critical > Attachments: YARN-3675.001.patch, YARN-3675.002.patch, > YARN-3675.003.patch > > > With continuous scheduling, scheduling can be done on a node thats just > removed causing errors like below. > {noformat} > 12:28:53.782 AM FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager > Error in handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.unreserve(FSAppAttempt.java:469) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:815) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:763) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1217) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:111) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684) > at java.lang.Thread.run(Thread.java:745) > 12:28:53.783 AM INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Exiting, bbye.. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)