[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

sandflee (JIRA) Wed, 20 Jul 2016 17:29:54 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386891#comment-15386891
 ]


sandflee commented on YARN-5195:
--------------------------------

AsyncSchedulerThread will copy all node from nodeTracker before 
attemptScheduling on node. there is a race condition:
1, all nodes copied from nodeTracker
2, nodeA lost and  removed from scheduler, all launched containers are cleaned
3, app attempt completed and the container allocated (or reserved) on nodeA 
will refer to non-exist node.
this is fixed in fairscheduler in YARN-3675, add a init patch and will add a 
test later

> RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event 
> when async-scheduling enabled in CapacityScheduler
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-5195
>                 URL: https://issues.apache.org/jira/browse/YARN-5195
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.2
>            Reporter: Karam Singh
>            Assignee: sandflee
>         Attachments: YARN-5195.01.patch
>
>
> While running gridmix experiments one time came across incident where RM went 
> down with following exception
> {noformat}
> 2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning 
> master appattempt_1464449118385_0006_000001
> 2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5195) RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

Reply via email to