[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424033#comment-17424033
 ] 

Benjamin Teke commented on YARN-10934:
--------------------------------------

[~luoyuan], [~snemeth] uploaded a possible fix for the issue. While I wasn't 
able to reproduce the issue, the reason for it was most likely the following:
# an app was removed via LeafQueue.finishApplicationAttempt (which calls 
removeApplicationAttempt)
# removeApplicationAttempt removes the user from the usersManager because it 
seems like the user has no more pending or running applications
# activateApplications() is called and for some reason an app still is the 
pending applications list with a removed user

I've noticed a behaviour change in YARN-3140: before that patch the 
LeafQueue.getUser() added a user to the list if it was missing, similarly what 
now the usersManager.getUserAndAddIfAbsent(username) does. Since most of the 
time this method is called anyway (instead of the 
usersManager.getUser(username)) I think the safe fix for this issue is (without 
repro steps) is to add the user if it has pending applications (but for some 
reason it was previously removed), just like it did before.

> LeafQueue activateApplications NPE
> ----------------------------------
>
>                 Key: YARN-10934
>                 URL: https://issues.apache.org/jira/browse/YARN-10934
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: RM
>    Affects Versions: 3.3.1
>            Reporter: Yuan Luo
>            Assignee: Benjamin Teke
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: RM-capacity-scheduler.xml, RM-yarn-site.xml
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
>         at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to