[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424033#comment-17424033
 ] 

Benjamin Teke edited comment on YARN-10934 at 10/4/21, 4:54 PM:
----------------------------------------------------------------

[~luoyuan], [~snemeth] uploaded a possible fix for the issue. While I wasn't 
able to reproduce the issue, the reason for it was most likely the following:
# an app was removed via LeafQueue.finishApplicationAttempt (which calls 
removeApplicationAttempt)
# removeApplicationAttempt removes the user from the usersManager because it 
seems like the user has no more pending or running applications
# activateApplications() is called and for some reason an app still is the 
pending applications list with a removed user

I've noticed a behaviour change in YARN-3140: before that patch the 
LeafQueue.getUser() added a user to the list if it was missing, similarly what 
now the usersManager.getUserAndAddIfAbsent(username) does. Since most of the 
time this method is called in LeafQueue anyway (instead of the 
usersManager.getUser(username)) I think the safe "fix" for this issue is 
(without repro steps) is to add the user if it has pending applications (but 
for some reason it was previously removed), just like it did before.


was (Author: bteke):
[~luoyuan], [~snemeth] uploaded a possible fix for the issue. While I wasn't 
able to reproduce the issue, the reason for it was most likely the following:
# an app was removed via LeafQueue.finishApplicationAttempt (which calls 
removeApplicationAttempt)
# removeApplicationAttempt removes the user from the usersManager because it 
seems like the user has no more pending or running applications
# activateApplications() is called and for some reason an app still is the 
pending applications list with a removed user

I've noticed a behaviour change in YARN-3140: before that patch the 
LeafQueue.getUser() added a user to the list if it was missing, similarly what 
now the usersManager.getUserAndAddIfAbsent(username) does. Since most of the 
time this method is called anyway (instead of the 
usersManager.getUser(username)) I think the safe fix for this issue is (without 
repro steps) is to add the user if it has pending applications (but for some 
reason it was previously removed), just like it did before.

> LeafQueue activateApplications NPE
> ----------------------------------
>
>                 Key: YARN-10934
>                 URL: https://issues.apache.org/jira/browse/YARN-10934
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: RM
>    Affects Versions: 3.3.1
>            Reporter: Yuan Luo
>            Assignee: Benjamin Teke
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: RM-capacity-scheduler.xml, RM-yarn-site.xml
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
>         at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to