[jira] [Comment Edited] (YARN-10934) LeafQueue activateApplications NPE

2021-10-04 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17424033#comment-17424033
 ] 

Benjamin Teke edited comment on YARN-10934 at 10/4/21, 4:54 PM:


[~luoyuan], [~snemeth] uploaded a possible fix for the issue. While I wasn't 
able to reproduce the issue, the reason for it was most likely the following:
# an app was removed via LeafQueue.finishApplicationAttempt (which calls 
removeApplicationAttempt)
# removeApplicationAttempt removes the user from the usersManager because it 
seems like the user has no more pending or running applications
# activateApplications() is called and for some reason an app still is the 
pending applications list with a removed user

I've noticed a behaviour change in YARN-3140: before that patch the 
LeafQueue.getUser() added a user to the list if it was missing, similarly what 
now the usersManager.getUserAndAddIfAbsent(username) does. Since most of the 
time this method is called in LeafQueue anyway (instead of the 
usersManager.getUser(username)) I think the safe "fix" for this issue is 
(without repro steps) is to add the user if it has pending applications (but 
for some reason it was previously removed), just like it did before.


was (Author: bteke):
[~luoyuan], [~snemeth] uploaded a possible fix for the issue. While I wasn't 
able to reproduce the issue, the reason for it was most likely the following:
# an app was removed via LeafQueue.finishApplicationAttempt (which calls 
removeApplicationAttempt)
# removeApplicationAttempt removes the user from the usersManager because it 
seems like the user has no more pending or running applications
# activateApplications() is called and for some reason an app still is the 
pending applications list with a removed user

I've noticed a behaviour change in YARN-3140: before that patch the 
LeafQueue.getUser() added a user to the list if it was missing, similarly what 
now the usersManager.getUserAndAddIfAbsent(username) does. Since most of the 
time this method is called anyway (instead of the 
usersManager.getUser(username)) I think the safe fix for this issue is (without 
repro steps) is to add the user if it has pending applications (but for some 
reason it was previously removed), just like it did before.

> LeafQueue activateApplications NPE
> --
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan Luo
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Attachments: RM-capacity-scheduler.xml, RM-yarn-site.xml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10934) LeafQueue activateApplications NPE

2021-09-08 Thread Yuan Luo (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411636#comment-17411636
 ] 

Yuan Luo edited comment on YARN-10934 at 9/8/21, 7:26 AM:
--

[~snemeth] Thanks for your reply, have fixed title, it is a NPE Error. I have 
added some yarn config in the attachment.  We use DefaultResourceCalculator and 
queue number of vcore configuration is 0, suspicion and the related, but the 
code is not found the problem.


was (Author: luoyuan):
[~snemeth] Thanks for your reply, have fixed title, it is a NPE Error. I will 
add some information in the attachment.  

> LeafQueue activateApplications NPE
> --
>
> Key: YARN-10934
> URL: https://issues.apache.org/jira/browse/YARN-10934
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 3.3.1
>Reporter: Yuan Luo
>Priority: Major
> Attachments: RM-capacity-scheduler.xml, RM-yarn-site.xml
>
>
> Our prod Yarn cluster is hadoop version 3.3.1 ,  we changed 
> DefaultResourceCalculator -> DominantResourceCalculator and restart RM, then 
> our RM crashed, the Exception stack like below.  I think this is a serious 
> bug and hope someone can follow up and fix it.
> 2021-08-30 21:00:59,114 ERROR event.EventDispatcher 
> (MarkerIgnoringBase.java:error(159)) - Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.activateApplications(LeafQueue.java:868)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.removeApplicationAttempt(LeafQueue.java:1014)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.finishApplicationAttempt(LeafQueue.java:972)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1188)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1904)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:79)
> at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org