[
https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akira Ajisaka updated YARN-10428:
---------------------------------
Fix Version/s: 2.10.2
Backported to branch-2.10.
> Zombie applications in the YARN queue using FAIR + sizebasedweight
> ------------------------------------------------------------------
>
> Key: YARN-10428
> URL: https://issues.apache.org/jira/browse/YARN-10428
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 2.8.5
> Reporter: Guang Yang
> Assignee: Andras Gyori
> Priority: Critical
> Fix For: 3.4.0, 2.10.2, 3.2.3, 3.3.2
>
> Attachments: YARN-10428.001.patch, YARN-10428.002.patch,
> YARN-10428.003.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight
> ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie
> job, it hits the limit even though the number of running applications is far
> less than the limit.
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application
> "application_1599157165858_29429" for example, it is still in the
> `FairOderingPolicy#schedulableEntities` set, however, if we check the log of
> resource manager, we can see RM already tried to remove the application:
>
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04
> 04:32:19,730 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue
> (ResourceManager Event Processor): Application removed - appId:
> application_1599157165858_29429 user: svc_di_data_eng queue: core-data
> #user-pending-applications: -3 #user-active-applications: 7
> #queue-pending-applications: 0 #queue-active-applications: 21
>
> So it appears RM failed to removed the application from the set.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]