[ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Subru Krishnan updated YARN-6661: --------------------------------- Fix Version/s: (was: 2.9.0) 2.9.1 > Too much CLEANUP event hang ApplicationMasterLauncher thread pool > ----------------------------------------------------------------- > > Key: YARN-6661 > URL: https://issues.apache.org/jira/browse/YARN-6661 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.7.2 > Environment: hadoop 2.7.2 > Reporter: JackZhou > Fix For: 2.9.1 > > > Some one else have already come up with the similar problem and fix it. > We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for > detail. > But I think the fix have not solve the problem completely, blow was the > problem I encountered: > There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps. > I failover my active rm and rm will failover all those 1800 apps. > When a application failover, It will wait for AM container register itself. > But there is a bug in my AM (I do it intentionally), and it will not register > itself. > So the RM will wait for about 10mins for the AM expiration, and it will send > a CLEANUP event to > ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so > it will hang the ApplicationMasterLauncher > thread pool for a large time. I have already use the > patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), > so > a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so > for each of my thread, it will > hang 1800 / 50 * 200s = 7200s=20min. > Because the AM have register itself during 10mins, so it will retry and > create a new application attempt. > The application attempt will accept a container from RM, and send a LAUNCH to > ApplicationMasterLauncher thread pool. > Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the > application attempt will not > start the AM container during 10min. > And it will expire, and send a CLEANUP event to ApplicationMasterLauncher > thread pools too. > As you can see, none of my application can really run it. > Each of them have 5 application attempts as follows, and each of them keep > retrying. > appattempt_1495786030132_4000_000005 > appattempt_1495786030132_4000_000004 > appattempt_1495786030132_4000_000003 > appattempt_1495786030132_4000_000002 > appattempt_1495786030132_4000_000001 > So all of my apps have hang several hours, and none of them can really run. > I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events. > And use some other thread to deal with LAUNCH event or use other way. > Sorry, I english is so poor. I don't know have I describe it clearly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org