[ 
https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subru Krishnan updated YARN-6661:
---------------------------------
    Fix Version/s:     (was: 2.9.0)
                   2.9.1

> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -----------------------------------------------------------------
>
>                 Key: YARN-6661
>                 URL: https://issues.apache.org/jira/browse/YARN-6661
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>         Environment: hadoop 2.7.2 
>            Reporter: JackZhou
>             Fix For: 2.9.1
>
>
> Some one else have already come up with the similar problem and fix it.
> We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for 
> detail.
> But I think the fix have not solve the problem completely, blow was the 
> problem I encountered:
> There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
> I failover my active rm and rm will failover all those 1800 apps.
> When a application failover, It will wait for AM container register itself. 
> But there is a bug in my AM (I do it intentionally), and it will not register 
> itself.
> So the RM will wait for about 10mins for the AM expiration, and it will send 
> a CLEANUP event to 
> ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so 
> it will hang the ApplicationMasterLauncher
> thread pool for a large time. I have already use the 
> patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch),
>  so
> a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so 
> for each of my thread, it will
> hang 1800 / 50 * 200s = 7200s=20min.
> Because the AM have register itself during 10mins, so it will retry and 
> create a new application attempt. 
> The application attempt will accept a container from RM, and send a LAUNCH to 
> ApplicationMasterLauncher thread pool.
> Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the 
> application attempt will not 
> start the AM container during 10min. 
> And it will expire, and send a CLEANUP event to ApplicationMasterLauncher 
> thread pools too.
> As you can see, none of my application can really run it. 
> Each of them have 5 application attempts as follows, and each of them keep 
> retrying.
> appattempt_1495786030132_4000_000005
> appattempt_1495786030132_4000_000004
> appattempt_1495786030132_4000_000003
> appattempt_1495786030132_4000_000002  
> appattempt_1495786030132_4000_000001
> So all of my apps have hang several hours, and none of them can really run. 
> I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
> And use some other thread to deal with LAUNCH event or use other way.
> Sorry, I english is so poor. I don't know have I describe it clearly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to