[ 
https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

daemon updated YARN-6710:
-------------------------
    Attachment: screenshot-4.png

> There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair 
> scheduler not assign container to the queue
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6710
>                 URL: https://issues.apache.org/jira/browse/YARN-6710
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>            Reporter: daemon
>             Fix For: 2.8.0
>
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> There are over three thousand nodes in my hadoop production cluster, and we 
> use fair schedule as my scheduler.
> Though there are many free resource in my resource manager, but there are 46 
> applications pending. 
> Those applications can not run after  several hours, and in the end I have to 
> stop them.
> I reproduce the scene in my test environment, and I find a bug in 
> FSLeafQueue. 
> In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater 
> than itself.
> When fair scheduler try to assign container to a application attempt,  it 
> will do as follow check:
> !screenshot-2.png!
> !screenshot-3.png!
> Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater 
> then it real value.
> So when the value of amResourceUsage greater than the value of 
> Resources.multiply(getFairShare(), maxAMShare) ,
> and the FSLeafQueue#canRunAppAM function will return false which will let the 
> fair scheduler not assign container
> to the FSAppAttempt. 
> In this scenario, all the application attempt will pending and never get any 
> resource.
> I find the reason why so many applications in my leaf queue is pending. I 
> will describe it as flow:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to