[ 
https://issues.apache.org/jira/browse/YARN-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

daemon updated YARN-6710:
-------------------------
    Description: 
There are over three thousand nodes in my hadoop production cluster, and we use 
fair schedule as my scheduler.
Though there are many free resource in my resource manager, but there are 46 
applications pending. 
Those applications can not run after  several hours, and in the end I have to 
stop them.

I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. 
In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than 
itself.
When fair scheduler try to assign container to a application attempt,  it will 
do as follow check:

!screenshot-2.png!
!screenshot-3.png!

Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater 
then it real value.
So when the value of amResourceUsage greater than the value of 
Resources.multiply(getFairShare(), maxAMShare) ,
and the FSLeafQueue#canRunAppAM function will return false which will let the 
fair scheduler not assign container
to the FSAppAttempt. 
In this scenario, all the application attempt will pending and never get any 
resource.

I find the reason why so many applications in my leaf queue is pending. I will 
describe it as flow:

  was:
There are over three thousand nodes in my hadoop production cluster, and we use 
fair schedule as my scheduler.
Though there are many free resource in my resource manager, but there are 46 
applications pending. 
Those applications can not run after  several hours, and in the end I have to 
stop them.

I reproduce the scene in my test environment, and I find a bug in FSLeafQueue. 
In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater than 
itself.
When fair scheduler try to assign container to a application attempt,  it will 
do as follow check:

!screenshot-2.png!
!screenshot-3.png!

Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater 
then it real value.
So when the value of amResourceUsage greater than the value of 
Resources.multiply(getFairShare(), maxAMShare) ,
and the FSLeafQueue#canRunAppAM function will return false which will let the 
fair scheduler not assign container
to the FSAppAttempt. 
In this scenario, all the application attempt will pending and never get any 
resource.


> There is a heavy bug in FSLeafQueue#amResourceUsage which will let the fair 
> scheduler not assign container to the queue
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6710
>                 URL: https://issues.apache.org/jira/browse/YARN-6710
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>            Reporter: daemon
>             Fix For: 2.8.0
>
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> There are over three thousand nodes in my hadoop production cluster, and we 
> use fair schedule as my scheduler.
> Though there are many free resource in my resource manager, but there are 46 
> applications pending. 
> Those applications can not run after  several hours, and in the end I have to 
> stop them.
> I reproduce the scene in my test environment, and I find a bug in 
> FSLeafQueue. 
> In a extreme scenario it will let the FSLeafQueue#amResourceUsage greater 
> than itself.
> When fair scheduler try to assign container to a application attempt,  it 
> will do as follow check:
> !screenshot-2.png!
> !screenshot-3.png!
> Because the value of FSLeafQueue#amResourceUsage is invalid, it will greater 
> then it real value.
> So when the value of amResourceUsage greater than the value of 
> Resources.multiply(getFairShare(), maxAMShare) ,
> and the FSLeafQueue#canRunAppAM function will return false which will let the 
> fair scheduler not assign container
> to the FSAppAttempt. 
> In this scenario, all the application attempt will pending and never get any 
> resource.
> I find the reason why so many applications in my leaf queue is pending. I 
> will describe it as flow:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to