[jira] [Comment Edited] (YARN-5773) RM recovery too slow due to LeafQueue#activateApplication()

Sunil G (JIRA) Mon, 24 Oct 2016 21:47:42 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604144#comment-15604144
 ]


Sunil G edited comment on YARN-5773 at 10/25/16 4:46 AM:
---------------------------------------------------------

*Issues in Recovery of apps:*
1. activateApplications works under a write lock.
2. If one application is found of overflowing AM resource limit, instead of 
breaking from loop, we continue and play complete apps from 
pendingOrderingPolicy. We may need to iterate all apps because we have apps 
belongs to different partition and pendingOrderingPolicy does not provide any 
order for apps based on partition.
3. As mentioned by [~bibinchundatt], when each app fails to get activated due 
to the upper cut of resource  limit, one INFO log is emitted (because *amLimit* 
is 0). During recovery, this is costly.

[~leftnoteasy] and [~rohithsharma]
bq.If a given app's AM resource amount > AM headroom, should we skip the AM and 
activate following app which AM resource amount <= AM headroom?
bq.But one point to be considered is for each Node registration, head room 
changes. So, user head room changes as new node registered. This need to be 
taken care.
Currently activateApplications is invoked when there is a change in cluster 
resource. So any change in cluster resource will ensure a call to 
activateApplications and we can recalculate this headroom. I am not very sure 
about the suggested map. Will this check be coming before we do the existing AM 
resource percentage check for queue/partition (not user based) ? OR are we 
replacing this checks?


was (Author: sunilg):
*Issues in Recovery of apps:*
1. activateApplications works under a write lock.
2. If one application is found of overflowing AM resource limit, instead of 
breaking from loop, we continue and play complete apps from 
pendingOrderingPolicy. We may need to iterate all apps because we have apps 
belongs to different partition and pendingOrderingPolicy does not provide any 
order for apps based on partition.
3. As mentioned by [~bibinchundatt], when each app fails to get activated due 
to the upper cut of resource  limit, one INFO log is emitted. During recovery, 
this is costly.

[~leftnoteasy] and [~rohithsharma]
bq.If a given app's AM resource amount > AM headroom, should we skip the AM and 
activate following app which AM resource amount <= AM headroom?
bq.But one point to be considered is for each Node registration, head room 
changes. So, user head room changes as new node registered. This need to be 
taken care.
Currently activateApplications is invoked when there is a change in cluster 
resource. So any change in cluster resource will ensure a call to 
activateApplications and we can recalculate this headroom. I am not very sure 
about the suggested map. Will this check be coming before we do the existing AM 
resource percentage check for queue/partition (not user based) ? OR are we 
replacing this checks?

> RM recovery too slow due to LeafQueue#activateApplication()
> -----------------------------------------------------------
>
>                 Key: YARN-5773
>                 URL: https://issues.apache.org/jira/browse/YARN-5773
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: YARN-5773.0001.patch, YARN-5773.0002.patch
>
>
> # Submit application 10K application to default queue.
> # All applications are in accepted state
> # Now restart resourcemanager
> For each application recovery {{LeafQueue#activateApplications()}} is 
> invoked.Resulting in AM limit check to be done even before Node managers are 
> getting registered.
> Total iteration for N application is about {{N(N+1)/2}} for {{10K}} 
> application   {{50000000}} iterations causing time take for Rm to be active 
> more than 10 min.
> Since NM resources are not yet added to during recovery we should skip 
> {{activateApplicaiton()}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-5773) RM recovery too slow due to LeafQueue#activateApplication()

Reply via email to