[ 
https://issues.apache.org/jira/browse/YARN-5773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15603964#comment-15603964
 ] 

Rohith Sharma K S commented on YARN-5773:
-----------------------------------------

Thanks folks for discussion..
I went through overall above discussion, I have one doubt that How can *RM 
recovery* is too slow? Because in current RM Restart, there are 2 stages.
# Recover : Read all the application data from ZooKeeper and replay it. 
Basically, for running/pending apps, an event will be triggered to scheduler, 
and scheduler has *separate dispatcher* to handle it. 
# Service Start : Once recover process is completed, all the RM services are 
started. 
IICU, RM service is up and able to accept a new requests from clients. So, 
problem is after RM service start, activating applications are being delayed 
because Nodes are not yet registered but not actual recovery. It would be 
better if JIRA summary is updated something like, "Scheduler takes longer time 
for activating recovered apps when RM is restarted" or any other. 

As far as improvement, as wangda suggested may be we can keep Map<UserName, 
List<Application>> which would optimize in activateApplication for head room. 
But one point to be considered is  for each Node registration, head room 
changes. So, user head room changes as new node registered. This need to be 
taken care. 

> RM recovery too slow due to LeafQueue#activateApplication()
> -----------------------------------------------------------
>
>                 Key: YARN-5773
>                 URL: https://issues.apache.org/jira/browse/YARN-5773
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>            Priority: Critical
>         Attachments: YARN-5773.0001.patch, YARN-5773.0002.patch
>
>
> # Submit application 10K application to default queue.
> # All applications are in accepted state
> # Now restart resourcemanager
> For each application recovery {{LeafQueue#activateApplications()}} is 
> invoked.Resulting in AM limit check to be done even before Node managers are 
> getting registered.
> Total iteration for N application is about {{N(N+1)/2}} for {{10K}} 
> application   {{50000000}} iterations causing time take for Rm to be active 
> more than 10 min.
> Since NM resources are not yet added to during recovery we should skip 
> {{activateApplicaiton()}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to