Hi Prabhu Joseph,

Thanks for rasing the problem. It will a problem if the completed applications 
are too many or applicaitons has many attempts. We are trying to solve this 
problem. As a first step, we are trying to reduce attempts' number in 
YARN-3480. Then we might need recover completed applications'  info in another 
way, e.g. put completed applications'  in another directory and restore them 
seperately.

PS: How much time do your cluster restore process take?

Thanks,
Jun Gong

From: Prabhu Joseph<mailto:[email protected]>
Date: 2015-12-14 15:08
To: [email protected]<mailto:[email protected]>
Subject: YARN Recovery uses Serial Producer Consumer design(Internet mail)

Hi Experts,


During ResourceManager startup, it first starts the recovery process where
it reads the Application state store and starts creating events for each
application and puts them into AsyncDispatcher Queue. Once the recovery
process read the entire state store, only then the service to dispatch the
events of RMAppManagerEventType from AsyncDispatcher is started.

Assume there are 10000 completed applications under recovery directory,
when RM starts, it creates 10000 APP_COMPLETED events into AsyncDispatcher
queue,  then the Service to dispatch APP_COMPLETED events gets started.

In worst case, if customer configures more than a lakh of  recovery
applications, then there is a possibility of Queue getting full and
producer gets blocked.

To solve this, when the Recovery starts creating APP_COMPLETED events, in
parallel the service to dispatch RMAppManagerEventType events has to be
started.




Thanks,
Prabhu Joseph

Reply via email to