Hi Prabhu Joseph, Thanks for rasing the problem. It will a problem if the completed applications are too many or applicaitons has many attempts. We are trying to solve this problem. As a first step, we are trying to reduce attempts' number in YARN-3480. Then we might need recover completed applications' info in another way, e.g. put completed applications' in another directory and restore them seperately.
PS: How much time do your cluster restore process take? Thanks, Jun Gong From: Prabhu Joseph<mailto:[email protected]> Date: 2015-12-14 15:08 To: [email protected]<mailto:[email protected]> Subject: YARN Recovery uses Serial Producer Consumer design(Internet mail) Hi Experts, During ResourceManager startup, it first starts the recovery process where it reads the Application state store and starts creating events for each application and puts them into AsyncDispatcher Queue. Once the recovery process read the entire state store, only then the service to dispatch the events of RMAppManagerEventType from AsyncDispatcher is started. Assume there are 10000 completed applications under recovery directory, when RM starts, it creates 10000 APP_COMPLETED events into AsyncDispatcher queue, then the Service to dispatch APP_COMPLETED events gets started. In worst case, if customer configures more than a lakh of recovery applications, then there is a possibility of Queue getting full and producer gets blocked. To solve this, when the Recovery starts creating APP_COMPLETED events, in parallel the service to dispatch RMAppManagerEventType events has to be started. Thanks, Prabhu Joseph
