Hi Jun Gong,
Thanks for taking this issue. There are 10000 completed applications and
each with 1 attempt in the recovery directory. The recovery process took 10
+ 3 seconds to complete.
10 seconds during Recovery, it creates 10000 APP_COMPLETED events into
the AsyncDispatcher Queue.
3 seconds for dispatching 10000 events.
The concern is When the thread starts putting events into AsyncDispathcer
Queue, why the service to dispatch the events is not started in parallel.
The recovery process has a serial producer consumer. If in future, if
customers like to set a lakh of applications in recovery, we will see the
RM startup taking longer time during recovery.
Thanks,
Prabhu Joseph
On Mon, Dec 14, 2015 at 1:18 PM, jungong(龚军) <[email protected]> wrote:
> Hi Prabhu Joseph,
>
> Thanks for rasing the problem. It will a problem if the completed
> applications are too many or applicaitons has many attempts. We are trying
> to solve this problem. As a first step, we are trying to reduce attempts'
> number in YARN-3480. Then we might need recover completed applications'
> info in another way, e.g. put completed applications' in another directory
> and restore them seperately.
>
> PS: How much time do your cluster restore process take?
>
> Thanks,
> Jun Gong
>
> From: Prabhu Joseph<mailto:[email protected]>
> Date: 2015-12-14 15:08
> To: [email protected]<mailto:[email protected]>
> Subject: YARN Recovery uses Serial Producer Consumer design(Internet mail)
>
> Hi Experts,
>
>
> During ResourceManager startup, it first starts the recovery process where
> it reads the Application state store and starts creating events for each
> application and puts them into AsyncDispatcher Queue. Once the recovery
> process read the entire state store, only then the service to dispatch the
> events of RMAppManagerEventType from AsyncDispatcher is started.
>
> Assume there are 10000 completed applications under recovery directory,
> when RM starts, it creates 10000 APP_COMPLETED events into AsyncDispatcher
> queue, then the Service to dispatch APP_COMPLETED events gets started.
>
> In worst case, if customer configures more than a lakh of recovery
> applications, then there is a possibility of Queue getting full and
> producer gets blocked.
>
> To solve this, when the Recovery starts creating APP_COMPLETED events, in
> parallel the service to dispatch RMAppManagerEventType events has to be
> started.
>
>
>
>
> Thanks,
> Prabhu Joseph
>