RE: AM timeout on RM failure?

Bikas Saha Mon, 12 Aug 2013 19:39:40 -0700

We can fix it once we have an idea on how long RM takes to restart for
some large clusters. I am hoping it will be considerably shorter than 15
mins.


-----Original Message-----
From: Karthik Kambatla [mailto:[email protected]]
Sent: Monday, August 12, 2013 11:38 AM
To: [email protected]
Subject: Re: AM timeout on RM failure?

The RMProxy code, by default, uses 15 minutes for connect.max-wait, but
the AM aborts trying to connect only after 20 mins. Wonder where the
additional
5 minutes comes from? Let me run it again and see.

Also, 15 minutes seems a little excessive, compared to other similar
timeouts being 10 mins. I can fix this as part of YARN-1056 if you agree
we should bring it down.

Thanks
Karthik


On Mon, Aug 12, 2013 at 10:22 AM, Bikas Saha <[email protected]>
wrote:

> You should probably look at the RMProxy code and the configs it uses.
> I am hoping that all clients including the MR AM now use that proxy
> and so older configs are no longer valid.
>
> Bikas
>
> -----Original Message-----
> From: Karthik Kambatla [mailto:[email protected]]
> Sent: Sunday, August 11, 2013 8:45 PM
> To: [email protected]
> Subject: AM timeout on RM failure?
>
> Hi YARN devs,
>
> I am working on the ZKRMStateStore, and had a very basic question - on
> RM failure, how long does the AM fail before crashing, or more
> importantly what controls it.
>
> Looking into the code, I see the following two parameters:
>
>    1. yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms - set
to
>    1 min
>    2. Fix configs
>
> yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_inte
> rval
> .secs}
>    - set by default to 15 mins and 30 seconds respectively
>
> The AM crashes only after 20 minutes.
>
> Are there any other configs that influence this?
>
> Thanks
> Karthik
>

RE: AM timeout on RM failure?

Reply via email to