We can fix it once we have an idea on how long RM takes to restart for some large clusters. I am hoping it will be considerably shorter than 15 mins.
-----Original Message----- From: Karthik Kambatla [mailto:[email protected]] Sent: Monday, August 12, 2013 11:38 AM To: [email protected] Subject: Re: AM timeout on RM failure? The RMProxy code, by default, uses 15 minutes for connect.max-wait, but the AM aborts trying to connect only after 20 mins. Wonder where the additional 5 minutes comes from? Let me run it again and see. Also, 15 minutes seems a little excessive, compared to other similar timeouts being 10 mins. I can fix this as part of YARN-1056 if you agree we should bring it down. Thanks Karthik On Mon, Aug 12, 2013 at 10:22 AM, Bikas Saha <[email protected]> wrote: > You should probably look at the RMProxy code and the configs it uses. > I am hoping that all clients including the MR AM now use that proxy > and so older configs are no longer valid. > > Bikas > > -----Original Message----- > From: Karthik Kambatla [mailto:[email protected]] > Sent: Sunday, August 11, 2013 8:45 PM > To: [email protected] > Subject: AM timeout on RM failure? > > Hi YARN devs, > > I am working on the ZKRMStateStore, and had a very basic question - on > RM failure, how long does the AM fail before crashing, or more > importantly what controls it. > > Looking into the code, I see the following two parameters: > > 1. yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms - set to > 1 min > 2. Fix configs > > yarn.resourcemanager.resourcemanager.connect.{max.wait.secs|retry_inte > rval > .secs} > - set by default to 15 mins and 30 seconds respectively > > The AM crashes only after 20 minutes. > > Are there any other configs that influence this? > > Thanks > Karthik >
