Re: HMaster not failing over dead RegionServers

Bryan Beaudreault Mon, 02 Jul 2012 17:17:50 -0700

Thanks a bunch for the insight.  This message was actually coming from
master, but it still needs to grab the HLog files from hdfs, so I can still
see it being what you mentioned.  I'm going to look into tuning these
parameters down in preparation for future failures.


On Mon, Jul 2, 2012 at 7:56 PM, Suraj Varma <[email protected]> wrote:

> This looks like it is trying to reach a datanode ... doesn't it?
> > 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
> 10.125.18.129:50020. Already tried 14 time(s).
>
> Is this from a master log or from a region server log? (I'm guess the
> above is from a region server log while trying to replay hlogs)
>
> Sometime back, we had a similar symptom (HLog splitting takes the long
> time due to the retries) and found that even though the datanode died,
> it was not being detected by the namenode. This leads to the region
> server retrying over dead datanodes over and over stretching out the
> splitting process.
>
> See this thread:
> http://www.mail-archive.com/[email protected]/msg10033.html
>
> We found that by default, it takes 15 mins for a datanode death to be
> detected by a NN ... and this seems to cause the NN serving back the
> dead DN as a valid one when RS tries to read the hlogs.
> The parameters in question are: dfs.heartbeat.recheck.interval and
> heartbeat.recheck.interval ... tweaking this down caused the recovery
> to be much faster.
> Also - hbase.rpc.timeout and zookeeper.session.timeout are two other
> configurations that need to be tweaked down from defaults for quick
> recovery.
>
> Not sure if this is the case in your error - but, might be something
> to investigate ...
> --Suraj
>
>
> On Sat, Jun 30, 2012 at 8:53 AM, Jimmy Xiang <[email protected]> wrote:
> > Bryan,
> >
> > The master could not detect if the region server is dead.
> > How do you set the zookeeper session timeout?
> >
> > Thanks,
> > Jimmy
> >
> > On Sat, Jun 30, 2012 at 8:09 AM, Stack <[email protected]> wrote:
> >> On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault
> >> <[email protected]> wrote:
> >>> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
> >>> 10.125.18.129:50020. Already tried 14 time(s).
> >>>
> >>
> >> This was one of the servers that went down?
> >>
> >>> It was not following through the splitting of HLog files and didn't
> appear
> >>> to be moving regions off failed hosts.  After giving it about 20
> minutes to
> >>> try to right itself, I tried restarting the service.  The restart
> script
> >>> just hung for a while printing dots and nothing apparent was happening
> on
> >>> the logs at the time.
> >>
> >> Can we see the log  Bryan?
> >>
> >> You might thread dump when its hung-up the next time Bryan (Would be
> >> something for us to do a looksee on).
> >>
> >>> Finally I kill -9 the process, so that another
> >>> master could take over.  The new master seemed to start splitting
> logs, but
> >>> eventually got into the same state of printing the above message.
> >>>
> >>
> >> You think it a particular log?
> >>
> >>
> >>> Eventually it all worked out, but it took WAY too long (almost an
> hour, all
> >>> said).  Is this something that is tunable?
> >>
> >> Have RS carry less WALs?  Its a configuration.
> >>
> >>> They should have instantly been
> >>> removed from the list instead of retrying so many times.  Each server
> was
> >>> retried upwards of 30-40 times.
> >>>
> >>
> >> Yeah, thats a bit silly.
> >>
> >> We're working on the MTTR in general.  You logs would be of interest
> >> to a few of us if its ok that someone else can take a look.
> >>
> >> St.Ack
> >>
> >>> I am running cdh3u2 (0.90.4).
> >>>
> >>> Thanks,
> >>>
> >>> Bryan
>

Re: HMaster not failing over dead RegionServers

Reply via email to