Thanks a bunch for the insight. This message was actually coming from master, but it still needs to grab the HLog files from hdfs, so I can still see it being what you mentioned. I'm going to look into tuning these parameters down in preparation for future failures.
On Mon, Jul 2, 2012 at 7:56 PM, Suraj Varma <[email protected]> wrote: > This looks like it is trying to reach a datanode ... doesn't it? > > 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: / > 10.125.18.129:50020. Already tried 14 time(s). > > Is this from a master log or from a region server log? (I'm guess the > above is from a region server log while trying to replay hlogs) > > Sometime back, we had a similar symptom (HLog splitting takes the long > time due to the retries) and found that even though the datanode died, > it was not being detected by the namenode. This leads to the region > server retrying over dead datanodes over and over stretching out the > splitting process. > > See this thread: > http://www.mail-archive.com/[email protected]/msg10033.html > > We found that by default, it takes 15 mins for a datanode death to be > detected by a NN ... and this seems to cause the NN serving back the > dead DN as a valid one when RS tries to read the hlogs. > The parameters in question are: dfs.heartbeat.recheck.interval and > heartbeat.recheck.interval ... tweaking this down caused the recovery > to be much faster. > Also - hbase.rpc.timeout and zookeeper.session.timeout are two other > configurations that need to be tweaked down from defaults for quick > recovery. > > Not sure if this is the case in your error - but, might be something > to investigate ... > --Suraj > > > On Sat, Jun 30, 2012 at 8:53 AM, Jimmy Xiang <[email protected]> wrote: > > Bryan, > > > > The master could not detect if the region server is dead. > > How do you set the zookeeper session timeout? > > > > Thanks, > > Jimmy > > > > On Sat, Jun 30, 2012 at 8:09 AM, Stack <[email protected]> wrote: > >> On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault > >> <[email protected]> wrote: > >>> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: / > >>> 10.125.18.129:50020. Already tried 14 time(s). > >>> > >> > >> This was one of the servers that went down? > >> > >>> It was not following through the splitting of HLog files and didn't > appear > >>> to be moving regions off failed hosts. After giving it about 20 > minutes to > >>> try to right itself, I tried restarting the service. The restart > script > >>> just hung for a while printing dots and nothing apparent was happening > on > >>> the logs at the time. > >> > >> Can we see the log Bryan? > >> > >> You might thread dump when its hung-up the next time Bryan (Would be > >> something for us to do a looksee on). > >> > >>> Finally I kill -9 the process, so that another > >>> master could take over. The new master seemed to start splitting > logs, but > >>> eventually got into the same state of printing the above message. > >>> > >> > >> You think it a particular log? > >> > >> > >>> Eventually it all worked out, but it took WAY too long (almost an > hour, all > >>> said). Is this something that is tunable? > >> > >> Have RS carry less WALs? Its a configuration. > >> > >>> They should have instantly been > >>> removed from the list instead of retrying so many times. Each server > was > >>> retried upwards of 30-40 times. > >>> > >> > >> Yeah, thats a bit silly. > >> > >> We're working on the MTTR in general. You logs would be of interest > >> to a few of us if its ok that someone else can take a look. > >> > >> St.Ack > >> > >>> I am running cdh3u2 (0.90.4). > >>> > >>> Thanks, > >>> > >>> Bryan >
