I was discussing this exact issue this morning. Ran into a problem where master was timing out a region in transition because the RS was 5 minutes behind the master.
I like the idea of the RS sending it's timestamp on startup and if it is outside a certain threshold, the master throws it a ClockOutOfSync-like exception and the RS goes down. Please do file a jira, Jeff. Or let me know and I can do it. JG > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of Jean- > Daniel Cryans > Sent: Thursday, October 28, 2010 10:00 AM > To: [email protected] > Subject: Re: Sanity date time check when a region server joins the > cluster > > That could be done easily when the server checks in by looking at the > given start code. In ServerManager we already do: > > HServerInfo info = new HServerInfo(serverInfo); > checkIsDead(info.getServerName(), "STARTUP"); > checkAlreadySameHostPort(info); > recordNewServer(info, false, null); > > A new check in there would fit nicely. Can you open a jira Jeff? > > Thx! > > J-D > > On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting <[email protected]> > wrote: > > We recently had a problem where one of our machines in the cluster > had a > > time that was 6 hours behind the other ones (ntp was supposed to be > setup on > > that machine but wasn't). We subsequently restarted our cluster and > the > > '-ROOT-' table was assigned to that machine. The problem was that > when it > > tried to update the value (info:server) for who was holding the > '.META.' > > table the value wasn't updating and stayed set as the previous > machine. I'm > > pretty sure the problem was the timestamp for the new server was > older than > > the timestamp for the previous server preventing the value from > updating > > correctly. Having the incorrect info:server in the ROOT table > basically > > made the cluster unusable. > > > > So my question is, would it make sense to have a sanity time check > when a > > region server joins the cluster? Basically when the region server > joins it > > would sent its current time and the master would check that time > against its > > current time and if difference is too large then it would prevent the > region > > server from joining. I know this is basic server configuration stuff > but > > because of human error these things happen and seem like they can > cause > > major problems for the cluster if the servers times aren't > synchronized. > > > > ~Jeff > > > > -- > > > > Jeff Whiting > > Qualtrics Senior Software Engineer > > [email protected] > > > >
