I continued this discussion on the JIRA: https://issues.apache.org/jira/browse/HBASE-3168
Please comment over there where we're working on implementing this. (And I'm planning to run at about 5 seconds) > -----Original Message----- > From: M. C. Srivas [mailto:[email protected]] > Sent: Sunday, October 31, 2010 11:35 AM > To: [email protected] > Subject: Re: Sanity date time check when a region server joins the > cluster > > How about 5 mins? 1 min is too narrow -- if the clocks are off, NTP > will not > slam clocks around that fast to get into within-one-minute resolution > quickly. > > > On Fri, Oct 29, 2010 at 11:03 AM, Jean-Daniel Cryans > <[email protected]>wrote: > > > A minute? Although it could be configurable. > > > > J-D > > > > On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <[email protected]> > > wrote: > > > Created HBASE-3168 for this issue. It seems pretty straight > forward and > > I > > > wouldn't mind tackling this problem. How much of a skew do we want > to > > allow > > > between the RS and the rest of the cluster? > > > > > > ~Jeff > > > > > > On 10/28/2010 12:08 PM, Jonathan Gray wrote: > > >> > > >> I was discussing this exact issue this morning. Ran into a > problem > > where > > >> master was timing out a region in transition because the RS was 5 > > minutes > > >> behind the master. > > >> > > >> I like the idea of the RS sending it's timestamp on startup and if > it is > > >> outside a certain threshold, the master throws it a > ClockOutOfSync-like > > >> exception and the RS goes down. > > >> > > >> Please do file a jira, Jeff. Or let me know and I can do it. > > >> > > >> JG > > >> > > >>> -----Original Message----- > > >>> From: [email protected] [mailto:[email protected]] On Behalf Of > > Jean- > > >>> Daniel Cryans > > >>> Sent: Thursday, October 28, 2010 10:00 AM > > >>> To: [email protected] > > >>> Subject: Re: Sanity date time check when a region server joins > the > > >>> cluster > > >>> > > >>> That could be done easily when the server checks in by looking at > the > > >>> given start code. In ServerManager we already do: > > >>> > > >>> HServerInfo info = new HServerInfo(serverInfo); > > >>> checkIsDead(info.getServerName(), "STARTUP"); > > >>> checkAlreadySameHostPort(info); > > >>> recordNewServer(info, false, null); > > >>> > > >>> A new check in there would fit nicely. Can you open a jira Jeff? > > >>> > > >>> Thx! > > >>> > > >>> J-D > > >>> > > >>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff > Whiting<[email protected]> > > >>> wrote: > > >>>> > > >>>> We recently had a problem where one of our machines in the > cluster > > >>> > > >>> had a > > >>>> > > >>>> time that was 6 hours behind the other ones (ntp was supposed to > be > > >>> > > >>> setup on > > >>>> > > >>>> that machine but wasn't). We subsequently restarted our cluster > and > > >>> > > >>> the > > >>>> > > >>>> '-ROOT-' table was assigned to that machine. The problem was > that > > >>> > > >>> when it > > >>>> > > >>>> tried to update the value (info:server) for who was holding the > > >>> > > >>> '.META.' > > >>>> > > >>>> table the value wasn't updating and stayed set as the previous > > >>> > > >>> machine. I'm > > >>>> > > >>>> pretty sure the problem was the timestamp for the new server was > > >>> > > >>> older than > > >>>> > > >>>> the timestamp for the previous server preventing the value from > > >>> > > >>> updating > > >>>> > > >>>> correctly. Having the incorrect info:server in the ROOT table > > >>> > > >>> basically > > >>>> > > >>>> made the cluster unusable. > > >>>> > > >>>> So my question is, would it make sense to have a sanity time > check > > >>> > > >>> when a > > >>>> > > >>>> region server joins the cluster? Basically when the region > server > > >>> > > >>> joins it > > >>>> > > >>>> would sent its current time and the master would check that time > > >>> > > >>> against its > > >>>> > > >>>> current time and if difference is too large then it would > prevent the > > >>> > > >>> region > > >>>> > > >>>> server from joining. I know this is basic server configuration > stuff > > >>> > > >>> but > > >>>> > > >>>> because of human error these things happen and seem like they > can > > >>> > > >>> cause > > >>>> > > >>>> major problems for the cluster if the servers times aren't > > >>> > > >>> synchronized. > > >>>> > > >>>> ~Jeff > > >>>> > > >>>> -- > > >>>> > > >>>> Jeff Whiting > > >>>> Qualtrics Senior Software Engineer > > >>>> [email protected] > > >>>> > > >>>> > > > > > > -- > > > Jeff Whiting > > > Qualtrics Senior Software Engineer > > > [email protected] > > > > > > > >
