How about 5 mins? 1 min is too narrow -- if the clocks are off, NTP will not slam clocks around that fast to get into within-one-minute resolution quickly.
On Fri, Oct 29, 2010 at 11:03 AM, Jean-Daniel Cryans <[email protected]>wrote: > A minute? Although it could be configurable. > > J-D > > On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <[email protected]> > wrote: > > Created HBASE-3168 for this issue. It seems pretty straight forward and > I > > wouldn't mind tackling this problem. How much of a skew do we want to > allow > > between the RS and the rest of the cluster? > > > > ~Jeff > > > > On 10/28/2010 12:08 PM, Jonathan Gray wrote: > >> > >> I was discussing this exact issue this morning. Ran into a problem > where > >> master was timing out a region in transition because the RS was 5 > minutes > >> behind the master. > >> > >> I like the idea of the RS sending it's timestamp on startup and if it is > >> outside a certain threshold, the master throws it a ClockOutOfSync-like > >> exception and the RS goes down. > >> > >> Please do file a jira, Jeff. Or let me know and I can do it. > >> > >> JG > >> > >>> -----Original Message----- > >>> From: [email protected] [mailto:[email protected]] On Behalf Of > Jean- > >>> Daniel Cryans > >>> Sent: Thursday, October 28, 2010 10:00 AM > >>> To: [email protected] > >>> Subject: Re: Sanity date time check when a region server joins the > >>> cluster > >>> > >>> That could be done easily when the server checks in by looking at the > >>> given start code. In ServerManager we already do: > >>> > >>> HServerInfo info = new HServerInfo(serverInfo); > >>> checkIsDead(info.getServerName(), "STARTUP"); > >>> checkAlreadySameHostPort(info); > >>> recordNewServer(info, false, null); > >>> > >>> A new check in there would fit nicely. Can you open a jira Jeff? > >>> > >>> Thx! > >>> > >>> J-D > >>> > >>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<[email protected]> > >>> wrote: > >>>> > >>>> We recently had a problem where one of our machines in the cluster > >>> > >>> had a > >>>> > >>>> time that was 6 hours behind the other ones (ntp was supposed to be > >>> > >>> setup on > >>>> > >>>> that machine but wasn't). We subsequently restarted our cluster and > >>> > >>> the > >>>> > >>>> '-ROOT-' table was assigned to that machine. The problem was that > >>> > >>> when it > >>>> > >>>> tried to update the value (info:server) for who was holding the > >>> > >>> '.META.' > >>>> > >>>> table the value wasn't updating and stayed set as the previous > >>> > >>> machine. I'm > >>>> > >>>> pretty sure the problem was the timestamp for the new server was > >>> > >>> older than > >>>> > >>>> the timestamp for the previous server preventing the value from > >>> > >>> updating > >>>> > >>>> correctly. Having the incorrect info:server in the ROOT table > >>> > >>> basically > >>>> > >>>> made the cluster unusable. > >>>> > >>>> So my question is, would it make sense to have a sanity time check > >>> > >>> when a > >>>> > >>>> region server joins the cluster? Basically when the region server > >>> > >>> joins it > >>>> > >>>> would sent its current time and the master would check that time > >>> > >>> against its > >>>> > >>>> current time and if difference is too large then it would prevent the > >>> > >>> region > >>>> > >>>> server from joining. I know this is basic server configuration stuff > >>> > >>> but > >>>> > >>>> because of human error these things happen and seem like they can > >>> > >>> cause > >>>> > >>>> major problems for the cluster if the servers times aren't > >>> > >>> synchronized. > >>>> > >>>> ~Jeff > >>>> > >>>> -- > >>>> > >>>> Jeff Whiting > >>>> Qualtrics Senior Software Engineer > >>>> [email protected] > >>>> > >>>> > > > > -- > > Jeff Whiting > > Qualtrics Senior Software Engineer > > [email protected] > > > > >
