A minute? Although it could be configurable. J-D
On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <[email protected]> wrote: > Created HBASE-3168 for this issue. It seems pretty straight forward and I > wouldn't mind tackling this problem. How much of a skew do we want to allow > between the RS and the rest of the cluster? > > ~Jeff > > On 10/28/2010 12:08 PM, Jonathan Gray wrote: >> >> I was discussing this exact issue this morning. Ran into a problem where >> master was timing out a region in transition because the RS was 5 minutes >> behind the master. >> >> I like the idea of the RS sending it's timestamp on startup and if it is >> outside a certain threshold, the master throws it a ClockOutOfSync-like >> exception and the RS goes down. >> >> Please do file a jira, Jeff. Or let me know and I can do it. >> >> JG >> >>> -----Original Message----- >>> From: [email protected] [mailto:[email protected]] On Behalf Of Jean- >>> Daniel Cryans >>> Sent: Thursday, October 28, 2010 10:00 AM >>> To: [email protected] >>> Subject: Re: Sanity date time check when a region server joins the >>> cluster >>> >>> That could be done easily when the server checks in by looking at the >>> given start code. In ServerManager we already do: >>> >>> HServerInfo info = new HServerInfo(serverInfo); >>> checkIsDead(info.getServerName(), "STARTUP"); >>> checkAlreadySameHostPort(info); >>> recordNewServer(info, false, null); >>> >>> A new check in there would fit nicely. Can you open a jira Jeff? >>> >>> Thx! >>> >>> J-D >>> >>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<[email protected]> >>> wrote: >>>> >>>> We recently had a problem where one of our machines in the cluster >>> >>> had a >>>> >>>> time that was 6 hours behind the other ones (ntp was supposed to be >>> >>> setup on >>>> >>>> that machine but wasn't). We subsequently restarted our cluster and >>> >>> the >>>> >>>> '-ROOT-' table was assigned to that machine. The problem was that >>> >>> when it >>>> >>>> tried to update the value (info:server) for who was holding the >>> >>> '.META.' >>>> >>>> table the value wasn't updating and stayed set as the previous >>> >>> machine. I'm >>>> >>>> pretty sure the problem was the timestamp for the new server was >>> >>> older than >>>> >>>> the timestamp for the previous server preventing the value from >>> >>> updating >>>> >>>> correctly. Having the incorrect info:server in the ROOT table >>> >>> basically >>>> >>>> made the cluster unusable. >>>> >>>> So my question is, would it make sense to have a sanity time check >>> >>> when a >>>> >>>> region server joins the cluster? Basically when the region server >>> >>> joins it >>>> >>>> would sent its current time and the master would check that time >>> >>> against its >>>> >>>> current time and if difference is too large then it would prevent the >>> >>> region >>>> >>>> server from joining. I know this is basic server configuration stuff >>> >>> but >>>> >>>> because of human error these things happen and seem like they can >>> >>> cause >>>> >>>> major problems for the cluster if the servers times aren't >>> >>> synchronized. >>>> >>>> ~Jeff >>>> >>>> -- >>>> >>>> Jeff Whiting >>>> Qualtrics Senior Software Engineer >>>> [email protected] >>>> >>>> > > -- > Jeff Whiting > Qualtrics Senior Software Engineer > [email protected] > >
