How about 5 mins? 1 min is too narrow -- if the clocks are off, NTP will not
slam clocks around that fast to get into within-one-minute resolution
quickly.


On Fri, Oct 29, 2010 at 11:03 AM, Jean-Daniel Cryans <[email protected]>wrote:

> A minute? Although it could be configurable.
>
> J-D
>
> On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <[email protected]>
> wrote:
> > Created HBASE-3168 for this issue.  It seems pretty straight forward and
> I
> > wouldn't mind tackling this problem.  How much of a skew do we want to
> allow
> > between the RS and the rest of the cluster?
> >
> > ~Jeff
> >
> > On 10/28/2010 12:08 PM, Jonathan Gray wrote:
> >>
> >> I was discussing this exact issue this morning.  Ran into a problem
> where
> >> master was timing out a region in transition because the RS was 5
> minutes
> >> behind the master.
> >>
> >> I like the idea of the RS sending it's timestamp on startup and if it is
> >> outside a certain threshold, the master throws it a ClockOutOfSync-like
> >> exception and the RS goes down.
> >>
> >> Please do file a jira, Jeff.  Or let me know and I can do it.
> >>
> >> JG
> >>
> >>> -----Original Message-----
> >>> From: [email protected] [mailto:[email protected]] On Behalf Of
> Jean-
> >>> Daniel Cryans
> >>> Sent: Thursday, October 28, 2010 10:00 AM
> >>> To: [email protected]
> >>> Subject: Re: Sanity date time check when a region server joins the
> >>> cluster
> >>>
> >>> That could be done easily when the server checks in by looking at the
> >>> given start code. In ServerManager we already do:
> >>>
> >>>     HServerInfo info = new HServerInfo(serverInfo);
> >>>     checkIsDead(info.getServerName(), "STARTUP");
> >>>     checkAlreadySameHostPort(info);
> >>>     recordNewServer(info, false, null);
> >>>
> >>> A new check in there would fit nicely. Can you open a jira Jeff?
> >>>
> >>> Thx!
> >>>
> >>> J-D
> >>>
> >>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<[email protected]>
> >>> wrote:
> >>>>
> >>>> We recently had a problem where one of our machines in the cluster
> >>>
> >>> had a
> >>>>
> >>>> time that was 6 hours behind the other ones (ntp was supposed to be
> >>>
> >>> setup on
> >>>>
> >>>> that machine but wasn't).  We subsequently restarted our cluster and
> >>>
> >>> the
> >>>>
> >>>> '-ROOT-' table was assigned to that machine.  The problem was that
> >>>
> >>> when it
> >>>>
> >>>> tried to update the value (info:server) for who was holding the
> >>>
> >>> '.META.'
> >>>>
> >>>> table the value wasn't updating and stayed set as the previous
> >>>
> >>> machine. I'm
> >>>>
> >>>> pretty sure the problem was the timestamp for the new server was
> >>>
> >>> older than
> >>>>
> >>>> the timestamp for the previous server preventing the value from
> >>>
> >>> updating
> >>>>
> >>>> correctly.  Having the incorrect info:server in the ROOT table
> >>>
> >>> basically
> >>>>
> >>>> made the cluster unusable.
> >>>>
> >>>> So my question is, would it make sense to have a sanity time check
> >>>
> >>> when a
> >>>>
> >>>> region server joins the cluster?  Basically when the region server
> >>>
> >>> joins it
> >>>>
> >>>> would sent its current time and the master would check that time
> >>>
> >>> against its
> >>>>
> >>>> current time and if difference is too large then it would prevent the
> >>>
> >>> region
> >>>>
> >>>> server from joining.  I know this is basic server configuration stuff
> >>>
> >>> but
> >>>>
> >>>> because of human error these things happen and seem like they can
> >>>
> >>> cause
> >>>>
> >>>> major problems for the cluster if the servers times aren't
> >>>
> >>> synchronized.
> >>>>
> >>>> ~Jeff
> >>>>
> >>>> --
> >>>>
> >>>> Jeff Whiting
> >>>> Qualtrics Senior Software Engineer
> >>>> [email protected]
> >>>>
> >>>>
> >
> > --
> > Jeff Whiting
> > Qualtrics Senior Software Engineer
> > [email protected]
> >
> >
>

Reply via email to