I continued this discussion on the JIRA:

https://issues.apache.org/jira/browse/HBASE-3168

Please comment over there where we're working on implementing this.

(And I'm planning to run at about 5 seconds)

> -----Original Message-----
> From: M. C. Srivas [mailto:[email protected]]
> Sent: Sunday, October 31, 2010 11:35 AM
> To: [email protected]
> Subject: Re: Sanity date time check when a region server joins the
> cluster
> 
> How about 5 mins? 1 min is too narrow -- if the clocks are off, NTP
> will not
> slam clocks around that fast to get into within-one-minute resolution
> quickly.
> 
> 
> On Fri, Oct 29, 2010 at 11:03 AM, Jean-Daniel Cryans
> <[email protected]>wrote:
> 
> > A minute? Although it could be configurable.
> >
> > J-D
> >
> > On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <[email protected]>
> > wrote:
> > > Created HBASE-3168 for this issue.  It seems pretty straight
> forward and
> > I
> > > wouldn't mind tackling this problem.  How much of a skew do we want
> to
> > allow
> > > between the RS and the rest of the cluster?
> > >
> > > ~Jeff
> > >
> > > On 10/28/2010 12:08 PM, Jonathan Gray wrote:
> > >>
> > >> I was discussing this exact issue this morning.  Ran into a
> problem
> > where
> > >> master was timing out a region in transition because the RS was 5
> > minutes
> > >> behind the master.
> > >>
> > >> I like the idea of the RS sending it's timestamp on startup and if
> it is
> > >> outside a certain threshold, the master throws it a
> ClockOutOfSync-like
> > >> exception and the RS goes down.
> > >>
> > >> Please do file a jira, Jeff.  Or let me know and I can do it.
> > >>
> > >> JG
> > >>
> > >>> -----Original Message-----
> > >>> From: [email protected] [mailto:[email protected]] On Behalf Of
> > Jean-
> > >>> Daniel Cryans
> > >>> Sent: Thursday, October 28, 2010 10:00 AM
> > >>> To: [email protected]
> > >>> Subject: Re: Sanity date time check when a region server joins
> the
> > >>> cluster
> > >>>
> > >>> That could be done easily when the server checks in by looking at
> the
> > >>> given start code. In ServerManager we already do:
> > >>>
> > >>>     HServerInfo info = new HServerInfo(serverInfo);
> > >>>     checkIsDead(info.getServerName(), "STARTUP");
> > >>>     checkAlreadySameHostPort(info);
> > >>>     recordNewServer(info, false, null);
> > >>>
> > >>> A new check in there would fit nicely. Can you open a jira Jeff?
> > >>>
> > >>> Thx!
> > >>>
> > >>> J-D
> > >>>
> > >>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff
> Whiting<[email protected]>
> > >>> wrote:
> > >>>>
> > >>>> We recently had a problem where one of our machines in the
> cluster
> > >>>
> > >>> had a
> > >>>>
> > >>>> time that was 6 hours behind the other ones (ntp was supposed to
> be
> > >>>
> > >>> setup on
> > >>>>
> > >>>> that machine but wasn't).  We subsequently restarted our cluster
> and
> > >>>
> > >>> the
> > >>>>
> > >>>> '-ROOT-' table was assigned to that machine.  The problem was
> that
> > >>>
> > >>> when it
> > >>>>
> > >>>> tried to update the value (info:server) for who was holding the
> > >>>
> > >>> '.META.'
> > >>>>
> > >>>> table the value wasn't updating and stayed set as the previous
> > >>>
> > >>> machine. I'm
> > >>>>
> > >>>> pretty sure the problem was the timestamp for the new server was
> > >>>
> > >>> older than
> > >>>>
> > >>>> the timestamp for the previous server preventing the value from
> > >>>
> > >>> updating
> > >>>>
> > >>>> correctly.  Having the incorrect info:server in the ROOT table
> > >>>
> > >>> basically
> > >>>>
> > >>>> made the cluster unusable.
> > >>>>
> > >>>> So my question is, would it make sense to have a sanity time
> check
> > >>>
> > >>> when a
> > >>>>
> > >>>> region server joins the cluster?  Basically when the region
> server
> > >>>
> > >>> joins it
> > >>>>
> > >>>> would sent its current time and the master would check that time
> > >>>
> > >>> against its
> > >>>>
> > >>>> current time and if difference is too large then it would
> prevent the
> > >>>
> > >>> region
> > >>>>
> > >>>> server from joining.  I know this is basic server configuration
> stuff
> > >>>
> > >>> but
> > >>>>
> > >>>> because of human error these things happen and seem like they
> can
> > >>>
> > >>> cause
> > >>>>
> > >>>> major problems for the cluster if the servers times aren't
> > >>>
> > >>> synchronized.
> > >>>>
> > >>>> ~Jeff
> > >>>>
> > >>>> --
> > >>>>
> > >>>> Jeff Whiting
> > >>>> Qualtrics Senior Software Engineer
> > >>>> [email protected]
> > >>>>
> > >>>>
> > >
> > > --
> > > Jeff Whiting
> > > Qualtrics Senior Software Engineer
> > > [email protected]
> > >
> > >
> >

Reply via email to