Created HBASE-3168 for this issue. It seems pretty straight forward and I wouldn't mind tackling
this problem. How much of a skew do we want to allow between the RS and the rest of the cluster?
~Jeff
On 10/28/2010 12:08 PM, Jonathan Gray wrote:
I was discussing this exact issue this morning. Ran into a problem where
master was timing out a region in transition because the RS was 5 minutes
behind the master.
I like the idea of the RS sending it's timestamp on startup and if it is
outside a certain threshold, the master throws it a ClockOutOfSync-like
exception and the RS goes down.
Please do file a jira, Jeff. Or let me know and I can do it.
JG
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jean-
Daniel Cryans
Sent: Thursday, October 28, 2010 10:00 AM
To: [email protected]
Subject: Re: Sanity date time check when a region server joins the
cluster
That could be done easily when the server checks in by looking at the
given start code. In ServerManager we already do:
HServerInfo info = new HServerInfo(serverInfo);
checkIsDead(info.getServerName(), "STARTUP");
checkAlreadySameHostPort(info);
recordNewServer(info, false, null);
A new check in there would fit nicely. Can you open a jira Jeff?
Thx!
J-D
On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<[email protected]>
wrote:
We recently had a problem where one of our machines in the cluster
had a
time that was 6 hours behind the other ones (ntp was supposed to be
setup on
that machine but wasn't). We subsequently restarted our cluster and
the
'-ROOT-' table was assigned to that machine. The problem was that
when it
tried to update the value (info:server) for who was holding the
'.META.'
table the value wasn't updating and stayed set as the previous
machine. I'm
pretty sure the problem was the timestamp for the new server was
older than
the timestamp for the previous server preventing the value from
updating
correctly. Having the incorrect info:server in the ROOT table
basically
made the cluster unusable.
So my question is, would it make sense to have a sanity time check
when a
region server joins the cluster? Basically when the region server
joins it
would sent its current time and the master would check that time
against its
current time and if difference is too large then it would prevent the
region
server from joining. I know this is basic server configuration stuff
but
because of human error these things happen and seem like they can
cause
major problems for the cluster if the servers times aren't
synchronized.
~Jeff
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]