A minute? Although it could be configurable.

J-D

On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <[email protected]> wrote:
> Created HBASE-3168 for this issue.  It seems pretty straight forward and I
> wouldn't mind tackling this problem.  How much of a skew do we want to allow
> between the RS and the rest of the cluster?
>
> ~Jeff
>
> On 10/28/2010 12:08 PM, Jonathan Gray wrote:
>>
>> I was discussing this exact issue this morning.  Ran into a problem where
>> master was timing out a region in transition because the RS was 5 minutes
>> behind the master.
>>
>> I like the idea of the RS sending it's timestamp on startup and if it is
>> outside a certain threshold, the master throws it a ClockOutOfSync-like
>> exception and the RS goes down.
>>
>> Please do file a jira, Jeff.  Or let me know and I can do it.
>>
>> JG
>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Jean-
>>> Daniel Cryans
>>> Sent: Thursday, October 28, 2010 10:00 AM
>>> To: [email protected]
>>> Subject: Re: Sanity date time check when a region server joins the
>>> cluster
>>>
>>> That could be done easily when the server checks in by looking at the
>>> given start code. In ServerManager we already do:
>>>
>>>     HServerInfo info = new HServerInfo(serverInfo);
>>>     checkIsDead(info.getServerName(), "STARTUP");
>>>     checkAlreadySameHostPort(info);
>>>     recordNewServer(info, false, null);
>>>
>>> A new check in there would fit nicely. Can you open a jira Jeff?
>>>
>>> Thx!
>>>
>>> J-D
>>>
>>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<[email protected]>
>>> wrote:
>>>>
>>>> We recently had a problem where one of our machines in the cluster
>>>
>>> had a
>>>>
>>>> time that was 6 hours behind the other ones (ntp was supposed to be
>>>
>>> setup on
>>>>
>>>> that machine but wasn't).  We subsequently restarted our cluster and
>>>
>>> the
>>>>
>>>> '-ROOT-' table was assigned to that machine.  The problem was that
>>>
>>> when it
>>>>
>>>> tried to update the value (info:server) for who was holding the
>>>
>>> '.META.'
>>>>
>>>> table the value wasn't updating and stayed set as the previous
>>>
>>> machine. I'm
>>>>
>>>> pretty sure the problem was the timestamp for the new server was
>>>
>>> older than
>>>>
>>>> the timestamp for the previous server preventing the value from
>>>
>>> updating
>>>>
>>>> correctly.  Having the incorrect info:server in the ROOT table
>>>
>>> basically
>>>>
>>>> made the cluster unusable.
>>>>
>>>> So my question is, would it make sense to have a sanity time check
>>>
>>> when a
>>>>
>>>> region server joins the cluster?  Basically when the region server
>>>
>>> joins it
>>>>
>>>> would sent its current time and the master would check that time
>>>
>>> against its
>>>>
>>>> current time and if difference is too large then it would prevent the
>>>
>>> region
>>>>
>>>> server from joining.  I know this is basic server configuration stuff
>>>
>>> but
>>>>
>>>> because of human error these things happen and seem like they can
>>>
>>> cause
>>>>
>>>> major problems for the cluster if the servers times aren't
>>>
>>> synchronized.
>>>>
>>>> ~Jeff
>>>>
>>>> --
>>>>
>>>> Jeff Whiting
>>>> Qualtrics Senior Software Engineer
>>>> [email protected]
>>>>
>>>>
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> [email protected]
>
>

Reply via email to