That could be done easily when the server checks in by looking at the
given start code. In ServerManager we already do:
HServerInfo info = new HServerInfo(serverInfo);
checkIsDead(info.getServerName(), "STARTUP");
checkAlreadySameHostPort(info);
recordNewServer(info, false, null);
A new check in there would fit nicely. Can you open a jira Jeff?
Thx!
J-D
On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting <[email protected]> wrote:
> We recently had a problem where one of our machines in the cluster had a
> time that was 6 hours behind the other ones (ntp was supposed to be setup on
> that machine but wasn't). We subsequently restarted our cluster and the
> '-ROOT-' table was assigned to that machine. The problem was that when it
> tried to update the value (info:server) for who was holding the '.META.'
> table the value wasn't updating and stayed set as the previous machine. I'm
> pretty sure the problem was the timestamp for the new server was older than
> the timestamp for the previous server preventing the value from updating
> correctly. Having the incorrect info:server in the ROOT table basically
> made the cluster unusable.
>
> So my question is, would it make sense to have a sanity time check when a
> region server joins the cluster? Basically when the region server joins it
> would sent its current time and the master would check that time against its
> current time and if difference is too large then it would prevent the region
> server from joining. I know this is basic server configuration stuff but
> because of human error these things happen and seem like they can cause
> major problems for the cluster if the servers times aren't synchronized.
>
> ~Jeff
>
> --
>
> Jeff Whiting
> Qualtrics Senior Software Engineer
> [email protected]
>
>