We recently had a problem where one of our machines in the cluster had a time that was 6 hours
behind the other ones (ntp was supposed to be setup on that machine but wasn't). We subsequently
restarted our cluster and the '-ROOT-' table was assigned to that machine. The problem was that
when it tried to update the value (info:server) for who was holding the '.META.' table the value
wasn't updating and stayed set as the previous machine. I'm pretty sure the problem was the
timestamp for the new server was older than the timestamp for the previous server preventing the
value from updating correctly. Having the incorrect info:server in the ROOT table basically made
the cluster unusable.
So my question is, would it make sense to have a sanity time check when a region server joins the
cluster? Basically when the region server joins it would sent its current time and the master would
check that time against its current time and if difference is too large then it would prevent the
region server from joining. I know this is basic server configuration stuff but because of human
error these things happen and seem like they can cause major problems for the cluster if the servers
times aren't synchronized.
~Jeff
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]