We recently had a problem where one of our machines in the cluster had a time that was 6 hours behind the other ones (ntp was supposed to be setup on that machine but wasn't). We subsequently restarted our cluster and the '-ROOT-' table was assigned to that machine. The problem was that when it tried to update the value (info:server) for who was holding the '.META.' table the value wasn't updating and stayed set as the previous machine. I'm pretty sure the problem was the timestamp for the new server was older than the timestamp for the previous server preventing the value from updating correctly. Having the incorrect info:server in the ROOT table basically made the cluster unusable.

So my question is, would it make sense to have a sanity time check when a region server joins the cluster? Basically when the region server joins it would sent its current time and the master would check that time against its current time and if difference is too large then it would prevent the region server from joining. I know this is basic server configuration stuff but because of human error these things happen and seem like they can cause major problems for the cluster if the servers times aren't synchronized.

~Jeff

--

Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]

Reply via email to