A few days ago at work our Kudu servers started having fatal errors and shutting down with the following error message:
Couldn't get the current time: Clock unsynchronized. Status: Service unavailable: Error: Clock synchronized but error wastoo high (10000016 us). After some research in the community forums, I found a post by Todd that pointed to this JIRA issue: https://issues.apache.org/jira/browse/KUDU-2079 I then checked our ntpd configuration and sure enough we had the '-x' option in the daemon command, so I went ahead, removed that option, restarted ntpd, and a few minutes later I restarted all the Kudu processes (one master and three tablet servers). A few minutes later a couple of those Kudu processes were down again, this time with this new time sync related error message: Tried to update clock beyond the max. error. To try to address this new error, I brought down all the Kudu processes, stopped ntpd, resync'd the time on all the servers with ntpdate, brought ntpd back up, waited a bit, and restarted Kudu (master and tablet servers). A few minutes or less later a couple of them were down again with the same 'Tried to update clock beyond the max. error.' I eventually ended up doubling the parameter 'max_clock_sync_error_usec' to 20,000,000 (20 seconds) and everything stayed up (and is still up). Looking at the source code in git, I found the relevant section here (source file https://github.com/apache/kudu/blob/master/src/kudu/clock/hybrid_clock.cc): // we won't update our clock if to_update is more than 'max_clock_sync_error_usec' // into the future as it might have been corrupted or originated from an out-of-sync // server. if ((to_update_physical - now_physical) > FLAGS_max_clock_sync_error_usec) { return Status::InvalidArgument("Tried to update clock beyond the max. error."); } If I understand this code correctly, it is complaining because for some reason Kudu is trying to update its clock by more than 10 seconds - however I ran ntptime and several ntpq queries, and I don't see the time between the servers being off by that much (or even by say half a second, since they are all synchronized with a stratum 3 NTP server). Has anyone in this group seen anything similar or does anyone have a better understanding of what this message means and what could be causing it? Thanks, Franco
