A few days ago at work our Kudu servers started having fatal errors and 
shutting down with the following error message: 




Couldn't get the current time: Clock unsynchronized. Status: Service 
unavailable: Error: Clock synchronized but error wastoo high (10000016 us). 




After some research in the community forums, I found a post by Todd that 
pointed to this JIRA issue: https://issues.apache.org/jira/browse/KUDU-2079 

I then checked our ntpd configuration and sure enough we had the '-x' option in 
the daemon command, so I went ahead, removed that option, restarted ntpd, and a 
few minutes later I restarted all the Kudu processes (one master and three 
tablet servers). 
A few minutes later a couple of those Kudu processes were down again, this time 
with this new time sync related error message: 




Tried to update clock beyond the max. error. 




To try to address this new error, I brought down all the Kudu processes, 
stopped ntpd, resync'd the time on all the servers with ntpdate, brought ntpd 
back up, waited a bit, and restarted Kudu (master and tablet servers). A few 
minutes or less later a couple of them were down again with the same 'Tried to 
update clock beyond the max. error.' 




I eventually ended up doubling the parameter 'max_clock_sync_error_usec' to 
20,000,000 (20 seconds) and everything stayed up (and is still up). 




Looking at the source code in git, I found the relevant section here (source 
file 
https://github.com/apache/kudu/blob/master/src/kudu/clock/hybrid_clock.cc): 




// we won't update our clock if to_update is more than 
'max_clock_sync_error_usec' 
// into the future as it might have been corrupted or originated from an 
out-of-sync 
// server. 
if ((to_update_physical - now_physical) > FLAGS_max_clock_sync_error_usec) { 
return Status::InvalidArgument("Tried to update clock beyond the max. error."); 
} 




If I understand this code correctly, it is complaining because for some reason 
Kudu is trying to update its clock by more than 10 seconds - however I ran 
ntptime and several ntpq queries, and I don't see the time between the servers 
being off by that much (or even by say half a second, since they are all 
synchronized with a stratum 3 NTP server). 




Has anyone in this group seen anything similar or does anyone have a better 
understanding of what this message means and what could be causing it? 




Thanks, 
Franco 

Reply via email to