Hi Alexey, Thank you for your kind answer!
Best, Jason 2017-06-17 4:53 GMT+09:00 Alexey Serbin <[email protected]>: > Hi Jason, > > I think the workaround you mentioned (i.e. replacing LOG(FATAL) with > LOG(WARNING) in the cited code snippet) is not safe at all. If > ntp_gettime() returns TIME_ERROR code, that means the 'now_usec' variable > might be left uninitialized, and the code relying on the > HybridClock::NowWithError() method would get some garbage instead of wall > clock usec value. That might lead to serious issues elsewhere up the > chain, and it's hard to predict what would happen. If you are lucky, a > tserver will crash just later on, if not -- you'll get undefined behavior > and data corruption which would be very hard to track and fix. > > Instead of running your tservers with that unsafe change, I would > recommend to track down the issue with the NTP in your cluster. Make sure > there isn't other clock drives on your machines besides ntpd (e.g., make > sure nobody runs ntpdate manually and ntpdate is not executed by a cron > job, etc.). If your local network experiences internet outages for long > periods of time, one suggestion might be running NTP server on a stable > machine (or two) within your local network. Your local NTP servers would > source time from 5-7 public NTP servers of stratum 2 or 3 from the > internet. In their turn, the NTP servers at your Kudu nodes would use your > internal NTP server(s) as a source. Also, it would make sense to take a > look at some 'NTP best practice' guides you could find elsewhere on the > Internet -- hopefully, you could find some ideas how to tailor those for > you case. > > Hope this helps. > > > Kind regards, > > Alexey > > > On 6/16/17 1:59 AM, Jason Heo wrote: > >> Hi. >> >> Congrat. Apache Kudu 1.4.0 >> >> To prevent tserver from dying accidentally, I've changed LOG(FATAL) < >> https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/ >> hybrid_clock.cc#L227> to LOG(WARNING) >> >> I wanted to know it is safe to continue if ntp_gettime() in GetClockTime < >> https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/ >> hybrid_clock.cc#L90> returns TIME_ERROR >> >> Could anyone can help me? >> >> Regards, >> >> Jason >> >> >> >> 2017-06-15 12:40 GMT+09:00 Jason Heo <[email protected] <mailto: >> [email protected]>>: >> >> >> Hi, >> >> I'm using Apache Kudu 1.4.0 >> >> Yesterday, 6 tservers die at the same time. Following message is >> logged for each tserver. >> >> >> F0614 14:58:32.868551 111454 hybrid_clock.cc:227] >> >> Couldn't get the current time: Clock unsynchronized. >> >> Status: Service unavailable: >> >> Error reading clock. Clock considered unsynchronized >> >> >> We are already using ntpd, and in /var/log/messages, ntpd related >> message is logged. >> >> Jun 14 14:58:38 hostname ntpdate[10231]: step time server ip_addr >> offset -0.000168 sec >> >> >> We use our own ntp service. I don't know what's the exact reason, >> but It's suspicious that our ntp service is malfunctioned or >> network is not good temporarily. >> >> The problem is that this could happen again and again. >> >> So, I'm considering modifying source code of Kudu from LOG(FATAL) >> to LOG(WARN) so that tserver does not exit on unsync. >> >> uint64_t now_usec; >> >> uint64_t error_usec; >> >> Status s = WalltimeWithError(&now_usec, &error_usec); >> >> if (PREDICT_FALSE(!s.ok())) { >> >> LOG(FATAL)<< Substitute("Couldn't get the current time: Clock >> unsynchronized. " >> >> "Status: $0", s.ToString()); >> >> } >> >> >> >> So, I question is that is it OK modifying LOG(FATAL) to LOG(WARN) >> of above code? and wanted to know this can preventing from dying >> of tserver when clock unsynced? >> >> Thanks. >> >> Jason, >> >> Regard >> >> >> >
