Hi Jason,
I think the workaround you mentioned (i.e. replacing LOG(FATAL) with
LOG(WARNING) in the cited code snippet) is not safe at all. If
ntp_gettime() returns TIME_ERROR code, that means the 'now_usec'
variable might be left uninitialized, and the code relying on the
HybridClock::NowWithError() method would get some garbage instead of
wall clock usec value. That might lead to serious issues elsewhere up
the chain, and it's hard to predict what would happen. If you are
lucky, a tserver will crash just later on, if not -- you'll get
undefined behavior and data corruption which would be very hard to track
and fix.
Instead of running your tservers with that unsafe change, I would
recommend to track down the issue with the NTP in your cluster. Make
sure there isn't other clock drives on your machines besides ntpd (e.g.,
make sure nobody runs ntpdate manually and ntpdate is not executed by a
cron job, etc.). If your local network experiences internet outages for
long periods of time, one suggestion might be running NTP server on a
stable machine (or two) within your local network. Your local NTP
servers would source time from 5-7 public NTP servers of stratum 2 or 3
from the internet. In their turn, the NTP servers at your Kudu nodes
would use your internal NTP server(s) as a source. Also, it would make
sense to take a look at some 'NTP best practice' guides you could find
elsewhere on the Internet -- hopefully, you could find some ideas how to
tailor those for you case.
Hope this helps.
Kind regards,
Alexey
On 6/16/17 1:59 AM, Jason Heo wrote:
Hi.
Congrat. Apache Kudu 1.4.0
To prevent tserver from dying accidentally, I've changed LOG(FATAL)
<https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L227>
to LOG(WARNING)
I wanted to know it is safe to continue if ntp_gettime() in
GetClockTime
<https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L90>
returns TIME_ERROR
Could anyone can help me?
Regards,
Jason
2017-06-15 12:40 GMT+09:00 Jason Heo <jason.heo....@gmail.com
<mailto:jason.heo....@gmail.com>>:
Hi,
I'm using Apache Kudu 1.4.0
Yesterday, 6 tservers die at the same time. Following message is
logged for each tserver.
F0614 14:58:32.868551 111454 hybrid_clock.cc:227]
Couldn't get the current time: Clock unsynchronized.
Status: Service unavailable:
Error reading clock. Clock considered unsynchronized
We are already using ntpd, and in /var/log/messages, ntpd related
message is logged.
Jun 14 14:58:38 hostname ntpdate[10231]: step time server ip_addr
offset -0.000168 sec
We use our own ntp service. I don't know what's the exact reason,
but It's suspicious that our ntp service is malfunctioned or
network is not good temporarily.
The problem is that this could happen again and again.
So, I'm considering modifying source code of Kudu from LOG(FATAL)
to LOG(WARN) so that tserver does not exit on unsync.
uint64_t now_usec;
uint64_t error_usec;
Status s = WalltimeWithError(&now_usec, &error_usec);
if (PREDICT_FALSE(!s.ok())) {
LOG(FATAL)<< Substitute("Couldn't get the current time: Clock
unsynchronized. "
"Status: $0", s.ToString());
}
So, I question is that is it OK modifying LOG(FATAL) to LOG(WARN)
of above code? and wanted to know this can preventing from dying
of tserver when clock unsynced?
Thanks.
Jason,
Regard