Hi Jason,

I think the workaround you mentioned (i.e. replacing LOG(FATAL) with LOG(WARNING) in the cited code snippet) is not safe at all. If ntp_gettime() returns TIME_ERROR code, that means the 'now_usec' variable might be left uninitialized, and the code relying on the HybridClock::NowWithError() method would get some garbage instead of wall clock usec value. That might lead to serious issues elsewhere up the chain, and it's hard to predict what would happen. If you are lucky, a tserver will crash just later on, if not -- you'll get undefined behavior and data corruption which would be very hard to track and fix.

Instead of running your tservers with that unsafe change, I would recommend to track down the issue with the NTP in your cluster. Make sure there isn't other clock drives on your machines besides ntpd (e.g., make sure nobody runs ntpdate manually and ntpdate is not executed by a cron job, etc.). If your local network experiences internet outages for long periods of time, one suggestion might be running NTP server on a stable machine (or two) within your local network. Your local NTP servers would source time from 5-7 public NTP servers of stratum 2 or 3 from the internet. In their turn, the NTP servers at your Kudu nodes would use your internal NTP server(s) as a source. Also, it would make sense to take a look at some 'NTP best practice' guides you could find elsewhere on the Internet -- hopefully, you could find some ideas how to tailor those for you case.

Hope this helps.


Kind regards,

Alexey


On 6/16/17 1:59 AM, Jason Heo wrote:
Hi.

Congrat. Apache Kudu 1.4.0

To prevent tserver from dying accidentally, I've changed LOG(FATAL) <https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L227> to LOG(WARNING)

I wanted to know it is safe to continue if ntp_gettime() in GetClockTime <https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L90> returns TIME_ERROR

Could anyone can help me?

Regards,

Jason



2017-06-15 12:40 GMT+09:00 Jason Heo <jason.heo....@gmail.com <mailto:jason.heo....@gmail.com>>:

    Hi,

    I'm using Apache Kudu 1.4.0

    Yesterday, 6 tservers die at the same time. Following message is
    logged for each tserver.


    F0614 14:58:32.868551 111454 hybrid_clock.cc:227]

    Couldn't get the current time: Clock unsynchronized.

    Status: Service unavailable:

    Error reading clock. Clock considered unsynchronized


    We are already using ntpd, and in /var/log/messages, ntpd related
    message is logged.

    Jun 14 14:58:38 hostname ntpdate[10231]: step time server ip_addr
    offset -0.000168 sec


    We use our own ntp service. I don't know what's the exact reason,
    but It's suspicious that our ntp service is malfunctioned or
    network is not good temporarily.

    The problem is that this could happen again and again.

    So, I'm considering modifying source code of Kudu from LOG(FATAL)
    to LOG(WARN) so that tserver does not exit on unsync.

      uint64_t now_usec;

      uint64_t error_usec;

      Status s = WalltimeWithError(&now_usec, &error_usec);

      if (PREDICT_FALSE(!s.ok())) {

    LOG(FATAL)<< Substitute("Couldn't get the current time: Clock
    unsynchronized. "

            "Status: $0", s.ToString());

      }



    So, I question is that is it OK modifying LOG(FATAL) to LOG(WARN)
    of above code? and wanted to know this can preventing from dying
    of tserver when clock unsynced?

    Thanks.

    Jason,

    Regard



Reply via email to