Re: tserver died by clock unsync.

Alexey Serbin Fri, 16 Jun 2017 12:54:06 -0700

Hi Jason,

I think the workaround you mentioned (i.e. replacing LOG(FATAL) withLOG(WARNING) in the cited code snippet) is not safe at all. Ifntp_gettime() returns TIME_ERROR code, that means the 'now_usec'variable might be left uninitialized, and the code relying on theHybridClock::NowWithError() method would get some garbage instead ofwall clock usec value. That might lead to serious issues elsewhere upthe chain, and it's hard to predict what would happen. If you arelucky, a tserver will crash just later on, if not -- you'll getundefined behavior and data corruption which would be very hard to trackand fix.

Instead of running your tservers with that unsafe change, I wouldrecommend to track down the issue with the NTP in your cluster. Makesure there isn't other clock drives on your machines besides ntpd (e.g.,make sure nobody runs ntpdate manually and ntpdate is not executed by acron job, etc.). If your local network experiences internet outages forlong periods of time, one suggestion might be running NTP server on astable machine (or two) within your local network. Your local NTPservers would source time from 5-7 public NTP servers of stratum 2 or 3from the internet. In their turn, the NTP servers at your Kudu nodeswould use your internal NTP server(s) as a source. Also, it would makesense to take a look at some 'NTP best practice' guides you could findelsewhere on the Internet -- hopefully, you could find some ideas how totailor those for you case.


Hope this helps.


Kind regards,

Alexey


On 6/16/17 1:59 AM, Jason Heo wrote:

Hi.

Congrat. Apache Kudu 1.4.0

To prevent tserver from dying accidentally, I've changed LOG(FATAL)<https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L227>to LOG(WARNING)

I wanted to know it is safe to continue if ntp_gettime() inGetClockTime<https://github.com/apache/kudu/blob/1.4.0/src/kudu/server/hybrid_clock.cc#L90>returns TIME_ERROR


Could anyone can help me?

Regards,

Jason

2017-06-15 12:40 GMT+09:00 Jason Heo <jason.heo....@gmail.com<mailto:jason.heo....@gmail.com>>:


    Hi,

    I'm using Apache Kudu 1.4.0

    Yesterday, 6 tservers die at the same time. Following message is
    logged for each tserver.


    F0614 14:58:32.868551 111454 hybrid_clock.cc:227]

    Couldn't get the current time: Clock unsynchronized.

    Status: Service unavailable:

    Error reading clock. Clock considered unsynchronized


    We are already using ntpd, and in /var/log/messages, ntpd related
    message is logged.

    Jun 14 14:58:38 hostname ntpdate[10231]: step time server ip_addr
    offset -0.000168 sec


    We use our own ntp service. I don't know what's the exact reason,
    but It's suspicious that our ntp service is malfunctioned or
    network is not good temporarily.

    The problem is that this could happen again and again.

    So, I'm considering modifying source code of Kudu from LOG(FATAL)
    to LOG(WARN) so that tserver does not exit on unsync.

      uint64_t now_usec;

      uint64_t error_usec;

      Status s = WalltimeWithError(&now_usec, &error_usec);

      if (PREDICT_FALSE(!s.ok())) {

    LOG(FATAL)<< Substitute("Couldn't get the current time: Clock
    unsynchronized. "

            "Status: $0", s.ToString());

      }



    So, I question is that is it OK modifying LOG(FATAL) to LOG(WARN)
    of above code? and wanted to know this can preventing from dying
    of tserver when clock unsynced?

    Thanks.

    Jason,

    Regard

Re: tserver died by clock unsync.

Reply via email to