docs: improvements to NTP troubleshooting Change-Id: I07b6871b91ed4ee08992d2fcd093f1054c7d61b8 Reviewed-on: http://gerrit.cloudera.org:8080/9234 Reviewed-by: Will Berkeley <wdberke...@gmail.com> Tested-by: Kudu Jenkins
Project: http://git-wip-us.apache.org/repos/asf/kudu/repo Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/60eca012 Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/60eca012 Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/60eca012 Branch: refs/heads/master Commit: 60eca0125c9383fa67b304b15b64728b8f153ceb Parents: c9c86f4 Author: Todd Lipcon <t...@apache.org> Authored: Tue Feb 6 17:07:58 2018 -0800 Committer: Todd Lipcon <t...@apache.org> Committed: Tue Feb 13 01:09:08 2018 +0000 ---------------------------------------------------------------------- docs/troubleshooting.adoc | 151 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 143 insertions(+), 8 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kudu/blob/60eca012/docs/troubleshooting.adoc ---------------------------------------------------------------------- diff --git a/docs/troubleshooting.adoc b/docs/troubleshooting.adoc index 34ac291..95557e3 100644 --- a/docs/troubleshooting.adoc +++ b/docs/troubleshooting.adoc @@ -94,8 +94,8 @@ or Sep 17, 8:32:31.135 PM FATAL tablet_server_main.cc:38 Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Cannot initialize HybridClock. Clock synchronized but error was too high (11711000 us). ---- -TIP: If NTP is installed the user can monitor the synchronization status by running -`ntptime`. The relevant value is what is reported for `maximum error`. +==== Installing NTP + To install NTP, use the appropriate command for your operating system: [cols="1,1", options="header"] @@ -113,14 +113,149 @@ If NTP is installed but not running, start it using one of these commands: | RHEL/CentOS | `sudo /etc/init.d/ntpd restart` |=== -TIP: NTP requires a network connection and may take a few minutes to synchronize the clock. -In some cases a spotty network connection may make NTP report the clock as unsynchronized. +==== Monitoring NTP Status + +When NTP is installed, you can monitor the synchronization status by running +`ntptime`. For example, a healthy system may report: + +---- +ntp_gettime() returns code 0 (OK) + time de24c0cf.8d5da274 Tue, Feb 6 2018 16:03:27.552, (.552210980), + maximum error 224455 us, estimated error 383 us, TAI offset 0 +ntp_adjtime() returns code 0 (OK) + modes 0x0 (), + offset 1279.543 us, frequency 2.500 ppm, interval 1 s, + maximum error 224455 us, estimated error 383 us, + status 0x2001 (PLL,NANO), + time constant 10, precision 0.001 us, tolerance 500 ppm, +---- + +In particular, note the following most important pieces of output: + +- `maximum error 22455 us`: this value is well under the 10-second maximum error required + by Kudu. +- `status 0x2001 (PLL,NANO)`: this indicates a healthy synchronization status. + +In contrast, a system without NTP properly configured and running will output +something like the following: + +---- +ntp_gettime() returns code 5 (ERROR) + time de24c240.0c006000 Tue, Feb 6 2018 16:09:36.046, (.046881), + maximum error 16000000 us, estimated error 16000000 us, TAI offset 0 +ntp_adjtime() returns code 5 (ERROR) + modes 0x0 (), + offset 0.000 us, frequency 2.500 ppm, interval 1 s, + maximum error 16000000 us, estimated error 16000000 us, + status 0x40 (UNSYNC), + time constant 10, precision 1.000 us, tolerance 500 ppm, +---- + +Note the `UNSYNC` status and the 16-second maximum error. + +If more detailed information is needed, the `ntpq` or `ntpdc` tools +can be used to dump further information about which network time servers +are currently acting as sources: + +---- +$ ntpq -n -c opeers + remote local st t when poll reach delay offset disp +============================================================================== + 0.0.0.0 0.0.0.0 16 p - 64 0 0.000 0.000 16000.0 + 0.0.0.0 0.0.0.0 16 p - 64 0 0.000 0.000 16000.0 + 0.0.0.0 0.0.0.0 16 p - 64 0 0.000 0.000 16000.0 + 0.0.0.0 0.0.0.0 16 p - 64 0 0.000 0.000 16000.0 + 0.0.0.0 0.0.0.0 16 p - 64 0 0.000 0.000 16000.0 +-108.59.2.24 10.16.2.89 2 u 3 64 3 74.380 0.321 62.992 +-208.82.104.205 10.16.2.89 2 u 5 64 3 52.654 -4.054 62.965 +#192.96.202.120 10.16.2.89 2 u 1 64 3 74.737 6.538 62.988 +#69.10.161.7 10.16.2.89 3 u 5 64 3 28.353 -1.967 62.960 +-173.255.206.154 10.16.2.89 3 u - 64 3 42.906 -3.127 62.996 +-69.195.159.158 10.16.2.89 2 u 1 64 3 52.543 -4.788 62.987 +*216.218.254.202 10.16.2.89 1 u 5 64 3 2.567 0.053 62.974 +-129.250.35.250 10.16.2.89 2 u 3 64 3 2.603 0.256 62.985 ++45.76.244.193 10.16.2.89 2 u 5 64 3 19.522 0.188 62.969 +-69.89.207.199 10.16.2.89 2 u 5 64 3 66.687 -0.395 62.967 +-171.66.97.126 10.16.2.89 1 u 1 64 3 12.627 -3.572 62.963 +#66.228.42.59 10.16.2.89 4 u 1 64 3 72.143 4.034 62.971 + 91.189.89.198 10.16.2.89 2 u 5 64 3 135.329 3.069 3937.74 +#162.210.111.4 10.16.2.89 2 u - 64 3 29.572 6.849 62.966 ++199.102.46.80 10.16.2.89 1 u 3 64 3 57.022 0.111 63.386 + 91.189.89.199 10.16.2.89 2 u 4 64 3 138.269 3.228 3937.98 +---- + +TIP: Depending on the specific version of NTP, the correct command may be either +`ntpq -n -c opeers` or `ntpq -n -c lpeers`. + + +[NOTE] +**** +.Using `chrony` for time synchronization + +Some operating systems offer `chrony` as an alternative to `ntpd` for network time +synchronization. Kudu has been tested most thoroughly using `ntpd` and use of +`chrony` is considered experimental. + +In order to use `chrony` for synchronization, `chrony.conf` must be configured +with the `rtcsync` option. +**** + +==== NTP Configuration Best Practices + +In order to provide stable time synchronization with low maximum error, follow +these best NTP configuration best practices. + +*Always configure at least four time sources for NTP.* In addition to providing +redundancy in case one or more time sources becomes unavailable, The NTP protocol is +designed to increase its accuracy with a diversity of sources. Even if your organization +provides one or more local time servers, configuring additional remote servers is highly +recommended for a robust setup. + +*Pick servers in your server's local geography.* For example, if your servers are located +in Europe, pick servers from the European NTP pool. If your servers are running in a public +cloud environment, consult the cloud provider's documentation for a recommended NTP setup. +Many cloud providers offer highly accurate clock synchronization as a service. + +*Use the `iburst` option for faster synchronization at startup*. The `iburst` option +instructs `ntpd` to send an initial "burst" of time queries at startup. This typically +results in a faster time synchronization when a machine restarts. + +An example NTP server list may appear as follows: + +---- +# Use my organization's internal NTP servers. +server ntp1.myorg.internal iburst +server ntp2.myorg.internal iburst +# Provide several public pool servers from the US pool for +# redundancy and robustness. +server 0.pool.us.ntp.org iburst +server 1.pool.us.ntp.org iburst +server 2.pool.us.ntp.org iburst +server 3.pool.us.ntp.org iburst +---- + +TIP: After configuring NTP, use the `ntpq` tool described above to verify that `ntpd` was +able to connect to a variety of peers. If no public peers appear, it is possiblbe that +the NTP protocol is being blocked by a firewall or other network connectivity issue. + +==== Troubleshooting NTP Stability Problems + +As of Kudu 1.6.0, Kudu daemons are able to continue to operate during a brief loss of +NTP synchronization. If NTP synchronization is lost for several hours, however, daemons +may crash. If a daemon crashes due to NTP synchronization issues, consult the `ERROR` log +for a dump of related information which may help to diagnose the issue. + +TIP: Kudu 1.5.0 and earlier versions were less resilient to brief NTP outages. In +addition, they contained a link:https://issues.apache.org/jira/browse/KUDU-2209[bug] +which could cause Kudu to incorrectly measure the maximum error, resulting in +crashes. If you experience crashes related to clock synchronization on these +earlier versions of Kudu and it appears that the system's NTP configuration is correct, +consider upgrading to Kudu 1.6.0 or later. + +TIP: NTP requires a network connection and may take a few minutes to synchronize the clock +at startup. In some cases a spotty network connection may make NTP report the clock as unsynchronized. A common, though temporary, workaround for this is to restart NTP with one of the commands above. -If the clock is being reported as synchronized by NTP, but the maximum error is too high, -the user can increase the threshold to a higher value by setting the above -mentioned flag. For example to increase the possible maximum error to -20 seconds the flag should be set like: `--max_clock_sync_error_usec=20000000` [[crash_reporting]] == Reporting Kudu Crashes