I investigated this quite a bit, and this appears to be an ntp bug and
not a charm bug.

This host is a trusty host, running ntp version
1:4.2.6.p5+dfsg-3ubuntu2.14.04.13. We have other hosts running the same
version that don't have the problem described above.

I spent quite some time investigating this, comparing the hosts, running
strace etc, and I noticed a subtle difference in /etc/hosts : on the
working host, the ::1 entry doesn't have "localhost", but it does on the
failing host. When I removed "localhost" from the ::1 entry on the
failing host, "ntpq -pn" started working.

Investigating things a bit more, I found out that on the working host,
ntpd was listening on ::1 but on the failing host, it wasn't (by
checking "ss -anupe" output as well as ntpd starting logs).

Comparing straces of starting ntpd, I think I was able to find what's
going on. On the working host it gives (only relevant output is posted
here) :

3973  19:41:32 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
[...]
3973  19:41:32 ioctl(5, SIOCGIFFLAGS, {ifr_name="qvobb268af4-e9", 
ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_PROMISC|IFF_MULTICAST}) = 0
3973  19:41:32 ioctl(5, SIOCGIFFLAGS, {ifr_name="qbrd5588b49-e3", 
ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
3973  19:41:32 ioctl(5, SIOCGIFFLAGS, {ifr_name="qvb1693c156-5f", 
ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_PROMISC|IFF_MULTICAST}) = 0
[... the same for a bunch of interfaces - this is a nova compute node so this 
is expected ...]
3973  19:41:32 close(5)                 = 0


But on the failing host, it checks a single interface :
56717 19:37:03 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
[...]
56717 19:37:03 ioctl(5, SIOCGIFFLAGS, {ifr_name="qvbba244f00-69", 
ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_PROMISC|IFF_MULTICAST}) = 0
56717 19:37:03 close(5)                 = 0

So I thought this interface was a bit special :
$ ip li sh dev qvbba244f00-69
67772: qvbba244f00-69@qvoba244f00-69: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> 
mtu 1500 qdisc noqueue master qbrba244f00-69 state UP mode DEFAULT group 
default qlen 1000
    link/ether 0e:ac:86:b1:c8:24 brd ff:ff:ff:ff:ff:ff

It appears completely normal, except that it has an unusually high
ifindex (67772). Could that be the cause of the problem ? Looking at the
source code at
https://git.launchpad.net/ubuntu/+source/ntp/tree/?h=ubuntu/trusty-
updates : interfaces are parsed looking at the /proc/net/if_inet6 file
(https://git.launchpad.net/ubuntu/+source/ntp/tree/lib/isc/unix/ifiter_getifaddrs.c?h=ubuntu/trusty-
updates#n54) which strace confirms :

3973  19:41:32 open("/proc/net/if_inet6", O_RDONLY) = 6

Each line is parsed using fgets :

fgets(iter->entry, sizeof(iter->entry), iter->proc) != NULL)

https://git.launchpad.net/ubuntu/+source/ntp/tree/lib/isc/unix/interfaceiter.c?h=ubuntu/trusty-
updates#n181

What's sizeof(iter->entry) ? Well "entry" is defined like that :

        char                    entry[ISC_IF_INET6_SZ];

https://git.launchpad.net/ubuntu/+source/ntp/tree/lib/isc/unix/ifiter_getifaddrs.c?h=ubuntu/trusty-
updates#n48

And ISC_IF_INET6_SZ is :
#define ISC_IF_INET6_SZ \
    sizeof("00000000000000000000000000000001 01 80 10 80 XXXXXXloXXXXXXXX\n")

https://git.launchpad.net/ubuntu/+source/ntp/tree/lib/isc/unix/interfaceiter.c?h=ubuntu/trusty-
updates#n153

And this is where the problem is. The computation of ISC_IF_INET6_SZ
assumes that ifindex will be 2 chars (in hex), so that ifindex will be <
256. However, ifindexes higher than that are likely common, so why don't
we see this bug elsewhere ? Well because the computation of
ISC_IF_INET6_SZ also assumes that the interface name is 16 chars.

In our example, the interface name is "only" 14 chars, so we have a buffer of 2 
chars for the ifindex. But that's not enough, it's off by 1 in fact !
"00000000000000000000000000000001 01 80 10 80 XXXXXXloXXXXXXXX\n" is 62 chars 
long.
The first line of if_inet6 on our machine is :
fe800000000000000cac86fffeb1c824 108bc 40 20 80 qvbba244f00-69, and that's 62 
chars long... but without the \n !

So what might be happening here is that the first iteration of the loop
will properly read the whole line except the \n, and the next iteration
will resume at that location, and because fgets() stops at EOF or
newline, it will just return a newline, which will make the whole
iteration stop.

The fix here is pretty simple : the computation of ISC_IF_INET6_SZ
should assume an ifindex of UINT_MAX, ie ffffffff (or any 8-chars
number). If I can trust
https://git.launchpad.net/ubuntu/+source/ntp/tree/lib/isc/unix/interfaceiter.c?h=applied/ubuntu/jammy
this is still present in Jammy.

Redirecting the bug to the "ntp" package.

** Also affects: ntp (Ubuntu)
   Importance: Undecided
       Status: New

** Changed in: ntp-charm
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to ntp in Ubuntu.
https://bugs.launchpad.net/bugs/1952264

Title:
  ntp sync checks fail when server as no IPv6 connectivity

Status in NTP Charm:
  Invalid
Status in ntp package in Ubuntu:
  New

Bug description:
  This charm sets up ntpmon and nagios checks to alert when ntp was not
  able to select a sync peer.

  On a server without a routable ipv6 configured, ntpq -p fails with:
  $ ntpq -p
  localhost: timed out, nothing received
  ***Request timed out

  $ /opt/ntpmon-ntp-charm/check_ntpmon.py --check sync
  CRITICAL: No sync peer selected | frequency= offset=nan peers=0 reach=nan 
result=2 rootdelay= rootdisp= runtime= stratum= sync=0.000000 sysjitter= 
sysoffset= tracehosts= traceloops= tracetime=

  This results in a nagios alert complaining about the problem.
  Although:

  $ ntpq -p -4
       remote           refid      st t when poll reach   delay   offset  jitter
  ==============================================================================
  *hostname1       xxx.xxx.xxx.x    2 u  210  256  377    0.842    0.031   0.050
  +hostname2       xxx.xxx.xxx.x    2 u   88  256  377    0.327    0.062   0.107
  -hostname3       xxx.xxx.xxx.x    2 u  210  256  377   75.810   -1.198   1.035
  +hostname4       xxx.xxx.xxx.x    2 u   68  256  377    0.751    0.078   0.193

  $ ntpq -p -4 | /opt/ntpmon-ntp-charm/check_ntpmon.py --check sync --test
  OK: Time is in sync with hostname1 | frequency= offset=0.000057 peers=4 
reach=100.000000 result=0 rootdelay= rootdisp= runtime= stratum= sync=1.000000 
sysjitter= sysoffset= tracehosts= traceloops= tracetime=

  Maybe this is a bug to file against ntp itself ? Or some configuration
  could allow ntpq -p and check_ntpmon.py to succeed ? I've tested
  running ntpd with -4 (using defaults file) but with no luck.

  Let us know if you need more information.

  Thank you,
  Loïc

To manage notifications about this bug go to:
https://bugs.launchpad.net/ntp-charm/+bug/1952264/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to