[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-27 Thread Launchpad Bug Tracker
This bug was fixed in the package isc-dhcp - 4.4.1-2.1ubuntu5.20.04.5

---
isc-dhcp (4.4.1-2.1ubuntu5.20.04.5) focal; urgency=medium

  [ Mauricio Faria de Oliveira ]
  * Prevent race condition that might ignore DHCP OFFERs/ACKs
when dhclient receives DHCP traffic noise. (LP: #1926139)
The previous/racy behavior can be switched back on with
the 'DHCP_FD_FLAGS_POKE=0' environment variable or
the 'dhcp.fd_flags_poke=0' kernel cmdline option.
- d/p/lp1926139-watch-socket-fd-later.patch: fix, switches.
- d/apparmor/sbin.dhclient,usr.sbin.dhcpd: /proc/cmdline r.

  [ Steve Langasek ]
  * Include /etc/dhcp/dhclient-exit-hooks.d/rfc3442-classless-routes
in the initramfs. (LP: #1937110)
- d/initramfs-tools/share/hooks/zz-dhclient: copy_exec it.

 -- Mauricio Faria de Oliveira   Tue, 31 Jan 2023
19:10:35 -0300

** Changed in: isc-dhcp (Ubuntu Focal)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  Fix Released
Status in isc-dhcp source package in Jammy:
  Fix Released

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   Socket/fallback
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 ...
 (omitting 20 similar lines)
 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 No DHCPOFFERS received.
 No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
 are multi-threaded (Bionic/earlier and Kinetic/later
 are single-threaded).

   * The actual problem is dhclient containing a thread
 concurrency race condition, and when the race occurs,
 the read socket is incorrectly/prematurely unwatched
 because required structures are not yet consistent,
 thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
 read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
 and no errors have been observed (previously: 0.4%).
 
   * Anyway, in case regressions are observed, the patch
 introduces 2 switches to revert to previous behavior,
 which can be applied per-process or system-wide:
 - DHCP_FD_FLAGS_POKE=0 environment variable
 - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
  bind9-libs to single-threaded, but we concluded it
  would have more regression risk than the expected
  [some bits in comment #8, and some internal chat],
  and remove exported symbols (apparently unused, but).
  We also considered a mutex/spinlock approach, but
  later found a simpler way w/ isc lib; comment #13.)
  
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
 condition, and DHCP server/client/noise injection
 is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
 - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
 - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
  
   * 2) dhcpd is also affected by code changes,
 thus failures to handle DHCP lease requests
 also have potential for regressions.
 
   * 3) the functional change added by the fix,
 if a regression were to occur, would likely
 be an issue only under some (unknown) race
 condition as well, thus expected to be rare.

   * Note: this potentially affects Focal/Jammy 
 on Azure as a whole, per usage of dhclient
 in cloud-init instead of systemd-networkd.
 
 Azure provided extensive testing for all 3
 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-27 Thread Launchpad Bug Tracker
This bug was fixed in the package isc-dhcp - 4.4.1-2.3ubuntu2.4

---
isc-dhcp (4.4.1-2.3ubuntu2.4) jammy; urgency=medium

  [ Mauricio Faria de Oliveira ]
  * Prevent race condition that might ignore DHCP OFFERs/ACKs
when dhclient receives DHCP traffic noise. (LP: #1926139)
The previous/racy behavior can be switched back on with
the 'DHCP_FD_FLAGS_POKE=0' environment variable or
the 'dhcp.fd_flags_poke=0' kernel cmdline option.
- d/p/lp1926139-watch-socket-fd-later.patch: fix, switches.
- d/apparmor/sbin.dhclient,usr.sbin.dhcpd: /proc/cmdline r.

  [ Steve Langasek ]
  * Include /etc/dhcp/dhclient-exit-hooks.d/rfc3442-classless-routes
in the initramfs. (LP: #1937110)
- d/initramfs-tools/share/hooks/zz-dhclient: copy_exec it.

 -- Mauricio Faria de Oliveira   Tue, 31 Jan 2023
18:54:40 -0300

** Changed in: isc-dhcp (Ubuntu Jammy)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  Fix Released
Status in isc-dhcp source package in Jammy:
  Fix Released

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   Socket/fallback
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 ...
 (omitting 20 similar lines)
 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 No DHCPOFFERS received.
 No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
 are multi-threaded (Bionic/earlier and Kinetic/later
 are single-threaded).

   * The actual problem is dhclient containing a thread
 concurrency race condition, and when the race occurs,
 the read socket is incorrectly/prematurely unwatched
 because required structures are not yet consistent,
 thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
 read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
 and no errors have been observed (previously: 0.4%).
 
   * Anyway, in case regressions are observed, the patch
 introduces 2 switches to revert to previous behavior,
 which can be applied per-process or system-wide:
 - DHCP_FD_FLAGS_POKE=0 environment variable
 - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
  bind9-libs to single-threaded, but we concluded it
  would have more regression risk than the expected
  [some bits in comment #8, and some internal chat],
  and remove exported symbols (apparently unused, but).
  We also considered a mutex/spinlock approach, but
  later found a simpler way w/ isc lib; comment #13.)
  
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
 condition, and DHCP server/client/noise injection
 is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
 - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
 - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
  
   * 2) dhcpd is also affected by code changes,
 thus failures to handle DHCP lease requests
 also have potential for regressions.
 
   * 3) the functional change added by the fix,
 if a regression were to occur, would likely
 be an issue only under some (unknown) race
 condition as well, thus expected to be rare.

   * Note: this potentially affects Focal/Jammy 
 on Azure as a whole, per usage of dhclient
 in cloud-init instead of systemd-networkd.
 
 Azure provided extensive testing for all 3
 approaches 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-15 Thread Mauricio Faria de Oliveira
The -proposed packages have also been verified by the Microsoft Azure team.
14k+ VM deployments with the fix didn't observe the issue over the weekend.

Marking jammy/focal as verified based on that plus 2 synthetic tests
above.

** Tags removed: verification-needed verification-needed-focal 
verification-needed-jammy
** Tags added: verification-done verification-done-focal verification-done-jammy

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  Fix Committed
Status in isc-dhcp source package in Jammy:
  Fix Committed

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   Socket/fallback
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 ...
 (omitting 20 similar lines)
 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 No DHCPOFFERS received.
 No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
 are multi-threaded (Bionic/earlier and Kinetic/later
 are single-threaded).

   * The actual problem is dhclient containing a thread
 concurrency race condition, and when the race occurs,
 the read socket is incorrectly/prematurely unwatched
 because required structures are not yet consistent,
 thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
 read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
 and no errors have been observed (previously: 0.4%).
 
   * Anyway, in case regressions are observed, the patch
 introduces 2 switches to revert to previous behavior,
 which can be applied per-process or system-wide:
 - DHCP_FD_FLAGS_POKE=0 environment variable
 - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
  bind9-libs to single-threaded, but we concluded it
  would have more regression risk than the expected
  [some bits in comment #8, and some internal chat],
  and remove exported symbols (apparently unused, but).
  We also considered a mutex/spinlock approach, but
  later found a simpler way w/ isc lib; comment #13.)
  
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
 condition, and DHCP server/client/noise injection
 is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
 - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
 - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
  
   * 2) dhcpd is also affected by code changes,
 thus failures to handle DHCP lease requests
 also have potential for regressions.
 
   * 3) the functional change added by the fix,
 if a regression were to occur, would likely
 be an issue only under some (unknown) race
 condition as well, thus expected to be rare.

   * Note: this potentially affects Focal/Jammy 
 on Azure as a whole, per usage of dhclient
 in cloud-init instead of systemd-networkd.
 
 Azure provided extensive testing for all 3
 approaches (mostly internal communications,
 and some bug comments), with ~13k instances.
 
 No issues were observed (previously: 0.4%).
 
   * Such testing scale seems to indicate that
 there are no regressions for dhclient to
 acquire DHCP leases (1), nor another race
 condition that hit the fix/new behavior (3).
 
 With that, apparently (2) should be OK too.
 
   * Also, so to mitigate the regression risk
 as much as possible, there's very detailed
 analysis provided here (comments #17, 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-15 Thread Mauricio Faria de Oliveira
verification done on focal
(full steps in comment #9)

The issue is not reproducible with the package in -proposed,
and is reproducible with the 2 switches for the old behavior
(DHCP_FD_FLAGS_POKE=0 or dhcp.fd_flags_poke=0), as expected.

...

# add-apt-repository -y 'deb http://archive.ubuntu.com/ubuntu focal-
proposed main'

# apt policy isc-dhcp-client
isc-dhcp-client:
  Installed: 4.4.1-2.1ubuntu5.20.04.4
  Candidate: 4.4.1-2.1ubuntu5.20.04.5
  Version table:
 4.4.1-2.1ubuntu5.20.04.5 500
500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
...

# wget https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp-
client-dbgsym_4.4.1-2.1ubuntu5.20.04.5_amd64.ddeb

# apt install -y isc-dhcp-client ./isc-dhcp-client-
dbgsym_4.4.1-2.1ubuntu5.20.04.5_amd64.ddeb gdb

Source code line numbers (for breakpoint):

 233 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
 312 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
 333 for (p = omapi_io_states.next;
  
# ip netns exec ns1 \
  gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' 
-ex 'set confirm off' -q dhclient

(gdb) break omapip/dispatch.c:333
(gdb) commands
shell sleep 0.2
continue
end
(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object 
(h=0x561afb034940, readfd=0x561afad13630 , 
writefd=writefd@entry=0x0, reader=0x561afad30fb0 , 
writer=writer@entry=0x0, reaper=reaper@entry=0x0) at dispatch.c:337
337 in dispatch.c
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x72ac8a14)
DHCPOFFER of 192.168.42.100 from 192.168.42.1
DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 
(xid=0x148aac72)
DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0x72ac8a14)
[Detaching after fork from child process 1037683]
bound to 192.168.42.100 -- renewal in 290 seconds.
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...

(gdb) run -d -v veth1 -r
...
DHCPRELEASE of 192.168.42.100 on veth1 to 192.168.42.1 port 67 (xid=0x20449570)
...

<<< WORKS 10/10 >>>

...

(gdb) set environment DHCP_FD_FLAGS_POKE 0

(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object 
(h=0x557c0d5d1350, readfd=0x557c0beb8630 , writefd=0x0, 
reader=0x557c0bed5fb0 , writer=0x0, reaper=0x0) at 
dispatch.c:337
337 in dispatch.c
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 6 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 8 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 10 (xid=0xa5a2783d)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 21 (xid=0xa5a2783d)
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...
(gdb) kill

<<< FAILS 3/3 >>>

(gdb) unset environment DHCP_FD_FLAGS_POKE

...

(gdb) shell echo "$(cat /proc/cmdline) dhcp.fd_flags_poke=0" >/tmp/cmdline
(gdb) shell mount --bind /tmp/cmdline /proc/cmdline
(gdb) shell cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-1084-kvm 
root=PARTUUID=7a9ea63e-b971-413c-9238-d59509520a9e ro console=tty1 
console=ttyS0 dhcp.fd_flags_poke=0

(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object 
(h=0x5604afcaac00, readfd=0x5604ae2d3630 , writefd=0x0, 
reader=0x5604ae2f0fb0 , writer=0x0, reaper=0x0) at 
dispatch.c:337
337 in dispatch.c
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x6095ca20)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0x6095ca20)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 18 (xid=0x6095ca20)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0x6095ca20)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 14 (xid=0x6095ca20)
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...
(gdb) kill

<<< FAILS 3/3 >>>

(gdb) shell umount /proc/cmdline

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  Fix Committed
Status in isc-dhcp source package in Jammy:
  Fix Committed

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-15 Thread Mauricio Faria de Oliveira
verification done on jammy 
(full steps in comment #9)

The issue is not reproducible with the package in -proposed,
and is reproducible with the 2 switches for the old behavior
(DHCP_FD_FLAGS_POKE=0 or dhcp.fd_flags_poke=0), as expected.

...

# add-apt-repository -yp proposed

# apt policy isc-dhcp-client
isc-dhcp-client:
  Installed: 4.4.1-2.3ubuntu2.3
  Candidate: 4.4.1-2.3ubuntu2.4
  Version table:
 4.4.1-2.3ubuntu2.4 500
500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
...

# wget https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp-
client-dbgsym_4.4.1-2.3ubuntu2.4_amd64.ddeb

# apt install -y isc-dhcp-client ./isc-dhcp-client-
dbgsym_4.4.1-2.3ubuntu2.4_amd64.ddeb gdb

Source code line numbers (for breakpoint):

 233 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
 312 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
 333 for (p = omapi_io_states.next;
  
# ip netns exec ns1 \
  gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' 
-ex 'set confirm off' -q dhclient

(gdb) break omapip/dispatch.c:333
(gdb) commands
shell sleep 0.2
continue
end
(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object 
(h=0x558f3b9b1180, readfd=0x558f3b00f150 , writefd=0x0, 
reader=0x558f3b0114b0 , writer=0x0, reaper=0x0) at 
../omapip/dispatch.c:343
343 ../omapip/dispatch.c: No such file or directory.
Sending on   Socket/fallback
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x6909975b)
DHCPOFFER of 192.168.42.100 fbreak omapip/dispatch.c:333rom 192.168.42.1
DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 
(xid=0x5b970969)
DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0x6909975b)
[Detaching after fork from child process 34351]
bound to 192.168.42.100 -- renewal in 301 seconds.
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...

(gdb) run -d -v veth1 -r
...
DHCPRELEASE of 192.168.42.100 on veth1 to 192.168.42.1 port 67 (xid=0x270d7f02)
...

<<< WORKS 10/10 >>>

...

(gdb) set environment DHCP_FD_FLAGS_POKE 0

(gdb) run -d -v veth1
...
Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object 
(h=0x55e3f8364180, readfd=0x55e3f7828150 , writefd=0x0, 
reader=0x55e3f782a4b0 , writer=0x0, reaper=0x0) at 
../omapip/dispatch.c:343
343 ../omapip/dispatch.c: No such file or directory.
Sending on   Socket/fallback
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 5 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255break omapip/dispatch.c:333.255.255.255 port 67 
interval 6 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0xd7155345)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0xd7155345)
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...
(gdb) kill

<<< FAILS 3/3 >>>

(gdb) unset environment DHCP_FD_FLAGS_POKE

...

(gdb) shell echo "$(cat /proc/cmdline) dhcp.fd_flags_poke=0" >/tmp/cmdline
(gdb) shell mount --bind /tmp/cmdline /proc/cmdline
(gdb) shell cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-1025-kvm 
root=PARTUUID=72954768-850e-45d3-a9ca-d064a700c8b5 ro console=tty1 
console=ttyS0 panic=-1 dhcp.fd_flags_poke=0

(gdb) run -d -v veth1
...
Sending on   Socket/fallback
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xf141b060)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 5 (xid=0xf141b060)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 13 (xid=0xf141b060)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 20 (xid=0xf141b060)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 11 (xid=0xf141b060)
^C
Thread 1 "dhclient" received signal SIGINT, Interrupt.
...
(gdb) kill

<<< FAILS 3/3 >>>

(gdb) shell umount /proc/cmdline

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  Fix Committed
Status in isc-dhcp source package in Jammy:
  Fix Committed

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-09 Thread Łukasz Zemczak
Hello Martijn, or anyone else affected,

Accepted isc-dhcp into focal-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/isc-
dhcp/4.4.1-2.1ubuntu5.20.04.5 in a few hours, and then in the -proposed
repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
focal to verification-done-focal. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-focal. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: isc-dhcp (Ubuntu Focal)
   Status: In Progress => Fix Committed

** Tags added: verification-needed-focal

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  Fix Committed
Status in isc-dhcp source package in Jammy:
  Fix Committed

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   Socket/fallback
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 ...
 (omitting 20 similar lines)
 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 No DHCPOFFERS received.
 No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
 are multi-threaded (Bionic/earlier and Kinetic/later
 are single-threaded).

   * The actual problem is dhclient containing a thread
 concurrency race condition, and when the race occurs,
 the read socket is incorrectly/prematurely unwatched
 because required structures are not yet consistent,
 thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
 read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
 and no errors have been observed (previously: 0.4%).
 
   * Anyway, in case regressions are observed, the patch
 introduces 2 switches to revert to previous behavior,
 which can be applied per-process or system-wide:
 - DHCP_FD_FLAGS_POKE=0 environment variable
 - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
  bind9-libs to single-threaded, but we concluded it
  would have more regression risk than the expected
  [some bits in comment #8, and some internal chat],
  and remove exported symbols (apparently unused, but).
  We also considered a mutex/spinlock approach, but
  later found a simpler way w/ isc lib; comment #13.)
  
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
 condition, and DHCP server/client/noise injection
 is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
 - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
 - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
  
   * 2) dhcpd is also affected by code changes,
 thus failures to handle DHCP lease 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-09 Thread Łukasz Zemczak
Hello Martijn, or anyone else affected,

Accepted isc-dhcp into jammy-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/isc-
dhcp/4.4.1-2.3ubuntu2.4 in a few hours, and then in the -proposed
repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
jammy to verification-done-jammy. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-jammy. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: isc-dhcp (Ubuntu Jammy)
   Status: In Progress => Fix Committed

** Tags added: verification-needed verification-needed-jammy

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  Fix Committed

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   Socket/fallback
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 ...
 (omitting 20 similar lines)
 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 No DHCPOFFERS received.
 No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
 are multi-threaded (Bionic/earlier and Kinetic/later
 are single-threaded).

   * The actual problem is dhclient containing a thread
 concurrency race condition, and when the race occurs,
 the read socket is incorrectly/prematurely unwatched
 because required structures are not yet consistent,
 thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
 read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
 and no errors have been observed (previously: 0.4%).
 
   * Anyway, in case regressions are observed, the patch
 introduces 2 switches to revert to previous behavior,
 which can be applied per-process or system-wide:
 - DHCP_FD_FLAGS_POKE=0 environment variable
 - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
  bind9-libs to single-threaded, but we concluded it
  would have more regression risk than the expected
  [some bits in comment #8, and some internal chat],
  and remove exported symbols (apparently unused, but).
  We also considered a mutex/spinlock approach, but
  later found a simpler way w/ isc lib; comment #13.)
  
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
 condition, and DHCP server/client/noise injection
 is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
 - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
 - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
  
   * 2) dhcpd is also affected by code changes,
 thus failures to handle 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-02-02 Thread Mauricio Faria de Oliveira
** Tags added: se-sru-sponsor-mfo

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   Socket/fallback
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 ...
 (omitting 20 similar lines)
 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 No DHCPOFFERS received.
 No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
 are multi-threaded (Bionic/earlier and Kinetic/later
 are single-threaded).

   * The actual problem is dhclient containing a thread
 concurrency race condition, and when the race occurs,
 the read socket is incorrectly/prematurely unwatched
 because required structures are not yet consistent,
 thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
 read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
 and no errors have been observed (previously: 0.4%).
 
   * Anyway, in case regressions are observed, the patch
 introduces 2 switches to revert to previous behavior,
 which can be applied per-process or system-wide:
 - DHCP_FD_FLAGS_POKE=0 environment variable
 - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
  bind9-libs to single-threaded, but we concluded it
  would have more regression risk than the expected
  [some bits in comment #8, and some internal chat],
  and remove exported symbols (apparently unused, but).
  We also considered a mutex/spinlock approach, but
  later found a simpler way w/ isc lib; comment #13.)
  
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
 condition, and DHCP server/client/noise injection
 is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
 - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
 - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
  
   * 2) dhcpd is also affected by code changes,
 thus failures to handle DHCP lease requests
 also have potential for regressions.
 
   * 3) the functional change added by the fix,
 if a regression were to occur, would likely
 be an issue only under some (unknown) race
 condition as well, thus expected to be rare.

   * Note: this potentially affects Focal/Jammy 
 on Azure as a whole, per usage of dhclient
 in cloud-init instead of systemd-networkd.
 
 Azure provided extensive testing for all 3
 approaches (mostly internal communications,
 and some bug comments), with ~13k instances.
 
 No issues were observed (previously: 0.4%).
 
   * Such testing scale seems to indicate that
 there are no regressions for dhclient to
 acquire DHCP leases (1), nor another race
 condition that hit the fix/new behavior (3).
 
 With that, apparently (2) should be OK too.
 
   * Also, so to mitigate the regression risk
 as much as possible, there's very detailed
 analysis provided here (comments #17, #18)
 and more information about the fix in its
 patch file's comment.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/bind9-libs/+bug/1926139/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-31 Thread Mauricio Faria de Oliveira
>From the 'Where problems could occur' section.
Considerations on regressions and approaches.

[Where problems could occur]

isc-dhcp is a core package, and any change comes with the risk that
users would not be able to receive dhcp leases with dhclient, leaving
their systems with no IP address and unreachable, and could potentially
cripple images that depend on it, e.g. Microsoft Azure uses dhclient
called from cloud-init, instead of systemd-networkd, so a regression
could potentially affect all Ubuntu users on Azure.

Additionally, the code is called whenever sockets are constructed, and
isc-dhcp-server could also be affected.

We have mitigated the risks of regression as best as possible by adding
as much detail as possible to this launchpad bug, so it is clear how the
race operates and how the patch fixes the issue.

Mauricio has additionally added a environment variable and a kernel
command line parameter, that when present, disables the fix from
operating. If a regression were to occur, users can add these parameters
to their deployments to work around any issues.

Mauricio and Matthew have decided that the individual fix route is best
in terms of lessening regression risk, as the alternate solution would
be to disable threading on bind9-libs.

Disabling threading on bind9-libs, while complete as a solution, and
removes the risk of a future regression caused by thread concurrency
issues that are currently undetected, comes with the fact that it
removes publicly exported symbols from bind9-libs, and adds others, and
changes the entire library from multithreaded to single threaded. If any
users happen to use bind9-libs outside of isc-dhcp, they would see their
applications either fail to work due to missing symbols, or performance
would change.

Disabling threading on bind9-libs is shelved, and can be looked at in
the future if necessary.

Back to the individual fix solution, Chris Patterson, has been testing
this solution at scale on Azure, and in 13k instances, has not had a
failure. With the gdb reproducer, we are confident that adding the mutex
will not prevent other parts of the software from functioning correctly.



** Description changed:

  [Impact]
  
- Occasionally, during instance boot or machine start-up, dhclient will
- attempt to acquire a dhcp lease and fail, leaving the instance with no
- IP address and making it unreachable.
+  * Occasionally, during instance boot or machine start-up,
+dhclient will attempt to acquire a dhcp lease and fail,
+leaving the instance with no IP address and making it
+unreachable.
  
- This happens about once every 100 reboots on bare metal, or Chris
- Patterson in comment #2 describes it as affecting between ~0.3% to 2% of
- deployments on Microsoft Azure. Azure uses dhclient called from cloud-
- init instead of systemd-networkd, and this is causing issues with larger
- deployments.
+  * This happens about once every 100 reboots on bare metal,
+or affecting between ~0.3% to 2% of deployments on Azure
+(comment #2).
+
+  * Azure uses dhclient called from cloud-init instead of
+systemd-networkd, and this is causing issues with larger
+deployments.
  
- The logs of an affected dhclient produce the following:
+  * The logs of an affected dhclient produce the following:
  
- Listening on LPF/enp1s0/52:54:00:1c:d7:00
- Sending on   LPF/enp1s0/52:54:00:1c:d7:00
- Sending on   Socket/fallback
- DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
- DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
- ...
- (omitting 20 similar lines)
- ...
- DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
- DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
- DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
- No DHCPOFFERS received.
- No working leases in persistent database - sleeping.
+Listening on LPF/enp1s0/52:54:00:1c:d7:00
+Sending on   LPF/enp1s0/52:54:00:1c:d7:00
+Sending on   Socket/fallback
+DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
+DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
+...
+(omitting 20 similar lines)
+...
+DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
+DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
+DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
+No DHCPOFFERS received.
+No working leases in persistent database - sleeping.
  
- Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
- Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/
+  * This only impacts Focal and Jammy, where bind9-libs
+are multi-threaded (Bionic/earlier and Kinetic/later
+are single-threaded).
  
- The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
- packets being replied to with DHCPOFFER packets, but the got_one()
- callback is never called, dhclient does not read these DHCPOFFER
- packets, 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-31 Thread Mauricio Faria de Oliveira
From the 'Other Info' section.
Detailed analysis of the issue and more.

[Other Info]

Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

When you tcpdump dhclient, we see all DHCPDISCOVER packets being replied
to with DHCPOFFER packets, but the got_one() callback is never called,
dhclient does not read these DHCPOFFER packets, and continues sending
DHCPDISCOVER packets. Once it reaches 25 DHCPDISCOVER packets sent, it
gives up.

This behaviour led several bug reporters to believe it was a kernel
issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
is not the case, the actual problem is dhclient containing a thread
concurrency race condition, and when the race occurs, the read socket is
closed prematurely, and dhclient does not read any of the DHCPOFFER
replies.

tcpdump: 'Screenshot of Wireshark' attached.

...

I was reading around the upstream issue trackers, and found the
following two bug reports:

https://gitlab.isc.org/isc-projects/dhcp/-/issues/264
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996356

The ISC upstream report was actually quite detailed, and it has the same
symptoms of what we are experiencing.

Let's have a look at the root cause. The code I am using is isc-dhcp
4.4.1-2.1ubuntu5.20.04.4 from Focal.

common/discover.c

 567 void
 568 discover_interfaces(int state) {
...
1002 case AF_INET:
1003 default:
1004 status = omapi_register_io_object((omapi_object_t *)tmp,
1005   if_readsocket,
1006   0, got_one, 0, 0);
1007 break;
1008 }
...

In discover.c, we call discover_interfaces() to iterate over the
interfaces, and attempt to register a raw socket against it. We do this
by calling omapi_register_io_object() which is used for reading data,
and calls the elusive got_one() callback that you instrumented your code
to see if it was being called or not.

omapip/dispatch.c

196 /* Register an I/O handle so that we can do asynchronous I/O on it. */
197
198 isc_result_t omapi_register_io_object (omapi_object_t *h,
199int (*readfd) (omapi_object_t *),
200int (*writefd) (omapi_object_t *),
201isc_result_t (*reader)
202 (omapi_object_t *),
203isc_result_t (*writer)
204 (omapi_object_t *),
205isc_result_t (*reaper)
206 (omapi_object_t *))
207 {
...
241 /*
242  * Attach the I/O object to the isc socket library via the
243  * fdwatch function.  This allows the socket library to watch
244  * over a socket that we built.  If there are both a read and
245  * a write socket we asssume they are the same socket.
246  */
247
248 if (readfd) {
249 fd_flags |= ISC_SOCKFDWATCH_READ;
250 fd = readfd(h);
251 }
...
257
258 if (fd_flags != 0) {
259 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
260   fd, fd_flags,
261   omapi_iscsock_cb,
262   obj,
263   dhcp_gbl_ctx.task,
264   >fd);
...
275 }
276
277
278 /* Find the last I/O state, if there are any. */
279 for (p = omapi_io_states.next;
280  p && p -> next; p = p -> next)
281 ;
282 if (p)
283 omapi_io_reference ( -> next, obj, MDL);
284 else
285 omapi_io_reference (_io_states.next, obj, MDL);
...

omapi_register_io_object() is called for each socket created, in this
case, the if_readsocket from discover_interfaces(). The file descriptor
is assigned ISC_SOCKFDWATCH_READ, and we enter the if statement.

The if statement calls isc_socket_fdwatchcreate(), which registers the
socket with the ISC socket manager, and sets up the callback
omapi_iscsock_cb(), to be called.

Once that has been done, we iterate over the omapi_io_states linked
list, which is a global list of registered sockets. We get to the end of
the list (or the beginning, if the list is empty), and add the socket to
the list.

Now, the bug happens between calling isc_socket_fdwatchcreate() to
register the socket with the socket manager, and adding it to the global
linked list.

Sometimes, the callback omapi_iscsock_cb() is called inbetween.

omapip/dispatch.c

101 /*
102  * Callback routine to connect the omapi I/O object and socket with
103  * the isc socket code.  The isc socket code will call this routine
104  * which will then call the correct local routine to process the bytes.
105  *
106  * Currently we are always willing to read more data, this should be 
modified
107  * so that on connections we don't read more if we already have enough.
108  *
109  * If we have more bytes to write we ask the library to call us when
110  * we can write more.  If we indicate we don't have more to write 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-31 Thread Mauricio Faria de Oliveira
Pressing Page Down 17 times to go over the bug description sounds like a
new record! ;-)

Just in case future reviewers don't find that as exciting while going
through reviews, I'll move some text into comments and reference them
from the description.

** Bug watch added: gitlab.isc.org/isc-projects/dhcp/-/issues #264
   https://gitlab.isc.org/isc-projects/dhcp/-/issues/264

** Bug watch added: Debian Bug tracker #996356
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996356

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
 dhclient will attempt to acquire a dhcp lease and fail,
 leaving the instance with no IP address and making it
 unreachable.

   * This happens about once every 100 reboots on bare metal,
 or affecting between ~0.3% to 2% of deployments on Azure
 (comment #2).
 
   * Azure uses dhclient called from cloud-init instead of
 systemd-networkd, and this is causing issues with larger
 deployments.

   * The logs of an affected dhclient produce the following:

 Listening on LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   LPF/enp1s0/52:54:00:1c:d7:00
 Sending on   Socket/fallback
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 ...
 (omitting 20 similar lines)
 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
 No DHCPOFFERS received.
 No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
 are multi-threaded (Bionic/earlier and Kinetic/later
 are single-threaded).

   * The actual problem is dhclient containing a thread
 concurrency race condition, and when the race occurs,
 the read socket is incorrectly/prematurely unwatched
 because required structures are not yet consistent,
 thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
 read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
 and no errors have been observed (previously: 0.4%).
 
   * Anyway, in case regressions are observed, the patch
 introduces 2 switches to revert to previous behavior,
 which can be applied per-process or system-wide:
 - DHCP_FD_FLAGS_POKE=0 environment variable
 - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
  bind9-libs to single-threaded, but we concluded it
  would have more regression risk than the expected
  [some bits in comment #8, and some internal chat],
  and remove exported symbols (apparently unused, but).
  We also considered a mutex/spinlock approach, but
  later found a simpler way w/ isc lib; comment #13.)
  
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
 condition, and DHCP server/client/noise injection
 is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
 - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
 - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
  
   * 2) dhcpd is also affected by code changes,
 thus failures to handle DHCP lease requests
 also have potential for regressions.
 
   * 3) the functional change added by the fix,
 if a regression were to occur, would likely
 be an issue only under some (unknown) race
 condition as well, thus expected to be rare.

   * Note: this potentially affects Focal/Jammy 
 on Azure as a whole, per usage of dhclient
 in cloud-init instead of systemd-networkd.
 
 Azure provided extensive testing for all 3
 approaches (mostly internal communications,
 and some bug comments), with ~13k instances.
 
 No issues were observed (previously: 0.4%).
 
   * Such testing scale seems to indicate that
 there are no regressions for dhclient to
 acquire DHCP leases (1), nor another race
 condition that hit the fix/new behavior (3).
 
 With that, apparently (2) should be OK too.
 
   * Also, so to mitigate the regression risk
 as 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-27 Thread Mauricio Faria de Oliveira
Jammy/22.04:
- test packages in ppa:mfo/lp1926139
- reproduction steps delta (based on comment #9)

...

Reproducer based on GDB and DHCP noise injection.

It uses 3 veth pairs (DHCP server/client/injector,
the latter two under namespaces) on a linux bridge.

...

LXD VM:

 lxc launch ubuntu:jammy lp1926139-jammy --vm
 lxc shell lp1926139-jammy

GDB Reproducer (original package):
==

Debug symbols:

 # wget 
https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3_amd64.ddeb
 # apt install -y ./isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3_amd64.ddeb

Source code line numbers (for breakpoint):

 198 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
 259 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
 279 for (p = omapi_io_states.next;

Attempt to reproduce the issue
with a delay introduced via breakpoint on line 279:

 # ip netns exec ns1 \
   gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' 
-ex 'set confirm off' -q dhclient

 (gdb) break omapip/dispatch.c:279
 (gdb) commands
 shell sleep 0.2
 continue
 end
 (gdb) run -v -d veth1

GDB Reproducer (patched package):
==

Client & Debug symbols:

 # wget \
   
https://launchpad.net/~mfo/+archive/ubuntu/lp1926139/+files/isc-dhcp-client_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.deb
 \
   
https://launchpad.net/~mfo/+archive/ubuntu/lp1926139/+files/isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.ddeb

  # sudo apt install \
 ./isc-dhcp-client_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.deb \
 ./isc-dhcp-client-dbgsym_4.4.1-2.3ubuntu2.3+lp1926139.2_amd64.ddeb

Source code line numbers (for breakpoint):

  233 isc_result_t omapi_register_io_object (omapi_object_t *h,
 ...
  312 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
 ...
  333 for (p = omapi_io_states.next;

Attempt to reproduce the issue again, the same way,
with a delay introduced via breakpoint on line 333:

 # ip netns exec ns1 \
   gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' 
-ex 'set confirm off' -q dhclient

 (gdb) break omapip/dispatch.c:333
 (gdb) commands
 shell sleep 0.2
 continue
 end
 (gdb) run -v -d veth1

...

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump:
  Screenshot of Wireshark:

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix is to
  add a mutex that restricts access to the global linked list of open
  sockets, and ensures that a newly created socket is added to 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-26 Thread Mauricio Faria de Oliveira
** Patch added: "lp1926139_focal_isc-dhcp_v2.debdiff"
   
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5643626/+files/lp1926139_focal_isc-dhcp_v2.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump:
  Screenshot of Wireshark:

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix is to
  add a mutex that restricts access to the global linked list of open
  sockets, and ensures that a newly created socket is added to this
  list, before the socketmanager callback has an opportunity to walk
  this list when there is data immediately able to be read.

  Mauricio has provided such a patch, and includes options to disable
  this behaviour during runtime to minimise regression risk.

  [Testcase]

  Reproducer based on GDB and DHCP noise injection.

  It uses 3 veth pairs (DHCP server/client/injector,
  the latter two under namespaces) on a linux bridge.

  LXD VM:

   $ lxc launch ubuntu:focal lp1926139-focal --vm
   $ lxc shell lp1926139-focal

  Network Setup:

   # ip link add br0 type bridge
   # ip link set br0 up

   # ip link add veth0 type veth peer name veth0br
   # ip link set veth0 up
   # ip link set veth0br up master br0

   # ip netns add ns1
   # ip link add veth1 netns ns1 type veth peer name veth1br
   # ip -n ns1 link set veth1 up
   # ip link set veth1br up master br0

   # ip netns add ns2
   # ip link add veth2 netns ns2 type veth peer name veth2br
   # ip -n ns2 link set veth2 up
   # ip link set veth2br up master br0

  Network Check:

   # ip link show type veth | grep veth
   5: veth0br@veth0:  mtu 1500 qdisc noqueue 
master br0 state UP mode DEFAULT group default qlen 1000
   6: veth0@veth0br:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000
   7: veth1br@if2:  mtu 1500 qdisc noqueue 
master br0 state UP mode DEFAULT group default qlen 1000
   8: veth2br@if2:  mtu 1500 qdisc noqueue 
master br0 state UP mode DEFAULT group default qlen 1000

   # ip -n ns1 link show type veth | grep veth
   2: veth1@if7:  mtu 1500 qdisc noqueue state 
UP mode DEFAULT group default qlen 1000

   # ip -n ns2 link show type veth | grep veth
   2: veth2@if8:  mtu 1500 qdisc noqueue state 
UP mode DEFAULT group default qlen 1000

  DHCP Server Setup:

   # apt install -y isc-dhcp-server

   # ip addr add 192.168.42.1/24 dev veth0

   # echo 'INTERFACESv4="veth0"' >>/etc/default/isc-dhcp-server

   # cat <>/etc/dhcp/dhcpd.conf
   subnet 192.168.42.0 netmask 255.255.255.0 {
 range 192.168.42.100 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-26 Thread Mauricio Faria de Oliveira
Test packages:

https://launchpad.net/~mfo/+archive/ubuntu/lp1926139
isc-dhcp 4.4.1-2.1ubuntu5.20.04.4+lp1926139.2 

Default behavior: issue fixed.
---

(gdb) break omapip/dispatch.c:333
(gdb) commands
shell sleep 0.2
continue
end

(gdb) run -d -v veth1
...
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 
(xid=0x9679b264)
DHCPOFFER of 192.168.42.100 from 192.168.42.1
DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 
(xid=0x64b27996)
DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0x9679b264)
...
^C
(gdb) kill

Release address.

(gdb) run -d -v veth1 -r
...


Original behavior with environment variable: issue observed.
---

(gdb) set environment DHCP_FD_FLAGS_POKE 0
(gdb) run -d -v veth1
...
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 
(xid=0xc2db3363)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 5 
(xid=0xc2db3363)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 5 
(xid=0xc2db3363)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 6 
(xid=0xc2db3363)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 16 
(xid=0xc2db3363)
^C
...
(gdb) kill

(gdb) unset environment DHCP_FD_FLAGS_POKE


Original behavior with kernel cmdline option: issue observed.
---

(gdb) shell echo "$(cat /proc/cmdline) dhcp.fd_flags_poke=0" 
>/tmp/cmdline
(gdb) shell mount --bind /tmp/cmdline /proc/cmdline
(gdb) shell cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-1084-kvm 
root=PARTUUID=a1286399-334e-4597-b30f-da227b6c076b ro console=tty1 
console=ttyS0 panic=-1 dhcp.fd_flags_poke=0

(gdb) run -d -v veth1
...
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 
(xid=0x938a6b0b)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 4 
(xid=0x938a6b0b)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 8 
(xid=0x938a6b0b)
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 20 
(xid=0x938a6b0b)
^C
...
(gdb) kill

(gdb) shell umount /proc/cmdline

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump:
  Screenshot of Wireshark:

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix is to
  add a mutex that restricts access to the global linked 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-26 Thread Mauricio Faria de Oliveira
Hey Matthew, Chris,

Apparently there's a simpler, less intrusive, and more specific way to
do this.

Apologies that I missed this earlier, but I found more about the
possibilities in bind9-libs functions while checking the previous fix
approach for regressions.

Could you please provide your thoughts, Matthew?

If it looks good for you, please feel free to discuss additional testing
with Chris, if at all possible.

P.S.: the workaround disable switches are in, via environment variable
and kernel cmdline option.

Thanks!
Mauricio

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump:
  Screenshot of Wireshark:

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix is to
  add a mutex that restricts access to the global linked list of open
  sockets, and ensures that a newly created socket is added to this
  list, before the socketmanager callback has an opportunity to walk
  this list when there is data immediately able to be read.

  Mauricio has provided such a patch, and includes options to disable
  this behaviour during runtime to minimise regression risk.

  [Testcase]

  Reproducer based on GDB and DHCP noise injection.

  It uses 3 veth pairs (DHCP server/client/injector,
  the latter two under namespaces) on a linux bridge.

  LXD VM:

   $ lxc launch ubuntu:focal lp1926139-focal --vm
   $ lxc shell lp1926139-focal

  Network Setup:

   # ip link add br0 type bridge
   # ip link set br0 up

   # ip link add veth0 type veth peer name veth0br
   # ip link set veth0 up
   # ip link set veth0br up master br0

   # ip netns add ns1
   # ip link add veth1 netns ns1 type veth peer name veth1br
   # ip -n ns1 link set veth1 up
   # ip link set veth1br up master br0

   # ip netns add ns2
   # ip link add veth2 netns ns2 type veth peer name veth2br
   # ip -n ns2 link set veth2 up
   # ip link set veth2br up master br0

  Network Check:

   # ip link show type veth | grep veth
   5: veth0br@veth0:  mtu 1500 qdisc noqueue 
master br0 state UP mode DEFAULT group default qlen 1000
   6: veth0@veth0br:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000
   7: veth1br@if2:  mtu 1500 qdisc noqueue 
master br0 state UP mode DEFAULT group default qlen 1000
   8: veth2br@if2:  mtu 1500 qdisc noqueue 
master br0 state UP mode DEFAULT group default qlen 1000

   # ip -n ns1 link show type veth | grep veth
   2: veth1@if7:  mtu 1500 qdisc noqueue state 
UP mode DEFAULT group default qlen 1000

   # ip -n ns2 link show type veth | grep veth
   2: 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-24 Thread Matthew Ruffell
** Description changed:

  [Impact]
  
  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.
  
  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2% of
  deployments on Microsoft Azure. Azure uses dhclient called from cloud-
  init instead of systemd-networkd, and this is causing issues with larger
  deployments.
  
  The logs of an affected dhclient produce the following:
  
  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.
  
  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/
  
  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches 25
  DHCPDISCOVER packets sent, it gives up.
  
- tcpdump: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641810/+files/test.pcap
- Screenshot of Wireshark: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png
+ tcpdump:
+ Screenshot of Wireshark:
  
  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket is
  closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.
  
- The full explanation is in the "Other Info" section, but the fix for
- this is to change bind9-libs from being built multithreaded, back to
- single threaded as intended by dhclient maintainers.
- 
- In Focal and Jammy, isc-dhcp links against bind9 libraries provided in
- bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9
- library it uses, which is already configured properly to --disable-
- threads.
- 
- Change the Focal and Jammy bind9-libs to --disable-threads and update
- symbol files to reflect the library is single threaded again.
+ The full explanation is in the "Other Info" section, but the fix is to
+ add a mutex that restricts access to the global linked list of open
+ sockets, and ensures that a newly created socket is added to this list,
+ before the socketmanager callback has an opportunity to walk this list
+ when there is data immediately able to be read.
+ 
+ Mauricio has provided such a patch, and includes options to disable this
+ behaviour during runtime to minimise regression risk.
  
  [Testcase]
  
- Start a fresh Focal or Jammy instance.
- 
- Download and set executable test-parallel.sh, and edit some lines:
- 
- 1) wget 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh
- 2) chmod +x test-parallel.sh
- 3) vim test-parallel.sh
- 
- Change iface="enp5s0" to your interface, likely iface="enp1s0".
- Comment out the line "#   cp bionic-dhclient $workdir/dhclient".
- 
- 4) sudo ./test-parallel.sh
- 
- After five minutes, if you issue reproduces, you will see "TEST FAILED".
- 
- You can watch the output with:
- 
- 5) cat /tmp/dhclient-* | less
- 
- Next, for instrumented runs, you need to build dhclient from source.
- 
- 1) sudo apt install build-essential devscripts
- 2) apt source isc-dhcp
- 3) sudo apt build-dep isc-dhcp
- 4) cd isc-dhcp
- 
- Apply the below patch:
- 
- https://paste.ubuntu.com/p/hGsssrVyG4/
- 
- 5) patch -p1 < ~/patch.patch
- 6) debuild -b -uc -us
- 7) cd ..
- 8) sudo dpkg -i isc-dhcp-client-*
- 9) sudo ./test-parallel.sh
- 10) cat /tmp/dhclient-* | less
- 
- Look for the race, as described in "Other Info", namely:
- 
- mruffell: registering with socket manager
- mruffell: callback called
- mruffell: omapi object is NULL
- mruffell: omapi object is NULL
- mruffell: Adding obj to linked list
- mruffell: Obj added to list
- 
- The issue has reproduced.
- 
- If you install the test package from the following ppa:
- 
- https://launchpad.net/~mruffell/+archive/ubuntu/sf337873-test
- 
- Instructions to install (on a Focal or Jammy 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-22 Thread Chris Patterson
Great work Maurico, I think you make several excellent points and I
appreciate your efforts on a better reproducer and alternative patch.
FWIW I began testing the Matthew's initial build (which disabled
threads) against a large number of VMs and that appeared to address the
issues we're seeing.  I'm cutting those tests short and am updating the
tests now to use your patch as provided by Matthew and we'll see how
that goes!

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Fix Released
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in bind9-libs source package in Focal:
  In Progress
Status in bind9-libs source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641810/+files/test.pcap
  Screenshot of Wireshark: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix for
  this is to change bind9-libs from being built multithreaded, back to
  single threaded as intended by dhclient maintainers.

  In Focal and Jammy, isc-dhcp links against bind9 libraries provided in
  bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9
  library it uses, which is already configured properly to --disable-
  threads.

  Change the Focal and Jammy bind9-libs to --disable-threads and update
  symbol files to reflect the library is single threaded again.

  [Testcase]

  Start a fresh Focal or Jammy instance.

  Download and set executable test-parallel.sh, and edit some lines:

  1) wget 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh
  2) chmod +x test-parallel.sh
  3) vim test-parallel.sh

  Change iface="enp5s0" to your interface, likely iface="enp1s0".
  Comment out the line "#   cp bionic-dhclient $workdir/dhclient".

  4) sudo ./test-parallel.sh

  After five minutes, if you issue reproduces, you will see "TEST
  FAILED".

  You can watch the output with:

  5) cat /tmp/dhclient-* | less

  Next, for instrumented runs, you need to build dhclient from source.

  1) sudo apt install build-essential devscripts
  2) apt source isc-dhcp
  3) sudo apt build-dep isc-dhcp
  4) cd isc-dhcp

  Apply the below patch:

  https://paste.ubuntu.com/p/hGsssrVyG4/

  5) patch -p1 < ~/patch.patch
  6) debuild -b -uc -us
  7) cd ..
  8) sudo dpkg -i isc-dhcp-client-*
  9) sudo ./test-parallel.sh
  10) cat /tmp/dhclient-* | less

  Look for the race, as described in "Other Info", namely:

  mruffell: registering with socket manager
  mruffell: callback 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-18 Thread Mauricio Faria de Oliveira
** Patch added: "lp1926139_focal_isc-dhcp.debdiff"
   
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5642442/+files/lp1926139_focal_isc-dhcp.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Fix Released
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in bind9-libs source package in Focal:
  In Progress
Status in bind9-libs source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641810/+files/test.pcap
  Screenshot of Wireshark: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix for
  this is to change bind9-libs from being built multithreaded, back to
  single threaded as intended by dhclient maintainers.

  In Focal and Jammy, isc-dhcp links against bind9 libraries provided in
  bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9
  library it uses, which is already configured properly to --disable-
  threads.

  Change the Focal and Jammy bind9-libs to --disable-threads and update
  symbol files to reflect the library is single threaded again.

  [Testcase]

  Start a fresh Focal or Jammy instance.

  Download and set executable test-parallel.sh, and edit some lines:

  1) wget 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh
  2) chmod +x test-parallel.sh
  3) vim test-parallel.sh

  Change iface="enp5s0" to your interface, likely iface="enp1s0".
  Comment out the line "#   cp bionic-dhclient $workdir/dhclient".

  4) sudo ./test-parallel.sh

  After five minutes, if you issue reproduces, you will see "TEST
  FAILED".

  You can watch the output with:

  5) cat /tmp/dhclient-* | less

  Next, for instrumented runs, you need to build dhclient from source.

  1) sudo apt install build-essential devscripts
  2) apt source isc-dhcp
  3) sudo apt build-dep isc-dhcp
  4) cd isc-dhcp

  Apply the below patch:

  https://paste.ubuntu.com/p/hGsssrVyG4/

  5) patch -p1 < ~/patch.patch
  6) debuild -b -uc -us
  7) cd ..
  8) sudo dpkg -i isc-dhcp-client-*
  9) sudo ./test-parallel.sh
  10) cat /tmp/dhclient-* | less

  Look for the race, as described in "Other Info", namely:

  mruffell: registering with socket manager
  mruffell: callback called
  mruffell: omapi object is NULL
  mruffell: omapi object is NULL
  mruffell: Adding obj to linked list
  mruffell: Obj added to list

  The issue has reproduced.

  If you install the test package from the following ppa:

  

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-18 Thread Mauricio Faria de Oliveira
Hi Matthew,

Thanks for the excellent analysis and considerate fix proposal, as
always!

I looked at this for the last couple of days, for potential sponsorship.

I have attentively gone through the SRU template and Other Info section,
and considered the proposal to switch bind9-libs into --disable-threads,
with the goal of not only address this issue, but also prevent others:

> So, we have two options for a fix for Focal and Jammy:
> 
> 1) We disable threading for dhclient.
> 2) We add in a mutex to resolve this particular concurrency issue.
> [...]
> I think if we fix the problem, another issue will crop up in six months
> time, and it will be another concurrency issue.

...

I'm aware you realize such change is concerning :) thus explained it
well.

Changing this is Focal (around for almost 3 years) brings regression risk
to an amount I have the _impression_ the SRU team would not be okay with.

And even though I agree with your analysis, proposal and risk assessment,
I'm a bit concerned too, specially as this touches DHCP / IP addressing.

(I'm also very aware this is ultimately their call, not mine at all. :)

...

However, considering how much work and time have likely gone into this
(and internal status) I can't just say 'no' without trying to help out.

I'd like to bring a different opinion.

The reason it's concerning is the very same reason 2) is reasonable:

This concurrency issue (and potential for other concurrency issues)
has been around with Focal since 2020/04 (~3 years), and until now,
its impact does not seem to statistically significant:

> This happens about once every 100 reboots on bare metal, or [...]
> affecting between ~0.3% to 2% of deployments on Microsoft Azure.

So, if there's a way to fix this particular concurrency issue with
less regression risk, that might be worth it, as it would build on
top of dhclient's life on Focal, instead of starting it over again.

...

So, while reviewing the source code for your analysis, I had ideas.

First, a synthetic reproducer with GDB that works every time.

Second, a patch that addressed the issue with the test above.
(It's not final form, I'd like to add a way to turn it off.)

...

Could you please review and verify both, and share your 
thoughts on possibly going with that proposal instead?

Of course, if you disagree with the argument or approach,
or if turns out not to work on your end/tests, that's OK!

We would defer this to the Foundations team and SRU team.

- Test steps in the next comment.
- Test packages in ppa:mfo/lp1926139 [1].
- Debdiff attached for reference (code has details).

(Right now only Focal patches/packages are available.
I can go look at Jammy depending on your feedback.)

Hope this helps, after all.
Thanks again!

[1] https://launchpad.net/~mfo/+archive/ubuntu/lp1926139

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Fix Released
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in bind9-libs source package in Focal:
  In Progress
Status in bind9-libs source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump: 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-18 Thread Mauricio Faria de Oliveira
Reproducer based on GDB and DHCP noise injection.

It uses 3 veth pairs (DHCP server/client/injector,
the latter two under namespaces) on a linux bridge.

LXD VM:

$ lxc launch ubuntu:focal lp1926139-focal --vm
$ lxc shell lp1926139-focal

Network Setup:

# ip link add br0 type bridge
# ip link set br0 up

# ip link add veth0 type veth peer name veth0br
# ip link set veth0 up
# ip link set veth0br up master br0

# ip netns add ns1
# ip link add veth1 netns ns1 type veth peer name veth1br
# ip -n ns1 link set veth1 up
# ip link set veth1br up master br0

# ip netns add ns2
# ip link add veth2 netns ns2 type veth peer name veth2br
# ip -n ns2 link set veth2 up
# ip link set veth2br up master br0

Network Check:

# ip link show type veth | grep veth
5: veth0br@veth0:  mtu 1500 qdisc 
noqueue master br0 state UP mode DEFAULT group default qlen 1000
6: veth0@veth0br:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT group default qlen 1000
7: veth1br@if2:  mtu 1500 qdisc 
noqueue master br0 state UP mode DEFAULT group default qlen 1000
8: veth2br@if2:  mtu 1500 qdisc 
noqueue master br0 state UP mode DEFAULT group default qlen 1000

# ip -n ns1 link show type veth | grep veth
2: veth1@if7:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000

# ip -n ns2 link show type veth | grep veth
2: veth2@if8:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000

DHCP Server Setup:

# apt install -y isc-dhcp-server

# ip addr add 192.168.42.1/24 dev veth0

# echo 'INTERFACESv4="veth0"' >>/etc/default/isc-dhcp-server

# cat <>/etc/dhcp/dhcpd.conf
subnet 192.168.42.0 netmask 255.255.255.0 {
  range 192.168.42.100 192.168.42.200;
}
EOF

# systemctl restart isc-dhcp-server.service
# systemctl status isc-dhcp-server.service | grep Active:
 Active: active (running) since Thu 2023-01-19 02:06:18 UTC; 19s ago

# ss -nlp | grep 0.0.0.0:67
udp UNCONN   00 
0.0.0.0:67 0.0.0.0:*
 users:(("dhcpd",pid=3279,fd=9))

DHCP Server Check:

# ip netns exec ns1 \
  dhclient -v veth1
...
DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 
(xid=0xd147ab17)
DHCPOFFER of 192.168.42.100 from 192.168.42.1
DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 
(xid=0x17ab47d1)
DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0xd147ab17)
bound to 192.168.42.100 -- renewal in 245 seconds.

# ip netns exec ns1 \
  dhclient -v veth1 -r
...
DHCPRELEASE of 192.168.42.100 on veth1 to 192.168.42.1 port 67 
(xid=0x1cd4aacf)

DHCP Noise Setup:

# ip -n ns2 addr add 192.168.42.2/24 dev veth2

# ip netns exec ns2 \
  /bin/sh -c 'while sleep 0.1; do echo; done | nc -u -v -b -s 
192.168.42.2 -p 67 255.255.255.255 68' &
Connection to 255.255.255.255 68 port [udp/bootpc] succeeded!

i.e., every 0.1 seconds, broadcast a message as DHCP (port 67)
to DHCP client receive (port 68).

DHCP Noise Check:

# tcpdump -i veth0 -n 'udp and host 255.255.255.255' -c 10
...
02:13:26.993233 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.098317 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.205879 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.314234 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.424486 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.532431 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.639614 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.747633 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.864037 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
02:13:27.977402 IP 192.168.42.2.67 > 255.255.255.255.68: BOOTP/DHCP, 
unknown (0x0a) [|bootp]
...

GDB Reproducer (original package):
==

# apt install -y gdb

Capture DHCP Server's UDP packets for reference:

# tcpdump -i veth0 -n 'udp and host 192.168.42.1' -w 
veth0-udp-192-168-42-1.pcap & pid=$!

Debug symbols:

# wget 
https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp-client-dbgsym_4.4.1-2.1ubuntu5.20.04.4_amd64.ddeb
# apt install -y 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-17 Thread Mauricio Faria de Oliveira
** Tags removed: sts-sponsor

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Fix Released
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in bind9-libs source package in Focal:
  In Progress
Status in bind9-libs source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641810/+files/test.pcap
  Screenshot of Wireshark: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix for
  this is to change bind9-libs from being built multithreaded, back to
  single threaded as intended by dhclient maintainers.

  In Focal and Jammy, isc-dhcp links against bind9 libraries provided in
  bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9
  library it uses, which is already configured properly to --disable-
  threads.

  Change the Focal and Jammy bind9-libs to --disable-threads and update
  symbol files to reflect the library is single threaded again.

  [Testcase]

  Start a fresh Focal or Jammy instance.

  Download and set executable test-parallel.sh, and edit some lines:

  1) wget 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh
  2) chmod +x test-parallel.sh
  3) vim test-parallel.sh

  Change iface="enp5s0" to your interface, likely iface="enp1s0".
  Comment out the line "#   cp bionic-dhclient $workdir/dhclient".

  4) sudo ./test-parallel.sh

  After five minutes, if you issue reproduces, you will see "TEST
  FAILED".

  You can watch the output with:

  5) cat /tmp/dhclient-* | less

  Next, for instrumented runs, you need to build dhclient from source.

  1) sudo apt install build-essential devscripts
  2) apt source isc-dhcp
  3) sudo apt build-dep isc-dhcp
  4) cd isc-dhcp

  Apply the below patch:

  https://paste.ubuntu.com/p/hGsssrVyG4/

  5) patch -p1 < ~/patch.patch
  6) debuild -b -uc -us
  7) cd ..
  8) sudo dpkg -i isc-dhcp-client-*
  9) sudo ./test-parallel.sh
  10) cat /tmp/dhclient-* | less

  Look for the race, as described in "Other Info", namely:

  mruffell: registering with socket manager
  mruffell: callback called
  mruffell: omapi object is NULL
  mruffell: omapi object is NULL
  mruffell: Adding obj to linked list
  mruffell: Obj added to list

  The issue has reproduced.

  If you install the test package from the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf337873-test

  Instructions to install (on a Focal or Jammy system):
  1) sudo add-apt-repository 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-16 Thread Matthew Ruffell
** Tags added: sts-sponsor

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Fix Released
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in bind9-libs source package in Focal:
  In Progress
Status in bind9-libs source package in Jammy:
  In Progress

Bug description:
  [Impact]

  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.

  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2%
  of deployments on Microsoft Azure. Azure uses dhclient called from
  cloud-init instead of systemd-networkd, and this is causing issues
  with larger deployments.

  The logs of an affected dhclient produce the following:

  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.

  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches
  25 DHCPDISCOVER packets sent, it gives up.

  tcpdump: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641810/+files/test.pcap
  Screenshot of Wireshark: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png

  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket
  is closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.

  The full explanation is in the "Other Info" section, but the fix for
  this is to change bind9-libs from being built multithreaded, back to
  single threaded as intended by dhclient maintainers.

  In Focal and Jammy, isc-dhcp links against bind9 libraries provided in
  bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9
  library it uses, which is already configured properly to --disable-
  threads.

  Change the Focal and Jammy bind9-libs to --disable-threads and update
  symbol files to reflect the library is single threaded again.

  [Testcase]

  Start a fresh Focal or Jammy instance.

  Download and set executable test-parallel.sh, and edit some lines:

  1) wget 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh
  2) chmod +x test-parallel.sh
  3) vim test-parallel.sh

  Change iface="enp5s0" to your interface, likely iface="enp1s0".
  Comment out the line "#   cp bionic-dhclient $workdir/dhclient".

  4) sudo ./test-parallel.sh

  After five minutes, if you issue reproduces, you will see "TEST
  FAILED".

  You can watch the output with:

  5) cat /tmp/dhclient-* | less

  Next, for instrumented runs, you need to build dhclient from source.

  1) sudo apt install build-essential devscripts
  2) apt source isc-dhcp
  3) sudo apt build-dep isc-dhcp
  4) cd isc-dhcp

  Apply the below patch:

  https://paste.ubuntu.com/p/hGsssrVyG4/

  5) patch -p1 < ~/patch.patch
  6) debuild -b -uc -us
  7) cd ..
  8) sudo dpkg -i isc-dhcp-client-*
  9) sudo ./test-parallel.sh
  10) cat /tmp/dhclient-* | less

  Look for the race, as described in "Other Info", namely:

  mruffell: registering with socket manager
  mruffell: callback called
  mruffell: omapi object is NULL
  mruffell: omapi object is NULL
  mruffell: Adding obj to linked list
  mruffell: Obj added to list

  The issue has reproduced.

  If you install the test package from the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf337873-test

  Instructions to install (on a Focal or Jammy system):
  1) sudo add-apt-repository 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-16 Thread Matthew Ruffell
Screenshot of wireshark.

** Attachment added: "Screenshot of wireshark"
   
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png

** Description changed:

  [Impact]
  
  Occasionally, during instance boot or machine start-up, dhclient will
  attempt to acquire a dhcp lease and fail, leaving the instance with no
  IP address and making it unreachable.
  
  This happens about once every 100 reboots on bare metal, or Chris
  Patterson in comment #2 describes it as affecting between ~0.3% to 2% of
  deployments on Microsoft Azure. Azure uses dhclient called from cloud-
  init instead of systemd-networkd, and this is causing issues with larger
  deployments.
  
  The logs of an affected dhclient produce the following:
  
  Listening on LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   LPF/enp1s0/52:54:00:1c:d7:00
  Sending on   Socket/fallback
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
  ...
  (omitting 20 similar lines)
  ...
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
  DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
  No DHCPOFFERS received.
  No working leases in persistent database - sleeping.
  
  Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
  Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/
  
  The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
  packets being replied to with DHCPOFFER packets, but the got_one()
  callback is never called, dhclient does not read these DHCPOFFER
  packets, and continues sending DHCPDISCOVER packets. Once it reaches 25
  DHCPDISCOVER packets sent, it gives up.
  
- tcpdump:
- Screenshot of Wireshark:
+ tcpdump: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641810/+files/test.pcap
+ Screenshot of Wireshark: 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png
  
  This behaviour led several bug reporters to believe it was a kernel
  issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
  is not the case, the actual problem is dhclient containing a thread
  concurrency race condition, and when the race occurs, the read socket is
  closed prematurely, and dhclient does not read any of the DHCPOFFER
  replies.
  
  The full explanation is in the "Other Info" section, but the fix for
  this is to change bind9-libs from being built multithreaded, back to
  single threaded as intended by dhclient maintainers.
  
  In Focal and Jammy, isc-dhcp links against bind9 libraries provided in
  bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9
  library it uses, which is already configured properly to --disable-
  threads.
  
  Change the Focal and Jammy bind9-libs to --disable-threads and update
  symbol files to reflect the library is single threaded again.
  
  [Testcase]
  
  Start a fresh Focal or Jammy instance.
  
  Download and set executable test-parallel.sh, and edit some lines:
  
  1) wget 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh
  2) chmod +x test-parallel.sh
  3) vim test-parallel.sh
  
  Change iface="enp5s0" to your interface, likely iface="enp1s0".
  Comment out the line "#   cp bionic-dhclient $workdir/dhclient".
  
  4) sudo ./test-parallel.sh
  
  After five minutes, if you issue reproduces, you will see "TEST FAILED".
  
  You can watch the output with:
  
  5) cat /tmp/dhclient-* | less
  
  Next, for instrumented runs, you need to build dhclient from source.
  
  1) sudo apt install build-essential devscripts
  2) apt source isc-dhcp
  3) sudo apt build-dep isc-dhcp
  4) cd isc-dhcp
  
  Apply the below patch:
  
  https://paste.ubuntu.com/p/hGsssrVyG4/
  
  5) patch -p1 < ~/patch.patch
  6) debuild -b -uc -us
  7) cd ..
  8) sudo dpkg -i isc-dhcp-client-*
  9) sudo ./test-parallel.sh
  10) cat /tmp/dhclient-* | less
  
  Look for the race, as described in "Other Info", namely:
  
  mruffell: registering with socket manager
  mruffell: callback called
  mruffell: omapi object is NULL
  mruffell: omapi object is NULL
  mruffell: Adding obj to linked list
  mruffell: Obj added to list
  
  The issue has reproduced.
  
  If you install the test package from the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf337873-test
  
  Instructions to install (on a Focal or Jammy system):
  1) sudo add-apt-repository ppa:mruffell/sf337873-test
  2) sudo apt update
  3) sudo apt install libdns-export1109 libisc-export1105
  4) sudo apt-cache policy libisc-export1105 | grep Installed
  Installed: 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-16 Thread Matthew Ruffell
packet capture from a reproduction run

** Description changed:

- Platform: Qemu/libvirt on AMD64
- Ubuntu version: 20.04
- isc-dhcp-client version: 4.4.1-2.1ubuntu5
- Problem: When dhclient is used during boot every few reboots the DHCP OFFER 
packets aren't pushed from the kernel to dhclient. The DISCOVER packets can be 
seen in strace and tcpdump. The OFFER packets can be seen in tcpdump, but no 
read event is triggered.
- Ubuntu 18.04 doesn't have the problem, neither does Debian 10. Building these 
dhclient versions on Ubuntu 20.04 alleviates the problem a little, but it still 
occurs. So this issue might also be kernel related.
- 
- Attached diff shows a strace of all threads and a pcap showing the
- tcpdump output.
- 
- Edit:
- - Sometimes the dhclient command does receive the OFFER packet and connection 
is restored.
- - In my testing running dhclient manually from the terminal when the OFFERs 
aren't received will result in a new dhclient session which does receive the 
OFFER packet and connection is restored.
+ [Impact]
+ 
+ Occasionally, during instance boot or machine start-up, dhclient will
+ attempt to acquire a dhcp lease and fail, leaving the instance with no
+ IP address and making it unreachable.
+ 
+ This happens about once every 100 reboots on bare metal, or Chris
+ Patterson in comment #2 describes it as affecting between ~0.3% to 2% of
+ deployments on Microsoft Azure. Azure uses dhclient called from cloud-
+ init instead of systemd-networkd, and this is causing issues with larger
+ deployments.
+ 
+ The logs of an affected dhclient produce the following:
+ 
+ Listening on LPF/enp1s0/52:54:00:1c:d7:00
+ Sending on   LPF/enp1s0/52:54:00:1c:d7:00
+ Sending on   Socket/fallback
+ DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f)
+ DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f)
+ ...
+ (omitting 20 similar lines)
+ ...
+ DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f)
+ DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f)
+ DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f)
+ No DHCPOFFERS received.
+ No working leases in persistent database - sleeping.
+ 
+ Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
+ Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/
+ 
+ The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER
+ packets being replied to with DHCPOFFER packets, but the got_one()
+ callback is never called, dhclient does not read these DHCPOFFER
+ packets, and continues sending DHCPDISCOVER packets. Once it reaches 25
+ DHCPDISCOVER packets sent, it gives up.
+ 
+ tcpdump:
+ Screenshot of Wireshark:
+ 
+ This behaviour led several bug reporters to believe it was a kernel
+ issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
+ is not the case, the actual problem is dhclient containing a thread
+ concurrency race condition, and when the race occurs, the read socket is
+ closed prematurely, and dhclient does not read any of the DHCPOFFER
+ replies.
+ 
+ The full explanation is in the "Other Info" section, but the fix for
+ this is to change bind9-libs from being built multithreaded, back to
+ single threaded as intended by dhclient maintainers.
+ 
+ In Focal and Jammy, isc-dhcp links against bind9 libraries provided in
+ bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9
+ library it uses, which is already configured properly to --disable-
+ threads.
+ 
+ Change the Focal and Jammy bind9-libs to --disable-threads and update
+ symbol files to reflect the library is single threaded again.
+ 
+ [Testcase]
+ 
+ Start a fresh Focal or Jammy instance.
+ 
+ Download and set executable test-parallel.sh, and edit some lines:
+ 
+ 1) wget 
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh
+ 2) chmod +x test-parallel.sh
+ 3) vim test-parallel.sh
+ 
+ Change iface="enp5s0" to your interface, likely iface="enp1s0".
+ Comment out the line "#   cp bionic-dhclient $workdir/dhclient".
+ 
+ 4) sudo ./test-parallel.sh
+ 
+ After five minutes, if you issue reproduces, you will see "TEST FAILED".
+ 
+ You can watch the output with:
+ 
+ 5) cat /tmp/dhclient-* | less
+ 
+ Next, for instrumented runs, you need to build dhclient from source.
+ 
+ 1) sudo apt install build-essential devscripts
+ 2) apt source isc-dhcp
+ 3) sudo apt build-dep isc-dhcp
+ 4) cd isc-dhcp
+ 
+ Apply the below patch:
+ 
+ https://paste.ubuntu.com/p/hGsssrVyG4/
+ 
+ 5) patch -p1 < ~/patch.patch
+ 6) debuild -b -uc -us
+ 7) cd ..
+ 8) sudo dpkg -i isc-dhcp-client-*
+ 9) sudo ./test-parallel.sh
+ 10) cat /tmp/dhclient-* | less
+ 
+ Look for the race, as described in "Other Info", namely:
+ 
+ mruffell: registering with socket manager
+ mruffell: callback called
+ mruffell: omapi object is NULL
+ mruffell: omapi object is NULL
+ mruffell: Adding obj to linked list
+ mruffell: 

[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received

2023-01-15 Thread Matthew Ruffell
** Summary changed:

- dhclient doesn't receive dhcp offer from kernel
+ dhclient: thread concurrency race leads to DHCPOFFER packets not being 
received

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Fix Released
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in bind9-libs source package in Focal:
  In Progress
Status in bind9-libs source package in Jammy:
  In Progress

Bug description:
  Platform: Qemu/libvirt on AMD64
  Ubuntu version: 20.04
  isc-dhcp-client version: 4.4.1-2.1ubuntu5
  Problem: When dhclient is used during boot every few reboots the DHCP OFFER 
packets aren't pushed from the kernel to dhclient. The DISCOVER packets can be 
seen in strace and tcpdump. The OFFER packets can be seen in tcpdump, but no 
read event is triggered.
  Ubuntu 18.04 doesn't have the problem, neither does Debian 10. Building these 
dhclient versions on Ubuntu 20.04 alleviates the problem a little, but it still 
occurs. So this issue might also be kernel related.

  Attached diff shows a strace of all threads and a pcap showing the
  tcpdump output.

  Edit:
  - Sometimes the dhclient command does receive the OFFER packet and connection 
is restored.
  - In my testing running dhclient manually from the terminal when the OFFERs 
aren't received will result in a new dhclient session which does receive the 
OFFER packet and connection is restored.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/bind9-libs/+bug/1926139/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp