Pressing Page Down 17 times to go over the bug description sounds like a
new record! ;-)

Just in case future reviewers don't find that as exciting while going
through reviews, I'll move some text into comments and reference them
from the description.

** Bug watch added: gitlab.isc.org/isc-projects/dhcp/-/issues #264
   https://gitlab.isc.org/isc-projects/dhcp/-/issues/264

** Bug watch added: Debian Bug tracker #996356
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996356

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
     dhclient will attempt to acquire a dhcp lease and fail,
     leaving the instance with no IP address and making it
     unreachable.

   * This happens about once every 100 reboots on bare metal,
     or affecting between ~0.3% to 2% of deployments on Azure
     (comment #2).
     
   * Azure uses dhclient called from cloud-init instead of
     systemd-networkd, and this is causing issues with larger
     deployments.

   * The logs of an affected dhclient produce the following:

     Listening on LPF/enp1s0/52:54:00:1c:d7:00
     Sending on   LPF/enp1s0/52:54:00:1c:d7:00
     Sending on   Socket/fallback
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     ...
     (omitting 20 similar lines)
     ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     No DHCPOFFERS received.
     No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
     are multi-threaded (Bionic/earlier and Kinetic/later
     are single-threaded).

   * The actual problem is dhclient containing a thread
     concurrency race condition, and when the race occurs,
     the read socket is incorrectly/prematurely unwatched
     because required structures are not yet consistent,
     thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
     read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
     and no errors have been observed (previously: 0.4%).
     
   * Anyway, in case regressions are observed, the patch
     introduces 2 switches to revert to previous behavior,
     which can be applied per-process or system-wide:
     - DHCP_FD_FLAGS_POKE=0 environment variable
     - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
      bind9-libs to single-threaded, but we concluded it
      would have more regression risk than the expected
      [some bits in comment #8, and some internal chat],
      and remove exported symbols (apparently unused, but).
      We also considered a mutex/spinlock approach, but
      later found a simpler way w/ isc lib; comment #13.)
      
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
     condition, and DHCP server/client/noise injection
     is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
     - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
     - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
      
   * 2) dhcpd is also affected by code changes,
     thus failures to handle DHCP lease requests
     also have potential for regressions.
     
   * 3) the functional change added by the fix,
     if a regression were to occur, would likely
     be an issue only under some (unknown) race
     condition as well, thus expected to be rare.

   * Note: this potentially affects Focal/Jammy 
     on Azure as a whole, per usage of dhclient
     in cloud-init instead of systemd-networkd.
     
     Azure provided extensive testing for all 3
     approaches (mostly internal communications,
     and some bug comments), with ~13k instances.
     
     No issues were observed (previously: 0.4%).
     
   * Such testing scale seems to indicate that
     there are no regressions for dhclient to
     acquire DHCP leases (1), nor another race
     condition that hit the fix/new behavior (3).
     
     With that, apparently (2) should be OK too.
     
   * Also, so to mitigate the regression risk
     as much as possible, there's very detailed
     analysis provided here (comments #17, #18)
     and more information about the fix in its
     patch file's comment.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/bind9-libs/+bug/1926139/+subscriptions


-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to