[Touch-packages] [Bug 1926139] Re: dhclient: thread concurrency race leads to DHCPOFFER packets not being received
Great work Maurico, I think you make several excellent points and I appreciate your efforts on a better reproducer and alternative patch. FWIW I began testing the Matthew's initial build (which disabled threads) against a large number of VMs and that appeared to address the issues we're seeing. I'm cutting those tests short and am updating the tests now to use your patch as provided by Matthew and we'll see how that goes! -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu. https://bugs.launchpad.net/bugs/1926139 Title: dhclient: thread concurrency race leads to DHCPOFFER packets not being received Status in bind9-libs package in Ubuntu: Fix Released Status in isc-dhcp package in Ubuntu: Invalid Status in bind9-libs source package in Focal: In Progress Status in bind9-libs source package in Jammy: In Progress Bug description: [Impact] Occasionally, during instance boot or machine start-up, dhclient will attempt to acquire a dhcp lease and fail, leaving the instance with no IP address and making it unreachable. This happens about once every 100 reboots on bare metal, or Chris Patterson in comment #2 describes it as affecting between ~0.3% to 2% of deployments on Microsoft Azure. Azure uses dhclient called from cloud-init instead of systemd-networkd, and this is causing issues with larger deployments. The logs of an affected dhclient produce the following: Listening on LPF/enp1s0/52:54:00:1c:d7:00 Sending on LPF/enp1s0/52:54:00:1c:d7:00 Sending on Socket/fallback DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0xd222950f) DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0xd222950f) ... (omitting 20 similar lines) ... DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0xd222950f) DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0xd222950f) DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 6 (xid=0xd222950f) No DHCPOFFERS received. No working leases in persistent database - sleeping. Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/ Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/ The bizarre thing is when you tcpdump dhclient, we see all DHCPDISOVER packets being replied to with DHCPOFFER packets, but the got_one() callback is never called, dhclient does not read these DHCPOFFER packets, and continues sending DHCPDISCOVER packets. Once it reaches 25 DHCPDISCOVER packets sent, it gives up. tcpdump: https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641810/+files/test.pcap Screenshot of Wireshark: https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641811/+files/Screenshot_2023-01-17-16-14-21_1920x1200%250A1920x1080%250A1920x1080.png This behaviour led several bug reporters to believe it was a kernel issue, with the kernel not pushing DHCPOFFER packets to dhclient. This is not the case, the actual problem is dhclient containing a thread concurrency race condition, and when the race occurs, the read socket is closed prematurely, and dhclient does not read any of the DHCPOFFER replies. The full explanation is in the "Other Info" section, but the fix for this is to change bind9-libs from being built multithreaded, back to single threaded as intended by dhclient maintainers. In Focal and Jammy, isc-dhcp links against bind9 libraries provided in bind9-libs, while in Kinetic onward isc-dhcp has an in-tree bind9 library it uses, which is already configured properly to --disable- threads. Change the Focal and Jammy bind9-libs to --disable-threads and update symbol files to reflect the library is single threaded again. [Testcase] Start a fresh Focal or Jammy instance. Download and set executable test-parallel.sh, and edit some lines: 1) wget https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh 2) chmod +x test-parallel.sh 3) vim test-parallel.sh Change iface="enp5s0" to your interface, likely iface="enp1s0". Comment out the line "# cp bionic-dhclient $workdir/dhclient". 4) sudo ./test-parallel.sh After five minutes, if you issue reproduces, you will see "TEST FAILED". You can watch the output with: 5) cat /tmp/dhclient-* | less Next, for instrumented runs, you need to build dhclient from source. 1) sudo apt install build-essential devscripts 2) apt source isc-dhcp 3) sudo apt build-dep isc-dhcp 4) cd isc-dhcp Apply the below patch: https://paste.ubuntu.com/p/hGsssrVyG4/ 5) patch -p1 < ~/patch.patch 6) debuild -b -uc -us 7) cd .. 8) sudo dpkg -i isc-dhcp-client-* 9) sudo ./test-parallel.sh 10) cat /tmp/dhclient-* | less Look for the race, as described in "Other Info", namely:
[Touch-packages] [Bug 1989190] Re: Bionic networking failures after NIC reordering
Reproducer script for both variants of systemd. ** Attachment added: "reproducer script" https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1989190/+attachment/5614805/+files/lp1989190-reproducer.sh ** Description changed: - Partially documented in https://bugs.launchpad.net/bugs/1958280 and + Documented across https://bugs.launchpad.net/bugs/1958280 and https://canonical.force.com/ua/s/case/5004K0E96qlQAB/vf-nic-not- getting-renamed-properly-for-ubuntu-2004. - Splitting these reports to focus on Bionic, because it's different than - 20.04+ and last week's failure + Creating this bug to focus on Bionic, because it's different than 20.04+ + and last week's failure https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119 helped me identify part of the root cause. When NICs are renamed on boot, networkd tends to fail to configure them. # WITHOUT THE PROPOSED SYSTEMD PATCH cpatterson@test-ubu1804-nicrenamerepro-x1:~$ networkctl list IDX LINK TYPE OPERATIONAL SETUP - 1 lo loopback carrier unmanaged - 2 eth0 ether routableconfigured - 3 eth1 ether n/a unmanaged - 4 eth2 ether routableconfigured - 5 eth3 ether routableconfigured - 6 eth4 ether routableconfigured - 7 eth5 ether off unmanaged - 8 eth6 ether off unmanaged - 9 eth7 ether off unmanaged - + 1 lo loopback carrier unmanaged + 2 eth0 ether routableconfigured + 3 eth1 ether n/a unmanaged + 4 eth2 ether routableconfigured + 5 eth3 ether routableconfigured + 6 eth4 ether routableconfigured + 7 eth5 ether off unmanaged + 8 eth6 ether off unmanaged + 9 eth7 ether off unmanaged ### As expected, we can see the properties are missing. cpatterson@test-ubu1804-nicrenamerepro-x1:~$ sudo udevadm info /sys/class/net/eth7 P: /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/0022481f-69aa-0022-481f-69aa0022481f/net/eth7 E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/0022481f-69aa-0022-481f-69aa0022481f/net/rename9 E: ID_NET_NAME_MAC=enx0022481f69aa E: ID_OUI_FROM_DATABASE=Microsoft Corporation E: ID_PATH=acpi-VMBUS:01 E: ID_PATH_TAG=acpi-VMBUS_01 E: IFINDEX=9 E: INTERFACE=eth1 E: SUBSYSTEM=net E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/rename9 /sys/subsystem/net/devices/eth1 /sys/subsystem/net/devices/cirename0 /sys/subsystem/net/devices/eth7 E: TAGS=:systemd: E: USEC_INITIALIZED=11203606 ### As expected, restarting networkd does not fix the issue. cpatterson@test-ubu1804-nicrenamerepro-x1:~$ sudo systemctl restart systemd-networkd cpatterson@test-ubu1804-nicrenamerepro-x1:~$ networkctl list IDX LINK TYPE OPERATIONAL SETUP - 1 lo loopback carrier unmanaged - 2 eth0 ether routableconfigured - 3 eth1 ether off unmanaged - 4 eth2 ether routableconfigured - 5 eth3 ether routableconfigured - 6 eth4 ether routableconfigured - 7 eth5 ether off unmanaged - 8 eth6 ether off unmanaged - 9 eth7 ether off unmanaged + 1 lo loopback carrier unmanaged + 2 eth0 ether routableconfigured + 3 eth1 ether off unmanaged + 4 eth2 ether routableconfigured + 5 eth3 ether routableconfigured + 6 eth4 ether routableconfigured + 7 eth5 ether off unmanaged + 8 eth6 ether off unmanaged + 9 eth7 ether off unmanaged 9 links listed. # WITH THE PROPOSED SYSTEMD PATCH I built systemd with the proposed patches in https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119. With these patches, networking still comes up broken, but restarting networkd does fix things. cpatterson@test-ubu1804-nicrenamerepro-systemd55-x2:~$ networkctl list IDX LINK TYPE
[Touch-packages] [Bug 1989190] [NEW] Bionic networking failures after NIC reordering
Public bug reported: Documented across https://bugs.launchpad.net/bugs/1958280 and https://canonical.force.com/ua/s/case/5004K0E96qlQAB/vf-nic-not- getting-renamed-properly-for-ubuntu-2004. Creating this bug to focus on Bionic, because it's different than 20.04+ and last week's failure https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119 helped me identify part of the root cause. When NICs are renamed on boot, networkd tends to fail to configure them. # WITHOUT THE PROPOSED SYSTEMD PATCH cpatterson@test-ubu1804-nicrenamerepro-x1:~$ networkctl list IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2 eth0 ether routableconfigured 3 eth1 ether n/a unmanaged 4 eth2 ether routableconfigured 5 eth3 ether routableconfigured 6 eth4 ether routableconfigured 7 eth5 ether off unmanaged 8 eth6 ether off unmanaged 9 eth7 ether off unmanaged ### As expected, we can see the properties are missing. cpatterson@test-ubu1804-nicrenamerepro-x1:~$ sudo udevadm info /sys/class/net/eth7 P: /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/0022481f-69aa-0022-481f-69aa0022481f/net/eth7 E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/0022481f-69aa-0022-481f-69aa0022481f/net/rename9 E: ID_NET_NAME_MAC=enx0022481f69aa E: ID_OUI_FROM_DATABASE=Microsoft Corporation E: ID_PATH=acpi-VMBUS:01 E: ID_PATH_TAG=acpi-VMBUS_01 E: IFINDEX=9 E: INTERFACE=eth1 E: SUBSYSTEM=net E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/rename9 /sys/subsystem/net/devices/eth1 /sys/subsystem/net/devices/cirename0 /sys/subsystem/net/devices/eth7 E: TAGS=:systemd: E: USEC_INITIALIZED=11203606 ### As expected, restarting networkd does not fix the issue. cpatterson@test-ubu1804-nicrenamerepro-x1:~$ sudo systemctl restart systemd-networkd cpatterson@test-ubu1804-nicrenamerepro-x1:~$ networkctl list IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2 eth0 ether routableconfigured 3 eth1 ether off unmanaged 4 eth2 ether routableconfigured 5 eth3 ether routableconfigured 6 eth4 ether routableconfigured 7 eth5 ether off unmanaged 8 eth6 ether off unmanaged 9 eth7 ether off unmanaged 9 links listed. # WITH THE PROPOSED SYSTEMD PATCH I built systemd with the proposed patches in https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119. With these patches, networking still comes up broken, but restarting networkd does fix things. cpatterson@test-ubu1804-nicrenamerepro-systemd55-x2:~$ networkctl list IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2 eth0 ether routableconfigured 3 eth1 ether n/a unmanaged 4 eth2 ether n/a unmanaged 5 eth3 ether n/a unmanaged 6 eth4 ether routableconfigured 7 eth5 ether n/a unmanaged 8 eth6 ether n/a unmanaged 9 eth7 ether n/a unmanaged 9 links listed. cpatterson@test-ubu1804-nicrenamerepro-systemd55-x2:~$ sudo udevadm info /sys/class/net/eth1 P: /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/0022482b-f769-0022-482b-f7690022482b/net/eth1 E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/0022482b-f769-0022-482b-f7690022482b/net/rename3 E: ID_NET_DRIVER=hv_netvsc E: ID_NET_LINK_FILE=/run/systemd/network/10-netplan-eth7.link E: ID_NET_NAME=eth1 E: ID_NET_NAME_MAC=enx0022482bf769 E: ID_OUI_FROM_DATABASE=Microsoft Corporation E: ID_PATH=acpi-VMBUS:01 E: ID_PATH_TAG=acpi-VMBUS_01 E: IFINDEX=3 E: INTERFACE=eth7 E: NM_UNMANAGED=1 E: SUBSYSTEM=net E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/rename3 /sys/subsystem/net/devices/eth7 /sys/subsystem/net/devices/eth1 E: TAGS=:systemd: E: USEC_INITIALIZED=10280176 cpatterson@test-ubu1804-nicrenamerepro-systemd55-x2:~$ sudo systemctl restart systemd-networkd cpatterson@test-ubu1804-nicrenamerepro-systemd55-x2:~$ networkctl list IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2
[Touch-packages] [Bug 1926139] Re: dhclient doesn't receive dhcp offer from kernel
We've been investigating a similar issue in Ubuntu 20.04 (and now 22.04) on Azure where Running PPS re-use fails to perform DHCP for 5 minutes when dhclient is invoked by cloud-init. dhclient is run by cloud-init, but sees no DHCPOFFER. It varies due to unknown reasons but it has affected a ~0.3-2% of deployments in this scenario over time. We instrumented our images to capture network traffic and see what is happening and sure enough DHCP offers are coming through to the guest by dhclient doesn't see them. We instrumented dhclient and the "got_one()" callback is never invoked in these failures. 18.04 does not have this issue. This behavior can be reproduced multiple ways: - Reproduce similar test environment to above scenario using cloud-init (switch hyperv nic to a different vnet while waiting the link status to reset, then perform dhcp). This test case will reproduce in ~1,500 runs, though it varies and requires more complex setup. - Repeatedly run dhclient in a loop until it fails (see test-sequential.sh). It may take a while, but even this simple test will reproduce this behavior in ~50k runs for me in an LXD VM. - Simply launch instances of dhclient in parallel (see test-parallel.sh). There is an excellent chance at least one of those dhclients will fail this way. I noticed the uprev of bind9 libs in focal: focal (net): 1:9.11.16+dfsg-3~build1 focal-updates (net): 1:9.11.16+dfsg-3~ubuntu1 impish (net): 1:9.11.19+dfsg-2.1ubuntu1 jammy (net): 1:9.11.19+dfsg-2.1ubuntu3 kinetic (net): 1:9.11.19+dfsg-2.1ubuntu3 I couldn't find any related issue on the isc-dhcp tracker, etc. I did build dhclient from the Debian master branch (https://salsa.debian.org/debian/isc-dhcp/-/commits/master/debian) which uses the in-tree bind libs and that seems to have addressed the issue for all scenarios. Not that it helps much to bisect this just yet. ** Attachment added: "parallel test" https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593045/+files/test-parallel.sh -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu. https://bugs.launchpad.net/bugs/1926139 Title: dhclient doesn't receive dhcp offer from kernel Status in isc-dhcp package in Ubuntu: New Bug description: Platform: Qemu/libvirt on AMD64 Ubuntu version: 20.04 isc-dhcp-client version: 4.4.1-2.1ubuntu5 Problem: When dhclient is used during boot every few reboots the DHCP OFFER packets aren't pushed from the kernel to dhclient. The DISCOVER packets can be seen in strace and tcpdump. The OFFER packets can be seen in tcpdump, but no read event is triggered. Ubuntu 18.04 doesn't have the problem, neither does Debian 10. Building these dhclient versions on Ubuntu 20.04 alleviates the problem a little, but it still occurs. So this issue might also be kernel related. Attached diff shows a strace of all threads and a pcap showing the tcpdump output. Edit: - Sometimes the dhclient command does receive the OFFER packet and connection is restored. - In my testing running dhclient manually from the terminal when the OFFERs aren't received will result in a new dhclient session which does receive the OFFER packet and connection is restored. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp
[Touch-packages] [Bug 1926139] Re: dhclient doesn't receive dhcp offer from kernel
** Attachment added: "sequential test" https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5593046/+files/test-sequential.sh -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu. https://bugs.launchpad.net/bugs/1926139 Title: dhclient doesn't receive dhcp offer from kernel Status in isc-dhcp package in Ubuntu: New Bug description: Platform: Qemu/libvirt on AMD64 Ubuntu version: 20.04 isc-dhcp-client version: 4.4.1-2.1ubuntu5 Problem: When dhclient is used during boot every few reboots the DHCP OFFER packets aren't pushed from the kernel to dhclient. The DISCOVER packets can be seen in strace and tcpdump. The OFFER packets can be seen in tcpdump, but no read event is triggered. Ubuntu 18.04 doesn't have the problem, neither does Debian 10. Building these dhclient versions on Ubuntu 20.04 alleviates the problem a little, but it still occurs. So this issue might also be kernel related. Attached diff shows a strace of all threads and a pcap showing the tcpdump output. Edit: - Sometimes the dhclient command does receive the OFFER packet and connection is restored. - In my testing running dhclient manually from the terminal when the OFFERs aren't received will result in a new dhclient session which does receive the OFFER packet and connection is restored. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp