[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
** Tags removed: sts-sponsor-mfo -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
This bug was fixed in the package bind9 - 1:9.16.1-0ubuntu2.7 --- bind9 (1:9.16.1-0ubuntu2.7) focal; urgency=medium * Fix a race between deactivating socket handle and processing async callbacks, which can lead to sockets not being closed properly, exhausting TCP connection limits. (LP: #1909950) - d/p/lp-1909950-fix-race-between-deactivating-handle-async-callback.patch -- Matthew Ruffell Thu, 18 Feb 2021 16:28:44 +1300 ** Changed in: bind9 (Ubuntu Focal) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
> I will also write back in a few days time with feedback from a user, > who is testing this fixed package in production. That user is me. I've been running 1:9.16.1-0ubuntu2.7 on a ISP production recursive server "since Fri 2021-02-19 17:44:17 CST; 5 days ago" (per systemd). The system remains stable. The TCP numbers in "rndc status" look good, compared to when we were experiencing the problem of named hitting the TCP client limit due to sockets not closing. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Performing verification for Bind9 on Focal. I first installed 9.16.1-0ubuntu2.6 from -updates to ensure that the issue is still present. I checked that I could look up ubuntu.com through the local caching resolver. From there I started a second VM, and checked I could look up addresses through the first VM. I then added the 30% packet loss rule with tc. From there I opened up 11 tabs in gnome-terminal and hit the first first VM with: $ for run in {1..1}; do dig +tcp @192.168.122.21 ubuntu.com & done https://paste.ubuntu.com/p/sF9SXkWpZK/ We can see that the "TCP high-water" mark kept rising until it reached 150, when I then killed the thundering herd from the second VM. I then did a DNS lookup, and found that named was not listening to TCP, and the lookup timed out. This confirms that 9.16.1-0ubuntu2.6 from -updates is affected. I then enabled -proposed, and installed bind9 9.16.1-0ubuntu2.7 and rebooted. >From there, I can checked I could look up ubuntu.com through the local caching resolver, and again started the second VM. The second VM could also look up addresses through the first VM. I again added a 30% packet loss with tc. I then opened up 11 tabs of gnome-terminal and hit the first vm with the dig for loop of doom. Except this time, once I reached the TCP high water mark and killed the second VM, the number of TCP connections fell back down to 1, and did not get stuck at a higher number. I did a TCP DNS lookup for ubuntu.com on the server, and the request was successful and did not time out. named is listening to TCP connections as it is suppose to. https://paste.ubuntu.com/p/SzJMzz6xbh/ bind9 9.16.1-0ubuntu2.7 fixes the problem. Happy to mark as verified. I will also write back in a few days time with feedback from a user, who is testing this fixed package in production. ** Tags removed: verification-needed verification-needed-focal ** Tags added: verification-done-focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Hello Adam, or anyone else affected, Accepted bind9 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9/1:9.16.1-0ubuntu2.7 in a few hours, and then in the -proposed repository. Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed- focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification- failed-focal. In either case, without details of your testing we will not be able to proceed. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping! N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days. ** Changed in: bind9 (Ubuntu Focal) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Ack; will do. Thanks for the notice! ** Changed in: bind9 (Ubuntu Focal) Status: Fix Committed => In Progress -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Unfortunately this SRU got superseded by a security update, and will have to be re-uploaded. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Hello Adam, or anyone else affected, Accepted bind9 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9/1:9.16.1-0ubuntu2.5 in a few hours, and then in the -proposed repository. Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed- focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification- failed-focal. In either case, without details of your testing we will not be able to proceed. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping! N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days. ** Changed in: bind9 (Ubuntu Focal) Status: In Progress => Fix Committed ** Tags added: verification-needed verification-needed-focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
For doc purposes, I've had an interesting time debugging why the bind9 forwarding didn't work to a host running dnsmasq/libvirt (DNS server). After some tcpdump comparisons against a local dig client that worked fine, it turns out that dnssec-validation must be changed from 'auto' to 'yes', and then bind9 forwarding worked OK! bind forwarder / default (see percent symbol): FAIL / NotImp --- $ sudo tcpdump -i vnet9 'port 53' ... 22:59:07.461914 IP 192.168.122.11.48475 > rotom.domain: 36180+% [1au] A? ubuntu.com. (51) 22:59:07.462424 IP rotom.domain > 192.168.122.11.48475: 36180 NotImp 0/0/1 (62) ... local client (no percent symbol): PASS --- $ sudo tcpdump -i lo 'port 53' ... 22:58:24.444288 IP rotom.47673 > rotom.domain: 30984+ [1au] A? ubuntu.com. (51) 22:58:24.444915 IP rotom.domain > rotom.47673: 30984 4/0/1 A 91.189.88.181, A 91.189.91.44, A 91.189.91.45, A 91.189.88.180 (103) ... bind forwarder / dnssec-validation yes (NO percent symbol): PASS --- $ sudo tcpdump -i vnet9 'port 53' ... 23:04:28.551700 IP 192.168.122.11.47530 > rotom.domain: 36699+ [1au] A? ubuntu.com. (51) 23:04:28.648898 IP rotom.domain > 192.168.122.11.47530: 36699 4/0/1 A 91.189.91.45, A 91.189.88.181, A 91.189.88.180, A 91.189.91.44 (126) ... Reference: https://serverfault.com/questions/399911/tcpdump-dns-output-codes#400044 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Matthew, Thanks for the great work on this bug. Sponsored to focal. I've reviewed the debdiff and had only two minor changes: 1) the Description: field to conform with DEP3/deb822 [1,2] on multiline field (first line and paragraph separators), 2) trimmed 'and-' of the patch name to keep its line under 80 chars in the changelog (we could break it, but it's weird.) The package built fine on all archs w/ focal-proposed enabled. The test-case consistently broke bind9 within ~30 seconds with a powerful client VM (32 CPUs, 5 tmux tabs w/ the loop) for the version in focal-updates. The test package consistently survived the test-case. cheers, Mauricio [1] https://dep-team.pages.debian.net/deps/dep3/ [2] https://manpages.debian.org/unstable/dpkg-dev/deb822.5.en.html ** Description changed: [Impact] We are seeing busy Bind9 servers stop accepting TCP connections after a period of time. Looking at netstat, named is still listening to port 53 on all interfaces, but if you send a dig, the connection will just time out: $ dig +tcp ubuntu.com @192.168.122.2 ;; Connection to 192.168.122.2#53(192.168.122.2) for ubuntu.com failed: timed out. Symptoms are the number of tcp connections slowly increase, as well as the tcp high water mark increases, if you run the "rndc status" command. Eventually, the number of tcp connections will reach the tcp connection limit, and named will "break" and no longer accept any new tcp connections. There will also be a number of connections in the conntrack table stuck in the ESTABLISHED state, even through they are idle and ready to close, and there will be a number of connections in the SYN_SENT state, due to these connections getting stuck since the tcp connection limit has been reached. This appears to be caused by a race between deactivating a netmgr handle and processing a asynchronous callback for the socket close code, which can get triggered when a client sends a broken packet to the server and then doesn't close the connection properly. [Testcase] You will need two VMs to reproduce this issue. On the first, install bind9: $ sudo apt install bind9 Set up a caching resolver by editing /etc/bind/named.conf.options and uncommenting the forwarding block, and adding a DNS provider: forwarders { 8.8.8.8; }; + + If the DNS provider runs on dnsmasq/libvirt, also set: + + dnssec-validation yes; Next, restart the named service: $ sudo systemctl restart named.service Edit /etc/resolv.conf and change the resolver to 127.0.0.1. Disable the systemd-resolved service: $ sudo systemctl stop systemd-resolved.service Test to make sure resolving ubuntu.com works, using the IP of the NIC: $ dig +tcp @192.168.122.21 ubuntu.com https://paste.ubuntu.com/p/7NQJ6RRJHN/ Now, go to the second VM: Test to make sure that you can dig the other VM with: $ dig +tcp @192.168.122.21 ubuntu.com After that, use tc to intentionally drop some packets, so we can simulate bad clients dropping connections and not closing them properly, so we can see if we can trigger the race. My NIC is enp1s0, and 30% drop should do the trick. $ sudo tc qdisc add dev enp1s0 root netem loss 30% Next, open gnome-terminal and paste and run the below command in 10-15 tabs, the more the better: $ for run in {1..1}; do dig +tcp @192.168.122.21 ubuntu.com & done This parallelizes the connections to the bind9 server, to try and get above the 150 connection limit. Back on the server, watch the tcp high water mark in: $ sudo rndc status .. tcp clients: 0/150 TCP high-water: 10 .. $ sudo rndc status .. tcp clients: 31/150 TCP high-water: 58 .. $ sudo rndc status .. tcp clients: 56/150 TCP high-water: 141 .. $ sudo rndc status .. tcp clients: 142/150 TCP high-water: 150 .. If you can't hit the 150 mark on tcp high water, add more tabs to the other VM and keep hitting the DNS server. This will likely make the other VM unstable as well, FYI. Eventually, you will hit the 150 mark. After hitting it a bit longer, your bind9 server will be broken. $ dig +tcp @192.168.122.21 ubuntu.com ;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out. ;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out. ; <<>> DiG 9.16.1-Ubuntu <<>> +tcp @192.168.122.21 ubuntu.com ; (1 server found) ;; global options: +cmd ;; connection timed out; no servers could be reached ;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out. Do this from the bind9 server, so you don't get confused with the 30% packet drop of the other VM. If you install the test package from the below ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp1909950-test You can hit this bind9 as much as
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
** Tags added: sts-sponsor-mfo -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Attached is a V2 of the patch that fixes this bug, Bind9 for Focal. ** Patch removed: "Debdiff for bind9 on Focal" https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+attachment/5461528/+files/lp1909950_focal.debdiff ** Patch added: "Debdiff for bind9 on Focal V2" https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+attachment/5461529/+files/lp1909950_focal_v2.debdiff ** Tags added: sts-sponsor -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
Attached is a debdiff for Bind9 on Focal to fix this bug. ** Patch added: "Debdiff for bind9 on Focal" https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+attachment/5461528/+files/lp1909950_focal.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1909950 Title: named: TCP connections sometimes never close due to race in socket teardown To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909950/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1909950] Re: named: TCP connections sometimes never close due to race in socket teardown
** Summary changed: - TCP connections never close + named: TCP connections sometimes never close due to race in socket teardown ** Description changed: - The default timeout for TCP connections on port 53 to named/bind9 is 300 - seconds. The upstream ISC build of bind9 uses this and honors an - overrides you set in the config files. + [Impact] - The Ubuntu packaged version of bind9 seems to hold idle connections - forever, eventually exhausting the allowed open socket limit and - refusing to service any more requests. This is reproducible on all of - our 20.04 hosts. + We are seeing busy Bind9 servers stop accepting TCP connections after a + period of time. Looking at netstat, named is still listening to port 53 + on all interfaces, but if you send a dig, the connection will just time + out: - lsb_release -rd - Description: Ubuntu 20.04.1 LTS - Release: 20.04 + $ dig +tcp ubuntu.com @192.168.122.2 + ;; Connection to 192.168.122.2#53(192.168.122.2) for ubuntu.com failed: timed out. + Symptoms are the number of tcp connections slowly increase, as well as + the tcp high water mark increases, if you run the "rndc status" command. + Eventually, the number of tcp connections will reach the tcp connection + limit, and named will "break" and no longer accept any new tcp + connections. - Package Version: 1:9.16.1-0ubuntu2.4 500 + There will also be a number of connections in the conntrack table stuck + in the ESTABLISHED state, even through they are idle and ready to close, + and there will be a number of connections in the SYN_SENT state, due to + these connections getting stuck since the tcp connection limit has been + reached. + + This appears to be caused by a race between deactivating a netmgr handle + and processing a asynchronous callback for the socket close code, which + can get triggered when a client sends a broken packet to the server and + then doesn't close the connection properly. + + [Testcase] + + You will need two VMs to reproduce this issue. + + On the first, install bind9: + + $ sudo apt install bind9 + + Set up a caching resolver by editing /etc/bind/named.conf.options and + uncommenting the forwarding block, and adding a DNS provider: + + forwarders { + 8.8.8.8; + }; + + Next, restart the named service: + + $ sudo systemctl restart named.service + + Edit /etc/resolv.conf and change the resolver to 127.0.0.1. + + Disable the systemd-resolved service: + + $ sudo systemctl stop systemd-resolved.service + + Test to make sure resolving ubuntu.com works, using the IP of the NIC: + + $ dig +tcp @192.168.122.21 ubuntu.com + https://paste.ubuntu.com/p/7NQJ6RRJHN/ + + Now, go to the second VM: + + Test to make sure that you can dig the other VM with: + + $ dig +tcp @192.168.122.21 ubuntu.com + + After that, use tc to intentionally drop some packets, so we can + simulate bad clients dropping connections and not closing them properly, + so we can see if we can trigger the race. + + My NIC is enp1s0, and 30% drop should do the trick. + + $ sudo tc qdisc add dev enp1s0 root netem loss 30% + + Next, open gnome-terminal and paste and run the below command in 10-15 + tabs, the more the better: + + $ for run in {1..1}; do dig +tcp @192.168.122.21 ubuntu.com & done + + This parallelizes the connections to the bind9 server, to try and get + above the 150 connection limit. + + Back on the server, watch the tcp high water mark in: + + $ sudo rndc status + .. + tcp clients: 0/150 + TCP high-water: 10 + .. + + $ sudo rndc status + .. + tcp clients: 31/150 + TCP high-water: 58 + .. + + $ sudo rndc status + .. + tcp clients: 56/150 + TCP high-water: 141 + .. + + $ sudo rndc status + .. + tcp clients: 15/150 + TCP high-water: 150 + .. + + If you can't hit the 150 mark on tcp high water, add more tabs to the + other VM and keep hitting the DNS server. This will likely make the + other VM unstable as well, FYI. + + Eventually, you will hit the 150 mark. After hitting it a bit longer, + your bind9 server will be broken. + + $ dig +tcp @192.168.122.21 ubuntu.com + ;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out. + ;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out. + + ; <<>> DiG 9.16.1-Ubuntu <<>> +tcp @192.168.122.21 ubuntu.com + ; (1 server found) + ;; global options: +cmd + ;; connection timed out; no servers could be reached + + ;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com + failed: timed out. + + Do this from the bind9 server, so you don't get confused with the 30% + packet drop of the other VM. + + If you install the test package from the below ppa: + + https://launchpad.net/~mruffell/+archive/ubuntu/lp1909950-test + + You can hit this bind9 as much as you can, but it will never become + broken. If you stop the thundering herd at the 150 max connections, the + server will correctly tear down tcp connections, and you will be able to + successfully