Public bug reported: On our 16.04LTS system we use the ipvsadm --ops UDP support (one-packet scheduling) to get a better distribution amongst our real servers behind the load-balancer for some small subset of applications.
This has worked fine through the 4.4.0-xxx kernels. But when we started a program to upgrade systems to use the 4.15 series of kernels to take advantage of new facilities, the subset of systems which used the --ops option ran into problems. Everything else with the 4.15 kernels appeared to work well. The issue appears to have been the change in the ip_vs module from using "atomic_*()" increment/decrement functions in the 4.4 kernel to instead use "refcount_*()" functions in a later kernel, including the 4.15 one we switched to. Unfortunately, the simple refcount_dec() function was inadequate, in putting out a time-consuming message and handling when the refcount dropped to zero, which is expected in the case of --ops support that retains no state post packet delivery. I will upload an attachment with the sample messages that get put out at packet arrival rate, which destroys performance of course. On this VM, with far more limited # of CPUs than production servers, the system actually hangs (crashes?) for quite some time. This issue was apparently already recognized as an error and has appeared as a fix in upstream kernels. This is a reference to the 4.17 version of the fix that we'd like to see incorporated into the next possible kernel maintenance release: https://github.com/torvalds/linux/commit/a050d345cef0dc6249263540da1e902bba617e43 #diff-75923493f6e3f314b196a8223b0d6342 We have successfully used the livepatch facility to build a livepatch .ko with the above diffs on our 4.15.0-36 system and successfully demonstrated the contrast in good/bad behavior with/without the livepatch module loaded. But we'd rather not have to build a version of livepatch.ko for each kernel maintenance release. The problem is easy to generate, with only a couple of packets and a simple configuration. Here's a very basic test (addresses rewritten/obscrued) version of an example configuration for 2 servers that worked on my test VM: ipvsadm -A -f 100 -s rr --ops ipvsadm -a -f 100 -r 10.129.131.227:0 -g -w 9999 ipvsadm -a -f 100 -r 10.129.131.228:0 -g -w 9999 iptables -t mangle -A PREROUTING -d 172.16.5.1/32 -j MARK --set-xmark 0x64/0xffffffff ifconfig lo:0 172.16.5.1/32 up Routing and addressing to achieve the above, or adaptation for one's own test environment, is left to the tester. I set up routing and addresses on my 2 NIC test such that packets arrived on my test machine's eth1 NIC and were directed by ip_vs out the eth2. To test, all I did was throw a few UDP packets via traceroute at the address on the iptables/firewall mark rule so that the eth1 interface of the test system was the traceroute system's default gateway: traceroute -m 2 172.16.5.1 Without the fix my test ip_vs system either hangs or puts out messages as per the attached. With our livepatch module using the above commit's contents, all is well. Just read that, despite using apport-bug, that I should include the lsb_release info requested. Here it is inline, with the uname -a to emphasize the kernel we are running which is the issue: $ lsb_release -rd Description: Ubuntu 16.04.5 LTS Release: 16.04 $ uname -a Linux director-16-04 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 25 08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Let me know of anything I can do to help accelerate addressing of this issue or understanding. It seems that the fix incorporation is fairly straightforward, and is a performance disaster without it for anyone using the --ops facility to any significant degree. Thanks! ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.15.0-36-generic 4.15.0-36.39~16.04.1 ProcVersionSignature: Ubuntu 4.15.0-36.39~16.04.1-generic 4.15.18 Uname: Linux 4.15.0-36-generic x86_64 ApportVersion: 2.20.1-0ubuntu2.18 Architecture: amd64 CurrentDesktop: Unity Date: Thu Feb 21 20:13:18 2019 InstallationDate: Installed on 2017-06-21 (611 days ago) InstallationMedia: Ubuntu 16.04 LTS "Xenial Xerus" - Release amd64 (20160420.1) SourcePackage: linux-signed-hwe UpgradeStatus: No upgrade log present (probably fresh install) ** Affects: linux-signed-hwe (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug xenial ** Attachment added: "ip_vs refcount error call trace logs for a single --ops packet" https://bugs.launchpad.net/bugs/1817247/+attachment/5240862/+files/kern_log.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1817247 Title: 4.15 kernel's ip_vs module gets refcount errors with --ops usage To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
