Public bug reported:

On our 16.04LTS system we use the ipvsadm --ops UDP support (one-packet
scheduling) to get a better distribution amongst our real servers behind
the load-balancer for some small subset of applications.

This has worked fine through the 4.4.0-xxx kernels.   But when we
started a program to upgrade systems to use the 4.15 series of kernels
to take advantage of new facilities, the subset of systems which used
the --ops option ran into problems.   Everything else with the 4.15
kernels appeared to work well.

The issue appears to have been the change in the ip_vs module from using
"atomic_*()" increment/decrement functions in the 4.4 kernel to instead
use "refcount_*()" functions in a later kernel, including the 4.15 one
we switched to.  Unfortunately, the simple refcount_dec() function was
inadequate, in putting out a time-consuming message and handling when
the refcount dropped to zero, which is expected in the case of --ops
support that retains no state post packet delivery.   I will upload an
attachment with the sample messages that get put out at packet arrival
rate, which destroys performance of course.   On this VM, with far more
limited # of CPUs than production servers, the system actually hangs
(crashes?) for quite some time.

This issue was apparently already recognized as an error and has
appeared as a fix in upstream kernels. This is a reference to the 4.17
version of the fix that we'd like to see incorporated into the next
possible kernel maintenance release:

https://github.com/torvalds/linux/commit/a050d345cef0dc6249263540da1e902bba617e43
#diff-75923493f6e3f314b196a8223b0d6342

We have successfully used the livepatch facility to build a livepatch
.ko with the above diffs on our 4.15.0-36 system and successfully
demonstrated the contrast in good/bad behavior with/without the
livepatch  module loaded.   But we'd rather not have to build a version
of livepatch.ko for each kernel maintenance release.

The problem is easy to generate, with only a couple of packets and a
simple configuration.   Here's a very basic test (addresses
rewritten/obscrued) version of an example configuration for 2 servers
that worked on my test VM:

ipvsadm -A -f 100 -s rr --ops
ipvsadm -a -f 100 -r 10.129.131.227:0 -g -w 9999
ipvsadm -a -f 100 -r 10.129.131.228:0 -g -w 9999
iptables -t mangle -A PREROUTING -d 172.16.5.1/32 -j MARK --set-xmark 
0x64/0xffffffff
ifconfig lo:0 172.16.5.1/32 up

Routing and addressing to achieve the above, or adaptation for one's own
test environment, is left to the tester.

I set up routing and addresses on my 2 NIC test such that packets
arrived on my test machine's eth1 NIC and were directed by ip_vs out the
eth2.   To test, all I did was throw a few UDP packets via traceroute at
the address on the iptables/firewall mark rule so that the eth1
interface of the test system was the traceroute system's default
gateway:

  traceroute -m 2 172.16.5.1

Without the fix my test ip_vs system either hangs or puts out messages
as per the attached.  With our livepatch module using the above commit's
contents, all is well.

Just read that, despite using apport-bug, that I should include the
lsb_release info requested.   Here it is inline, with the uname -a to
emphasize the kernel we are running which is the issue:

$ lsb_release -rd
Description:    Ubuntu 16.04.5 LTS
Release:        16.04
$ uname -a
Linux director-16-04 4.15.0-36-generic #39~16.04.1-Ubuntu SMP Tue Sep 25 
08:59:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux


Let me know of anything I can do to help accelerate addressing of this issue or 
understanding.  It seems that the fix incorporation is fairly straightforward, 
and is a performance disaster without it for anyone using the --ops facility to 
any significant degree.

Thanks!

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.15.0-36-generic 4.15.0-36.39~16.04.1
ProcVersionSignature: Ubuntu 4.15.0-36.39~16.04.1-generic 4.15.18
Uname: Linux 4.15.0-36-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
CurrentDesktop: Unity
Date: Thu Feb 21 20:13:18 2019
InstallationDate: Installed on 2017-06-21 (611 days ago)
InstallationMedia: Ubuntu 16.04 LTS "Xenial Xerus" - Release amd64 (20160420.1)
SourcePackage: linux-signed-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: linux-signed-hwe (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug xenial

** Attachment added: "ip_vs refcount error call trace logs for a single --ops 
packet"
   
https://bugs.launchpad.net/bugs/1817247/+attachment/5240862/+files/kern_log.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1817247

Title:
  4.15 kernel's ip_vs module gets refcount errors with --ops usage

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1817247/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to