> The LXC images failed to start under linux-image-4.2.0-28-generic,
with a kernel oops.

this bug isn't about kernel oopses.

> Setting /proc/sys/net/ipv4/xfrm4_gc_thresh to 5 causes the failure almost 
> immediately.
> I would like to confirm my procedure however. I've been changing 
> /proc/sys/net/ipv4/xfrm4_gc_thresh inside the containers,
> not the host. Is this correct?

no, that's not correct, and unfortunately the "reproducer" in this bug
description is completely invalid (it was copied from a private bug for
this issue).  ipsec will NEVER work with gc_thresh set to 5, due to the
internal details between the xfrm gc_thresh and the hardcoded flowcache
per-cpu hashtable size limit.  In upstream the xfrm4_gc_thresh has been
changed to INT_MAX, because using a gc_thresh doesn't make any sense for
xfrm dst entries; the total number of them possible is entirely
dependent on the number of cpus in the system.

There is 1 xfrm dst entry per flowcache entry, and the flowcache entries
are kept in a per-cpu hashtable that is strictly limited at 4096 entries
(per cpu).  So, the total number of xfrm dst entries will be 4096 *
num_active_cpus(), meaning that since the dst code stops allowing new
dst allocation once the dst entry count is >= 2 * gc_thresh, the
threshold (with the current default 32k gc_thresh) where dst allocation
failures can begin to be seen is (32k * 2) / 4096 = 16 cpus.  On systems
with less than 16 cpus, at the default gc_thresh of 32k, there will
never be any dst allocation failures (except due to the real bug this
addresses).  On systems with 16 or more cpus, with the default gc_thresh
of 32k, there will/can be dst alloc failures (with a high enough ipsec
usage rate, creating new connections - the flowcache clears all its
entries every 10 minutes, so a lightly loaded ipsec with > 16 cpus could
be fine).

Setting the xfrm4_gc_thresh value to anything less than (4k * 2 * CPUS)
will result in failures, and in fact there is no point in setting
xfrm4_gc_thresh to ANYTHING other than INT_MAX because it doesn't
actually remove any dst entries.

All that, unfortunately, is actually tangential to this real bug - the problem 
here is with multiple net namespaces (i.e. multiple containers) all running 
ipsec, the dst entry counter changes for one container can incorrectly be given 
to a different container - so one container could have its dst entry count 
steadily go down (incorrectly) while another container's dst entry count keeps 
going up (incorrectly).  The latter container there would eventually reach its 
2 * gc_thresh limit and encounter dst allocation failures, making its ipsec 
network unusable.  Unfortunately, that error looks identical to the error when 
the dst entry counter is correct, but the number of system cpus is > 16 as 
described above.

To test this fix, multiple containers must be started (just 2 is fine).  On 
each container, new ipsec connections should be created as fast as possible; 
e.g. something like:

while true ; do ping -c 1 OTHER_CONTAINER ; done

so each container is pinging each other - but it's important to use -c 1
so that each ping creates a new ipsec dst entry; just normal ping will
re-use the existing dst entry.

After a sufficiently long period (depending on the number of containers
and number of cpus, and the luck of randomly incorrectly assigning dst
entry count changes to different containers) one or more containers
should get dst allocation failures and its ipsec network should not be
usable anymore.  To speed up reproduction of this bug, lower the
xfrm4_gc_thresh to a value ABOVE (2 * 4096 * CPUS), but close to it -
e.g. something like 10k * CPUS.  With that gc_thresh value set, this bug
should be reproducable fairly quickly (order of days or less) without
the patch, but not reproducable with the patch (i.e. with the -proposed
repo kernel).

You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

  using ipsec, many connections result in no buffer space error

To manage notifications about this bug go to:

ubuntu-bugs mailing list

Reply via email to