** Attachment added: "kmod.c" https://bugs.launchpad.net/ubuntu/+source/stress-ng/+bug/1851553/+attachment/5308837/+files/kmod.c
** Description changed: [Impact] * Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu' and 'os' classes of stressors) with 50+ instances, might get failure exit status and the message 'bind failed, errno=110 (Connection timed out)'. * For MAAS users, this means the CPU hardware tests (that run the 'cpu' class of stressors) on larger systems might report 'FAILED' status, thus possibly misleading admins about the hardware present in the system. * It has been determined the problem root cause is related to concurrent module loading request threshold in the kernel (50), which is exercised by the crypto API at the time of the bind() system call (so to load the crypto algorithm module requested). * The problem happens due to a race condition between the instance that exceeded the threshold of concurrent module loading (50 requests), then timed out while waiting for a second chance (5 seconds), and another instance that successfully made it and requested the module load but the module's self-tests didn't finish within the time-out running in the first instance (60 seconds), as all the CPUs are currently under stress; this error is then returned to userspace/bind(). * Not all instances fail with that error, as once the crypto algorithm module is successfully loaded (i.e., by another concurrent instance and the module self-tests eventually finished), the problem no longer occurs. * The fix simply checks for ETIMEDOUT errno/failure on the bind() system call, and performs a bounded retry loop (3 attempts), as the module may just have been loaded successfully by another instance. [Test Case] - * A synthetic reproducer is available; a kernel module that uses kprobes to - force the synchronization of af-alg instances to happen in the way needed - to reproduce the problem. + * A synthetic reproducer is available; a kernel module that uses kprobes to + force the synchronization of af-alg instances to happen in the way needed + to reproduce the problem. (comments #7 and #13, test in comments #10-#12) - * With the kernel module loaded, one of the af-alg instances (not all of - them) hits the bind() connection timed out if this patch is not applied. + * With the kernel module loaded, one of the af-alg instances (not all of + them) hits the bind() connection timed out if this patch is not applied. [Regression Potential] - * The code changes are minimal and contained within af-alg stressor + * The code changes are minimal and contained within af-alg stressor code. - * Differences in behavior might be af-alg/cpu/os stressors that now - pass/exit with successful status on larger systems. + * Differences in behavior might be af-alg/cpu/os stressors that now + pass/exit with successful status on larger systems. [Other Info] * Fix applied in stress-ng [1] on V0.10.09 and in Focal (development series). * Backport provided for these stable releases: Bionic, Disco, Eoan. [1] https://kernel.ubuntu.com/git/cking/stress- ng.git/commit/?id=637e0a9b7050cc69e76eeb7b61c14a659d8b8cfd [Original Description] The MAAS hardware test for CPU (long/12h) fails due to stress-ng-af-alg bind() errors. stress-ng-cpu-long <...> Failed [View log] disabled 'cpu-online' as it may hang the machine (enable it with the --pathological option) dispatching hogs: 72 af-alg, 72 atomic, 72 branch, 72 bsearch, 72 cache, 72 context, 72 cpu, 72 crypt, 72 fp-error, 72 funccall, 72 getrandom, 72 heapsort, 72 hsearch, 72 icache, 72 ioport, 72 lockbus, 72 longjmp, 72 lsearch, 72 malloc, 72 matrix, 72 membarrier, 72 memcpy, 72 mergesort, 72 nop, 72 numa, 72 opcode, 72 qsort, 72 radixsort, 72 rdrand, 72 str, 72 stream, 72 tree, 72 tsc, 72 tsearch, 72 vecmath, 72 wcs, 72 zlib stress-ng-numa: system has 2 of a maximum 1024 memory NUMA nodes stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results stress-ng-stream: Using CPU cache size of 25344K stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) ... process 6626 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6673 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6713 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6751 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6800 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) ... unsuccessful run completed in 44935.38s (12 hours, 28 mins, 55.38 secs) ... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1851553 Title: stress-ng-af-alg: bind failed, errno=110 (Connection timed out) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/stress-ng/+bug/1851553/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs