** Attachment added: "kmod.c"
   
https://bugs.launchpad.net/ubuntu/+source/stress-ng/+bug/1851553/+attachment/5308837/+files/kmod.c

** Description changed:

  [Impact]
  
   * Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu'
     and 'os' classes of stressors) with 50+ instances, might get failure exit
     status and the message 'bind failed, errno=110 (Connection timed out)'.
  
   * For MAAS users, this means the CPU hardware tests (that run the 'cpu'
     class of stressors) on larger systems might report 'FAILED' status, thus
     possibly misleading admins about the hardware present in the system.
  
   * It has been determined the problem root cause is related to concurrent
     module loading request threshold in the kernel (50), which is exercised
     by the crypto API at the time of the bind() system call (so to load the
     crypto algorithm module requested).
  
   * The problem happens due to a race condition between the instance that
     exceeded the threshold of concurrent module loading (50 requests), then
     timed out while waiting for a second chance (5 seconds), and another
     instance that successfully made it and requested the module load but the
     module's self-tests didn't finish within the time-out running in the first
     instance (60 seconds), as all the CPUs are currently under stress;
     this error is then returned to userspace/bind().
  
   * Not all instances fail with that error, as once the crypto algorithm
     module is successfully loaded (i.e., by another concurrent instance
     and the module self-tests eventually finished), the problem no longer
     occurs.
  
   * The fix simply checks for ETIMEDOUT errno/failure on the bind() system
     call, and performs a bounded retry loop (3 attempts), as the module may
     just have been loaded successfully by another instance.
  
  [Test Case]
  
-  * A synthetic reproducer is available; a kernel module that uses kprobes to
-    force the synchronization of af-alg instances to happen in the way needed
-    to reproduce the problem.
+  * A synthetic reproducer is available; a kernel module that uses kprobes to
+    force the synchronization of af-alg instances to happen in the way needed
+    to reproduce the problem. (comments #7 and #13, test in comments #10-#12)
  
-  * With the kernel module loaded, one of the af-alg instances (not all of
-    them) hits the bind() connection timed out if this patch is not applied.
+  * With the kernel module loaded, one of the af-alg instances (not all of
+    them) hits the bind() connection timed out if this patch is not applied.
  
  [Regression Potential]
  
-  * The code changes are minimal and contained within af-alg stressor
+  * The code changes are minimal and contained within af-alg stressor
  code.
  
-  * Differences in behavior might be af-alg/cpu/os stressors that now
-    pass/exit with successful status on larger systems.
+  * Differences in behavior might be af-alg/cpu/os stressors that now
+    pass/exit with successful status on larger systems.
  
  [Other Info]
  
   * Fix applied in stress-ng [1] on V0.10.09 and in Focal (development
  series).
  
   * Backport provided for these stable releases: Bionic, Disco, Eoan.
  
   [1] https://kernel.ubuntu.com/git/cking/stress-
  ng.git/commit/?id=637e0a9b7050cc69e76eeb7b61c14a659d8b8cfd
  
  [Original Description]
  
  The MAAS hardware test for CPU (long/12h) fails due to stress-ng-af-alg
  bind() errors.
  
  stress-ng-cpu-long <...> Failed [View log]
  
  disabled 'cpu-online' as it may hang the machine (enable it with the 
--pathological option)
  dispatching hogs: 72 af-alg, 72 atomic, 72 branch, 72 bsearch, 72 cache, 72 
context, 72 cpu, 72 crypt, 72 fp-error, 72 funccall, 72 getrandom, 72 heapsort, 
72 hsearch, 72 icache, 72 ioport, 72 lockbus, 72 longjmp, 72 lsearch, 72 
malloc, 72 matrix, 72 membarrier, 72 memcpy, 72 mergesort, 72 nop, 72 numa, 72 
opcode, 72 qsort, 72 radixsort, 72 rdrand, 72 str, 72 stream, 72 tree, 72 tsc, 
72 tsearch, 72 vecmath, 72 wcs, 72 zlib
  stress-ng-numa: system has 2 of a maximum 1024 memory NUMA nodes
  stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark 
code
  stress-ng-stream: do NOT submit any of these results to the STREAM benchmark 
results
  stress-ng-stream: Using CPU cache size of 25344K
  stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
  stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
  stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
  stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
  stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
  stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
  ...
  process 6626 (stress-ng-af_alg) terminated with an error, exit status=1 
(stress-ng core failure)
  process 6673 (stress-ng-af_alg) terminated with an error, exit status=1 
(stress-ng core failure)
  process 6713 (stress-ng-af_alg) terminated with an error, exit status=1 
(stress-ng core failure)
  process 6751 (stress-ng-af_alg) terminated with an error, exit status=1 
(stress-ng core failure)
  process 6800 (stress-ng-af_alg) terminated with an error, exit status=1 
(stress-ng core failure)
  ...
  unsuccessful run completed in 44935.38s (12 hours, 28 mins, 55.38 secs)
  ...

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1851553

Title:
  stress-ng-af-alg: bind failed, errno=110 (Connection timed out)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/stress-ng/+bug/1851553/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to