** Description changed: - Fix applied in stress-ng [1] on V0.10.09 and in Focal (development series), - working on the backport patch for the stable releases (Bionic, Disco, Eoan). + [Impact] - [1] https://kernel.ubuntu.com/git/cking/stress- + * Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu' and + 'os' classes of stressors) with 50+ instances, might get failure exit status + and the message 'bind failed, errno=110 (Connection timed out)'. + + * For MAAS users, this means the CPU hardware tests (that run the 'cpu' class + of stressors) on larger systems might report 'FAILED' status, thus possibly + misleading administrators about the hardware present in the system. + + * It has been determined that the problem root cause is related to concurrent + module loading request threshold in the kernel (50), which is exercised by + the crypto API at the time of the bind() system call (so to load the crypto + algorithm module requested). + + * The problem happens due to a race condition between the instance that exceeded + the threshold of concurrent module loading (50 requests), then timed out while + waiting for a second chance (5 seconds), and another instance that successfully + made it and requested the module load but the module's self-tests didn't finish + within the time-out running in the first instance (60 seconds), as all the CPUs + are currently under stress; this error is then returned to userspace/bind(). + + * Not all instances fail with that error, as once the crypto algorithm module + is successfully loaded (i.e., by another concurrent instance and the module + self-tests eventually finished), the problem no longer occurs. + + * The fix simply checks for ETIMEDOUT errno/failure on the bind() system call, + and performs a bounded retry loop (3 attempts), as the module may just have + been loaded successfully by another instance. + + [Test Case] + + * A synthetic reproducer is available; a kernel module that uses kprobes to + force the synchronization of af-alg instances to happen in the way needed + to reproduce the problem. + + * With the kernel module loaded, one of the af-alg instances (not all of them) + hits the bind() connection timed out error if this fix/patch is not applied. + + [Regression Potential] + + * The code changes are minimal and contained within the af-alg stressor + code. + + * Differences in behavior might be af-alg/cpu/os stressors that now pass/exit + with successful status on larger systems. + + [Other Info] + + * Fix applied in stress-ng [1] on V0.10.09 and in Focal (development + series). + + * Backport provided for these stable releases: Bionic, Disco, Eoan. + + [1] https://kernel.ubuntu.com/git/cking/stress- ng.git/commit/?id=637e0a9b7050cc69e76eeb7b61c14a659d8b8cfd + + [Original Description] + + The MAAS hardware test for CPU (long/12h) fails due to stress-ng-af-alg + bind() errors. + + stress-ng-cpu-long <...> Failed [View log] + + disabled 'cpu-online' as it may hang the machine (enable it with the --pathological option) + dispatching hogs: 72 af-alg, 72 atomic, 72 branch, 72 bsearch, 72 cache, 72 context, 72 cpu, 72 crypt, 72 fp-error, 72 funccall, 72 getrandom, 72 heapsort, 72 hsearch, 72 icache, 72 ioport, 72 lockbus, 72 longjmp, 72 lsearch, 72 malloc, 72 matrix, 72 membarrier, 72 memcpy, 72 mergesort, 72 nop, 72 numa, 72 opcode, 72 qsort, 72 radixsort, 72 rdrand, 72 str, 72 stream, 72 tree, 72 tsc, 72 tsearch, 72 vecmath, 72 wcs, 72 zlib + stress-ng-numa: system has 2 of a maximum 1024 memory NUMA nodes + stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code + stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results + stress-ng-stream: Using CPU cache size of 25344K + stress-ng-af-alg: bind failed, errno=110 (Connection timed out) + stress-ng-af-alg: bind failed, errno=110 (Connection timed out) + stress-ng-af-alg: bind failed, errno=110 (Connection timed out) + stress-ng-af-alg: bind failed, errno=110 (Connection timed out) + stress-ng-af-alg: bind failed, errno=110 (Connection timed out) + stress-ng-af-alg: bind failed, errno=110 (Connection timed out) + ... + process 6626 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) + process 6673 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) + process 6713 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) + process 6751 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) + process 6800 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) + ... + unsuccessful run completed in 44935.38s (12 hours, 28 mins, 55.38 secs) + ...
** Description changed: [Impact] - * Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu' and - 'os' classes of stressors) with 50+ instances, might get failure exit status - and the message 'bind failed, errno=110 (Connection timed out)'. + * Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu' + and 'os' classes of stressors) with 50+ instances, might get failure exit + status and the message 'bind failed, errno=110 (Connection timed out)'. - * For MAAS users, this means the CPU hardware tests (that run the 'cpu' class - of stressors) on larger systems might report 'FAILED' status, thus possibly - misleading administrators about the hardware present in the system. + * For MAAS users, this means the CPU hardware tests (that run the 'cpu' + class of stressors) on larger systems might report 'FAILED' status, thus + possibly misleading admins about the hardware present in the system. - * It has been determined that the problem root cause is related to concurrent - module loading request threshold in the kernel (50), which is exercised by - the crypto API at the time of the bind() system call (so to load the crypto - algorithm module requested). + * It has been determined the problem root cause is related to concurrent + module loading request threshold in the kernel (50), which is exercised + by the crypto API at the time of the bind() system call (so to load the + crypto algorithm module requested). - * The problem happens due to a race condition between the instance that exceeded - the threshold of concurrent module loading (50 requests), then timed out while - waiting for a second chance (5 seconds), and another instance that successfully - made it and requested the module load but the module's self-tests didn't finish - within the time-out running in the first instance (60 seconds), as all the CPUs - are currently under stress; this error is then returned to userspace/bind(). + * The problem happens due to a race condition between the instance that + exceeded the threshold of concurrent module loading (50 requests), then + timed out while waiting for a second chance (5 seconds), and another + instance that successfully made it and requested the module load but the + module's self-tests didn't finish within the time-out running in the first + instance (60 seconds), as all the CPUs are currently under stress; + this error is then returned to userspace/bind(). - * Not all instances fail with that error, as once the crypto algorithm module - is successfully loaded (i.e., by another concurrent instance and the module - self-tests eventually finished), the problem no longer occurs. + * Not all instances fail with that error, as once the crypto algorithm + module is successfully loaded (i.e., by another concurrent instance + and the module self-tests eventually finished), the problem no longer + occurs. - * The fix simply checks for ETIMEDOUT errno/failure on the bind() system call, - and performs a bounded retry loop (3 attempts), as the module may just have - been loaded successfully by another instance. + * The fix simply checks for ETIMEDOUT errno/failure on the bind() system + call, and performs a bounded retry loop (3 attempts), as the module may + just have been loaded successfully by another instance. [Test Case] - * A synthetic reproducer is available; a kernel module that uses kprobes to - force the synchronization of af-alg instances to happen in the way needed - to reproduce the problem. + * A synthetic reproducer is available; a kernel module that uses kprobes to + force the synchronization of af-alg instances to happen in the way needed + to reproduce the problem. - * With the kernel module loaded, one of the af-alg instances (not all of them) - hits the bind() connection timed out error if this fix/patch is not applied. + * With the kernel module loaded, one of the af-alg instances (not all of them) + hits the bind() connection timed out error if this fix/patch is not applied. [Regression Potential] - * The code changes are minimal and contained within the af-alg stressor + * The code changes are minimal and contained within the af-alg stressor code. - * Differences in behavior might be af-alg/cpu/os stressors that now pass/exit - with successful status on larger systems. + * Differences in behavior might be af-alg/cpu/os stressors that now pass/exit + with successful status on larger systems. [Other Info] - * Fix applied in stress-ng [1] on V0.10.09 and in Focal (development + * Fix applied in stress-ng [1] on V0.10.09 and in Focal (development series). - * Backport provided for these stable releases: Bionic, Disco, Eoan. + * Backport provided for these stable releases: Bionic, Disco, Eoan. - [1] https://kernel.ubuntu.com/git/cking/stress- + [1] https://kernel.ubuntu.com/git/cking/stress- ng.git/commit/?id=637e0a9b7050cc69e76eeb7b61c14a659d8b8cfd [Original Description] The MAAS hardware test for CPU (long/12h) fails due to stress-ng-af-alg bind() errors. stress-ng-cpu-long <...> Failed [View log] disabled 'cpu-online' as it may hang the machine (enable it with the --pathological option) dispatching hogs: 72 af-alg, 72 atomic, 72 branch, 72 bsearch, 72 cache, 72 context, 72 cpu, 72 crypt, 72 fp-error, 72 funccall, 72 getrandom, 72 heapsort, 72 hsearch, 72 icache, 72 ioport, 72 lockbus, 72 longjmp, 72 lsearch, 72 malloc, 72 matrix, 72 membarrier, 72 memcpy, 72 mergesort, 72 nop, 72 numa, 72 opcode, 72 qsort, 72 radixsort, 72 rdrand, 72 str, 72 stream, 72 tree, 72 tsc, 72 tsearch, 72 vecmath, 72 wcs, 72 zlib stress-ng-numa: system has 2 of a maximum 1024 memory NUMA nodes stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results stress-ng-stream: Using CPU cache size of 25344K stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) ... process 6626 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6673 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6713 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6751 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6800 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) ... unsuccessful run completed in 44935.38s (12 hours, 28 mins, 55.38 secs) ... ** Description changed: [Impact] - * Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu' - and 'os' classes of stressors) with 50+ instances, might get failure exit - status and the message 'bind failed, errno=110 (Connection timed out)'. + * Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu' + and 'os' classes of stressors) with 50+ instances, might get failure exit + status and the message 'bind failed, errno=110 (Connection timed out)'. - * For MAAS users, this means the CPU hardware tests (that run the 'cpu' - class of stressors) on larger systems might report 'FAILED' status, thus - possibly misleading admins about the hardware present in the system. + * For MAAS users, this means the CPU hardware tests (that run the 'cpu' + class of stressors) on larger systems might report 'FAILED' status, thus + possibly misleading admins about the hardware present in the system. - * It has been determined the problem root cause is related to concurrent - module loading request threshold in the kernel (50), which is exercised - by the crypto API at the time of the bind() system call (so to load the - crypto algorithm module requested). + * It has been determined the problem root cause is related to concurrent + module loading request threshold in the kernel (50), which is exercised + by the crypto API at the time of the bind() system call (so to load the + crypto algorithm module requested). - * The problem happens due to a race condition between the instance that - exceeded the threshold of concurrent module loading (50 requests), then - timed out while waiting for a second chance (5 seconds), and another - instance that successfully made it and requested the module load but the - module's self-tests didn't finish within the time-out running in the first - instance (60 seconds), as all the CPUs are currently under stress; - this error is then returned to userspace/bind(). + * The problem happens due to a race condition between the instance that + exceeded the threshold of concurrent module loading (50 requests), then + timed out while waiting for a second chance (5 seconds), and another + instance that successfully made it and requested the module load but the + module's self-tests didn't finish within the time-out running in the first + instance (60 seconds), as all the CPUs are currently under stress; + this error is then returned to userspace/bind(). - * Not all instances fail with that error, as once the crypto algorithm - module is successfully loaded (i.e., by another concurrent instance - and the module self-tests eventually finished), the problem no longer - occurs. + * Not all instances fail with that error, as once the crypto algorithm + module is successfully loaded (i.e., by another concurrent instance + and the module self-tests eventually finished), the problem no longer + occurs. - * The fix simply checks for ETIMEDOUT errno/failure on the bind() system - call, and performs a bounded retry loop (3 attempts), as the module may - just have been loaded successfully by another instance. + * The fix simply checks for ETIMEDOUT errno/failure on the bind() system + call, and performs a bounded retry loop (3 attempts), as the module may + just have been loaded successfully by another instance. [Test Case] - * A synthetic reproducer is available; a kernel module that uses kprobes to - force the synchronization of af-alg instances to happen in the way needed - to reproduce the problem. + * A synthetic reproducer is available; a kernel module that uses kprobes to + force the synchronization of af-alg instances to happen in the way needed + to reproduce the problem. - * With the kernel module loaded, one of the af-alg instances (not all of them) - hits the bind() connection timed out error if this fix/patch is not applied. + * With the kernel module loaded, one of the af-alg instances (not all of + them) hits the bind() connection timed out if this patch is not applied. [Regression Potential] - * The code changes are minimal and contained within the af-alg stressor + * The code changes are minimal and contained within af-alg stressor code. - * Differences in behavior might be af-alg/cpu/os stressors that now pass/exit - with successful status on larger systems. + * Differences in behavior might be af-alg/cpu/os stressors that now + pass/exit with successful status on larger systems. [Other Info] * Fix applied in stress-ng [1] on V0.10.09 and in Focal (development series). * Backport provided for these stable releases: Bionic, Disco, Eoan. [1] https://kernel.ubuntu.com/git/cking/stress- ng.git/commit/?id=637e0a9b7050cc69e76eeb7b61c14a659d8b8cfd [Original Description] The MAAS hardware test for CPU (long/12h) fails due to stress-ng-af-alg bind() errors. stress-ng-cpu-long <...> Failed [View log] disabled 'cpu-online' as it may hang the machine (enable it with the --pathological option) dispatching hogs: 72 af-alg, 72 atomic, 72 branch, 72 bsearch, 72 cache, 72 context, 72 cpu, 72 crypt, 72 fp-error, 72 funccall, 72 getrandom, 72 heapsort, 72 hsearch, 72 icache, 72 ioport, 72 lockbus, 72 longjmp, 72 lsearch, 72 malloc, 72 matrix, 72 membarrier, 72 memcpy, 72 mergesort, 72 nop, 72 numa, 72 opcode, 72 qsort, 72 radixsort, 72 rdrand, 72 str, 72 stream, 72 tree, 72 tsc, 72 tsearch, 72 vecmath, 72 wcs, 72 zlib stress-ng-numa: system has 2 of a maximum 1024 memory NUMA nodes stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results stress-ng-stream: Using CPU cache size of 25344K stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) stress-ng-af-alg: bind failed, errno=110 (Connection timed out) ... process 6626 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6673 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6713 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6751 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) process 6800 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure) ... unsuccessful run completed in 44935.38s (12 hours, 28 mins, 55.38 secs) ... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1851553 Title: stress-ng-af-alg: bind failed, errno=110 (Connection timed out) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/stress-ng/+bug/1851553/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
