[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
dpkg -l | grep libc6:amd64 ii libc6:amd64 2.23-0ubuntu10 amd64GNU C Library: Shared libraries $ time ./exp 127781126.100057 real0m3.334s user0m3.336s sys 0m0.000s $ time LD_BIND_NOW=1 ./exp 127781126.100057 real0m0.710s user0m0.708s sys 0m0.000s $ dpkg -l | grep libc6:amd64 ii libc6:amd64 2.23-0ubuntu11 amd64GNU C Library: Shared libraries $ time ./exp 127781126.100057 real0m0.709s user0m0.708s sys 0m0.000s $ time LD_BIND_NOW=1 ./exp 127781126.100057 real0m0.714s user0m0.712s sys 0m0.004s ** Tags removed: verification-needed verification-needed-xenial yakkety zesty ** Tags added: verification-done verification-done-xenial -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Fix Committed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
pdns-recursor (s390x/i386): test fails the same way for only these 2 archs, on yakkety; failure unrelated to this sru, ignore. -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Fix Committed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
mercurial (all archs): fails since security update, verified fails in local test run, test failure introduced by security update. ignore. -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Fix Committed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
all autopkgtest failures should be ignored based on above comments. -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Fix Committed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them.
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
ruby2.3/s390x: test fails on all archs, but hinted always fails on other archs. should be hinted always fails on s390x as well. ignore. -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Fix Committed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
autopkgtest regressions that should be ignored: node-srs (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore. bzr (all archs): test failed for last 2 years, since last bzr pkg update; fails on local system with current glibc; ignore. systemd/amd64: 2 failures: subprocess.CalledProcessError: Command '['modprobe', 'scsi_debug']' returned non-zero exit status 1 logind FAIL stderr: grep: /sys/power/state: No such file or directory both unrelated to glibc (likely a change in the kernel api and/or pkging introduced this test regression), ignore. systemd/s390x: test always failed. ignore. linux-oracle/amd64: test fails consistently, ignore. node-ws (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore. fpc (i386/armhf): test always failed, ignore. libreoffice/i386: test always failed, ignore. apt (all archs): existing apt test bug, ignore: https://bugs.launchpad.net/ubuntu/+source/apt/+bug/1815750 nvidia-graphics-drivers-340/armhf: test always failed, ignore. ruby-xmlparser (all archs): test failed for 3 years, fails locally with current glibc; ignore. linux (ppc64el/i386): tests flaky, fail more than succeed; ignore. gearmand/armhf: test failed for last year, ignore. node-groove (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore. iscsitarget/armhf: test always failed, ignore. libnih/armhf: test blacklisted, ignore. nplan (all archs): test flaky, almost always fails; ignore. node-leveldown (all archs): test has not been run in 3 years, and fails in local build with existing glibc; failure unrelated to this sru. ignore. snapcraft: failed since pkg last updated, unrelated to this sru, ignore. r-bioc-genomicalignments/s390x: test not run in 3 years, fails locally with current glibc; ignore. ruby-nokogiri (all archs): test not run for 3 years, fails locally with current glibc; ignore. ipset (all archs): test not run for 3 years, fails locally with current glibc; ignore. snapd (all archs): test flaky, almost always fails, ignore. dadhi-linux/s390x: test always failed, ignore. still reviewing: mercurial (all archs) ruby2.3/s390x pdns-recursor (s390x/i386) -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Fix Committed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
Hello Oleg, or anyone else affected, Accepted glibc into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/glibc/2.23-0ubuntu11 in a few hours, and then in the -proposed repository. Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping! N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days. ** Changed in: glibc (Ubuntu Xenial) Status: In Progress => Fix Committed ** Tags added: verification-needed verification-needed-xenial -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Fix Committed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
** Changed in: glibc (Ubuntu Xenial) Importance: Low => High -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: In Progress Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
** Changed in: glibc (Ubuntu Xenial) Importance: Medium => Low -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: In Progress Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This
[Sts-sponsors] [Bug 1663280] Re: Serious performance degradation of math functions
** Changed in: glibc (Ubuntu Xenial) Importance: Undecided => Medium -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1663280 Title: Serious performance degradation of math functions Status in GLibC: Fix Released Status in glibc package in Ubuntu: Fix Released Status in glibc source package in Xenial: Confirmed Status in glibc source package in Zesty: Won't Fix Status in glibc package in Fedora: Fix Released Bug description: SRU Justification = [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include #include int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.0005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real0m1.349s user0m1.349s $ time LD_BIND_NOW=1 ./exp ... real0m0.625s user0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real0m0.625s user0m0.621s $ time LD_BIND_NOW=1 ./exp ... real0m0.631s user0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them.