[Bug 1928508] [NEW] Performance regression on memcpy() calls for AMD Zen

Heitor Alves de Siqueira Fri, 14 May 2021 12:10:57 -0700

Public bug reported:

[Impact]
On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal 
and Groovy, due to the way __x86_non_temporal_threshold is calculated.


Before 'glibc-2.33~455', cache values were calculated taking into consideration 
the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this 
can be counter-productive if the number of threads is high enough for the 
last-level caches to "overrun" each other and cause cache line flushes. The 
solution is to reduce the allocated size for these non_temporal stores, removing
the number of threads from the equation.

[Test Plan]
Attached to this bug is a short C program that exercises memcpy() calls in 
buffers of variable length. This has been obtained from a similar bug report 
for Red Hat, and is publicly available at [0].
This test program was compiled with gcc 10.2.0, using the following flags:
$ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64

Tests were performed with the following criteria:
- use 32Mb buffers ("./test_memcpy64 32")
- benchmark with the hyperfine tool [1], as it calculates relevant statistics 
automatically
- benchmark with at least 10 runs in the same environment, to minimize variance
- measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't 
penalize one x86 vendor in favor of the other

Below is a comparison between two Focal containers, leveraging LXD to
make use of different libc versions on the same host:

$ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n 
libc-patched 'lxc exec focal-patched ./test_memcpy64 32'
Benchmark #1: libc-2.31-0ubuntu9.2
  Time (mean ± σ):      2.723 s ±  0.013 s    [User: 4.7 ms, System: 5.1 ms]
  Range (min … max):    2.693 s …  2.735 s    10 runs

Benchmark #2: libc-patched
  Time (mean ± σ):      1.522 s ±  0.004 s    [User: 3.9 ms, System: 5.6 ms]
  Range (min … max):    1.515 s …  1.528 s    10 runs

Summary
  'libc-patched' ran
    1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2'
$ head -n5 /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 113
model name      : AMD Ryzen 7 3700X 8-Core Processor

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670
[1] https://github.com/sharkdp/hyperfine/

[Where problems could occur]
Since we're messing with the cacheinfo for x86 in general, we need to be 
careful not to introduce further performance regressions on memory-heavy 
workloads. Even though initial results might reveal improvement on AMD Ryzen 
and EPYC hardware, we should also validate different configurations (e.g. 
Intel, different buffer sizes, etc) to make sure we won't hurt performance in 
other non-AMD environments.

[Other Info]
This has been fixed by the following upstream commit:
- d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold)

$ git describe --contains d3c57027470b
glibc-2.33~455
$ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute
 glibc | 2.31-0ubuntu9   | focal           | source
 glibc | 2.31-0ubuntu9.2 | focal-updates   | source
 glibc | 2.32-0ubuntu3   | groovy          | source
 glibc | 2.32-0ubuntu3.2 | groovy-proposed | source
 glibc | 2.33-0ubuntu5   | hirsute         | source

Affected releases include Ubuntu Focal and Groovy. Bionic is not
affected, and releases starting with Hirsute already ship the upstream
patch to fix this regression.

** Affects: glibc (Ubuntu)
     Importance: High
     Assignee: Heitor Alves de Siqueira (halves)
         Status: Fix Released

** Affects: glibc (Ubuntu Focal)
     Importance: High
     Assignee: Heitor Alves de Siqueira (halves)
         Status: Confirmed

** Affects: glibc (Ubuntu Groovy)
     Importance: High
     Assignee: Heitor Alves de Siqueira (halves)
         Status: Confirmed


** Tags: sts

** Also affects: glibc (Ubuntu Groovy)
   Importance: Undecided
       Status: New

** Also affects: glibc (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: glibc (Ubuntu Focal)
   Importance: Undecided => High

** Changed in: glibc (Ubuntu Groovy)
   Importance: Undecided => High

** Changed in: glibc (Ubuntu Focal)
       Status: New => Confirmed

** Changed in: glibc (Ubuntu Groovy)
       Status: New => Won't Fix

** Changed in: glibc (Ubuntu Groovy)
       Status: Won't Fix => Confirmed

** Changed in: glibc (Ubuntu)
       Status: New => Fix Released

** Changed in: glibc (Ubuntu Focal)
     Assignee: (unassigned) => Heitor Alves de Siqueira (halves)

** Changed in: glibc (Ubuntu Groovy)
     Assignee: (unassigned) => Heitor Alves de Siqueira (halves)

** Description changed:

  [Impact]
  On AMD Zen systems, memcpy() calls see a heavy performance regression in 
Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated.
  
  Before 'glibc-2.33~455', cache values were calculated taking into 
consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC 
systems, this can be counter-productive if the number of threads is high enough 
for the last-level caches to "overrun" each other and cause cache line flushes. 
The solution is to reduce the allocated size for these non_temporal stores, 
removing
  the number of threads from the equation.
  
  [Test Plan]
  Attached to this bug is a short C program that exercises memcpy() calls in 
buffers of variable length. This has been obtained from a similar bug report 
for Red Hat, and is publicly available at [0].
  This test program was compiled with gcc 10.2.0, using the following flags:
  $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64
  
  Tests were performed with the following criteria:
  - use 32Mb buffers ("./test_memcpy64 32")
  - benchmark with the hyperfine tool [1], as it calculates relevant statistics 
automatically
  - benchmark with at least 10 runs in the same environment, to minimize 
variance
  - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't 
penalize one x86 vendor in favor of the other
  
  Below is a comparison between two Focal containers, leveraging LXD to
  make use of different libc versions on the same host:
  
  $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n 
libc-patched 'lxc exec focal-patched ./test_memcpy64 32'
  Benchmark #1: libc-2.31-0ubuntu9.2
-   Time (mean ± σ):      2.723 s ±  0.013 s    [User: 4.7 ms, System: 5.1 ms]
-   Range (min … max):    2.693 s …  2.735 s    10 runs
+   Time (mean ± σ):      2.723 s ±  0.013 s    [User: 4.7 ms, System: 5.1 ms]
+   Range (min … max):    2.693 s …  2.735 s    10 runs
  
  Benchmark #2: libc-patched
-   Time (mean ± σ):      1.522 s ±  0.004 s    [User: 3.9 ms, System: 5.6 ms]
-   Range (min … max):    1.515 s …  1.528 s    10 runs
+   Time (mean ± σ):      1.522 s ±  0.004 s    [User: 3.9 ms, System: 5.6 ms]
+   Range (min … max):    1.515 s …  1.528 s    10 runs
  
  Summary
-   'libc-patched' ran
-     1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2'
+   'libc-patched' ran
+     1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2'
  
  [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670
  [1] https://github.com/sharkdp/hyperfine/
  
  [Where problems could occur]
  Since we're messing with the cacheinfo for x86 in general, we need to be 
careful not to introduce further performance regressions on memory-heavy 
workloads. Even though initial results might reveal improvement on AMD Ryzen 
and EPYC hardware, we should also validate different configurations (e.g. 
Intel, different buffer sizes, etc) to make sure we won't hurt performance in 
other non-AMD environments.
  
  [Other Info]
  This has been fixed by the following upstream commit:
  - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold)
  
  $ git describe --contains d3c57027470b
  glibc-2.33~455
+ $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute
+  glibc | 2.31-0ubuntu9   | focal           | source
+  glibc | 2.31-0ubuntu9.2 | focal-updates   | source
+  glibc | 2.32-0ubuntu3   | groovy          | source
+  glibc | 2.32-0ubuntu3.2 | groovy-proposed | source
+  glibc | 2.33-0ubuntu5   | hirsute         | source
  
  Affected releases include Ubuntu Focal and Groovy. Bionic is not
  affected, and releases starting with Hirsute already ship the upstream
  patch to fix this regression.

** Description changed:

  [Impact]
  On AMD Zen systems, memcpy() calls see a heavy performance regression in 
Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated.
  
  Before 'glibc-2.33~455', cache values were calculated taking into 
consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC 
systems, this can be counter-productive if the number of threads is high enough 
for the last-level caches to "overrun" each other and cause cache line flushes. 
The solution is to reduce the allocated size for these non_temporal stores, 
removing
  the number of threads from the equation.
  
  [Test Plan]
  Attached to this bug is a short C program that exercises memcpy() calls in 
buffers of variable length. This has been obtained from a similar bug report 
for Red Hat, and is publicly available at [0].
  This test program was compiled with gcc 10.2.0, using the following flags:
  $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64
  
  Tests were performed with the following criteria:
  - use 32Mb buffers ("./test_memcpy64 32")
  - benchmark with the hyperfine tool [1], as it calculates relevant statistics 
automatically
  - benchmark with at least 10 runs in the same environment, to minimize 
variance
  - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't 
penalize one x86 vendor in favor of the other
  
  Below is a comparison between two Focal containers, leveraging LXD to
  make use of different libc versions on the same host:
  
  $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n 
libc-patched 'lxc exec focal-patched ./test_memcpy64 32'
  Benchmark #1: libc-2.31-0ubuntu9.2
    Time (mean ± σ):      2.723 s ±  0.013 s    [User: 4.7 ms, System: 5.1 ms]
    Range (min … max):    2.693 s …  2.735 s    10 runs
  
  Benchmark #2: libc-patched
    Time (mean ± σ):      1.522 s ±  0.004 s    [User: 3.9 ms, System: 5.6 ms]
    Range (min … max):    1.515 s …  1.528 s    10 runs
  
  Summary
    'libc-patched' ran
      1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2'
+ $ head -n5 /proc/cpuinfo
+ processor       : 0
+ vendor_id       : AuthenticAMD
+ cpu family      : 23
+ model           : 113
+ model name      : AMD Ryzen 7 3700X 8-Core Processor
  
  [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670
  [1] https://github.com/sharkdp/hyperfine/
  
  [Where problems could occur]
  Since we're messing with the cacheinfo for x86 in general, we need to be 
careful not to introduce further performance regressions on memory-heavy 
workloads. Even though initial results might reveal improvement on AMD Ryzen 
and EPYC hardware, we should also validate different configurations (e.g. 
Intel, different buffer sizes, etc) to make sure we won't hurt performance in 
other non-AMD environments.
  
  [Other Info]
  This has been fixed by the following upstream commit:
  - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold)
  
  $ git describe --contains d3c57027470b
  glibc-2.33~455
  $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute
-  glibc | 2.31-0ubuntu9   | focal           | source
-  glibc | 2.31-0ubuntu9.2 | focal-updates   | source
-  glibc | 2.32-0ubuntu3   | groovy          | source
-  glibc | 2.32-0ubuntu3.2 | groovy-proposed | source
-  glibc | 2.33-0ubuntu5   | hirsute         | source
+  glibc | 2.31-0ubuntu9   | focal           | source
+  glibc | 2.31-0ubuntu9.2 | focal-updates   | source
+  glibc | 2.32-0ubuntu3   | groovy          | source
+  glibc | 2.32-0ubuntu3.2 | groovy-proposed | source
+  glibc | 2.33-0ubuntu5   | hirsute         | source
  
  Affected releases include Ubuntu Focal and Groovy. Bionic is not
  affected, and releases starting with Hirsute already ship the upstream
  patch to fix this regression.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1928508

Title:
  Performance regression on memcpy() calls for AMD Zen

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1928508] [NEW] Performance regression on memcpy() calls for AMD Zen

Reply via email to