Re: TCG performance on PPC64

Daniel Henrique Barboza Wed, 18 May 2022 06:56:48 -0700

I'm adding qemu-devel for extra visibility since you've also measure TCG
performance of x86 and s390x guests in x86/M1 hosts as well.



This is very interesting. Nice work collecting all this data.

As for ppc64 performance I'll say that I am surprised that, in the end,
the ppc64 TCG backend isn't that bad in comparison with x86. There's a good
chance that the pseries guests does a lot of instructions that benefits x86
more than ppc64. Otherwise we wouldn't see other guests perform better in
ppc64.

The aarch64 guest is booting too slow IMHO. I'd guess that there might be
some command line tuning to turn off some default stuff that the ARM guest
might be enabling by default.

I'll also mention that I had no idea that the Apple M1 was that fast. Apple
really meant business when developing this chip.


Thanks,


Daniel


On 5/18/22 10:16, Matheus K. Ferst wrote:

Hi,

Since we started working with QEMU on PPC, we've noticed that
emulating PPC64 VMs is faster in x86_64 than PPC64 itself, even when compared 
with x86 machines that are slower in other workloads (like building QEMU or the 
Linux kernel).

We thought it would be related to the TCG backend, which would be better 
optimized on x86. As a first approach to better understand the problem, I ran 
some boot tests with Fedora Cloud Base 35-1.2[1] on both platforms. Using the 
command line

./qemu-system-ppc64 -name Fedora-Cloud-Base-35-1.2.ppc64le -smp 2 -m 2G -vga 
none -nographic -serial pipe:Fedora-Cloud-Base-35-1.2.ppc64le -monitor 
unix:Fedora-Cloud-Base-35-1.2.ppc64le.mon,server,nowait -device 
virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso -cpu 
POWER10 -accel tcg -device virtio-scsi-pci -drive 
file=Fedora-Cloud-Base-35-1.2.ppc64le.temp.qcow2,if=none,format=qcow2,id=hd0 
-device scsi-hd,drive=hd0 -boot c

in a POWER9 DD2.2 and an Intel Xeon E5-2687W, a simple bash script reads the ".out" pipe until the 
"fedora login:" string is found and then issues a "system_powerdown" through QEMU monitor. The 
."temp.qcow2" file is backed by the original Fedora image and deleted at the end of the test, so every boot 
is fresh. Running the test 10 times gave us 235.26 ± 6.27 s on PPC64 and 192.92 ± 4.53 s on x86_64, i.e., TCG is ~20% 
slower in the POWER9.

As a second step, I wondered if this gap would be the same when emulating other 
architectures on PPC64, so I used the same version of Fedora Cloud for 
aarch64[2] and s390x[3], using the following command lines:

./qemu-system-aarch64 -name Fedora-Cloud-Base-35-1.2.aarch64 -smp 2 -m 2G -vga 
none -nographic -serial pipe:Fedora-Cloud-Base-35-1.2.aarch64 -monitor 
unix:Fedora-Cloud-Base-35-1.2.aarch64.mon,server,nowait -device 
virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso 
-machine virt -cpu max -accel tcg -device virtio-scsi-pci -drive 
file=Fedora-Cloud-Base-35-1.2.aarch64.temp.qcow2,if=none,format=qcow2,id=hd0 
-device scsi-hd,drive=hd0 -boot c -bios ./pc-bios/edk2-aarch64-code.fd

and

./qemu-system-s390x -name Fedora-Cloud-Base-35-1.2.s390x -smp 2 -m 2G -vga none 
-nographic -serial pipe:Fedora-Cloud-Base-35-1.2.s390x -monitor 
unix:Fedora-Cloud-Base-35-1.2.s390x.mon,server,nowait -device 
virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso 
-machine s390-ccw-virtio -cpu max -accel tcg -hda 
Fedora-Cloud-Base-35-1.2.s390x.temp.qcow2 -boot c

With 50 runs, we got:

+---------+---------------------------------+
|         |               Host              |
|  Guest  +----------------+----------------+
|         |      PPC64     |     x86_64     |
+---------+----------------+----------------+
| PPC64   |  194.72 ± 7.28 |  162.75 ± 8.75 |
| aarch64 |  501.89 ± 9.98 | 586.08 ± 10.55 |
| s390x   | 294.10 ± 21.62 | 223.71 ± 85.30 |
+---------+----------------+----------------+

The difference with an s390x guest is around ~30%, with a greater variability 
on x86_64 that I couldn't find the source. However, POWER9 emulates aarch64 
faster than this Xeon.

The particular workload of the guest could distort this result since in the 
first boot Cloud-Init will create user accounts, generate SSH keys, etc. If the 
aarch64 guest uses many vector instructions for this initial setup, that might 
explain why an older Xeon would be slower here.

As a final test, I changed the images to have a normal user account already 
created and unlocked, disabled Cloud-Init, downloaded bc-1.07 sources[4][5], 
installed its build dependencies[6], and changed the test script to login, 
extract, configure, build, and shutdown the guest. I also added an aarch64 
compatible machine (Apple M1 w/ 10 cores) to our test setup. Running 100 
iterations gave us the following results:

+---------+----------------------------------------------------+
|         |                        Host                        |
|  Guest  +-----------------+-----------------+----------------+
|         |      PPC64      |     x86_64      |     aarch64    |
+---------+-----------------+-----------------+----------------+
| PPC64   |  429.82 ± 11.57 |   352.34 ± 8.51 | 180.78 ± 42.02 |
| aarch64 | 1029.78 ± 46.01 | 1207.98 ± 80.49 |  487.50 ± 7.54 |
| s390x   |  589.97 ± 86.67 |  411.83 ± 41.88 | 221.86 ± 79.85 |
+---------+-----------------+-----------------+----------------+

The pattern with PPC64 vs. x86_64 remains: PPC64/s390x guests are ~20%/~30% slower on 
POWER9, but the aarch64 VM is slower on this Xeon. If the PPC backend can perform better 
than the x86 when emulating some architectures, I guess that improving PPC64-on-PPC64 
emulation isn't "just" TCG backend optimization but a more complex problem to 
tackle.

What would be different in aarch64 emulation that yields a better performance 
on our POWER9?
  - I suppose that aarch64 has more instructions with GVec implementations than 
PPC64 and s390x, so maybe aarch64 guests can better use host-vector 
instructions?
  - Looking at the flame graphs of each test (attached), I can see that 
tb_gen_code takes proportionally less time of aarch64 emulation than PPC64 and 
s390x, so it might be that decodetree is faster?
  - There is more than TCG at play, so perhaps the differences can be better 
explained by VirtIO performance or something else?

Currently, Leandro Lupori is working to improve TLB invalidation[7], Victor 
Colombo is working to enable hardfpu in some scenarios, and I'm reviewing some 
older helpers that can use GVec or easily implemented inline. We're also 
planning to add some Power ISA v3.1 instructions to the TCG backend, but it's 
probably better to test on hardware if our changes are doing any good, and we 
don't have access to a POWER10 yet.

Are there any other known performance problems for TCG on PPC64 that we should 
investigate?

[1] 
https://download.fedoraproject.org/pub/fedora-secondary/releases/36/Cloud/ppc64le/images/Fedora-Cloud-Base-36-1.5.ppc64le.qcow2
[2] 
https://download.fedoraproject.org/pub/fedora/linux/releases/36/Cloud/aarch64/images/Fedora-Cloud-Base-36-1.5.aarch64.qcow2
[3] 
https://download.fedoraproject.org/pub/fedora-secondary/releases/36/Cloud/s390x/images/Fedora-Cloud-Base-36-1.5.s390x.qcow2
[4] https://ftp.gnu.org/gnu/bc/bc-1.07.tar.gz
[5] I'm using bc here because it's a reasonably sized project (not a hello word 
and not a defconfig Linux kernel), with few build dependencies.
[6] "sudo dnf install gcc flex make bison ed texinfo"
[7] https://gitlab.com/qemu-project/qemu/-/issues/767

Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO <http://www.eldorado.org.br/>
Analista de Software
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

Re: TCG performance on PPC64

Reply via email to