[Qemu-commits] [qemu/qemu] c700b5: spapr: avoid overhead of finding vhyp class in cri...

2024-05-24 Thread Richard Henderson via Qemu-commits
  Branch: refs/heads/master
  Home:   https://github.com/qemu/qemu
  Commit: c700b5e162208a0fa4211fc6d9dab271b1342640
  
https://github.com/qemu/qemu/commit/c700b5e162208a0fa4211fc6d9dab271b1342640
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M hw/ppc/pegasos2.c
M target/ppc/cpu.h
M target/ppc/cpu_init.c
M target/ppc/excp_helper.c
M target/ppc/kvm.c
M target/ppc/mmu-book3s-v3.h
M target/ppc/mmu-hash64.c
M target/ppc/mmu-radix64.c

  Log Message:
  ---
  spapr: avoid overhead of finding vhyp class in critical operations

PPC_VIRTUAL_HYPERVISOR_GET_CLASS is used in critical operations like
interrupts and TLB misses and is quite costly. Running the
kvm-unit-tests sieve program with radix MMU enabled thrashes the TCG
TLB and spends a lot of time in TLB and page table walking code. The
test takes 67 seconds to complete with a lot of time being spent in
code related to finding the vhyp class:

   12.01%  [.] g_str_hash
8.94%  [.] g_hash_table_lookup
8.06%  [.] object_class_dynamic_cast
6.21%  [.] address_space_ldq
4.94%  [.] __strcmp_avx2
4.28%  [.] tlb_set_page_full
4.08%  [.] address_space_translate_internal
3.17%  [.] object_class_dynamic_cast_assert
2.84%  [.] ppc_radix64_xlate

Keep a pointer to the class and avoid this lookup. This reduces the
execution time to 40 seconds.

Reviewed-by: Harsh Prateek Bora 
Signed-off-by: Nicholas Piggin 


  Commit: 95912ce1ebe4303d17118219691573ae6227b0e2
  
https://github.com/qemu/qemu/commit/95912ce1ebe4303d17118219691573ae6227b0e2
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M hw/ppc/spapr.c

  Log Message:
  ---
  ppc/spapr: Add ibm,pi-features

The ibm,pi-features property has a bit to say whether or not
msgsndp should be used. Linux checks if it is being run under
KVM and avoids msgsndp anyway, but it would be preferable to
rely on this bit.

Reviewed-by: Harsh Prateek Bora 
Signed-off-by: Nicholas Piggin 


  Commit: 82676f1fc4b1511a5fe32256aaec885d200ffbf6
  
https://github.com/qemu/qemu/commit/82676f1fc4b1511a5fe32256aaec885d200ffbf6
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M target/ppc/helper_regs.c
M target/ppc/mmu_helper.c
M target/ppc/translate.c
M target/ppc/translate/storage-ctrl-impl.c.inc

  Log Message:
  ---
  target/ppc: Fix broadcast tlbie synchronisation

With mttcg, broadcast tlbie instructions do not wait until other vCPUs
have been kicked out of TCG execution before they complete (including
necessary subsequent tlbsync, etc., instructions). This is contrary to
the ISA, and it permits other vCPUs to use translations after the TLB
flush. For example:

   CPU0
   // *memP is initially 0, memV maps to memP with *pte
   *pte = 0;
   ptesync ; tlbie ; eieio ; tlbsync ; ptesync
   *memP = 1;

   CPU1
   assert(*memV == 0);

It is possible for the assertion to fail because CPU1 translates memV
using the TLB after CPU0 has stored 1 to the underlying memory. This
race was observed with a careful test case where CPU1 checks run in a
very large expensive TB so it can run for the entire CPU0 period between
clearing the pte and storing the memory, but host vCPU thread preemption
could cause the race to hit anywhere.

As explained in commit 4ddc104689b ("target/ppc: Fix tlbie"), it is not
enough to just use tlb_flush_all_cpus_synced(), because that does not
execute until the calling CPU has finished its TB. It is also required
that the TB is ended at the point where the TLB flush must subsequently
take effect.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Nicholas Piggin 


  Commit: 99cd12ced16d15a1ffde055f842497747f070f91
  
https://github.com/qemu/qemu/commit/99cd12ced16d15a1ffde055f842497747f070f91
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M accel/tcg/cputlb.c
M docs/devel/multi-thread-tcg.rst
M include/exec/exec-all.h

  Log Message:
  ---
  tcg/cputlb: Remove non-synced variants of global TLB flushes

These are no longer used.

  tlb_flush_all_cpus: removed by previous commit.
  tlb_flush_page_all_cpus: removed by previous commit.

  tlb_flush_page_bits_by_mmuidx_all_cpus: never used.
  tlb_flush_page_by_mmuidx_all_cpus: never used.
  tlb_flush_page_bits_by_mmuidx_all_cpus: never used, thus:
tlb_flush_range_by_mmuidx_all_cpus: never used.
tlb_flush_by_mmuidx_all_cpus: never used.

Reviewed-by: Richard Henderson 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Nicholas Piggin 


  Commit: 30933c4fb4f3df95ae44c4c3c86a5df049852c01
  
https://github.com/qemu/qemu/commit/30933c4fb4f3df95ae44c4c3c86a5df049852c01
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M accel/tcg/cputlb.c

  Log Message:
  ---
  tcg/cputlb: remove other-cpu capability from TLB flushing

Some TLB flush 

[Qemu-commits] [qemu/qemu] c700b5: spapr: avoid overhead of finding vhyp class in cri...

2024-05-23 Thread Richard Henderson via Qemu-commits
  Branch: refs/heads/staging
  Home:   https://github.com/qemu/qemu
  Commit: c700b5e162208a0fa4211fc6d9dab271b1342640
  
https://github.com/qemu/qemu/commit/c700b5e162208a0fa4211fc6d9dab271b1342640
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M hw/ppc/pegasos2.c
M target/ppc/cpu.h
M target/ppc/cpu_init.c
M target/ppc/excp_helper.c
M target/ppc/kvm.c
M target/ppc/mmu-book3s-v3.h
M target/ppc/mmu-hash64.c
M target/ppc/mmu-radix64.c

  Log Message:
  ---
  spapr: avoid overhead of finding vhyp class in critical operations

PPC_VIRTUAL_HYPERVISOR_GET_CLASS is used in critical operations like
interrupts and TLB misses and is quite costly. Running the
kvm-unit-tests sieve program with radix MMU enabled thrashes the TCG
TLB and spends a lot of time in TLB and page table walking code. The
test takes 67 seconds to complete with a lot of time being spent in
code related to finding the vhyp class:

   12.01%  [.] g_str_hash
8.94%  [.] g_hash_table_lookup
8.06%  [.] object_class_dynamic_cast
6.21%  [.] address_space_ldq
4.94%  [.] __strcmp_avx2
4.28%  [.] tlb_set_page_full
4.08%  [.] address_space_translate_internal
3.17%  [.] object_class_dynamic_cast_assert
2.84%  [.] ppc_radix64_xlate

Keep a pointer to the class and avoid this lookup. This reduces the
execution time to 40 seconds.

Reviewed-by: Harsh Prateek Bora 
Signed-off-by: Nicholas Piggin 


  Commit: 95912ce1ebe4303d17118219691573ae6227b0e2
  
https://github.com/qemu/qemu/commit/95912ce1ebe4303d17118219691573ae6227b0e2
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M hw/ppc/spapr.c

  Log Message:
  ---
  ppc/spapr: Add ibm,pi-features

The ibm,pi-features property has a bit to say whether or not
msgsndp should be used. Linux checks if it is being run under
KVM and avoids msgsndp anyway, but it would be preferable to
rely on this bit.

Reviewed-by: Harsh Prateek Bora 
Signed-off-by: Nicholas Piggin 


  Commit: 82676f1fc4b1511a5fe32256aaec885d200ffbf6
  
https://github.com/qemu/qemu/commit/82676f1fc4b1511a5fe32256aaec885d200ffbf6
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M target/ppc/helper_regs.c
M target/ppc/mmu_helper.c
M target/ppc/translate.c
M target/ppc/translate/storage-ctrl-impl.c.inc

  Log Message:
  ---
  target/ppc: Fix broadcast tlbie synchronisation

With mttcg, broadcast tlbie instructions do not wait until other vCPUs
have been kicked out of TCG execution before they complete (including
necessary subsequent tlbsync, etc., instructions). This is contrary to
the ISA, and it permits other vCPUs to use translations after the TLB
flush. For example:

   CPU0
   // *memP is initially 0, memV maps to memP with *pte
   *pte = 0;
   ptesync ; tlbie ; eieio ; tlbsync ; ptesync
   *memP = 1;

   CPU1
   assert(*memV == 0);

It is possible for the assertion to fail because CPU1 translates memV
using the TLB after CPU0 has stored 1 to the underlying memory. This
race was observed with a careful test case where CPU1 checks run in a
very large expensive TB so it can run for the entire CPU0 period between
clearing the pte and storing the memory, but host vCPU thread preemption
could cause the race to hit anywhere.

As explained in commit 4ddc104689b ("target/ppc: Fix tlbie"), it is not
enough to just use tlb_flush_all_cpus_synced(), because that does not
execute until the calling CPU has finished its TB. It is also required
that the TB is ended at the point where the TLB flush must subsequently
take effect.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Nicholas Piggin 


  Commit: 99cd12ced16d15a1ffde055f842497747f070f91
  
https://github.com/qemu/qemu/commit/99cd12ced16d15a1ffde055f842497747f070f91
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M accel/tcg/cputlb.c
M docs/devel/multi-thread-tcg.rst
M include/exec/exec-all.h

  Log Message:
  ---
  tcg/cputlb: Remove non-synced variants of global TLB flushes

These are no longer used.

  tlb_flush_all_cpus: removed by previous commit.
  tlb_flush_page_all_cpus: removed by previous commit.

  tlb_flush_page_bits_by_mmuidx_all_cpus: never used.
  tlb_flush_page_by_mmuidx_all_cpus: never used.
  tlb_flush_page_bits_by_mmuidx_all_cpus: never used, thus:
tlb_flush_range_by_mmuidx_all_cpus: never used.
tlb_flush_by_mmuidx_all_cpus: never used.

Reviewed-by: Richard Henderson 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Nicholas Piggin 


  Commit: 30933c4fb4f3df95ae44c4c3c86a5df049852c01
  
https://github.com/qemu/qemu/commit/30933c4fb4f3df95ae44c4c3c86a5df049852c01
  Author: Nicholas Piggin 
  Date:   2024-05-24 (Fri, 24 May 2024)

  Changed paths:
M accel/tcg/cputlb.c

  Log Message:
  ---
  tcg/cputlb: remove other-cpu capability from TLB flushing

Some TLB flush