Re: [PATCH] target/riscv: Use a direct cast for better performance
On 10/7/23 02:02, Richard W.M. Jones wrote: RISCV_CPU(cs) uses a checked cast. When QOM cast debugging is enabled this adds about 5% total overhead when emulating RV64 on x86-64 host. Using a RISC-V guest with 16 vCPUs, 16 GB of guest RAM, virtio-blk disk. The guest has a copy of the qemu source tree. The test involves compiling the qemu source tree with 'make clean; time make -j16'. Before making this change the compile step took 449 & 447 seconds over two consecutive runs. After making this change, 428 & 422 seconds. The saving is about 5%. Thanks: Paolo Bonzini Signed-off-by: Richard W.M. Jones --- target/riscv/cpu_helper.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c index 3a02079290..6174d99fb2 100644 --- a/target/riscv/cpu_helper.c +++ b/target/riscv/cpu_helper.c @@ -66,7 +66,11 @@ void cpu_get_tb_cpu_state(CPURISCVState *env, vaddr *pc, uint64_t *cs_base, uint32_t *pflags) { CPUState *cs = env_cpu(env); -RISCVCPU *cpu = RISCV_CPU(cs); +/* + * Using the checked cast RISCV_CPU(cs) imposes ~ 5% overhead when + * qemu cast debugging is enabled, so use a direct cast instead. + */ +RISCVCPU *cpu = (RISCVCPU *)cs; RISCVCPU *cpu = env_archcpu(env); and avoid "CPUState *cs" entirely. r~
Re: [PATCH] target/riscv: Use a direct cast for better performance
On 10/7/23 06:02, Richard W.M. Jones wrote: RISCV_CPU(cs) uses a checked cast. When QOM cast debugging is enabled this adds about 5% total overhead when emulating RV64 on x86-64 host. Using a RISC-V guest with 16 vCPUs, 16 GB of guest RAM, virtio-blk disk. The guest has a copy of the qemu source tree. The test involves compiling the qemu source tree with 'make clean; time make -j16'. Before making this change the compile step took 449 & 447 seconds over two consecutive runs. After making this change, 428 & 422 seconds. The saving is about 5%. Thanks: Paolo Bonzini Signed-off-by: Richard W.M. Jones --- Reviewed-by: Daniel Henrique Barboza target/riscv/cpu_helper.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c index 3a02079290..6174d99fb2 100644 --- a/target/riscv/cpu_helper.c +++ b/target/riscv/cpu_helper.c @@ -66,7 +66,11 @@ void cpu_get_tb_cpu_state(CPURISCVState *env, vaddr *pc, uint64_t *cs_base, uint32_t *pflags) { CPUState *cs = env_cpu(env); -RISCVCPU *cpu = RISCV_CPU(cs); +/* + * Using the checked cast RISCV_CPU(cs) imposes ~ 5% overhead when + * qemu cast debugging is enabled, so use a direct cast instead. + */ +RISCVCPU *cpu = (RISCVCPU *)cs; RISCVExtStatus fs, vs; uint32_t flags = 0;
Re: [PATCH] target/riscv: Use a direct cast for better performance
If you're interested in how I found this problem, it was done using 'perf report -a -g' & flamegraphs. This is the flamegraph of qemu (on the host) when the guest is running the parallel compile: http://oirase.annexia.org/tmp/qemu-riscv.svg If you click into 'CPU_0/TCG' at the bottom left (all the vCPUs basically act alike), and then go to 'cpu_get_tb_cpu_state' you can see the call to 'object_dynamic_cast_assert' taking considerable time. If you zoom out, hit Ctrl F and type 'object_dynamic_cast_assert' into the search box then the flamegraph will tell you this call takes about 6.6% of total time (not all, but most, attributable to the call from 'cpu_get_tb_cpu_state' -> 'object_dynamic_cast_assert'). There are several other issues in the flamegraph which I'm trying to address, but this was the simplest one. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
[PATCH] target/riscv: Use a direct cast for better performance
RISCV_CPU(cs) uses a checked cast. When QOM cast debugging is enabled this adds about 5% total overhead when emulating RV64 on x86-64 host. Using a RISC-V guest with 16 vCPUs, 16 GB of guest RAM, virtio-blk disk. The guest has a copy of the qemu source tree. The test involves compiling the qemu source tree with 'make clean; time make -j16'. Before making this change the compile step took 449 & 447 seconds over two consecutive runs. After making this change, 428 & 422 seconds. The saving is about 5%. Thanks: Paolo Bonzini Signed-off-by: Richard W.M. Jones --- target/riscv/cpu_helper.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c index 3a02079290..6174d99fb2 100644 --- a/target/riscv/cpu_helper.c +++ b/target/riscv/cpu_helper.c @@ -66,7 +66,11 @@ void cpu_get_tb_cpu_state(CPURISCVState *env, vaddr *pc, uint64_t *cs_base, uint32_t *pflags) { CPUState *cs = env_cpu(env); -RISCVCPU *cpu = RISCV_CPU(cs); +/* + * Using the checked cast RISCV_CPU(cs) imposes ~ 5% overhead when + * qemu cast debugging is enabled, so use a direct cast instead. + */ +RISCVCPU *cpu = (RISCVCPU *)cs; RISCVExtStatus fs, vs; uint32_t flags = 0; -- 2.41.0