KVM P9 optimisation series
I have put my current series here https://github.com/npiggin/linux/tree/kvm-in-c-new It contains existing Cify series plus about 50 patches, it's getting fairly stable with both L0 and L1 hypervisors. The aim of the series is to speed up the P9 entry/exit code and also simplify things where possible. It does this in several main ways: - Rearrange code to optimise SPR accesses. Mainly, avoid scoreboard stalls. - Test SPR values to avoid mtSPRs where possible. mtSPRs are expensive. - Reduce mftb. mftb is expensive. - Demand fault certain facilities to avoid saving and/or restoring them (at the cost of fault when they are used, but this is mitigated over a number of entries, like the facilities when context switching processes). PM, TM, and EBB so far. - Defer some sequences that are made just in case a guest is interrupted in the middle of a critical section to the case where the guest is scheduled on a different CPU, rather than every time (at the cost of an extra IPI in this case). Namely the tlbsync sequence for radix with GTSE, which is very expensive. - Reduce barriers, atomics, start shedding some of vcore complexity to reduce path length, locking, etc. So far this speeds up the full entry/exit cycle (measured by guest spinning in 'sc 1' to cause exits, with a host hack make it exit rather than SIGILL), by about 2x on P9 and more on a P10. There is some more that can be done (xive optimisation, more complexity reduction, removing another mftb) but there are not many easy gains left here. The big thing which is not yet addressed is a light weight exit that does not switch all context each time. That will take a bit more design to get working really well, so I prefer to do that over a longer period perhaps with the help of some realistic workloads. It's very simple to hack something up to work fast with a few TCE or HPT hcalls for example, but really we may prefer on balance to do something which is slightly slower for those but works for other host interrupts like timers, device irqs, IPIs, partition scope page faults, etc. I will submit this after the first Cify series is accepted into the powerpc/kvm tree. Thanks, Nick
Re: [PATCH 11/15] powerpc: convert to setup_initial_init_mm()
On 2021/5/30 0:16, Christophe Leroy wrote: Kefeng Wang a écrit : Use setup_initial_init_mm() helper to simplify code. Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Kefeng Wang --- arch/powerpc/kernel/setup-common.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 046fe21b5c3b..c046d99efd18 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -928,10 +928,7 @@ void __init setup_arch(char **cmdline_p) klp_init_thread_info(&init_task); - init_mm.start_code = (unsigned long)_stext; - init_mm.end_code = (unsigned long) _etext; - init_mm.end_data = (unsigned long) _edata; - init_mm.brk = klimit; + setup_initial_init_mm(_stext, _etext, _edata, _end); This looks wrong, should be klimit instead of _end IIUC see arch/powerpc/kernel/setup-common.c: unsigned long klimit = (unsigned long) _end; the setup_initial_init_mm helper [1] should use the original _end +static inline void setup_initial_init_mm(char *start_code, +char *end_code, +char *end_data, +char *brk) +{ + init_mm.start_code = (unsigned long)start_code; + init_mm.end_code = (unsigned long)end_code; + init_mm.end_data = (unsigned long)end_data; + init_mm.brk = (unsigned long)brk; +} [1] https://lkml.org/lkml/2021/5/29/84 mm_iommu_init(&init_mm); irqstack_early_init(); -- 2.26.2 .
Re: [PATCH 11/15] powerpc: convert to setup_initial_init_mm()
On 2021/5/29 23:22, Christophe Leroy wrote: Santosh Sivaraj a écrit : Kefeng Wang writes: Use setup_initial_init_mm() helper to simplify code. I only got that patch, and patchwork as well (https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=246315) Can you tell where I can see and get the full series ? And next time can you copy all patches to linuxppc-dev ok, will be careful next time, thank for your reminding. Thanks Christophe Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Kefeng Wang --- arch/powerpc/kernel/setup-common.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 046fe21b5c3b..c046d99efd18 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -928,10 +928,7 @@ void __init setup_arch(char **cmdline_p) klp_init_thread_info(&init_task); - init_mm.start_code = (unsigned long)_stext; - init_mm.end_code = (unsigned long) _etext; - init_mm.end_data = (unsigned long) _edata; - init_mm.brk = klimit; + setup_initial_init_mm(_stext, _etext, _edata, _end); This function definition is not visible for those who have subscribed only to linuxppc-dev mailing list. I had to do a web-search with the ID. Thanks, Santosh mm_iommu_init(&init_mm); irqstack_early_init(); -- 2.26.2 .
Re: [PATCH] Revert "powerpc: Switch to relative jump labels"
On Sat, May 29, 2021 at 09:39:49AM +1000, Michael Ellerman wrote: > Roman Bolshakov writes: > > This reverts commit b0b3b2c78ec075cec4721986a95abbbac8c3da4f. > > > > Otherwise, direct kernel boot with initramfs no longer works in QEMU. > > It's broken in some bizarre way because a valid initramfs is not > > recognized anymore: > > > > Found initrd at 0xc1f7:0xc3d61d64 > > rootfs image is not initramfs (XZ-compressed data is corrupt); looks like > > an initrd > > > > The issue is observed on v5.13-rc3 if the kernel is built with > > defconfig, GCC 7.5.0 and GNU ld 2.32.0. > > Are you able to try a different compiler? Hi Michael, I've just tried GCC 9.3.1 and the result is the same. The offending patch has assembly inlines, they typically go through binutils/GAS and it might also be a case when older binutils doesn't implement something properly (i've seen this on x86 and arm). > I test booting qemu constantly, but I don't use GCC 7.5. > > And what qemu version are you using? > QEMU 3.1.1, but I've also tried 6.0.50 (QEMU master, 62c0ac5041e913) and it fails the same way. > > I assume your initramfs is compressed with XZ? How large is it > compressed? > Yes, XZ. initramfs size is 30 MB (around 100 MB cpio size). It's interesting that the issue doesn't happen if I pass initramfs from host (11MB), then the initramfs can be recognized. It might be related to initramfs size then and bigger initramfs that used to work no longer work with v5.13-rc3. So, I've created a small initramfs using only static busybox (2.7M uncompressed, 960K compressed with xz). No error is produced and it boots fine. If I add a dummy file (11M off /dev/urandom) to the small busybox initramfs, it boots and the init is started but I'm seeing the error: rootfs image is not initramfs (XZ-compressed data is corrupt); looks like an initrd sha1sum of the file inside initramfs doesn't match sha1sum on the host. guest # sha1sum dummy 407c347e671ddd00f69df12b3368048bad0ebf0c dummy # QEMU: Terminated host $ sha1sum dummy ed8494b3eecab804960ceba2c497270eed0b0cd1 dummy sha1sum is the same in the guest and on the host for 10M dummy file: guest # sha1sum dummy 43855f7a772a28cce91da9eb8f86f53bc807631f dummy # QEMU: Terminated host $ sha1sum dummy 43855f7a772a28cce91da9eb8f86f53bc807631f dummy That might explain why bigger initramfs (or initramfs with bigger files) doesn't boot - because some files might appear corrupted inside the guest. Here're the sources of the initrd along with 11M dummy file: https://drive.yadro.com/s/W8HdbPnaKmPPwK4 I've compressed it with: $ find . 2>/dev/null | cpio -ocR 0:0 | xz --check=crc32 > ../initrd-dummy.xz Hope this helps, Roman