KVM P9 optimisation series

2021-05-30 Thread Nicholas Piggin
I have put my current series here

https://github.com/npiggin/linux/tree/kvm-in-c-new

It contains existing Cify series plus about 50 patches, it's getting 
fairly stable with both L0 and L1 hypervisors. The aim of the series
is to speed up the P9 entry/exit code and also simplify things where
possible.

It does this in several main ways:

- Rearrange code to optimise SPR accesses. Mainly, avoid scoreboard
  stalls.

- Test SPR values to avoid mtSPRs where possible. mtSPRs are expensive.

- Reduce mftb. mftb is expensive.

- Demand fault certain facilities to avoid saving and/or restoring them
  (at the cost of fault when they are used, but this is mitigated over
  a number of entries, like the facilities when context switching 
  processes). PM, TM, and EBB so far.

- Defer some sequences that are made just in case a guest is interrupted
  in the middle of a critical section to the case where the guest is
  scheduled on a different CPU, rather than every time (at the cost of
  an extra IPI in this case). Namely the tlbsync sequence for radix with
  GTSE, which is very expensive.

- Reduce barriers, atomics, start shedding some of vcore complexity to
  reduce path length, locking, etc.

So far this speeds up the full entry/exit cycle (measured by guest 
spinning in 'sc 1' to cause exits, with a host hack make it exit rather
than SIGILL), by about 2x on P9 and more on a P10.

There is some more that can be done (xive optimisation, more complexity
reduction, removing another mftb) but there are not many easy gains left
here. The big thing which is not yet addressed is a light weight exit
that does not switch all context each time. That will take a bit more
design to get working really well, so I prefer to do that over a longer
period perhaps with the help of some realistic workloads. It's very
simple to hack something up to work fast with a few TCE or HPT hcalls
for example, but really we may prefer on balance to do something which
is slightly slower for those but works for other host interrupts like 
timers, device irqs, IPIs, partition scope page faults, etc.

I will submit this after the first Cify series is accepted into the
powerpc/kvm tree.

Thanks,
Nick


Re: [PATCH 11/15] powerpc: convert to setup_initial_init_mm()

2021-05-30 Thread Kefeng Wang



On 2021/5/30 0:16, Christophe Leroy wrote:

Kefeng Wang  a écrit :


Use setup_initial_init_mm() helper to simplify code.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Kefeng Wang 
---
 arch/powerpc/kernel/setup-common.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c

index 046fe21b5c3b..c046d99efd18 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -928,10 +928,7 @@ void __init setup_arch(char **cmdline_p)

 klp_init_thread_info(&init_task);

-    init_mm.start_code = (unsigned long)_stext;
-    init_mm.end_code = (unsigned long) _etext;
-    init_mm.end_data = (unsigned long) _edata;
-    init_mm.brk = klimit;
+    setup_initial_init_mm(_stext, _etext, _edata, _end);


This looks wrong, should be klimit instead of _end IIUC


see  arch/powerpc/kernel/setup-common.c:

unsigned long klimit = (unsigned long) _end;

the setup_initial_init_mm helper [1] should use the original _end


+static inline void setup_initial_init_mm(char *start_code,
+char *end_code,
+char *end_data,
+char *brk)
+{
+   init_mm.start_code = (unsigned long)start_code;
+   init_mm.end_code = (unsigned long)end_code;
+   init_mm.end_data = (unsigned long)end_data;
+   init_mm.brk = (unsigned long)brk;
+}

[1] https://lkml.org/lkml/2021/5/29/84





 mm_iommu_init(&init_mm);
 irqstack_early_init();
--
2.26.2



.



Re: [PATCH 11/15] powerpc: convert to setup_initial_init_mm()

2021-05-30 Thread Kefeng Wang



On 2021/5/29 23:22, Christophe Leroy wrote:

Santosh Sivaraj  a écrit :


Kefeng Wang  writes:


Use setup_initial_init_mm() helper to simplify code.


I only got that patch, and patchwork as well 
(https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=246315)


Can you tell where I can see and get the full series ?

And next time can you copy all patches to linuxppc-dev


ok, will be careful next time, thank for your reminding.



Thanks
Christophe



Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Kefeng Wang 
---
 arch/powerpc/kernel/setup-common.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c

index 046fe21b5c3b..c046d99efd18 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -928,10 +928,7 @@ void __init setup_arch(char **cmdline_p)

 klp_init_thread_info(&init_task);

-    init_mm.start_code = (unsigned long)_stext;
-    init_mm.end_code = (unsigned long) _etext;
-    init_mm.end_data = (unsigned long) _edata;
-    init_mm.brk = klimit;
+    setup_initial_init_mm(_stext, _etext, _edata, _end);


This function definition is not visible for those who have subscribed 
only to

linuxppc-dev mailing list. I had to do a web-search with the ID.

Thanks,
Santosh



 mm_iommu_init(&init_mm);
 irqstack_early_init();
--
2.26.2



.



Re: [PATCH] Revert "powerpc: Switch to relative jump labels"

2021-05-30 Thread Roman Bolshakov
On Sat, May 29, 2021 at 09:39:49AM +1000, Michael Ellerman wrote:
> Roman Bolshakov  writes:
> > This reverts commit b0b3b2c78ec075cec4721986a95abbbac8c3da4f.
> >
> > Otherwise, direct kernel boot with initramfs no longer works in QEMU.
> > It's broken in some bizarre way because a valid initramfs is not
> > recognized anymore:
> >
> >   Found initrd at 0xc1f7:0xc3d61d64
> >   rootfs image is not initramfs (XZ-compressed data is corrupt); looks like 
> > an initrd
> >
> > The issue is observed on v5.13-rc3 if the kernel is built with
> > defconfig, GCC 7.5.0 and GNU ld 2.32.0.
> 
> Are you able to try a different compiler?

Hi Michael,

I've just tried GCC 9.3.1 and the result is the same.

The offending patch has assembly inlines, they typically go through
binutils/GAS and it might also be a case when older binutils doesn't
implement something properly (i've seen this on x86 and arm).

> I test booting qemu constantly, but I don't use GCC 7.5.
>
> And what qemu version are you using?
> 

QEMU 3.1.1, but I've also tried 6.0.50 (QEMU master, 62c0ac5041e913) and
it fails the same way.

> 
> I assume your initramfs is compressed with XZ? How large is it
> compressed?
> 

Yes, XZ. initramfs size is 30 MB (around 100 MB cpio size).

It's interesting that the issue doesn't happen if I pass initramfs from
host (11MB), then the initramfs can be recognized. It might be related
to initramfs size then and bigger initramfs that used to work no longer
work with v5.13-rc3.

So, I've created a small initramfs using only static busybox (2.7M
uncompressed, 960K compressed with xz). No error is produced and it
boots fine.

If I add a dummy file (11M off /dev/urandom) to the small busybox
initramfs, it boots and the init is started but I'm seeing the error:

  rootfs image is not initramfs (XZ-compressed data is corrupt); looks like an 
initrd

sha1sum of the file inside initramfs doesn't match sha1sum on the host.

  guest # sha1sum dummy
  407c347e671ddd00f69df12b3368048bad0ebf0c  dummy
  # QEMU: Terminated
  host $ sha1sum dummy
  ed8494b3eecab804960ceba2c497270eed0b0cd1  dummy

sha1sum is the same in the guest and on the host for 10M dummy file:

  guest # sha1sum dummy
  43855f7a772a28cce91da9eb8f86f53bc807631f  dummy
  # QEMU: Terminated
  host $ sha1sum dummy
  43855f7a772a28cce91da9eb8f86f53bc807631f  dummy

That might explain why bigger initramfs (or initramfs with bigger files)
doesn't boot - because some files might appear corrupted inside the guest.

Here're the sources of the initrd along with 11M dummy file:
  https://drive.yadro.com/s/W8HdbPnaKmPPwK4

I've compressed it with:
  $ find . 2>/dev/null | cpio -ocR 0:0 | xz  --check=crc32 > ../initrd-dummy.xz

Hope this helps,
Roman