Re: [PATCH v3 07/12] asm-generic: Define 'func_desc_t' to commonly describe function descriptors

2021-10-18 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 18, 2021 5:07 pm:
> 
> 
> Le 18/10/2021 à 08:29, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of October 17, 2021 10:38 pm:
>>> We have three architectures using function descriptors, each with its
>>> own type and name.
>>>
>>> Add a common typedef that can be used in generic code.
>>>
>>> Also add a stub typedef for architecture without function descriptors,
>>> to avoid a forest of #ifdefs.
>>>
>>> It replaces the similar 'func_desc_t' previously defined in
>>> arch/powerpc/kernel/module_64.c
>>>
>>> Reviewed-by: Kees Cook 
>>> Acked-by: Arnd Bergmann 
>>> Signed-off-by: Christophe Leroy 
>>> ---
>> 
>> [...]
>> 
>>> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
>>> index a918388d9bf6..33b51efe3a24 100644
>>> --- a/include/asm-generic/sections.h
>>> +++ b/include/asm-generic/sections.h
>>> @@ -63,6 +63,9 @@ extern __visible const void __nosave_begin, __nosave_end;
>>>   #else
>>>   #define dereference_function_descriptor(p) ((void *)(p))
>>>   #define dereference_kernel_function_descriptor(p) ((void *)(p))
>>> +typedef struct {
>>> +   unsigned long addr;
>>> +} func_desc_t;
>>>   #endif
>>>   
>> 
>> I think that deserves a comment. If it's just to allow ifdef to be
>> avoided, I guess that's okay with a comment. Would be nice if you could
>> cause it to generate a link time error if it was ever used like
>> undefined functions, but I guess you can't. It's not a necessity though.
>> 
> 
> I tried to explain it in the commit message, but I can add a comment 
> here in addition for sure.

Thanks.

> 
> By the way, it IS used in powerpc's module_64.c:

Ah yes of course. I guess the point is function descriptors don't exist 
so it should not be used (in general). powerpc module code knows what it
is doing, I guess it's okay for it to use it.

Thanks,
Nick


Re: [PATCH v3 07/12] asm-generic: Define 'func_desc_t' to commonly describe function descriptors

2021-10-18 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 17, 2021 10:38 pm:
> We have three architectures using function descriptors, each with its
> own type and name.
> 
> Add a common typedef that can be used in generic code.
> 
> Also add a stub typedef for architecture without function descriptors,
> to avoid a forest of #ifdefs.
> 
> It replaces the similar 'func_desc_t' previously defined in
> arch/powerpc/kernel/module_64.c
> 
> Reviewed-by: Kees Cook 
> Acked-by: Arnd Bergmann 
> Signed-off-by: Christophe Leroy 
> ---

[...]

> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
> index a918388d9bf6..33b51efe3a24 100644
> --- a/include/asm-generic/sections.h
> +++ b/include/asm-generic/sections.h
> @@ -63,6 +63,9 @@ extern __visible const void __nosave_begin, __nosave_end;
>  #else
>  #define dereference_function_descriptor(p) ((void *)(p))
>  #define dereference_kernel_function_descriptor(p) ((void *)(p))
> +typedef struct {
> + unsigned long addr;
> +} func_desc_t;
>  #endif
>  

I think that deserves a comment. If it's just to allow ifdef to be 
avoided, I guess that's okay with a comment. Would be nice if you could 
cause it to generate a link time error if it was ever used like
undefined functions, but I guess you can't. It's not a necessity though.

Thanks,
Nick


Re: [PATCH v3 04/12] powerpc: Prepare func_desc_t for refactorisation

2021-10-18 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 17, 2021 10:38 pm:
> In preparation of making func_desc_t generic, change the ELFv2
> version to a struct containing 'addr' element.
> 
> This allows using single helpers common to ELFv1 and ELFv2.
> 
> Signed-off-by: Christophe Leroy 

> ---
>  arch/powerpc/kernel/module_64.c | 32 ++--
>  1 file changed, 14 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
> index a89da0ee25e2..b687ef88c4c4 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -33,19 +33,13 @@
>  #ifdef PPC64_ELF_ABI_v2
>  
>  /* An address is simply the address of the function. */
> -typedef unsigned long func_desc_t;
> +typedef struct {
> + unsigned long addr;
> +} func_desc_t;

I'm not quite following why this change is done. I guess it is so you 
can move this func_desc_t type into core code, but why do that? Is it
just to avoid using the preprocessor?

On its own this patch looks okay.

Acked-by: Nicholas Piggin 


Re: [PATCH v3 01/12] powerpc: Move and rename func_descr_t

2021-10-17 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 17, 2021 10:38 pm:
> There are three architectures with function descriptors, try to
> have common names for the address they contain in order to
> refactor some functions into generic functions later.
> 
> powerpc has 'entry'
> ia64 has 'ip'
> parisc has 'addr'
> 
> Vote for 'addr' and update 'func_descr_t' accordingly.
> 
> Move it in asm/elf.h to have it at the same place on all
> three architectures, remove the typedef which hides its real
> type, and change it to a smoother name 'struct func_desc'.
> 

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/code-patching.h | 2 +-
>  arch/powerpc/include/asm/elf.h   | 6 ++
>  arch/powerpc/include/asm/types.h | 6 --
>  arch/powerpc/kernel/signal_64.c  | 8 
>  4 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/code-patching.h 
> b/arch/powerpc/include/asm/code-patching.h
> index 4ba834599c4d..c6e805976e6f 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -110,7 +110,7 @@ static inline unsigned long ppc_function_entry(void *func)
>* function's descriptor. The first entry in the descriptor is the
>* address of the function text.
>*/
> - return ((func_descr_t *)func)->entry;
> + return ((struct func_desc *)func)->addr;
>  #else
>   return (unsigned long)func;
>  #endif
> diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
> index b8425e3cfd81..971589a21bc0 100644
> --- a/arch/powerpc/include/asm/elf.h
> +++ b/arch/powerpc/include/asm/elf.h
> @@ -176,4 +176,10 @@ do { 
> \
>  /* Relocate the kernel image to @final_address */
>  void relocate(unsigned long final_address);
>  
> +struct func_desc {
> + unsigned long addr;
> + unsigned long toc;
> + unsigned long env;
> +};
> +
>  #endif /* _ASM_POWERPC_ELF_H */
> diff --git a/arch/powerpc/include/asm/types.h 
> b/arch/powerpc/include/asm/types.h
> index f1630c553efe..97da77bc48c9 100644
> --- a/arch/powerpc/include/asm/types.h
> +++ b/arch/powerpc/include/asm/types.h
> @@ -23,12 +23,6 @@
>  
>  typedef __vector128 vector128;
>  
> -typedef struct {
> - unsigned long entry;
> - unsigned long toc;
> - unsigned long env;
> -} func_descr_t;
> -
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* _ASM_POWERPC_TYPES_H */
> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index 1831bba0582e..36537d7d5191 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -933,11 +933,11 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
> *set,
>* descriptor is the entry address of signal and the second
>* entry is the TOC value we need to use.
>*/
> - func_descr_t __user *funct_desc_ptr =
> - (func_descr_t __user *) ksig->ka.sa.sa_handler;
> + struct func_desc __user *ptr =
> + (struct func_desc __user *)ksig->ka.sa.sa_handler;
>  
> - err |= get_user(regs->ctr, _desc_ptr->entry);
> - err |= get_user(regs->gpr[2], _desc_ptr->toc);
> + err |= get_user(regs->ctr, >addr);
> + err |= get_user(regs->gpr[2], >toc);
>   }
>  
>   /* enter the signal handler in native-endian mode */
> -- 
> 2.31.1
> 
> 
> 


[PATCH v1 11/11] powerpc/microwatt: Don't select the hash MMU code

2021-10-15 Thread Nicholas Piggin
Microwatt is radix-only, so it does not require hash MMU support.

This saves 20kB compressed dtbImage and 56kB vmlinux size.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/configs/microwatt_defconfig | 1 -
 arch/powerpc/platforms/microwatt/Kconfig | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index 6e62966730d3..7c8eb29d8afe 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -27,7 +27,6 @@ CONFIG_PPC_MICROWATT=y
 # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set
 CONFIG_CPU_FREQ=y
 CONFIG_HZ_100=y
-# CONFIG_PPC_MEM_KEYS is not set
 # CONFIG_SECCOMP is not set
 # CONFIG_MQ_IOSCHED_KYBER is not set
 # CONFIG_COREDUMP is not set
diff --git a/arch/powerpc/platforms/microwatt/Kconfig 
b/arch/powerpc/platforms/microwatt/Kconfig
index 823192e9d38a..5e320f49583a 100644
--- a/arch/powerpc/platforms/microwatt/Kconfig
+++ b/arch/powerpc/platforms/microwatt/Kconfig
@@ -5,7 +5,6 @@ config PPC_MICROWATT
select PPC_XICS
select PPC_ICS_NATIVE
select PPC_ICP_NATIVE
-   select PPC_HASH_MMU_NATIVE if PPC_64S_HASH_MMU
select PPC_UDBG_16550
select ARCH_RANDOM
help
-- 
2.23.0



[PATCH v1 10/11] powerpc/configs/microwatt: add POWER9_CPU

2021-10-15 Thread Nicholas Piggin
Microwatt implements a subset of ISA v3.0 (which is equivalent to
the POWER9_CPU option).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/configs/microwatt_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index 9465209b8c5b..6e62966730d3 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -15,6 +15,7 @@ CONFIG_EMBEDDED=y
 # CONFIG_COMPAT_BRK is not set
 # CONFIG_SLAB_MERGE_DEFAULT is not set
 CONFIG_PPC64=y
+CONFIG_POWER9_CPU=y
 # CONFIG_PPC_KUEP is not set
 # CONFIG_PPC_KUAP is not set
 CONFIG_CPU_LITTLE_ENDIAN=y
-- 
2.23.0



[PATCH v1 09/11] powerpc/64s: Make hash MMU code build configurable

2021-10-15 Thread Nicholas Piggin
Introduce a new option CONFIG_PPC_64S_HASH_MMU which allows the 64s hash
MMU code to be compiled out if radix is selected and the minimum
supported CPU type is POWER9 or higher, and KVM is not selected.

This saves 128kB kernel image size (90kB text) on powernv_defconfig
minus KVM, 350kB on pseries_defconfig minus KVM, 40kB on a tiny config.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/book3s/64/mmu.h  | 22 ++-
 .../include/asm/book3s/64/tlbflush-hash.h |  7 ++
 arch/powerpc/include/asm/book3s/pgtable.h |  4 
 arch/powerpc/include/asm/mmu.h| 14 +---
 arch/powerpc/include/asm/mmu_context.h|  2 ++
 arch/powerpc/include/asm/paca.h   |  8 +++
 arch/powerpc/kernel/asm-offsets.c |  2 ++
 arch/powerpc/kernel/dt_cpu_ftrs.c |  8 ++-
 arch/powerpc/kernel/entry_64.S|  4 ++--
 arch/powerpc/kernel/exceptions-64s.S  | 16 ++
 arch/powerpc/kernel/mce.c |  2 +-
 arch/powerpc/kernel/mce_power.c   | 10 ++---
 arch/powerpc/kernel/paca.c| 18 ++-
 arch/powerpc/kernel/process.c | 13 ++-
 arch/powerpc/kernel/prom.c|  2 ++
 arch/powerpc/kernel/setup_64.c|  4 
 arch/powerpc/kexec/core_64.c  |  4 ++--
 arch/powerpc/kexec/ranges.c   |  4 
 arch/powerpc/kvm/Kconfig  |  1 +
 arch/powerpc/mm/book3s64/Makefile | 17 --
 arch/powerpc/mm/book3s64/hash_utils.c | 10 -
 .../{hash_hugetlbpage.c => hugetlbpage.c} |  6 +
 arch/powerpc/mm/book3s64/mmu_context.c| 16 ++
 arch/powerpc/mm/book3s64/pgtable.c| 12 ++
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  4 
 arch/powerpc/mm/copro_fault.c |  2 ++
 arch/powerpc/mm/pgtable.c | 10 ++---
 arch/powerpc/platforms/Kconfig.cputype| 21 +-
 arch/powerpc/platforms/cell/Kconfig   |  1 +
 arch/powerpc/platforms/maple/Kconfig  |  1 +
 arch/powerpc/platforms/microwatt/Kconfig  |  2 +-
 arch/powerpc/platforms/pasemi/Kconfig |  1 +
 arch/powerpc/platforms/powermac/Kconfig   |  1 +
 arch/powerpc/platforms/powernv/Kconfig|  2 +-
 arch/powerpc/platforms/powernv/idle.c |  2 ++
 arch/powerpc/platforms/powernv/setup.c|  2 ++
 arch/powerpc/platforms/pseries/lpar.c | 11 --
 arch/powerpc/platforms/pseries/lparcfg.c  |  2 +-
 arch/powerpc/platforms/pseries/mobility.c |  6 +
 arch/powerpc/platforms/pseries/ras.c  |  2 ++
 arch/powerpc/platforms/pseries/reconfig.c |  2 ++
 arch/powerpc/platforms/pseries/setup.c|  6 +++--
 arch/powerpc/xmon/xmon.c  |  8 +--
 44 files changed, 233 insertions(+), 60 deletions(-)
 rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (95%)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ba5b66189358..3c36511e6133 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -936,6 +936,7 @@ config PPC_MEM_KEYS
prompt "PowerPC Memory Protection Keys"
def_bool y
depends on PPC_BOOK3S_64
+   depends on PPC_64S_HASH_MMU
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
help
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index c02f42d1031e..857dc88b0043 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -98,7 +98,9 @@ typedef struct {
 * from EA and new context ids to build the new VAs.
 */
mm_context_id_t id;
+#ifdef CONFIG_PPC_64S_HASH_MMU
mm_context_id_t extended_id[TASK_SIZE_USER64/TASK_CONTEXT_SIZE];
+#endif
};
 
/* Number of bits in the mm_cpumask */
@@ -110,7 +112,9 @@ typedef struct {
/* Number of user space windows opened in process mm_context */
atomic_t vas_windows;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
struct hash_mm_context *hash_context;
+#endif
 
void __user *vdso;
/*
@@ -133,6 +137,7 @@ typedef struct {
 #endif
 } mm_context_t;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
 static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
 {
return ctx->hash_context->user_psize;
@@ -187,11 +192,22 @@ static inline struct subpage_prot_table 
*mm_ctx_subpage_prot(mm_context_t *ctx)
 }
 #endif
 
+#endif
+
 /*
  * The current system page and segment sizes
  */
-extern int mmu_linear_psize;
+#if defined(CONFIG_PPC_RADIX_MMU) && !defined(CONFIG_PPC_64S_HASH_MMU)
+#ifdef CONFIG_PPC_64K_PAGES
+#define mmu_virtual_psize MMU_PAGE_64K
+#else
+#define mmu_virtual_psize MMU_PAGE_4K
+#endif
+#else

[PATCH v1 08/11] powerpc/64s: Make flush_and_reload_slb a no-op when radix is enabled

2021-10-15 Thread Nicholas Piggin
The radix test can exclude slb_flush_all_realmode() from being called
because flush_and_reload_slb() is only expected to flush ERAT when
called by flush_erat(), which is only on pre-ISA v3.0 CPUs that do not
support radix.

This helps the later change to make hash support configurable to not
introduce runtime changes to radix mode behaviour.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce_power.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index c2f55fe7092d..cf5263b648fc 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -80,12 +80,12 @@ static bool mce_in_guest(void)
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void)
 {
-   /* Invalidate all SLBs */
-   slb_flush_all_realmode();
-
if (early_radix_enabled())
return;
 
+   /* Invalidate all SLBs */
+   slb_flush_all_realmode();
+
/*
 * This probably shouldn't happen, but it may be possible it's
 * called in early boot before SLB shadows are allocated.
-- 
2.23.0



[PATCH v1 07/11] powerpc/64s: move THP trace point creation out of hash specific file

2021-10-15 Thread Nicholas Piggin
In preparation for making hash MMU support configurable, move THP
trace point function definitions out of an otherwise hash specific
file.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/Makefile   | 2 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c | 1 -
 arch/powerpc/mm/book3s64/pgtable.c  | 1 +
 arch/powerpc/mm/book3s64/trace.c| 8 
 4 files changed, 10 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/mm/book3s64/trace.c

diff --git a/arch/powerpc/mm/book3s64/Makefile 
b/arch/powerpc/mm/book3s64/Makefile
index 319f4b7f3357..1579e18e098d 100644
--- a/arch/powerpc/mm/book3s64/Makefile
+++ b/arch/powerpc/mm/book3s64/Makefile
@@ -5,7 +5,7 @@ ccflags-y   := $(NO_MINIMAL_TOC)
 CFLAGS_REMOVE_slb.o = $(CC_FLAGS_FTRACE)
 
 obj-y  += hash_pgtable.o hash_utils.o slb.o \
-  mmu_context.o pgtable.o hash_tlb.o
+  mmu_context.o pgtable.o hash_tlb.o trace.o
 obj-$(CONFIG_PPC_HASH_MMU_NATIVE)  += hash_native.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= radix_pgtable.o radix_tlb.o
 obj-$(CONFIG_PPC_4K_PAGES) += hash_4k.o
diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
b/arch/powerpc/mm/book3s64/hash_pgtable.c
index ad5eff097d31..7ce8914992e3 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -16,7 +16,6 @@
 
 #include 
 
-#define CREATE_TRACE_POINTS
 #include 
 
 #if H_PGTABLE_RANGE > (USER_VSID_RANGE * (TASK_SIZE_USER64 / 
TASK_CONTEXT_SIZE))
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 9e16c7b1a6c5..049843c8c875 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -28,6 +28,7 @@ unsigned long __pmd_frag_size_shift;
 EXPORT_SYMBOL(__pmd_frag_size_shift);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
 /*
  * This is called when relaxing access to a hugepage. It's also called in the 
page
  * fault path when we don't hit any of the major fault cases, ie, a minor
diff --git a/arch/powerpc/mm/book3s64/trace.c b/arch/powerpc/mm/book3s64/trace.c
new file mode 100644
index ..b86e7b906257
--- /dev/null
+++ b/arch/powerpc/mm/book3s64/trace.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * This file is for defining trace points and trace related helpers.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define CREATE_TRACE_POINTS
+#include 
+#endif
-- 
2.23.0



[PATCH v1 06/11] powerpc/pseries: lparcfg don't include slb_size line in radix mode

2021-10-15 Thread Nicholas Piggin
This avoids a change in behaviour in the later patch making hash
support configurable. This is possibly a user interface change, so
the alternative would be a hard-coded slb_size=0 here.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/pseries/lparcfg.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
b/arch/powerpc/platforms/pseries/lparcfg.c
index f71eac74ea92..3354c00914fa 100644
--- a/arch/powerpc/platforms/pseries/lparcfg.c
+++ b/arch/powerpc/platforms/pseries/lparcfg.c
@@ -532,7 +532,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)
   lppaca_shared_proc(get_lppaca()));
 
 #ifdef CONFIG_PPC_BOOK3S_64
-   seq_printf(m, "slb_size=%d\n", mmu_slb_size);
+   if (!radix_enabled())
+   seq_printf(m, "slb_size=%d\n", mmu_slb_size);
 #endif
parse_em_data(m);
maxmem_data(m);
-- 
2.23.0



[PATCH v1 05/11] powerpc/pseries: move pseries_lpar_register_process_table() out from hash specific code

2021-10-15 Thread Nicholas Piggin
This reduces ifdefs in a later change making hash support configurable.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/pseries/lpar.c | 56 +--
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 3df6bdfea475..06d6a824c0dc 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -712,6 +712,34 @@ void vpa_init(int cpu)
 
 #ifdef CONFIG_PPC_BOOK3S_64
 
+static int pseries_lpar_register_process_table(unsigned long base,
+   unsigned long page_size, unsigned long table_size)
+{
+   long rc;
+   unsigned long flags = 0;
+
+   if (table_size)
+   flags |= PROC_TABLE_NEW;
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
+   flags |= PROC_TABLE_HPT_SLB;
+   for (;;) {
+   rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
+   page_size, table_size);
+   if (!H_IS_LONG_BUSY(rc))
+   break;
+   mdelay(get_longbusy_msecs(rc));
+   }
+   if (rc != H_SUCCESS) {
+   pr_err("Failed to register process table (rc=%ld)\n", rc);
+   BUG();
+   }
+   return rc;
+}
+
 static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
 unsigned long vpn, unsigned long pa,
 unsigned long rflags, unsigned long vflags,
@@ -1680,34 +1708,6 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
return 0;
 }
 
-static int pseries_lpar_register_process_table(unsigned long base,
-   unsigned long page_size, unsigned long table_size)
-{
-   long rc;
-   unsigned long flags = 0;
-
-   if (table_size)
-   flags |= PROC_TABLE_NEW;
-   if (radix_enabled()) {
-   flags |= PROC_TABLE_RADIX;
-   if (mmu_has_feature(MMU_FTR_GTSE))
-   flags |= PROC_TABLE_GTSE;
-   } else
-   flags |= PROC_TABLE_HPT_SLB;
-   for (;;) {
-   rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-   page_size, table_size);
-   if (!H_IS_LONG_BUSY(rc))
-   break;
-   mdelay(get_longbusy_msecs(rc));
-   }
-   if (rc != H_SUCCESS) {
-   pr_err("Failed to register process table (rc=%ld)\n", rc);
-   BUG();
-   }
-   return rc;
-}
-
 void __init hpte_init_pseries(void)
 {
mmu_hash_ops.hpte_invalidate = pSeries_lpar_hpte_invalidate;
-- 
2.23.0



[PATCH v1 04/11] powerpc/64s: Move and rename do_bad_slb_fault as it is not hash specific

2021-10-15 Thread Nicholas Piggin
slb.c is hash-specific SLB management, but do_bad_slb_fault deals with
segment interrupts that occur with radix MMU as well.
---
 arch/powerpc/include/asm/interrupt.h |  2 +-
 arch/powerpc/kernel/exceptions-64s.S |  4 ++--
 arch/powerpc/mm/book3s64/slb.c   | 16 
 arch/powerpc/mm/fault.c  | 17 +
 4 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index a1d238255f07..3487aab12229 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -564,7 +564,7 @@ DECLARE_INTERRUPT_HANDLER(kernel_bad_stack);
 
 /* slb.c */
 DECLARE_INTERRUPT_HANDLER_RAW(do_slb_fault);
-DECLARE_INTERRUPT_HANDLER(do_bad_slb_fault);
+DECLARE_INTERRUPT_HANDLER(do_bad_segment_interrupt);
 
 /* hash_utils.c */
 DECLARE_INTERRUPT_HANDLER_RAW(do_hash_fault);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index eaf1f72131a1..046c99e31d01 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1430,7 +1430,7 @@ MMU_FTR_SECTION_ELSE
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
std r3,RESULT(r1)
addir3,r1,STACK_FRAME_OVERHEAD
-   bl  do_bad_slb_fault
+   bl  do_bad_segment_interrupt
b   interrupt_return_srr
 
 
@@ -1510,7 +1510,7 @@ MMU_FTR_SECTION_ELSE
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
std r3,RESULT(r1)
addir3,r1,STACK_FRAME_OVERHEAD
-   bl  do_bad_slb_fault
+   bl  do_bad_segment_interrupt
b   interrupt_return_srr
 
 
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index f0037bcc47a0..31f4cef3adac 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -868,19 +868,3 @@ DEFINE_INTERRUPT_HANDLER_RAW(do_slb_fault)
return err;
}
 }
-
-DEFINE_INTERRUPT_HANDLER(do_bad_slb_fault)
-{
-   int err = regs->result;
-
-   if (err == -EFAULT) {
-   if (user_mode(regs))
-   _exception(SIGSEGV, regs, SEGV_BNDERR, regs->dar);
-   else
-   bad_page_fault(regs, SIGSEGV);
-   } else if (err == -EINVAL) {
-   unrecoverable_exception(regs);
-   } else {
-   BUG();
-   }
-}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a8d0ce85d39a..53ddcae0ac9e 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -620,4 +621,20 @@ DEFINE_INTERRUPT_HANDLER(do_bad_page_fault_segv)
 {
bad_page_fault(regs, SIGSEGV);
 }
+
+DEFINE_INTERRUPT_HANDLER(do_bad_segment_interrupt)
+{
+   int err = regs->result;
+
+   if (err == -EFAULT) {
+   if (user_mode(regs))
+   _exception(SIGSEGV, regs, SEGV_BNDERR, regs->dar);
+   else
+   bad_page_fault(regs, SIGSEGV);
+   } else if (err == -EINVAL) {
+   unrecoverable_exception(regs);
+   } else {
+   BUG();
+   }
+}
 #endif
-- 
2.23.0



[PATCH v1 03/11] powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE

2021-10-15 Thread Nicholas Piggin
The pseries platform does not use the native hash code but the PAPR
virtualised hash interfaces, so remove PPC_HASH_MMU_NATIVE.

This requires moving tlbiel code from hash_native.c to hash_utils.c.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   4 -
 arch/powerpc/mm/book3s64/hash_native.c| 104 --
 arch/powerpc/mm/book3s64/hash_utils.c | 104 ++
 arch/powerpc/platforms/pseries/Kconfig|   1 -
 4 files changed, 104 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 215973b4cb26..d2e80f178b6d 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -14,7 +14,6 @@ enum {
TLB_INVAL_SCOPE_LPID = 1,   /* invalidate TLBs for current LPID */
 };
 
-#ifdef CONFIG_PPC_NATIVE
 static inline void tlbiel_all(void)
 {
/*
@@ -30,9 +29,6 @@ static inline void tlbiel_all(void)
else
hash__tlbiel_all(TLB_INVAL_SCOPE_GLOBAL);
 }
-#else
-static inline void tlbiel_all(void) { BUG(); }
-#endif
 
 static inline void tlbiel_all_lpid(bool radix)
 {
diff --git a/arch/powerpc/mm/book3s64/hash_native.c 
b/arch/powerpc/mm/book3s64/hash_native.c
index d8279bfe68ea..d2a320828c0b 100644
--- a/arch/powerpc/mm/book3s64/hash_native.c
+++ b/arch/powerpc/mm/book3s64/hash_native.c
@@ -43,110 +43,6 @@
 
 static DEFINE_RAW_SPINLOCK(native_tlbie_lock);
 
-static inline void tlbiel_hash_set_isa206(unsigned int set, unsigned int is)
-{
-   unsigned long rb;
-
-   rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53));
-
-   asm volatile("tlbiel %0" : : "r" (rb));
-}
-
-/*
- * tlbiel instruction for hash, set invalidation
- * i.e., r=1 and is=01 or is=10 or is=11
- */
-static __always_inline void tlbiel_hash_set_isa300(unsigned int set, unsigned 
int is,
-   unsigned int pid,
-   unsigned int ric, unsigned int prs)
-{
-   unsigned long rb;
-   unsigned long rs;
-   unsigned int r = 0; /* hash format */
-
-   rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53));
-   rs = ((unsigned long)pid << PPC_BITLSHIFT(31));
-
-   asm volatile(PPC_TLBIEL(%0, %1, %2, %3, %4)
-: : "r"(rb), "r"(rs), "i"(ric), "i"(prs), "i"(r)
-: "memory");
-}
-
-
-static void tlbiel_all_isa206(unsigned int num_sets, unsigned int is)
-{
-   unsigned int set;
-
-   asm volatile("ptesync": : :"memory");
-
-   for (set = 0; set < num_sets; set++)
-   tlbiel_hash_set_isa206(set, is);
-
-   ppc_after_tlbiel_barrier();
-}
-
-static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
-{
-   unsigned int set;
-
-   asm volatile("ptesync": : :"memory");
-
-   /*
-* Flush the partition table cache if this is HV mode.
-*/
-   if (early_cpu_has_feature(CPU_FTR_HVMODE))
-   tlbiel_hash_set_isa300(0, is, 0, 2, 0);
-
-   /*
-* Now invalidate the process table cache. UPRT=0 HPT modes (what
-* current hardware implements) do not use the process table, but
-* add the flushes anyway.
-*
-* From ISA v3.0B p. 1078:
-* The following forms are invalid.
-*  * PRS=1, R=0, and RIC!=2 (The only process-scoped
-*HPT caching is of the Process Table.)
-*/
-   tlbiel_hash_set_isa300(0, is, 0, 2, 1);
-
-   /*
-* Then flush the sets of the TLB proper. Hash mode uses
-* partition scoped TLB translations, which may be flushed
-* in !HV mode.
-*/
-   for (set = 0; set < num_sets; set++)
-   tlbiel_hash_set_isa300(set, is, 0, 0, 0);
-
-   ppc_after_tlbiel_barrier();
-
-   asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT "; isync" : : :"memory");
-}
-
-void hash__tlbiel_all(unsigned int action)
-{
-   unsigned int is;
-
-   switch (action) {
-   case TLB_INVAL_SCOPE_GLOBAL:
-   is = 3;
-   break;
-   case TLB_INVAL_SCOPE_LPID:
-   is = 2;
-   break;
-   default:
-   BUG();
-   }
-
-   if (early_cpu_has_feature(CPU_FTR_ARCH_300))
-   tlbiel_all_isa300(POWER9_TLB_SETS_HASH, is);
-   else if (early_cpu_has_feature(CPU_FTR_ARCH_207S))
-   tlbiel_all_isa206(POWER8_TLB_SETS, is);
-   else if (early_cpu_has_feature(CPU_FTR_ARCH_206))
-   tlbiel_all_isa206(POWER7_TLB_SETS, is);
-   else
-   WARN(1, "%s called on pre-POWER7 CPU\n", __func__);
-}
-
 static inline unsigned long  ___tlbie(unsigned long vpn, int psize,

[PATCH v1 02/11] powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE

2021-10-15 Thread Nicholas Piggin
PPC_NATIVE now only controls the native HPT code, so rename it to be
more descriptive. Restrict it to Book3S only.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/Makefile  | 2 +-
 arch/powerpc/mm/book3s64/hash_utils.c  | 2 +-
 arch/powerpc/platforms/52xx/Kconfig| 2 +-
 arch/powerpc/platforms/Kconfig | 4 ++--
 arch/powerpc/platforms/cell/Kconfig| 2 +-
 arch/powerpc/platforms/chrp/Kconfig| 2 +-
 arch/powerpc/platforms/embedded6xx/Kconfig | 2 +-
 arch/powerpc/platforms/maple/Kconfig   | 2 +-
 arch/powerpc/platforms/microwatt/Kconfig   | 2 +-
 arch/powerpc/platforms/pasemi/Kconfig  | 2 +-
 arch/powerpc/platforms/powermac/Kconfig| 2 +-
 arch/powerpc/platforms/powernv/Kconfig | 2 +-
 arch/powerpc/platforms/pseries/Kconfig | 2 +-
 13 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/Makefile 
b/arch/powerpc/mm/book3s64/Makefile
index 1b56d3af47d4..319f4b7f3357 100644
--- a/arch/powerpc/mm/book3s64/Makefile
+++ b/arch/powerpc/mm/book3s64/Makefile
@@ -6,7 +6,7 @@ CFLAGS_REMOVE_slb.o = $(CC_FLAGS_FTRACE)
 
 obj-y  += hash_pgtable.o hash_utils.o slb.o \
   mmu_context.o pgtable.o hash_tlb.o
-obj-$(CONFIG_PPC_NATIVE)   += hash_native.o
+obj-$(CONFIG_PPC_HASH_MMU_NATIVE)  += hash_native.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= radix_pgtable.o radix_tlb.o
 obj-$(CONFIG_PPC_4K_PAGES) += hash_4k.o
 obj-$(CONFIG_PPC_64K_PAGES)+= hash_64k.o
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index c145776d3ae5..ebe3044711ce 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1091,7 +1091,7 @@ void __init hash__early_init_mmu(void)
ps3_early_mm_init();
else if (firmware_has_feature(FW_FEATURE_LPAR))
hpte_init_pseries();
-   else if (IS_ENABLED(CONFIG_PPC_NATIVE))
+   else if (IS_ENABLED(CONFIG_PPC_HASH_MMU_NATIVE))
hpte_init_native();
 
if (!mmu_hash_ops.hpte_insert)
diff --git a/arch/powerpc/platforms/52xx/Kconfig 
b/arch/powerpc/platforms/52xx/Kconfig
index 99d60acc20c8..b72ed2950ca8 100644
--- a/arch/powerpc/platforms/52xx/Kconfig
+++ b/arch/powerpc/platforms/52xx/Kconfig
@@ -34,7 +34,7 @@ config PPC_EFIKA
bool "bPlan Efika 5k2. MPC5200B based computer"
depends on PPC_MPC52xx
select PPC_RTAS
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
 
 config PPC_LITE5200
bool "Freescale Lite5200 Eval Board"
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index e02d29a9d12f..d41dad227de8 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -40,9 +40,9 @@ config EPAPR_PARAVIRT
 
  In case of doubt, say Y
 
-config PPC_NATIVE
+config PPC_HASH_MMU_NATIVE
bool
-   depends on PPC_BOOK3S_32 || PPC64
+   depends on PPC_BOOK3S
help
  Support for running natively on the hardware, i.e. without
  a hypervisor. This option is not user-selectable but should
diff --git a/arch/powerpc/platforms/cell/Kconfig 
b/arch/powerpc/platforms/cell/Kconfig
index cb70c5f25bc6..db4465c51b56 100644
--- a/arch/powerpc/platforms/cell/Kconfig
+++ b/arch/powerpc/platforms/cell/Kconfig
@@ -8,7 +8,7 @@ config PPC_CELL_COMMON
select PPC_DCR_MMIO
select PPC_INDIRECT_PIO
select PPC_INDIRECT_MMIO
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select PPC_RTAS
select IRQ_EDGE_EOI_HANDLER
 
diff --git a/arch/powerpc/platforms/chrp/Kconfig 
b/arch/powerpc/platforms/chrp/Kconfig
index 9b5c5505718a..ff30ed579a39 100644
--- a/arch/powerpc/platforms/chrp/Kconfig
+++ b/arch/powerpc/platforms/chrp/Kconfig
@@ -11,6 +11,6 @@ config PPC_CHRP
select RTAS_ERROR_LOGGING
select PPC_MPC106
select PPC_UDBG_16550
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select FORCE_PCI
default y
diff --git a/arch/powerpc/platforms/embedded6xx/Kconfig 
b/arch/powerpc/platforms/embedded6xx/Kconfig
index 4c6d703a4284..c54786f8461e 100644
--- a/arch/powerpc/platforms/embedded6xx/Kconfig
+++ b/arch/powerpc/platforms/embedded6xx/Kconfig
@@ -55,7 +55,7 @@ config MVME5100
select FORCE_PCI
select PPC_INDIRECT_PCI
select PPC_I8259
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select PPC_UDBG_16550
help
  This option enables support for the Motorola (now Emerson) MVME5100
diff --git a/arch/powerpc/platforms/maple/Kconfig 
b/arch/powerpc/platforms/maple/Kconfig
index 86ae210bee9a..7fd84311ade5 100644
--- a/arch/powerpc/platforms/maple/Kconfig
+++ b/arch/powerpc/platforms/maple/Kconfig
@@ -9,7 +9,7 @@ config PPC_MAPLE
select GENERIC_TBSYNC
select PPC_UDBG_16550
select PPC_970_NAP
-   select PPC_N

[PATCH v1 00/11] powerpc: Make hash MMU code build configurable

2021-10-15 Thread Nicholas Piggin
Now that there's a platform that can make good use of it, here's
a series that can prevent the hash MMU code being built for 64s
platforms that don't need it.

Thanks Christophe and Michael for reviews of the RFC, I hope I
got all the issues raised.

Since RFC:
- Split out large code movement from other changes.
- Used mmu ftr test constant folding rather than adding new constant
  true/false for radix_enabled().
- Restore tlbie trace point that had to be commented out in the
  previous.
- Avoid minor (probably unreachable) behaviour change in machine check
  handler when hash was not compiled.
- Fix microwatt updates so !HASH is not enforced.
- Rebase, build fixes.

Thanks,
Nick

Nicholas Piggin (11):
  powerpc: Remove unused FW_FEATURE_NATIVE references
  powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE
  powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE
  powerpc/64s: Move and rename do_bad_slb_fault as it is not hash
specific
  powerpc/pseries: move pseries_lpar_register_process_table() out from
hash specific code
  powerpc/pseries: lparcfg don't include slb_size line in radix mode
  powerpc/64s: move THP trace point creation out of hash specific file
  powerpc/64s: Make flush_and_reload_slb a no-op when radix is enabled
  powerpc/64s: Make hash MMU code build configurable
  powerpc/configs/microwatt: add POWER9_CPU
  powerpc/microwatt: Don't select the hash MMU code

 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/configs/microwatt_defconfig  |   2 +-
 arch/powerpc/include/asm/book3s/64/mmu.h  |  22 +++-
 .../include/asm/book3s/64/tlbflush-hash.h |   7 ++
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   4 -
 arch/powerpc/include/asm/book3s/pgtable.h |   4 +
 arch/powerpc/include/asm/firmware.h   |   8 --
 arch/powerpc/include/asm/interrupt.h  |   2 +-
 arch/powerpc/include/asm/mmu.h|  14 ++-
 arch/powerpc/include/asm/mmu_context.h|   2 +
 arch/powerpc/include/asm/paca.h   |   8 ++
 arch/powerpc/kernel/asm-offsets.c |   2 +
 arch/powerpc/kernel/dt_cpu_ftrs.c |   8 +-
 arch/powerpc/kernel/entry_64.S|   4 +-
 arch/powerpc/kernel/exceptions-64s.S  |  20 ++-
 arch/powerpc/kernel/mce.c |   2 +-
 arch/powerpc/kernel/mce_power.c   |  16 ++-
 arch/powerpc/kernel/paca.c|  18 ++-
 arch/powerpc/kernel/process.c |  13 +-
 arch/powerpc/kernel/prom.c|   2 +
 arch/powerpc/kernel/setup_64.c|   4 +
 arch/powerpc/kexec/core_64.c  |   4 +-
 arch/powerpc/kexec/ranges.c   |   4 +
 arch/powerpc/kvm/Kconfig  |   1 +
 arch/powerpc/mm/book3s64/Makefile |  19 +--
 arch/powerpc/mm/book3s64/hash_native.c| 104 
 arch/powerpc/mm/book3s64/hash_pgtable.c   |   1 -
 arch/powerpc/mm/book3s64/hash_utils.c | 116 --
 .../{hash_hugetlbpage.c => hugetlbpage.c} |   6 +
 arch/powerpc/mm/book3s64/mmu_context.c|  16 +++
 arch/powerpc/mm/book3s64/pgtable.c|  13 ++
 arch/powerpc/mm/book3s64/radix_pgtable.c  |   4 +
 arch/powerpc/mm/book3s64/slb.c|  16 ---
 arch/powerpc/mm/book3s64/trace.c  |   8 ++
 arch/powerpc/mm/copro_fault.c |   2 +
 arch/powerpc/mm/fault.c   |  17 +++
 arch/powerpc/mm/pgtable.c |  10 +-
 arch/powerpc/platforms/52xx/Kconfig   |   2 +-
 arch/powerpc/platforms/Kconfig|   4 +-
 arch/powerpc/platforms/Kconfig.cputype|  21 +++-
 arch/powerpc/platforms/cell/Kconfig   |   3 +-
 arch/powerpc/platforms/chrp/Kconfig   |   2 +-
 arch/powerpc/platforms/embedded6xx/Kconfig|   2 +-
 arch/powerpc/platforms/maple/Kconfig  |   3 +-
 arch/powerpc/platforms/microwatt/Kconfig  |   1 -
 arch/powerpc/platforms/pasemi/Kconfig |   3 +-
 arch/powerpc/platforms/powermac/Kconfig   |   3 +-
 arch/powerpc/platforms/powernv/Kconfig|   2 +-
 arch/powerpc/platforms/powernv/idle.c |   2 +
 arch/powerpc/platforms/powernv/setup.c|   2 +
 arch/powerpc/platforms/pseries/Kconfig|   1 -
 arch/powerpc/platforms/pseries/lpar.c |  67 +-
 arch/powerpc/platforms/pseries/lparcfg.c  |   5 +-
 arch/powerpc/platforms/pseries/mobility.c |   6 +
 arch/powerpc/platforms/pseries/ras.c  |   2 +
 arch/powerpc/platforms/pseries/reconfig.c |   2 +
 arch/powerpc/platforms/pseries/setup.c|   6 +-
 arch/powerpc/xmon/xmon.c  |   8 +-
 58 files changed, 410 insertions(+), 241 deletions(-)
 rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (95%)
 create mode 100644 arch/powerpc/mm/book3s64/trace.c

-- 
2.23.0



[PATCH v1 01/11] powerpc: Remove unused FW_FEATURE_NATIVE references

2021-10-15 Thread Nicholas Piggin
FW_FEATURE_NATIVE_ALWAYS and FW_FEATURE_NATIVE_POSSIBLE are always
zero and never do anything. Remove them.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/firmware.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 97a3bd9ffeb9..9b702d2b80fb 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -80,8 +80,6 @@ enum {
FW_FEATURE_POWERNV_ALWAYS = 0,
FW_FEATURE_PS3_POSSIBLE = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
FW_FEATURE_PS3_ALWAYS = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
-   FW_FEATURE_NATIVE_POSSIBLE = 0,
-   FW_FEATURE_NATIVE_ALWAYS = 0,
FW_FEATURE_POSSIBLE =
 #ifdef CONFIG_PPC_PSERIES
FW_FEATURE_PSERIES_POSSIBLE |
@@ -91,9 +89,6 @@ enum {
 #endif
 #ifdef CONFIG_PPC_PS3
FW_FEATURE_PS3_POSSIBLE |
-#endif
-#ifdef CONFIG_PPC_NATIVE
-   FW_FEATURE_NATIVE_ALWAYS |
 #endif
0,
FW_FEATURE_ALWAYS =
@@ -105,9 +100,6 @@ enum {
 #endif
 #ifdef CONFIG_PPC_PS3
FW_FEATURE_PS3_ALWAYS &
-#endif
-#ifdef CONFIG_PPC_NATIVE
-   FW_FEATURE_NATIVE_ALWAYS &
 #endif
FW_FEATURE_POSSIBLE,
 
-- 
2.23.0



Re: [PATCH v2 06/13] asm-generic: Use HAVE_FUNCTION_DESCRIPTORS to define associated stubs

2021-10-15 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of October 15, 2021 6:02 pm:
> Excerpts from Christophe Leroy's message of October 15, 2021 4:24 pm:
>> 
>> 
>> Le 15/10/2021 à 08:16, Nicholas Piggin a écrit :
>>> Excerpts from Christophe Leroy's message of October 14, 2021 3:49 pm:
>>>> Replace HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR by
>>>> HAVE_FUNCTION_DESCRIPTORS and use it instead of
>>>> 'dereference_function_descriptor' macro to know
>>>> whether an arch has function descriptors.
>>>>
>>>> To limit churn in one of the following patches, use
>>>> an #ifdef/#else construct with empty first part
>>>> instead of an #ifndef in asm-generic/sections.h
>>> 
>>> Is it worth putting this into Kconfig if you're going to
>>> change it? In any case
>> 
>> That was what I wanted to do in the begining but how can I do that in 
>> Kconfig ?
>> 
>> #ifdef __powerpc64__
>> #if defined(_CALL_ELF) && _CALL_ELF == 2
>> #define PPC64_ELF_ABI_v2
>> #else
>> #define PPC64_ELF_ABI_v1
>> #endif
>> #endif /* __powerpc64__ */
>> 
>> #ifdef PPC64_ELF_ABI_v1
>> #define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1
> 
> We have ELFv2 ABI / function descriptors iff big-endian so you could 
> just select based on that.

Of course that should read ELFv1. To be clearer: BE is ELFv1 ABI and
LE is ELFv2 ABI.

Thanks,
Nick


Re: [PATCH v2 06/13] asm-generic: Use HAVE_FUNCTION_DESCRIPTORS to define associated stubs

2021-10-15 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 15, 2021 4:24 pm:
> 
> 
> Le 15/10/2021 à 08:16, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of October 14, 2021 3:49 pm:
>>> Replace HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR by
>>> HAVE_FUNCTION_DESCRIPTORS and use it instead of
>>> 'dereference_function_descriptor' macro to know
>>> whether an arch has function descriptors.
>>>
>>> To limit churn in one of the following patches, use
>>> an #ifdef/#else construct with empty first part
>>> instead of an #ifndef in asm-generic/sections.h
>> 
>> Is it worth putting this into Kconfig if you're going to
>> change it? In any case
> 
> That was what I wanted to do in the begining but how can I do that in 
> Kconfig ?
> 
> #ifdef __powerpc64__
> #if defined(_CALL_ELF) && _CALL_ELF == 2
> #define PPC64_ELF_ABI_v2
> #else
> #define PPC64_ELF_ABI_v1
> #endif
> #endif /* __powerpc64__ */
> 
> #ifdef PPC64_ELF_ABI_v1
> #define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1

We have ELFv2 ABI / function descriptors iff big-endian so you could 
just select based on that.

I have a patch that makes the ABI version configurable which cleans
some of this up a bit, but that can be rebased on your series if we
ever merge it. Maybe just add BUILD_BUG_ONs in the above ifdef block
to ensure CONFIG_HAVE_FUNCTION_DESCRIPTORS was set the right way, so
I don't forget.

Thanks,
Nick


Re: [PATCH] powerpc/mce: check if event info is valid

2021-10-15 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of October 7, 2021 10:09 pm:
> Ganesh  writes:
>> On 8/6/21 6:53 PM, Ganesh Goudar wrote:
>>
>>> Check if the event info is valid before printing the
>>> event information. When a fwnmi enabled nested kvm guest
>>> hits a machine check exception L0 and L2 would generate
>>> machine check event info, But L1 would not generate any
>>> machine check event info as it won't go through 0x200
>>> vector and prints some unwanted message.
>>>
>>> To fix this, 'in_use' variable in machine check event info is
>>> no more in use, rename it to 'valid' and check if the event
>>> information is valid before logging the event information.
>>>
>>> without this patch L1 would print following message for
>>> exceptions encountered in L2, as event structure will be
>>> empty in L1.
>>>
>>> "Machine Check Exception, Unknown event version 0".
>>>
>>> Signed-off-by: Ganesh Goudar 
>>> ---
>>
>> Hi mpe, Any comments on this patch.
> 
> The variable rename is a bit of a distraction.
> 
> But ignoring that, how do we end up processing a machine_check_event
> that has in_use = 0?
> 
> You don't give much detail on what call path you're talking about. I
> guess we're coming in via the calls in the KVM code?
> 
> In the definition of kvm_vcpu_arch we have:
> 
>   struct machine_check_event mce_evt; /* Valid if trap == 0x200 */
> 
> And you said we're not going via 0x200 in L1. But so shouldn't we be
> teaching the KVM code not to use mce_evt when trap is not 0x200?

I'm not sure we want the MCE to skip the L1 either. It should match the 
L0 hypervisor behaviour as closely as reasonably possible.

We might have to teach the KVM pseries path to do something about the
0x200 before the common HV guest exit handler (which is where the L1
message comes from).

Thanks,
Nick


Re: [PATCH v2 08/13] asm-generic: Refactor dereference_[kernel]_function_descriptor()

2021-10-15 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 14, 2021 3:49 pm:
> dereference_function_descriptor() and
> dereference_kernel_function_descriptor() are identical on the
> three architectures implementing them.
> 
> Make them common and put them out-of-line in kernel/extable.c
> which is one of the users and has similar type of functions.

We should be moving more stuff out of extable.c (including all the
kernel address tests). lib/kimage.c or kelf.c or something. 

It could be after your series though.

> 
> Reviewed-by: Kees Cook 
> Reviewed-by: Arnd Bergmann 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/ia64/include/asm/sections.h| 19 ---
>  arch/parisc/include/asm/sections.h  |  9 -
>  arch/parisc/kernel/process.c| 21 -
>  arch/powerpc/include/asm/sections.h | 23 ---
>  include/asm-generic/sections.h  |  2 ++
>  kernel/extable.c| 23 ++-
>  6 files changed, 24 insertions(+), 73 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/sections.h 
> b/arch/ia64/include/asm/sections.h
> index 1aaed8882294..96c9bb500c34 100644
> --- a/arch/ia64/include/asm/sections.h
> +++ b/arch/ia64/include/asm/sections.h
> @@ -31,23 +31,4 @@ extern char __start_gate_brl_fsys_bubble_down_patchlist[], 
> __end_gate_brl_fsys_b
>  extern char __start_unwind[], __end_unwind[];
>  extern char __start_ivt_text[], __end_ivt_text[];
>  
> -#undef dereference_function_descriptor
> -static inline void *dereference_function_descriptor(void *ptr)
> -{
> - struct fdesc *desc = ptr;
> - void *p;
> -
> - if (!get_kernel_nofault(p, (void *)>addr))
> - ptr = p;
> - return ptr;
> -}
> -
> -#undef dereference_kernel_function_descriptor
> -static inline void *dereference_kernel_function_descriptor(void *ptr)
> -{
> - if (ptr < (void *)__start_opd || ptr >= (void *)__end_opd)
> - return ptr;
> - return dereference_function_descriptor(ptr);
> -}
> -
>  #endif /* _ASM_IA64_SECTIONS_H */
> diff --git a/arch/parisc/include/asm/sections.h 
> b/arch/parisc/include/asm/sections.h
> index 37b34b357cb5..6b1fe22baaf5 100644
> --- a/arch/parisc/include/asm/sections.h
> +++ b/arch/parisc/include/asm/sections.h
> @@ -13,13 +13,4 @@ typedef Elf64_Fdesc func_desc_t;
>  
>  extern char __alt_instructions[], __alt_instructions_end[];
>  
> -#ifdef CONFIG_64BIT
> -
> -#undef dereference_function_descriptor
> -void *dereference_function_descriptor(void *);
> -
> -#undef dereference_kernel_function_descriptor
> -void *dereference_kernel_function_descriptor(void *);
> -#endif
> -
>  #endif
> diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
> index 38ec4ae81239..7382576b52a8 100644
> --- a/arch/parisc/kernel/process.c
> +++ b/arch/parisc/kernel/process.c
> @@ -266,27 +266,6 @@ get_wchan(struct task_struct *p)
>   return 0;
>  }
>  
> -#ifdef CONFIG_64BIT
> -void *dereference_function_descriptor(void *ptr)
> -{
> - Elf64_Fdesc *desc = ptr;
> - void *p;
> -
> - if (!get_kernel_nofault(p, (void *)>addr))
> - ptr = p;
> - return ptr;
> -}
> -
> -void *dereference_kernel_function_descriptor(void *ptr)
> -{
> - if (ptr < (void *)__start_opd ||
> - ptr >= (void *)__end_opd)
> - return ptr;
> -
> - return dereference_function_descriptor(ptr);
> -}
> -#endif
> -
>  static inline unsigned long brk_rnd(void)
>  {
>   return (get_random_int() & BRK_RND_MASK) << PAGE_SHIFT;
> diff --git a/arch/powerpc/include/asm/sections.h 
> b/arch/powerpc/include/asm/sections.h
> index 1322d7b2f1a3..fbfe1957edbe 100644
> --- a/arch/powerpc/include/asm/sections.h
> +++ b/arch/powerpc/include/asm/sections.h
> @@ -72,29 +72,6 @@ static inline int overlaps_kernel_text(unsigned long 
> start, unsigned long end)
>   (unsigned long)_stext < end;
>  }
>  
> -#ifdef PPC64_ELF_ABI_v1
> -
> -#undef dereference_function_descriptor
> -static inline void *dereference_function_descriptor(void *ptr)
> -{
> - struct ppc64_opd_entry *desc = ptr;
> - void *p;
> -
> - if (!get_kernel_nofault(p, (void *)>addr))
> - ptr = p;
> - return ptr;
> -}
> -
> -#undef dereference_kernel_function_descriptor
> -static inline void *dereference_kernel_function_descriptor(void *ptr)
> -{
> - if (ptr < (void *)__start_opd || ptr >= (void *)__end_opd)
> - return ptr;
> -
> - return dereference_function_descriptor(ptr);
> -}
> -#endif /* PPC64_ELF_ABI_v1 */
> -
>  #endif
>  
>  #endif /* __KERNEL__ */
> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
> index cbec7d5f1678..76163883c6ff 100644
> --- a/include/asm-generic/sections.h
> +++ b/include/asm-generic/sections.h
> @@ -60,6 +60,8 @@ extern __visible const void __nosave_begin, __nosave_end;
>  
>  /* Function descriptor handling (if any).  Override in asm/sections.h */
>  #ifdef HAVE_FUNCTION_DESCRIPTORS
> +void 

Re: [PATCH v2 06/13] asm-generic: Use HAVE_FUNCTION_DESCRIPTORS to define associated stubs

2021-10-15 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 14, 2021 3:49 pm:
> Replace HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR by
> HAVE_FUNCTION_DESCRIPTORS and use it instead of
> 'dereference_function_descriptor' macro to know
> whether an arch has function descriptors.
> 
> To limit churn in one of the following patches, use
> an #ifdef/#else construct with empty first part
> instead of an #ifndef in asm-generic/sections.h

Is it worth putting this into Kconfig if you're going to
change it? In any case

Reviewed-by: Nicholas Piggin 

> 
> Reviewed-by: Kees Cook 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/ia64/include/asm/sections.h| 5 +++--
>  arch/parisc/include/asm/sections.h  | 6 --
>  arch/powerpc/include/asm/sections.h | 6 --
>  include/asm-generic/sections.h  | 3 ++-
>  include/linux/kallsyms.h| 2 +-
>  5 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/sections.h 
> b/arch/ia64/include/asm/sections.h
> index 35f24e52149a..6e55e545bf02 100644
> --- a/arch/ia64/include/asm/sections.h
> +++ b/arch/ia64/include/asm/sections.h
> @@ -9,6 +9,9 @@
>  
>  #include 
>  #include 
> +
> +#define HAVE_FUNCTION_DESCRIPTORS 1
> +
>  #include 
>  
>  extern char __phys_per_cpu_start[];
> @@ -27,8 +30,6 @@ extern char __start_gate_brl_fsys_bubble_down_patchlist[], 
> __end_gate_brl_fsys_b
>  extern char __start_unwind[], __end_unwind[];
>  extern char __start_ivt_text[], __end_ivt_text[];
>  
> -#define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1
> -
>  #undef dereference_function_descriptor
>  static inline void *dereference_function_descriptor(void *ptr)
>  {
> diff --git a/arch/parisc/include/asm/sections.h 
> b/arch/parisc/include/asm/sections.h
> index bb52aea0cb21..85149a89ff3e 100644
> --- a/arch/parisc/include/asm/sections.h
> +++ b/arch/parisc/include/asm/sections.h
> @@ -2,6 +2,10 @@
>  #ifndef _PARISC_SECTIONS_H
>  #define _PARISC_SECTIONS_H
>  
> +#ifdef CONFIG_64BIT
> +#define HAVE_FUNCTION_DESCRIPTORS 1
> +#endif
> +
>  /* nothing to see, move along */
>  #include 
>  
> @@ -9,8 +13,6 @@ extern char __alt_instructions[], __alt_instructions_end[];
>  
>  #ifdef CONFIG_64BIT
>  
> -#define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1
> -
>  #undef dereference_function_descriptor
>  void *dereference_function_descriptor(void *);
>  
> diff --git a/arch/powerpc/include/asm/sections.h 
> b/arch/powerpc/include/asm/sections.h
> index 32e7035863ac..bba97b8c38cf 100644
> --- a/arch/powerpc/include/asm/sections.h
> +++ b/arch/powerpc/include/asm/sections.h
> @@ -8,6 +8,10 @@
>  
>  #define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed
>  
> +#ifdef PPC64_ELF_ABI_v1
> +#define HAVE_FUNCTION_DESCRIPTORS 1
> +#endif
> +
>  #include 
>  
>  extern bool init_mem_is_free;
> @@ -69,8 +73,6 @@ static inline int overlaps_kernel_text(unsigned long start, 
> unsigned long end)
>  
>  #ifdef PPC64_ELF_ABI_v1
>  
> -#define HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR 1
> -
>  #undef dereference_function_descriptor
>  static inline void *dereference_function_descriptor(void *ptr)
>  {
> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
> index d16302d3eb59..b677e926e6b3 100644
> --- a/include/asm-generic/sections.h
> +++ b/include/asm-generic/sections.h
> @@ -59,7 +59,8 @@ extern char __noinstr_text_start[], __noinstr_text_end[];
>  extern __visible const void __nosave_begin, __nosave_end;
>  
>  /* Function descriptor handling (if any).  Override in asm/sections.h */
> -#ifndef dereference_function_descriptor
> +#ifdef HAVE_FUNCTION_DESCRIPTORS
> +#else
>  #define dereference_function_descriptor(p) ((void *)(p))
>  #define dereference_kernel_function_descriptor(p) ((void *)(p))
>  #endif
> diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
> index a1d6fc82d7f0..9f277baeb559 100644
> --- a/include/linux/kallsyms.h
> +++ b/include/linux/kallsyms.h
> @@ -57,7 +57,7 @@ static inline int is_ksym_addr(unsigned long addr)
>  
>  static inline void *dereference_symbol_descriptor(void *ptr)
>  {
> -#ifdef HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR
> +#ifdef HAVE_FUNCTION_DESCRIPTORS
>   struct module *mod;
>  
>   ptr = dereference_kernel_function_descriptor(ptr);
> -- 
> 2.31.1
> 
> 
> 


Re: [PATCH v2 03/13] powerpc: Remove func_descr_t

2021-10-15 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 15, 2021 3:19 pm:
> 
> 
> Le 15/10/2021 à 00:17, Daniel Axtens a écrit :
>> Christophe Leroy  writes:
>> 
>>> 'func_descr_t' is redundant with 'struct ppc64_opd_entry'
>> 
>> So, if I understand the overall direction of the series, you're
>> consolidating powerpc around one single type for function descriptors,
>> and then you're creating a generic typedef so that generic code can
>> always do ((func_desc_t)x)->addr to get the address of a function out of
>> a function descriptor regardless of arch. (And regardless of whether the
>> arch uses function descriptors or not.)
> 
> An architecture not using function descriptors won't do much with 
> ((func_desc_t *)x)->addr. This is just done to allow building stuff 
> regardless.
> 
> I prefer something like
> 
>   if (have_function_descriptors())
>   addr = (func_desc_t *)ptr)->addr;
>   else
>   addr = ptr;

If you make a generic data type for architectures without function 
descriptors as such

typedef struct func_desc {
char addr[0];
} func_desc_t;

Then you can do that with no if. The downside is your addr has to be 
char * and it's maybe not helpful to be so "clever".

>>   - why pick ppc64_opd_entry over func_descr_t?
> 
> Good question. At the begining it was because it was in UAPI headers, 
> and also because it was the one used in our 
> dereference_function_descriptor().
> 
> But at the end maybe that's not the more logical choice. I need to look 
> a bit more.

I would prefer the func_descr_t (with 'toc' and 'env') if you're going 
to change it.

Thanks,
Nick


Re: [PATCH v2 02/13] powerpc: Rename 'funcaddr' to 'addr' in 'struct ppc64_opd_entry'

2021-10-15 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 14, 2021 3:49 pm:
> There are three architectures with function descriptors, try to
> have common names for the address they contain in order to
> refactor some functions into generic functions later.
> 
> powerpc has 'funcaddr'
> ia64 has 'ip'
> parisc has 'addr'
> 
> Vote for 'addr' and update 'struct ppc64_opd_entry' accordingly.

It is the "address of the entry point of the function" according to 
powerpc ELF spec, so addr seems fine.

Reviewed-by: Nicholas Piggin 

> 
> Reviewed-by: Kees Cook 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/elf.h  | 2 +-
>  arch/powerpc/include/asm/sections.h | 2 +-
>  arch/powerpc/kernel/module_64.c | 6 +++---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
> index a4406714c060..bb0f278f9ed4 100644
> --- a/arch/powerpc/include/asm/elf.h
> +++ b/arch/powerpc/include/asm/elf.h
> @@ -178,7 +178,7 @@ void relocate(unsigned long final_address);
>  
>  /* There's actually a third entry here, but it's unused */
>  struct ppc64_opd_entry {
> - unsigned long funcaddr;
> + unsigned long addr;
>   unsigned long r2;
>  };
>  
> diff --git a/arch/powerpc/include/asm/sections.h 
> b/arch/powerpc/include/asm/sections.h
> index 6e4af4492a14..32e7035863ac 100644
> --- a/arch/powerpc/include/asm/sections.h
> +++ b/arch/powerpc/include/asm/sections.h
> @@ -77,7 +77,7 @@ static inline void *dereference_function_descriptor(void 
> *ptr)
>   struct ppc64_opd_entry *desc = ptr;
>   void *p;
>  
> - if (!get_kernel_nofault(p, (void *)>funcaddr))
> + if (!get_kernel_nofault(p, (void *)>addr))
>   ptr = p;
>   return ptr;
>  }
> diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
> index 6baa676e7cb6..82908c9be627 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -72,11 +72,11 @@ static func_desc_t func_desc(unsigned long addr)
>  }
>  static unsigned long func_addr(unsigned long addr)
>  {
> - return func_desc(addr).funcaddr;
> + return func_desc(addr).addr;
>  }
>  static unsigned long stub_func_addr(func_desc_t func)
>  {
> - return func.funcaddr;
> + return func.addr;
>  }
>  static unsigned int local_entry_offset(const Elf64_Sym *sym)
>  {
> @@ -187,7 +187,7 @@ static int relacmp(const void *_x, const void *_y)
>  static unsigned long get_stubs_size(const Elf64_Ehdr *hdr,
>   const Elf64_Shdr *sechdrs)
>  {
> - /* One extra reloc so it's always 0-funcaddr terminated */
> + /* One extra reloc so it's always 0-addr terminated */
>   unsigned long relocs = 1;
>   unsigned i;
>  
> -- 
> 2.31.1
> 
> 
> 


Re: [PATCH v2 01/13] powerpc: Move 'struct ppc64_opd_entry' back into asm/elf.h

2021-10-14 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 14, 2021 3:49 pm:
> 'struct ppc64_opd_entry' doesn't belong to uapi/asm/elf.h
> 
> It was initially in module_64.c and commit 2d291e902791 ("Fix compile
> failure with non modular builds") moved it into asm/elf.h
> 
> But it was by mistake added outside of __KERNEL__ section,
> therefore commit c3617f72036c ("UAPI: (Scripted) Disintegrate
> arch/powerpc/include/asm") moved it to uapi/asm/elf.h
> 
> Move it back into asm/elf.h, this brings it back in line with
> IA64 and PARISC architectures.
> 
> Fixes: 2d291e902791 ("Fix compile failure with non modular builds")
> Reviewed-by: Kees Cook 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/elf.h  | 6 ++
>  arch/powerpc/include/uapi/asm/elf.h | 8 
>  2 files changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
> index b8425e3cfd81..a4406714c060 100644
> --- a/arch/powerpc/include/asm/elf.h
> +++ b/arch/powerpc/include/asm/elf.h
> @@ -176,4 +176,10 @@ do { 
> \
>  /* Relocate the kernel image to @final_address */
>  void relocate(unsigned long final_address);
>  
> +/* There's actually a third entry here, but it's unused */
> +struct ppc64_opd_entry {
> + unsigned long funcaddr;
> + unsigned long r2;
> +};

Reviewed-by: Nicholas Piggin 

I wonder if we should add that third entry, just for completeness. And 
'r2' isn't a good name should probably be toc. And should it be packed? 
At any rate that's not for your series, a cleanup I might think about
for later.

Thanks,
Nick


Re: [PATCH v2] KVM: PPC: Defer vtime accounting 'til after IRQ handling

2021-10-14 Thread Nicholas Piggin
Excerpts from Laurent Vivier's message of October 13, 2021 7:30 pm:
> On 13/10/2021 01:18, Michael Ellerman wrote:
>> Laurent Vivier  writes:
>>> Commit 112665286d08 moved guest_exit() in the interrupt protected
>>> area to avoid wrong context warning (or worse), but the tick counter
>>> cannot be updated and the guest time is accounted to the system time.
>>>
>>> To fix the problem port to POWER the x86 fix
>>> 160457140187 ("Defer vtime accounting 'til after IRQ handling"):
>>>
>>> "Defer the call to account guest time until after servicing any IRQ(s)
>>>   that happened in the guest or immediately after VM-Exit.  Tick-based
>>>   accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
>>>   handler runs, and IRQs are blocked throughout the main sequence of
>>>   vcpu_enter_guest(), including the call into vendor code to actually
>>>   enter and exit the guest."
>>>
>>> Fixes: 112665286d08 ("KVM: PPC: Book3S HV: Context tracking exit guest 
>>> context before enabling irqs")
>>> Cc: npig...@gmail.com
>>> Cc:  # 5.12
>>> Signed-off-by: Laurent Vivier 
>>> ---
>>>
>>> Notes:
>>>  v2: remove reference to commit 61bd0f66ff92
>>>  cc stable 5.12
>>>  add the same comment in the code as for x86
>>>
>>>   arch/powerpc/kvm/book3s_hv.c | 24 
>>>   1 file changed, 20 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 2acb1c96cfaf..a694d1a8f6ce 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> ...
>>> @@ -4506,13 +4514,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, 
>>> u64 time_limit,
>>>   
>>> srcu_read_unlock(>srcu, srcu_idx);
>>>   
>>> +   context_tracking_guest_exit();
>>> +
>>> set_irq_happened(trap);
>>>   
>>> kvmppc_set_host_core(pcpu);
>>>   
>>> -   guest_exit_irqoff();
>>> -
>>> local_irq_enable();
>>> +   /*
>>> +* Wait until after servicing IRQs to account guest time so that any
>>> +* ticks that occurred while running the guest are properly accounted
>>> +* to the guest.  Waiting until IRQs are enabled degrades the accuracy
>>> +* of accounting via context tracking, but the loss of accuracy is
>>> +* acceptable for all known use cases.
>>> +*/
>>> +   vtime_account_guest_exit();
>> 
>> This pops a warning for me, running guest(s) on Power8:
>>   
>>[  270.745303][T16661] [ cut here ]
>>[  270.745374][T16661] WARNING: CPU: 72 PID: 16661 at 
>> arch/powerpc/kernel/time.c:311 vtime_account_kernel+0xe0/0xf0
> 
> Thank you, I missed that...
> 
> My patch is wrong, I have to add vtime_account_guest_exit() before the 
> local_irq_enable().

I thought so because if we take an interrupt after exiting the guest that 
should be accounted to kernel not guest.

> 
> arch/powerpc/kernel/time.c
> 
>   305 static unsigned long vtime_delta(struct cpu_accounting_data *acct,
>   306  unsigned long *stime_scaled,
>   307  unsigned long *steal_time)
>   308 {
>   309 unsigned long now, stime;
>   310
>   311 WARN_ON_ONCE(!irqs_disabled());
> ...
> 
> But I don't understand how ticks can be accounted now if irqs are still 
> disabled.
> 
> Not sure it is as simple as expected...

I don't know all the timer stuff too well. The 
!CONFIG_VIRT_CPU_ACCOUNTING case is relying on PF_VCPU to be set when 
the host timer interrupt runs irqtime_account_process_tick runs so it
can accumulate that tick to the guest?

That probably makes sense then, but it seems like we need that in a
different place. Timer interrupts are not guaranteed to be the first one
to occur when interrupts are enabled.

Maybe a new tick_account_guest_exit() and move PF_VCPU clearing to that
for tick based accounting. Call it after local_irq_enable and call the
vtime accounting before it. Would that work?

Thanks,
Nick


[RFC PATCH] KVM: PPC: Book3S HV P9: Move H_CEDE logic mostly to one place

2021-10-04 Thread Nicholas Piggin
Move the vcpu->arch.ceded, hrtimer, and blocking handling to one place,
except the xive escalation rearm case. The only special case is the
xive handling, as it is to be done before the xive context is pulled.

This means the P9 path does not run with ceded==1 or the hrtimer armed
except in the kvmhv_handle_cede function, and hopefully cede handling is
a bit more understandable.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_ppc.h|   4 +-
 arch/powerpc/kvm/book3s_hv.c  | 137 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c |   3 +-
 arch/powerpc/kvm/book3s_xive.c|  26 ++---
 4 files changed, 88 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 70ffcb3c91bf..e2e6cee9dddf 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -674,7 +674,7 @@ extern int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu);
-extern void kvmppc_xive_rearm_escalation(struct kvm_vcpu *vcpu);
+extern bool kvmppc_xive_rearm_escalation(struct kvm_vcpu *vcpu);
 
 static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
 {
@@ -712,7 +712,7 @@ static inline int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 ir
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
 static inline void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu) { }
-static inline void kvmppc_xive_rearm_escalation(struct kvm_vcpu *vcpu) { }
+static inline bool kvmppc_xive_rearm_escalation(struct kvm_vcpu *vcpu) { 
return false; }
 
 static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
{ return 0; }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 36c54f483a02..230f10b67f98 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1019,6 +1019,8 @@ static long kvmppc_h_rpt_invalidate(struct kvm_vcpu *vcpu,
return H_SUCCESS;
 }
 
+static int kvmhv_handle_cede(struct kvm_vcpu *vcpu);
+
 int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 {
struct kvm *kvm = vcpu->kvm;
@@ -1080,7 +1082,9 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
break;
 
case H_CEDE:
+   ret = kvmhv_handle_cede(vcpu);
break;
+
case H_PROD:
target = kvmppc_get_gpr(vcpu, 4);
tvcpu = kvmppc_find_vcpu(kvm, target);
@@ -1292,25 +1296,6 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
return RESUME_GUEST;
 }
 
-/*
- * Handle H_CEDE in the P9 path where we don't call the real-mode hcall
- * handlers in book3s_hv_rmhandlers.S.
- *
- * This has to be done early, not in kvmppc_pseries_do_hcall(), so
- * that the cede logic in kvmppc_run_single_vcpu() works properly.
- */
-static void kvmppc_cede(struct kvm_vcpu *vcpu)
-{
-   vcpu->arch.shregs.msr |= MSR_EE;
-   vcpu->arch.ceded = 1;
-   smp_mb();
-   if (vcpu->arch.prodded) {
-   vcpu->arch.prodded = 0;
-   smp_mb();
-   vcpu->arch.ceded = 0;
-   }
-}
-
 static int kvmppc_hcall_impl_hv(unsigned long cmd)
 {
switch (cmd) {
@@ -2971,7 +2956,7 @@ static int kvmppc_core_check_requests_hv(struct kvm_vcpu 
*vcpu)
return 1;
 }
 
-static void kvmppc_set_timer(struct kvm_vcpu *vcpu)
+static bool kvmppc_set_timer(struct kvm_vcpu *vcpu)
 {
unsigned long dec_nsec, now;
 
@@ -2980,11 +2965,12 @@ static void kvmppc_set_timer(struct kvm_vcpu *vcpu)
/* decrementer has already gone negative */
kvmppc_core_queue_dec(vcpu);
kvmppc_core_prepare_to_enter(vcpu);
-   return;
+   return false;
}
dec_nsec = tb_to_ns(kvmppc_dec_expires_host_tb(vcpu) - now);
hrtimer_start(>arch.dec_timer, dec_nsec, HRTIMER_MODE_REL);
vcpu->arch.timer_running = 1;
+   return true;
 }
 
 extern int __kvmppc_vcore_entry(void);
@@ -4015,21 +4001,11 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
else if (*tb >= time_limit) /* nested time limit */
return BOOK3S_INTERRUPT_NESTED_HV_DECREMENTER;
 
-   vcpu->arch.ceded = 0;
-
vcpu_vpa_increment_dispatch(vcpu);
 
if (kvmhv_on_pseries()) {
trap = kvmhv_vcpu_entry_p9_nested(vcpu, time_limit, lpcr, tb);
 
-   /* H_CEDE has to be handled now, not later */
-   if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
-   kvmppc_get_gpr(vcpu, 3) == H_CEDE) {
-   kvmppc_cede(vcpu);
-   kvmppc_s

[PATCH v3 52/52] KVM: PPC: Book3S HV P9: Remove subcore HMI handling

2021-10-04 Thread Nicholas Piggin
On POWER9 and newer, rather than the complex HMI synchronisation and
subcore state, have each thread un-apply the guest TB offset before
calling into the early HMI handler.

This allows the subcore state to be avoided, including subcore enter
/ exit guest, which includes an expensive divide that shows up
slightly in profiles.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_ppc.h|  1 +
 arch/powerpc/kvm/book3s_hv.c  | 12 +++---
 arch/powerpc/kvm/book3s_hv_hmi.c  |  7 +++-
 arch/powerpc/kvm/book3s_hv_p9_entry.c |  2 +-
 arch/powerpc/kvm/book3s_hv_ras.c  | 54 +++
 5 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 671fbd1a765e..70ffcb3c91bf 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -760,6 +760,7 @@ void kvmppc_realmode_machine_check(struct kvm_vcpu *vcpu);
 void kvmppc_subcore_enter_guest(void);
 void kvmppc_subcore_exit_guest(void);
 long kvmppc_realmode_hmi_handler(void);
+long kvmppc_p9_realmode_hmi_handler(struct kvm_vcpu *vcpu);
 long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 long pte_index, unsigned long pteh, unsigned long ptel);
 long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 351018f617fb..449ac0a19ceb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4014,8 +4014,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu->arch.ceded = 0;
 
-   kvmppc_subcore_enter_guest();
-
vcpu_vpa_increment_dispatch(vcpu);
 
if (kvmhv_on_pseries()) {
@@ -4068,8 +4066,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu_vpa_increment_dispatch(vcpu);
 
-   kvmppc_subcore_exit_guest();
-
return trap;
 }
 
@@ -6069,9 +6065,11 @@ static int kvmppc_book3s_init_hv(void)
if (r)
return r;
 
-   r = kvm_init_subcore_bitmap();
-   if (r)
-   return r;
+   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   r = kvm_init_subcore_bitmap();
+   if (r)
+   return r;
+   }
 
/*
 * We need a way of accessing the XICS interrupt controller,
diff --git a/arch/powerpc/kvm/book3s_hv_hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c
index 9af660476314..1ec50c69678b 100644
--- a/arch/powerpc/kvm/book3s_hv_hmi.c
+++ b/arch/powerpc/kvm/book3s_hv_hmi.c
@@ -20,10 +20,15 @@ void wait_for_subcore_guest_exit(void)
 
/*
 * NULL bitmap pointer indicates that KVM module hasn't
-* been loaded yet and hence no guests are running.
+* been loaded yet and hence no guests are running, or running
+* on POWER9 or newer CPU.
+*
 * If no KVM is in use, no need to co-ordinate among threads
 * as all of them will always be in host and no one is going
 * to modify TB other than the opal hmi handler.
+*
+* POWER9 and newer don't need this synchronisation.
+*
 * Hence, just return from here.
 */
if (!local_paca->sibling_subcore_state)
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index fbecbdc42c26..86a222f97e8e 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -938,7 +938,7 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
kvmppc_realmode_machine_check(vcpu);
 
} else if (unlikely(trap == BOOK3S_INTERRUPT_HMI)) {
-   kvmppc_realmode_hmi_handler();
+   kvmppc_p9_realmode_hmi_handler(vcpu);
 
} else if (trap == BOOK3S_INTERRUPT_H_EMUL_ASSIST) {
vcpu->arch.emul_inst = mfspr(SPRN_HEIR);
diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
index d4bca93b79f6..ccfd96965630 100644
--- a/arch/powerpc/kvm/book3s_hv_ras.c
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -136,6 +136,60 @@ void kvmppc_realmode_machine_check(struct kvm_vcpu *vcpu)
vcpu->arch.mce_evt = mce_evt;
 }
 
+
+long kvmppc_p9_realmode_hmi_handler(struct kvm_vcpu *vcpu)
+{
+   struct kvmppc_vcore *vc = vcpu->arch.vcore;
+   long ret = 0;
+
+   /*
+* Unapply and clear the offset first. That way, if the TB was not
+* resynced then it will remain in host-offset, and if it was resynced
+* then it is brought into host-offset. Then the tb offset is
+* re-applied before continuing with the KVM exit.
+*
+* This way, we don't need to actually know whether not OPAL resynced
+* the timebase or do any of the complicated dance that the P7/8
+* path requires.
+*/
+   if (vc->tb_offset_applied) {
+   

[PATCH v3 51/52] KVM: PPC: Book3S HV P9: Stop using vc->dpdes

2021-10-04 Thread Nicholas Piggin
The P9 path uses vc->dpdes only for msgsndp / SMT emulation. This adds
an ordering requirement between vcpu->doorbell_request and vc->dpdes for
no real benefit. Use vcpu->doorbell_request directly.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 18 ++
 arch/powerpc/kvm/book3s_hv_builtin.c  |  2 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 14 ++
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 57bf49c90e73..351018f617fb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -761,6 +761,8 @@ static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
 
if (vcpu->arch.doorbell_request)
return true;
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   return false;
/*
 * Ensure that the read of vcore->dpdes comes after the read
 * of vcpu->doorbell_request.  This barrier matches the
@@ -2185,8 +2187,10 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
 * either vcore->dpdes or doorbell_request.
 * On POWER8, doorbell_request is 0.
 */
-   *val = get_reg_val(id, vcpu->arch.vcore->dpdes |
-  vcpu->arch.doorbell_request);
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   *val = get_reg_val(id, vcpu->arch.doorbell_request);
+   else
+   *val = get_reg_val(id, vcpu->arch.vcore->dpdes);
break;
case KVM_REG_PPC_VTB:
*val = get_reg_val(id, vcpu->arch.vcore->vtb);
@@ -2423,7 +2427,10 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
vcpu->arch.pspb = set_reg_val(id, *val);
break;
case KVM_REG_PPC_DPDES:
-   vcpu->arch.vcore->dpdes = set_reg_val(id, *val);
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   vcpu->arch.doorbell_request = set_reg_val(id, *val) & 1;
+   else
+   vcpu->arch.vcore->dpdes = set_reg_val(id, *val);
break;
case KVM_REG_PPC_VTB:
vcpu->arch.vcore->vtb = set_reg_val(id, *val);
@@ -4472,11 +4479,6 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 
if (!nested) {
kvmppc_core_prepare_to_enter(vcpu);
-   if (vcpu->arch.doorbell_request) {
-   vc->dpdes = 1;
-   smp_wmb();
-   vcpu->arch.doorbell_request = 0;
-   }
if (test_bit(BOOK3S_IRQPRIO_EXTERNAL,
 >arch.pending_exceptions))
lpcr |= LPCR_MER;
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index fcf4760a3a0e..a4fc4b2d3806 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -649,6 +649,8 @@ void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu)
int ext;
unsigned long lpcr;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
/* Insert EXTERNAL bit into LPCR at the MER bit position */
ext = (vcpu->arch.pending_exceptions >> BOOK3S_IRQPRIO_EXTERNAL) & 1;
lpcr = mfspr(SPRN_LPCR);
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 5a71532a3adf..fbecbdc42c26 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -705,6 +705,7 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
unsigned long host_pidr;
unsigned long host_dawr1;
unsigned long host_dawrx1;
+   unsigned long dpdes;
 
hdec = time_limit - *tb;
if (hdec < 0)
@@ -767,8 +768,10 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
if (vc->pcr)
mtspr(SPRN_PCR, vc->pcr | PCR_MASK);
-   if (vc->dpdes)
-   mtspr(SPRN_DPDES, vc->dpdes);
+   if (vcpu->arch.doorbell_request) {
+   vcpu->arch.doorbell_request = 0;
+   mtspr(SPRN_DPDES, 1);
+   }
 
if (dawr_enabled()) {
if (vcpu->arch.dawr0 != host_dawr0)
@@ -999,7 +1002,10 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vcpu->arch.shregs.sprg2 = mfspr(SPRN_SPRG2);
vcpu->arch.shregs.sprg3 = mfspr(SPRN_SPRG3);
 
-   vc->dpdes = mfspr(SPRN_DPDES);
+   dpdes = mfspr(SPRN_DPDES);
+   if (dpdes)
+   vcpu->arch.doorbell_request = 1;
+
vc->vtb = mfspr(SPRN_VTB);
 
dec = mfspr(SPRN_DEC);
@@ -1061,7 +1067,7 @@ int kvmhv_vc

[PATCH v3 50/52] KVM: PPC: Book3S HV P9: Tidy kvmppc_create_dtl_entry

2021-10-04 Thread Nicholas Piggin
This goes further to removing vcores from the P9 path. Also avoid the
memset in favour of explicitly initialising all fields.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 61 +---
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d614f83c7b3f..57bf49c90e73 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -698,41 +698,30 @@ static u64 vcore_stolen_time(struct kvmppc_vcore *vc, u64 
now)
return p;
 }
 
-static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
-   struct kvmppc_vcore *vc, u64 tb)
+static void __kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
+   unsigned int pcpu, u64 now,
+   unsigned long stolen)
 {
struct dtl_entry *dt;
struct lppaca *vpa;
-   unsigned long stolen;
-   unsigned long core_stolen;
-   u64 now;
-   unsigned long flags;
 
dt = vcpu->arch.dtl_ptr;
vpa = vcpu->arch.vpa.pinned_addr;
-   now = tb;
-
-   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
-   stolen = 0;
-   } else {
-   core_stolen = vcore_stolen_time(vc, now);
-   stolen = core_stolen - vcpu->arch.stolen_logged;
-   vcpu->arch.stolen_logged = core_stolen;
-   spin_lock_irqsave(>arch.tbacct_lock, flags);
-   stolen += vcpu->arch.busy_stolen;
-   vcpu->arch.busy_stolen = 0;
-   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
-   }
 
if (!dt || !vpa)
return;
-   memset(dt, 0, sizeof(struct dtl_entry));
+
dt->dispatch_reason = 7;
-   dt->processor_id = cpu_to_be16(vc->pcpu + vcpu->arch.ptid);
-   dt->timebase = cpu_to_be64(now + vc->tb_offset);
+   dt->preempt_reason = 0;
+   dt->processor_id = cpu_to_be16(pcpu + vcpu->arch.ptid);
dt->enqueue_to_dispatch_time = cpu_to_be32(stolen);
+   dt->ready_to_enqueue_time = 0;
+   dt->waiting_to_ready_time = 0;
+   dt->timebase = cpu_to_be64(now);
+   dt->fault_addr = 0;
dt->srr0 = cpu_to_be64(kvmppc_get_pc(vcpu));
dt->srr1 = cpu_to_be64(vcpu->arch.shregs.msr);
+
++dt;
if (dt == vcpu->arch.dtl.pinned_end)
dt = vcpu->arch.dtl.pinned_addr;
@@ -743,6 +732,27 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
vcpu->arch.dtl.dirty = true;
 }
 
+static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
+   struct kvmppc_vcore *vc)
+{
+   unsigned long stolen;
+   unsigned long core_stolen;
+   u64 now;
+   unsigned long flags;
+
+   now = mftb();
+
+   core_stolen = vcore_stolen_time(vc, now);
+   stolen = core_stolen - vcpu->arch.stolen_logged;
+   vcpu->arch.stolen_logged = core_stolen;
+   spin_lock_irqsave(>arch.tbacct_lock, flags);
+   stolen += vcpu->arch.busy_stolen;
+   vcpu->arch.busy_stolen = 0;
+   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
+
+   __kvmppc_create_dtl_entry(vcpu, vc->pcpu, now + vc->tb_offset, stolen);
+}
+
 /* See if there is a doorbell interrupt pending for a vcpu */
 static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
 {
@@ -3750,7 +3760,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
pvc->pcpu = pcpu + thr;
for_each_runnable_thread(i, vcpu, pvc) {
kvmppc_start_thread(vcpu, pvc);
-   kvmppc_create_dtl_entry(vcpu, pvc, mftb());
+   kvmppc_create_dtl_entry(vcpu, pvc);
trace_kvm_guest_enter(vcpu);
if (!vcpu->arch.ptid)
thr0_done = true;
@@ -4313,7 +4323,7 @@ static int kvmppc_run_vcpu(struct kvm_vcpu *vcpu)
if ((vc->vcore_state == VCORE_PIGGYBACK ||
 vc->vcore_state == VCORE_RUNNING) &&
   !VCORE_IS_EXITING(vc)) {
-   kvmppc_create_dtl_entry(vcpu, vc, mftb());
+   kvmppc_create_dtl_entry(vcpu, vc);
kvmppc_start_thread(vcpu, vc);
trace_kvm_guest_enter(vcpu);
} else if (vc->vcore_state == VCORE_SLEEPING) {
@@ -4490,8 +4500,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
local_paca->kvm_hstate.ptid = 0;
local_paca->kvm_hstate.fake_suspend = 0;
 
-   vc->pcpu = pcpu; // for kvmppc_create_dtl_entry
-   kvmppc_create_dtl_entry(vcpu, vc, tb);
+   __kvmppc_create_dtl_entry(vcpu, pcpu, tb + vc->tb_offset, 0);
 
trace_kvm_guest_enter(vcpu);
 
-- 
2.23.0



[PATCH v3 49/52] KVM: PPC: Book3S HV P9: Remove most of the vcore logic

2021-10-04 Thread Nicholas Piggin
The P9 path always uses one vcpu per vcore, so none of the vcore, locks,
stolen time, blocking logic, shared waitq, etc., is required.

Remove most of it.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 147 ---
 1 file changed, 85 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6574e8a3731e..d614f83c7b3f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -276,6 +276,8 @@ static void kvmppc_core_start_stolen(struct kvmppc_vcore 
*vc, u64 tb)
 {
unsigned long flags;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
spin_lock_irqsave(>stoltb_lock, flags);
vc->preempt_tb = tb;
spin_unlock_irqrestore(>stoltb_lock, flags);
@@ -285,6 +287,8 @@ static void kvmppc_core_end_stolen(struct kvmppc_vcore *vc, 
u64 tb)
 {
unsigned long flags;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
spin_lock_irqsave(>stoltb_lock, flags);
if (vc->preempt_tb != TB_NIL) {
vc->stolen_tb += tb - vc->preempt_tb;
@@ -297,7 +301,12 @@ static void kvmppc_core_vcpu_load_hv(struct kvm_vcpu 
*vcpu, int cpu)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
unsigned long flags;
-   u64 now = mftb();
+   u64 now;
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   return;
+
+   now = mftb();
 
/*
 * We can test vc->runner without taking the vcore lock,
@@ -321,7 +330,12 @@ static void kvmppc_core_vcpu_put_hv(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
unsigned long flags;
-   u64 now = mftb();
+   u64 now;
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   return;
+
+   now = mftb();
 
if (vc->runner == vcpu && vc->vcore_state >= VCORE_SLEEPING)
kvmppc_core_start_stolen(vc, now);
@@ -673,6 +687,8 @@ static u64 vcore_stolen_time(struct kvmppc_vcore *vc, u64 
now)
u64 p;
unsigned long flags;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
spin_lock_irqsave(>stoltb_lock, flags);
p = vc->stolen_tb;
if (vc->vcore_state != VCORE_INACTIVE &&
@@ -695,13 +711,19 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
dt = vcpu->arch.dtl_ptr;
vpa = vcpu->arch.vpa.pinned_addr;
now = tb;
-   core_stolen = vcore_stolen_time(vc, now);
-   stolen = core_stolen - vcpu->arch.stolen_logged;
-   vcpu->arch.stolen_logged = core_stolen;
-   spin_lock_irqsave(>arch.tbacct_lock, flags);
-   stolen += vcpu->arch.busy_stolen;
-   vcpu->arch.busy_stolen = 0;
-   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   stolen = 0;
+   } else {
+   core_stolen = vcore_stolen_time(vc, now);
+   stolen = core_stolen - vcpu->arch.stolen_logged;
+   vcpu->arch.stolen_logged = core_stolen;
+   spin_lock_irqsave(>arch.tbacct_lock, flags);
+   stolen += vcpu->arch.busy_stolen;
+   vcpu->arch.busy_stolen = 0;
+   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
+   }
+
if (!dt || !vpa)
return;
memset(dt, 0, sizeof(struct dtl_entry));
@@ -898,13 +920,14 @@ static int kvm_arch_vcpu_yield_to(struct kvm_vcpu *target)
 * mode handler is not called but no other threads are in the
 * source vcore.
 */
-
-   spin_lock(>lock);
-   if (target->arch.state == KVMPPC_VCPU_RUNNABLE &&
-   vcore->vcore_state != VCORE_INACTIVE &&
-   vcore->runner)
-   target = vcore->runner;
-   spin_unlock(>lock);
+   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   spin_lock(>lock);
+   if (target->arch.state == KVMPPC_VCPU_RUNNABLE &&
+   vcore->vcore_state != VCORE_INACTIVE &&
+   vcore->runner)
+   target = vcore->runner;
+   spin_unlock(>lock);
+   }
 
return kvm_vcpu_yield_to(target);
 }
@@ -3125,13 +3148,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, 
struct kvmppc_vcore *vc)
kvmppc_ipi_thread(cpu);
 }
 
-/* Old path does this in asm */
-static void kvmppc_stop_thread(struct kvm_vcpu *vcpu)
-{
-   vcpu->cpu = -1;
-   vcpu->arch.thread_cpu = -1;
-}
-
 static void kvmppc_wait_for_nap(int n_threads)
 {
int cpu = smp_processor_id();
@@ -3220,6 +3236,8 @@ static void kvmppc_vcore_preempt(struct kvmppc_vcore *vc)
 {
struct preempted_vcore_list *lp = this_cpu_ptr(_vcores);
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_30

[PATCH v3 48/52] KVM: PPC: Book3S HV P9: Avoid cpu_in_guest atomics on entry and exit

2021-10-04 Thread Nicholas Piggin
cpu_in_guest is set to determine if a CPU needs to be IPI'ed to exit
the guest and notice the need_tlb_flush bit.

This can be implemented as a global per-CPU pointer to the currently
running guest instead of per-guest cpumasks, saving 2 atomics per
entry/exit. P7/8 doesn't require cpu_in_guest, nor does a nested HV
(only the L0 does), so move it to the P9 HV path.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  1 -
 arch/powerpc/include/asm/kvm_host.h  |  1 -
 arch/powerpc/kvm/book3s_hv.c | 38 +---
 3 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 96f0fda50a07..fe07558173ef 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -44,7 +44,6 @@ struct kvm_nested_guest {
struct mutex tlb_lock;  /* serialize page faults and tlbies */
struct kvm_nested_guest *next;
cpumask_t need_tlb_flush;
-   cpumask_t cpu_in_guest;
short prev_cpu[NR_CPUS];
u8 radix;   /* is this nested guest radix */
 };
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 92925f82a1e3..4de418f6c0a2 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -287,7 +287,6 @@ struct kvm_arch {
u32 online_vcores;
atomic_t hpte_mod_interest;
cpumask_t need_tlb_flush;
-   cpumask_t cpu_in_guest;
u8 radix;
u8 fwnmi_enabled;
u8 secure_guest;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6e072e2e130a..6574e8a3731e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3009,30 +3009,33 @@ static void kvmppc_release_hwthread(int cpu)
tpaca->kvm_hstate.kvm_split_mode = NULL;
 }
 
+static DEFINE_PER_CPU(struct kvm *, cpu_in_guest);
+
 static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
 {
struct kvm_nested_guest *nested = vcpu->arch.nested;
-   cpumask_t *cpu_in_guest;
int i;
 
cpu = cpu_first_tlb_thread_sibling(cpu);
-   if (nested) {
+   if (nested)
cpumask_set_cpu(cpu, >need_tlb_flush);
-   cpu_in_guest = >cpu_in_guest;
-   } else {
+   else
cpumask_set_cpu(cpu, >arch.need_tlb_flush);
-   cpu_in_guest = >arch.cpu_in_guest;
-   }
/*
-* Make sure setting of bit in need_tlb_flush precedes
-* testing of cpu_in_guest bits.  The matching barrier on
-* the other side is the first smp_mb() in kvmppc_run_core().
+* Make sure setting of bit in need_tlb_flush precedes testing of
+* cpu_in_guest. The matching barrier on the other side is hwsync
+* when switching to guest MMU mode, which happens between
+* cpu_in_guest being set to the guest kvm, and need_tlb_flush bit
+* being tested.
 */
smp_mb();
for (i = cpu; i <= cpu_last_tlb_thread_sibling(cpu);
-   i += cpu_tlb_thread_sibling_step())
-   if (cpumask_test_cpu(i, cpu_in_guest))
+   i += cpu_tlb_thread_sibling_step()) {
+   struct kvm *running = *per_cpu_ptr(_in_guest, i);
+
+   if (running == kvm)
smp_call_function_single(i, do_nothing, NULL, 1);
+   }
 }
 
 static void do_migrate_away_vcpu(void *arg)
@@ -3100,7 +3103,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, 
struct kvmppc_vcore *vc)
 {
int cpu;
struct paca_struct *tpaca;
-   struct kvm *kvm = vc->kvm;
 
cpu = vc->pcpu;
if (vcpu) {
@@ -3111,7 +3113,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, 
struct kvmppc_vcore *vc)
cpu += vcpu->arch.ptid;
vcpu->cpu = vc->pcpu;
vcpu->arch.thread_cpu = cpu;
-   cpumask_set_cpu(cpu, >arch.cpu_in_guest);
}
tpaca = paca_ptrs[cpu];
tpaca->kvm_hstate.kvm_vcpu = vcpu;
@@ -3829,7 +3830,6 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
kvmppc_release_hwthread(pcpu + i);
if (sip && sip->napped[i])
kvmppc_ipi_thread(pcpu + i);
-   cpumask_clear_cpu(pcpu + i, >kvm->arch.cpu_in_guest);
}
 
spin_unlock(>lock);
@@ -3997,8 +3997,14 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
}
 
} else {
+   struct kvm *kvm = vcpu->kvm;
+
kvmppc_xive_push_vcpu(vcpu);
+
+   __this_cpu_write(cpu_in_guest, kvm);
trap = kvmhv_vcpu_entry_p9(vcpu, time_limit, lpcr, tb);
+   

[PATCH v3 47/52] KVM: PPC: Book3S HV P9: Add unlikely annotation for !mmu_ready

2021-10-04 Thread Nicholas Piggin
The mmu will almost always be ready.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0bbef4587f41..6e072e2e130a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4408,7 +4408,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
vc->runner = vcpu;
 
/* See if the MMU is ready to go */
-   if (!kvm->arch.mmu_ready) {
+   if (unlikely(!kvm->arch.mmu_ready)) {
r = kvmhv_setup_mmu(vcpu);
if (r) {
run->exit_reason = KVM_EXIT_FAIL_ENTRY;
-- 
2.23.0



[PATCH v3 46/52] KVM: PPC: Book3S HV P9: Avoid changing MSR[RI] in entry and exit

2021-10-04 Thread Nicholas Piggin
kvm_hstate.in_guest provides the equivalent of MSR[RI]=0 protection,
and it covers the existing MSR[RI]=0 section in late entry and early
exit, so clearing and setting MSR[RI] in those cases does not
actually do anything useful.

Remove the RI manipulation and replace it with comments. Make the
in_guest memory accesses a bit closer to a proper critical section
pattern. This speeds up guest entry/exit performance.

This also removes the MSR[RI] warnings which aren't very interesting
and would cause crashes if they hit due to causing an interrupt in
non-recoverable code.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 50 ---
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 99ce5805ea28..5a71532a3adf 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -829,7 +829,15 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 * But TM could be split out if this would be a significant benefit.
 */
 
-   local_paca->kvm_hstate.in_guest = KVM_GUEST_MODE_HV_P9;
+   /*
+* MSR[RI] does not need to be cleared (and is not, for radix guests
+* with no prefetch bug), because in_guest is set. If we take a SRESET
+* or MCE with in_guest set but still in HV mode, then
+* kvmppc_p9_bad_interrupt handles the interrupt, which effectively
+* clears MSR[RI] and doesn't return.
+*/
+   WRITE_ONCE(local_paca->kvm_hstate.in_guest, KVM_GUEST_MODE_HV_P9);
+   barrier(); /* Open in_guest critical section */
 
/*
 * Hash host, hash guest, or radix guest with prefetch bug, all have
@@ -841,14 +849,10 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
save_clear_host_mmu(kvm);
 
-   if (kvm_is_radix(kvm)) {
+   if (kvm_is_radix(kvm))
switch_mmu_to_guest_radix(kvm, vcpu, lpcr);
-   if (!cpu_has_feature(CPU_FTR_P9_RADIX_PREFETCH_BUG))
-   __mtmsrd(0, 1); /* clear RI */
-
-   } else {
+   else
switch_mmu_to_guest_hpt(kvm, vcpu, lpcr);
-   }
 
/* TLBIEL uses LPID=LPIDR, so run this after setting guest LPID */
kvmppc_check_need_tlb_flush(kvm, vc->pcpu, nested);
@@ -903,19 +907,16 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vcpu->arch.regs.gpr[3] = local_paca->kvm_hstate.scratch2;
 
/*
-* Only set RI after reading machine check regs (DAR, DSISR, SRR0/1)
-* and hstate scratch (which we need to move into exsave to make
-* re-entrant vs SRESET/MCE)
+* After reading machine check regs (DAR, DSISR, SRR0/1) and hstate
+* scratch (which we need to move into exsave to make re-entrant vs
+* SRESET/MCE), register state is protected from reentrancy. However
+* timebase, MMU, among other state is still set to guest, so don't
+* enable MSR[RI] here. It gets enabled at the end, after in_guest
+* is cleared.
+*
+* It is possible an NMI could come in here, which is why it is
+* important to save the above state early so it can be debugged.
 */
-   if (ri_set) {
-   if (unlikely(!(mfmsr() & MSR_RI))) {
-   __mtmsrd(MSR_RI, 1);
-   WARN_ON_ONCE(1);
-   }
-   } else {
-   WARN_ON_ONCE(mfmsr() & MSR_RI);
-   __mtmsrd(MSR_RI, 1);
-   }
 
vcpu->arch.regs.gpr[9] = exsave[EX_R9/sizeof(u64)];
vcpu->arch.regs.gpr[10] = exsave[EX_R10/sizeof(u64)];
@@ -973,13 +974,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 */
mtspr(SPRN_HSRR0, vcpu->arch.regs.nip);
mtspr(SPRN_HSRR1, vcpu->arch.shregs.msr);
-
-   /*
-* tm_return_to_guest re-loads SRR0/1, DAR,
-* DSISR after RI is cleared, in case they had
-* been clobbered by a MCE.
-*/
-   __mtmsrd(0, 1); /* clear RI */
goto tm_return_to_guest;
}
}
@@ -1079,7 +1073,9 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
restore_p9_host_os_sprs(vcpu, _os_sprs);
 
-   local_paca->kvm_hstate.in_guest = KVM_GUEST_MODE_NONE;
+   barrier(); /* Close in_guest critical section */
+   WRITE_ONCE(local_paca->kvm_hstate.in_guest, KVM_GUEST_MODE_NONE);
+   /* Interrupts are recoverable at this point */
 
/*
 * cp_ab

[PATCH v3 45/52] KVM: PPC: Book3S HV P9: Optimise hash guest SLB saving

2021-10-04 Thread Nicholas Piggin
slbmfee/slbmfev instructions are very expensive, moreso than a regular
mfspr instruction, so minimising them significantly improves hash guest
exit performance. The slbmfev is only required if slbmfee found a valid
SLB entry.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 646f487ebf97..99ce5805ea28 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -487,10 +487,22 @@ static void __accumulate_time(struct kvm_vcpu *vcpu, 
struct kvmhv_tb_accumulator
 #define accumulate_time(vcpu, next) do {} while (0)
 #endif
 
-static inline void mfslb(unsigned int idx, u64 *slbee, u64 *slbev)
+static inline u64 mfslbv(unsigned int idx)
 {
-   asm volatile("slbmfev  %0,%1" : "=r" (*slbev) : "r" (idx));
-   asm volatile("slbmfee  %0,%1" : "=r" (*slbee) : "r" (idx));
+   u64 slbev;
+
+   asm volatile("slbmfev  %0,%1" : "=r" (slbev) : "r" (idx));
+
+   return slbev;
+}
+
+static inline u64 mfslbe(unsigned int idx)
+{
+   u64 slbee;
+
+   asm volatile("slbmfee  %0,%1" : "=r" (slbee) : "r" (idx));
+
+   return slbee;
 }
 
 static inline void mtslb(u64 slbee, u64 slbev)
@@ -620,8 +632,10 @@ static void save_clear_guest_mmu(struct kvm *kvm, struct 
kvm_vcpu *vcpu)
 */
for (i = 0; i < vcpu->arch.slb_nr; i++) {
u64 slbee, slbev;
-   mfslb(i, , );
+
+   slbee = mfslbe(i);
if (slbee & SLB_ESID_V) {
+   slbev = mfslbv(i);
vcpu->arch.slb[nr].orige = slbee | i;
vcpu->arch.slb[nr].origv = slbev;
nr++;
-- 
2.23.0



[PATCH v3 44/52] KVM: PPC: Book3S HV P9: Improve mfmsr performance on entry

2021-10-04 Thread Nicholas Piggin
Rearrange the MSR saving on entry so it does not follow the mtmsrd to
disable interrupts, avoiding a possible RAW scoreboard stall.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  2 +
 arch/powerpc/kvm/book3s_hv.c | 18 ++-
 arch/powerpc/kvm/book3s_hv_p9_entry.c| 66 +++-
 3 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 0a319ed9c2fd..96f0fda50a07 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -154,6 +154,8 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu 
*vcpu)
return radix;
 }
 
+unsigned long kvmppc_msr_hard_disable_set_facilities(struct kvm_vcpu *vcpu, 
unsigned long msr);
+
 int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 time_limit, unsigned long 
lpcr, u64 *tb);
 
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 342b4f125d03..0bbef4587f41 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3878,6 +3878,8 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
s64 dec;
int trap;
 
+   msr = mfmsr();
+
save_p9_host_os_sprs(_os_sprs);
 
/*
@@ -3888,24 +3890,10 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
 */
host_psscr = mfspr(SPRN_PSSCR_PR);
 
-   hard_irq_disable();
+   kvmppc_msr_hard_disable_set_facilities(vcpu, msr);
if (lazy_irq_pending())
return 0;
 
-   /* MSR bits may have been cleared by context switch */
-   msr = 0;
-   if (IS_ENABLED(CONFIG_PPC_FPU))
-   msr |= MSR_FP;
-   if (cpu_has_feature(CPU_FTR_ALTIVEC))
-   msr |= MSR_VEC;
-   if (cpu_has_feature(CPU_FTR_VSX))
-   msr |= MSR_VSX;
-   if ((cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) &&
-   (vcpu->arch.hfscr & HFSCR_TM))
-   msr |= MSR_TM;
-   msr = msr_check_and_set(msr);
-
if (unlikely(load_vcpu_state(vcpu, _os_sprs)))
msr = mfmsr(); /* TM restore can update msr */
 
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index a45ba584c734..646f487ebf97 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -632,6 +632,44 @@ static void save_clear_guest_mmu(struct kvm *kvm, struct 
kvm_vcpu *vcpu)
}
 }
 
+unsigned long kvmppc_msr_hard_disable_set_facilities(struct kvm_vcpu *vcpu, 
unsigned long msr)
+{
+   unsigned long msr_needed = 0;
+
+   msr &= ~MSR_EE;
+
+   /* MSR bits may have been cleared by context switch so must recheck */
+   if (IS_ENABLED(CONFIG_PPC_FPU))
+   msr_needed |= MSR_FP;
+   if (cpu_has_feature(CPU_FTR_ALTIVEC))
+   msr_needed |= MSR_VEC;
+   if (cpu_has_feature(CPU_FTR_VSX))
+   msr_needed |= MSR_VSX;
+   if ((cpu_has_feature(CPU_FTR_TM) ||
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) &&
+   (vcpu->arch.hfscr & HFSCR_TM))
+   msr_needed |= MSR_TM;
+
+   /*
+* This could be combined with MSR[RI] clearing, but that expands
+* the unrecoverable window. It would be better to cover unrecoverable
+* with KVM bad interrupt handling rather than use MSR[RI] at all.
+*
+* Much more difficult and less worthwhile to combine with IR/DR
+* disable.
+*/
+   if ((msr & msr_needed) != msr_needed) {
+   msr |= msr_needed;
+   __mtmsrd(msr, 0);
+   } else {
+   __hard_irq_disable();
+   }
+   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
+
+   return msr;
+}
+EXPORT_SYMBOL_GPL(kvmppc_msr_hard_disable_set_facilities);
+
 int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 time_limit, unsigned long 
lpcr, u64 *tb)
 {
struct p9_host_os_sprs host_os_sprs;
@@ -665,6 +703,9 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
vcpu->arch.ceded = 0;
 
+   /* Save MSR for restore, with EE clear. */
+   msr = mfmsr() & ~MSR_EE;
+
host_hfscr = mfspr(SPRN_HFSCR);
host_ciabr = mfspr(SPRN_CIABR);
host_psscr = mfspr(SPRN_PSSCR_PR);
@@ -686,35 +727,12 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
save_p9_host_os_sprs(_os_sprs);
 
-   /*
-* This could be combined with MSR[RI] clearing, but that expands
-* the unrecoverable window. It would be better to cover unrecoverable
-* with KVM bad interrupt handling rather than use MSR[RI] at all.
-*
- 

[PATCH v3 43/52] KVM: PPC: Book3S HV Nested: Avoid extra mftb() in nested entry

2021-10-04 Thread Nicholas Piggin
mftb() is expensive and one can be avoided on nested guest dispatch.

If the time checking code distinguishes between the L0 timer and the
nested HV timer, then both can be tested in the same place with the
same mftb() value.

This also nicely illustrates the relationship between the L0 and nested
HV timers.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_asm.h  |  1 +
 arch/powerpc/kvm/book3s_hv.c| 12 
 arch/powerpc/kvm/book3s_hv_nested.c |  5 -
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index fbbf3cec92e9..d68d71987d5c 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -79,6 +79,7 @@
 #define BOOK3S_INTERRUPT_FP_UNAVAIL0x800
 #define BOOK3S_INTERRUPT_DECREMENTER   0x900
 #define BOOK3S_INTERRUPT_HV_DECREMENTER0x980
+#define BOOK3S_INTERRUPT_NESTED_HV_DECREMENTER 0x1980
 #define BOOK3S_INTERRUPT_DOORBELL  0xa00
 #define BOOK3S_INTERRUPT_SYSCALL   0xc00
 #define BOOK3S_INTERRUPT_TRACE 0xd00
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d2a9c930f33e..342b4f125d03 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1486,6 +1486,10 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
run->ready_for_interrupt_injection = 1;
switch (vcpu->arch.trap) {
/* We're good on these - the host merely wanted to get our attention */
+   case BOOK3S_INTERRUPT_NESTED_HV_DECREMENTER:
+   WARN_ON_ONCE(1); /* Should never happen */
+   vcpu->arch.trap = BOOK3S_INTERRUPT_HV_DECREMENTER;
+   fallthrough;
case BOOK3S_INTERRUPT_HV_DECREMENTER:
vcpu->stat.dec_exits++;
r = RESUME_GUEST;
@@ -1814,6 +1818,12 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
vcpu->stat.ext_intr_exits++;
r = RESUME_GUEST;
break;
+   /* These need to go to the nested HV */
+   case BOOK3S_INTERRUPT_NESTED_HV_DECREMENTER:
+   vcpu->arch.trap = BOOK3S_INTERRUPT_HV_DECREMENTER;
+   vcpu->stat.dec_exits++;
+   r = RESUME_HOST;
+   break;
/* SR/HMI/PMI are HV interrupts that host has handled. Resume guest.*/
case BOOK3S_INTERRUPT_HMI:
case BOOK3S_INTERRUPT_PERFMON:
@@ -3975,6 +3985,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
return BOOK3S_INTERRUPT_HV_DECREMENTER;
if (next_timer < time_limit)
time_limit = next_timer;
+   else if (*tb >= time_limit) /* nested time limit */
+   return BOOK3S_INTERRUPT_NESTED_HV_DECREMENTER;
 
vcpu->arch.ceded = 0;
 
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 7bed0b91245e..e57c08b968c0 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -375,11 +375,6 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
vcpu->arch.ret = RESUME_GUEST;
vcpu->arch.trap = 0;
do {
-   if (mftb() >= hdec_exp) {
-   vcpu->arch.trap = BOOK3S_INTERRUPT_HV_DECREMENTER;
-   r = RESUME_HOST;
-   break;
-   }
r = kvmhv_run_single_vcpu(vcpu, hdec_exp, lpcr);
} while (is_kvmppc_resume_guest(r));
 
-- 
2.23.0



[PATCH v3 42/52] KVM: PPC: Book3S HV P9: Avoid tlbsync sequence on radix guest exit

2021-10-04 Thread Nicholas Piggin
Use the existing TLB flushing logic to IPI the previous CPU and run the
necessary barriers before running a guest vCPU on a new physical CPU,
to do the necessary radix GTSE barriers for handling the case of an
interrupted guest tlbie sequence.

This results in more IPIs than the TLB flush logic requires, but it's
a significant win for common case scheduling when the vCPU remains on
the same physical CPU.

This saves about 520 cycles (nearly 10%) on a guest entry+exit micro
benchmark on a POWER9.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 31 +++
 arch/powerpc/kvm/book3s_hv_p9_entry.c |  9 
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f10cb4167549..d2a9c930f33e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3025,6 +3025,25 @@ static void radix_flush_cpu(struct kvm *kvm, int cpu, 
struct kvm_vcpu *vcpu)
smp_call_function_single(i, do_nothing, NULL, 1);
 }
 
+static void do_migrate_away_vcpu(void *arg)
+{
+   struct kvm_vcpu *vcpu = arg;
+   struct kvm *kvm = vcpu->kvm;
+
+   /*
+* If the guest has GTSE, it may execute tlbie, so do a eieio; tlbsync;
+* ptesync sequence on the old CPU before migrating to a new one, in
+* case we interrupted the guest between a tlbie ; eieio ;
+* tlbsync; ptesync sequence.
+*
+* Otherwise, ptesync is sufficient.
+*/
+   if (kvm->arch.lpcr & LPCR_GTSE)
+   asm volatile("eieio; tlbsync; ptesync");
+   else
+   asm volatile("ptesync");
+}
+
 static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu *vcpu, int pcpu)
 {
struct kvm_nested_guest *nested = vcpu->arch.nested;
@@ -3052,10 +3071,14 @@ static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu 
*vcpu, int pcpu)
 * so we use a single bit in .need_tlb_flush for all 4 threads.
 */
if (prev_cpu != pcpu) {
-   if (prev_cpu >= 0 &&
-   cpu_first_tlb_thread_sibling(prev_cpu) !=
-   cpu_first_tlb_thread_sibling(pcpu))
-   radix_flush_cpu(kvm, prev_cpu, vcpu);
+   if (prev_cpu >= 0) {
+   if (cpu_first_tlb_thread_sibling(prev_cpu) !=
+   cpu_first_tlb_thread_sibling(pcpu))
+   radix_flush_cpu(kvm, prev_cpu, vcpu);
+
+   smp_call_function_single(prev_cpu,
+   do_migrate_away_vcpu, vcpu, 1);
+   }
if (nested)
nested->prev_cpu[vcpu->arch.nested_vcpu_id] = pcpu;
else
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index eae9d806d704..a45ba584c734 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -1049,15 +1049,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
local_paca->kvm_hstate.in_guest = KVM_GUEST_MODE_NONE;
 
-   if (kvm_is_radix(kvm)) {
-   /*
-* Since this is radix, do a eieio; tlbsync; ptesync sequence
-* in case we interrupted the guest between a tlbie and a
-* ptesync.
-*/
-   asm volatile("eieio; tlbsync; ptesync");
-   }
-
/*
 * cp_abort is required if the processor supports local copy-paste
 * to clear the copy buffer that was under control of the guest.
-- 
2.23.0



[PATCH v3 41/52] KVM: PPC: Book3S HV P9: Don't restore PSSCR if not needed

2021-10-04 Thread Nicholas Piggin
This also moves the PSSCR update in nested entry to avoid a SPR
scoreboard stall.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  |  7 +--
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 26 +++---
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f4445cc5a29a..f10cb4167549 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3876,7 +3876,9 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
if (unlikely(load_vcpu_state(vcpu, _os_sprs)))
msr = mfmsr(); /* TM restore can update msr */
 
-   mtspr(SPRN_PSSCR_PR, vcpu->arch.psscr);
+   if (vcpu->arch.psscr != host_psscr)
+   mtspr(SPRN_PSSCR_PR, vcpu->arch.psscr);
+
kvmhv_save_hv_regs(vcpu, );
hvregs.lpcr = lpcr;
vcpu->arch.regs.msr = vcpu->arch.shregs.msr;
@@ -3917,7 +3919,6 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
vcpu->arch.shregs.dar = mfspr(SPRN_DAR);
vcpu->arch.shregs.dsisr = mfspr(SPRN_DSISR);
vcpu->arch.psscr = mfspr(SPRN_PSSCR_PR);
-   mtspr(SPRN_PSSCR_PR, host_psscr);
 
store_vcpu_state(vcpu);
 
@@ -3930,6 +3931,8 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
timer_rearm_host_dec(*tb);
 
restore_p9_host_os_sprs(vcpu, _os_sprs);
+   if (vcpu->arch.psscr != host_psscr)
+   mtspr(SPRN_PSSCR_PR, host_psscr);
 
return trap;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 0f341011816c..eae9d806d704 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -649,6 +649,7 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
unsigned long host_dawr0;
unsigned long host_dawrx0;
unsigned long host_psscr;
+   unsigned long host_hpsscr;
unsigned long host_pidr;
unsigned long host_dawr1;
unsigned long host_dawrx1;
@@ -666,7 +667,9 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
host_hfscr = mfspr(SPRN_HFSCR);
host_ciabr = mfspr(SPRN_CIABR);
-   host_psscr = mfspr(SPRN_PSSCR);
+   host_psscr = mfspr(SPRN_PSSCR_PR);
+   if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   host_hpsscr = mfspr(SPRN_PSSCR);
host_pidr = mfspr(SPRN_PID);
 
if (dawr_enabled()) {
@@ -750,8 +753,14 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
if (vcpu->arch.ciabr != host_ciabr)
mtspr(SPRN_CIABR, vcpu->arch.ciabr);
 
-   mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
- (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
+
+   if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) {
+   mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
+ (local_paca->kvm_hstate.fake_suspend << 
PSSCR_FAKE_SUSPEND_LG));
+   } else {
+   if (vcpu->arch.psscr != host_psscr)
+   mtspr(SPRN_PSSCR_PR, vcpu->arch.psscr);
+   }
 
mtspr(SPRN_HFSCR, vcpu->arch.hfscr);
 
@@ -957,7 +966,7 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
vcpu->arch.ic = mfspr(SPRN_IC);
vcpu->arch.pid = mfspr(SPRN_PID);
-   vcpu->arch.psscr = mfspr(SPRN_PSSCR) & PSSCR_GUEST_VIS;
+   vcpu->arch.psscr = mfspr(SPRN_PSSCR_PR);
 
vcpu->arch.shregs.sprg0 = mfspr(SPRN_SPRG0);
vcpu->arch.shregs.sprg1 = mfspr(SPRN_SPRG1);
@@ -1003,9 +1012,12 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
mtspr(SPRN_PURR, local_paca->kvm_hstate.host_purr);
mtspr(SPRN_SPURR, local_paca->kvm_hstate.host_spurr);
 
-   /* Preserve PSSCR[FAKE_SUSPEND] until we've called kvmppc_save_tm_hv */
-   mtspr(SPRN_PSSCR, host_psscr |
- (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
+   if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) {
+   /* Preserve PSSCR[FAKE_SUSPEND] until we've called 
kvmppc_save_tm_hv */
+   mtspr(SPRN_PSSCR, host_hpsscr |
+ (local_paca->kvm_hstate.fake_suspend << 
PSSCR_FAKE_SUSPEND_LG));
+   }
+
mtspr(SPRN_HFSCR, host_hfscr);
if (vcpu->arch.ciabr != host_ciabr)
mtspr(SPRN_CIABR, host_ciabr);
-- 
2.23.0



[PATCH v3 40/52] KVM: PPC: Book3S HV P9: Test dawr_enabled() before saving host DAWR SPRs

2021-10-04 Thread Nicholas Piggin
Some of the DAWR SPR access is already predicated on dawr_enabled(),
apply this to the remainder of the accesses.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 34 ---
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 323b692bbfe2..0f341011816c 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -666,13 +666,16 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
host_hfscr = mfspr(SPRN_HFSCR);
host_ciabr = mfspr(SPRN_CIABR);
-   host_dawr0 = mfspr(SPRN_DAWR0);
-   host_dawrx0 = mfspr(SPRN_DAWRX0);
host_psscr = mfspr(SPRN_PSSCR);
host_pidr = mfspr(SPRN_PID);
-   if (cpu_has_feature(CPU_FTR_DAWR1)) {
-   host_dawr1 = mfspr(SPRN_DAWR1);
-   host_dawrx1 = mfspr(SPRN_DAWRX1);
+
+   if (dawr_enabled()) {
+   host_dawr0 = mfspr(SPRN_DAWR0);
+   host_dawrx0 = mfspr(SPRN_DAWRX0);
+   if (cpu_has_feature(CPU_FTR_DAWR1)) {
+   host_dawr1 = mfspr(SPRN_DAWR1);
+   host_dawrx1 = mfspr(SPRN_DAWRX1);
+   }
}
 
local_paca->kvm_hstate.host_purr = mfspr(SPRN_PURR);
@@ -1006,15 +1009,18 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
mtspr(SPRN_HFSCR, host_hfscr);
if (vcpu->arch.ciabr != host_ciabr)
mtspr(SPRN_CIABR, host_ciabr);
-   if (vcpu->arch.dawr0 != host_dawr0)
-   mtspr(SPRN_DAWR0, host_dawr0);
-   if (vcpu->arch.dawrx0 != host_dawrx0)
-   mtspr(SPRN_DAWRX0, host_dawrx0);
-   if (cpu_has_feature(CPU_FTR_DAWR1)) {
-   if (vcpu->arch.dawr1 != host_dawr1)
-   mtspr(SPRN_DAWR1, host_dawr1);
-   if (vcpu->arch.dawrx1 != host_dawrx1)
-   mtspr(SPRN_DAWRX1, host_dawrx1);
+
+   if (dawr_enabled()) {
+   if (vcpu->arch.dawr0 != host_dawr0)
+   mtspr(SPRN_DAWR0, host_dawr0);
+   if (vcpu->arch.dawrx0 != host_dawrx0)
+   mtspr(SPRN_DAWRX0, host_dawrx0);
+   if (cpu_has_feature(CPU_FTR_DAWR1)) {
+   if (vcpu->arch.dawr1 != host_dawr1)
+   mtspr(SPRN_DAWR1, host_dawr1);
+   if (vcpu->arch.dawrx1 != host_dawrx1)
+   mtspr(SPRN_DAWRX1, host_dawrx1);
+   }
}
 
if (vc->dpdes)
-- 
2.23.0



[PATCH v3 39/52] KVM: PPC: Book3S HV P9: Comment and fix MMU context switching code

2021-10-04 Thread Nicholas Piggin
Tighten up partition switching code synchronisation and comments.

In particular, hwsync ; isync is required after the last access that is
performed in the context of a partition, before the partition is
switched away from.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_64_entry.S | 11 +--
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  4 +++
 arch/powerpc/kvm/book3s_hv_p9_entry.c  | 40 +++---
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 983b8c18bc31..05e003eb5d90 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -374,11 +374,16 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_TYPE_RADIX)
 BEGIN_FTR_SECTION
mtspr   SPRN_DAWRX1,r10
 END_FTR_SECTION_IFSET(CPU_FTR_DAWR1)
-   mtspr   SPRN_PID,r10
 
/*
-* Switch to host MMU mode
+* Switch to host MMU mode (don't have the real host PID but we aren't
+* going back to userspace).
 */
+   hwsync
+   isync
+
+   mtspr   SPRN_PID,r10
+
ld  r10, HSTATE_KVM_VCPU(r13)
ld  r10, VCPU_KVM(r10)
lwz r10, KVM_HOST_LPID(r10)
@@ -389,6 +394,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_DAWR1)
ld  r10, KVM_HOST_LPCR(r10)
mtspr   SPRN_LPCR,r10
 
+   isync
+
/*
 * Set GUEST_MODE_NONE so the handler won't branch to KVM, and clear
 * MSR_RI in r12 ([H]SRR1) so the handler won't try to return.
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 16359525a40f..8cebe5542256 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -57,6 +57,8 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
 
preempt_disable();
 
+   asm volatile("hwsync" ::: "memory");
+   isync();
/* switch the lpid first to avoid running host with unallocated pid */
old_lpid = mfspr(SPRN_LPID);
if (old_lpid != lpid)
@@ -75,6 +77,8 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
ret = __copy_to_user_inatomic((void __user *)to, from, n);
pagefault_enable();
 
+   asm volatile("hwsync" ::: "memory");
+   isync();
/* switch the pid first to avoid running host with unallocated pid */
if (quadrant == 1 && pid != old_pid)
mtspr(SPRN_PID, old_pid);
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 093ac0453d91..323b692bbfe2 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -531,17 +531,19 @@ static void switch_mmu_to_guest_radix(struct kvm *kvm, 
struct kvm_vcpu *vcpu, u6
lpid = nested ? nested->shadow_lpid : kvm->arch.lpid;
 
/*
-* All the isync()s are overkill but trivially follow the ISA
-* requirements. Some can likely be replaced with justification
-* comment for why they are not needed.
+* Prior memory accesses to host PID Q3 must be completed before we
+* start switching, and stores must be drained to avoid not-my-LPAR
+* logic (see switch_mmu_to_host).
 */
+   asm volatile("hwsync" ::: "memory");
isync();
mtspr(SPRN_LPID, lpid);
-   isync();
mtspr(SPRN_LPCR, lpcr);
-   isync();
mtspr(SPRN_PID, vcpu->arch.pid);
-   isync();
+   /*
+* isync not required here because we are HRFID'ing to guest before
+* any guest context access, which is context synchronising.
+*/
 }
 
 static void switch_mmu_to_guest_hpt(struct kvm *kvm, struct kvm_vcpu *vcpu, 
u64 lpcr)
@@ -551,25 +553,41 @@ static void switch_mmu_to_guest_hpt(struct kvm *kvm, 
struct kvm_vcpu *vcpu, u64
 
lpid = kvm->arch.lpid;
 
+   /*
+* See switch_mmu_to_guest_radix. ptesync should not be required here
+* even if the host is in HPT mode because speculative accesses would
+* not cause RC updates (we are in real mode).
+*/
+   asm volatile("hwsync" ::: "memory");
+   isync();
mtspr(SPRN_LPID, lpid);
mtspr(SPRN_LPCR, lpcr);
mtspr(SPRN_PID, vcpu->arch.pid);
 
for (i = 0; i < vcpu->arch.slb_max; i++)
mtslb(vcpu->arch.slb[i].orige, vcpu->arch.slb[i].origv);
-
-   isync();
+   /*
+* isync not required here, see switch_mmu_to_guest_radix.
+*/
 }
 
 static void switch_mmu_to_host(struct kvm *kvm, u32 pid)
 {
+   /*
+* The guest has exited, so guest MMU context is no longer being
+* non-speculatively accessed, but a hwsync is needed before the
+* mtLPIDR / mtPIDR switch, in order to ensure all stores are drained,
+* so the not

[PATCH v3 38/52] KVM: PPC: Book3S HV P9: Use Linux SPR save/restore to manage some host SPRs

2021-10-04 Thread Nicholas Piggin
Linux implements SPR save/restore including storage space for registers
in the task struct for process context switching. Make use of this
similarly to the way we make use of the context switching fp/vec save
restore.

This improves code reuse, allows some stack space to be saved, and helps
with avoiding VRSAVE updates if they are not required.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/switch_to.h  |  1 +
 arch/powerpc/kernel/process.c |  6 ++
 arch/powerpc/kvm/book3s_hv.c  | 21 +-
 arch/powerpc/kvm/book3s_hv.h  |  3 -
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 93 +++
 5 files changed, 73 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/include/asm/switch_to.h 
b/arch/powerpc/include/asm/switch_to.h
index e8013cd6b646..1f43ef696033 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -113,6 +113,7 @@ static inline void clear_task_ebb(struct task_struct *t)
 }
 
 void kvmppc_save_user_regs(void);
+void kvmppc_save_current_sprs(void);
 
 extern int set_thread_tidr(struct task_struct *t);
 
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 3fca321b820d..b2d191a8fbf9 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1182,6 +1182,12 @@ void kvmppc_save_user_regs(void)
 #endif
 }
 EXPORT_SYMBOL_GPL(kvmppc_save_user_regs);
+
+void kvmppc_save_current_sprs(void)
+{
+   save_sprs(>thread);
+}
+EXPORT_SYMBOL_GPL(kvmppc_save_current_sprs);
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 
 static inline void restore_sprs(struct thread_struct *old_thread,
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e9037f7a3737..f4445cc5a29a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4540,9 +4540,6 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
struct kvm_run *run = vcpu->run;
int r;
int srcu_idx;
-   unsigned long ebb_regs[3] = {}; /* shut up GCC */
-   unsigned long user_tar = 0;
-   unsigned int user_vrsave;
struct kvm *kvm;
unsigned long msr;
 
@@ -4603,14 +4600,7 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
 
kvmppc_save_user_regs();
 
-   /* Save userspace EBB and other register values */
-   if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
-   ebb_regs[0] = mfspr(SPRN_EBBHR);
-   ebb_regs[1] = mfspr(SPRN_EBBRR);
-   ebb_regs[2] = mfspr(SPRN_BESCR);
-   user_tar = mfspr(SPRN_TAR);
-   }
-   user_vrsave = mfspr(SPRN_VRSAVE);
+   kvmppc_save_current_sprs();
 
vcpu->arch.waitp = >arch.vcore->wait;
vcpu->arch.pgdir = kvm->mm->pgd;
@@ -4651,15 +4641,6 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
}
} while (is_kvmppc_resume_guest(r));
 
-   /* Restore userspace EBB and other register values */
-   if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
-   mtspr(SPRN_EBBHR, ebb_regs[0]);
-   mtspr(SPRN_EBBRR, ebb_regs[1]);
-   mtspr(SPRN_BESCR, ebb_regs[2]);
-   mtspr(SPRN_TAR, user_tar);
-   }
-   mtspr(SPRN_VRSAVE, user_vrsave);
-
vcpu->arch.state = KVMPPC_VCPU_NOTREADY;
atomic_dec(>arch.vcpus_running);
 
diff --git a/arch/powerpc/kvm/book3s_hv.h b/arch/powerpc/kvm/book3s_hv.h
index d7485b9e9762..6b7f07d9026b 100644
--- a/arch/powerpc/kvm/book3s_hv.h
+++ b/arch/powerpc/kvm/book3s_hv.h
@@ -4,11 +4,8 @@
  * Privileged (non-hypervisor) host registers to save.
  */
 struct p9_host_os_sprs {
-   unsigned long dscr;
-   unsigned long tidr;
unsigned long iamr;
unsigned long amr;
-   unsigned long fscr;
 
unsigned int pmc1;
unsigned int pmc2;
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 8499e8a9ca8f..093ac0453d91 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -231,15 +231,26 @@ EXPORT_SYMBOL_GPL(switch_pmu_to_host);
 static void load_spr_state(struct kvm_vcpu *vcpu,
struct p9_host_os_sprs *host_os_sprs)
 {
+   /* TAR is very fast */
mtspr(SPRN_TAR, vcpu->arch.tar);
 
+#ifdef CONFIG_ALTIVEC
+   if (cpu_has_feature(CPU_FTR_ALTIVEC) &&
+   current->thread.vrsave != vcpu->arch.vrsave)
+   mtspr(SPRN_VRSAVE, vcpu->arch.vrsave);
+#endif
+
if (vcpu->arch.hfscr & HFSCR_EBB) {
-   mtspr(SPRN_EBBHR, vcpu->arch.ebbhr);
-   mtspr(SPRN_EBBRR, vcpu->arch.ebbrr);
-   mtspr(SPRN_BESCR, vcpu->arch.bescr);
+   if (current->thread.ebbhr != vcpu->arch.ebbhr)
+   mtspr(SPRN_EBBHR, vcpu->arch.ebbhr);
+   if (current->thread.ebbrr != vcpu->arch.ebbrr)
+

[PATCH v3 37/52] KVM: PPC: Book3S HV P9: Demand fault TM facility registers

2021-10-04 Thread Nicholas Piggin
Use HFSCR facility disabling to implement demand faulting for TM, with
a hysteresis counter similar to the load_fp etc counters in context
switching that implement the equivalent demand faulting for userspace
facilities.

This speeds up guest entry/exit by avoiding the register save/restore
when a guest is not frequently using them. When a guest does use them
often, there will be some additional demand fault overhead, but these
are not commonly used facilities.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_host.h   |  3 +++
 arch/powerpc/kvm/book3s_hv.c  | 26 --
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 15 +++
 3 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 9c63eff35812..92925f82a1e3 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -580,6 +580,9 @@ struct kvm_vcpu_arch {
ulong ppr;
u32 pspb;
u8 load_ebb;
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   u8 load_tm;
+#endif
ulong fscr;
ulong shadow_fscr;
ulong ebbhr;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 070469867bf5..e9037f7a3737 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1446,6 +1446,16 @@ static int kvmppc_ebb_unavailable(struct kvm_vcpu *vcpu)
return RESUME_GUEST;
 }
 
+static int kvmppc_tm_unavailable(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu->arch.hfscr_permitted & HFSCR_TM))
+   return EMULATE_FAIL;
+
+   vcpu->arch.hfscr |= HFSCR_TM;
+
+   return RESUME_GUEST;
+}
+
 static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 struct task_struct *tsk)
 {
@@ -1739,6 +1749,8 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
r = kvmppc_pmu_unavailable(vcpu);
if (cause == FSCR_EBB_LG)
r = kvmppc_ebb_unavailable(vcpu);
+   if (cause == FSCR_TM_LG)
+   r = kvmppc_tm_unavailable(vcpu);
}
if (r == EMULATE_FAIL) {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
@@ -2783,9 +2795,9 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
*vcpu)
vcpu->arch.hfscr_permitted = vcpu->arch.hfscr;
 
/*
-* PM, EBB is demand-faulted so start with it clear.
+* PM, EBB, TM are demand-faulted so start with it clear.
 */
-   vcpu->arch.hfscr &= ~(HFSCR_PM | HFSCR_EBB);
+   vcpu->arch.hfscr &= ~(HFSCR_PM | HFSCR_EBB | HFSCR_TM);
 
kvmppc_mmu_book3s_hv_init(vcpu);
 
@@ -3855,8 +3867,9 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
msr |= MSR_VEC;
if (cpu_has_feature(CPU_FTR_VSX))
msr |= MSR_VSX;
-   if (cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   if ((cpu_has_feature(CPU_FTR_TM) ||
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) &&
+   (vcpu->arch.hfscr & HFSCR_TM))
msr |= MSR_TM;
msr = msr_check_and_set(msr);
 
@@ -4582,8 +4595,9 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
msr |= MSR_VEC;
if (cpu_has_feature(CPU_FTR_VSX))
msr |= MSR_VSX;
-   if (cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   if ((cpu_has_feature(CPU_FTR_TM) ||
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) &&
+   (vcpu->arch.hfscr & HFSCR_TM))
msr |= MSR_TM;
msr = msr_check_and_set(msr);
 
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 929a7c336b09..8499e8a9ca8f 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -310,7 +310,7 @@ bool load_vcpu_state(struct kvm_vcpu *vcpu,
if (MSR_TM_ACTIVE(guest_msr)) {
kvmppc_restore_tm_hv(vcpu, guest_msr, true);
ret = true;
-   } else {
+   } else if (vcpu->arch.hfscr & HFSCR_TM) {
mtspr(SPRN_TEXASR, vcpu->arch.texasr);
mtspr(SPRN_TFHAR, vcpu->arch.tfhar);
mtspr(SPRN_TFIAR, vcpu->arch.tfiar);
@@ -346,10 +346,16 @@ void store_vcpu_state(struct kvm_vcpu *vcpu)
unsigned long guest_msr = vcpu->arch.shregs.msr;
if (MSR_TM_ACTIVE(guest_msr)) {
kvmppc_save_tm_hv(vcpu, guest_msr, true);
-   } else {
+   } else if (vcpu->arch.hfscr & HFSCR_TM) {
vc

[PATCH v3 36/52] KVM: PPC: Book3S HV P9: Demand fault EBB facility registers

2021-10-04 Thread Nicholas Piggin
Use HFSCR facility disabling to implement demand faulting for EBB, with
a hysteresis counter similar to the load_fp etc counters in context
switching that implement the equivalent demand faulting for userspace
facilities.

This speeds up guest entry/exit by avoiding the register save/restore
when a guest is not frequently using them. When a guest does use them
often, there will be some additional demand fault overhead, but these
are not commonly used facilities.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_host.h   |  1 +
 arch/powerpc/kvm/book3s_hv.c  | 16 +--
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 28 +--
 3 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index c5fc4d016695..9c63eff35812 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -579,6 +579,7 @@ struct kvm_vcpu_arch {
ulong cfar;
ulong ppr;
u32 pspb;
+   u8 load_ebb;
ulong fscr;
ulong shadow_fscr;
ulong ebbhr;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6fb941aa77f1..070469867bf5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1436,6 +1436,16 @@ static int kvmppc_pmu_unavailable(struct kvm_vcpu *vcpu)
return RESUME_GUEST;
 }
 
+static int kvmppc_ebb_unavailable(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu->arch.hfscr_permitted & HFSCR_EBB))
+   return EMULATE_FAIL;
+
+   vcpu->arch.hfscr |= HFSCR_EBB;
+
+   return RESUME_GUEST;
+}
+
 static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 struct task_struct *tsk)
 {
@@ -1727,6 +1737,8 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
r = kvmppc_emulate_doorbell_instr(vcpu);
if (cause == FSCR_PM_LG)
r = kvmppc_pmu_unavailable(vcpu);
+   if (cause == FSCR_EBB_LG)
+   r = kvmppc_ebb_unavailable(vcpu);
}
if (r == EMULATE_FAIL) {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
@@ -2771,9 +2783,9 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
*vcpu)
vcpu->arch.hfscr_permitted = vcpu->arch.hfscr;
 
/*
-* PM is demand-faulted so start with it clear.
+* PM, EBB is demand-faulted so start with it clear.
 */
-   vcpu->arch.hfscr &= ~HFSCR_PM;
+   vcpu->arch.hfscr &= ~(HFSCR_PM | HFSCR_EBB);
 
kvmppc_mmu_book3s_hv_init(vcpu);
 
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index a23f09fa7d2d..929a7c336b09 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -232,9 +232,12 @@ static void load_spr_state(struct kvm_vcpu *vcpu,
struct p9_host_os_sprs *host_os_sprs)
 {
mtspr(SPRN_TAR, vcpu->arch.tar);
-   mtspr(SPRN_EBBHR, vcpu->arch.ebbhr);
-   mtspr(SPRN_EBBRR, vcpu->arch.ebbrr);
-   mtspr(SPRN_BESCR, vcpu->arch.bescr);
+
+   if (vcpu->arch.hfscr & HFSCR_EBB) {
+   mtspr(SPRN_EBBHR, vcpu->arch.ebbhr);
+   mtspr(SPRN_EBBRR, vcpu->arch.ebbrr);
+   mtspr(SPRN_BESCR, vcpu->arch.bescr);
+   }
 
if (cpu_has_feature(CPU_FTR_P9_TIDR))
mtspr(SPRN_TIDR, vcpu->arch.tid);
@@ -265,9 +268,22 @@ static void load_spr_state(struct kvm_vcpu *vcpu,
 static void store_spr_state(struct kvm_vcpu *vcpu)
 {
vcpu->arch.tar = mfspr(SPRN_TAR);
-   vcpu->arch.ebbhr = mfspr(SPRN_EBBHR);
-   vcpu->arch.ebbrr = mfspr(SPRN_EBBRR);
-   vcpu->arch.bescr = mfspr(SPRN_BESCR);
+
+   if (vcpu->arch.hfscr & HFSCR_EBB) {
+   vcpu->arch.ebbhr = mfspr(SPRN_EBBHR);
+   vcpu->arch.ebbrr = mfspr(SPRN_EBBRR);
+   vcpu->arch.bescr = mfspr(SPRN_BESCR);
+   /*
+* This is like load_fp in context switching, turn off the
+* facility after it wraps the u8 to try avoiding saving
+* and restoring the registers each partition switch.
+*/
+   if (!vcpu->arch.nested) {
+   vcpu->arch.load_ebb++;
+   if (!vcpu->arch.load_ebb)
+   vcpu->arch.hfscr &= ~HFSCR_EBB;
+   }
+   }
 
if (cpu_has_feature(CPU_FTR_P9_TIDR))
vcpu->arch.tid = mfspr(SPRN_TIDR);
-- 
2.23.0



[PATCH v3 35/52] KVM: PPC: Book3S HV P9: More SPR speed improvements

2021-10-04 Thread Nicholas Piggin
This avoids more scoreboard stalls and reduces mtSPRs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 73 ---
 1 file changed, 43 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 67f57b03a896..a23f09fa7d2d 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -645,24 +645,29 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vc->tb_offset_applied = vc->tb_offset;
}
 
-   if (vc->pcr)
-   mtspr(SPRN_PCR, vc->pcr | PCR_MASK);
-   mtspr(SPRN_DPDES, vc->dpdes);
mtspr(SPRN_VTB, vc->vtb);
-
mtspr(SPRN_PURR, vcpu->arch.purr);
mtspr(SPRN_SPURR, vcpu->arch.spurr);
 
+   if (vc->pcr)
+   mtspr(SPRN_PCR, vc->pcr | PCR_MASK);
+   if (vc->dpdes)
+   mtspr(SPRN_DPDES, vc->dpdes);
+
if (dawr_enabled()) {
-   mtspr(SPRN_DAWR0, vcpu->arch.dawr0);
-   mtspr(SPRN_DAWRX0, vcpu->arch.dawrx0);
+   if (vcpu->arch.dawr0 != host_dawr0)
+   mtspr(SPRN_DAWR0, vcpu->arch.dawr0);
+   if (vcpu->arch.dawrx0 != host_dawrx0)
+   mtspr(SPRN_DAWRX0, vcpu->arch.dawrx0);
if (cpu_has_feature(CPU_FTR_DAWR1)) {
-   mtspr(SPRN_DAWR1, vcpu->arch.dawr1);
-   mtspr(SPRN_DAWRX1, vcpu->arch.dawrx1);
+   if (vcpu->arch.dawr1 != host_dawr1)
+   mtspr(SPRN_DAWR1, vcpu->arch.dawr1);
+   if (vcpu->arch.dawrx1 != host_dawrx1)
+   mtspr(SPRN_DAWRX1, vcpu->arch.dawrx1);
}
}
-   mtspr(SPRN_CIABR, vcpu->arch.ciabr);
-   mtspr(SPRN_IC, vcpu->arch.ic);
+   if (vcpu->arch.ciabr != host_ciabr)
+   mtspr(SPRN_CIABR, vcpu->arch.ciabr);
 
mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
  (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
@@ -881,20 +886,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vc->dpdes = mfspr(SPRN_DPDES);
vc->vtb = mfspr(SPRN_VTB);
 
-   save_clear_guest_mmu(kvm, vcpu);
-   switch_mmu_to_host(kvm, host_pidr);
-
-   /*
-* If we are in real mode, only switch MMU on after the MMU is
-* switched to host, to avoid the P9_RADIX_PREFETCH_BUG.
-*/
-   if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
-   vcpu->arch.shregs.msr & MSR_TS_MASK)
-   msr |= MSR_TS_S;
-   __mtmsrd(msr, 0);
-
-   store_vcpu_state(vcpu);
-
dec = mfspr(SPRN_DEC);
if (!(lpcr & LPCR_LD)) /* Sign extend if not using large decrementer */
dec = (s32) dec;
@@ -912,6 +903,22 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vc->tb_offset_applied = 0;
}
 
+   save_clear_guest_mmu(kvm, vcpu);
+   switch_mmu_to_host(kvm, host_pidr);
+
+   /*
+* Enable MSR here in order to have facilities enabled to save
+* guest registers. This enables MMU (if we were in realmode), so
+* only switch MMU on after the MMU is switched to host, to avoid
+* the P9_RADIX_PREFETCH_BUG or hash guest context.
+*/
+   if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
+   vcpu->arch.shregs.msr & MSR_TS_MASK)
+   msr |= MSR_TS_S;
+   __mtmsrd(msr, 0);
+
+   store_vcpu_state(vcpu);
+
mtspr(SPRN_PURR, local_paca->kvm_hstate.host_purr);
mtspr(SPRN_SPURR, local_paca->kvm_hstate.host_spurr);
 
@@ -919,15 +926,21 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
mtspr(SPRN_PSSCR, host_psscr |
  (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
mtspr(SPRN_HFSCR, host_hfscr);
-   mtspr(SPRN_CIABR, host_ciabr);
-   mtspr(SPRN_DAWR0, host_dawr0);
-   mtspr(SPRN_DAWRX0, host_dawrx0);
+   if (vcpu->arch.ciabr != host_ciabr)
+   mtspr(SPRN_CIABR, host_ciabr);
+   if (vcpu->arch.dawr0 != host_dawr0)
+   mtspr(SPRN_DAWR0, host_dawr0);
+   if (vcpu->arch.dawrx0 != host_dawrx0)
+   mtspr(SPRN_DAWRX0, host_dawrx0);
if (cpu_has_feature(CPU_FTR_DAWR1)) {
-   mtspr(SPRN_DAWR1, host_dawr1);
-   mtspr(SPRN_DAWRX1, host_dawrx1);
+   if (vcpu->arch.dawr1 != host_dawr1)
+   mtspr(SPRN_DAWR1, host_dawr1);
+   if (vcpu->arch.dawrx1 != host_dawrx1)
+   mtspr(SPRN_DAWRX1, host_dawr

[PATCH v3 34/52] KVM: PPC: Book3S HV P9: Restrict DSISR canary workaround to processors that require it

2021-10-04 Thread Nicholas Piggin
Use CPU_FTR_P9_RADIX_PREFETCH_BUG to apply the workaround, to test for
DD2.1 and below processors. This saves a mtSPR in guest entry.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 3 ++-
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 6 --
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 5a1859311b3e..6fb941aa77f1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1590,7 +1590,8 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
unsigned long vsid;
long err;
 
-   if (vcpu->arch.fault_dsisr == HDSISR_CANARY) {
+   if (cpu_has_feature(CPU_FTR_P9_RADIX_PREFETCH_BUG) &&
+   unlikely(vcpu->arch.fault_dsisr == HDSISR_CANARY)) {
r = RESUME_GUEST; /* Just retry if it's the canary */
break;
}
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 619bbcd47b92..67f57b03a896 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -683,9 +683,11 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 * HDSI which should correctly update the HDSISR the second time HDSI
 * entry.
 *
-* Just do this on all p9 processors for now.
+* The "radix prefetch bug" test can be used to test for this bug, as
+* it also exists fo DD2.1 and below.
 */
-   mtspr(SPRN_HDSISR, HDSISR_CANARY);
+   if (cpu_has_feature(CPU_FTR_P9_RADIX_PREFETCH_BUG))
+   mtspr(SPRN_HDSISR, HDSISR_CANARY);
 
mtspr(SPRN_SPRG0, vcpu->arch.shregs.sprg0);
mtspr(SPRN_SPRG1, vcpu->arch.shregs.sprg1);
-- 
2.23.0



[PATCH v3 33/52] KVM: PPC: Book3S HV P9: Switch PMU to guest as late as possible

2021-10-04 Thread Nicholas Piggin
This moves PMU switch to guest as late as possible in entry, and switch
back to host as early as possible at exit. This helps the host get the
most perf coverage of KVM entry/exit code as possible.

This is slightly suboptimal for SPR scheduling point of view when the
PMU is enabled, but when perf is disabled there is no real difference.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 6 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 6 ++
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index db42eeb27c15..5a1859311b3e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3820,8 +3820,6 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
s64 dec;
int trap;
 
-   switch_pmu_to_guest(vcpu, _os_sprs);
-
save_p9_host_os_sprs(_os_sprs);
 
/*
@@ -3884,9 +3882,11 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
 
mtspr(SPRN_DAR, vcpu->arch.shregs.dar);
mtspr(SPRN_DSISR, vcpu->arch.shregs.dsisr);
+   switch_pmu_to_guest(vcpu, _os_sprs);
trap = plpar_hcall_norets(H_ENTER_NESTED, __pa(),
  __pa(>arch.regs));
kvmhv_restore_hv_return_state(vcpu, );
+   switch_pmu_to_host(vcpu, _os_sprs);
vcpu->arch.shregs.msr = vcpu->arch.regs.msr;
vcpu->arch.shregs.dar = mfspr(SPRN_DAR);
vcpu->arch.shregs.dsisr = mfspr(SPRN_DSISR);
@@ -3905,8 +3905,6 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
 
restore_p9_host_os_sprs(vcpu, _os_sprs);
 
-   switch_pmu_to_host(vcpu, _os_sprs);
-
return trap;
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 6bef509bccb8..619bbcd47b92 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -601,8 +601,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
local_paca->kvm_hstate.host_purr = mfspr(SPRN_PURR);
local_paca->kvm_hstate.host_spurr = mfspr(SPRN_SPURR);
 
-   switch_pmu_to_guest(vcpu, _os_sprs);
-
save_p9_host_os_sprs(_os_sprs);
 
/*
@@ -744,7 +742,9 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
accumulate_time(vcpu, >arch.guest_time);
 
+   switch_pmu_to_guest(vcpu, _os_sprs);
kvmppc_p9_enter_guest(vcpu);
+   switch_pmu_to_host(vcpu, _os_sprs);
 
accumulate_time(vcpu, >arch.rm_intr);
 
@@ -955,8 +955,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
asm volatile(PPC_CP_ABORT);
 
 out:
-   switch_pmu_to_host(vcpu, _os_sprs);
-
end_timing(vcpu);
 
return trap;
-- 
2.23.0



[PATCH v3 32/52] KVM: PPC: Book3S HV P9: Implement TM fastpath for guest entry/exit

2021-10-04 Thread Nicholas Piggin
If TM is not active, only TM register state needs to be saved and
restored, avoiding several mfmsr/mtmsrd instructions and improving
performance.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index fa080533bd8d..6bef509bccb8 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -287,11 +287,20 @@ bool load_vcpu_state(struct kvm_vcpu *vcpu,
 {
bool ret = false;
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
if (cpu_has_feature(CPU_FTR_TM) ||
cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) {
-   kvmppc_restore_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
-   ret = true;
+   unsigned long guest_msr = vcpu->arch.shregs.msr;
+   if (MSR_TM_ACTIVE(guest_msr)) {
+   kvmppc_restore_tm_hv(vcpu, guest_msr, true);
+   ret = true;
+   } else {
+   mtspr(SPRN_TEXASR, vcpu->arch.texasr);
+   mtspr(SPRN_TFHAR, vcpu->arch.tfhar);
+   mtspr(SPRN_TFIAR, vcpu->arch.tfiar);
+   }
}
+#endif
 
load_spr_state(vcpu, host_os_sprs);
 
@@ -315,9 +324,19 @@ void store_vcpu_state(struct kvm_vcpu *vcpu)
 #endif
vcpu->arch.vrsave = mfspr(SPRN_VRSAVE);
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
if (cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
-   kvmppc_save_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) {
+   unsigned long guest_msr = vcpu->arch.shregs.msr;
+   if (MSR_TM_ACTIVE(guest_msr)) {
+   kvmppc_save_tm_hv(vcpu, guest_msr, true);
+   } else {
+   vcpu->arch.texasr = mfspr(SPRN_TEXASR);
+   vcpu->arch.tfhar = mfspr(SPRN_TFHAR);
+   vcpu->arch.tfiar = mfspr(SPRN_TFIAR);
+   }
+   }
+#endif
 }
 EXPORT_SYMBOL_GPL(store_vcpu_state);
 
-- 
2.23.0



[PATCH v3 31/52] KVM: PPC: Book3S HV P9: Move remaining SPR and MSR access into low level entry

2021-10-04 Thread Nicholas Piggin
Move register saving and loading from kvmhv_p9_guest_entry() into the HV
and nested entry handlers.

Accesses are scheduled to reduce mtSPR / mfSPR interleaving which
reduces SPR scoreboard stalls.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 79 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 96 ---
 2 files changed, 109 insertions(+), 66 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a57727463980..db42eeb27c15 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3814,9 +3814,15 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
unsigned long host_psscr;
+   unsigned long msr;
struct hv_guest_state hvregs;
-   int trap;
+   struct p9_host_os_sprs host_os_sprs;
s64 dec;
+   int trap;
+
+   switch_pmu_to_guest(vcpu, _os_sprs);
+
+   save_p9_host_os_sprs(_os_sprs);
 
/*
 * We need to save and restore the guest visible part of the
@@ -3825,6 +3831,27 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
 * this is done in kvmhv_vcpu_entry_p9() below otherwise.
 */
host_psscr = mfspr(SPRN_PSSCR_PR);
+
+   hard_irq_disable();
+   if (lazy_irq_pending())
+   return 0;
+
+   /* MSR bits may have been cleared by context switch */
+   msr = 0;
+   if (IS_ENABLED(CONFIG_PPC_FPU))
+   msr |= MSR_FP;
+   if (cpu_has_feature(CPU_FTR_ALTIVEC))
+   msr |= MSR_VEC;
+   if (cpu_has_feature(CPU_FTR_VSX))
+   msr |= MSR_VSX;
+   if (cpu_has_feature(CPU_FTR_TM) ||
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   msr |= MSR_TM;
+   msr = msr_check_and_set(msr);
+
+   if (unlikely(load_vcpu_state(vcpu, _os_sprs)))
+   msr = mfmsr(); /* TM restore can update msr */
+
mtspr(SPRN_PSSCR_PR, vcpu->arch.psscr);
kvmhv_save_hv_regs(vcpu, );
hvregs.lpcr = lpcr;
@@ -3866,12 +3893,20 @@ static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu 
*vcpu, u64 time_limit, uns
vcpu->arch.psscr = mfspr(SPRN_PSSCR_PR);
mtspr(SPRN_PSSCR_PR, host_psscr);
 
+   store_vcpu_state(vcpu);
+
dec = mfspr(SPRN_DEC);
if (!(lpcr & LPCR_LD)) /* Sign extend if not using large decrementer */
dec = (s32) dec;
*tb = mftb();
vcpu->arch.dec_expires = dec + (*tb + vc->tb_offset);
 
+   timer_rearm_host_dec(*tb);
+
+   restore_p9_host_os_sprs(vcpu, _os_sprs);
+
+   switch_pmu_to_host(vcpu, _os_sprs);
+
return trap;
 }
 
@@ -3882,9 +3917,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 unsigned long lpcr, u64 *tb)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
-   struct p9_host_os_sprs host_os_sprs;
u64 next_timer;
-   unsigned long msr;
int trap;
 
next_timer = timer_get_next_tb();
@@ -3895,33 +3928,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu->arch.ceded = 0;
 
-   save_p9_host_os_sprs(_os_sprs);
-
-   /*
-* This could be combined with MSR[RI] clearing, but that expands
-* the unrecoverable window. It would be better to cover unrecoverable
-* with KVM bad interrupt handling rather than use MSR[RI] at all.
-*
-* Much more difficult and less worthwhile to combine with IR/DR
-* disable.
-*/
-   hard_irq_disable();
-   if (lazy_irq_pending())
-   return 0;
-
-   /* MSR bits may have been cleared by context switch */
-   msr = 0;
-   if (IS_ENABLED(CONFIG_PPC_FPU))
-   msr |= MSR_FP;
-   if (cpu_has_feature(CPU_FTR_ALTIVEC))
-   msr |= MSR_VEC;
-   if (cpu_has_feature(CPU_FTR_VSX))
-   msr |= MSR_VSX;
-   if (cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
-   msr |= MSR_TM;
-   msr = msr_check_and_set(msr);
-
kvmppc_subcore_enter_guest();
 
vc->entry_exit_map = 1;
@@ -3929,11 +3935,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu_vpa_increment_dispatch(vcpu);
 
-   if (unlikely(load_vcpu_state(vcpu, _os_sprs)))
-   msr = mfmsr(); /* MSR may have been updated */
-
-   switch_pmu_to_guest(vcpu, _os_sprs);
-
if (kvmhv_on_pseries()) {
trap = kvmhv_vcpu_entry_p9_nested(vcpu, time_limit, lpcr, tb);
 
@@ -3976,16 +3977,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vcpu->arch.slb_max = 0;
}
 
-   switch_pmu_to_host(vcpu, _os_sprs);
-
-   store_vcpu_state(vcpu);
-
v

[PATCH v3 30/52] KVM: PPC: Book3S HV P9: Move nested guest entry into its own function

2021-10-04 Thread Nicholas Piggin
Move the part of the guest entry which is specific to nested HV into its
own function. This is just refactoring.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 125 +++
 1 file changed, 67 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 580bac4753f6..a57727463980 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3809,6 +3809,72 @@ static void vcpu_vpa_increment_dispatch(struct kvm_vcpu 
*vcpu)
}
 }
 
+/* call our hypervisor to load up HV regs and go */
+static int kvmhv_vcpu_entry_p9_nested(struct kvm_vcpu *vcpu, u64 time_limit, 
unsigned long lpcr, u64 *tb)
+{
+   struct kvmppc_vcore *vc = vcpu->arch.vcore;
+   unsigned long host_psscr;
+   struct hv_guest_state hvregs;
+   int trap;
+   s64 dec;
+
+   /*
+* We need to save and restore the guest visible part of the
+* psscr (i.e. using SPRN_PSSCR_PR) since the hypervisor
+* doesn't do this for us. Note only required if pseries since
+* this is done in kvmhv_vcpu_entry_p9() below otherwise.
+*/
+   host_psscr = mfspr(SPRN_PSSCR_PR);
+   mtspr(SPRN_PSSCR_PR, vcpu->arch.psscr);
+   kvmhv_save_hv_regs(vcpu, );
+   hvregs.lpcr = lpcr;
+   vcpu->arch.regs.msr = vcpu->arch.shregs.msr;
+   hvregs.version = HV_GUEST_STATE_VERSION;
+   if (vcpu->arch.nested) {
+   hvregs.lpid = vcpu->arch.nested->shadow_lpid;
+   hvregs.vcpu_token = vcpu->arch.nested_vcpu_id;
+   } else {
+   hvregs.lpid = vcpu->kvm->arch.lpid;
+   hvregs.vcpu_token = vcpu->vcpu_id;
+   }
+   hvregs.hdec_expiry = time_limit;
+
+   /*
+* When setting DEC, we must always deal with irq_work_raise
+* via NMI vs setting DEC. The problem occurs right as we
+* switch into guest mode if a NMI hits and sets pending work
+* and sets DEC, then that will apply to the guest and not
+* bring us back to the host.
+*
+* irq_work_raise could check a flag (or possibly LPCR[HDICE]
+* for example) and set HDEC to 1? That wouldn't solve the
+* nested hv case which needs to abort the hcall or zero the
+* time limit.
+*
+* XXX: Another day's problem.
+*/
+   mtspr(SPRN_DEC, kvmppc_dec_expires_host_tb(vcpu) - *tb);
+
+   mtspr(SPRN_DAR, vcpu->arch.shregs.dar);
+   mtspr(SPRN_DSISR, vcpu->arch.shregs.dsisr);
+   trap = plpar_hcall_norets(H_ENTER_NESTED, __pa(),
+ __pa(>arch.regs));
+   kvmhv_restore_hv_return_state(vcpu, );
+   vcpu->arch.shregs.msr = vcpu->arch.regs.msr;
+   vcpu->arch.shregs.dar = mfspr(SPRN_DAR);
+   vcpu->arch.shregs.dsisr = mfspr(SPRN_DSISR);
+   vcpu->arch.psscr = mfspr(SPRN_PSSCR_PR);
+   mtspr(SPRN_PSSCR_PR, host_psscr);
+
+   dec = mfspr(SPRN_DEC);
+   if (!(lpcr & LPCR_LD)) /* Sign extend if not using large decrementer */
+   dec = (s32) dec;
+   *tb = mftb();
+   vcpu->arch.dec_expires = dec + (*tb + vc->tb_offset);
+
+   return trap;
+}
+
 /*
  * Guest entry for POWER9 and later CPUs.
  */
@@ -3817,7 +3883,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
struct p9_host_os_sprs host_os_sprs;
-   s64 dec;
u64 next_timer;
unsigned long msr;
int trap;
@@ -3870,63 +3935,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
switch_pmu_to_guest(vcpu, _os_sprs);
 
if (kvmhv_on_pseries()) {
-   /*
-* We need to save and restore the guest visible part of the
-* psscr (i.e. using SPRN_PSSCR_PR) since the hypervisor
-* doesn't do this for us. Note only required if pseries since
-* this is done in kvmhv_vcpu_entry_p9() below otherwise.
-*/
-   unsigned long host_psscr;
-   /* call our hypervisor to load up HV regs and go */
-   struct hv_guest_state hvregs;
-
-   host_psscr = mfspr(SPRN_PSSCR_PR);
-   mtspr(SPRN_PSSCR_PR, vcpu->arch.psscr);
-   kvmhv_save_hv_regs(vcpu, );
-   hvregs.lpcr = lpcr;
-   vcpu->arch.regs.msr = vcpu->arch.shregs.msr;
-   hvregs.version = HV_GUEST_STATE_VERSION;
-   if (vcpu->arch.nested) {
-   hvregs.lpid = vcpu->arch.nested->shadow_lpid;
-   hvregs.vcpu_token = vcpu->arch.nested_vcpu_id;
-   } else {
-   hvregs.lpid = vcpu->kvm->arch.lpid;
-   hvregs.vcpu_token = vcpu->vcpu_id;
-   }
-   hvreg

[PATCH v3 29/52] KVM: PPC: Book3S HV P9: Move host OS save/restore functions to built-in

2021-10-04 Thread Nicholas Piggin
Move the P9 guest/host register switching functions to the built-in
P9 entry code, and export it for nested to use as well.

This allows more flexibility in scheduling these supervisor privileged
SPR accesses with the HV privileged and PR SPR accesses in the low level
entry code.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 379 +-
 arch/powerpc/kvm/book3s_hv.h  |  45 +++
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 353 
 3 files changed, 399 insertions(+), 378 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv.h

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8d721baf8c6b..580bac4753f6 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -80,6 +80,7 @@
 #include 
 
 #include "book3s.h"
+#include "book3s_hv.h"
 
 #define CREATE_TRACE_POINTS
 #include "trace_hv.h"
@@ -127,11 +128,6 @@ static bool nested = true;
 module_param(nested, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
 
-static inline bool nesting_enabled(struct kvm *kvm)
-{
-   return kvm->arch.nested_enable && kvm_is_radix(kvm);
-}
-
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
 /*
@@ -3797,379 +3793,6 @@ static noinline void kvmppc_run_core(struct 
kvmppc_vcore *vc)
trace_kvmppc_run_core(vc, 1);
 }
 
-/*
- * Privileged (non-hypervisor) host registers to save.
- */
-struct p9_host_os_sprs {
-   unsigned long dscr;
-   unsigned long tidr;
-   unsigned long iamr;
-   unsigned long amr;
-   unsigned long fscr;
-
-   unsigned int pmc1;
-   unsigned int pmc2;
-   unsigned int pmc3;
-   unsigned int pmc4;
-   unsigned int pmc5;
-   unsigned int pmc6;
-   unsigned long mmcr0;
-   unsigned long mmcr1;
-   unsigned long mmcr2;
-   unsigned long mmcr3;
-   unsigned long mmcra;
-   unsigned long siar;
-   unsigned long sier1;
-   unsigned long sier2;
-   unsigned long sier3;
-   unsigned long sdar;
-};
-
-static void freeze_pmu(unsigned long mmcr0, unsigned long mmcra)
-{
-   if (!(mmcr0 & MMCR0_FC))
-   goto do_freeze;
-   if (mmcra & MMCRA_SAMPLE_ENABLE)
-   goto do_freeze;
-   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
-   if (!(mmcr0 & MMCR0_PMCCEXT))
-   goto do_freeze;
-   if (!(mmcra & MMCRA_BHRB_DISABLE))
-   goto do_freeze;
-   }
-   return;
-
-do_freeze:
-   mmcr0 = MMCR0_FC;
-   mmcra = 0;
-   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
-   mmcr0 |= MMCR0_PMCCEXT;
-   mmcra = MMCRA_BHRB_DISABLE;
-   }
-
-   mtspr(SPRN_MMCR0, mmcr0);
-   mtspr(SPRN_MMCRA, mmcra);
-   isync();
-}
-
-static void switch_pmu_to_guest(struct kvm_vcpu *vcpu,
-   struct p9_host_os_sprs *host_os_sprs)
-{
-   struct lppaca *lp;
-   int load_pmu = 1;
-
-   lp = vcpu->arch.vpa.pinned_addr;
-   if (lp)
-   load_pmu = lp->pmcregs_in_use;
-
-   /* Save host */
-   if (ppc_get_pmu_inuse()) {
-   /*
-* It might be better to put PMU handling (at least for the
-* host) in the perf subsystem because it knows more about what
-* is being used.
-*/
-
-   /* POWER9, POWER10 do not implement HPMC or SPMC */
-
-   host_os_sprs->mmcr0 = mfspr(SPRN_MMCR0);
-   host_os_sprs->mmcra = mfspr(SPRN_MMCRA);
-
-   freeze_pmu(host_os_sprs->mmcr0, host_os_sprs->mmcra);
-
-   host_os_sprs->pmc1 = mfspr(SPRN_PMC1);
-   host_os_sprs->pmc2 = mfspr(SPRN_PMC2);
-   host_os_sprs->pmc3 = mfspr(SPRN_PMC3);
-   host_os_sprs->pmc4 = mfspr(SPRN_PMC4);
-   host_os_sprs->pmc5 = mfspr(SPRN_PMC5);
-   host_os_sprs->pmc6 = mfspr(SPRN_PMC6);
-   host_os_sprs->mmcr1 = mfspr(SPRN_MMCR1);
-   host_os_sprs->mmcr2 = mfspr(SPRN_MMCR2);
-   host_os_sprs->sdar = mfspr(SPRN_SDAR);
-   host_os_sprs->siar = mfspr(SPRN_SIAR);
-   host_os_sprs->sier1 = mfspr(SPRN_SIER);
-
-   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
-   host_os_sprs->mmcr3 = mfspr(SPRN_MMCR3);
-   host_os_sprs->sier2 = mfspr(SPRN_SIER2);
-   host_os_sprs->sier3 = mfspr(SPRN_SIER3);
-   }
-   }
-
-#ifdef CONFIG_PPC_PSERIES
-   /* After saving PMU, before loading guest PMU, flip pmcregs_in_use */
-   if (kvmhv_on_pseries()) {
-   barrier();
-   get_lppaca()->pmcregs_in_use = load_pmu;
-   barrier();
-   }
-#endif
-
- 

[PATCH v3 28/52] KVM: PPC: Book3S HV P9: Move vcpu register save/restore into functions

2021-10-04 Thread Nicholas Piggin
This should be no functional difference but makes the caller easier
to read.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 65 +++-
 1 file changed, 41 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e817159cd53f..8d721baf8c6b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4095,6 +4095,44 @@ static void store_spr_state(struct kvm_vcpu *vcpu)
vcpu->arch.ctrl = mfspr(SPRN_CTRLF);
 }
 
+/* Returns true if current MSR and/or guest MSR may have changed */
+static bool load_vcpu_state(struct kvm_vcpu *vcpu,
+  struct p9_host_os_sprs *host_os_sprs)
+{
+   bool ret = false;
+
+   if (cpu_has_feature(CPU_FTR_TM) ||
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) {
+   kvmppc_restore_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
+   ret = true;
+   }
+
+   load_spr_state(vcpu, host_os_sprs);
+
+   load_fp_state(>arch.fp);
+#ifdef CONFIG_ALTIVEC
+   load_vr_state(>arch.vr);
+#endif
+   mtspr(SPRN_VRSAVE, vcpu->arch.vrsave);
+
+   return ret;
+}
+
+static void store_vcpu_state(struct kvm_vcpu *vcpu)
+{
+   store_spr_state(vcpu);
+
+   store_fp_state(>arch.fp);
+#ifdef CONFIG_ALTIVEC
+   store_vr_state(>arch.vr);
+#endif
+   vcpu->arch.vrsave = mfspr(SPRN_VRSAVE);
+
+   if (cpu_has_feature(CPU_FTR_TM) ||
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   kvmppc_save_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
+}
+
 static void save_p9_host_os_sprs(struct p9_host_os_sprs *host_os_sprs)
 {
host_os_sprs->dscr = mfspr(SPRN_DSCR);
@@ -4203,19 +4241,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu_vpa_increment_dispatch(vcpu);
 
-   if (cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) {
-   kvmppc_restore_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
-   msr = mfmsr(); /* TM restore can update msr */
-   }
-
-   load_spr_state(vcpu, _os_sprs);
-
-   load_fp_state(>arch.fp);
-#ifdef CONFIG_ALTIVEC
-   load_vr_state(>arch.vr);
-#endif
-   mtspr(SPRN_VRSAVE, vcpu->arch.vrsave);
+   if (unlikely(load_vcpu_state(vcpu, _os_sprs)))
+   msr = mfmsr(); /* MSR may have been updated */
 
switch_pmu_to_guest(vcpu, _os_sprs);
 
@@ -4319,17 +4346,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
switch_pmu_to_host(vcpu, _os_sprs);
 
-   store_spr_state(vcpu);
-
-   store_fp_state(>arch.fp);
-#ifdef CONFIG_ALTIVEC
-   store_vr_state(>arch.vr);
-#endif
-   vcpu->arch.vrsave = mfspr(SPRN_VRSAVE);
-
-   if (cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
-   kvmppc_save_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
+   store_vcpu_state(vcpu);
 
vcpu_vpa_increment_dispatch(vcpu);
 
-- 
2.23.0



[PATCH v3 27/52] KVM: PPC: Book3S HV P9: Juggle SPR switching around

2021-10-04 Thread Nicholas Piggin
This juggles SPR switching on the entry and exit sides to be more
symmetric, which makes the next refactoring patch possible with no
functional change.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 460290cc79af..e817159cd53f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4209,7 +4209,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
msr = mfmsr(); /* TM restore can update msr */
}
 
-   switch_pmu_to_guest(vcpu, _os_sprs);
+   load_spr_state(vcpu, _os_sprs);
 
load_fp_state(>arch.fp);
 #ifdef CONFIG_ALTIVEC
@@ -4217,7 +4217,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 #endif
mtspr(SPRN_VRSAVE, vcpu->arch.vrsave);
 
-   load_spr_state(vcpu, _os_sprs);
+   switch_pmu_to_guest(vcpu, _os_sprs);
 
if (kvmhv_on_pseries()) {
/*
@@ -4317,6 +4317,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vcpu->arch.slb_max = 0;
}
 
+   switch_pmu_to_host(vcpu, _os_sprs);
+
store_spr_state(vcpu);
 
store_fp_state(>arch.fp);
@@ -4331,8 +4333,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu_vpa_increment_dispatch(vcpu);
 
-   switch_pmu_to_host(vcpu, _os_sprs);
-
timer_rearm_host_dec(*tb);
 
restore_p9_host_os_sprs(vcpu, _os_sprs);
-- 
2.23.0



[PATCH v3 26/52] KVM: PPC: Book3S HV P9: Only execute mtSPR if the value changed

2021-10-04 Thread Nicholas Piggin
Keep better track of the current SPR value in places where
they are to be loaded with a new context, to reduce expensive
mtSPR operations.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 51 ++--
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 823d64047d01..460290cc79af 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4042,20 +4042,28 @@ static void switch_pmu_to_host(struct kvm_vcpu *vcpu,
}
 }
 
-static void load_spr_state(struct kvm_vcpu *vcpu)
+static void load_spr_state(struct kvm_vcpu *vcpu,
+   struct p9_host_os_sprs *host_os_sprs)
 {
-   mtspr(SPRN_DSCR, vcpu->arch.dscr);
-   mtspr(SPRN_IAMR, vcpu->arch.iamr);
-   mtspr(SPRN_PSPB, vcpu->arch.pspb);
-   mtspr(SPRN_FSCR, vcpu->arch.fscr);
mtspr(SPRN_TAR, vcpu->arch.tar);
mtspr(SPRN_EBBHR, vcpu->arch.ebbhr);
mtspr(SPRN_EBBRR, vcpu->arch.ebbrr);
mtspr(SPRN_BESCR, vcpu->arch.bescr);
+
if (cpu_has_feature(CPU_FTR_P9_TIDR))
mtspr(SPRN_TIDR, vcpu->arch.tid);
-   mtspr(SPRN_AMR, vcpu->arch.amr);
-   mtspr(SPRN_UAMOR, vcpu->arch.uamor);
+   if (host_os_sprs->iamr != vcpu->arch.iamr)
+   mtspr(SPRN_IAMR, vcpu->arch.iamr);
+   if (host_os_sprs->amr != vcpu->arch.amr)
+   mtspr(SPRN_AMR, vcpu->arch.amr);
+   if (vcpu->arch.uamor != 0)
+   mtspr(SPRN_UAMOR, vcpu->arch.uamor);
+   if (host_os_sprs->fscr != vcpu->arch.fscr)
+   mtspr(SPRN_FSCR, vcpu->arch.fscr);
+   if (host_os_sprs->dscr != vcpu->arch.dscr)
+   mtspr(SPRN_DSCR, vcpu->arch.dscr);
+   if (vcpu->arch.pspb != 0)
+   mtspr(SPRN_PSPB, vcpu->arch.pspb);
 
/*
 * DAR, DSISR, and for nested HV, SPRGs must be set with MSR[RI]
@@ -4070,20 +4078,21 @@ static void load_spr_state(struct kvm_vcpu *vcpu)
 
 static void store_spr_state(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.ctrl = mfspr(SPRN_CTRLF);
-
-   vcpu->arch.iamr = mfspr(SPRN_IAMR);
-   vcpu->arch.pspb = mfspr(SPRN_PSPB);
-   vcpu->arch.fscr = mfspr(SPRN_FSCR);
vcpu->arch.tar = mfspr(SPRN_TAR);
vcpu->arch.ebbhr = mfspr(SPRN_EBBHR);
vcpu->arch.ebbrr = mfspr(SPRN_EBBRR);
vcpu->arch.bescr = mfspr(SPRN_BESCR);
+
if (cpu_has_feature(CPU_FTR_P9_TIDR))
vcpu->arch.tid = mfspr(SPRN_TIDR);
+   vcpu->arch.iamr = mfspr(SPRN_IAMR);
vcpu->arch.amr = mfspr(SPRN_AMR);
vcpu->arch.uamor = mfspr(SPRN_UAMOR);
+   vcpu->arch.fscr = mfspr(SPRN_FSCR);
vcpu->arch.dscr = mfspr(SPRN_DSCR);
+   vcpu->arch.pspb = mfspr(SPRN_PSPB);
+
+   vcpu->arch.ctrl = mfspr(SPRN_CTRLF);
 }
 
 static void save_p9_host_os_sprs(struct p9_host_os_sprs *host_os_sprs)
@@ -4094,6 +4103,7 @@ static void save_p9_host_os_sprs(struct p9_host_os_sprs 
*host_os_sprs)
host_os_sprs->iamr = mfspr(SPRN_IAMR);
host_os_sprs->amr = mfspr(SPRN_AMR);
host_os_sprs->fscr = mfspr(SPRN_FSCR);
+   host_os_sprs->dscr = mfspr(SPRN_DSCR);
 }
 
 /* vcpu guest regs must already be saved */
@@ -4102,19 +4112,20 @@ static void restore_p9_host_os_sprs(struct kvm_vcpu 
*vcpu,
 {
mtspr(SPRN_SPRG_VDSO_WRITE, local_paca->sprg_vdso);
 
-   mtspr(SPRN_PSPB, 0);
-   mtspr(SPRN_UAMOR, 0);
-
-   mtspr(SPRN_DSCR, host_os_sprs->dscr);
if (cpu_has_feature(CPU_FTR_P9_TIDR))
mtspr(SPRN_TIDR, host_os_sprs->tidr);
-   mtspr(SPRN_IAMR, host_os_sprs->iamr);
-
+   if (host_os_sprs->iamr != vcpu->arch.iamr)
+   mtspr(SPRN_IAMR, host_os_sprs->iamr);
+   if (vcpu->arch.uamor != 0)
+   mtspr(SPRN_UAMOR, 0);
if (host_os_sprs->amr != vcpu->arch.amr)
mtspr(SPRN_AMR, host_os_sprs->amr);
-
if (host_os_sprs->fscr != vcpu->arch.fscr)
mtspr(SPRN_FSCR, host_os_sprs->fscr);
+   if (host_os_sprs->dscr != vcpu->arch.dscr)
+   mtspr(SPRN_DSCR, host_os_sprs->dscr);
+   if (vcpu->arch.pspb != 0)
+   mtspr(SPRN_PSPB, 0);
 
/* Save guest CTRL register, set runlatch to 1 */
if (!(vcpu->arch.ctrl & 1))
@@ -4206,7 +4217,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 #endif
mtspr(SPRN_VRSAVE, vcpu->arch.vrsave);
 
-   load_spr_state(vcpu);
+   load_spr_state(vcpu, _os_sprs);
 
if (kvmhv_on_pseries()) {
/*
-- 
2.23.0



[PATCH v3 25/52] KVM: PPC: Book3S HV P9: Avoid SPR scoreboard stalls

2021-10-04 Thread Nicholas Piggin
Avoid interleaving mfSPR and mtSPR to reduce SPR scoreboard stalls.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  |  8 
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 19 +++
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f3c052b8b7ee..823d64047d01 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4308,10 +4308,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
store_spr_state(vcpu);
 
-   timer_rearm_host_dec(*tb);
-
-   restore_p9_host_os_sprs(vcpu, _os_sprs);
-
store_fp_state(>arch.fp);
 #ifdef CONFIG_ALTIVEC
store_vr_state(>arch.vr);
@@ -4326,6 +4322,10 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
switch_pmu_to_host(vcpu, _os_sprs);
 
+   timer_rearm_host_dec(*tb);
+
+   restore_p9_host_os_sprs(vcpu, _os_sprs);
+
vc->entry_exit_map = 0x101;
vc->in_guest = 0;
 
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 2bd96d8256d1..bd0021cd3a67 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -228,6 +228,9 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
host_dawrx1 = mfspr(SPRN_DAWRX1);
}
 
+   local_paca->kvm_hstate.host_purr = mfspr(SPRN_PURR);
+   local_paca->kvm_hstate.host_spurr = mfspr(SPRN_SPURR);
+
if (vc->tb_offset) {
u64 new_tb = *tb + vc->tb_offset;
mtspr(SPRN_TBU40, new_tb);
@@ -244,8 +247,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
mtspr(SPRN_DPDES, vc->dpdes);
mtspr(SPRN_VTB, vc->vtb);
 
-   local_paca->kvm_hstate.host_purr = mfspr(SPRN_PURR);
-   local_paca->kvm_hstate.host_spurr = mfspr(SPRN_SPURR);
mtspr(SPRN_PURR, vcpu->arch.purr);
mtspr(SPRN_SPURR, vcpu->arch.spurr);
 
@@ -448,10 +449,8 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
/* Advance host PURR/SPURR by the amount used by guest */
purr = mfspr(SPRN_PURR);
spurr = mfspr(SPRN_SPURR);
-   mtspr(SPRN_PURR, local_paca->kvm_hstate.host_purr +
- purr - vcpu->arch.purr);
-   mtspr(SPRN_SPURR, local_paca->kvm_hstate.host_spurr +
- spurr - vcpu->arch.spurr);
+   local_paca->kvm_hstate.host_purr += purr - vcpu->arch.purr;
+   local_paca->kvm_hstate.host_spurr += spurr - vcpu->arch.spurr;
vcpu->arch.purr = purr;
vcpu->arch.spurr = spurr;
 
@@ -464,6 +463,9 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vcpu->arch.shregs.sprg2 = mfspr(SPRN_SPRG2);
vcpu->arch.shregs.sprg3 = mfspr(SPRN_SPRG3);
 
+   vc->dpdes = mfspr(SPRN_DPDES);
+   vc->vtb = mfspr(SPRN_VTB);
+
dec = mfspr(SPRN_DEC);
if (!(lpcr & LPCR_LD)) /* Sign extend if not using large decrementer */
dec = (s32) dec;
@@ -481,6 +483,9 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vc->tb_offset_applied = 0;
}
 
+   mtspr(SPRN_PURR, local_paca->kvm_hstate.host_purr);
+   mtspr(SPRN_SPURR, local_paca->kvm_hstate.host_spurr);
+
/* Preserve PSSCR[FAKE_SUSPEND] until we've called kvmppc_save_tm_hv */
mtspr(SPRN_PSSCR, host_psscr |
  (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
@@ -509,8 +514,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
if (cpu_has_feature(CPU_FTR_ARCH_31))
asm volatile(PPC_CP_ABORT);
 
-   vc->dpdes = mfspr(SPRN_DPDES);
-   vc->vtb = mfspr(SPRN_VTB);
mtspr(SPRN_DPDES, 0);
if (vc->pcr)
mtspr(SPRN_PCR, PCR_MASK);
-- 
2.23.0



[PATCH v3 24/52] KVM: PPC: Book3S HV P9: Optimise timebase reads

2021-10-04 Thread Nicholas Piggin
Reduce the number of mfTB executed by passing the current timebase
around entry and exit code rather than read it multiple times.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  2 +-
 arch/powerpc/kvm/book3s_hv.c | 88 +---
 arch/powerpc/kvm/book3s_hv_p9_entry.c| 33 +
 3 files changed, 65 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index fff391b9b97b..0a319ed9c2fd 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -154,7 +154,7 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu 
*vcpu)
return radix;
 }
 
-int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 time_limit, unsigned long 
lpcr);
+int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 time_limit, unsigned long 
lpcr, u64 *tb);
 
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4abe4a24e5e7..f3c052b8b7ee 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -276,22 +276,22 @@ static void kvmppc_fast_vcpu_kick_hv(struct kvm_vcpu 
*vcpu)
  * they should never fail.)
  */
 
-static void kvmppc_core_start_stolen(struct kvmppc_vcore *vc)
+static void kvmppc_core_start_stolen(struct kvmppc_vcore *vc, u64 tb)
 {
unsigned long flags;
 
spin_lock_irqsave(>stoltb_lock, flags);
-   vc->preempt_tb = mftb();
+   vc->preempt_tb = tb;
spin_unlock_irqrestore(>stoltb_lock, flags);
 }
 
-static void kvmppc_core_end_stolen(struct kvmppc_vcore *vc)
+static void kvmppc_core_end_stolen(struct kvmppc_vcore *vc, u64 tb)
 {
unsigned long flags;
 
spin_lock_irqsave(>stoltb_lock, flags);
if (vc->preempt_tb != TB_NIL) {
-   vc->stolen_tb += mftb() - vc->preempt_tb;
+   vc->stolen_tb += tb - vc->preempt_tb;
vc->preempt_tb = TB_NIL;
}
spin_unlock_irqrestore(>stoltb_lock, flags);
@@ -301,6 +301,7 @@ static void kvmppc_core_vcpu_load_hv(struct kvm_vcpu *vcpu, 
int cpu)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
unsigned long flags;
+   u64 now = mftb();
 
/*
 * We can test vc->runner without taking the vcore lock,
@@ -309,12 +310,12 @@ static void kvmppc_core_vcpu_load_hv(struct kvm_vcpu 
*vcpu, int cpu)
 * ever sets it to NULL.
 */
if (vc->runner == vcpu && vc->vcore_state >= VCORE_SLEEPING)
-   kvmppc_core_end_stolen(vc);
+   kvmppc_core_end_stolen(vc, now);
 
spin_lock_irqsave(>arch.tbacct_lock, flags);
if (vcpu->arch.state == KVMPPC_VCPU_BUSY_IN_HOST &&
vcpu->arch.busy_preempt != TB_NIL) {
-   vcpu->arch.busy_stolen += mftb() - vcpu->arch.busy_preempt;
+   vcpu->arch.busy_stolen += now - vcpu->arch.busy_preempt;
vcpu->arch.busy_preempt = TB_NIL;
}
spin_unlock_irqrestore(>arch.tbacct_lock, flags);
@@ -324,13 +325,14 @@ static void kvmppc_core_vcpu_put_hv(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
unsigned long flags;
+   u64 now = mftb();
 
if (vc->runner == vcpu && vc->vcore_state >= VCORE_SLEEPING)
-   kvmppc_core_start_stolen(vc);
+   kvmppc_core_start_stolen(vc, now);
 
spin_lock_irqsave(>arch.tbacct_lock, flags);
if (vcpu->arch.state == KVMPPC_VCPU_BUSY_IN_HOST)
-   vcpu->arch.busy_preempt = mftb();
+   vcpu->arch.busy_preempt = now;
spin_unlock_irqrestore(>arch.tbacct_lock, flags);
 }
 
@@ -685,7 +687,7 @@ static u64 vcore_stolen_time(struct kvmppc_vcore *vc, u64 
now)
 }
 
 static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
-   struct kvmppc_vcore *vc)
+   struct kvmppc_vcore *vc, u64 tb)
 {
struct dtl_entry *dt;
struct lppaca *vpa;
@@ -696,7 +698,7 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
 
dt = vcpu->arch.dtl_ptr;
vpa = vcpu->arch.vpa.pinned_addr;
-   now = mftb();
+   now = tb;
core_stolen = vcore_stolen_time(vc, now);
stolen = core_stolen - vcpu->arch.stolen_logged;
vcpu->arch.stolen_logged = core_stolen;
@@ -2914,14 +2916,14 @@ static void kvmppc_set_timer(struct kvm_vcpu *vcpu)
 extern int __kvmppc_vcore_entry(void);
 
 static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
-  struct kvm_vcpu *vcpu)
+  struct kvm_vcpu *vcpu, u64 tb)
 {
u64 now;
 
if (vcpu->arch.state != KVMPPC_VCPU_RUNNABLE)
retur

[PATCH v3 23/52] KVM: PPC: Book3S HV P9: Move TB updates

2021-10-04 Thread Nicholas Piggin
Move the TB updates between saving and loading guest and host SPRs,
to improve scheduling by keeping issue-NTC operations together as
much as possible.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 36 +--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 814b0dfd590f..e7793bb806eb 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -215,15 +215,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
vcpu->arch.ceded = 0;
 
-   if (vc->tb_offset) {
-   u64 new_tb = tb + vc->tb_offset;
-   mtspr(SPRN_TBU40, new_tb);
-   tb = mftb();
-   if ((tb & 0xff) < (new_tb & 0xff))
-   mtspr(SPRN_TBU40, new_tb + 0x100);
-   vc->tb_offset_applied = vc->tb_offset;
-   }
-
/* Could avoid mfmsr by passing around, but probably no big deal */
msr = mfmsr();
 
@@ -238,6 +229,15 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
host_dawrx1 = mfspr(SPRN_DAWRX1);
}
 
+   if (vc->tb_offset) {
+   u64 new_tb = tb + vc->tb_offset;
+   mtspr(SPRN_TBU40, new_tb);
+   tb = mftb();
+   if ((tb & 0xff) < (new_tb & 0xff))
+   mtspr(SPRN_TBU40, new_tb + 0x100);
+   vc->tb_offset_applied = vc->tb_offset;
+   }
+
if (vc->pcr)
mtspr(SPRN_PCR, vc->pcr | PCR_MASK);
mtspr(SPRN_DPDES, vc->dpdes);
@@ -469,6 +469,15 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
tb = mftb();
vcpu->arch.dec_expires = dec + tb;
 
+   if (vc->tb_offset_applied) {
+   u64 new_tb = tb - vc->tb_offset_applied;
+   mtspr(SPRN_TBU40, new_tb);
+   tb = mftb();
+   if ((tb & 0xff) < (new_tb & 0xff))
+   mtspr(SPRN_TBU40, new_tb + 0x100);
+   vc->tb_offset_applied = 0;
+   }
+
/* Preserve PSSCR[FAKE_SUSPEND] until we've called kvmppc_save_tm_hv */
mtspr(SPRN_PSSCR, host_psscr |
  (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
@@ -503,15 +512,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
if (vc->pcr)
mtspr(SPRN_PCR, PCR_MASK);
 
-   if (vc->tb_offset_applied) {
-   u64 new_tb = mftb() - vc->tb_offset_applied;
-   mtspr(SPRN_TBU40, new_tb);
-   tb = mftb();
-   if ((tb & 0xff) < (new_tb & 0xff))
-   mtspr(SPRN_TBU40, new_tb + 0x100);
-   vc->tb_offset_applied = 0;
-   }
-
/* HDEC must be at least as large as DEC, so decrementer_max fits */
mtspr(SPRN_HDEC, decrementer_max);
 
-- 
2.23.0



[PATCH v3 22/52] KVM: PPC: Book3S HV: Change dec_expires to be relative to guest timebase

2021-10-04 Thread Nicholas Piggin
Change dec_expires to be relative to the guest timebase, and allow
it to be moved into low level P9 guest entry functions, to improve
SPR access scheduling.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_book3s.h   |  6 +++
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s_hv.c| 58 +
 arch/powerpc/kvm/book3s_hv_nested.c |  3 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c   | 10 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 --
 6 files changed, 49 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index caaa0f592d8e..15b573671f99 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -406,6 +406,12 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+/* Expiry time of vcpu DEC relative to host TB */
+static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+}
+
 static inline bool is_kvmppc_resume_guest(int r)
 {
return (r == RESUME_GUEST || r == RESUME_GUEST_NV);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 080a7feb7731..c5fc4d016695 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -741,7 +741,7 @@ struct kvm_vcpu_arch {
 
struct hrtimer dec_timer;
u64 dec_jiffies;
-   u64 dec_expires;
+   u64 dec_expires;/* Relative to guest timebase. */
unsigned long pending_exceptions;
u8 ceded;
u8 prodded;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0a711d929531..4abe4a24e5e7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2261,8 +2261,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
*val = get_reg_val(id, vcpu->arch.vcore->arch_compat);
break;
case KVM_REG_PPC_DEC_EXPIRY:
-   *val = get_reg_val(id, vcpu->arch.dec_expires +
-  vcpu->arch.vcore->tb_offset);
+   *val = get_reg_val(id, vcpu->arch.dec_expires);
break;
case KVM_REG_PPC_ONLINE:
*val = get_reg_val(id, vcpu->arch.online);
@@ -2514,8 +2513,7 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
r = kvmppc_set_arch_compat(vcpu, set_reg_val(id, *val));
break;
case KVM_REG_PPC_DEC_EXPIRY:
-   vcpu->arch.dec_expires = set_reg_val(id, *val) -
-   vcpu->arch.vcore->tb_offset;
+   vcpu->arch.dec_expires = set_reg_val(id, *val);
break;
case KVM_REG_PPC_ONLINE:
i = set_reg_val(id, *val);
@@ -2902,13 +2900,13 @@ static void kvmppc_set_timer(struct kvm_vcpu *vcpu)
unsigned long dec_nsec, now;
 
now = get_tb();
-   if (now > vcpu->arch.dec_expires) {
+   if (now > kvmppc_dec_expires_host_tb(vcpu)) {
/* decrementer has already gone negative */
kvmppc_core_queue_dec(vcpu);
kvmppc_core_prepare_to_enter(vcpu);
return;
}
-   dec_nsec = tb_to_ns(vcpu->arch.dec_expires - now);
+   dec_nsec = tb_to_ns(kvmppc_dec_expires_host_tb(vcpu) - now);
hrtimer_start(>arch.dec_timer, dec_nsec, HRTIMER_MODE_REL);
vcpu->arch.timer_running = 1;
 }
@@ -3380,7 +3378,7 @@ static void post_guest_process(struct kvmppc_vcore *vc, 
bool is_master)
 */
spin_unlock(>lock);
/* cancel pending dec exception if dec is positive */
-   if (now < vcpu->arch.dec_expires &&
+   if (now < kvmppc_dec_expires_host_tb(vcpu) &&
kvmppc_core_pending_dec(vcpu))
kvmppc_core_dequeue_dec(vcpu);
 
@@ -4211,20 +4209,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
load_spr_state(vcpu);
 
-   /*
-* When setting DEC, we must always deal with irq_work_raise via NMI vs
-* setting DEC. The problem occurs right as we switch into guest mode
-* if a NMI hits and sets pending work and sets DEC, then that will
-* apply to the guest and not bring us back to the host.
-*
-* irq_work_raise could check a flag (or possibly LPCR[HDICE] for
-* example) and set HDEC to 1? That wouldn't solve the nested hv
-* case which needs to abort the hcall or zero the time limit.
-*
-* XXX: Another day's problem.
-*/
-   mtspr(SPRN_DEC, vcpu->arch.dec_expires - tb);
-
if (kvmhv_on_pseries()) {
/*

[PATCH v3 21/52] KVM: PPC: Book3S HV P9: Add kvmppc_stop_thread to match kvmppc_start_thread

2021-10-04 Thread Nicholas Piggin
Small cleanup makes it a bit easier to match up entry and exit
operations.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7e8ddffd61c7..0a711d929531 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3070,6 +3070,13 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, 
struct kvmppc_vcore *vc)
kvmppc_ipi_thread(cpu);
 }
 
+/* Old path does this in asm */
+static void kvmppc_stop_thread(struct kvm_vcpu *vcpu)
+{
+   vcpu->cpu = -1;
+   vcpu->arch.thread_cpu = -1;
+}
+
 static void kvmppc_wait_for_nap(int n_threads)
 {
int cpu = smp_processor_id();
@@ -4297,8 +4304,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
dec = (s32) dec;
tb = mftb();
vcpu->arch.dec_expires = dec + tb;
-   vcpu->cpu = -1;
-   vcpu->arch.thread_cpu = -1;
 
store_spr_state(vcpu);
 
@@ -4782,6 +4787,8 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 
guest_exit_irqoff();
 
+   kvmppc_stop_thread(vcpu);
+
powerpc_local_irq_pmu_restore(flags);
 
cpumask_clear_cpu(pcpu, >arch.cpu_in_guest);
-- 
2.23.0



[PATCH v3 20/52] KVM: PPC: Book3S HV P9: Improve mtmsrd scheduling by delaying MSR[EE] disable

2021-10-04 Thread Nicholas Piggin
Moving the mtmsrd after the host SPRs are saved and before the guest
SPRs start to be loaded can prevent an SPR scoreboard stall (because
the mtmsrd is L=1 type which does not cause context synchronisation.

This is also now more convenient to combined with the mtmsrd L=0
instruction to enable facilities just below, but that is not done yet.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 16365c0e9872..7e8ddffd61c7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4156,6 +4156,18 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
save_p9_host_os_sprs(_os_sprs);
 
+   /*
+* This could be combined with MSR[RI] clearing, but that expands
+* the unrecoverable window. It would be better to cover unrecoverable
+* with KVM bad interrupt handling rather than use MSR[RI] at all.
+*
+* Much more difficult and less worthwhile to combine with IR/DR
+* disable.
+*/
+   hard_irq_disable();
+   if (lazy_irq_pending())
+   return 0;
+
/* MSR bits may have been cleared by context switch */
msr = 0;
if (IS_ENABLED(CONFIG_PPC_FPU))
@@ -4667,6 +4679,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
struct kvmppc_vcore *vc;
struct kvm *kvm = vcpu->kvm;
struct kvm_nested_guest *nested = vcpu->arch.nested;
+   unsigned long flags;
 
trace_kvmppc_run_vcpu_enter(vcpu);
 
@@ -4710,11 +4723,11 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
if (kvm_is_radix(kvm))
kvmppc_prepare_radix_vcpu(vcpu, pcpu);
 
-   local_irq_disable();
-   hard_irq_disable();
+   /* flags save not required, but irq_pmu has no disable/enable API */
+   powerpc_local_irq_pmu_save(flags);
if (signal_pending(current))
goto sigpend;
-   if (lazy_irq_pending() || need_resched() || !kvm->arch.mmu_ready)
+   if (need_resched() || !kvm->arch.mmu_ready)
goto out;
 
if (!nested) {
@@ -4769,7 +4782,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 
guest_exit_irqoff();
 
-   local_irq_enable();
+   powerpc_local_irq_pmu_restore(flags);
 
cpumask_clear_cpu(pcpu, >arch.cpu_in_guest);
 
@@ -4827,7 +4840,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
run->exit_reason = KVM_EXIT_INTR;
vcpu->arch.ret = -EINTR;
  out:
-   local_irq_enable();
+   powerpc_local_irq_pmu_restore(flags);
preempt_enable();
goto done;
 }
-- 
2.23.0



[PATCH v3 19/52] KVM: PPC: Book3S HV P9: Reduce mtmsrd instructions required to save host SPRs

2021-10-04 Thread Nicholas Piggin
This reduces the number of mtmsrd required to enable facility bits when
saving/restoring registers, by having the KVM code set all bits up front
rather than using individual facility functions that set their particular
MSR bits.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/switch_to.h  |  2 +
 arch/powerpc/kernel/process.c | 28 +
 arch/powerpc/kvm/book3s_hv.c  | 59 ++-
 arch/powerpc/kvm/book3s_hv_p9_entry.c |  1 +
 4 files changed, 71 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/switch_to.h 
b/arch/powerpc/include/asm/switch_to.h
index 9d1fbd8be1c7..e8013cd6b646 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -112,6 +112,8 @@ static inline void clear_task_ebb(struct task_struct *t)
 #endif
 }
 
+void kvmppc_save_user_regs(void);
+
 extern int set_thread_tidr(struct task_struct *t);
 
 #endif /* _ASM_POWERPC_SWITCH_TO_H */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 50436b52c213..3fca321b820d 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1156,6 +1156,34 @@ static inline void save_sprs(struct thread_struct *t)
 #endif
 }
 
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+void kvmppc_save_user_regs(void)
+{
+   unsigned long usermsr;
+
+   if (!current->thread.regs)
+   return;
+
+   usermsr = current->thread.regs->msr;
+
+   if (usermsr & MSR_FP)
+   save_fpu(current);
+
+   if (usermsr & MSR_VEC)
+   save_altivec(current);
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   if (usermsr & MSR_TM) {
+   current->thread.tm_tfhar = mfspr(SPRN_TFHAR);
+   current->thread.tm_tfiar = mfspr(SPRN_TFIAR);
+   current->thread.tm_texasr = mfspr(SPRN_TEXASR);
+   current->thread.regs->msr &= ~MSR_TM;
+   }
+#endif
+}
+EXPORT_SYMBOL_GPL(kvmppc_save_user_regs);
+#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
+
 static inline void restore_sprs(struct thread_struct *old_thread,
struct thread_struct *new_thread)
 {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index fca89ed2244f..16365c0e9872 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4140,6 +4140,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
struct p9_host_os_sprs host_os_sprs;
s64 dec;
u64 tb, next_timer;
+   unsigned long msr;
int trap;
 
WARN_ON_ONCE(vcpu->arch.ceded);
@@ -4151,8 +4152,23 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
if (next_timer < time_limit)
time_limit = next_timer;
 
+   vcpu->arch.ceded = 0;
+
save_p9_host_os_sprs(_os_sprs);
 
+   /* MSR bits may have been cleared by context switch */
+   msr = 0;
+   if (IS_ENABLED(CONFIG_PPC_FPU))
+   msr |= MSR_FP;
+   if (cpu_has_feature(CPU_FTR_ALTIVEC))
+   msr |= MSR_VEC;
+   if (cpu_has_feature(CPU_FTR_VSX))
+   msr |= MSR_VSX;
+   if (cpu_has_feature(CPU_FTR_TM) ||
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   msr |= MSR_TM;
+   msr = msr_check_and_set(msr);
+
kvmppc_subcore_enter_guest();
 
vc->entry_exit_map = 1;
@@ -4161,12 +4177,13 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vcpu_vpa_increment_dispatch(vcpu);
 
if (cpu_has_feature(CPU_FTR_TM) ||
-   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST)) {
kvmppc_restore_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
+   msr = mfmsr(); /* TM restore can update msr */
+   }
 
switch_pmu_to_guest(vcpu, _os_sprs);
 
-   msr_check_and_set(MSR_FP | MSR_VEC | MSR_VSX);
load_fp_state(>arch.fp);
 #ifdef CONFIG_ALTIVEC
load_vr_state(>arch.vr);
@@ -4275,7 +4292,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
restore_p9_host_os_sprs(vcpu, _os_sprs);
 
-   msr_check_and_set(MSR_FP | MSR_VEC | MSR_VSX);
store_fp_state(>arch.fp);
 #ifdef CONFIG_ALTIVEC
store_vr_state(>arch.vr);
@@ -4825,19 +4841,24 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
unsigned long user_tar = 0;
unsigned int user_vrsave;
struct kvm *kvm;
+   unsigned long msr;
 
if (!vcpu->arch.sane) {
run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
return -EINVAL;
}
 
+   /* No need to go into the guest when all we'll do is come back out */
+   if (signal_pending(current)) {
+   run->exit_reason = KVM_EXIT_INTR;
+   return -EINTR;
+   }
+
+#ifdef CONFIG_PPC_TRANSACTIO

[PATCH v3 18/52] KVM: PPC: Book3S HV P9: Move SPRG restore to restore_p9_host_os_sprs

2021-10-04 Thread Nicholas Piggin
Move the SPR update into its relevant helper function. This will
help with SPR scheduling improvements in later changes.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1c5b81bd02c1..fca89ed2244f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4093,6 +4093,8 @@ static void save_p9_host_os_sprs(struct p9_host_os_sprs 
*host_os_sprs)
 static void restore_p9_host_os_sprs(struct kvm_vcpu *vcpu,
struct p9_host_os_sprs *host_os_sprs)
 {
+   mtspr(SPRN_SPRG_VDSO_WRITE, local_paca->sprg_vdso);
+
mtspr(SPRN_PSPB, 0);
mtspr(SPRN_UAMOR, 0);
 
@@ -4293,8 +4295,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
timer_rearm_host_dec(tb);
 
-   mtspr(SPRN_SPRG_VDSO_WRITE, local_paca->sprg_vdso);
-
kvmppc_subcore_exit_guest();
 
return trap;
-- 
2.23.0



[PATCH v3 17/52] KVM: PPC: Book3S HV: CTRL SPR does not require read-modify-write

2021-10-04 Thread Nicholas Piggin
Processors that support KVM HV do not require read-modify-write of
the CTRL SPR to set/clear their thread's runlatch. Just write 1 or 0
to it.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c|  2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 15 ++-
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f0ad3fb2eabd..1c5b81bd02c1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4058,7 +4058,7 @@ static void load_spr_state(struct kvm_vcpu *vcpu)
 */
 
if (!(vcpu->arch.ctrl & 1))
-   mtspr(SPRN_CTRLT, mfspr(SPRN_CTRLF) & ~1);
+   mtspr(SPRN_CTRLT, 0);
 }
 
 static void store_spr_state(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 7fa0df632f89..070e228b3c20 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -775,12 +775,11 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
mtspr   SPRN_AMR,r5
mtspr   SPRN_UAMOR,r6
 
-   /* Restore state of CTRL run bit; assume 1 on entry */
+   /* Restore state of CTRL run bit; the host currently has it set to 1 */
lwz r5,VCPU_CTRL(r4)
andi.   r5,r5,1
bne 4f
-   mfspr   r6,SPRN_CTRLF
-   clrrdi  r6,r6,1
+   li  r6,0
mtspr   SPRN_CTRLT,r6
 4:
/* Secondary threads wait for primary to have done partition switch */
@@ -1203,12 +1202,12 @@ guest_bypass:
stw r0, VCPU_CPU(r9)
stw r0, VCPU_THREAD_CPU(r9)
 
-   /* Save guest CTRL register, set runlatch to 1 */
+   /* Save guest CTRL register, set runlatch to 1 if it was clear */
mfspr   r6,SPRN_CTRLF
stw r6,VCPU_CTRL(r9)
andi.   r0,r6,1
bne 4f
-   ori r6,r6,1
+   li  r6,1
mtspr   SPRN_CTRLT,r6
 4:
/*
@@ -2178,8 +2177,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM)
 * Also clear the runlatch bit before napping.
 */
 kvm_do_nap:
-   mfspr   r0, SPRN_CTRLF
-   clrrdi  r0, r0, 1
+   li  r0,0
mtspr   SPRN_CTRLT, r0
 
li  r0,1
@@ -2198,8 +2196,7 @@ kvm_nap_sequence: /* desired LPCR value in r5 */
 
bl  isa206_idle_insn_mayloss
 
-   mfspr   r0, SPRN_CTRLF
-   ori r0, r0, 1
+   li  r0,1
mtspr   SPRN_CTRLT, r0
 
mtspr   SPRN_SRR1, r3
-- 
2.23.0



[PATCH v3 16/52] KVM: PPC: Book3S HV P9: Factor out yield_count increment

2021-10-04 Thread Nicholas Piggin
Factor duplicated code into a helper function.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6bbd670658b9..f0ad3fb2eabd 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4118,6 +4118,16 @@ static inline bool hcall_is_xics(unsigned long req)
req == H_IPOLL || req == H_XIRR || req == H_XIRR_X;
 }
 
+static void vcpu_vpa_increment_dispatch(struct kvm_vcpu *vcpu)
+{
+   struct lppaca *lp = vcpu->arch.vpa.pinned_addr;
+   if (lp) {
+   u32 yield_count = be32_to_cpu(lp->yield_count) + 1;
+   lp->yield_count = cpu_to_be32(yield_count);
+   vcpu->arch.vpa.dirty = 1;
+   }
+}
+
 /*
  * Guest entry for POWER9 and later CPUs.
  */
@@ -4146,12 +4156,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->entry_exit_map = 1;
vc->in_guest = 1;
 
-   if (vcpu->arch.vpa.pinned_addr) {
-   struct lppaca *lp = vcpu->arch.vpa.pinned_addr;
-   u32 yield_count = be32_to_cpu(lp->yield_count) + 1;
-   lp->yield_count = cpu_to_be32(yield_count);
-   vcpu->arch.vpa.dirty = 1;
-   }
+   vcpu_vpa_increment_dispatch(vcpu);
 
if (cpu_has_feature(CPU_FTR_TM) ||
cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
@@ -4279,12 +4284,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
kvmppc_save_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
 
-   if (vcpu->arch.vpa.pinned_addr) {
-   struct lppaca *lp = vcpu->arch.vpa.pinned_addr;
-   u32 yield_count = be32_to_cpu(lp->yield_count) + 1;
-   lp->yield_count = cpu_to_be32(yield_count);
-   vcpu->arch.vpa.dirty = 1;
-   }
+   vcpu_vpa_increment_dispatch(vcpu);
 
switch_pmu_to_host(vcpu, _os_sprs);
 
-- 
2.23.0



[PATCH v3 15/52] KVM: PPC: Book3S HV P9: Demand fault PMU SPRs when marked not inuse

2021-10-04 Thread Nicholas Piggin
The pmcregs_in_use field in the guest VPA can not be trusted to reflect
what the guest is doing with PMU SPRs, so the PMU must always be managed
(stopped) when exiting the guest, and SPR values set when entering the
guest to ensure it can't cause a covert channel or otherwise cause other
guests or the host to misbehave.

So prevent guest access to the PMU with HFSCR[PM] if pmcregs_in_use is
clear, and avoid the PMU SPR access on every partition switch. Guests
that set pmcregs_in_use incorrectly or when first setting it and using
the PMU will take a hypervisor facility unavailable interrupt that will
bring in the PMU SPRs.

Reviewed-by: Athira Jajeev 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 131 ++-
 1 file changed, 98 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 29a8c770c4a6..6bbd670658b9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1421,6 +1421,23 @@ static int kvmppc_emulate_doorbell_instr(struct kvm_vcpu 
*vcpu)
return RESUME_GUEST;
 }
 
+/*
+ * If the lppaca had pmcregs_in_use clear when we exited the guest, then
+ * HFSCR_PM is cleared for next entry. If the guest then tries to access
+ * the PMU SPRs, we get this facility unavailable interrupt. Putting HFSCR_PM
+ * back in the guest HFSCR will cause the next entry to load the PMU SPRs and
+ * allow the guest access to continue.
+ */
+static int kvmppc_pmu_unavailable(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu->arch.hfscr_permitted & HFSCR_PM))
+   return EMULATE_FAIL;
+
+   vcpu->arch.hfscr |= HFSCR_PM;
+
+   return RESUME_GUEST;
+}
+
 static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 struct task_struct *tsk)
 {
@@ -1702,16 +1719,22 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * to emulate.
 * Otherwise, we just generate a program interrupt to the guest.
 */
-   case BOOK3S_INTERRUPT_H_FAC_UNAVAIL:
+   case BOOK3S_INTERRUPT_H_FAC_UNAVAIL: {
+   u64 cause = vcpu->arch.hfscr >> 56;
+
r = EMULATE_FAIL;
-   if (((vcpu->arch.hfscr >> 56) == FSCR_MSGP_LG) &&
-   cpu_has_feature(CPU_FTR_ARCH_300))
-   r = kvmppc_emulate_doorbell_instr(vcpu);
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (cause == FSCR_MSGP_LG)
+   r = kvmppc_emulate_doorbell_instr(vcpu);
+   if (cause == FSCR_PM_LG)
+   r = kvmppc_pmu_unavailable(vcpu);
+   }
if (r == EMULATE_FAIL) {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
r = RESUME_GUEST;
}
break;
+   }
 
case BOOK3S_INTERRUPT_HV_RM_HARD:
r = RESUME_PASSTHROUGH;
@@ -2750,6 +2773,11 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
*vcpu)
 
vcpu->arch.hfscr_permitted = vcpu->arch.hfscr;
 
+   /*
+* PM is demand-faulted so start with it clear.
+*/
+   vcpu->arch.hfscr &= ~HFSCR_PM;
+
kvmppc_mmu_book3s_hv_init(vcpu);
 
vcpu->arch.state = KVMPPC_VCPU_NOTREADY;
@@ -3820,6 +3848,14 @@ static void freeze_pmu(unsigned long mmcr0, unsigned 
long mmcra)
 static void switch_pmu_to_guest(struct kvm_vcpu *vcpu,
struct p9_host_os_sprs *host_os_sprs)
 {
+   struct lppaca *lp;
+   int load_pmu = 1;
+
+   lp = vcpu->arch.vpa.pinned_addr;
+   if (lp)
+   load_pmu = lp->pmcregs_in_use;
+
+   /* Save host */
if (ppc_get_pmu_inuse()) {
/*
 * It might be better to put PMU handling (at least for the
@@ -3854,41 +3890,47 @@ static void switch_pmu_to_guest(struct kvm_vcpu *vcpu,
}
 
 #ifdef CONFIG_PPC_PSERIES
+   /* After saving PMU, before loading guest PMU, flip pmcregs_in_use */
if (kvmhv_on_pseries()) {
barrier();
-   if (vcpu->arch.vpa.pinned_addr) {
-   struct lppaca *lp = vcpu->arch.vpa.pinned_addr;
-   get_lppaca()->pmcregs_in_use = lp->pmcregs_in_use;
-   } else {
-   get_lppaca()->pmcregs_in_use = 1;
-   }
+   get_lppaca()->pmcregs_in_use = load_pmu;
barrier();
}
 #endif
 
-   /* load guest */
-   mtspr(SPRN_PMC1, vcpu->arch.pmc[0]);
-   mtspr(SPRN_PMC2, vcpu->arch.pmc[1]);
-   mtspr(SPRN_PMC3, vcpu->arch.pmc[2]);
-   mtspr(SPRN_PMC4, vcpu->arch.pmc[3]);
-   mtspr(SPRN_PMC5, vcpu->arch.pmc[4]);
-   mtspr(SPRN_PMC6, vcpu->arch.pmc[5]);
-   mtspr(SPRN_MMCR1, vcpu->arch.mmcr[1]);
-   mtspr(SPRN_MMC

[PATCH v3 14/52] KVM: PPC: Book3S HV P9: Factor PMU save/load into context switch functions

2021-10-04 Thread Nicholas Piggin
Rather than guest/host save/retsore functions, implement context switch
functions that take care of details like the VPA update for nested.

The reason to split these kind of helpers into explicit save/load
functions is mainly to schedule SPR access nicely, but PMU is a special
case where the load requires mtSPR (to stop counters) and other
difficulties, so there's less possibility to schedule those nicely. The
SPR accesses also have side-effects if the PMU is running, and in later
changes we keep the host PMU running as long as possible so this code
can be better profiled, which also complicates scheduling.

Reviewed-by: Athira Jajeev 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 61 +---
 1 file changed, 28 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 211184544599..29a8c770c4a6 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3817,7 +3817,8 @@ static void freeze_pmu(unsigned long mmcr0, unsigned long 
mmcra)
isync();
 }
 
-static void save_p9_host_pmu(struct p9_host_os_sprs *host_os_sprs)
+static void switch_pmu_to_guest(struct kvm_vcpu *vcpu,
+   struct p9_host_os_sprs *host_os_sprs)
 {
if (ppc_get_pmu_inuse()) {
/*
@@ -3851,10 +3852,21 @@ static void save_p9_host_pmu(struct p9_host_os_sprs 
*host_os_sprs)
host_os_sprs->sier3 = mfspr(SPRN_SIER3);
}
}
-}
 
-static void load_p9_guest_pmu(struct kvm_vcpu *vcpu)
-{
+#ifdef CONFIG_PPC_PSERIES
+   if (kvmhv_on_pseries()) {
+   barrier();
+   if (vcpu->arch.vpa.pinned_addr) {
+   struct lppaca *lp = vcpu->arch.vpa.pinned_addr;
+   get_lppaca()->pmcregs_in_use = lp->pmcregs_in_use;
+   } else {
+   get_lppaca()->pmcregs_in_use = 1;
+   }
+   barrier();
+   }
+#endif
+
+   /* load guest */
mtspr(SPRN_PMC1, vcpu->arch.pmc[0]);
mtspr(SPRN_PMC2, vcpu->arch.pmc[1]);
mtspr(SPRN_PMC3, vcpu->arch.pmc[2]);
@@ -3879,7 +3891,8 @@ static void load_p9_guest_pmu(struct kvm_vcpu *vcpu)
/* No isync necessary because we're starting counters */
 }
 
-static void save_p9_guest_pmu(struct kvm_vcpu *vcpu)
+static void switch_pmu_to_host(struct kvm_vcpu *vcpu,
+   struct p9_host_os_sprs *host_os_sprs)
 {
struct lppaca *lp;
int save_pmu = 1;
@@ -3922,10 +3935,15 @@ static void save_p9_guest_pmu(struct kvm_vcpu *vcpu)
} else {
freeze_pmu(mfspr(SPRN_MMCR0), mfspr(SPRN_MMCRA));
}
-}
 
-static void load_p9_host_pmu(struct p9_host_os_sprs *host_os_sprs)
-{
+#ifdef CONFIG_PPC_PSERIES
+   if (kvmhv_on_pseries()) {
+   barrier();
+   get_lppaca()->pmcregs_in_use = ppc_get_pmu_inuse();
+   barrier();
+   }
+#endif
+
if (ppc_get_pmu_inuse()) {
mtspr(SPRN_PMC1, host_os_sprs->pmc1);
mtspr(SPRN_PMC2, host_os_sprs->pmc2);
@@ -4058,8 +4076,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
save_p9_host_os_sprs(_os_sprs);
 
-   save_p9_host_pmu(_os_sprs);
-
kvmppc_subcore_enter_guest();
 
vc->entry_exit_map = 1;
@@ -4076,19 +4092,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
kvmppc_restore_tm_hv(vcpu, vcpu->arch.shregs.msr, true);
 
-#ifdef CONFIG_PPC_PSERIES
-   if (kvmhv_on_pseries()) {
-   barrier();
-   if (vcpu->arch.vpa.pinned_addr) {
-   struct lppaca *lp = vcpu->arch.vpa.pinned_addr;
-   get_lppaca()->pmcregs_in_use = lp->pmcregs_in_use;
-   } else {
-   get_lppaca()->pmcregs_in_use = 1;
-   }
-   barrier();
-   }
-#endif
-   load_p9_guest_pmu(vcpu);
+   switch_pmu_to_guest(vcpu, _os_sprs);
 
msr_check_and_set(MSR_FP | MSR_VEC | MSR_VSX);
load_fp_state(>arch.fp);
@@ -4217,14 +4221,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vcpu->arch.vpa.dirty = 1;
}
 
-   save_p9_guest_pmu(vcpu);
-#ifdef CONFIG_PPC_PSERIES
-   if (kvmhv_on_pseries()) {
-   barrier();
-   get_lppaca()->pmcregs_in_use = ppc_get_pmu_inuse();
-   barrier();
-   }
-#endif
+   switch_pmu_to_host(vcpu, _os_sprs);
 
vc->entry_exit_map = 0x101;
vc->in_guest = 0;
@@ -4233,8 +4230,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
mtspr(SPRN_SPRG_VDSO_WRITE, local_paca->sprg_vdso);
 
-   load_p9_host_pmu(_os_sprs);
-
kvmppc_subcore_exit_guest();
 
return trap;
-- 
2.23.0



[PATCH v3 13/52] KVM: PPC: Book3S HV P9: Implement PMU save/restore in C

2021-10-04 Thread Nicholas Piggin
Implement the P9 path PMU save/restore code in C, and remove the
POWER9/10 code from the P7/8 path assembly.

Cc: Madhavan Srinivasan 
Reviewed-by: Athira Jajeev 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/asm-prototypes.h |   5 -
 arch/powerpc/kvm/book3s_hv.c  | 221 +++---
 arch/powerpc/kvm/book3s_hv_interrupts.S   |  13 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  43 +
 4 files changed, 208 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 222823861a67..41b8a1e1144a 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -141,11 +141,6 @@ static inline void kvmppc_restore_tm_hv(struct kvm_vcpu 
*vcpu, u64 msr,
bool preserve_nv) { }
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
-void kvmhv_save_host_pmu(void);
-void kvmhv_load_host_pmu(void);
-void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
-void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
-
 void kvmppc_p9_enter_guest(struct kvm_vcpu *vcpu);
 
 long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b069209b49b2..211184544599 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3762,6 +3762,196 @@ static noinline void kvmppc_run_core(struct 
kvmppc_vcore *vc)
trace_kvmppc_run_core(vc, 1);
 }
 
+/*
+ * Privileged (non-hypervisor) host registers to save.
+ */
+struct p9_host_os_sprs {
+   unsigned long dscr;
+   unsigned long tidr;
+   unsigned long iamr;
+   unsigned long amr;
+   unsigned long fscr;
+
+   unsigned int pmc1;
+   unsigned int pmc2;
+   unsigned int pmc3;
+   unsigned int pmc4;
+   unsigned int pmc5;
+   unsigned int pmc6;
+   unsigned long mmcr0;
+   unsigned long mmcr1;
+   unsigned long mmcr2;
+   unsigned long mmcr3;
+   unsigned long mmcra;
+   unsigned long siar;
+   unsigned long sier1;
+   unsigned long sier2;
+   unsigned long sier3;
+   unsigned long sdar;
+};
+
+static void freeze_pmu(unsigned long mmcr0, unsigned long mmcra)
+{
+   if (!(mmcr0 & MMCR0_FC))
+   goto do_freeze;
+   if (mmcra & MMCRA_SAMPLE_ENABLE)
+   goto do_freeze;
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   if (!(mmcr0 & MMCR0_PMCCEXT))
+   goto do_freeze;
+   if (!(mmcra & MMCRA_BHRB_DISABLE))
+   goto do_freeze;
+   }
+   return;
+
+do_freeze:
+   mmcr0 = MMCR0_FC;
+   mmcra = 0;
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mmcr0 |= MMCR0_PMCCEXT;
+   mmcra = MMCRA_BHRB_DISABLE;
+   }
+
+   mtspr(SPRN_MMCR0, mmcr0);
+   mtspr(SPRN_MMCRA, mmcra);
+   isync();
+}
+
+static void save_p9_host_pmu(struct p9_host_os_sprs *host_os_sprs)
+{
+   if (ppc_get_pmu_inuse()) {
+   /*
+* It might be better to put PMU handling (at least for the
+* host) in the perf subsystem because it knows more about what
+* is being used.
+*/
+
+   /* POWER9, POWER10 do not implement HPMC or SPMC */
+
+   host_os_sprs->mmcr0 = mfspr(SPRN_MMCR0);
+   host_os_sprs->mmcra = mfspr(SPRN_MMCRA);
+
+   freeze_pmu(host_os_sprs->mmcr0, host_os_sprs->mmcra);
+
+   host_os_sprs->pmc1 = mfspr(SPRN_PMC1);
+   host_os_sprs->pmc2 = mfspr(SPRN_PMC2);
+   host_os_sprs->pmc3 = mfspr(SPRN_PMC3);
+   host_os_sprs->pmc4 = mfspr(SPRN_PMC4);
+   host_os_sprs->pmc5 = mfspr(SPRN_PMC5);
+   host_os_sprs->pmc6 = mfspr(SPRN_PMC6);
+   host_os_sprs->mmcr1 = mfspr(SPRN_MMCR1);
+   host_os_sprs->mmcr2 = mfspr(SPRN_MMCR2);
+   host_os_sprs->sdar = mfspr(SPRN_SDAR);
+   host_os_sprs->siar = mfspr(SPRN_SIAR);
+   host_os_sprs->sier1 = mfspr(SPRN_SIER);
+
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   host_os_sprs->mmcr3 = mfspr(SPRN_MMCR3);
+   host_os_sprs->sier2 = mfspr(SPRN_SIER2);
+   host_os_sprs->sier3 = mfspr(SPRN_SIER3);
+   }
+   }
+}
+
+static void load_p9_guest_pmu(struct kvm_vcpu *vcpu)
+{
+   mtspr(SPRN_PMC1, vcpu->arch.pmc[0]);
+   mtspr(SPRN_PMC2, vcpu->arch.pmc[1]);
+   mtspr(SPRN_PMC3, vcpu->arch.pmc[2]);
+   mtspr(SPRN_PMC4, vcpu->arch.pmc[3]);
+   mtspr(SPRN_PMC5, vcpu->arch.pmc[4]);
+   mtspr(SPRN_PMC6, vcpu->arch.pmc[5]);
+   mtspr(SPRN_MMCR1, vcpu->arch.mmcr[1]);
+   mtspr(SPRN_MMCR2, vcpu->arch.mmcr[2]);
+

[PATCH v3 12/52] powerpc/64s: Implement PMU override command line option

2021-10-04 Thread Nicholas Piggin
It can be useful in simulators (with very constrained environments)
to allow some PMCs to run from boot so they can be sampled directly
by a test harness, rather than having to run perf.

A previous change freezes counters at boot by default, so provide
a boot time option to un-freeze (plus a bit more flexibility).

Cc: Madhavan Srinivasan 
Reviewed-by: Athira Jajeev 
Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |  8 +
 arch/powerpc/perf/core-book3s.c   | 35 +++
 2 files changed, 43 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..02a80c02a713 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4120,6 +4120,14 @@
Override pmtimer IOPort with a hex value.
e.g. pmtmr=0x508
 
+   pmu_override=   [PPC] Override the PMU.
+   This option takes over the PMU facility, so it is no
+   longer usable by perf. Setting this option starts the
+   PMU counters by setting MMCR0 to 0 (the FC bit is
+   cleared). If a number is given, then MMCR1 is set to
+   that number, otherwise (e.g., 'pmu_override=on'), MMCR1
+   remains 0.
+
pm_debug_messages   [SUSPEND,KNL]
Enable suspend/resume debug messages during boot up.
 
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 73e62e9b179b..8d4ff93462fb 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2419,8 +2419,24 @@ int register_power_pmu(struct power_pmu *pmu)
 }
 
 #ifdef CONFIG_PPC64
+static bool pmu_override = false;
+static unsigned long pmu_override_val;
+static void do_pmu_override(void *data)
+{
+   ppc_set_pmu_inuse(1);
+   if (pmu_override_val)
+   mtspr(SPRN_MMCR1, pmu_override_val);
+   mtspr(SPRN_MMCR0, mfspr(SPRN_MMCR0) & ~MMCR0_FC);
+}
+
 static int __init init_ppc64_pmu(void)
 {
+   if (cpu_has_feature(CPU_FTR_HVMODE) && pmu_override) {
+   pr_warn("disabling perf due to pmu_override= command line 
option.\n");
+   on_each_cpu(do_pmu_override, NULL, 1);
+   return 0;
+   }
+
/* run through all the pmu drivers one at a time */
if (!init_power5_pmu())
return 0;
@@ -2442,4 +2458,23 @@ static int __init init_ppc64_pmu(void)
return init_generic_compat_pmu();
 }
 early_initcall(init_ppc64_pmu);
+
+static int __init pmu_setup(char *str)
+{
+   unsigned long val;
+
+   if (!early_cpu_has_feature(CPU_FTR_HVMODE))
+   return 0;
+
+   pmu_override = true;
+
+   if (kstrtoul(str, 0, ))
+   val = 0;
+
+   pmu_override_val = val;
+
+   return 1;
+}
+__setup("pmu_override=", pmu_setup);
+
 #endif
-- 
2.23.0



[PATCH v3 11/52] powerpc/64s: Always set PMU control registers to frozen/disabled when not in use

2021-10-04 Thread Nicholas Piggin
KVM PMU management code looks for particular frozen/disabled bits in
the PMU registers so it knows whether it must clear them when coming
out of a guest or not. Setting this up helps KVM make these optimisations
without getting confused. Longer term the better approach might be to
move guest/host PMU switching to the perf subsystem.

Cc: Madhavan Srinivasan 
Reviewed-by: Athira Jajeev 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/cpu_setup_power.c | 4 ++--
 arch/powerpc/kernel/dt_cpu_ftrs.c | 6 +++---
 arch/powerpc/kvm/book3s_hv.c  | 5 +
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.c 
b/arch/powerpc/kernel/cpu_setup_power.c
index a29dc8326622..3dc61e203f37 100644
--- a/arch/powerpc/kernel/cpu_setup_power.c
+++ b/arch/powerpc/kernel/cpu_setup_power.c
@@ -109,7 +109,7 @@ static void init_PMU_HV_ISA207(void)
 static void init_PMU(void)
 {
mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
 }
@@ -123,7 +123,7 @@ static void init_PMU_ISA31(void)
 {
mtspr(SPRN_MMCR3, 0);
mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
-   mtspr(SPRN_MMCR0, MMCR0_PMCCEXT);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
 }
 
 /*
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 0a6b36b4bda8..06a089fbeaa7 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -353,7 +353,7 @@ static void init_pmu_power8(void)
}
 
mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
mtspr(SPRN_MMCRS, 0);
@@ -392,7 +392,7 @@ static void init_pmu_power9(void)
mtspr(SPRN_MMCRC, 0);
 
mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
 }
@@ -428,7 +428,7 @@ static void init_pmu_power10(void)
 
mtspr(SPRN_MMCR3, 0);
mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
-   mtspr(SPRN_MMCR0, MMCR0_PMCCEXT);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
 }
 
 static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 945fc9a96439..b069209b49b2 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2715,6 +2715,11 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
*vcpu)
 #endif
 #endif
vcpu->arch.mmcr[0] = MMCR0_FC;
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   vcpu->arch.mmcr[0] |= MMCR0_PMCCEXT;
+   vcpu->arch.mmcra = MMCRA_BHRB_DISABLE;
+   }
+
vcpu->arch.ctrl = CTRL_RUNLATCH;
/* default to host PVR, since we can't spoof it */
kvmppc_set_pvr_hv(vcpu, mfspr(SPRN_PVR));
-- 
2.23.0



[PATCH v3 10/52] KVM: PPC: Book3S HV: Don't always save PMU for guest capable of nesting

2021-10-04 Thread Nicholas Piggin
Provide a config option that controls the workaround added by commit
63279eeb7f93 ("KVM: PPC: Book3S HV: Always save guest pmu for guest
capable of nesting"). The option defaults to y for now, but is expected
to go away within a few releases.

Nested capable guests running with the earlier commit ("KVM: PPC: Book3S
HV Nested: Indicate guest PMU in-use in VPA") will now indicate the PMU
in-use status of their guests, which means the parent does not need to
unconditionally save the PMU for nested capable guests.

After this latest round of performance optimisations, this option costs
about 540 cycles or 10% entry/exit performance on a POWER9 nested-capable
guest.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/Kconfig | 15 +++
 arch/powerpc/kvm/book3s_hv.c | 10 --
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index ff581d70f20c..1e7aae522be8 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -130,6 +130,21 @@ config KVM_BOOK3S_HV_EXIT_TIMING
 
  If unsure, say N.
 
+config KVM_BOOK3S_HV_NESTED_PMU_WORKAROUND
+   bool "Nested L0 host workaround for L1 KVM host PMU handling bug" if 
EXPERT
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   default !EXPERT
+   help
+ Old nested HV capable Linux guests have a bug where the don't
+ reflect the PMU in-use status of their L2 guest to the L0 host
+ while the L2 PMU registers are live. This can result in loss
+  of L2 PMU register state, causing perf to not work correctly in
+ L2 guests.
+
+ Selecting this option for the L0 host implements a workaround for
+ those buggy L1s which saves the L2 state, at the cost of performance
+ in all nested-capable guest entry/exit.
+
 config KVM_BOOKE_HV
bool
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 463534402107..945fc9a96439 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4034,8 +4034,14 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vcpu->arch.vpa.dirty = 1;
save_pmu = lp->pmcregs_in_use;
}
-   /* Must save pmu if this guest is capable of running nested guests */
-   save_pmu |= nesting_enabled(vcpu->kvm);
+   if (IS_ENABLED(CONFIG_KVM_BOOK3S_HV_NESTED_PMU_WORKAROUND)) {
+   /*
+* Save pmu if this guest is capable of running nested guests.
+* This is option is for old L1s that do not set their
+* lppaca->pmcregs_in_use properly when entering their L2.
+*/
+   save_pmu |= nesting_enabled(vcpu->kvm);
+   }
 
kvmhv_save_guest_pmu(vcpu, save_pmu);
 #ifdef CONFIG_PPC_PSERIES
-- 
2.23.0



[PATCH v3 09/52] powerpc/64s: Keep AMOR SPR a constant ~0 at runtime

2021-10-04 Thread Nicholas Piggin
This register controls supervisor SPR modifications, and as such is only
relevant for KVM. KVM always sets AMOR to ~0 on guest entry, and never
restores it coming back out to the host, so it can be kept constant and
avoid the mtSPR in KVM guest entry.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/cpu_setup_power.c|  8 
 arch/powerpc/kernel/dt_cpu_ftrs.c|  2 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c|  2 --
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  2 --
 arch/powerpc/mm/book3s64/radix_pgtable.c | 15 ---
 arch/powerpc/platforms/powernv/idle.c|  8 +++-
 6 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.c 
b/arch/powerpc/kernel/cpu_setup_power.c
index 3cca88ee96d7..a29dc8326622 100644
--- a/arch/powerpc/kernel/cpu_setup_power.c
+++ b/arch/powerpc/kernel/cpu_setup_power.c
@@ -137,6 +137,7 @@ void __setup_cpu_power7(unsigned long offset, struct 
cpu_spec *t)
return;
 
mtspr(SPRN_LPID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA206(mfspr(SPRN_LPCR), LPCR_LPES1 >> LPCR_LPES_SH);
 }
@@ -150,6 +151,7 @@ void __restore_cpu_power7(void)
return;
 
mtspr(SPRN_LPID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA206(mfspr(SPRN_LPCR), LPCR_LPES1 >> LPCR_LPES_SH);
 }
@@ -164,6 +166,7 @@ void __setup_cpu_power8(unsigned long offset, struct 
cpu_spec *t)
return;
 
mtspr(SPRN_LPID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA206(mfspr(SPRN_LPCR) | LPCR_PECEDH, 0); /* LPES = 0 */
init_HFSCR();
@@ -184,6 +187,7 @@ void __restore_cpu_power8(void)
return;
 
mtspr(SPRN_LPID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA206(mfspr(SPRN_LPCR) | LPCR_PECEDH, 0); /* LPES = 0 */
init_HFSCR();
@@ -202,6 +206,7 @@ void __setup_cpu_power9(unsigned long offset, struct 
cpu_spec *t)
mtspr(SPRN_PSSCR, 0);
mtspr(SPRN_LPID, 0);
mtspr(SPRN_PID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA300((mfspr(SPRN_LPCR) | LPCR_PECEDH | LPCR_PECE_HVEE |\
 LPCR_HVICE | LPCR_HEIC) & ~(LPCR_UPRT | LPCR_HR), 0);
@@ -223,6 +228,7 @@ void __restore_cpu_power9(void)
mtspr(SPRN_PSSCR, 0);
mtspr(SPRN_LPID, 0);
mtspr(SPRN_PID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA300((mfspr(SPRN_LPCR) | LPCR_PECEDH | LPCR_PECE_HVEE |\
 LPCR_HVICE | LPCR_HEIC) & ~(LPCR_UPRT | LPCR_HR), 0);
@@ -242,6 +248,7 @@ void __setup_cpu_power10(unsigned long offset, struct 
cpu_spec *t)
mtspr(SPRN_PSSCR, 0);
mtspr(SPRN_LPID, 0);
mtspr(SPRN_PID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA300((mfspr(SPRN_LPCR) | LPCR_PECEDH | LPCR_PECE_HVEE |\
 LPCR_HVICE | LPCR_HEIC) & ~(LPCR_UPRT | LPCR_HR), 0);
@@ -264,6 +271,7 @@ void __restore_cpu_power10(void)
mtspr(SPRN_PSSCR, 0);
mtspr(SPRN_LPID, 0);
mtspr(SPRN_PID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_PCR, PCR_MASK);
init_LPCR_ISA300((mfspr(SPRN_LPCR) | LPCR_PECEDH | LPCR_PECE_HVEE |\
 LPCR_HVICE | LPCR_HEIC) & ~(LPCR_UPRT | LPCR_HR), 0);
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 358aee7c2d79..0a6b36b4bda8 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -80,6 +80,7 @@ static void __restore_cpu_cpufeatures(void)
mtspr(SPRN_LPCR, system_registers.lpcr);
if (hv_mode) {
mtspr(SPRN_LPID, 0);
+   mtspr(SPRN_AMOR, ~0);
mtspr(SPRN_HFSCR, system_registers.hfscr);
mtspr(SPRN_PCR, system_registers.pcr);
}
@@ -216,6 +217,7 @@ static int __init feat_enable_hv(struct dt_cpu_feature *f)
}
 
mtspr(SPRN_LPID, 0);
+   mtspr(SPRN_AMOR, ~0);
 
lpcr = mfspr(SPRN_LPCR);
lpcr &=  ~LPCR_LPES0; /* HV external interrupts */
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index bd8cf0a65ce8..a7f63082b4e3 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -286,8 +286,6 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
mtspr(SPRN_SPRG2, vcpu->arch.shregs.sprg2);
mtspr(SPRN_SPRG3, vcpu->arch.shregs.sprg3);
 
-   mtspr(SPRN_AMOR, ~0UL);
-
local_paca->kvm_hstate.in_guest = KVM_GUEST_MODE_HV_P9;
 
/*
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 904

[PATCH v3 08/52] KVM: PPC: Book3S HV: POWER10 enable HAIL when running radix guests

2021-10-04 Thread Nicholas Piggin
HV interrupts may be taken with the MMU enabled when radix guests are
running. Enable LPCR[HAIL] on ISA v3.1 processors for radix guests.
Make this depend on the host LPCR[HAIL] being enabled. Currently that is
always enabled, but having this test means any issue that might require
LPCR[HAIL] to be disabled in the host will not have to be duplicated in
KVM.

This optimisation takes 1380 cycles off a NULL hcall entry+exit micro
benchmark on a POWER10.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e83c7aa7dbba..463534402107 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5047,6 +5047,8 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu)
  */
 int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
 {
+   unsigned long lpcr, lpcr_mask;
+
if (nesting_enabled(kvm))
kvmhv_release_all_nested(kvm);
kvmppc_rmap_reset(kvm);
@@ -5056,8 +5058,13 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
kvm->arch.radix = 0;
spin_unlock(>mmu_lock);
kvmppc_free_radix(kvm);
-   kvmppc_update_lpcr(kvm, LPCR_VPM1,
-  LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR);
+
+   lpcr = LPCR_VPM1;
+   lpcr_mask = LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR;
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   lpcr_mask |= LPCR_HAIL;
+   kvmppc_update_lpcr(kvm, lpcr, lpcr_mask);
+
return 0;
 }
 
@@ -5067,6 +5074,7 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
  */
 int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
 {
+   unsigned long lpcr, lpcr_mask;
int err;
 
err = kvmppc_init_vm_radix(kvm);
@@ -5078,8 +5086,17 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
kvm->arch.radix = 1;
spin_unlock(>mmu_lock);
kvmppc_free_hpt(>arch.hpt);
-   kvmppc_update_lpcr(kvm, LPCR_UPRT | LPCR_GTSE | LPCR_HR,
-  LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR);
+
+   lpcr = LPCR_UPRT | LPCR_GTSE | LPCR_HR;
+   lpcr_mask = LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR;
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   lpcr_mask |= LPCR_HAIL;
+   if (cpu_has_feature(CPU_FTR_HVMODE) &&
+   (kvm->arch.host_lpcr & LPCR_HAIL))
+   lpcr |= LPCR_HAIL;
+   }
+   kvmppc_update_lpcr(kvm, lpcr, lpcr_mask);
+
return 0;
 }
 
@@ -5243,6 +5260,10 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
kvm->arch.mmu_ready = 1;
lpcr &= ~LPCR_VPM1;
lpcr |= LPCR_UPRT | LPCR_GTSE | LPCR_HR;
+   if (cpu_has_feature(CPU_FTR_HVMODE) &&
+   cpu_has_feature(CPU_FTR_ARCH_31) &&
+   (kvm->arch.host_lpcr & LPCR_HAIL))
+   lpcr |= LPCR_HAIL;
ret = kvmppc_init_vm_radix(kvm);
if (ret) {
kvmppc_free_lpid(kvm->arch.lpid);
-- 
2.23.0



[PATCH v3 07/52] powerpc/time: add API for KVM to re-arm the host timer/decrementer

2021-10-04 Thread Nicholas Piggin
Rather than have KVM look up the host timer and fiddle with the
irq-work internal details, have the powerpc/time.c code provide a
function for KVM to re-arm the Linux timer code when exiting a
guest.

This is implementation has an improvement over existing code of
marking a decrementer interrupt as soft-pending if a timer has
expired, rather than setting DEC to a -ve value, which tended to
cause host timers to take two interrupts (first hdec to exit the
guest, then the immediate dec).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/time.h | 16 +++---
 arch/powerpc/kernel/time.c  | 52 +++--
 arch/powerpc/kvm/book3s_hv.c|  7 ++---
 3 files changed, 49 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 69b6be617772..924b2157882f 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -99,18 +99,6 @@ extern void div128_by_32(u64 dividend_high, u64 dividend_low,
 extern void secondary_cpu_time_init(void);
 extern void __init time_init(void);
 
-#ifdef CONFIG_PPC64
-static inline unsigned long test_irq_work_pending(void)
-{
-   unsigned long x;
-
-   asm volatile("lbz %0,%1(13)"
-   : "=r" (x)
-   : "i" (offsetof(struct paca_struct, irq_work_pending)));
-   return x;
-}
-#endif
-
 DECLARE_PER_CPU(u64, decrementers_next_tb);
 
 static inline u64 timer_get_next_tb(void)
@@ -118,6 +106,10 @@ static inline u64 timer_get_next_tb(void)
return __this_cpu_read(decrementers_next_tb);
 }
 
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+void timer_rearm_host_dec(u64 now);
+#endif
+
 /* Convert timebase ticks to nanoseconds */
 unsigned long long tb_to_ns(unsigned long long tb_ticks);
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 6ce40d2ac201..2a6c118a43fb 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -498,6 +498,16 @@ EXPORT_SYMBOL(profile_pc);
  * 64-bit uses a byte in the PACA, 32-bit uses a per-cpu variable...
  */
 #ifdef CONFIG_PPC64
+static inline unsigned long test_irq_work_pending(void)
+{
+   unsigned long x;
+
+   asm volatile("lbz %0,%1(13)"
+   : "=r" (x)
+   : "i" (offsetof(struct paca_struct, irq_work_pending)));
+   return x;
+}
+
 static inline void set_irq_work_pending_flag(void)
 {
asm volatile("stb %0,%1(13)" : :
@@ -541,13 +551,44 @@ void arch_irq_work_raise(void)
preempt_enable();
 }
 
+static void set_dec_or_work(u64 val)
+{
+   set_dec(val);
+   /* We may have raced with new irq work */
+   if (unlikely(test_irq_work_pending()))
+   set_dec(1);
+}
+
 #else  /* CONFIG_IRQ_WORK */
 
 #define test_irq_work_pending()0
 #define clear_irq_work_pending()
 
+static void set_dec_or_work(u64 val)
+{
+   set_dec(val);
+}
 #endif /* CONFIG_IRQ_WORK */
 
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+void timer_rearm_host_dec(u64 now)
+{
+   u64 *next_tb = this_cpu_ptr(_next_tb);
+
+   WARN_ON_ONCE(!arch_irqs_disabled());
+   WARN_ON_ONCE(mfmsr() & MSR_EE);
+
+   if (now >= *next_tb) {
+   local_paca->irq_happened |= PACA_IRQ_DEC;
+   } else {
+   now = *next_tb - now;
+   if (now <= decrementer_max)
+   set_dec_or_work(now);
+   }
+}
+EXPORT_SYMBOL_GPL(timer_rearm_host_dec);
+#endif
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -608,10 +649,7 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
} else {
now = *next_tb - now;
if (now <= decrementer_max)
-   set_dec(now);
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
+   set_dec_or_work(now);
__this_cpu_inc(irq_stat.timer_irqs_others);
}
 
@@ -853,11 +891,7 @@ static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev)
 {
__this_cpu_write(decrementers_next_tb, get_tb() + evt);
-   set_dec(evt);
-
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
+   set_dec_or_work(evt);
 
return 0;
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e4482bf546ed..e83c7aa7dbba 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4049,11 +4049,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->entry_exit_map = 0x101;
vc->in_guest = 0;
 
-   next_timer = timer_get_next_tb();
-   set_dec(next_timer - tb);
-   /* We may have raced with new irq work */
-   if (test_irq_w

[PATCH v3 06/52] KVM: PPC: Book3S HV P9: Reduce mftb per guest entry/exit

2021-10-04 Thread Nicholas Piggin
mftb is serialising (dispatch next-to-complete) so it is heavy weight
for a mfspr. Avoid reading it multiple times in the entry or exit paths.
A small number of cycles delay to timers is tolerable.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 4 ++--
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 5 +++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 30d400bf161b..e4482bf546ed 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3927,7 +3927,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 *
 * XXX: Another day's problem.
 */
-   mtspr(SPRN_DEC, vcpu->arch.dec_expires - mftb());
+   mtspr(SPRN_DEC, vcpu->arch.dec_expires - tb);
 
if (kvmhv_on_pseries()) {
/*
@@ -4050,7 +4050,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->in_guest = 0;
 
next_timer = timer_get_next_tb();
-   set_dec(next_timer - mftb());
+   set_dec(next_timer - tb);
/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 0ff9ddb5e7ca..bd8cf0a65ce8 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -203,7 +203,8 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
unsigned long host_dawr1;
unsigned long host_dawrx1;
 
-   hdec = time_limit - mftb();
+   tb = mftb();
+   hdec = time_limit - tb;
if (hdec < 0)
return BOOK3S_INTERRUPT_HV_DECREMENTER;
 
@@ -215,7 +216,7 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vcpu->arch.ceded = 0;
 
if (vc->tb_offset) {
-   u64 new_tb = mftb() + vc->tb_offset;
+   u64 new_tb = tb + vc->tb_offset;
mtspr(SPRN_TBU40, new_tb);
tb = mftb();
if ((tb & 0xff) < (new_tb & 0xff))
-- 
2.23.0



[PATCH v3 05/52] KVM: PPC: Book3S HV P9: Use large decrementer for HDEC

2021-10-04 Thread Nicholas Piggin
On processors that don't suppress the HDEC exceptions when LPCR[HDICE]=0,
this could help reduce needless guest exits due to leftover exceptions on
entering the guest.

Reviewed-by: Alexey Kardashevskiy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/time.h   | 2 ++
 arch/powerpc/kernel/time.c| 1 +
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 3 ++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index fd09b4797fd7..69b6be617772 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -18,6 +18,8 @@
 #include 
 
 /* time.c */
+extern u64 decrementer_max;
+
 extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index e84a087223ce..6ce40d2ac201 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -88,6 +88,7 @@ static struct clocksource clocksource_timebase = {
 
 #define DECREMENTER_DEFAULT_MAX 0x7FFF
 u64 decrementer_max = DECREMENTER_DEFAULT_MAX;
+EXPORT_SYMBOL_GPL(decrementer_max); /* for KVM HDEC */
 
 static int decrementer_set_next_event(unsigned long evt,
  struct clock_event_device *dev);
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 961b3d70483c..0ff9ddb5e7ca 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -504,7 +504,8 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vc->tb_offset_applied = 0;
}
 
-   mtspr(SPRN_HDEC, 0x7fff);
+   /* HDEC must be at least as large as DEC, so decrementer_max fits */
+   mtspr(SPRN_HDEC, decrementer_max);
 
save_clear_guest_mmu(kvm, vcpu);
switch_mmu_to_host(kvm, host_pidr);
-- 
2.23.0



[PATCH v3 04/52] KVM: PPC: Book3S HV P9: Use host timer accounting to avoid decrementer read

2021-10-04 Thread Nicholas Piggin
There is no need to save away the host DEC value, as it is derived
from the host timer subsystem which maintains the next timer time,
so it can be restored from there.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/time.h |  5 +
 arch/powerpc/kernel/time.c  |  1 +
 arch/powerpc/kvm/book3s_hv.c| 14 +++---
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 8c2c3dd4ddba..fd09b4797fd7 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -111,6 +111,11 @@ static inline unsigned long test_irq_work_pending(void)
 
 DECLARE_PER_CPU(u64, decrementers_next_tb);
 
+static inline u64 timer_get_next_tb(void)
+{
+   return __this_cpu_read(decrementers_next_tb);
+}
+
 /* Convert timebase ticks to nanoseconds */
 unsigned long long tb_to_ns(unsigned long long tb_ticks);
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 934d8ae66cc6..e84a087223ce 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -107,6 +107,7 @@ struct clock_event_device decrementer_clockevent = {
 EXPORT_SYMBOL(decrementer_clockevent);
 
 DEFINE_PER_CPU(u64, decrementers_next_tb);
+EXPORT_SYMBOL_GPL(decrementers_next_tb);
 static DEFINE_PER_CPU(struct clock_event_device, decrementers);
 
 #define XSEC_PER_SEC (1024*1024)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6a07a79f07d8..30d400bf161b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3860,18 +3860,17 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
struct kvmppc_vcore *vc = vcpu->arch.vcore;
struct p9_host_os_sprs host_os_sprs;
s64 dec;
-   u64 tb;
+   u64 tb, next_timer;
int trap, save_pmu;
 
WARN_ON_ONCE(vcpu->arch.ceded);
 
-   dec = mfspr(SPRN_DEC);
tb = mftb();
-   if (dec < 0)
+   next_timer = timer_get_next_tb();
+   if (tb >= next_timer)
return BOOK3S_INTERRUPT_HV_DECREMENTER;
-   local_paca->kvm_hstate.dec_expires = dec + tb;
-   if (local_paca->kvm_hstate.dec_expires < time_limit)
-   time_limit = local_paca->kvm_hstate.dec_expires;
+   if (next_timer < time_limit)
+   time_limit = next_timer;
 
save_p9_host_os_sprs(_os_sprs);
 
@@ -4050,7 +4049,8 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->entry_exit_map = 0x101;
vc->in_guest = 0;
 
-   set_dec(local_paca->kvm_hstate.dec_expires - mftb());
+   next_timer = timer_get_next_tb();
+   set_dec(next_timer - mftb());
/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);
-- 
2.23.0



[PATCH v3 03/52] KMV: PPC: Book3S HV P9: Use set_dec to set decrementer to host

2021-10-04 Thread Nicholas Piggin
The host Linux timer code arms the decrementer with the value
'decrementers_next_tb - current_tb' using set_dec(), which stores
val - 1 on Book3S-64, which is not quite the same as what KVM does
to re-arm the host decrementer when exiting the guest.

This shouldn't be a significant change, but it makes the logic match
and avoids this small extra change being brought into the next patch.

Suggested-by: Alexey Kardashevskiy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f4a779fffd18..6a07a79f07d8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4050,7 +4050,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->entry_exit_map = 0x101;
vc->in_guest = 0;
 
-   mtspr(SPRN_DEC, local_paca->kvm_hstate.dec_expires - mftb());
+   set_dec(local_paca->kvm_hstate.dec_expires - mftb());
/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);
-- 
2.23.0



[PATCH v3 02/52] powerpc/64s: guard optional TIDR SPR with CPU ftr test

2021-10-04 Thread Nicholas Piggin
The TIDR SPR only exists on POWER9. Avoid accessing it when the
feature bit for it is not set.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 12 
 arch/powerpc/xmon/xmon.c | 10 --
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2acb1c96cfaf..f4a779fffd18 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3767,7 +3767,8 @@ static void load_spr_state(struct kvm_vcpu *vcpu)
mtspr(SPRN_EBBHR, vcpu->arch.ebbhr);
mtspr(SPRN_EBBRR, vcpu->arch.ebbrr);
mtspr(SPRN_BESCR, vcpu->arch.bescr);
-   mtspr(SPRN_TIDR, vcpu->arch.tid);
+   if (cpu_has_feature(CPU_FTR_P9_TIDR))
+   mtspr(SPRN_TIDR, vcpu->arch.tid);
mtspr(SPRN_AMR, vcpu->arch.amr);
mtspr(SPRN_UAMOR, vcpu->arch.uamor);
 
@@ -3793,7 +3794,8 @@ static void store_spr_state(struct kvm_vcpu *vcpu)
vcpu->arch.ebbhr = mfspr(SPRN_EBBHR);
vcpu->arch.ebbrr = mfspr(SPRN_EBBRR);
vcpu->arch.bescr = mfspr(SPRN_BESCR);
-   vcpu->arch.tid = mfspr(SPRN_TIDR);
+   if (cpu_has_feature(CPU_FTR_P9_TIDR))
+   vcpu->arch.tid = mfspr(SPRN_TIDR);
vcpu->arch.amr = mfspr(SPRN_AMR);
vcpu->arch.uamor = mfspr(SPRN_UAMOR);
vcpu->arch.dscr = mfspr(SPRN_DSCR);
@@ -3813,7 +3815,8 @@ struct p9_host_os_sprs {
 static void save_p9_host_os_sprs(struct p9_host_os_sprs *host_os_sprs)
 {
host_os_sprs->dscr = mfspr(SPRN_DSCR);
-   host_os_sprs->tidr = mfspr(SPRN_TIDR);
+   if (cpu_has_feature(CPU_FTR_P9_TIDR))
+   host_os_sprs->tidr = mfspr(SPRN_TIDR);
host_os_sprs->iamr = mfspr(SPRN_IAMR);
host_os_sprs->amr = mfspr(SPRN_AMR);
host_os_sprs->fscr = mfspr(SPRN_FSCR);
@@ -3827,7 +3830,8 @@ static void restore_p9_host_os_sprs(struct kvm_vcpu *vcpu,
mtspr(SPRN_UAMOR, 0);
 
mtspr(SPRN_DSCR, host_os_sprs->dscr);
-   mtspr(SPRN_TIDR, host_os_sprs->tidr);
+   if (cpu_has_feature(CPU_FTR_P9_TIDR))
+   mtspr(SPRN_TIDR, host_os_sprs->tidr);
mtspr(SPRN_IAMR, host_os_sprs->iamr);
 
if (host_os_sprs->amr != vcpu->arch.amr)
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index dd8241c009e5..7958e5aae844 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2107,8 +2107,14 @@ static void dump_300_sprs(void)
if (!cpu_has_feature(CPU_FTR_ARCH_300))
return;
 
-   printf("pidr   = %.16lx  tidr  = %.16lx\n",
-   mfspr(SPRN_PID), mfspr(SPRN_TIDR));
+   if (cpu_has_feature(CPU_FTR_P9_TIDR)) {
+   printf("pidr   = %.16lx  tidr  = %.16lx\n",
+   mfspr(SPRN_PID), mfspr(SPRN_TIDR));
+   } else {
+   printf("pidr   = %.16lx\n",
+   mfspr(SPRN_PID));
+   }
+
printf("psscr  = %.16lx\n",
hv ? mfspr(SPRN_PSSCR) : mfspr(SPRN_PSSCR_PR));
 
-- 
2.23.0



[PATCH v3 01/52] powerpc/64s: Remove WORT SPR from POWER9/10 (take 2)

2021-10-04 Thread Nicholas Piggin
This removes a missed remnant of the WORT SPR.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/idle.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index e3ffdc8e8567..86e787502e42 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -589,7 +589,6 @@ struct p9_sprs {
u64 purr;
u64 spurr;
u64 dscr;
-   u64 wort;
u64 ciabr;
 
u64 mmcra;
-- 
2.23.0



[PATCH v3 00/52] KVM: PPC: Book3S HV P9: entry/exit optimisations

2021-10-04 Thread Nicholas Piggin
This reduces radix guest full entry/exit latency on POWER9 and POWER10
by 2x.

Nested HV guests should see smaller improvements in their L1 entry/exit,
but this is also combined with most L0 speedups also applying to nested
entry. nginx localhost throughput test in a SMP nested guest is improved
about 10% (in a direct guest it doesn't change much because it uses XIVE
for IPIs) when L0 and L1 are patched.

It does this in several main ways:

- Rearrange code to optimise SPR accesses. Mainly, avoid scoreboard
  stalls.

- Test SPR values to avoid mtSPRs where possible. mtSPRs are expensive.

- Reduce mftb. mftb is expensive.

- Demand fault certain facilities to avoid saving and/or restoring them
  (at the cost of fault when they are used, but this is mitigated over
  a number of entries, like the facilities when context switching 
  processes). PM, TM, and EBB so far.

- Defer some sequences that are made just in case a guest is interrupted
  in the middle of a critical section to the case where the guest is
  scheduled on a different CPU, rather than every time (at the cost of
  an extra IPI in this case). Namely the tlbsync sequence for radix with
  GTSE, which is very expensive.

- Reduce locking, barriers, atomics related to the vcpus-per-vcore > 1
  handling that the P9 path does not require.

Changes since v2:
- Rebased, several patches from the series were merged in the previous
  merge window.
- Fixed some compile errors noticed by kernel test robot.
- Added RB from Athira for the PMU stuff (thanks!)
- Split TIDR ftr check (patch 2) out into its own patch.
- Added a missed license tag on new file.

Changes since v1:
- Verified DPDES changes still work with msgsndp SMT emulation.
- Fixed HMI handling bug.
- Split softpatch handling fixes into smaller pieces.
- Rebased with Fabiano's latest HV sanitising patches.
- Fix TM demand faulting bug causing nested guest TM tests to TM Bad
  Thing the host in rare cases.
- Re-name new "pmu=" command line option to "pmu_override=" and update
  documentation wording.
- Add default=y config option rather than unconditionally removing the
  L0 nested PMU workaround.
- Remove unnecessary MSR[RI] updates in entry/exit. Down to about 4700
  cycles now.
- Another bugfix from Alexey's testing.

Changes since RFC:
- Rebased with Fabiano's HV sanitising patches at the front.
- Several demand faulting bug fixes mostly relating to nested guests.
- Removed facility demand-faulting from L0 nested entry/exit handler.
  Demand faulting is still done in the L1, but not the L0. The reason
  is to reduce complexity (although it's only a small amount of
  complexity), reduce demand faulting overhead that may require several

Thanks,
Nick

Nicholas Piggin (52):
  powerpc/64s: Remove WORT SPR from POWER9/10 (take 2)
  powerpc/64s: guard optional TIDR SPR with CPU ftr test
  KMV: PPC: Book3S HV P9: Use set_dec to set decrementer to host
  KVM: PPC: Book3S HV P9: Use host timer accounting to avoid decrementer
read
  KVM: PPC: Book3S HV P9: Use large decrementer for HDEC
  KVM: PPC: Book3S HV P9: Reduce mftb per guest entry/exit
  powerpc/time: add API for KVM to re-arm the host timer/decrementer
  KVM: PPC: Book3S HV: POWER10 enable HAIL when running radix guests
  powerpc/64s: Keep AMOR SPR a constant ~0 at runtime
  KVM: PPC: Book3S HV: Don't always save PMU for guest capable of
nesting
  powerpc/64s: Always set PMU control registers to frozen/disabled when
not in use
  powerpc/64s: Implement PMU override command line option
  KVM: PPC: Book3S HV P9: Implement PMU save/restore in C
  KVM: PPC: Book3S HV P9: Factor PMU save/load into context switch
functions
  KVM: PPC: Book3S HV P9: Demand fault PMU SPRs when marked not inuse
  KVM: PPC: Book3S HV P9: Factor out yield_count increment
  KVM: PPC: Book3S HV: CTRL SPR does not require read-modify-write
  KVM: PPC: Book3S HV P9: Move SPRG restore to restore_p9_host_os_sprs
  KVM: PPC: Book3S HV P9: Reduce mtmsrd instructions required to save
host SPRs
  KVM: PPC: Book3S HV P9: Improve mtmsrd scheduling by delaying MSR[EE]
disable
  KVM: PPC: Book3S HV P9: Add kvmppc_stop_thread to match
kvmppc_start_thread
  KVM: PPC: Book3S HV: Change dec_expires to be relative to guest
timebase
  KVM: PPC: Book3S HV P9: Move TB updates
  KVM: PPC: Book3S HV P9: Optimise timebase reads
  KVM: PPC: Book3S HV P9: Avoid SPR scoreboard stalls
  KVM: PPC: Book3S HV P9: Only execute mtSPR if the value changed
  KVM: PPC: Book3S HV P9: Juggle SPR switching around
  KVM: PPC: Book3S HV P9: Move vcpu register save/restore into functions
  KVM: PPC: Book3S HV P9: Move host OS save/restore functions to
built-in
  KVM: PPC: Book3S HV P9: Move nested guest entry into its own function
  KVM: PPC: Book3S HV P9: Move remaining SPR and MSR access into low
level entry
  KVM: PPC: Book3S HV P9: Implement TM fastpath for guest entry/exit
  KVM: PPC: Book3S HV P9: Switch PMU to guest as late as possible
 

[PATCH] KVM: PPC: Book3S HV: H_ENTER filter out reserved HPTE[B] value

2021-10-04 Thread Nicholas Piggin
The HPTE B field is a 2-bit field with values 0b10 and 0b11 reserved.
This field is also taken from the HPTE and used when KVM executes
TLBIEs to set the B field of those instructions.

Disallow the guest setting B to a reserved value with H_ENTER by
rejecting it. This is the same approach already taken for rejecting
reserved (unsupported) LLP values. This prevents the guest from being
able to induce the host to execute TLBIE with reserved values, which
is not known to be a problem with current processors but in theory it
could prevent the TLBIE from working correctly in a future processor.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 4 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 9 +
 2 files changed, 13 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 19b6942c6969..fff391b9b97b 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -378,6 +378,10 @@ static inline unsigned long compute_tlbie_rb(unsigned long 
v, unsigned long r,
rb |= 1;/* L field */
rb |= r & 0xff000 & ((1ul << a_pgshift) - 1); /* LP field */
}
+   /*
+* This sets both bits of the B field in the PTE. 0b1x values are
+* reserved, but those will have been filtered by kvmppc_do_h_enter.
+*/
rb |= (v >> HPTE_V_SSIZE_SHIFT) << 8;   /* B field */
return rb;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 632b2545072b..2c1f3c6e72d1 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -207,6 +207,15 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
 
if (kvm_is_radix(kvm))
return H_FUNCTION;
+   /*
+* The HPTE gets used by compute_tlbie_rb() to set TLBIE bits, so
+* these functions should work together -- must ensure a guest can not
+* cause problems with the TLBIE that KVM executes.
+*/
+   if ((pteh >> HPTE_V_SSIZE_SHIFT) & 0x2) {
+   /* B=0b1x is a reserved value, disallow it. */
+   return H_PARAMETER;
+   }
psize = kvmppc_actual_pgsz(pteh, ptel);
if (!psize)
return H_PARAMETER;
-- 
2.23.0



[PATCH 5/5] powerpc/64s: Fix unrecoverable MCE calling async handler from NMI

2021-10-04 Thread Nicholas Piggin
The machine check handler is not considered NMI on 64s. The early
handler is the true NMI handler, and then it schedules the
machine_check_exception handler to run when interrupts are enabled.

This works fine except the case of an unrecoverable MCE, where the true
NMI is taken when MSR[RI] is clear, it can not recover, so it calls
machine_check_exception directly so something might be done about it.

Calling an async handler from NMI context can result in irq state and
other things getting corrupted. This can also trigger the BUG at
  arch/powerpc/include/asm/interrupt.h:168
  BUG_ON(!arch_irq_disabled_regs(regs) && !(regs->msr & MSR_EE));

Fix this by making an _async version of the handler which is called
in the normal case, and a NMI version that is called for unrecoverable
interrupts.

Fixes: 2b43dd7653cc ("powerpc/64: enable MSR[EE] in irq replay pt_regs")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h |  5 ++---
 arch/powerpc/kernel/exceptions-64s.S |  8 +--
 arch/powerpc/kernel/traps.c  | 31 
 3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index b894b7169706..a1d238255f07 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -528,10 +528,9 @@ static __always_inline long ##func(struct pt_regs 
*regs)
 /* kernel/traps.c */
 DECLARE_INTERRUPT_HANDLER_NMI(system_reset_exception);
 #ifdef CONFIG_PPC_BOOK3S_64
-DECLARE_INTERRUPT_HANDLER_ASYNC(machine_check_exception);
-#else
-DECLARE_INTERRUPT_HANDLER_NMI(machine_check_exception);
+DECLARE_INTERRUPT_HANDLER_ASYNC(machine_check_exception_async);
 #endif
+DECLARE_INTERRUPT_HANDLER_NMI(machine_check_exception);
 DECLARE_INTERRUPT_HANDLER(SMIException);
 DECLARE_INTERRUPT_HANDLER(handle_hmi_exception);
 DECLARE_INTERRUPT_HANDLER(unknown_exception);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 024d9231f88c..eaf1f72131a1 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1243,7 +1243,7 @@ EXC_COMMON_BEGIN(machine_check_common)
li  r10,MSR_RI
mtmsrd  r10,1
addir3,r1,STACK_FRAME_OVERHEAD
-   bl  machine_check_exception
+   bl  machine_check_exception_async
b   interrupt_return_srr
 
 
@@ -1303,7 +1303,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
subir12,r12,1
sth r12,PACA_IN_MCE(r13)
 
-   /* Invoke machine_check_exception to print MCE event and panic. */
+   /*
+* Invoke machine_check_exception to print MCE event and panic.
+* This is the NMI version of the handler because we are called from
+* the early handler which is a true NMI.
+*/
addir3,r1,STACK_FRAME_OVERHEAD
bl  machine_check_exception
 
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index e453b13b..11741703d26e 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -796,24 +796,22 @@ void die_mce(const char *str, struct pt_regs *regs, long 
err)
 * do_exit() checks for in_interrupt() and panics in that case, so
 * exit the irq/nmi before calling die.
 */
-   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
-   irq_exit();
-   else
+   if (in_nmi())
nmi_exit();
+   else
+   irq_exit();
die(str, regs, err);
 }
 
 /*
- * BOOK3S_64 does not call this handler as a non-maskable interrupt
+ * BOOK3S_64 does not usually call this handler as a non-maskable interrupt
  * (it uses its own early real-mode handler to handle the MCE proper
  * and then raises irq_work to call this handler when interrupts are
- * enabled).
+ * enabled). The only time when this is not true is if the early handler
+ * is unrecoverable, then it does call this directly to try to get a
+ * message out.
  */
-#ifdef CONFIG_PPC_BOOK3S_64
-DEFINE_INTERRUPT_HANDLER_ASYNC(machine_check_exception)
-#else
-DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
-#endif
+static void __machine_check_exception(struct pt_regs *regs)
 {
int recover = 0;
 
@@ -847,12 +845,19 @@ DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
/* Must die if the interrupt is not recoverable */
if (regs_is_unrecoverable(regs))
die_mce("Unrecoverable Machine check", regs, SIGBUS);
+}
 
 #ifdef CONFIG_PPC_BOOK3S_64
-   return;
-#else
-   return 0;
+DEFINE_INTERRUPT_HANDLER_ASYNC(machine_check_exception_async)
+{
+   __machine_check_exception(regs);
+}
 #endif
+DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
+{
+   __machine_check_exception(regs);
+
+   return 0;
 }
 
 DEFINE_INTERRUPT_HANDLER(SMIException) /* async? */
-- 
2.23.0



[PATCH 4/5] powerpc/64/interrupt: Reconcile soft-mask state in NMI and fix false BUG

2021-10-04 Thread Nicholas Piggin
If a NMI hits early in an interrupt handler before the irq soft-mask
state is reconciled, that can cause a false-positive BUG with a
CONFIG_PPC_IRQ_SOFT_MASK_DEBUG assertion.

Remove that assertion and instead check the case that if regs->msr has
EE clear, then regs->softe should be marked as disabled so the irq state
looks correct to NMI handlers, the same as how it's fixed up in the
case it was implicit soft-masked.

This doesn't fix a known problem -- the change that was fixed by commit
4ec5feec1ad02 ("powerpc/64s: Make NMI record implicitly soft-masked code
as irqs disabled") was the addition of a warning in the soft-nmi
watchdog interrupt which can never actually fire when MSR[EE]=0. However
it may be important if NMI handlers grow more code, and it's less
surprising to anything using 'regs' - (I tripped over this when working
in the area).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 6b800d3e2681..b894b7169706 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -265,13 +265,16 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
local_paca->irq_soft_mask = IRQS_ALL_DISABLED;
local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
 
-   if (is_implicit_soft_masked(regs)) {
-   // Adjust regs->softe soft implicit soft-mask, so
-   // arch_irq_disabled_regs(regs) behaves as expected.
+   if (!(regs->msr & MSR_EE) || is_implicit_soft_masked(regs)) {
+   /*
+* Adjust regs->softe to be soft-masked if it had not been
+* reconcied (e.g., interrupt entry with MSR[EE]=0 but softe
+* not yet set disabled), or if it was in an implicit soft
+* masked state. This makes arch_irq_disabled_regs(regs)
+* behave as expected.
+*/
regs->softe = IRQS_ALL_DISABLED;
}
-   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
-   BUG_ON(!arch_irq_disabled_regs(regs) && !(regs->msr & MSR_EE));
 
/* Don't do any per-CPU operations until interrupt state is fixed */
 
-- 
2.23.0



[PATCH 3/5] powerpc/64: warn if local irqs are enabled in NMI or hardirq context

2021-10-04 Thread Nicholas Piggin
This can help catch bugs such as the one fixed by the previous change
to prevent _exception() from enabling irqs.

ppc32 could have a similar warning but it has no good config option to
debug this stuff (the test may be overkill to add for production
kernels).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/irq.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 551b653228c4..c4f1d6b7d992 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -229,6 +229,9 @@ notrace void arch_local_irq_restore(unsigned long mask)
return;
}
 
+   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
+   WARN_ON_ONCE(in_nmi() || in_hardirq());
+
/*
 * After the stb, interrupts are unmasked and there are no interrupts
 * pending replay. The restart sequence makes this atomic with
@@ -321,6 +324,9 @@ notrace void arch_local_irq_restore(unsigned long mask)
if (mask)
return;
 
+   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
+   WARN_ON_ONCE(in_nmi() || in_hardirq());
+
/*
 * From this point onward, we can take interrupts, preempt,
 * etc... unless we got hard-disabled. We check if an event
-- 
2.23.0



[PATCH 2/5] powerpc/traps: do not enable irqs in _exception

2021-10-04 Thread Nicholas Piggin
_exception can be called by machine check handlers when the MCE hits
user code (e.g., pseries and powernv). This will enable local irqs
because, which is a dicey thing to do in NMI or hard irq context.

This seemed to worked out okay because a userspace MCE can basically be
treated like a synchronous interrupt (after async / imprecise MCEs are
filtered out). Since NMI and hard irq handlers have started growing
nmi_enter / irq_enter, and more irq state sanity checks, this has
started to cause problems (or at least trigger warnings).

The Fixes tag to the commit which introduced this rather than try to
work out exactly which commit was the first that could possibly cause a
problem because that may be difficult to prove.

Fixes: 9f2f79e3a3c1 ("powerpc: Disable interrupts in 64-bit kernel FP and 
vector faults")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/traps.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index aac8c0412ff9..e453b13b 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -340,10 +340,16 @@ static bool exception_common(int signr, struct pt_regs 
*regs, int code,
return false;
}
 
-   show_signal_msg(signr, regs, code, addr);
+   /*
+* Must not enable interrupts even for user-mode exception, because
+* this can be called from machine check, which may be a NMI or IRQ
+* which don't like interrupts being enabled. Could check for
+* in_hardirq || in_nmi perhaps, but there doesn't seem to be a good
+* reason why _exception() should enable irqs for an exception handler,
+* the handlers themselves do that directly.
+*/
 
-   if (arch_irqs_disabled())
-   interrupt_cond_local_irq_enable(regs);
+   show_signal_msg(signr, regs, code, addr);
 
current->thread.trap_nr = code;
 
-- 
2.23.0



[PATCH 0/5] powerpc: various interrupt handling fixes

2021-10-04 Thread Nicholas Piggin
This fixes a number of bugs found mostly looking at a MCE handler issue,
which should be fixed in patch 5 of the series, previous attempt here
which Ganesh found to be wrong.

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210922020247.209409-1-npig...@gmail.com/

I didn't increment to patch v2 because it's a different approach (so I
gave it a different title).

Thanks,
Nick

Nicholas Piggin (5):
  powerpc/64s: fix program check interrupt emergency stack path
  powerpc/traps: do not enable irqs in _exception
  powerpc/64: warn if local irqs are enabled in NMI or hardirq context
  powerpc/64/interrupt: Reconcile soft-mask state in NMI and fix false
BUG
  powerpc/64s: Fix unrecoverable MCE calling async handler from NMI

 arch/powerpc/include/asm/interrupt.h | 18 ++--
 arch/powerpc/kernel/exceptions-64s.S | 25 ++--
 arch/powerpc/kernel/irq.c|  6 
 arch/powerpc/kernel/traps.c  | 43 +---
 4 files changed, 59 insertions(+), 33 deletions(-)

-- 
2.23.0



[PATCH 1/5] powerpc/64s: fix program check interrupt emergency stack path

2021-10-04 Thread Nicholas Piggin
Emergency stack path was jumping into a 3: label inside the
__GEN_COMMON_BODY macro for the normal path after it had finished,
rather than jumping over it. By a small miracle this is the correct
place to build up a new interrupt frame with the existing stack
pointer, so things basically worked okay with an added weird looking
700 trap frame on top (which had the wrong ->nip so it didn't decode
bug messages either).

Fix this by avoiding using numeric labels when jumping over non-trivial
macros.

Before:

 LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
 Modules linked in:
 CPU: 0 PID: 88 Comm: sh Not tainted 5.15.0-rc2-00034-ge057cdade6e5 #2637
 NIP:  7265677368657265 LR: c006c0c8 CTR: c00097f0
 REGS: c000fffb3a50 TRAP: 0700   Not tainted
 MSR:  90021031   CR: 0700  XER: 2004
 CFAR: c00098b0 IRQMASK: 0
 GPR00: c006c964 c000fffb3cf0 c1513800 
 GPR04: 48ab0778 4200  1299
 GPR08: 01e447c718ec 22424282 2710 c006bee8
 GPR12: 90009033 c16b 00b0 0001
 GPR16:  0002  0ff8
 GPR20: 1fff 0007 0080 7fff89d90158
 GPR24: 0200 0200 0255 0300
 GPR28: c127 4200 48ab0778 c00080647e80
 NIP [7265677368657265] 0x7265677368657265
 LR [c006c0c8] ___do_page_fault+0x3f8/0xb10
 Call Trace:
 [c000fffb3cf0] [c000bdac] soft_nmi_common+0x13c/0x1d0 (unreliable)
 --- interrupt: 700 at decrementer_common_virt+0xb8/0x230
 NIP:  c00098b8 LR: c006c0c8 CTR: c00097f0
 REGS: c000fffb3d60 TRAP: 0700   Not tainted
 MSR:  90021031   CR: 22424282  XER: 2004
 CFAR: c00098b0 IRQMASK: 0
 GPR00: c006c964 2400 c1513800 
 GPR04: 48ab0778 4200  1299
 GPR08: 01e447c718ec 22424282 2710 c006bee8
 GPR12: 90009033 c16b 00b0 0001
 GPR16:  0002  0ff8
 GPR20: 1fff 0007 0080 7fff89d90158
 GPR24: 0200 0200 0255 0300
 GPR28: c127 4200 48ab0778 c00080647e80
 NIP [c00098b8] decrementer_common_virt+0xb8/0x230
 LR [c006c0c8] ___do_page_fault+0x3f8/0xb10
 --- interrupt: 700
 Instruction dump:
        
        
 ---[ end trace 6d28218e0cc3c949 ]---

After:

 [ cut here ]
 kernel BUG at arch/powerpc/kernel/exceptions-64s.S:491!
 Oops: Exception in kernel mode, sig: 5 [#1]
 LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
 Modules linked in:
 CPU: 0 PID: 88 Comm: login Not tainted 5.15.0-rc2-00034-ge057cdade6e5-dirty 
#2638
 NIP:  c00098b8 LR: c006bf04 CTR: c00097f0
 REGS: c000fffb3d60 TRAP: 0700   Not tainted
 MSR:  90021031   CR: 24482227  XER: 0004
 CFAR: c00098b0 IRQMASK: 0
 GPR00: c006bf04 2400 c1513800 c1271868
 GPR04: 100f0d29 4200 0007 0009
 GPR08: 100f0d29 24482227 2710 c0181b3c
 GPR12: 90009033 c16b 100f0d29 c5b22f00
 GPR16:  0001 0009 100eed90
 GPR20: 100eed90 1000 1000a49c 100f1430
 GPR24: c1271868 0200 0215 0300
 GPR28: c1271800 4200 100f0d29 c00080647860
 NIP [c00098b8] decrementer_common_virt+0xb8/0x230
 LR [c006bf04] ___do_page_fault+0x234/0xb10
 Call Trace:
 Instruction dump:
 4182000c 3941 4808 894d0932 714a0001 3948 408225fc 718a4000
 7c2a0b78 3821fcf0 41c20008 e82d0910 <0981fcf0> f92101a0 f9610170 f9810178
 ---[ end trace a5dbd1f5ea4ccc51 ]---

Fixes: 0a882e28468f4 ("powerpc/64s/exception: remove bad stack branch")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 37859e62a8dc..024d9231f88c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1665,27 +1665,30 @@ EXC_COMMON_BEGIN(program_check_common)
 */
 
andi.   r10,r12,MSR_PR
-   bne 2f  /* If userspace, go normal path */
+   bne .Lnormal_stack  

[PATCH v3 6/6] powerpc/64s/interrupt: avoid saving CFAR in some asynchronous interrupts

2021-09-22 Thread Nicholas Piggin
Reading the CFAR register is quite costly (~20 cycles on POWER9). It is
a good idea to have for most synchronous interrupts, but for async ones
it is much less important.

Doorbell, external, and decrementer interrupts are the important
asynchronous ones. HV interrupts can't skip CFAR if KVM HV is possible,
because it might be a guest exit that requires CFAR preserved. But the
important pseries interrupts can avoid loading CFAR.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 63 
 1 file changed, 63 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 4dcc76206f8e..eb3af5abdd37 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -111,6 +111,8 @@ name:
 #define IAREA  .L_IAREA_\name\()   /* PACA save area */
 #define IVIRT  .L_IVIRT_\name\()   /* Has virt mode entry point */
 #define IISIDE .L_IISIDE_\name\()  /* Uses SRR0/1 not DAR/DSISR */
+#define ICFAR  .L_ICFAR_\name\()   /* Uses CFAR */
+#define ICFAR_IF_HVMODE.L_ICFAR_IF_HVMODE_\name\() /* Uses CFAR if HV 
*/
 #define IDAR   .L_IDAR_\name\()/* Uses DAR (or SRR0) */
 #define IDSISR .L_IDSISR_\name\()  /* Uses DSISR (or SRR1) */
 #define IBRANCH_TO_COMMON  .L_IBRANCH_TO_COMMON_\name\() /* ENTRY branch 
to common */
@@ -150,6 +152,12 @@ do_define_int n
.ifndef IISIDE
IISIDE=0
.endif
+   .ifndef ICFAR
+   ICFAR=1
+   .endif
+   .ifndef ICFAR_IF_HVMODE
+   ICFAR_IF_HVMODE=0
+   .endif
.ifndef IDAR
IDAR=0
.endif
@@ -287,9 +295,21 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
HMT_MEDIUM
std r10,IAREA+EX_R10(r13)   /* save r10 - r12 */
+   .if ICFAR
 BEGIN_FTR_SECTION
mfspr   r10,SPRN_CFAR
 END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
+   .elseif ICFAR_IF_HVMODE
+BEGIN_FTR_SECTION
+  BEGIN_FTR_SECTION_NESTED(69)
+   mfspr   r10,SPRN_CFAR
+  END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 69)
+FTR_SECTION_ELSE
+  BEGIN_FTR_SECTION_NESTED(69)
+   li  r10,0
+  END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 69)
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   .endif
.if \ool
.if !\virt
b   tramp_real_\name
@@ -305,9 +325,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 BEGIN_FTR_SECTION
std r9,IAREA+EX_PPR(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+   .if ICFAR || ICFAR_IF_HVMODE
 BEGIN_FTR_SECTION
std r10,IAREA+EX_CFAR(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
+   .endif
INTERRUPT_TO_KERNEL
mfctr   r10
std r10,IAREA+EX_CTR(r13)
@@ -559,7 +581,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
.endif
 
 BEGIN_FTR_SECTION
+   .if ICFAR || ICFAR_IF_HVMODE
ld  r10,IAREA+EX_CFAR(r13)
+   .else
+   li  r10,0
+   .endif
std r10,ORIG_GPR3(r1)
 END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
ld  r10,IAREA+EX_CTR(r13)
@@ -1502,6 +1528,12 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
  *
  * If soft masked, the masked handler will note the pending interrupt for
  * replay, and clear MSR[EE] in the interrupted context.
+ *
+ * CFAR is not required because this is an asynchronous interrupt that in
+ * general won't have much bearing on the state of the CPU, with the possible
+ * exception of crash/debug IPIs, but those are generally moving to use SRESET
+ * IPIs. Unless this is an HV interrupt and KVM HV is possible, in which case
+ * it may be exiting the guest and need CFAR to be saved.
  */
 INT_DEFINE_BEGIN(hardware_interrupt)
IVEC=0x500
@@ -1509,6 +1541,10 @@ INT_DEFINE_BEGIN(hardware_interrupt)
IMASK=IRQS_DISABLED
IKVM_REAL=1
IKVM_VIRT=1
+   ICFAR=0
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   ICFAR_IF_HVMODE=1
+#endif
 INT_DEFINE_END(hardware_interrupt)
 
 EXC_REAL_BEGIN(hardware_interrupt, 0x500, 0x100)
@@ -1727,6 +1763,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
  * If PPC_WATCHDOG is configured, the soft masked handler will actually set
  * things back up to run soft_nmi_interrupt as a regular interrupt handler
  * on the emergency stack.
+ *
+ * CFAR is not required because this is asynchronous (see hardware_interrupt).
+ * A watchdog interrupt may like to have CFAR, but usually the interesting
+ * branch is long gone by that point (e.g., infinite loop).
  */
 INT_DEFINE_BEGIN(decrementer)
IVEC=0x900
@@ -1734,6 +1774,7 @@ INT_DEFINE_BEGIN(decrementer)
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
 #endif
+   ICFAR=0
 INT_DEFINE_END(decrementer)
 
 EXC_REAL_BEGIN(decrementer, 0x900, 0x80)
@@ -1809,6 +1850,8 @@ EXC_COMMON_BEGIN(hdecrementer_common)
  * If soft masked, the masked handler will note the pending interrupt for
  * replay

[PATCH v3 5/6] powerpc/64/interrupt: reduce expensive debug tests

2021-09-22 Thread Nicholas Piggin
Move the assertions requiring restart table searches under
CONFIG_PPC_IRQ_SOFT_MASK_DEBUG.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index e178d143671a..0e84e99af37b 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -97,6 +97,11 @@ static inline void srr_regs_clobbered(void)
local_paca->hsrr_valid = 0;
 }
 #else
+static inline unsigned long search_kernel_restart_table(unsigned long addr)
+{
+   return 0;
+}
+
 static inline bool is_implicit_soft_masked(struct pt_regs *regs)
 {
return false;
@@ -190,13 +195,14 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs, struct interrup
 */
if (TRAP(regs) != INTERRUPT_PROGRAM) {
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
-   BUG_ON(is_implicit_soft_masked(regs));
+   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
+   BUG_ON(is_implicit_soft_masked(regs));
}
-#ifdef CONFIG_PPC_BOOK3S
+
/* Move this under a debugging check */
-   if (arch_irq_disabled_regs(regs))
+   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG) &&
+   arch_irq_disabled_regs(regs))
BUG_ON(search_kernel_restart_table(regs->nip));
-#endif
}
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
BUG_ON(!arch_irq_disabled_regs(regs) && !(regs->msr & MSR_EE));
-- 
2.23.0



[PATCH v3 4/6] powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use

2021-09-22 Thread Nicholas Piggin
Enabling MSR[EE] in interrupt handlers while interrupts are still soft
masked allows PMIs to profile interrupt handlers to some degree, beyond
what SIAR latching allows.

When perf is not being used, this is almost useless work. It requires an
extra mtmsrd in the irq handler, and it also opens the door to masked
interrupts hitting and requiring replay, which is more expensive than
just taking them directly. This effect can be noticable in high IRQ
workloads.

Avoid enabling MSR[EE] unless perf is currently in use. This saves about
60 cycles (or 8%) on a simple decrementer interrupt microbenchmark.
Replayed interrupts drop from 1.4% of all interrupts taken, to 0.003%.

This does prevent the soft-nmi interrupt being taken in these handlers,
but that's not too reliable anyway. The SMP watchdog will continue to be
the reliable way to catch lockups.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hw_irq.h | 57 +--
 arch/powerpc/kernel/dbell.c   |  3 +-
 arch/powerpc/kernel/irq.c |  3 +-
 arch/powerpc/kernel/time.c| 31 +
 4 files changed, 67 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index b987822e552e..55e3fa44f280 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -309,17 +309,54 @@ static inline bool lazy_irq_pending_nocheck(void)
 bool power_pmu_wants_prompt_pmi(void);
 
 /*
- * This is called by asynchronous interrupts to conditionally
- * re-enable hard interrupts after having cleared the source
- * of the interrupt. They are kept disabled if there is a different
- * soft-masked interrupt pending that requires hard masking.
+ * This is called by asynchronous interrupts to check whether to
+ * conditionally re-enable hard interrupts after having cleared
+ * the source of the interrupt. They are kept disabled if there
+ * is a different soft-masked interrupt pending that requires hard
+ * masking.
  */
-static inline void may_hard_irq_enable(void)
+static inline bool should_hard_irq_enable(void)
 {
-   if (!(get_paca()->irq_happened & PACA_IRQ_MUST_HARD_MASK)) {
-   get_paca()->irq_happened &= ~PACA_IRQ_HARD_DIS;
-   __hard_irq_enable();
-   }
+#ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
+   WARN_ON(irq_soft_mask_return() == IRQS_ENABLED);
+   WARN_ON(mfmsr() & MSR_EE);
+#endif
+#ifdef CONFIG_PERF_EVENTS
+   /*
+* If the PMU is not running, there is not much reason to enable
+* MSR[EE] in irq handlers because any interrupts would just be
+* soft-masked.
+*
+* TODO: Add test for 64e
+*/
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !power_pmu_wants_prompt_pmi())
+   return false;
+
+   if (get_paca()->irq_happened & PACA_IRQ_MUST_HARD_MASK)
+   return false;
+
+   return true;
+#else
+   return false;
+#endif
+}
+
+/*
+ * Do the hard enabling, only call this if should_hard_irq_enable is true.
+ */
+static inline void do_hard_irq_enable(void)
+{
+#ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
+   WARN_ON(irq_soft_mask_return() == IRQS_ENABLED);
+   WARN_ON(get_paca()->irq_happened & PACA_IRQ_MUST_HARD_MASK);
+   WARN_ON(mfmsr() & MSR_EE);
+#endif
+   /*
+* This allows PMI interrupts (and watchdog soft-NMIs) through.
+* There is no other reason to enable this way.
+*/
+   get_paca()->irq_happened &= ~PACA_IRQ_HARD_DIS;
+   __hard_irq_enable();
 }
 
 static inline bool arch_irq_disabled_regs(struct pt_regs *regs)
@@ -400,7 +437,7 @@ static inline bool arch_irq_disabled_regs(struct pt_regs 
*regs)
return !(regs->msr & MSR_EE);
 }
 
-static inline bool may_hard_irq_enable(void)
+static inline bool should_hard_irq_enable(void)
 {
return false;
 }
diff --git a/arch/powerpc/kernel/dbell.c b/arch/powerpc/kernel/dbell.c
index 5545c9cd17c1..f55c6fb34a3a 100644
--- a/arch/powerpc/kernel/dbell.c
+++ b/arch/powerpc/kernel/dbell.c
@@ -27,7 +27,8 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(doorbell_exception)
 
ppc_msgsync();
 
-   may_hard_irq_enable();
+   if (should_hard_irq_enable())
+   do_hard_irq_enable();
 
kvmppc_clear_host_ipi(smp_processor_id());
__this_cpu_inc(irq_stat.doorbell_irqs);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 551b653228c4..f658aa22a21e 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -739,7 +739,8 @@ void __do_irq(struct pt_regs *regs)
irq = ppc_md.get_irq();
 
/* We can hard enable interrupts now to allow perf interrupts */
-   may_hard_irq_enable();
+   if (should_hard_irq_enable())
+   do_hard_irq_enable();
 
/* And finally process it */
if (unlikely(!irq))
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 9

[PATCH v3 3/6] powerpc/64s/perf: add power_pmu_wants_prompt_pmi to say whether perf wants PMIs to be soft-NMI

2021-09-22 Thread Nicholas Piggin
Interrupt code enables MSR[EE] in some irq handlers while keeping local
irqs disabled via soft-mask, allowing PMI interrupts to be taken as
soft-NMI to improve profiling of irq handlers.

When perf is not enabled, there is no point to doing this, it's
additional overhead. So provide a function that can say if PMIs should
be taken promptly if possible.

Cc: Madhavan Srinivasan 
Cc: Athira Rajeev 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hw_irq.h |  2 ++
 arch/powerpc/perf/core-book3s.c   | 31 +++
 2 files changed, 33 insertions(+)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 21cc571ea9c2..b987822e552e 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -306,6 +306,8 @@ static inline bool lazy_irq_pending_nocheck(void)
return __lazy_irq_pending(local_paca->irq_happened);
 }
 
+bool power_pmu_wants_prompt_pmi(void);
+
 /*
  * This is called by asynchronous interrupts to conditionally
  * re-enable hard interrupts after having cleared the source
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 73e62e9b179b..773d07d68b6d 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_PPC64
@@ -2381,6 +2382,36 @@ static void perf_event_interrupt(struct pt_regs *regs)
perf_sample_event_took(sched_clock() - start_clock);
 }
 
+/*
+ * If the perf subsystem wants performance monitor interrupts as soon as
+ * possible (e.g., to sample the instruction address and stack chain),
+ * this should return true. The IRQ masking code can then enable MSR[EE]
+ * in some places (e.g., interrupt handlers) that allows PMI interrupts
+ * though to improve accuracy of profiles, at the cost of some performance.
+ *
+ * The PMU counters can be enabled by other means (e.g., sysfs raw SPR
+ * access), but in that case there is no need for prompt PMI handling.
+ *
+ * This currently returns true if any perf counter is being used. It
+ * could possibly return false if only events are being counted rather than
+ * samples being taken, but for now this is good enough.
+ */
+bool power_pmu_wants_prompt_pmi(void)
+{
+   struct cpu_hw_events *cpuhw;
+
+   /*
+* This could simply test local_paca->pmcregs_in_use if that were not
+* under ifdef KVM.
+*/
+
+   if (!ppmu)
+   return false;
+
+   cpuhw = this_cpu_ptr(_hw_events);
+   return cpuhw->n_events;
+}
+
 static int power_pmu_prepare_cpu(unsigned int cpu)
 {
struct cpu_hw_events *cpuhw = _cpu(cpu_hw_events, cpu);
-- 
2.23.0



[PATCH v3 2/6] powerpc/64s/interrupt: handle MSR EE and RI in interrupt entry wrapper

2021-09-22 Thread Nicholas Piggin
The mtmsrd to enable MSR[RI] can be combined with the mtmsrd to enable
MSR[EE] in interrupt entry code, for those interrupts which enable EE.
This helps performance of important synchronous interrupts (e.g., page
faults).

This is similar to what commit dd152f70bdc1 ("powerpc/64s: system call
avoid setting MSR[RI] until we set MSR[EE]") does for system calls.

Do this by enabling EE and RI together at the beginning of the entry
wrapper if PACA_IRQ_HARD_DIS is clear, and only enabling RI if it is
set.

Asynchronous interrupts set PACA_IRQ_HARD_DIS, but synchronous ones
leave it unchanged, so by default they always get EE=1 unless they have
interrupted a caller that is hard disabled. When the sync interrupt
later calls interrupt_cond_local_irq_enable(), it will not require
another mtmsrd because MSR[EE] was already enabled here.

This avoids one mtmsrd L=1 for synchronous interrupts on 64s, which
saves about 20 cycles on POWER9. And for kernel-mode interrupts, both
synchronous and asynchronous, this saves an additional 40 cycles due to
the mtmsrd being moved ahead of mfspr SPRN_AMR, which prevents a SPR
scoreboard stall.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 27 +---
 arch/powerpc/kernel/exceptions-64s.S | 38 +++-
 arch/powerpc/kernel/fpu.S|  5 
 arch/powerpc/kernel/vector.S | 10 
 4 files changed, 42 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 3802390d8eea..e178d143671a 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -148,8 +148,14 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs, struct interrup
 #endif
 
 #ifdef CONFIG_PPC64
-   if (irq_soft_mask_set_return(IRQS_ALL_DISABLED) == IRQS_ENABLED)
-   trace_hardirqs_off();
+   bool trace_enable = false;
+
+   if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) {
+   if (irq_soft_mask_set_return(IRQS_ALL_DISABLED) == IRQS_ENABLED)
+   trace_enable = true;
+   } else {
+   irq_soft_mask_set(IRQS_ALL_DISABLED);
+   }
 
/*
 * If the interrupt was taken with HARD_DIS clear, then enable MSR[EE].
@@ -163,8 +169,14 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs, struct interrup
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
BUG_ON(!(regs->msr & MSR_EE));
__hard_irq_enable();
+   } else {
+   __hard_RI_enable();
}
 
+   /* Do this when RI=1 because it can cause SLB faults */
+   if (trace_enable)
+   trace_hardirqs_off();
+
if (user_mode(regs)) {
CT_WARN_ON(ct_state() != CONTEXT_USER);
user_exit_irqoff();
@@ -217,13 +229,16 @@ static inline void interrupt_async_enter_prepare(struct 
pt_regs *regs, struct in
/* Ensure interrupt_enter_prepare does not enable MSR[EE] */
local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
 #endif
+   interrupt_enter_prepare(regs, state);
 #ifdef CONFIG_PPC_BOOK3S_64
+   /*
+* RI=1 is set by interrupt_enter_prepare, so this thread flags access
+* has to come afterward (it can cause SLB faults).
+*/
if (cpu_has_feature(CPU_FTR_CTRL) &&
!test_thread_local_flags(_TLF_RUNLATCH))
__ppc64_runlatch_on();
 #endif
-
-   interrupt_enter_prepare(regs, state);
irq_enter();
 }
 
@@ -293,6 +308,8 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
regs->softe = IRQS_ALL_DISABLED;
}
 
+   __hard_RI_enable();
+
/* Don't do any per-CPU operations until interrupt state is fixed */
 
if (nmi_disables_ftrace(regs)) {
@@ -390,6 +407,8 @@ interrupt_handler long func(struct pt_regs *regs)   
\
 {  \
long ret;   \
\
+   __hard_RI_enable(); \
+   \
ret = ##func (regs);\
\
return ret; \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 37859e62a8dc..4dcc76206f8e 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -113,7 +113,6 @@ name:
 #define IISIDE .L_IISIDE_\name\()  /* Uses SRR0/1 not DAR/DSISR */
 #define IDAR   .L_IDAR_\name\()/* Uses D

[PATCH v3 1/6] powerpc/64/interrupt: make normal synchronous interrupts enable MSR[EE] if possible

2021-09-22 Thread Nicholas Piggin
Make synchronous interrupt handler entry wrappers enable MSR[EE] if
MSR[EE] was enabled in the interrupted context. IRQs are soft-disabled
at this point so there is no change to high level code, but it's a
masked interrupt could fire.

This is a performance disadvantage for interrupts which do not later
call interrupt_cond_local_irq_enable(), because an an additional mtmsrd
or wrtee instruction is executed. However the important synchronous
interrupts (e.g., page fault) do enable interrupts, so the performance
disadvantage is mostly avoided.

In the next patch, MSR[RI] enabling can be combined with MSR[EE]
enabling, which mitigates the performance drop for the former and gives
a performance advanage for the latter interrupts, on 64s machines. 64e
is coming along for the ride for now to avoid divergences with 64s in
this tricky code.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index b76ab848aa0d..3802390d8eea 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -150,7 +150,20 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs, struct interrup
 #ifdef CONFIG_PPC64
if (irq_soft_mask_set_return(IRQS_ALL_DISABLED) == IRQS_ENABLED)
trace_hardirqs_off();
-   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
+
+   /*
+* If the interrupt was taken with HARD_DIS clear, then enable MSR[EE].
+* Asynchronous interrupts get here with HARD_DIS set (see below), so
+* this enables MSR[EE] for synchronous interrupts. IRQs remain
+* soft-masked. The interrupt handler may later call
+* interrupt_cond_local_irq_enable() to achieve a regular process
+* context.
+*/
+   if (!(local_paca->irq_happened & PACA_IRQ_HARD_DIS)) {
+   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
+   BUG_ON(!(regs->msr & MSR_EE));
+   __hard_irq_enable();
+   }
 
if (user_mode(regs)) {
CT_WARN_ON(ct_state() != CONTEXT_USER);
@@ -200,6 +213,10 @@ static inline void interrupt_exit_prepare(struct pt_regs 
*regs, struct interrupt
 
 static inline void interrupt_async_enter_prepare(struct pt_regs *regs, struct 
interrupt_state *state)
 {
+#ifdef CONFIG_PPC64
+   /* Ensure interrupt_enter_prepare does not enable MSR[EE] */
+   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
+#endif
 #ifdef CONFIG_PPC_BOOK3S_64
if (cpu_has_feature(CPU_FTR_CTRL) &&
!test_thread_local_flags(_TLF_RUNLATCH))
-- 
2.23.0



[PATCH v3 0/6] powerpc/64s: interrupt speedups

2021-09-22 Thread Nicholas Piggin
Here's a few stragglers. The first patch was submitted already but had
some bugs with unrecoverable exceptions on HPT (current->blah being
accessed before MSR[RI] was enabled). Those should be fixed now.

The others are generally for helping asynch interrupts, which are a bit
harder to measure well but important for IO and IPIs.

After this series, the SPR accesses of the interrupt handlers for radix
are becoming pretty optimal except for PPR which we could improve on,
and virt CPU accounting which is very costly -- we might disable that
by default unless someone comes up with a good reason to keep it.

Since v1:
- Compile fixes for 64e.
- Fixed a SOFT_MASK_DEBUG false positive.
- Improve function name and comments explaining why patch 2 does not
  need to hard enable when PMU is enabled via sysfs.

Since v2:
- Split first patch into patch 1 and 2, improve on the changelogs.
- More compile fixes.
- Fixed several review comments from Daniel.
- Added patch 5.

Thanks,
Nick

Nicholas Piggin (6):
  powerpc/64/interrupt: make normal synchronous interrupts enable
MSR[EE] if possible
  powerpc/64s/interrupt: handle MSR EE and RI in interrupt entry wrapper
  powerpc/64s/perf: add power_pmu_wants_prompt_pmi to say whether perf
wants PMIs to be soft-NMI
  powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless
perf is in use
  powerpc/64/interrupt: reduce expensive debug tests
  powerpc/64s/interrupt: avoid saving CFAR in some asynchronous
interrupts

 arch/powerpc/include/asm/hw_irq.h|  59 +---
 arch/powerpc/include/asm/interrupt.h |  58 ---
 arch/powerpc/kernel/dbell.c  |   3 +-
 arch/powerpc/kernel/exceptions-64s.S | 101 ++-
 arch/powerpc/kernel/fpu.S|   5 ++
 arch/powerpc/kernel/irq.c|   3 +-
 arch/powerpc/kernel/time.c   |  31 
 arch/powerpc/kernel/vector.S |  10 +++
 arch/powerpc/perf/core-book3s.c  |  31 
 9 files changed, 232 insertions(+), 69 deletions(-)

-- 
2.23.0



[PATCH v1] powerpc/64/interrupt: Reconcile soft-mask state in NMI and fix false BUG

2021-09-22 Thread Nicholas Piggin
If a NMI hits early in an interrupt handler before the irq soft-mask
state is reconciled, that can cause a false-positive BUG with a
CONFIG_PPC_IRQ_SOFT_MASK_DEBUG assertion.

Remove that assertion and instead check the case that if regs->msr has
EE clear, then regs->softe should be marked as disabled so the irq state
looks correct to NMI handlers, the same as how it's fixed up in the
case it was implicit soft-masked.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index b32ed910a8cf..b76ab848aa0d 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -265,13 +265,16 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
local_paca->irq_soft_mask = IRQS_ALL_DISABLED;
local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
 
-   if (is_implicit_soft_masked(regs)) {
-   // Adjust regs->softe soft implicit soft-mask, so
-   // arch_irq_disabled_regs(regs) behaves as expected.
+   if (!(regs->msr & MSR_EE) || is_implicit_soft_masked(regs)) {
+   /*
+* Adjust regs->softe to be soft-masked if it had not been
+* reconcied (e.g., interrupt entry with MSR[EE]=0 but softe
+* not yet set disabled), or if it was in an implicit soft
+* masked state. This makes arch_irq_disabled_regs(regs)
+* behave as expected.
+*/
regs->softe = IRQS_ALL_DISABLED;
}
-   if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
-   BUG_ON(!arch_irq_disabled_regs(regs) && !(regs->msr & MSR_EE));
 
/* Don't do any per-CPU operations until interrupt state is fixed */
 
-- 
2.23.0



[PATCH v1] powerpc/64s: Fix unrecoverable MCE crash

2021-09-21 Thread Nicholas Piggin
The machine check handler is not considered NMI on 64s. The early
handler is the true NMI handler, and then it schedules the
machine_check_exception handler to run when interrupts are enabled.

This works fine except the case of an unrecoverable MCE, where the true
NMI is taken when MSR[RI] is clear, it can not recover to schedule the
next handler, so it calls machine_check_exception directly so something
might be done about it.

Calling an async handler from NMI context can result in irq state and
other things getting corrupted. This can also trigger the BUG at
arch/powerpc/include/asm/interrupt.h:168.

Fix this by just making the 64s machine_check_exception handler an NMI
like it is on other subarchs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h |  4 
 arch/powerpc/kernel/traps.c  | 23 +++
 2 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 6b800d3e2681..b32ed910a8cf 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -524,11 +524,7 @@ static __always_inline long ##func(struct pt_regs 
*regs)
 /* Interrupt handlers */
 /* kernel/traps.c */
 DECLARE_INTERRUPT_HANDLER_NMI(system_reset_exception);
-#ifdef CONFIG_PPC_BOOK3S_64
-DECLARE_INTERRUPT_HANDLER_ASYNC(machine_check_exception);
-#else
 DECLARE_INTERRUPT_HANDLER_NMI(machine_check_exception);
-#endif
 DECLARE_INTERRUPT_HANDLER(SMIException);
 DECLARE_INTERRUPT_HANDLER(handle_hmi_exception);
 DECLARE_INTERRUPT_HANDLER(unknown_exception);
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index aac8c0412ff9..b21450c655d2 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -790,24 +790,19 @@ void die_mce(const char *str, struct pt_regs *regs, long 
err)
 * do_exit() checks for in_interrupt() and panics in that case, so
 * exit the irq/nmi before calling die.
 */
-   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
-   irq_exit();
-   else
-   nmi_exit();
+   nmi_exit();
die(str, regs, err);
 }
 
 /*
- * BOOK3S_64 does not call this handler as a non-maskable interrupt
- * (it uses its own early real-mode handler to handle the MCE proper
- * and then raises irq_work to call this handler when interrupts are
- * enabled).
+ * BOOK3S_64 does not call this handler as a non-maskable interrupt (it uses
+ * its own early real-mode handler to handle the MCE proper and then raises
+ * irq_work to call this handler when interrupts are enabled), except in the
+ * case of unrecoverable_mce. If unrecoverable_mce was a separate NMI handler,
+ * then this could be ASYNC on 64s. However it should all work okay as an NMI
+ * handler (and it is NMI on other platforms) so just make it an NMI.
  */
-#ifdef CONFIG_PPC_BOOK3S_64
-DEFINE_INTERRUPT_HANDLER_ASYNC(machine_check_exception)
-#else
 DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
-#endif
 {
int recover = 0;
 
@@ -842,11 +837,7 @@ DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
if (regs_is_unrecoverable(regs))
die_mce("Unrecoverable Machine check", regs, SIGBUS);
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   return;
-#else
return 0;
-#endif
 }
 
 DEFINE_INTERRUPT_HANDLER(SMIException) /* async? */
-- 
2.23.0



[PATCH v1] powerpc: flexible GPR range save/restore macros

2021-09-10 Thread Nicholas Piggin
Introduce macros that operate on a (start, end) range of GPRs, which
reduces lines of code and need to do mental arithmetic while reading the
code.

Signed-off-by: Nicholas Piggin 
---
Since RFC:
- Removed FP / VMX regs changes from the patch

 arch/powerpc/boot/crt0.S  | 31 +++--
 arch/powerpc/crypto/md5-asm.S | 10 ++---
 arch/powerpc/crypto/sha1-powerpc-asm.S|  6 +--
 arch/powerpc/include/asm/ppc_asm.h| 43 ---
 arch/powerpc/kernel/entry_32.S| 23 --
 arch/powerpc/kernel/exceptions-64e.S  | 14 ++
 arch/powerpc/kernel/exceptions-64s.S  |  6 +--
 arch/powerpc/kernel/head_32.h |  3 +-
 arch/powerpc/kernel/head_booke.h  |  3 +-
 arch/powerpc/kernel/interrupt_64.S| 34 ++-
 arch/powerpc/kernel/optprobes_head.S  |  4 +-
 arch/powerpc/kernel/tm.S  | 15 ++-
 .../powerpc/kernel/trace/ftrace_64_mprofile.S | 14 +++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  5 +--
 .../lib/test_emulate_step_exec_instr.S|  8 ++--
 15 files changed, 93 insertions(+), 126 deletions(-)

diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index 1d83966f5ef6..e8f10a599659 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -226,16 +226,19 @@ p_base:   mflrr10 /* r10 now points to 
runtime addr of p_base */
 #ifdef __powerpc64__
 
 #define PROM_FRAME_SIZE 512
-#define SAVE_GPR(n, base)   std n,8*(n)(base)
-#define REST_GPR(n, base)   ld  n,8*(n)(base)
-#define SAVE_2GPRS(n, base) SAVE_GPR(n, base); SAVE_GPR(n+1, base)
-#define SAVE_4GPRS(n, base) SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
-#define SAVE_8GPRS(n, base) SAVE_4GPRS(n, base); SAVE_4GPRS(n+4, base)
-#define SAVE_10GPRS(n, base)SAVE_8GPRS(n, base); SAVE_2GPRS(n+8, base)
-#define REST_2GPRS(n, base) REST_GPR(n, base); REST_GPR(n+1, base)
-#define REST_4GPRS(n, base) REST_2GPRS(n, base); REST_2GPRS(n+2, base)
-#define REST_8GPRS(n, base) REST_4GPRS(n, base); REST_4GPRS(n+4, base)
-#define REST_10GPRS(n, base)REST_8GPRS(n, base); REST_2GPRS(n+8, base)
+
+.macro OP_REGS op, width, start, end, base, offset
+   .Lreg=\start
+   .rept (\end - \start + 1)
+   \op .Lreg,\offset+\width*.Lreg(\base)
+   .Lreg=.Lreg+1
+   .endr
+.endm
+
+#define SAVE_GPRS(start, end, base)OP_REGS std, 8, start, end, base, 0
+#define REST_GPRS(start, end, base)OP_REGS ld, 8, start, end, base, 0
+#define SAVE_GPR(n, base)  SAVE_GPRS(n, n, base)
+#define REST_GPR(n, base)  REST_GPRS(n, n, base)
 
 /* prom handles the jump into and return from firmware.  The prom args pointer
is loaded in r3. */
@@ -246,9 +249,7 @@ prom:
stdur1,-PROM_FRAME_SIZE(r1) /* Save SP and create stack space */
 
SAVE_GPR(2, r1)
-   SAVE_GPR(13, r1)
-   SAVE_8GPRS(14, r1)
-   SAVE_10GPRS(22, r1)
+   SAVE_GPRS(13, 31, r1)
mfcrr10
std r10,8*32(r1)
mfmsr   r10
@@ -283,9 +284,7 @@ prom:
 
/* Restore other registers */
REST_GPR(2, r1)
-   REST_GPR(13, r1)
-   REST_8GPRS(14, r1)
-   REST_10GPRS(22, r1)
+   REST_GPRS(13, 31, r1)
ld  r10,8*32(r1)
mtcrr10
 
diff --git a/arch/powerpc/crypto/md5-asm.S b/arch/powerpc/crypto/md5-asm.S
index 948d100a2934..fa6bc440cf4a 100644
--- a/arch/powerpc/crypto/md5-asm.S
+++ b/arch/powerpc/crypto/md5-asm.S
@@ -38,15 +38,11 @@
 
 #define INITIALIZE \
PPC_STLU r1,-INT_FRAME_SIZE(r1); \
-   SAVE_8GPRS(14, r1); /* push registers onto stack*/ \
-   SAVE_4GPRS(22, r1);\
-   SAVE_GPR(26, r1)
+   SAVE_GPRS(14, 26, r1)   /* push registers onto stack*/
 
 #define FINALIZE \
-   REST_8GPRS(14, r1); /* pop registers from stack */ \
-   REST_4GPRS(22, r1);\
-   REST_GPR(26, r1);  \
-   addir1,r1,INT_FRAME_SIZE;
+   REST_GPRS(14, 26, r1);  /* pop registers from stack */ \
+   addir1,r1,INT_FRAME_SIZE
 
 #ifdef __BIG_ENDIAN__
 #define LOAD_DATA(reg, off) \
diff --git a/arch/powerpc/crypto/sha1-powerpc-asm.S 
b/arch/powerpc/crypto/sha1-powerpc-asm.S
index 23e248beff71..f0d5ed557ab1 100644
--- a/arch/powerpc/crypto/sha1-powerpc-asm.S
+++ b/arch/powerpc/crypto/sha1-powerpc-asm.S
@@ -125,8 +125,7 @@
 
 _GLOBAL(powerpc_sha_transform)
PPC_STLU r1,-INT_FRAME_SIZE(r1)
-   SAVE_8GPRS(14, r1)
-   SAVE_10GPRS(22, r1)
+   SAVE_GPRS(14, 31, r1)
 
/* Load up A - E */
lwz RA(0),0(r3) /* A */
@@ -184,7 +183,6 @@ _GLOBAL(powerpc_sha_transform)
stw RD(0),12(r3)
stw RE(0),16(r3)
 
-   REST_8GPRS(14, r1)
-   REST_10GPRS(22, r1

[PATCH v1 1/2] powerpc/64s: system call rfscv workaround for TM bugs

2021-09-08 Thread Nicholas Piggin
The rfscv instruction does not work correctly with the fake-suspend mode
in POWER9, which can end up with the hypervisor restoring an incorrect
checkpoint.

Work around this by setting the _TIF_RESTOREALL flag if a system call
returns to a transaction active state, causing rfid to be used instead
of rfscv to return, which will do the right thing. The contents of the
registers are irrelevant because they will be overwritten in this case
anyway.

Reported-by: Eirik Fuller 
Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv 
instructions")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/interrupt.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index c77c80214ad3..917a2ac4def6 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -139,6 +139,19 @@ notrace long system_call_exception(long r3, long r4, long 
r5,
 */
irq_soft_mask_regs_set_state(regs, IRQS_ENABLED);
 
+   /*
+* If system call is called with TM active, set _TIF_RESTOREALL to
+* prevent RFSCV being used to return to userspace, because POWER9
+* TM implementation has problems with this instruction returning to
+* transactional state. Final register values are not relevant because
+* the transaction will be aborted upon return anyway. Or in the case
+* of unsupported_scv SIGILL fault, the return state does not much
+* matter because it's an edge case.
+*/
+   if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
+   unlikely(MSR_TM_TRANSACTIONAL(regs->msr)))
+   current_thread_info()->flags |= _TIF_RESTOREALL;
+
/*
 * If the system call was made with a transaction active, doom it and
 * return without performing the system call. Unless it was an
-- 
2.23.0



[PATCH v1 2/2] KVM: PPC: Book3S HV: Tolerate treclaim. in fake-suspend mode changing registers

2021-09-08 Thread Nicholas Piggin
POWER9 DD2.2 and 2.3 hardware implements a "fake-suspend" mode where
certain TM instructions executed in HV=0 mode cause softpatch interrupts
so the hypervisor can emulate them and prevent problematic processor
conditions. In this fake-suspend mode, the treclaim. instruction does
not modify registers.

Unfortunately the rfscv instruction executed by the guest do not
generate softpatch interrupts, which can cause the hypervisor to lose
track of the fake-suspend mode, and it can execute this treclaim. while
not in fake-suspend mode. This modifies GPRs and crashes the hypervisor.

It's not trivial to disable scv in the guest with HFSCR now, because
they assume a POWER9 has scv available. So this fix saves and restores
checkpointed registers across the treclaim.

Fixes: 7854f7545bff ("KVM: PPC: Book3S: Rework TM save/restore code and make it 
C-callable")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 36 +++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 8dd437d7a2c6..dd18e1c44751 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2578,7 +2578,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_P9_TM_HV_ASSIST)
/* The following code handles the fake_suspend = 1 case */
mflrr0
std r0, PPC_LR_STKOFF(r1)
-   stdur1, -PPC_MIN_STKFRM(r1)
+   stdur1, -TM_FRAME_SIZE(r1)
 
/* Turn on TM. */
mfmsr   r8
@@ -2593,10 +2593,42 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
nop
 
+   /*
+* It's possible that treclaim. may modify registers, if we have lost
+* track of fake-suspend state in the guest due to it using rfscv.
+* Save and restore registers in case this occurs.
+*/
+   mfspr   r3, SPRN_DSCR
+   mfspr   r4, SPRN_XER
+   mfspr   r5, SPRN_AMR
+   /* SPRN_TAR would need to be saved here if the kernel ever used it */
+   mfcrr12
+   SAVE_NVGPRS(r1)
+   SAVE_GPR(2, r1)
+   SAVE_GPR(3, r1)
+   SAVE_GPR(4, r1)
+   SAVE_GPR(5, r1)
+   stw r12, 8(r1)
+   std r1, HSTATE_HOST_R1(r13)
+
/* We have to treclaim here because that's the only way to do S->N */
li  r3, TM_CAUSE_KVM_RESCHED
TRECLAIM(R3)
 
+   GET_PACA(r13)
+   ld  r1, HSTATE_HOST_R1(r13)
+   REST_GPR(2, r1)
+   REST_GPR(3, r1)
+   REST_GPR(4, r1)
+   REST_GPR(5, r1)
+   lwz r12, 8(r1)
+   REST_NVGPRS(r1)
+   mtspr   SPRN_DSCR, r3
+   mtspr   SPRN_XER, r4
+   mtspr   SPRN_AMR, r5
+   mtcrr12
+   HMT_MEDIUM
+
/*
 * We were in fake suspend, so we are not going to save the
 * register state as the guest checkpointed state (since
@@ -2624,7 +2656,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
std r5, VCPU_TFHAR(r9)
std r6, VCPU_TFIAR(r9)
 
-   addir1, r1, PPC_MIN_STKFRM
+   addir1, r1, TM_FRAME_SIZE
ld  r0, PPC_LR_STKOFF(r1)
mtlrr0
blr
-- 
2.23.0



[PATCH v3 1/2] powerpc/64s: system call scv tabort fix for corrupt irq soft-mask state

2021-09-03 Thread Nicholas Piggin
If a system call is made with a transaction active, the kernel
immediately aborts it and returns. scv system calls disable irqs even
earlier in their interrupt handler, and tabort_syscall does not fix this
up.

This can result in irq soft-mask state being messed up on the next
kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
the kernel exit handlers, or possibly worse.

This can't easily be fixed in asm because at this point an async irq may
have hit, which is soft-masked and marked pending. The pending interrupt
has to be replayed before returning to userspace. The fix is to move the
tabort_syscall code to C in the main syscall handler, and just skip the
system call but otherwise return as usual, which will take care of the
pending irqs. This also does a bunch of other things including possible
signal delivery to the process, but the doomed transaction should still
be aborted when it is eventually returned to.

The sc system call path is changed to use the new C function as well to
reduce code and path differences. This slows down how quickly system
calls are aborted when called while a transaction is active, which could
potentially impact TM performance. But making any system call is already
bad for performance, and TM is on the way out, so go with simpler over
faster.

Reported-by: Eirik Fuller 
Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv 
instructions")
Signed-off-by: Nicholas Piggin 
---

v2 of this fix had a bug where an irq could be soft masked and pending
before we hard disable interrupts in tabort_syscall for the case of
scv (because it enters the kernel with EE enabled). So this actually
requires a pretty large change to fix because we can't replay interrupts
just from this early asm context.

Thanks,
Nick

 arch/powerpc/kernel/interrupt.c| 29 +
 arch/powerpc/kernel/interrupt_64.S | 41 --
 2 files changed, 29 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index 21bbd615ca41..c77c80214ad3 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #if defined(CONFIG_PPC_ADV_DEBUG_REGS) && defined(CONFIG_PPC32)
@@ -138,6 +139,34 @@ notrace long system_call_exception(long r3, long r4, long 
r5,
 */
irq_soft_mask_regs_set_state(regs, IRQS_ENABLED);
 
+   /*
+* If the system call was made with a transaction active, doom it and
+* return without performing the system call. Unless it was an
+* unsupported scv vector, in which case it's treated like an illegal
+* instruction.
+*/
+   if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
+   unlikely(MSR_TM_TRANSACTIONAL(regs->msr)) &&
+   !trap_is_unsupported_scv(regs)) {
+   /* Enable TM in the kernel, and disable EE (for scv) */
+   hard_irq_disable();
+   mtmsr(mfmsr() | MSR_TM);
+
+   /* tabort, this dooms the transaction, nothing else */
+   asm volatile(".long 0x7c00071d | ((%0) << 16)"
+   :: "r"(TM_CAUSE_SYSCALL|TM_CAUSE_PERSISTENT));
+
+   /*
+* Userspace will never see the return value. Execution will
+* resume after the tbegin. of the aborted transaction with the
+* checkpointed register state. A context switch could occur
+* or signal delivered to the process before resuming the
+* doomed transaction context, but that should all be handled
+* as expected.
+*/
+   return -ENOSYS;
+   }
+
local_irq_enable();
 
if (unlikely(current_thread_info()->flags & _TIF_SYSCALL_DOTRACE)) {
diff --git a/arch/powerpc/kernel/interrupt_64.S 
b/arch/powerpc/kernel/interrupt_64.S
index d4212d2ff0b5..ec950b08a8dc 100644
--- a/arch/powerpc/kernel/interrupt_64.S
+++ b/arch/powerpc/kernel/interrupt_64.S
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 
.section".toc","aw"
 SYS_CALL_TABLE:
@@ -55,12 +54,6 @@ COMPAT_SYS_CALL_TABLE:
.globl system_call_vectored_\name
 system_call_vectored_\name:
 _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name)
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-BEGIN_FTR_SECTION
-   extrdi. r10, r12, 1, (63-MSR_TS_T_LG) /* transaction active? */
-   bne tabort_syscall
-END_FTR_SECTION_IFSET(CPU_FTR_TM)
-#endif
SCV_INTERRUPT_TO_KERNEL
mr  r10,r1
ld  r1,PACAKSAVE(r13)
@@ -247,12 +240,6 @@ _ASM_NOKPROBE_SYMBOL(system_call_common_real)
.globl system_call_common
 system_call_common:
 _ASM_NOKPROBE_SYMBOL(system_call_common)
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-BEGIN_FTR_S

Re: [PATCH v2 1/2] powerpc/64s: system call scv tabort fix for corrupt irq soft-mask state

2021-09-01 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 2, 2021 3:21 am:
> 
> 
> Le 01/09/2021 à 18:54, Nicholas Piggin a écrit :
>> If a system call is made with a transaction active, the kernel
>> immediately aborts it and returns. scv system calls disable irqs even
>> earlier in their interrupt handler, and tabort_syscall does not fix this
>> up.
>> 
>> This can result in irq soft-mask state being messed up on the next
>> kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
>> the kernel exit handlers, or possibly worse.
>> 
>> Fix this by having tabort_syscall setting irq soft-mask back to enabled
>> (which requires MSR[EE] be disabled first).
>> 
>> Reported-by: Eirik Fuller 
>> Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv 
>> instructions")
>> Signed-off-by: Nicholas Piggin 
>> ---
>> 
>> Tested the wrong kernel before sending v1 and missed a bug, sorry.
>> 
>>   arch/powerpc/kernel/interrupt_64.S | 8 +++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/kernel/interrupt_64.S 
>> b/arch/powerpc/kernel/interrupt_64.S
>> index d4212d2ff0b5..9c31d65b4851 100644
>> --- a/arch/powerpc/kernel/interrupt_64.S
>> +++ b/arch/powerpc/kernel/interrupt_64.S
>> @@ -428,16 +428,22 @@ RESTART_TABLE(.Lsyscall_rst_start, .Lsyscall_rst_end, 
>> syscall_restart)
>>   #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>>   tabort_syscall:
>>   _ASM_NOKPROBE_SYMBOL(tabort_syscall)
>> -/* Firstly we need to enable TM in the kernel */
>> +/* We need to enable TM in the kernel, and disable EE (for scv) */
>>  mfmsr   r10
>>  li  r9, 1
>>  rldimi  r10, r9, MSR_TM_LG, 63-MSR_TM_LG
>> +LOAD_REG_IMMEDIATE(r9, MSR_EE)
>> +andcr10, r10, r9
> 
> Why not use 'rlwinm' to mask out MSR_EE ?
> 
> Something like
> 
>   rlwinm  r10, r10, 0, ~MSR_EE

Mainly because I'm bad at powerpc assembly. Why do you think I'm trying 
to change as much as possible to C?

Actually there should really be no need for mfmsr either, I wanted to
rewrite the thing entirely as

ld  r10,PACAKMSR(r13)
LOAD_REG_IMMEDIATE(r9, MSR_TM)
or  r10,r10,r9
mtmsrd  r10

But I thought that's not a minimal bug fix.

Thanks,
Nick
> 
>>  mtmsrd  r10, 0
>>   
>>  /* tabort, this dooms the transaction, nothing else */
>>  li  r9, (TM_CAUSE_SYSCALL|TM_CAUSE_PERSISTENT)
>>  TABORT(R9)
>>   
>> +/* scv has disabled irqs so must re-enable. sc just remains enabled */
>> +li  r9,IRQS_ENABLED
>> +stb r9,PACAIRQSOFTMASK(r13)
>> +
>>  /*
>>   * Return directly to userspace. We have corrupted user register state,
>>   * but userspace will never see that register state. Execution will
>> 
> 


  1   2   3   4   5   6   7   8   9   10   >