Re: [PATCH -next v2] KVM: PPC: Book3S HV: XIVE: Convert to DEFINE_SHOW_ATTRIBUTE

2020-09-22 Thread Paul Mackerras
On Sat, Sep 19, 2020 at 09:29:25AM +0800, Qinglang Miao wrote:
> Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
> 
> Signed-off-by: Qinglang Miao 

Thanks, applied.

Paul.


Re: [PATCH] KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls

2020-09-22 Thread Paul Mackerras
On Fri, Sep 11, 2020 at 12:53:45PM +0200, Greg Kurz wrote:
> ENOTSUPP is a linux only thingy, the value of which is unknown to
> userspace, not to be confused with ENOTSUP which linux maps to
> EOPNOTSUPP, as permitted by POSIX [1]:
> 
> [EOPNOTSUPP]
> Operation not supported on socket. The type of socket (address family
> or protocol) does not support the requested operation. A conforming
> implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].
> 
> Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
> - KVM_GET_FPU for Book3s and BookE
> - KVM_SET_FPU for Book3s and BookE
> - KVM_GET_DIRTY_LOG for BookE
> 
> This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
> KVM_SET_FPU ioctls on POWER anyway since they are not supported,
> and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.
> 
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html
> 
> Signed-off-by: Greg Kurz 

Thanks, applied.

Paul.


Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-22 Thread Paul Mackerras
On Fri, Sep 11, 2020 at 01:16:07AM -0300, Fabiano Rosas wrote:
> The current nested KVM code does not support HPT guests. This is
> informed/enforced in some ways:
> 
> - Hosts < P9 will not be able to enable the nested HV feature;
> 
> - The nested hypervisor MMU capabilities will not contain
>   KVM_CAP_PPC_MMU_HASH_V3;
> 
> - QEMU reflects the MMU capabilities in the
>   'ibm,arch-vec-5-platform-support' device-tree property;
> 
> - The nested guest, at 'prom_parse_mmu_model' ignores the
>   'disable_radix' kernel command line option if HPT is not supported;
> 
> - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> 
> There is, however, still a way to start a HPT guest by using
> max-compat-cpu=power8 at the QEMU machine options. This leads to the
> guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> ioctl.
> 
> With the guest set to hash, the nested hypervisor goes through the
> entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> crashes when it tries to execute an hypervisor-privileged (mtspr
> HDEC) instruction at __kvmppc_vcore_entry:
> 
> root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> 
> 
> [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> c01e5ec0
> [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
> [  538.543470] MSR:  82843033   CR: 
> 22422882  XER: 2004
> [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
>GPR00: c008075397a0 c013e91e3640 c0080755e600 
> 8000
>GPR04:  c013eab19800 c01394de 
> 0043a054db72
>GPR08: 003b1652   
> c008075502e0
>GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> 0008
>GPR16:  c0139676c6c0 c1d23948 
> c013e91e38b8
>GPR20: 0053  0001 
> 
>GPR24: 0001 0001  
> 0001
>GPR28: 0001 0053 c013eab19800 
> 0001
> [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv]
> [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv]
> [  538.544173] Call Trace:
> [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> (unreliable)
> [  538.544260] [c013e91e3820] [c008075397a0] 
> kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> [  538.544325] [c013e91e39e0] [c0080753d99c] 
> kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> kvmppc_vcpu_run+0x34/0x48 [kvm]
> [  538.544472] [c013e91e3af0] [c008072d61b8] 
> kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> [  538.544539] [c013e91e3b80] [c008072c7450] 
> kvm_vcpu_ioctl+0x298/0x778 [kvm]
> [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> [  538.544662] [c013e91e3dc0] [c002f9a4] 
> system_call_exception+0xe4/0x1c0
> [  538.544726] [c013e91e3e20] [c000d140] 
> system_call_common+0xf0/0x27c
> [  538.544787] Instruction dump:
> [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 e9264140 
> 75290002
> [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> f90d10a0 480104fd
> [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> 
> This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> the nested hypervisor, causing QEMU to abort.
> 
> Reported-by: Satheesh Rajendran 
> Signed-off-by: Fabiano Rosas 

Thanks, applied.

Paul.


Re: [PATCH] KVM: PPC: Book3S: Remove redundant initialization of variable ret

2020-09-22 Thread Paul Mackerras
On Sat, Sep 19, 2020 at 03:12:30PM +0800, Jing Xiangfeng wrote:
> The variable ret is being initialized with '-ENOMEM' that is meaningless.
> So remove it.
> 
> Signed-off-by: Jing Xiangfeng 

Thanks, applied.

Paul.


Re: [PATCH -next] powerpc/kvm/books: Fix symbol undeclared warnings

2020-09-22 Thread Paul Mackerras
On Mon, Sep 21, 2020 at 11:22:11AM +, Wang Wensheng wrote:
> Build the kernel with `C=2`:
> arch/powerpc/kvm/book3s_hv_nested.c:572:25: warning: symbol
> 'kvmhv_alloc_nested' was not declared. Should it be static?
> arch/powerpc/kvm/book3s_64_mmu_radix.c:350:6: warning: symbol
> 'kvmppc_radix_set_pte_at' was not declared. Should it be static?
> arch/powerpc/kvm/book3s_hv.c:3568:5: warning: symbol
> 'kvmhv_p9_guest_entry' was not declared. Should it be static?
> arch/powerpc/kvm/book3s_hv_rm_xics.c:767:15: warning: symbol 'eoi_rc'
> was not declared. Should it be static?
> arch/powerpc/kvm/book3s_64_vio_hv.c:240:13: warning: symbol
> 'iommu_tce_kill_rm' was not declared. Should it be static?
> arch/powerpc/kvm/book3s_64_vio.c:492:6: warning: symbol
> 'kvmppc_tce_iommu_do_map' was not declared. Should it be static?
> arch/powerpc/kvm/book3s_pr.c:572:6: warning: symbol 'kvmppc_set_pvr_pr'
> was not declared. Should it be static?
> 
> Those symbols are used only in the files that define them so make them
> static to fix the warnings.
> 
> Signed-off-by: Wang Wensheng 

Thanks, applied.

Paul.


Re: [PATCH V2] Doc: admin-guide: Add entry for kvm_cma_resv_ratio kernel param

2020-09-21 Thread Paul Mackerras
On Mon, Sep 21, 2020 at 02:32:20PM +0530, sathn...@linux.vnet.ibm.com wrote:
> From: Satheesh Rajendran 
> 
> Add document entry for kvm_cma_resv_ratio kernel param which
> is used to alter the KVM contiguous memory allocation percentage
> for hash pagetable allocation used by hash mode PowerPC KVM guests.
> 
> Cc: linux-ker...@vger.kernel.org
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Jonathan Corbet 
> Reviewed-by: Randy Dunlap 
> Signed-off-by: Satheesh Rajendran 
> ---
> 
> V2: 
> Addressed review comments from Randy.
> 
> V1: https://lkml.org/lkml/2020/9/16/72
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index a1068742a6df..932ed45740c9 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2258,6 +2258,14 @@
>   [KVM,ARM] Allow use of GICv4 for direct injection of
>   LPIs.
>  
> + kvm_cma_resv_ratio=n [PPC]
> + Reserves given percentage from system memory area for
> + contiguous memory allocation for KVM hash pagetable
> + allocation.
> + By default it reserves 5% of total system memory.

I am concerned that using the term "reserve" here could give the
impression that this memory is then not available for any other use.
It is in fact available for other uses as long as they are movable
allocations.  So this memory is available for uses such as process
anonymous memory and page cache, just not for things like kmalloc.

I'm not sure what would be a better term than "reserve", though.
Perhaps we need to add a sentence something like "The reserved memory
is available for use as process memory and page cache when it is not
being used by KVM."

Paul.


Re: [PATCH] powerpc/64s: handle ISA v3.1 local copy-paste context switches

2020-09-02 Thread Paul Mackerras
On Tue, Aug 25, 2020 at 05:55:35PM +1000, Nicholas Piggin wrote:
> The ISA v3.1 the copy-paste facility has a new memory move functionality
> which allows the copy buffer to be pasted to domestic memory (RAM) as
> opposed to foreign memory (accelerator).
> 
> This means the POWER9 trick of avoiding the cp_abort on context switch if
> the process had not mapped foreign memory does not work on POWER10. Do the
> cp_abort unconditionally there.
> 
> KVM must also cp_abort on guest exit to prevent copy buffer state leaking
> between contexts.
> 
> Signed-off-by: Nicholas Piggin 

For the KVM part:

Acked-by: Paul Mackerras 


Re: [RFC PATCH 1/2] KVM: PPC: Use the ppc_inst type

2020-09-02 Thread Paul Mackerras
On Wed, Sep 02, 2020 at 06:00:24PM +1000, Jordan Niethe wrote:
> On Wed, Sep 2, 2020 at 4:18 PM Paul Mackerras  wrote:
> >
> > On Thu, Aug 20, 2020 at 01:39:21PM +1000, Jordan Niethe wrote:
> > > The ppc_inst type was added to help cope with the addition of prefixed
> > > instructions to the ISA. Convert KVM to use this new type for dealing
> > > wiht instructions. For now do not try to add further support for
> > > prefixed instructions.
> >
> > This change does seem to splatter itself across a lot of code that
> > mostly or exclusively runs on machines which are not POWER10 and will
> > never need to handle prefixed instructions, unfortunately.  I wonder
> > if there is a less invasive way to approach this.
> Something less invasive would be good.
> >
> > In particular we are inflicting this 64-bit struct on 32-bit platforms
> > unnecessarily (I assume, correct me if I am wrong here).
> No, that is something that I wanted to to avoid, on 32 bit platforms
> it is a 32bit struct:
> 
> struct ppc_inst {
> u32 val;
> #ifdef CONFIG_PPC64
> u32 suffix;
> #endif
> } __packed;
> >
> > How would it be to do something like:
> >
> > typedef unsigned long ppc_inst_t;
> >
> > so it is 32 bits on 32-bit platforms and 64 bits on 64-bit platforms,
> > and then use that instead of 'struct ppc_inst'?  You would still need
> > to change the function declarations but I think most of the function
> > bodies would not need to be changed.  In particular you would avoid a
> > lot of the churn related to having to add ppc_inst_val() and suchlike.
> 
> Would the idea be to get rid of `struct ppc_inst` entirely or just not
> use it in kvm?
> In an earlier series I did something similar (at least code shared
> between 32bit and 64bit would need helpers, but 32bit only code need
> not change):
> 
> #ifdef __powerpc64__
> 
> typedef struct ppc_inst {
> union {
> struct {
> u32 word;
> u32 pad;
> } __packed;
> struct {
> u32 prefix;
> u32 suffix;
> } __packed;
> };
> } ppc_inst;
> 
> #else /* !__powerpc64__ */
> 
> typedef u32 ppc_inst;
> #endif
> 
> However mpe wanted to avoid using a typedef
> (https://patchwork.ozlabs.org/comment/2391845/)

Well it doesn't have to be typedef'd, it could just be "unsigned
long", which is used in other places for things that want to be 32-bit
on 32-bit machines and 64-bit on 64-bit machines.

I do however think that it should be a numeric type so that we can
mask, shift and compare it more easily.  I know that's less "abstract"
but it's also a lot less obfuscated and I think that will lead to
clearer code.  If you got the opposite advice from Michael Ellerman or
Nick Piggin then I will discuss it with them.

> We did also talk about just using a u64 for instructions
> (https://lore.kernel.org/linuxppc-dev/1585028462.t27rstc2uf.astr...@bobo.none/)
> but the concern was that as prefixed instructions act as two separate
> u32s (prefix is always before the suffix regardless of endianess)
> keeping it as a u64 would lead to lot of macros and potential
> confusion.
> But it does seem if that can avoid a lot of needless churn it might
> worth the trade off.

u32 *ip;

instr = *ip++;
if (is_prefix(instr) && is_suitably_aligned(ip))
instr = (instr << 32) | *ip++;

would avoid the endian issues pretty cleanly I think.  In other words
the prefix would always be the high half of the 64-bit value, so you
can't just do a single 64-bit of the instruction on little-endian
platforms; but you can't do a single 64-bit load for other reasons as
well, such as alignment.

Paul.


Re: [RFC PATCH 2/2] KVM: PPC: Book3S HV: Support prefixed instructions

2020-09-02 Thread Paul Mackerras
On Thu, Aug 20, 2020 at 01:39:22PM +1000, Jordan Niethe wrote:
> There are two main places where instructions are loaded from the guest:
> * Emulate loadstore - such as when performing MMIO emulation
>   triggered by an HDSI
> * After an HV emulation assistance interrupt (e40)
> 
> If it is a prefixed instruction that triggers these cases, its suffix
> must be loaded. Use the SRR1_PREFIX bit to decide if a suffix needs to
> be loaded. Make sure if this bit is set inject_interrupt() also sets it
> when giving an interrupt to the guest.
> 
> ISA v3.10 extends the Hypervisor Emulation Instruction Register (HEIR)
> to 64 bits long to accommodate prefixed instructions. For interrupts
> caused by a word instruction the instruction is loaded into bits 32:63
> and bits 0:31 are zeroed. When caused by a prefixed instruction the
> prefix and suffix are loaded into bits 0:63.
> 
> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/kvm/book3s.c   | 15 +--
>  arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 +++---
>  arch/powerpc/kvm/book3s_hv_builtin.c|  3 +++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 14 ++
>  4 files changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 70d8967acc9b..18b1928a571b 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -456,13 +456,24 @@ int kvmppc_load_last_inst(struct kvm_vcpu *vcpu,
>  {
>   ulong pc = kvmppc_get_pc(vcpu);
>   u32 word;
> + u64 doubleword;
>   int r;
>  
>   if (type == INST_SC)
>   pc -= 4;
>  
> - r = kvmppc_ld(vcpu, , sizeof(u32), , false);
> - *inst = ppc_inst(word);
> + if ((kvmppc_get_msr(vcpu) & SRR1_PREFIXED)) {
> + r = kvmppc_ld(vcpu, , sizeof(u64), , false);

Should we also have a check here that the doubleword is not crossing a
page boundary?  I can't think of a way to get this code to cross a
page boundary, assuming the hardware is working correctly, but it
makes me just a little nervous.

> +#ifdef CONFIG_CPU_LITTLE_ENDIAN
> + *inst = ppc_inst_prefix(doubleword & 0x, doubleword >> 
> 32);
> +#else
> + *inst = ppc_inst_prefix(doubleword >> 32, doubleword & 
> 0x);
> +#endif

Ick.  Is there a cleaner way to do this?

> + } else {
> + r = kvmppc_ld(vcpu, , sizeof(u32), , false);
> + *inst = ppc_inst(word);
> + }
> +
>   if (r == EMULATE_DONE)
>   return r;
>   else
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 775ce41738ce..0802471f4856 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -411,9 +411,13 @@ static int instruction_is_store(struct ppc_inst instr)
>   unsigned int mask;
>  
>   mask = 0x1000;
> - if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00)
> - mask = 0x100;   /* major opcode 31 */
> - return (ppc_inst_val(instr) & mask) != 0;
> + if (ppc_inst_prefixed(instr)) {
> + return (ppc_inst_suffix(instr) & mask) != 0;
> + } else {
> + if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00)
> + mask = 0x100;   /* major opcode 31 */
> + return (ppc_inst_val(instr) & mask) != 0;
> + }

The way the code worked before, the mask depended on whether the
instruction was a D-form (or DS-form or other variant) instruction,
where you can tell loads and stores apart by looking at the major
opcode, or an X-form instruction, where you look at the minor opcode.

Now we are only looking at the minor opcode if it is not a prefixed
instruction.  Are there no X-form prefixed loads or stores?

Paul.


Re: [RFC PATCH 1/2] KVM: PPC: Use the ppc_inst type

2020-09-02 Thread Paul Mackerras
On Thu, Aug 20, 2020 at 01:39:21PM +1000, Jordan Niethe wrote:
> The ppc_inst type was added to help cope with the addition of prefixed
> instructions to the ISA. Convert KVM to use this new type for dealing
> wiht instructions. For now do not try to add further support for
> prefixed instructions.

This change does seem to splatter itself across a lot of code that
mostly or exclusively runs on machines which are not POWER10 and will
never need to handle prefixed instructions, unfortunately.  I wonder
if there is a less invasive way to approach this.

In particular we are inflicting this 64-bit struct on 32-bit platforms
unnecessarily (I assume, correct me if I am wrong here).

How would it be to do something like:

typedef unsigned long ppc_inst_t;

so it is 32 bits on 32-bit platforms and 64 bits on 64-bit platforms,
and then use that instead of 'struct ppc_inst'?  You would still need
to change the function declarations but I think most of the function
bodies would not need to be changed.  In particular you would avoid a
lot of the churn related to having to add ppc_inst_val() and suchlike.

> -static inline unsigned make_dsisr(unsigned instr)
> +static inline unsigned make_dsisr(struct ppc_inst instr)
>  {
>   unsigned dsisr;
> + u32 word = ppc_inst_val(instr);
>  
>  
>   /* bits  6:15 --> 22:31 */
> - dsisr = (instr & 0x03ff) >> 16;
> + dsisr = (word & 0x03ff) >> 16;
>  
>   if (IS_XFORM(instr)) {
>   /* bits 29:30 --> 15:16 */
> - dsisr |= (instr & 0x0006) << 14;
> + dsisr |= (word & 0x0006) << 14;
>   /* bit 25 -->17 */
> - dsisr |= (instr & 0x0040) << 8;
> + dsisr |= (word & 0x0040) << 8;
>   /* bits 21:24 --> 18:21 */
> - dsisr |= (instr & 0x0780) << 3;
> + dsisr |= (word & 0x0780) << 3;
>   } else {
>   /* bit  5 -->17 */
> - dsisr |= (instr & 0x0400) >> 12;
> + dsisr |= (word & 0x0400) >> 12;
>   /* bits  1: 4 --> 18:21 */
> - dsisr |= (instr & 0x7800) >> 17;
> + dsisr |= (word & 0x7800) >> 17;
>   /* bits 30:31 --> 12:13 */
>   if (IS_DSFORM(instr))
> - dsisr |= (instr & 0x0003) << 18;
> + dsisr |= (word & 0x0003) << 18;

Here I would have done something like:

> -static inline unsigned make_dsisr(unsigned instr)
> +static inline unsigned make_dsisr(struct ppc_inst pi)
>  {
>   unsigned dsisr;
> + u32 instr = ppc_inst_val(pi);

and left the rest of the function unchanged.

At first I wondered why we still had that function, since IBM Power
CPUs have not set DSISR on an alignment interrupt since POWER3 days.
It turns out it this function is used by PR KVM when it is emulating
one of the old 32-bit PowerPC CPUs (601, 603, 604, 750, 7450 etc.).

> diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c

Despite the file name, this code is not used on IBM Power servers.
It is for platforms which run under an ePAPR (not server PAPR)
hypervisor (which would be a KVM variant, but generally book E KVM not
book 3S).

Paul.


Re: [PATCH -next] powerpc: Convert to DEFINE_SHOW_ATTRIBUTE

2020-09-01 Thread Paul Mackerras
On Thu, Jul 16, 2020 at 05:07:12PM +0800, Qinglang Miao wrote:
> From: Chen Huang 
> 
> Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
> 
> Signed-off-by: Chen Huang 

For the arch/powerpc/kvm part:

Acked-by: Paul Mackerras 

I expect Michael Ellerman will take the patch through his tree.

Paul.


Re: [PATCH 0/7] powerpc/watchpoint: 2nd DAWR kvm enablement + selftests

2020-09-01 Thread Paul Mackerras
On Thu, Jul 23, 2020 at 03:50:51PM +0530, Ravi Bangoria wrote:
> Patch #1, #2 and #3 enables p10 2nd DAWR feature for Book3S kvm guest. DAWR
> is a hypervisor resource and thus H_SET_MODE hcall is used to set/unset it.
> A new case H_SET_MODE_RESOURCE_SET_DAWR1 is introduced in H_SET_MODE hcall
> for setting/unsetting 2nd DAWR. Also, new capability KVM_CAP_PPC_DAWR1 has
> been added to query 2nd DAWR support via kvm ioctl.
> 
> This feature also needs to be enabled in Qemu to really use it. I'll reply
> link to qemu patches once I post them in qemu-devel mailing list.
> 
> Patch #4, #5, #6 and #7 adds selftests to test 2nd DAWR.

If/when you resubmit these patches, please split the KVM patches into
a separate series, since the KVM patches would go via my tree whereas
I expect the selftests/powerpc patches would go through Michael
Ellerman's tree.

Paul.


Re: [PATCH 2/7] powerpc/watchpoint/kvm: Add infrastructure to support 2nd DAWR

2020-09-01 Thread Paul Mackerras
On Thu, Jul 23, 2020 at 03:50:53PM +0530, Ravi Bangoria wrote:
> kvm code assumes single DAWR everywhere. Add code to support 2nd DAWR.
> DAWR is a hypervisor resource and thus H_SET_MODE hcall is used to set/
> unset it. Introduce new case H_SET_MODE_RESOURCE_SET_DAWR1 for 2nd DAWR.

Is this the same interface as will be defined in PAPR and available
under PowerVM, or is it a new/different interface for KVM?

> Also, kvm will support 2nd DAWR only if CPU_FTR_DAWR1 is set.

In general QEMU wants to be able to control all aspects of the virtual
machine presented to the guest, meaning that just because a host has a
particular hardware capability does not mean we should automatically
present that capability to the guest.

In this case, QEMU will want a way to control whether the guest sees
the availability of the second DAWR/X registers or not, i.e. whether a
H_SET_MODE to set DAWR[X]1 will succeed or fail.

Paul.


Re: [PATCH 1/7] powerpc/watchpoint/kvm: Rename current DAWR macros and variables

2020-09-01 Thread Paul Mackerras
On Thu, Jul 23, 2020 at 03:50:52PM +0530, Ravi Bangoria wrote:
> Power10 is introducing second DAWR. Use real register names (with
> suffix 0) from ISA for current macros and variables used by kvm.

Most of this looks fine, but I think we should not change the existing
names in arch/powerpc/include/uapi/asm/kvm.h (and therefore also
Documentation/virt/kvm/api.rst).

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 426f94582b7a..4dc18fe6a2bf 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -2219,8 +2219,8 @@ registers, find a list below:
>PPC KVM_REG_PPC_BESCR   64
>PPC KVM_REG_PPC_TAR 64
>PPC KVM_REG_PPC_DPDES   64
> -  PPC KVM_REG_PPC_DAWR64
> -  PPC KVM_REG_PPC_DAWRX   64
> +  PPC KVM_REG_PPC_DAWR0   64
> +  PPC KVM_REG_PPC_DAWRX0  64
>PPC KVM_REG_PPC_CIABR   64
>PPC KVM_REG_PPC_IC  64
>PPC KVM_REG_PPC_VTB 64
...
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 264e266a85bf..38d61b73f5ed 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -608,8 +608,8 @@ struct kvm_ppc_cpu_char {
>  #define KVM_REG_PPC_BESCR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa7)
>  #define KVM_REG_PPC_TAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa8)
>  #define KVM_REG_PPC_DPDES(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa9)
> -#define KVM_REG_PPC_DAWR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xaa)
> -#define KVM_REG_PPC_DAWRX(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xab)
> +#define KVM_REG_PPC_DAWR0(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xaa)
> +#define KVM_REG_PPC_DAWRX0   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xab)
>  #define KVM_REG_PPC_CIABR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xac)
>  #define KVM_REG_PPC_IC   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xad)
>  #define KVM_REG_PPC_VTB  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xae)

The existing names are an API, and if you change them you will break
compilation of existing userspace programs.  I don't see that adding
the '0' on the end is so important that we need to break userspace.

Paul.


Re: [PATCH v6 0/5] Migrate non-migrated pages of a SVM.

2020-07-27 Thread Paul Mackerras
On Mon, Jul 27, 2020 at 11:07:13AM -0700, Ram Pai wrote:
> The time to switch a VM to Secure-VM, increases by the size of the VM.
> A 100GB VM takes about 7minutes. This is unacceptable.  This linear
> increase is caused by a suboptimal behavior by the Ultravisor and the
> Hypervisor.  The Ultravisor unnecessarily migrates all the GFN of the
> VM from normal-memory to secure-memory. It has to just migrate the
> necessary and sufficient GFNs.
> 
> However when the optimization is incorporated in the Ultravisor, the
> Hypervisor starts misbehaving. The Hypervisor has a inbuilt assumption
> that the Ultravisor will explicitly request to migrate, each and every
> GFN of the VM. If only necessary and sufficient GFNs are requested for
> migration, the Hypervisor continues to manage the remaining GFNs as
> normal GFNs. This leads to memory corruption; manifested
> consistently when the SVM reboots.
> 
> The same is true, when a memory slot is hotplugged into a SVM. The
> Hypervisor expects the ultravisor to request migration of all GFNs to
> secure-GFN.  But the hypervisor cannot handle any H_SVM_PAGE_IN
> requests from the Ultravisor, done in the context of
> UV_REGISTER_MEM_SLOT ucall.  This problem manifests as random errors
> in the SVM, when a memory-slot is hotplugged.
> 
> This patch series automatically migrates the non-migrated pages of a
> SVM, and thus solves the problem.
> 
> Testing: Passed rigorous testing using various sized SVMs.

Thanks, series applied to my kvm-ppc-next branch and pull request sent.

Paul.


Re: [PATCH v2 0/2] Rework secure memslot dropping

2020-07-27 Thread Paul Mackerras
On Mon, Jul 27, 2020 at 12:24:27PM -0700, Ram Pai wrote:
> From: Laurent Dufour 
> 
> When doing memory hotplug on a secure VM, the secure pages are not well
> cleaned from the secure device when dropping the memslot.  This silent
> error, is then preventing the SVM to reboot properly after the following
> sequence of commands are run in the Qemu monitor:
> 
> device_add pc-dimm,id=dimm1,memdev=mem1
> device_del dimm1
> device_add pc-dimm,id=dimm1,memdev=mem1
> 
> At reboot time, when the kernel is booting again and switching to the
> secure mode, the page_in is failing for the pages in the memslot because
> the cleanup was not done properly, because the memslot is flagged as
> invalid during the hot unplug and thus the page fault mechanism is not
> triggered.
> 
> To prevent that during the memslot dropping, instead of belonging on the
> page fault mechanism to trigger the page out of the secured pages, it seems
> simpler to directly call the function doing the page out. This way the
> state of the memslot is not interfering on the page out process.
> 
> This series applies on top of the Ram's one titled:
> "[v6 0/5] Migrate non-migrated pages of a SVM."

Thanks, series applied to my kvm-ppc-next branch and pull request sent.

Paul.


Re: [PATCH v6 3/5] KVM: PPC: clean up redundant kvm_run parameters in assembly

2020-07-23 Thread Paul Mackerras
On Tue, Jun 23, 2020 at 09:14:16PM +0800, Tianjia Zhang wrote:
> In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
> structure. For historical reasons, many kvm-related function parameters
> retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
> patch does a unified cleanup of these remaining redundant parameters.

Thanks, patch applied to my kvm-ppc-next branch, with fixes.

Paul.


Re: [PATCH] powerpc/kvm: Enable support for ISA v3.1 guests

2020-07-23 Thread Paul Mackerras
On Tue, Jun 02, 2020 at 03:53:25PM +1000, Alistair Popple wrote:
> Adds support for emulating ISAv3.1 guests by adding the appropriate PCR
> and FSCR bits.
> 
> Signed-off-by: Alistair Popple 

Thanks, patch applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH kernel] KVM: PPC: Protect kvm_vcpu_read_guest with srcu locks

2020-07-23 Thread Paul Mackerras
On Tue, Jun 09, 2020 at 12:12:29PM +1000, Alexey Kardashevskiy wrote:
> The kvm_vcpu_read_guest/kvm_vcpu_write_guest used for nested guests
> eventually call srcu_dereference_check to dereference a memslot and
> lockdep produces a warning as neither kvm->slots_lock nor
> kvm->srcu lock is held and kvm->users_count is above zero (>100 in fact).
> 
> This wraps mentioned VCPU read/write helpers in srcu read lock/unlock as
> it is done in other places. This uses vcpu->srcu_idx when possible.
> 
> These helpers are only used for nested KVM so this may explain why
> we did not see these before.
> 
> Here is an example of a warning:
> 
> =
> WARNING: suspicious RCU usage
> 5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897 Not tainted
> -
> include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by qemu-system-ppc/2752:
>  #0: c000200359016be0 (>mutex){+.+.}-{3:3}, at: 
> kvm_vcpu_ioctl+0x144/0xd80 [kvm]
> 
> stack backtrace:
> CPU: 80 PID: 2752 Comm: qemu-system-ppc Not tainted 
> 5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897
> Call Trace:
> [c0002003591ab240] [c0b23ab4] dump_stack+0x190/0x25c (unreliable)
> [c0002003591ab2b0] [c023f954] lockdep_rcu_suspicious+0x140/0x164
> [c0002003591ab330] [c00804a445f8] kvm_vcpu_gfn_to_memslot+0x4c0/0x510 
> [kvm]
> [c0002003591ab3a0] [c00804a44c18] kvm_vcpu_read_guest+0xa0/0x180 [kvm]
> [c0002003591ab410] [c00804ff9bd8] kvmhv_enter_nested_guest+0x90/0xb80 
> [kvm_hv]
> [c0002003591ab980] [c00804fe07bc] kvmppc_pseries_do_hcall+0x7b4/0x1c30 
> [kvm_hv]
> [c0002003591aba10] [c00804fe5d30] kvmppc_vcpu_run_hv+0x10a8/0x1a30 
> [kvm_hv]
> [c0002003591abae0] [c00804a5d954] kvmppc_vcpu_run+0x4c/0x70 [kvm]
> [c0002003591abb10] [c00804a56e54] kvm_arch_vcpu_ioctl_run+0x56c/0x7c0 
> [kvm]
> [c0002003591abba0] [c00804a3ddc4] kvm_vcpu_ioctl+0x4ac/0xd80 [kvm]
> [c0002003591abd20] [c06ebb58] ksys_ioctl+0x188/0x210
> [c0002003591abd70] [c06ebc28] sys_ioctl+0x48/0xb0
> [c0002003591abdb0] [c0042764] system_call_exception+0x1d4/0x2e0
> [c0002003591abe20] [c000cce8] system_call_common+0xe8/0x214
> 
> Signed-off-by: Alexey Kardashevskiy 

Thanks, patch applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH] KVM: PPC: Book3S HV: increase KVMPPC_NR_LPIDS on POWER8 and POWER9

2020-07-23 Thread Paul Mackerras
On Mon, Jun 08, 2020 at 01:57:14PM +0200, Cédric Le Goater wrote:
> POWER8 and POWER9 have 12-bit LPIDs. Change LPID_RSVD to support up to
> (4096 - 2) guests on these processors. POWER7 is kept the same with a
> limitation of (1024 - 2), but it might be time to drop KVM support for
> POWER7.
> 
> Tested with 2048 guests * 4 vCPUs on a witherspoon system with 512G
> RAM and a bit of swap.
> 
> Signed-off-by: Cédric Le Goater 

Thanks, patch applied to my kvm-ppc-next branch.

Paul.


Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-21 Thread Paul Mackerras
On Thu, Jul 16, 2020 at 01:32:13AM -0700, Ram Pai wrote:
> An instruction accessing a mmio address, generates a HDSI fault.  This fault 
> is
> appropriately handled by the Hypervisor.  However in the case of secureVMs, 
> the
> fault is delivered to the ultravisor.
> 
> Unfortunately the Ultravisor has no correct-way to fetch the faulting
> instruction. The PEF architecture does not allow Ultravisor to enable MMU
> translation. Walking the two level page table to read the instruction can race
> with other vcpus modifying the SVM's process scoped page table.
> 
> This problem can be correctly solved with some help from the kernel.
> 
> Capture the faulting instruction in SPRG0 register, before executing the
> faulting instruction. This enables the ultravisor to easily procure the
> faulting instruction and emulate it.

Just a comment on the approach of putting the instruction in SPRG0:
these I/O accessors can be used in interrupt routines, which means
that if these accessors are ever used with interrupts enabled, there
is the possibility of an external interrupt occurring between the
instruction that sets SPRG0 and the load/store instruction that
faults.  If the handler for that interrupt itself does an I/O access,
it will overwrite SPRG0, corrupting the value set by the interrupted
code.

The choices to fix that would seem to be (a) disable interrupts around
all I/O accesses, (b) have the accessor save and restore SPRG0, or (c)
solve the problem another way, such as by doing a H_LOGICAL_CI_LOAD
or H_LOGICAL_CI_STORE hypercall.

Paul.


Re: [v3 02/15] KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR

2020-07-21 Thread Paul Mackerras
On Wed, Jul 22, 2020 at 07:39:26AM +0530, Athira Rajeev wrote:
> 
> 
> > On 21-Jul-2020, at 9:24 AM, Paul Mackerras  wrote:
> > 
> > On Fri, Jul 17, 2020 at 10:38:14AM -0400, Athira Rajeev wrote:
> >> Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
> >> in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
> >> Split this to give mmcra and mmcrs its own entries in vcpu and
> >> use a flat array for mmcr0 to mmcr2. This patch implements this
> >> cleanup to make code easier to read.
> > 
> > Changing the way KVM stores these values internally is fine, but
> > changing the user ABI is not.  This part:
> > 
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> >> b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 264e266..e55d847 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -510,8 +510,8 @@ struct kvm_ppc_cpu_char {
> >> 
> >> #define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
> >> #define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
> >> -#define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
> >> -#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
> >> +#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
> >> +#define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
> > 
> > means that existing userspace programs that used to work would now be
> > broken.  That is not acceptable (breaking the user ABI is only ever
> > acceptable with a very compelling reason).  So NAK to this part of the
> > patch.
> 
> Hi Paul
> 
> Thanks for checking the patch. I understood your point on user ABI breakage 
> that this particular change can cause.
> I will retain original KVM_REG_PPC_MMCRA and KVM_REG_PPC_MMCR2 order in 
> `kvm.h`
> And with that, additionally I will need below change ( on top of current 
> patch ) for my clean up updates for kvm cpu MMCR to work,
> Because now mmcra and mmcrs will have its own entries in vcpu and is not part 
> of the mmcr[] array
> Please suggest if this looks good

Yes, that looks fine.

By the way, is the new MMCRS register here at all related to the MMCRS
that there used to be on POWER8, but which wasn't present (as far as I
know) on POWER9?

Paul.


Re: [PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-20 Thread Paul Mackerras
On Wed, Jul 08, 2020 at 02:16:36PM +0200, Laurent Dufour wrote:
> Le 08/07/2020 à 13:25, Bharata B Rao a écrit :
> > On Fri, Jul 03, 2020 at 05:59:14PM +0200, Laurent Dufour wrote:
> > > When a secure memslot is dropped, all the pages backed in the secure 
> > > device
> > > (aka really backed by secure memory by the Ultravisor) should be paged out
> > > to a normal page. Previously, this was achieved by triggering the page
> > > fault mechanism which is calling kvmppc_svm_page_out() on each pages.
> > > 
> > > This can't work when hot unplugging a memory slot because the memory slot
> > > is flagged as invalid and gfn_to_pfn() is then not trying to access the
> > > page, so the page fault mechanism is not triggered.
> > > 
> > > Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> > > simpler to directly calling it instead of triggering such a mechanism. 
> > > This
> > > way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> > > memslot.
> > 
> > Yes, this appears much simpler.
> 
> Thanks Bharata for reviewing this.
> 
> > 
> > > 
> > > Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> > > the call to __kvmppc_svm_page_out() is made.
> > > As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
> > > VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
> > > addition, the mmap_sem is help in read mode during that time, not in write
> > > mode since the virual memory layout is not impacted, and
> > > kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
> > > 
> > > Cc: Ram Pai 
> > > Cc: Bharata B Rao 
> > > Cc: Paul Mackerras 
> > > Signed-off-by: Laurent Dufour 
> > > ---
> > >   arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
> > >   1 file changed, 37 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> > > b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > index 852cc9ae6a0b..479ddf16d18c 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > @@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct 
> > > vm_area_struct *vma,
> > >* fault on them, do fault time migration to replace the device PTEs in
> > >* QEMU page table with normal PTEs from newly allocated pages.
> > >*/
> > > -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
> > > +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
> > >struct kvm *kvm, bool skip_page_out)
> > >   {
> > >   int i;
> > >   struct kvmppc_uvmem_page_pvt *pvt;
> > > - unsigned long pfn, uvmem_pfn;
> > > - unsigned long gfn = free->base_gfn;
> > > + struct page *uvmem_page;
> > > + struct vm_area_struct *vma = NULL;
> > > + unsigned long uvmem_pfn, gfn;
> > > + unsigned long addr, end;
> > > +
> > > + down_read(>mm->mmap_sem);
> > 
> > You should be using mmap_read_lock(kvm->mm) with recent kernels.
> 
> Absolutely, shame on me, I reviewed Michel's series about that!
> 
> Paul, Michael, could you fix that when pulling this patch or should I sent a
> whole new series?

Given that Ram has reworked his series, I think it would be best if
you rebase on top of his new series and re-send your series.

Thanks,
Paul.


Re: [PATCH] KVM: PPC: Book3S HV: Use feature flag CPU_FTR_P9_TIDR when accessing TIDR

2020-07-20 Thread Paul Mackerras
On Tue, Jun 23, 2020 at 06:50:27PM +0200, Cédric Le Goater wrote:
> The TIDR register is only available on POWER9 systems and code
> accessing this register is not always protected by the CPU_FTR_P9_TIDR
> flag. Fix that to make sure POWER10 systems won't use it as TIDR has
> been removed.

I'm concerned about what this patch would do if we are trying to
migrate from a P9 guest to a guest on P10 in P9-compat mode, in that
the destination QEMU would get an error on doing the SET_ONE_REG for
the TIDR.  I don't think the lack of TIDR is worth failing the
migration for given that TIDR only actually does anything if you are
using an accelerator, and KVM has never supported use of accelerators
in guests.  I'm cc'ing David Gibson for his comments on the
compatibility and migration issues.

In any case, given that both move to and move from TIDR will be no-ops
on P10 (for privileged code), I don't think there is a great urgency
for this patch.

Paul.


Re: [v3 02/15] KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR

2020-07-20 Thread Paul Mackerras
On Fri, Jul 17, 2020 at 10:38:14AM -0400, Athira Rajeev wrote:
> Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
> in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
> Split this to give mmcra and mmcrs its own entries in vcpu and
> use a flat array for mmcr0 to mmcr2. This patch implements this
> cleanup to make code easier to read.

Changing the way KVM stores these values internally is fine, but
changing the user ABI is not.  This part:

> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 264e266..e55d847 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -510,8 +510,8 @@ struct kvm_ppc_cpu_char {
>  
>  #define KVM_REG_PPC_MMCR0(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
>  #define KVM_REG_PPC_MMCR1(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
> -#define KVM_REG_PPC_MMCRA(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
> -#define KVM_REG_PPC_MMCR2(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
> +#define KVM_REG_PPC_MMCR2(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
> +#define KVM_REG_PPC_MMCRA(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)

means that existing userspace programs that used to work would now be
broken.  That is not acceptable (breaking the user ABI is only ever
acceptable with a very compelling reason).  So NAK to this part of the
patch.

Regards,
Paul.


Re: [PATCH] pseries: Fix 64 bit logical memory block panic

2020-07-16 Thread Paul Mackerras
On Wed, Jul 15, 2020 at 06:12:25PM +0530, Aneesh Kumar K.V wrote:
> Anton Blanchard  writes:
> 
> > Booting with a 4GB LMB size causes us to panic:
> >
> >   qemu-system-ppc64: OS terminated: OS panic:
> >   Memory block size not suitable: 0x0
> >
> > Fix pseries_memory_block_size() to handle 64 bit LMBs.
> >
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Anton Blanchard 
> > ---
> >  arch/powerpc/platforms/pseries/hotplug-memory.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 5ace2f9a277e..6574ac33e887 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -27,7 +27,7 @@ static bool rtas_hp_event;
> >  unsigned long pseries_memory_block_size(void)
> >  {
> > struct device_node *np;
> > -   unsigned int memblock_size = MIN_MEMORY_BLOCK_SIZE;
> > +   uint64_t memblock_size = MIN_MEMORY_BLOCK_SIZE;
> > struct resource r;
> >  
> > np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
> 
> We need similar changes at more places?
> 
> modified   arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -85,7 +85,7 @@ extern unsigned int mmu_base_pid;
>  /*
>   * memory block size used with radix translation.
>   */
> -extern unsigned int __ro_after_init radix_mem_block_size;
> +extern unsigned long __ro_after_init radix_mem_block_size;
>  
>  #define PRTB_SIZE_SHIFT  (mmu_pid_bits + 4)
>  #define PRTB_ENTRIES (1ul << mmu_pid_bits)
> modified   arch/powerpc/include/asm/drmem.h
> @@ -21,7 +21,7 @@ struct drmem_lmb {
>  struct drmem_lmb_info {
>   struct drmem_lmb*lmbs;
>   int n_lmbs;
> - u32 lmb_size;
> + u64 lmb_size;
>  };
>  
>  extern struct drmem_lmb_info *drmem_info;
> modified   arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -34,7 +34,7 @@
>  
>  unsigned int mmu_pid_bits;
>  unsigned int mmu_base_pid;
> -unsigned int radix_mem_block_size __ro_after_init;
> +unsigned long radix_mem_block_size __ro_after_init;

These changes look fine.

>  static __ref void *early_alloc_pgtable(unsigned long size, int nid,
>   unsigned long region_start, unsigned long region_end)
> modified   arch/powerpc/mm/drmem.c
> @@ -268,14 +268,15 @@ static void __init __walk_drmem_v2_lmbs(const __be32 
> *prop, const __be32 *usm,
>  void __init walk_drmem_lmbs_early(unsigned long node,
>   void (*func)(struct drmem_lmb *, const __be32 **))
>  {
> + const __be64 *lmb_prop;
>   const __be32 *prop, *usm;
>   int len;
>  
> - prop = of_get_flat_dt_prop(node, "ibm,lmb-size", );
> - if (!prop || len < dt_root_size_cells * sizeof(__be32))
> + lmb_prop = of_get_flat_dt_prop(node, "ibm,lmb-size", );
> + if (!lmb_prop || len < sizeof(__be64))
>   return;
>  
> - drmem_info->lmb_size = dt_mem_next_cell(dt_root_size_cells, );
> + drmem_info->lmb_size = be64_to_cpup(lmb_prop);

This particular change shouldn't be necessary.  We already have
dt_mem_next_cell() returning u64, and it knows how to combine two
cells to give a u64 (for dt_root_size_cells == 2).

>   usm = of_get_flat_dt_prop(node, "linux,drconf-usable-memory", );
>  
> @@ -296,19 +297,19 @@ void __init walk_drmem_lmbs_early(unsigned long node,
>  
>  static int __init init_drmem_lmb_size(struct device_node *dn)
>  {
> - const __be32 *prop;
> + const __be64 *prop;
>   int len;
>  
>   if (drmem_info->lmb_size)
>   return 0;
>  
>   prop = of_get_property(dn, "ibm,lmb-size", );
> - if (!prop || len < dt_root_size_cells * sizeof(__be32)) {
> + if (!prop || len < sizeof(__be64)) {
>   pr_info("Could not determine LMB size\n");
>   return -1;
>   }
>  
> - drmem_info->lmb_size = dt_mem_next_cell(dt_root_size_cells, );
> + drmem_info->lmb_size = be64_to_cpup(prop);

Same comment here.

Paul.


Re: [RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-07-09 Thread Paul Mackerras
On Thu, Jul 09, 2020 at 02:38:51PM +0530, Bharata B Rao wrote:
> On Thu, Jul 09, 2020 at 03:18:03PM +1000, Paul Mackerras wrote:
> > On Fri, Jul 03, 2020 at 04:14:20PM +0530, Bharata B Rao wrote:
> > > In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
> > > H_RPT_INVALIDATE if available. The availability of this hcall
> > > is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
> > > DT property.
> > 
> > What are we going to use when nested KVM supports HPT guests at L2?
> > L1 will need to do partition-scoped tlbies with R=0 via a hypercall,
> > but H_RPT_INVALIDATE says in its name that it only handles radix
> > page tables (i.e. R=1).
> 
> For L2 HPT guests, the old hcall is expected to work after it adds
> support for R=0 case?

That was the plan.

> The new hcall should be advertised via ibm,hypertas-functions only
> for radix guests I suppose.

Well, the L1 hypervisor is a radix guest of L0, so it would have
H_RPT_INVALIDATE available to it?

I guess the question is whether H_RPT_INVALIDATE is supposed to do
everything, that is, radix process-scoped invalidations, radix
partition-scoped invalidations, and HPT partition-scoped
invalidations.  If that is the plan then we should call it something
different.

This patchset seems to imply that H_RPT_INVALIDATE is at least going
to be used for radix partition-scoped invalidations as well as radix
process-scoped invalidations.  If you are thinking that in future when
we need HPT partition-scoped invalidations for a radix L1 hypervisor
running a HPT L2 guest, we are going to define a new hypercall for
that, I suppose that is OK, though it doesn't really seem necessary.

Paul.


Re: [RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-07-08 Thread Paul Mackerras
On Fri, Jul 03, 2020 at 04:14:20PM +0530, Bharata B Rao wrote:
> In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
> H_RPT_INVALIDATE if available. The availability of this hcall
> is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
> DT property.

What are we going to use when nested KVM supports HPT guests at L2?
L1 will need to do partition-scoped tlbies with R=0 via a hypercall,
but H_RPT_INVALIDATE says in its name that it only handles radix
page tables (i.e. R=1).

Paul.


Re: [PATCH v2 2/3] powerpc/64s: remove PROT_SAO support

2020-07-08 Thread Paul Mackerras
On Fri, Jul 03, 2020 at 11:19:57AM +1000, Nicholas Piggin wrote:
> ISA v3.1 does not support the SAO storage control attribute required to
> implement PROT_SAO. PROT_SAO was used by specialised system software
> (Lx86) that has been discontinued for about 7 years, and is not thought
> to be used elsewhere, so removal should not cause problems.
> 
> We rather remove it than keep support for older processors, because
> live migrating guest partitions to newer processors may not be possible
> if SAO is in use (or worse allowed with silent races).

This is actually a real problem for KVM, because now we have the
capabilities of the host affecting the characteristics of the guest
virtual machine in a manner which userspace (e.g. QEMU) is unable to
control.

It would probably be better to disallow SAO on all machines than have
it available on some hosts and not others.  (Yes I know there is a
check on CPU_FTR_ARCH_206 in there, but that has been a no-op since we
removed the PPC970 KVM support.)

Solving this properly will probably require creating a new KVM host
capability and associated machine parameter in QEMU, along with a new
machine type.

[snip]

> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 9bb9bb370b53..fac39ff659d4 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -398,9 +398,10 @@ static inline bool hpte_cache_flags_ok(unsigned long 
> hptel, bool is_ci)
>  {
>   unsigned int wimg = hptel & HPTE_R_WIMG;
>  
> - /* Handle SAO */
> + /* Handle SAO for POWER7,8,9 */
>   if (wimg == (HPTE_R_W | HPTE_R_I | HPTE_R_M) &&
> - cpu_has_feature(CPU_FTR_ARCH_206))
> + cpu_has_feature(CPU_FTR_ARCH_206) &&
> + !cpu_has_feature(CPU_FTR_ARCH_31))
>   wimg = HPTE_R_M;

Paul.


Re: [PATCH 1/1] KVM/PPC: Fix typo on H_DISABLE_AND_GET hcall

2020-07-08 Thread Paul Mackerras
On Mon, Jul 06, 2020 at 09:48:12PM -0300, Leonardo Bras wrote:
> On PAPR+ the hcall() on 0x1B0 is called H_DISABLE_AND_GET, but got
> defined as H_DISABLE_AND_GETC instead.
> 
> This define was introduced with a typo in commit 
> ("[PATCH] powerpc: Extends HCALL interface for InfiniBand usage"), and was
> later used without having the typo noticed.
> 
> Signed-off-by: Leonardo Bras 

Acked-by: Paul Mackerras 

Since this hypercall is not implemented in KVM nor used by KVM guests,
I'll leave this one for Michael to pick up.

Paul.


Re: [PATCH v2 02/10] KVM: PPC: Book3S HV: Save/restore new PMU registers

2020-07-01 Thread Paul Mackerras
On Wed, Jul 01, 2020 at 05:20:54AM -0400, Athira Rajeev wrote:
> PowerISA v3.1 has added new performance monitoring unit (PMU)
> special purpose registers (SPRs). They are
> 
> Monitor Mode Control Register 3 (MMCR3)
> Sampled Instruction Event Register A (SIER2)
> Sampled Instruction Event Register B (SIER3)
> 
> Patch addes support to save/restore these new
> SPRs while entering/exiting guest.

This mostly looks reasonable, at a quick glance at least, but I am
puzzled by two of the changes you are making.  See below.

> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 6bf66649..c265800 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1698,7 +1698,8 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
> u64 id,
>   *val = get_reg_val(id, vcpu->arch.sdar);
>   break;
>   case KVM_REG_PPC_SIER:
> - *val = get_reg_val(id, vcpu->arch.sier);
> + i = id - KVM_REG_PPC_SIER;
> + *val = get_reg_val(id, vcpu->arch.sier[i]);

This is inside a switch (id) statement, so here we know that id is
KVM_REG_PPC_SIER.  In other words i will always be zero, so what is
the point of doing the subtraction?

>   break;
>   case KVM_REG_PPC_IAMR:
>   *val = get_reg_val(id, vcpu->arch.iamr);
> @@ -1919,7 +1920,8 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
> u64 id,
>   vcpu->arch.sdar = set_reg_val(id, *val);
>   break;
>   case KVM_REG_PPC_SIER:
> - vcpu->arch.sier = set_reg_val(id, *val);
> + i = id - KVM_REG_PPC_SIER;
> + vcpu->arch.sier[i] = set_reg_val(id, *val);

Same comment here.

I think that new defines for the new registers will need to be added
to arch/powerpc/include/uapi/asm/kvm.h and
Documentation/virt/kvm/api.rst, and then new cases will need to be
added to these switch statements.

By the way, please cc kvm-...@vger.kernel.org and k...@vger.kernel.org
on KVM patches.

Paul.


Re: [PATCH 3/3] powerpc/pseries: Add KVM guest doorbell restrictions

2020-06-30 Thread Paul Mackerras
On Tue, Jun 30, 2020 at 03:35:08PM +1000, Nicholas Piggin wrote:
> Excerpts from Paul Mackerras's message of June 30, 2020 12:27 pm:
> > On Sun, Jun 28, 2020 at 01:04:28AM +1000, Nicholas Piggin wrote:
> >> KVM guests have certain restrictions and performance quirks when
> >> using doorbells. This patch tests for KVM environment in doorbell
> >> setup, and optimises IPI performance:
> >> 
> >>  - PowerVM guests may now use doorbells even if they are secure.
> >> 
> >>  - KVM guests no longer use doorbells if XIVE is available.
> > 
> > It seems, from the fact that you completely remove
> > kvm_para_available(), that you perhaps haven't tried building with
> > CONFIG_KVM_GUEST=y.
> 
> It's still there and builds:

OK, good, I missed that.

> static inline int kvm_para_available(void)
> {
> return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
> }
> 
> but...
> 
> > Somewhat confusingly, that option is not used or
> > needed when building for a PAPR guest (i.e. the "pseries" platform)
> > but is used on non-IBM platforms using the "epapr" hypervisor
> > interface.
> 
> ... is_kvm_guest() returns false on !PSERIES now.

And therefore kvm_para_available() returns false on all the platforms
where the code that depends on it could actually be used.

It's not correct to assume that !PSERIES means not a KVM guest.

> Not intended
> to break EPAPR. I'm not sure of a good way to share this between
> EPAPR and PSERIES, I might just make a copy of it but I'll see.

OK, so you're doing a new version?

Regards,
Paul.


Re: [PATCH 3/3] powerpc/pseries: Add KVM guest doorbell restrictions

2020-06-29 Thread Paul Mackerras
On Sun, Jun 28, 2020 at 01:04:28AM +1000, Nicholas Piggin wrote:
> KVM guests have certain restrictions and performance quirks when
> using doorbells. This patch tests for KVM environment in doorbell
> setup, and optimises IPI performance:
> 
>  - PowerVM guests may now use doorbells even if they are secure.
> 
>  - KVM guests no longer use doorbells if XIVE is available.

It seems, from the fact that you completely remove
kvm_para_available(), that you perhaps haven't tried building with
CONFIG_KVM_GUEST=y.  Somewhat confusingly, that option is not used or
needed when building for a PAPR guest (i.e. the "pseries" platform)
but is used on non-IBM platforms using the "epapr" hypervisor
interface.

If you did intend to remove support for the epapr hypervisor interface
then that should have been talked about in the commit message (and
would I expect be controversial).

So NAK on the kvm_para_available() removal.

Paul.


Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

2020-05-28 Thread Paul Mackerras
On Thu, May 28, 2020 at 11:31:04AM +0530, Aneesh Kumar K.V wrote:
> On 5/28/20 7:13 AM, Paul Mackerras wrote:
> > On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> > > The locking rules for walking partition scoped table is different from 
> > > process
> > > scoped table. Hence add a helper for secondary linux page table walk and 
> > > also
> > > add check whether we are holding the right locks.
> > 
> > This patch is causing new warnings to appear when testing migration,
> > like this:
> > 
> > [  142.090159] [ cut here ]
> > [  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
> > [  142.090176] WARNING: CPU: 23 PID: 5341 at 
> > arch/powerpc/include/asm/kvm_book3s_64.h:644 
> > kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 
> > nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables 
> > ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds 
> > raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
> > [  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: GW  
> >5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
> > [  142.090189] NIP:  c00800fe848c LR: c00800fe8488 CTR: 
> > 
> > [  142.090190] REGS: c01e19f077e0 TRAP: 0700   Tainted: GW  
> > (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
> > [  142.090191] MSR:  90029033   CR: 
> > 4422  XER: 2004
> > [  142.090196] CFAR: c012f5ac IRQMASK: 0
> > GPR00: c00800fe8488 c01e19f07a70 c00800ffe200 
> > 0039
> > GPR04: 0001 c01ffc8b4900 00018840 
> > 0007
> > GPR08: 0003 0001 0007 
> > 0001
> > GPR12: 2000 c01fff6d9400 00011f884678 
> > 7fff70b7
> > GPR16: 7fff7137cb90 7fff7dcb4410 0001 
> > 
> > GPR20: 0ffe  0001 
> > 
> > GPR24: 8000 0001 c01e1f67e600 
> > c01e1fd82410
> > GPR28: 1000 c01e2e41 0fff 
> > 0ffe
> > [  142.090217] NIP [c00800fe848c] 
> > kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [  142.090223] LR [c00800fe8488] 
> > kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
> > [  142.090224] Call Trace:
> > [  142.090230] [c01e19f07a70] [c00800fe8488] 
> > kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
> > [  142.090236] [c01e19f07b50] [c00800fd42e4] 
> > kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
> > [  142.090292] [c01e19f07be0] [c00800eea878] 
> > kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
> > [  142.090300] [c01e19f07c00] [c00800edc818] 
> > kvm_vm_ioctl+0x2b0/0xc00 [kvm]
> > [  142.090302] [c01e19f07d50] [c046e148] ksys_ioctl+0xf8/0x150
> > [  142.090305] [c01e19f07da0] [c046e1c8] sys_ioctl+0x28/0x80
> > [  142.090307] [c01e19f07dc0] [c003652c] 
> > system_call_exception+0x16c/0x240
> > [  142.090309] [c01e19f07e20] [c000d070] 
> > system_call_common+0xf0/0x278
> > [  142.090310] Instruction dump:
> > [  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 6000 3c82 
> > e8848468 3c62
> > [  142.090317] e86384a8 38840010 4800673d e8410018 <0fe0> 4bfffdd4 
> > 6000 6000
> > [  142.090322] ---[ end trace 619d45057b6919e0 ]---
> > 
> > Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
> > locklessly, and only takes the kvm->mmu_lock once it finds a dirty
> > PTE.  I think that is important for performance, since on any given
> > scan of the guest real address space we may only find a small
> > proportion of the guest pages to be dirty.
> > 
> > Are you now relying on the kvm->mmu_lock to protect the existence of
> > the PTEs, or just their content?
> > 
> 
> The patch series should not change any rules w.r.t kvm partition scoped page
> table walk. We only added helpers to make it explicit that this is different
> from regular linux page table walk. And kvm->mmu_lock is what was used to
> protect the partition scoped table walk earlier.
> 
> In this specific case, what we need probably is an open coded

Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

2020-05-27 Thread Paul Mackerras
On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> The locking rules for walking partition scoped table is different from process
> scoped table. Hence add a helper for secondary linux page table walk and also
> add check whether we are holding the right locks.

This patch is causing new warnings to appear when testing migration,
like this:

[  142.090159] [ cut here ]
[  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held 
[  142.090176] WARNING: CPU: 23 PID: 5341 at 
arch/powerpc/include/asm/kvm_book3s_64.h:644 
kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables bpfilter overlay binfmt_misc input_leds raid_class 
scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
[  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: GW  
   5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
[  142.090189] NIP:  c00800fe848c LR: c00800fe8488 CTR: 
[  142.090190] REGS: c01e19f077e0 TRAP: 0700   Tainted: GW  
(5.7.0-rc5-kvm-00211-g9ccf10d6d088)
[  142.090191] MSR:  90029033   CR: 4422  
XER: 2004
[  142.090196] CFAR: c012f5ac IRQMASK: 0 
   GPR00: c00800fe8488 c01e19f07a70 c00800ffe200 
0039 
   GPR04: 0001 c01ffc8b4900 00018840 
0007 
   GPR08: 0003 0001 0007 
0001 
   GPR12: 2000 c01fff6d9400 00011f884678 
7fff70b7 
   GPR16: 7fff7137cb90 7fff7dcb4410 0001 
 
   GPR20: 0ffe  0001 
 
   GPR24: 8000 0001 c01e1f67e600 
c01e1fd82410 
   GPR28: 1000 c01e2e41 0fff 
0ffe 
[  142.090217] NIP [c00800fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 
[kvm_hv]
[  142.090223] LR [c00800fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 
[kvm_hv]
[  142.090224] Call Trace:
[  142.090230] [c01e19f07a70] [c00800fe8488] 
kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[  142.090236] [c01e19f07b50] [c00800fd42e4] 
kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[  142.090292] [c01e19f07be0] [c00800eea878] 
kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[  142.090300] [c01e19f07c00] [c00800edc818] kvm_vm_ioctl+0x2b0/0xc00 
[kvm]
[  142.090302] [c01e19f07d50] [c046e148] ksys_ioctl+0xf8/0x150
[  142.090305] [c01e19f07da0] [c046e1c8] sys_ioctl+0x28/0x80
[  142.090307] [c01e19f07dc0] [c003652c] 
system_call_exception+0x16c/0x240
[  142.090309] [c01e19f07e20] [c000d070] 
system_call_common+0xf0/0x278
[  142.090310] Instruction dump:
[  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 6000 3c82 e8848468 
3c62 
[  142.090317] e86384a8 38840010 4800673d e8410018 <0fe0> 4bfffdd4 6000 
6000 
[  142.090322] ---[ end trace 619d45057b6919e0 ]---

Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
locklessly, and only takes the kvm->mmu_lock once it finds a dirty
PTE.  I think that is important for performance, since on any given
scan of the guest real address space we may only find a small
proportion of the guest pages to be dirty.

Are you now relying on the kvm->mmu_lock to protect the existence of
the PTEs, or just their content?

Paul.


Re: [PATCH] KVM: PPC: Book3S HV: read ibm,secure-memory nodes

2020-05-26 Thread Paul Mackerras
On Thu, Apr 16, 2020 at 06:27:15PM +0200, Laurent Dufour wrote:
> The newly introduced ibm,secure-memory nodes supersede the
> ibm,uv-firmware's property secure-memory-ranges.
> 
> Firmware will no more expose the secure-memory-ranges property so first
> read the new one and if not found rollback to the older one.
> 
> Signed-off-by: Laurent Dufour 

Thanks, applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH v4 3/7] KVM: PPC: Remove redundant kvm_run from vcpu_arch

2020-05-26 Thread Paul Mackerras
On Mon, Apr 27, 2020 at 12:35:10PM +0800, Tianjia Zhang wrote:
> The 'kvm_run' field already exists in the 'vcpu' structure, which
> is the same structure as the 'kvm_run' in the 'vcpu_arch' and
> should be deleted.
> 
> Signed-off-by: Tianjia Zhang 

Thanks, patches 3 and 4 of this series applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH -next] KVM: PPC: Book3S HV: remove redundant NULL check

2020-05-26 Thread Paul Mackerras
On Wed, Apr 01, 2020 at 09:09:03PM +0800, Chen Zhou wrote:
> Free function kfree() already does NULL check, so the additional
> check is unnecessary, just remove it.
> 
> Signed-off-by: Chen Zhou 

Thanks, applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-26 Thread Paul Mackerras
On Wed, May 20, 2020 at 07:43:08PM +0200, Laurent Dufour wrote:
> The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
> Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
> reserved to the Ultravisor.
> 
> However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
> context of the VM calling UV_ESM. This allows the Hypervisor to return to
> the guest without going through the Ultravisor. Thus the Secure bit of SRR1
> is not set in that particular case.
> 
> In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
> filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
> not set in that case.
> 
> Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
> Signed-off-by: Laurent Dufour 

Thanks, applied to my kvm-ppc-next branch.  I expanded the comment in
the code a little.

Paul.


Re: [PATCH] powerpc/kvm/radix: ignore kmemleak false positives

2020-05-26 Thread Paul Mackerras
On Wed, May 13, 2020 at 09:39:15AM -0400, Qian Cai wrote:
> kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
> pud_populate() and pmd_populate() will use __pa() to reference the newly
> allocated memory.
> 
> Since kmemleak is unable to track the physical memory resulting in false
> positives, silence those by using kmemleak_ignore().
> 
> unreferenced object 0xc000201c382a1000 (size 4096):
>  comm "qemu-kvm", pid 124828, jiffies 4295733767 (age 341.250s)
>  hex dump (first 32 bytes):
>c0 00 20 09 f4 60 03 87 c0 00 20 10 72 a0 03 87  .. ..` .r...
>c0 00 20 0e 13 a0 03 87 c0 00 20 1b dc c0 03 87  .. ... .
>  backtrace:
>[<4cc2790f>] kvmppc_create_pte+0x838/0xd20 [kvm_hv]
>kvmppc_pmd_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:366
>(inlined by) kvmppc_create_pte at 
> arch/powerpc/kvm/book3s_64_mmu_radix.c:590
>[] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
>[] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
>[<86dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
>[<5ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
>[] kvmppc_vcpu_run+0x34/0x48 [kvm]
>[] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
>[<2543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
>[<48155cd6>] ksys_ioctl+0xd8/0x130
>[<41ffeaa7>] sys_ioctl+0x28/0x40
>[<4afc4310>] system_call_exception+0x114/0x1e0
>[] system_call_common+0xf0/0x278
> unreferenced object 0xc0002001f0c03900 (size 256):
>  comm "qemu-kvm", pid 124830, jiffies 4295735235 (age 326.570s)
>  hex dump (first 32 bytes):
>c0 00 20 10 fa a0 03 87 c0 00 20 10 fa a1 03 87  .. ... .
>c0 00 20 10 fa a2 03 87 c0 00 20 10 fa a3 03 87  .. ... .
>  backtrace:
>[<23f675b8>] kvmppc_create_pte+0x854/0xd20 [kvm_hv]
>kvmppc_pte_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:356
>(inlined by) kvmppc_create_pte at 
> arch/powerpc/kvm/book3s_64_mmu_radix.c:593
>[] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
>[] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
>[<86dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
>[<5ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
>[] kvmppc_vcpu_run+0x34/0x48 [kvm]
>[] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
>[<2543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
>[<48155cd6>] ksys_ioctl+0xd8/0x130
>[<41ffeaa7>] sys_ioctl+0x28/0x40
>[<4afc4310>] system_call_exception+0x114/0x1e0
>[] system_call_common+0xf0/0x278
> 
> Signed-off-by: Qian Cai 

Thanks, applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH] powerpc/kvm/book3s64/vio: fix some RCU-list locks

2020-05-26 Thread Paul Mackerras
On Sun, May 10, 2020 at 01:18:34AM -0400, Qian Cai wrote:
> It is unsafe to traverse kvm->arch.spapr_tce_tables and
> stt->iommu_tables without the RCU read lock held. Also, add
> cond_resched_rcu() in places with the RCU read lock held that could take
> a while to finish.
> 
>  arch/powerpc/kvm/book3s_64_vio.c:76 RCU-list traversed in non-reader 
> section!!
> 
>  other info that might help us debug this:
> 
>  rcu_scheduler_active = 2, debug_locks = 1
>  no locks held by qemu-kvm/4265.
> 
>  stack backtrace:
>  CPU: 96 PID: 4265 Comm: qemu-kvm Not tainted 5.7.0-rc4-next-20200508+ #2
>  Call Trace:
>  [c000201a8690f720] [c0715948] dump_stack+0xfc/0x174 (unreliable)
>  [c000201a8690f770] [c01d9470] lockdep_rcu_suspicious+0x140/0x164
>  [c000201a8690f7f0] [c00810b9fb48] 
> kvm_spapr_tce_release_iommu_group+0x1f0/0x220 [kvm]
>  [c000201a8690f870] [c00810b8462c] 
> kvm_spapr_tce_release_vfio_group+0x54/0xb0 [kvm]
>  [c000201a8690f8a0] [c00810b84710] kvm_vfio_destroy+0x88/0x140 [kvm]
>  [c000201a8690f8f0] [c00810b7d488] kvm_put_kvm+0x370/0x600 [kvm]
>  [c000201a8690f990] [c00810b7e3c0] kvm_vm_release+0x38/0x60 [kvm]
>  [c000201a8690f9c0] [c05223f4] __fput+0x124/0x330
>  [c000201a8690fa20] [c0151cd8] task_work_run+0xb8/0x130
>  [c000201a8690fa70] [c01197e8] do_exit+0x4e8/0xfa0
>  [c000201a8690fb70] [c011a374] do_group_exit+0x64/0xd0
>  [c000201a8690fbb0] [c0132c90] get_signal+0x1f0/0x1200
>  [c000201a8690fcc0] [c0020690] do_notify_resume+0x130/0x3c0
>  [c000201a8690fda0] [c0038d64] syscall_exit_prepare+0x1a4/0x280
>  [c000201a8690fe20] [c000c8f8] system_call_common+0xf8/0x278
> 
>  
>  arch/powerpc/kvm/book3s_64_vio.c:368 RCU-list traversed in non-reader 
> section!!
> 
>  other info that might help us debug this:
> 
>  rcu_scheduler_active = 2, debug_locks = 1
>  2 locks held by qemu-kvm/4264:
>   #0: c000201ae2d000d8 (>mutex){+.+.}-{3:3}, at: 
> kvm_vcpu_ioctl+0xdc/0x950 [kvm]
>   #1: c000200c9ed0c468 (>srcu){}-{0:0}, at: 
> kvmppc_h_put_tce+0x88/0x340 [kvm]
> 
>  
>  arch/powerpc/kvm/book3s_64_vio.c:108 RCU-list traversed in non-reader 
> section!!
> 
>  other info that might help us debug this:
> 
>  rcu_scheduler_active = 2, debug_locks = 1
>  1 lock held by qemu-kvm/4257:
>   #0: c000200b1b363a40 (>lock){+.+.}-{3:3}, at: 
> kvm_vfio_set_attr+0x598/0x6c0 [kvm]
> 
>  
>  arch/powerpc/kvm/book3s_64_vio.c:146 RCU-list traversed in non-reader 
> section!!
> 
>  other info that might help us debug this:
> 
>  rcu_scheduler_active = 2, debug_locks = 1
>  1 lock held by qemu-kvm/4257:
>   #0: c000200b1b363a40 (>lock){+.+.}-{3:3}, at: 
> kvm_vfio_set_attr+0x598/0x6c0 [kvm]
> 
> Signed-off-by: Qian Cai 

Thanks, applied to my kvm-ppc-next branch, with the cond_resched_rcu()
in kvmppc_tce_validate removed.

Paul.


Re: [PATCH] powerpc/kvm/book3s64/vio: fix some RCU-list locks

2020-05-26 Thread Paul Mackerras
On Sun, May 10, 2020 at 01:18:34AM -0400, Qian Cai wrote:
> It is unsafe to traverse kvm->arch.spapr_tce_tables and
> stt->iommu_tables without the RCU read lock held. Also, add
> cond_resched_rcu() in places with the RCU read lock held that could take
> a while to finish.

This mostly looks fine.  The cond_resched_rcu() in kvmppc_tce_validate
doesn't seem necessary (the list would rarely have more than a few
dozen entries) and could be a performance problem given that TCE
validation is a hot-path.

Are you OK with me modifying the patch to take out that
cond_resched_rcu(), or is there some reason why it's essential that it
be there?

Paul.


Re: [linux-next PATCH] mm/gup.c: Convert to use get_user_{page|pages}_fast_only()

2020-05-26 Thread Paul Mackerras
On Mon, May 25, 2020 at 02:23:32PM +0530, Souptick Joarder wrote:
> API __get_user_pages_fast() renamed to get_user_pages_fast_only()
> to align with pin_user_pages_fast_only().
> 
> As part of this we will get rid of write parameter.
> Instead caller will pass FOLL_WRITE to get_user_pages_fast_only().
> This will not change any existing functionality of the API.
> 
> All the callers are changed to pass FOLL_WRITE.
> 
> Also introduce get_user_page_fast_only(), and use it in a few
> places that hard-code nr_pages to 1.
> 
> Updated the documentation of the API.
> 
> Signed-off-by: Souptick Joarder 

The arch/powerpc/kvm bits look reasonable.

Reviewed-by: Paul Mackerras 


Re: [PATCH v4 5/7] KVM: PPC: clean up redundant kvm_run parameters in assembly

2020-05-26 Thread Paul Mackerras
On Mon, Apr 27, 2020 at 12:35:12PM +0800, Tianjia Zhang wrote:
> In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
> structure. For historical reasons, many kvm-related function parameters
> retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
> patch does a unified cleanup of these remaining redundant parameters.

Some of these changes don't look completely correct to me, see below.
If you're expecting these patches to go through my tree, I can fix up
the patch and commit it (with you as author), noting the changes I
made in the commit message.  Do you want me to do that?

> diff --git a/arch/powerpc/kvm/book3s_interrupts.S 
> b/arch/powerpc/kvm/book3s_interrupts.S
> index f7ad99d972ce..0eff749d8027 100644
> --- a/arch/powerpc/kvm/book3s_interrupts.S
> +++ b/arch/powerpc/kvm/book3s_interrupts.S
> @@ -55,8 +55,7 @@
>   
> /
>  
>  /* Registers:
> - *  r3: kvm_run pointer
> - *  r4: vcpu pointer
> + *  r3: vcpu pointer
>   */
>  _GLOBAL(__kvmppc_vcpu_run)
>  
> @@ -68,8 +67,8 @@ kvm_start_entry:
>   /* Save host state to the stack */
>   PPC_STLU r1, -SWITCH_FRAME_SIZE(r1)
>  
> - /* Save r3 (kvm_run) and r4 (vcpu) */
> - SAVE_2GPRS(3, r1)
> + /* Save r3 (vcpu) */
> + SAVE_GPR(3, r1)
>  
>   /* Save non-volatile registers (r14 - r31) */
>   SAVE_NVGPRS(r1)
> @@ -82,11 +81,11 @@ kvm_start_entry:
>   PPC_STL r0, _LINK(r1)
>  
>   /* Load non-volatile guest state from the vcpu */
> - VCPU_LOAD_NVGPRS(r4)
> + VCPU_LOAD_NVGPRS(r3)
>  
>  kvm_start_lightweight:
>   /* Copy registers into shadow vcpu so we can access them in real mode */
> - mr  r3, r4
> + mr  r4, r3

This mr doesn't seem necessary.

>   bl  FUNC(kvmppc_copy_to_svcpu)
>   nop
>   REST_GPR(4, r1)

This should be loading r4 from GPR3(r1), not GPR4(r1) - which is what
REST_GPR(4, r1) will do.

Then, in the file but not in the patch context, there is this line:

PPC_LL  r3, GPR4(r1)/* vcpu pointer */

where once again GPR4 needs to be GPR3.

> @@ -191,10 +190,10 @@ after_sprg3_load:
>   PPC_STL r31, VCPU_GPR(R31)(r7)
>  
>   /* Pass the exit number as 3rd argument to kvmppc_handle_exit */

The comment should be modified to say "2nd" instead of "3rd",
otherwise it is confusing.

The rest of the patch looks OK.

Paul.


Re: [PATCH v4 4/7] KVM: PPC: clean up redundant 'kvm_run' parameters

2020-05-25 Thread Paul Mackerras
On Mon, Apr 27, 2020 at 12:35:11PM +0800, Tianjia Zhang wrote:
> In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
> structure. For historical reasons, many kvm-related function parameters
> retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
> patch does a unified cleanup of these remaining redundant parameters.
> 
> Signed-off-by: Tianjia Zhang 

This looks OK, though possibly a little larger than it needs to be
because of variable name changes (kvm_run -> run) that aren't strictly
necessary.

Reviewed-by: Paul Mackerras 


Re: [PATCH v4 3/7] KVM: PPC: Remove redundant kvm_run from vcpu_arch

2020-05-25 Thread Paul Mackerras
On Mon, Apr 27, 2020 at 12:35:10PM +0800, Tianjia Zhang wrote:
> The 'kvm_run' field already exists in the 'vcpu' structure, which
> is the same structure as the 'kvm_run' in the 'vcpu_arch' and
> should be deleted.
> 
> Signed-off-by: Tianjia Zhang 

This looks fine.

I assume each architecture sub-maintainer is taking the relevant
patches from this series via their tree - is that right?

Reviewed-by: Paul Mackerras 


[PATCH RFC 2/4] powerpc: Add Microwatt platform

2020-05-08 Thread Paul Mackerras
Microwatt is a FPGA-based implementation of the Power ISA.  It
currently only implements little-endian 64-bit mode, and does
not (yet) support SMP.

This adds a new machine type to support FPGA-based SoCs with a
Microwatt core.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/Kconfig  |2 +-
 arch/powerpc/configs/microwatt_defconfig  | 1418 +
 arch/powerpc/platforms/Kconfig|1 +
 arch/powerpc/platforms/Makefile   |1 +
 arch/powerpc/platforms/microwatt/Kconfig  |9 +
 arch/powerpc/platforms/microwatt/Makefile |1 +
 arch/powerpc/platforms/microwatt/setup.c  |   40 +
 7 files changed, 1471 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/configs/microwatt_defconfig
 create mode 100644 arch/powerpc/platforms/microwatt/Kconfig
 create mode 100644 arch/powerpc/platforms/microwatt/Makefile
 create mode 100644 arch/powerpc/platforms/microwatt/setup.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..97286b8312f5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -407,7 +407,7 @@ config HUGETLB_PAGE_SIZE_VARIABLE
 
 config MATH_EMULATION
bool "Math emulation"
-   depends on 4xx || PPC_8xx || PPC_MPC832x || BOOKE
+   depends on 4xx || PPC_8xx || PPC_MPC832x || BOOKE || PPC_MICROWATT
help
  Some PowerPC chips designed for embedded applications do not have
  a floating-point unit and therefore do not implement the
diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
new file mode 100644
index ..f4f4c965a786
--- /dev/null
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -0,0 +1,1418 @@
+#
+# Automatically generated file; DO NOT EDIT.
+# Linux/powerpc 5.6.0 Kernel Configuration
+#
+
+#
+# Compiler: powerpc64le-linux-gnu-gcc (GCC) 9.2.1 20190827 (Red Hat Cross 
9.2.1-1)
+#
+CONFIG_CC_IS_GCC=y
+CONFIG_GCC_VERSION=90201
+CONFIG_CLANG_VERSION=0
+CONFIG_CC_HAS_ASM_GOTO=y
+CONFIG_CC_HAS_ASM_INLINE=y
+CONFIG_CC_HAS_WARN_MAYBE_UNINITIALIZED=y
+CONFIG_CC_DISABLE_WARN_MAYBE_UNINITIALIZED=y
+CONFIG_IRQ_WORK=y
+CONFIG_BUILDTIME_TABLE_SORT=y
+CONFIG_THREAD_INFO_IN_TASK=y
+
+#
+# General setup
+#
+CONFIG_BROKEN_ON_SMP=y
+CONFIG_INIT_ENV_ARG_LIMIT=32
+# CONFIG_COMPILE_TEST is not set
+CONFIG_LOCALVERSION=""
+CONFIG_LOCALVERSION_AUTO=y
+CONFIG_BUILD_SALT=""
+CONFIG_HAVE_KERNEL_GZIP=y
+CONFIG_HAVE_KERNEL_XZ=y
+# CONFIG_KERNEL_GZIP is not set
+CONFIG_KERNEL_XZ=y
+CONFIG_DEFAULT_HOSTNAME="arty"
+# CONFIG_SWAP is not set
+# CONFIG_SYSVIPC is not set
+# CONFIG_CROSS_MEMORY_ATTACH is not set
+# CONFIG_USELIB is not set
+CONFIG_HAVE_ARCH_AUDITSYSCALL=y
+
+#
+# IRQ subsystem
+#
+CONFIG_GENERIC_IRQ_SHOW=y
+CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
+CONFIG_HARDIRQS_SW_RESEND=y
+CONFIG_IRQ_DOMAIN=y
+CONFIG_IRQ_FORCED_THREADING=y
+CONFIG_SPARSE_IRQ=y
+# end of IRQ subsystem
+
+CONFIG_GENERIC_TIME_VSYSCALL=y
+CONFIG_GENERIC_CLOCKEVENTS=y
+CONFIG_GENERIC_CMOS_UPDATE=y
+
+#
+# Timers subsystem
+#
+CONFIG_TICK_ONESHOT=y
+CONFIG_HZ_PERIODIC=y
+# CONFIG_NO_HZ_IDLE is not set
+# CONFIG_NO_HZ is not set
+CONFIG_HIGH_RES_TIMERS=y
+# end of Timers subsystem
+
+# CONFIG_PREEMPT_NONE is not set
+CONFIG_PREEMPT_VOLUNTARY=y
+# CONFIG_PREEMPT is not set
+
+#
+# CPU/Task time and stats accounting
+#
+CONFIG_TICK_CPU_ACCOUNTING=y
+# CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
+# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
+# CONFIG_IRQ_TIME_ACCOUNTING is not set
+# CONFIG_BSD_PROCESS_ACCT is not set
+# CONFIG_PSI is not set
+# end of CPU/Task time and stats accounting
+
+#
+# RCU Subsystem
+#
+CONFIG_TINY_RCU=y
+# CONFIG_RCU_EXPERT is not set
+CONFIG_SRCU=y
+CONFIG_TINY_SRCU=y
+# end of RCU Subsystem
+
+# CONFIG_IKCONFIG is not set
+# CONFIG_IKHEADERS is not set
+CONFIG_LOG_BUF_SHIFT=16
+CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=12
+
+#
+# Scheduler features
+#
+# end of Scheduler features
+
+CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
+CONFIG_CC_HAS_INT128=y
+# CONFIG_CGROUPS is not set
+# CONFIG_NAMESPACES is not set
+# CONFIG_CHECKPOINT_RESTORE is not set
+# CONFIG_SCHED_AUTOGROUP is not set
+# CONFIG_SYSFS_DEPRECATED is not set
+# CONFIG_RELAY is not set
+CONFIG_BLK_DEV_INITRD=y
+CONFIG_INITRAMFS_SOURCE="rootfs.cpio"
+CONFIG_INITRAMFS_ROOT_UID=0
+CONFIG_INITRAMFS_ROOT_GID=0
+# CONFIG_RD_GZIP is not set
+# CONFIG_RD_BZIP2 is not set
+# CONFIG_RD_LZMA is not set
+CONFIG_RD_XZ=y
+# CONFIG_RD_LZO is not set
+# CONFIG_RD_LZ4 is not set
+CONFIG_INITRAMFS_COMPRESSION_XZ=y
+# CONFIG_INITRAMFS_COMPRESSION_NONE is not set
+# CONFIG_BOOT_CONFIG is not set
+# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y
+# CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is not set
+CONFIG_SYSCTL=y
+CONFIG_SYSCTL_EXCEPTION_TRACE=y
+CONFIG_EXPERT=y
+CONFIG_MULTIUSER=y
+# CONFIG_SGETMASK_SYSCALL is not set
+# CONFIG_SYSFS_SYSCALL is not set
+# CONFIG_FHANDLE 

[PATCH RFC 0/4] Add support for Microwatt-based SoCs

2020-05-08 Thread Paul Mackerras
This patch series adds support for running Linux on a Microwatt SoC
(system on chip) implementation on an FPGA.  Microwatt is a small
Power ISA implementation, targetted at FPGAs, aiming for PowerISA
v3.0B compliance.  It does not currently implement any floating-point
or vector instructions, hypervisor mode, big-endian mode, 32-bit mode,
or the HPT/SLB MMU facilities.  However, it does support enough to run
Linux (as of my "mmu" branch plus Ben's "litedram" branch").

Paul.


[PATCH RFC 1/4] powerpc/radix: Fix compilation for radix with CONFIG_SMP=n

2020-05-08 Thread Paul Mackerras
This fixes the compile errors we currently get with CONFIG_SMP=n and
CONFIG_PPC_RADIX_MMU=y.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/book3s/64/tlbflush-radix.h | 2 ++
 arch/powerpc/mm/book3s64/radix_tlb.c| 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index ca8db193ae38..adcc6114d170 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -68,6 +68,8 @@ extern void radix__flush_tlb_page_psize(struct mm_struct *mm, 
unsigned long vmad
 #define radix__flush_all_mm(mm)radix__local_flush_all_mm(mm)
 #define radix__flush_tlb_page(vma,addr)
radix__local_flush_tlb_page(vma,addr)
 #define radix__flush_tlb_page_psize(mm,addr,p) 
radix__local_flush_tlb_page_psize(mm,addr,p)
+#define exit_flush_lazy_tlbs(mm)   do { } while (0)
+#define __flush_all_mm(mm, fullmm) radix__local_flush_all_mm(mm)
 #endif
 extern void radix__flush_tlb_pwc(struct mmu_gather *tlb, unsigned long addr);
 extern void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long 
addr);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 03f43c924e00..e3ea026cf91e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -776,8 +776,6 @@ void radix__flush_tlb_page(struct vm_area_struct *vma, 
unsigned long vmaddr)
 }
 EXPORT_SYMBOL(radix__flush_tlb_page);
 
-#else /* CONFIG_SMP */
-#define radix__flush_all_mm radix__local_flush_all_mm
 #endif /* CONFIG_SMP */
 
 static void do_tlbiel_kernel(void *info)
-- 
2.25.3



[PATCH RFC 4/4] powerpc/radix: Add support for microwatt's PRTBL SPR

2020-05-08 Thread Paul Mackerras
Microwatt currently doesn't implement hypervisor mode and therefore
doesn't implement the partition table.  It does implement the process
table and radix page table walks.

This adds code to write the base address of the process table to the
PRTBL SPR, which has been assigned SPR 720 for now, as that is in the
range of SPR numbers assigned for experimental use.  PRTBL is only
written when we have neither the FW_FEATURE_LPAR feature nor the
CPU_FTR_HVMODE feature.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h   |  1 +
 arch/powerpc/mm/book3s64/radix_pgtable.c | 13 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1aa46dff0957..6ea3fc42740d 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -721,6 +721,7 @@
 #endif
 #define SPRN_TIR   0x1BE   /* Thread Identification Register */
 #define SPRN_PTCR  0x1D0   /* Partition table control Register */
+#define SPRN_PRTBL 0x2D0   /* Process table pointer */
 #define SPRN_PSPB  0x09F   /* Problem State Priority Boost reg */
 #define SPRN_PTEHI 0x3D5   /* 981 7450 PTE HI word (S/W TLB load) */
 #define SPRN_PTELO 0x3D6   /* 982 7450 PTE LO word (S/W TLB load) */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index dd1bea45325c..2e6a376c9d82 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -600,10 +600,15 @@ void __init radix__early_init_mmu(void)
radix_init_pgtable();
 
if (!firmware_has_feature(FW_FEATURE_LPAR)) {
-   lpcr = mfspr(SPRN_LPCR);
-   mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR);
-   radix_init_partition_table();
-   radix_init_amor();
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   lpcr = mfspr(SPRN_LPCR);
+   mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR);
+   radix_init_partition_table();
+   radix_init_amor();
+   } else {
+   mtspr(SPRN_PRTBL, (__pa(process_tb) |
+  (PRTB_SIZE_SHIFT - 12)));
+   }
} else {
radix_init_pseries();
}
-- 
2.25.3



[PATCH RFC 3/4] powerpc/microwatt: Add early debug UART support for Microwatt

2020-05-08 Thread Paul Mackerras
Currently microwatt-based SoCs come with a "potato" UART.  This
adds udbg support for the potato UART, giving us both an early
debug console, and a runtime console using the hvc-udbg support.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/Kconfig.debug   |   6 ++
 arch/powerpc/include/asm/udbg.h  |   1 +
 arch/powerpc/kernel/udbg.c   |   2 +
 arch/powerpc/platforms/microwatt/setup.c | 108 +++
 4 files changed, 117 insertions(+)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 0b063830eea8..399abc6d2af7 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -277,6 +277,12 @@ config PPC_EARLY_DEBUG_MEMCONS
  This console provides input and output buffers stored within the
  kernel BSS and should be safe to select on any system. A debugger
  can then be used to read kernel output or send input to the console.
+
+config PPC_EARLY_DEBUG_MICROWATT
+   bool "Microwatt potato UART"
+   help
+ Select this to enable early debugging using the potato UART
+included in the Microwatt SOC.
 endchoice
 
 config PPC_MEMCONS_OUTPUT_SIZE
diff --git a/arch/powerpc/include/asm/udbg.h b/arch/powerpc/include/asm/udbg.h
index 0ea9e70ed78b..2dbd2d3b0591 100644
--- a/arch/powerpc/include/asm/udbg.h
+++ b/arch/powerpc/include/asm/udbg.h
@@ -53,6 +53,7 @@ extern void __init udbg_init_ehv_bc(void);
 extern void __init udbg_init_ps3gelic(void);
 extern void __init udbg_init_debug_opal_raw(void);
 extern void __init udbg_init_debug_opal_hvsi(void);
+extern void __init udbg_init_debug_microwatt(void);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_UDBG_H */
diff --git a/arch/powerpc/kernel/udbg.c b/arch/powerpc/kernel/udbg.c
index 01595e8cafe7..e614993021c6 100644
--- a/arch/powerpc/kernel/udbg.c
+++ b/arch/powerpc/kernel/udbg.c
@@ -67,6 +67,8 @@ void __init udbg_early_init(void)
udbg_init_debug_opal_raw();
 #elif defined(CONFIG_PPC_EARLY_DEBUG_OPAL_HVSI)
udbg_init_debug_opal_hvsi();
+#elif defined(CONFIG_PPC_EARLY_DEBUG_MICROWATT)
+   udbg_init_debug_microwatt();
 #endif
 
 #ifdef CONFIG_PPC_EARLY_DEBUG
diff --git a/arch/powerpc/platforms/microwatt/setup.c 
b/arch/powerpc/platforms/microwatt/setup.c
index 3cfc5955a6fe..a5145adeaae7 100644
--- a/arch/powerpc/platforms/microwatt/setup.c
+++ b/arch/powerpc/platforms/microwatt/setup.c
@@ -10,8 +10,115 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
+static u64 potato_uart_base;
+
+#define PROC_FREQ 1
+#define UART_FREQ 115200
+#define UART_BASE 0xc0002000
+
+#define POTATO_CONSOLE_TX  0x00
+#define POTATO_CONSOLE_RX  0x08
+#define POTATO_CONSOLE_STATUS  0x10
+#define   POTATO_CONSOLE_STATUS_RX_EMPTY   0x01
+#define   POTATO_CONSOLE_STATUS_TX_EMPTY   0x02
+#define   POTATO_CONSOLE_STATUS_RX_FULL0x04
+#define   POTATO_CONSOLE_STATUS_TX_FULL0x08
+#define POTATO_CONSOLE_CLOCK_DIV   0x18
+#define POTATO_CONSOLE_IRQ_EN  0x20
+
+static u64 potato_uart_reg_read(int offset)
+{
+   u64 val, msr;
+
+   msr = mfmsr();
+   __asm__ volatile("mtmsrd %3,0; ldcix %0,%1,%2; mtmsrd %4,0"
+: "=r" (val) : "b" (potato_uart_base), "r" (offset),
+  "r" (msr & ~MSR_DR), "r" (msr));
+
+   return val;
+}
+
+static void potato_uart_reg_write(int offset, u64 val)
+{
+   u64 msr;
+
+   msr = mfmsr();
+   __asm__ volatile("mtmsrd %3,0; stdcix %0,%1,%2; mtmsrd %4,0"
+: : "r" (val), "b" (potato_uart_base), "r" (offset),
+  "r" (msr & ~MSR_DR), "r" (msr));
+}
+
+static int potato_uart_rx_empty(void)
+{
+   u64 val;
+
+   val = potato_uart_reg_read(POTATO_CONSOLE_STATUS);
+
+   if (val & POTATO_CONSOLE_STATUS_RX_EMPTY)
+   return 1;
+
+   return 0;
+}
+
+static int potato_uart_tx_full(void)
+{
+   u64 val;
+
+   val = potato_uart_reg_read(POTATO_CONSOLE_STATUS);
+
+   if (val & POTATO_CONSOLE_STATUS_TX_FULL)
+   return 1;
+
+   return 0;
+}
+
+static int potato_uart_read(void)
+{
+   while (potato_uart_rx_empty())
+   ;
+   return potato_uart_reg_read(POTATO_CONSOLE_RX);
+}
+
+static int potato_uart_read_poll(void)
+{
+   if (potato_uart_rx_empty())
+   return -1;
+   return potato_uart_reg_read(POTATO_CONSOLE_RX);
+}
+
+static void potato_uart_write(char c)
+{
+   if (c == '\n')
+   potato_uart_write('\r');
+   while (potato_uart_tx_full())
+   ;
+   potato_uart_reg_write(POTATO_CONSOLE_TX, c);
+}
+
+static unsigned long potato_uart_divisor(unsigned long proc_freq, unsigned 
long uart_freq)
+{

[PATCH] KVM: PPC: Book3S HV: Handle non-present PTEs in page fault functions

2020-04-15 Thread Paul Mackerras
Since cd758a9b57ee "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT
page fault handler", it's been possible in fairly rare circumstances to
load a non-present PTE in kvmppc_book3s_hv_page_fault() when running a
guest on a POWER8 host.

Because that case wasn't checked for, we could misinterpret the non-present
PTE as being a cache-inhibited PTE.  That could mismatch with the
corresponding hash PTE, which would cause the function to fail with -EFAULT
a little further down.  That would propagate up to the KVM_RUN ioctl()
generally causing the KVM userspace (usually qemu) to fall over.

This addresses the problem by catching that case and returning to the guest
instead, letting it fault again, and retrying the whole page fault from
the beginning.

For completeness, this fixes the radix page fault handler in the same
way.  For radix this didn't cause any obvious misbehaviour, because we
ended up putting the non-present PTE into the guest's partition-scoped
page tables, leading immediately to another hypervisor data/instruction
storage interrupt, which would go through the page fault path again
and fix things up.

Fixes: cd758a9b57ee "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT page 
fault handler"
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1820402
Reported-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
This is a reworked version of the patch David Gibson sent recently,
with the fix applied to the radix case as well. The commit message
is mostly stolen from David's patch.

 arch/powerpc/kvm/book3s_64_mmu_hv.c| 9 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 9 +
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 3aecec8..20b7dce 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -604,18 +604,19 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 */
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
+   pte = __pte(0);
+   if (ptep)
+   pte = *ptep;
+   local_irq_enable();
/*
 * If the PTE disappeared temporarily due to a THP
 * collapse, just return and let the guest try again.
 */
-   if (!ptep) {
-   local_irq_enable();
+   if (!pte_present(pte)) {
if (page)
put_page(page);
return RESUME_GUEST;
}
-   pte = *ptep;
-   local_irq_enable();
hpa = pte_pfn(pte) << PAGE_SHIFT;
pte_size = PAGE_SIZE;
if (shift)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 134fbc1..7bf94ba 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -815,18 +815,19 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 */
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
+   pte = __pte(0);
+   if (ptep)
+   pte = *ptep;
+   local_irq_enable();
/*
 * If the PTE disappeared temporarily due to a THP
 * collapse, just return and let the guest try again.
 */
-   if (!ptep) {
-   local_irq_enable();
+   if (!pte_present(pte)) {
if (page)
put_page(page);
return RESUME_GUEST;
}
-   pte = *ptep;
-   local_irq_enable();
 
/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
-- 
2.7.4



Re: CVE-2020-11669: Linux kernel 4.10 to 5.1: powerpc: guest can cause DoS on POWER9 KVM hosts

2020-04-15 Thread Paul Mackerras
On Wed, Apr 15, 2020 at 04:03:29PM +0200, Michal Suchánek wrote:
> On Wed, Apr 15, 2020 at 10:52:53PM +1000, Andrew Donnellan wrote:
> > The Linux kernel for powerpc from v4.10 to v5.1 has a bug where the
> > Authority Mask Register (AMR), Authority Mask Override Register (AMOR) and
> > User Authority Mask Override Register (UAMOR) are not correctly saved and
> > restored when the CPU is going into/coming out of idle state.
> > 
> > On POWER9 CPUs, this means that a CPU may return from idle with the AMR
> > value of another thread on the same core.
> > 
> > This allows a trivial Denial of Service attack against KVM hosts, by booting
> > a guest kernel which makes use of the AMR, such as a v5.2 or later kernel
> > with Kernel Userspace Access Prevention (KUAP) enabled.
> > 
> > The guest kernel will set the AMR to prevent userspace access, then the
> > thread will go idle. At a later point, the hardware thread that the guest
> > was using may come out of idle and start executing in the host, without
> > restoring the host AMR value. The host kernel can get caught in a page fault
> > loop, as the AMR is unexpectedly causing memory accesses to fail in the
> > host, and the host is eventually rendered unusable.
> 
> Hello,
> 
> shouldn't the kernel restore the host registers when leaving the guest?

It does.  That's not the bug.

> I recall some code exists for handling the *AM*R when leaving guest. Can
> the KVM guest enter idle without exiting to host?

No, we currently never execute the "stop" instruction in guest context.

The bug occurs when a thread that is in the host goes idle and
executes the stop instruction to go to a power-saving state, while
another thread is executing inside a guest.  Hardware loses the first
thread's AMR while it is stopped, and as it happens, it is possible
for the first thread to wake up with the contents of its AMR equal to
the other thread's AMR.  This can happen even if the first thread has
never executed in the guest.

The kernel needs to save and restore AMR (among other registers)
across the stop instruction because of this hardware behaviour.
We missed the AMR initially, which is what led to this vulnerability.

Paul.


Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

2020-04-08 Thread Paul Mackerras
On Wed, Apr 08, 2020 at 10:21:29PM +1000, Michael Ellerman wrote:
> 
> We should be able to just allocate the rtas_args on the stack, it's only
> ~80 odd bytes. And then we can use rtas_call_unlocked() which doesn't
> take the global lock.

Do we instantiate a 64-bit RTAS these days, or is it still 32-bit?
In the old days we had to make sure the RTAS argument buffer was
below the 4GB point.  If that's still necessary then perhaps putting
rtas_args inside the PACA would be the way to go.

Paul.


Re: [RFC PATCH 1/1] ppc/smp: Replace unnecessary 'while' by 'if'

2020-03-26 Thread Paul Mackerras
On Thu, Mar 26, 2020 at 05:37:52PM -0300, Leonardo Bras wrote:
> spin_until_cond() will wait until nmi_ipi_busy == false, and
> nmi_ipi_lock_start() does not seem to change nmi_ipi_busy, so there is
> no way this while will ever repeat.
> 
> Replace this 'while' by an 'if', so it does not look like it can repeat.

Nack, it can repeat.  The scenario is that cpu A is in this code,
inside spin_until_cond(); cpu B has previously set nmi_ipi_busy, and
cpu C is also waiting for nmi_ipi_busy to be cleared, like cpu A.
When cpu B clears nmi_ipi_busy, both cpu A and cpu C will see that and
will race inside nmi_ipi_lock_start().  One of them, say cpu C, will
take the lock and proceed to set nmi_ipi_busy and then call
nmi_ipi_unlock().  Then the other cpu (cpu A) will then take the lock
and return from nmi_ipi_lock_start() and find nmi_ipi_busy == true.
At that point it needs to go through the while loop body once more.

Paul.


Re: [PATCH 0/2] Fix SVM hang at startup

2020-03-23 Thread Paul Mackerras
On Fri, Mar 20, 2020 at 11:26:41AM +0100, Laurent Dufour wrote:
> This series is fixing a SVM hang occurring when starting a SVM requiring
> more secure memory than available. The hang happens in the SVM when calling
> UV_ESM.
> 
> The following is happening:
> 
> 1. SVM calls UV_ESM
> 2. Ultravisor (UV) calls H_SVM_INIT_START
> 3. Hypervisor (HV) calls UV_REGISTER_MEM_SLOT
> 4. UV returns error because there is not enough free secure memory
> 5. HV enter the error path in kvmppc_h_svm_init_start()
> 6. In the return path, since kvm->arch.secure_guest is not yet set hrfid is
>called
> 7. As the HV doesn't know the SVM calling context hrfid is jumping to
>unknown address in the SVM leading to various expections.
> 
> This series fixes the setting of kvm->arch.secure_guest in
> kvmppc_h_svm_init_start() to ensure that UV_RETURN is called on the return
> path to get back to the UV.
> 
> In addition to ensure that a malicious VM will not call UV reserved Hcall,
> a check of the Secure bit in the calling MSR is addded to reject such a
> call.
> 
> It is assumed that the UV will filtered out such Hcalls made by a malicious
> SVM.
> 
> Laurent Dufour (2):
>   KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls
>   KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN
> 
>  arch/powerpc/kvm/book3s_hv.c   | 32 --
>  arch/powerpc/kvm/book3s_hv_uvmem.c |  3 ++-
>  2 files changed, 23 insertions(+), 12 deletions(-)

Thanks, series applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH] KVM: PPC: Book3S HV: Skip kvmppc_uvmem_free if Ultravisor is not supported

2020-03-23 Thread Paul Mackerras
On Thu, Mar 19, 2020 at 07:55:10PM -0300, Fabiano Rosas wrote:
> kvmppc_uvmem_init checks for Ultravisor support and returns early if
> it is not present. Calling kvmppc_uvmem_free at module exit will cause
> an Oops:
> 
> $ modprobe -r kvm-hv
> 
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   
>   NIP:  c0789e90 LR: c0789e8c CTR: c0401030
>   REGS: c03fa7bab9a0 TRAP: 0300   Not tainted  
> (5.6.0-rc6-00033-g6c90b86a745a-dirty)
>   MSR:  90009033   CR: 24002282  XER: 
> 
>   CFAR: c0dae880 DAR: 0008 DSISR: 4000 IRQMASK: 1
>   GPR00: c0789e8c c03fa7babc30 c16fe500 
>   GPR04:  0006  c03faf205c00
>   GPR08:  0001 802d c0080ddde140
>   GPR12: c0401030 c03d9080 0001 
>   GPR16:   00013aad0074 00013aaac978
>   GPR20: 00013aad0070  7fffd1b37158 
>   GPR24: 00014fef0d58  00014fef0cf0 0001
>   GPR28:   c18b2a60 
>   NIP [c0789e90] percpu_ref_kill_and_confirm+0x40/0x170
>   LR [c0789e8c] percpu_ref_kill_and_confirm+0x3c/0x170
>   Call Trace:
>   [c03fa7babc30] [c03faf2064d4] 0xc03faf2064d4 (unreliable)
>   [c03fa7babcb0] [c0400e8c] dev_pagemap_kill+0x6c/0x80
>   [c03fa7babcd0] [c0401064] memunmap_pages+0x34/0x2f0
>   [c03fa7babd50] [c0080548] kvmppc_uvmem_free+0x30/0x80 [kvm_hv]
>   [c03fa7babd80] [c0080ddcef18] kvmppc_book3s_exit_hv+0x20/0x78 
> [kvm_hv]
>   [c03fa7babda0] [c02084d0] sys_delete_module+0x1d0/0x2c0
>   [c03fa7babe20] [c000b9d0] system_call+0x5c/0x68
>   Instruction dump:
>   3fc2001b fb81ffe0 fba1ffe8 fbe1fff8 7c7f1b78 7c9c2378 3bde4560 7fc3f378
>   f8010010 f821ff81 486249a1 6000  7c7d1b78 712a0002 40820084
>   ---[ end trace 5774ef4dc2c98279 ]---
> 
> So this patch checks if kvmppc_uvmem_init actually allocated anything
> before running kvmppc_uvmem_free.
> 
> Fixes: ca9f4942670c ("KVM: PPC: Book3S HV: Support for running secure guests")
> Reported-by: Greg Kurz 
> Signed-off-by: Fabiano Rosas 

Thanks, applied to my kvm-ppc-next branch (Michael Ellerman decided
that he didn't need to take it as the crash only occurs with
CONFIG_PPC_UV=n, which is not the default).

Paul.


Re: [PATCH 1/2] KVM: PPC: Book3S HV: check caller of H_SVM_* Hcalls

2020-03-23 Thread Paul Mackerras
On Fri, Mar 20, 2020 at 01:22:48PM +0100, Greg Kurz wrote:
> On Fri, 20 Mar 2020 11:26:42 +0100
> Laurent Dufour  wrote:
> 
> > The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
> > prevent a malicious VM or SVM to call them. This could lead to weird result
> > and should be filtered out.
> > 
> > Checking the Secure bit of the calling MSR ensure that the call is coming
> > from either the Ultravisor or a SVM. But any system call made from a SVM
> > are going through the Ultravisor, and the Ultravisor should filter out
> > these malicious call. This way, only the Ultravisor is able to make such a
> > Hcall.
> 
> "Ultravisor should filter" ? And what if it doesn't (eg. because of a bug) ?
> 
> Shouldn't we also check the HV bit of the calling MSR as well to
> disambiguate SVM and UV ?

The trouble with doing that (checking the HV bit) is that KVM does not
expect to see the HV bit set on an interrupt that occurred while we
were in the guest, and if it is set, it indicates a serious problem,
i.e. that an interrupt occurred while we were in the code that
transitions from host context to guest context, or from guest context
to host context.  In those cases we don't know how much of the
transition has been completed and therefore whether we have guest
values or host values in the CPU registers (GPRs, FPRs/VSRs, SPRs).
If we do see HV set then KVM reports a severe error to userspace which
should cause userspace to terminate the guest.

Therefore the UV should *always* have the HV bit clear in HSRR1/SRR1
when transitioning to KVM.

Paul.


Re: [PATCH] KVM: PPC: Book3S HV: Skip kvmppc_uvmem_free if Ultravisor is not supported

2020-03-19 Thread Paul Mackerras
On Thu, Mar 19, 2020 at 07:55:10PM -0300, Fabiano Rosas wrote:
> kvmppc_uvmem_init checks for Ultravisor support and returns early if
> it is not present. Calling kvmppc_uvmem_free at module exit will cause
> an Oops:
> 
> $ modprobe -r kvm-hv
> 
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   
>   NIP:  c0789e90 LR: c0789e8c CTR: c0401030
>   REGS: c03fa7bab9a0 TRAP: 0300   Not tainted  
> (5.6.0-rc6-00033-g6c90b86a745a-dirty)
>   MSR:  90009033   CR: 24002282  XER: 
> 
>   CFAR: c0dae880 DAR: 0008 DSISR: 4000 IRQMASK: 1
>   GPR00: c0789e8c c03fa7babc30 c16fe500 
>   GPR04:  0006  c03faf205c00
>   GPR08:  0001 802d c0080ddde140
>   GPR12: c0401030 c03d9080 0001 
>   GPR16:   00013aad0074 00013aaac978
>   GPR20: 00013aad0070  7fffd1b37158 
>   GPR24: 00014fef0d58  00014fef0cf0 0001
>   GPR28:   c18b2a60 
>   NIP [c0789e90] percpu_ref_kill_and_confirm+0x40/0x170
>   LR [c0789e8c] percpu_ref_kill_and_confirm+0x3c/0x170
>   Call Trace:
>   [c03fa7babc30] [c03faf2064d4] 0xc03faf2064d4 (unreliable)
>   [c03fa7babcb0] [c0400e8c] dev_pagemap_kill+0x6c/0x80
>   [c03fa7babcd0] [c0401064] memunmap_pages+0x34/0x2f0
>   [c03fa7babd50] [c0080548] kvmppc_uvmem_free+0x30/0x80 [kvm_hv]
>   [c03fa7babd80] [c0080ddcef18] kvmppc_book3s_exit_hv+0x20/0x78 
> [kvm_hv]
>   [c03fa7babda0] [c02084d0] sys_delete_module+0x1d0/0x2c0
>   [c03fa7babe20] [c000b9d0] system_call+0x5c/0x68
>   Instruction dump:
>   3fc2001b fb81ffe0 fba1ffe8 fbe1fff8 7c7f1b78 7c9c2378 3bde4560 7fc3f378
>   f8010010 f821ff81 486249a1 6000  7c7d1b78 712a0002 40820084
>   ---[ end trace 5774ef4dc2c98279 ]---
> 
> So this patch checks if kvmppc_uvmem_init actually allocated anything
> before running kvmppc_uvmem_free.
> 
> Fixes: ca9f4942670c ("KVM: PPC: Book3S HV: Support for running secure guests")
> Reported-by: Greg Kurz 
> Signed-off-by: Fabiano Rosas 

Good catch!

This should be Cc: sta...@vger.kernel.org # v5.5+

Acked-by: Paul Mackerras 

Paul.


Re: [PATCH 0/3] KVM: PPC: Fix host kernel crash with PR KVM

2020-03-19 Thread Paul Mackerras
On Wed, Mar 18, 2020 at 06:43:24PM +0100, Greg Kurz wrote:
> Recent cleanup from Sean Christopherson introduced a use-after-free
> condition that crashes the kernel when shutting down the VM with
> PR KVM. It went unnoticed so far because PR isn't tested/used much
> these days (mostly used for nested on POWER8, not supported on POWER9
> where HV should be used for nested), and other KVM implementations for
> ppc are unaffected.
> 
> This all boils down to the fact that the path that frees the per-vCPU
> MMU data goes through a complex set of indirections. This obfuscates
> the code to the point that we didn't realize that the MMU data was
> now being freed too early. And worse, most of the indirection isn't
> needed because only PR KVM has some MMU data to free when the vCPU is
> destroyed.
> 
> Fix the issue (patch 1) and simplify the code (patch 2 and 3).

I have put this series in my kvm-ppc-next branch, and I believe
Michael Ellerman is putting patch 1 in his fixes branch so it gets
into 5.6.

Thanks,
Paul.


Re: [PATCH] KVM: PPC: Book3S HV: Use RADIX_PTE_INDEX_SIZE in Radix MMU code

2020-03-19 Thread Paul Mackerras
On Tue, Feb 18, 2020 at 03:36:50PM +1100, Michael Ellerman wrote:
> In kvmppc_unmap_free_pte() in book3s_64_mmu_radix.c, we use the
> non-constant value PTE_INDEX_SIZE to clear a PTE page.
> 
> We can instead use the constant RADIX_PTE_INDEX_SIZE, because we know
> this code will only be running when the Radix MMU is active.
> 
> Note that we already use RADIX_PTE_INDEX_SIZE for the allocation of
> kvm_pte_cache.
> 
> Signed-off-by: Michael Ellerman 

Thanks, applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH -next 016/491] KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc): Use fallthrough;

2020-03-19 Thread Paul Mackerras
On Wed, Mar 18, 2020 at 06:22:29PM -0700, Joe Perches wrote:
> On Thu, 2020-03-19 at 12:18 +1100, Paul Mackerras wrote:
> > On Tue, Mar 10, 2020 at 09:51:30PM -0700, Joe Perches wrote:
> > > Convert the various uses of fallthrough comments to fallthrough;
> > > 
> > > Done via script
> > > Link: 
> > > https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe.com/
> > > 
> > > Signed-off-by: Joe Perches 
> > 
> > The subject line should look like "KVM: PPC: Use fallthrough".
> 
> There's no way to generate a subject line like that via a script
> so far as I can tell.
> 
> > Apart from that,
> > 
> > Acked-by: Paul Mackerras 
> > 
> > How are these patches going upstream?  Do you want me to take this via
> > my tree?
> 
> If you want.
> 
> Ideally, these changes would go in treewide via a script run
> by Linus at an -rc1, but if the change is OK with you, it'd
> be fine to have you apply it now.

I have taken this patch in my kvm-ppc-next branch.

Thanks,
Paul.


Re: [PATCH] KVM: PPC: Book3S HV: Fix H_CEDE return code for nested guests

2020-03-19 Thread Paul Mackerras
On Tue, Mar 10, 2020 at 04:11:28PM -0500, Michael Roth wrote:
> The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest
> via the guest/nested hypervisor.
> 
>   ./run-tests.sh -v
>   ...
>   TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 
> 2,threads=2 -machine cap-htm=on -append "h_cede_tm"
>   FAIL h_cede_tm (2 tests, 1 unexpected failures)
> 
> While the test relates to transactional memory instructions, the actual
> failure is due to the return code of the H_CEDE hypercall, which is
> reported as 224 instead of 0. This happens even when no TM instructions
> are issued.
> 
> 224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3
> is where the caller expects the return code to be placed upon return.
> 
> In the case of guest running under a nested hypervisor, issuing H_CEDE
> causes a return from H_ENTER_NESTED. In this case H_CEDE is
> specially-handled immediately rather than later in
> kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to
> set the return code for the caller, hence why kvm-unit-test sees the
> 224 return code and reports an error.
> 
> Guest kernels generally don't check the return value of H_CEDE, so
> that likely explains why this hasn't caused issues outside of
> kvm-unit-tests so far.
> 
> Fix this by setting r3 to 0 after we finish processing the H_CEDE.
> 
> RHBZ: 1778556
> 
> Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when 
> nested")
> Cc: linuxppc-...@ozlabs.org
> Cc: David Gibson 
> Cc: Paul Mackerras 
> Signed-off-by: Michael Roth 

Thanks, applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH v3] KVM: PPC: Book3S HV: Treat TM-related invalid form instructions on P9 like the valid ones

2020-03-19 Thread Paul Mackerras
On Fri, Feb 21, 2020 at 11:29:50AM -0500, Gustavo Romero wrote:
> On P9 DD2.2 due to a CPU defect some TM instructions need to be emulated by
> KVM. This is handled at first by the hardware raising a softpatch interrupt
> when certain TM instructions that need KVM assistance are executed in the
> guest. Althought some TM instructions per Power ISA are invalid forms they
> can raise a softpatch interrupt too. For instance, 'tresume.' instruction
> as defined in the ISA must have bit 31 set (1), but an instruction that
> matches 'tresume.' PO and XO opcode fields but has bit 31 not set (0), like
> 0x7cfe9ddc, also raises a softpatch interrupt. Similarly for 'treclaim.'
> and 'trechkpt.' instructions with bit 31 = 0, i.e. 0x7c00075c and
> 0x7c0007dc, respectively. Hence, if a code like the following is executed
> in the guest it will raise a softpatch interrupt just like a 'tresume.'
> when the TM facility is enabled ('tabort. 0' in the example is used only
> to enable the TM facility):
> 
> int main() { asm("tabort. 0; .long 0x7cfe9ddc;"); }
> 
> Currently in such a case KVM throws a complete trace like:
> 
> [345523.705984] WARNING: CPU: 24 PID: 64413 at 
> arch/powerpc/kvm/book3s_hv_tm.c:211 kvmhv_p9_tm_emulation+0x68/0x620 [kvm_hv]
> [345523.705985] Modules linked in: kvm_hv(E) xt_conntrack ipt_REJECT 
> nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat
> iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
> ebtable_filter ebtables ip6table_filter
> ip6_tables iptable_filter bridge stp llc sch_fq_codel ipmi_powernv at24 
> vmx_crypto ipmi_devintf ipmi_msghandler
> ibmpowernv uio_pdrv_genirq kvm opal_prd uio leds_powernv ib_iser rdma_cm 
> iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
> libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs 
> blake2b_generic zstd_compress raid10 raid456
> async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor 
> raid6_pq raid1 raid0 multipath linear tg3
> crct10dif_vpmsum crc32c_vpmsum ipr [last unloaded: kvm_hv]
> [345523.706030] CPU: 24 PID: 64413 Comm: CPU 0/KVM Tainted: GW   E
>  5.5.0+ #1
> [345523.706031] NIP:  c008072cb9c0 LR: c008072b5e80 CTR: 
> c008085c7850
> [345523.706034] REGS: c00399467680 TRAP: 0700   Tainted: GW   E   
>(5.5.0+)
> [345523.706034] MSR:  90010282b033 
>   CR: 24022428  XER: 
> [345523.706042] CFAR: c008072b5e7c IRQMASK: 0
> GPR00: c008072b5e80 c00399467910 c008072db500 
> c00375ccc720
> GPR04: c00375ccc720 0003fbec a10395dda5a6 
> 
> GPR08: 7cfe9ddc 7cfe9ddc05dc 7cfe9ddc7c0005dc 
> c008072cd530
> GPR12: c008085c7850 c003fffeb800 0001 
> 7dfb737f
> GPR16: c0002001edcca558   
> 0001
> GPR20: c1b21258 c0002001edcca558 0018 
> 
> GPR24: 0100  0001 
> 1500
> GPR28: c0002001edcc4278 c0037dd8 80050280f033 
> c00375ccc720
> [345523.706062] NIP [c008072cb9c0] kvmhv_p9_tm_emulation+0x68/0x620 
> [kvm_hv]
> [345523.706065] LR [c008072b5e80] 
> kvmppc_handle_exit_hv.isra.53+0x3e8/0x798 [kvm_hv]
> [345523.706066] Call Trace:
> [345523.706069] [c00399467910] [c00399467940] 0xc00399467940 
> (unreliable)
> [345523.706071] [c00399467950] [c00399467980] 0xc00399467980
> [345523.706075] [c003994679f0] [c008072bd1c4] 
> kvmhv_run_single_vcpu+0xa1c/0xb80 [kvm_hv]
> [345523.706079] [c00399467ac0] [c008072bd8e0] 
> kvmppc_vcpu_run_hv+0x5b8/0xb00 [kvm_hv]
> [345523.706087] [c00399467b90] [c008085c93cc] 
> kvmppc_vcpu_run+0x34/0x48 [kvm]
> [345523.706095] [c00399467bb0] [c008085c582c] 
> kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
> [345523.706101] [c00399467c40] [c008085b7498] 
> kvm_vcpu_ioctl+0x3d0/0x7b0 [kvm]
> [345523.706105] [c00399467db0] [c04adf9c] ksys_ioctl+0x13c/0x170
> [345523.706107] [c00399467e00] [c04adff8] sys_ioctl+0x28/0x80
> [345523.706111] [c00399467e20] [c000b278] system_call+0x5c/0x68
> [345523.706112] Instruction dump:
> [345523.706114] 419e0390 7f8a4840 409d0048 6d497c00 2f89075d 419e021c 
> 6d497c00 2f8907dd
> [345523.706119] 419e01c0 6d497c00 2f8905dd 419e00a4 <0fe0> 38210040 
> 3860 ebc1fff0
> 
> and then treats the executed instruction as a 'nop'.
> 
> However the POWER9 User's Manual, in section "4.6.10 Book II Invalid
> Forms", informs that for TM instructions bit 31 is in fact ignored, thus
> for the TM-related invalid forms ignoring bit 31 and handling them like the
> valid forms is an acceptable way to handle them. POWER8 behaves the same
> way too.
> 
> This commit changes the handling of the cases here described by treating
> the TM-related invalid 

Re: [PATCH -next 016/491] KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc): Use fallthrough;

2020-03-18 Thread Paul Mackerras
On Tue, Mar 10, 2020 at 09:51:30PM -0700, Joe Perches wrote:
> Convert the various uses of fallthrough comments to fallthrough;
> 
> Done via script
> Link: 
> https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe.com/
> 
> Signed-off-by: Joe Perches 

The subject line should look like "KVM: PPC: Use fallthrough".

Apart from that,

Acked-by: Paul Mackerras 

How are these patches going upstream?  Do you want me to take this via
my tree?

Paul.


Re: [PATCH FIX] KVM: PPC: Book3S HV: Release lock on page-out failure path

2020-01-29 Thread Paul Mackerras
On Wed, Jan 22, 2020 at 10:25:42AM +0530, Bharata B Rao wrote:
> When migrate_vma_setup() fails in kvmppc_svm_page_out(),
> release kvm->arch.uvmem_lock before returning.
> 
> Fixes: ca9f4942670 ("KVM: PPC: Book3S HV: Support for running secure guests")
> Signed-off-by: Bharata B Rao 

Thanks, applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH v3] powerpc/mm: Remove kvm radix prefetch workaround for Power9 DD2.2

2020-01-16 Thread Paul Mackerras
On Fri, Dec 06, 2019 at 02:17:22PM +1100, Jordan Niethe wrote:
> Commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with
> KVM") introduced a number of workarounds as coming out of a guest with
> the mmu enabled would make the cpu would start running in hypervisor
> state with the PID value from the guest. The cpu will then start
> prefetching for the hypervisor with that PID value.
> 
> In Power9 DD2.2 the cpu behaviour was modified to fix this. When
> accessing Quadrant 0 in hypervisor mode with LPID != 0 prefetching will
> not be performed. This means that we can get rid of the workarounds for
> Power9 DD2.2 and later revisions. Add a new cpu feature
> CPU_FTR_P9_RADIX_PREFETCH_BUG to indicate if the workarounds are needed.
> 
> Signed-off-by: Jordan Niethe 

Acked-by: Paul Mackerras 


Re: [PATCH v3 2/2] powerpc/pseries/svm: Disable BHRB/EBB/PMU access

2020-01-12 Thread Paul Mackerras
On Thu, Jan 09, 2020 at 09:19:57PM -0800, Sukadev Bhattiprolu wrote:
> Ultravisor disables some CPU features like BHRB, EBB and PMU in
> secure virtual machines (SVMs). Skip accessing those registers
> in SVMs to avoid getting a Program Interrupt.

It would be useful to have more explanation of the rationale for the
ultravisor disabling access to those features, and indicate whether
this is a temporary restriction or a permanent one.  If SVMs are never
going to be able to use the PMU then that is a bad thing in my
opinion.  In other words, the commit message should tell us whether
the restriction is just because the ultravisor doesn't yet have code
for managing and context-switching the PMU, or if there is there some
reason why using the PMU in a SVM will always be prohibited for some
security-related reason.

Also, the only way that a SVM would be getting into the KVM code that
you are patching is if it is trying to do nested virtualization.
However, the SVM should already know that it is not able to do nested
virtualization because the ultravisor should be intercepting and
failing the H_SET_PARTITION_TABLE hypercall.  So I think there is no
good reason for patching the KVM code like you are doing unless the
PMU restriction is permanent and we are intending someday to enable
SVMs to have nested guests.

Paul.


Re: [PATCH 2/3] powerpc sstep: add support for divde[.] and divdeu[.] instructions

2020-01-08 Thread Paul Mackerras
On Tue, Dec 10, 2019 at 12:49:03PM +0530, Balamuruhan S wrote:
> This patch adds emulation support for divde, divdeu instructions,
>   * Divide Doubleword Extended (divde[.])
>   * Divide Doubleword Extended Unsigned (divdeu[.])
> 
> Signed-off-by: Balamuruhan S 
> ---
>  arch/powerpc/lib/sstep.c | 27 ++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index c077acb983a1..4b4119729e59 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -1736,7 +1736,32 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   op->val = (int) regs->gpr[ra] /
>   (int) regs->gpr[rb];
>   goto arith_done;
> -
> +#ifdef __powerpc64__
> + case 425:   /* divde[.] */
> + if (instr & 1) {
> + asm volatile(PPC_DIVDE_DOT(%0, %1, %2) :
> + "=r" (op->val) : "r" (regs->gpr[ra]),
> + "r" (regs->gpr[rb]));
> + set_cr0(regs, op);

This seems unneccesarily complicated.  You take the trouble to do a
"divde." instruction rather than a "divde" instruction but then don't
use the CR0 setting that the instruction did, but instead go and work
out what happens to CR0 manually in set_cr0().  Also you don't tell
the compiler that CR0 has been modified, which could lead to problems.

This case could be done much more simply like this:



case 425:   /* divde[.] */
asm volatile(PPC_DIVDE(%0, %1, %2) :
"=r" (op->val) : "r" (regs->gpr[ra]),
"r" (regs->gpr[rb]));
goto arith_done;

(note, goto arith_done rather than compute_done) and similarly for the
divdeu case.

Paul.


Re: [PATCH V3 2/2] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-12-17 Thread Paul Mackerras
On Sat, Dec 14, 2019 at 06:12:08PM -0800, Sukadev Bhattiprolu wrote:
> 
> Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
> abort an SVM after it has issued the H_SVM_INIT_START and before the
> H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
> encounters security violations or other errors when starting an SVM.
> 
> Note that this hcall is different from UV_SVM_TERMINATE ucall which
> is used by HV to terminate/cleanup an VM that has becore secure.
> 
> The H_SVM_INIT_ABORT should basically undo operations that were done
> since the H_SVM_INIT_START hcall - i.e page-out all the VM pages back
> to normal memory, and terminate the SVM.
> 
> (If we do not bring the pages back to normal memory, the text/data
> of the VM would be stuck in secure memory and since the SVM did not
> go secure, its MSR_S bit will be clear and the VM wont be able to
> access its pages even to do a clean exit).
> 
> Based on patches and discussion with Paul Mackerras, Ram Pai and
> Bharata Rao.
> 
> Signed-off-by: Ram Pai 
> Signed-off-by: Sukadev Bhattiprolu 
> Signed-off-by: Bharata B Rao 

Minor comment below, but not a showstopper.  Also, as Bharata noted
you need to hold the srcu lock for reading.

> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + struct kvm_memory_slot *memslot;
> + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> +
> + if (!slots)
> + continue;
> +
> + kvm_for_each_memslot(memslot, slots)
> + kvmppc_uvmem_drop_pages(memslot, kvm, false);
> + }

Since we use the default KVM_ADDRESS_SPACE_NUM, which is 1, this code
isn't wrong but it is more verbose than it needs to be.  It could be

kvm_for_each_memslot(kvm_memslots(kvm), slots)
kvmppc_uvmem_drop_pages(memslot, kvm, false);

Paul.


Re: [PATCH V3 1/2] KVM: PPC: Add skip_page_out parameter

2019-12-17 Thread Paul Mackerras
On Sat, Dec 14, 2019 at 06:11:04PM -0800, Sukadev Bhattiprolu wrote:
> 
> This patch is based on Bharata's v11 KVM patches for secure guests:
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-November/200918.html
> ---
> 
> From: Sukadev Bhattiprolu 
> Date: Fri, 13 Dec 2019 15:06:16 -0600
> Subject: [PATCH V3 1/2] KVM: PPC: Add skip_page_out parameter
> 
> Add 'skip_page_out' parameter to kvmppc_uvmem_drop_pages() which will
> be needed in a follow-on patch that implements H_SVM_INIT_ABORT hcall.
> 
> Signed-off-by: Sukadev Bhattiprolu 

Reviewed-by: Paul Mackerras 


Re: [PATCH 1/1] powerpc/kvm/book3s: Fixes possible 'use after release' of kvm

2019-11-27 Thread Paul Mackerras
On Tue, Nov 26, 2019 at 02:52:12PM -0300, Leonardo Bras wrote:
> Fixes a possible 'use after free' of kvm variable.
> It does use mutex_unlock(>lock) after possible freeing a variable
> with kvm_put_kvm(kvm).

Comments below...

> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 5834db0a54c6..a402ead833b6 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -316,14 +316,13 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  
>   if (ret >= 0)
>   list_add_rcu(>list, >arch.spapr_tce_tables);
> - else
> - kvm_put_kvm(kvm);
>  
>   mutex_unlock(>lock);
>  
>   if (ret >= 0)
>   return ret;
>  
> + kvm_put_kvm(kvm);

There isn't a potential use-after-free here.  We are relying on the
property that the release function (kvm_vm_release) cannot be called
in parallel with this function.  The reason is that this function
(kvm_vm_ioctl_create_spapr_tce) is handling an ioctl on a kvm VM file
descriptor.  That means that a userspace process has the file
descriptor still open.  The code that implements the close() system
call makes sure that no thread is still executing inside any system
call that is using the same file descriptor before calling the file
descriptor's release function (in this case, kvm_vm_release).  That
means that this kvm_put_kvm() call here cannot make the reference
count go to zero.

>   kfree(stt);
>   fail_acct:
>   account_locked_vm(current->mm, kvmppc_stt_pages(npages), false);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13efc291b1c7..f37089b60d09 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2744,10 +2744,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, 
> u32 id)
>   /* Now it's all set up, let userspace reach it */
>   kvm_get_kvm(kvm);
>   r = create_vcpu_fd(vcpu);
> - if (r < 0) {
> - kvm_put_kvm(kvm);
> + if (r < 0)
>   goto unlock_vcpu_destroy;
> - }
>  
>   kvm->vcpus[atomic_read(>online_vcpus)] = vcpu;
>  
> @@ -2771,6 +2769,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, 
> u32 id)
>   mutex_lock(>lock);
>   kvm->created_vcpus--;
>   mutex_unlock(>lock);
> + if (r < 0)
> + kvm_put_kvm(kvm);
>   return r;
>  }

Once again we are inside an ioctl on the kvm VM file descriptor, so
the reference count cannot go to zero.

> @@ -3183,10 +3183,10 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
>   kvm_get_kvm(kvm);
>   ret = anon_inode_getfd(ops->name, _device_fops, dev, O_RDWR | 
> O_CLOEXEC);
>   if (ret < 0) {
> - kvm_put_kvm(kvm);
>   mutex_lock(>lock);
>   list_del(>vm_node);
>   mutex_unlock(>lock);
> + kvm_put_kvm(kvm);
>   ops->destroy(dev);
>   return ret;
>   }

Same again here.

Paul.


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-13 Thread Paul Mackerras
On Wed, Nov 13, 2019 at 01:50:42PM -0800, Ram Pai wrote:
> On Thu, Nov 14, 2019 at 08:18:24AM +1100, Paul Mackerras wrote:
> > On Tue, Nov 12, 2019 at 10:32:33PM -0800, Ram Pai wrote:
> > > On Wed, Nov 13, 2019 at 11:14:27AM +1100, Paul Mackerras wrote:
> > > > On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> > > > > On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > > > > > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > > > > > There is subtle problem removing that code from the assembly.
> > > > > > > 
> > > > > > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without 
> > > > > > > clearing
> > > > > > > kvm->arch.secure_guest, the hypervisor will continue to think 
> > > > > > > that the
> > > > > > > VM is a secure VM.   However the primary reason the 
> > > > > > > H_SVM_INIT_ABORT
> > > > > > > hcall was invoked, was to inform the Hypervisor that it should no 
> > > > > > > longer
> > > > > > > consider the VM as a Secure VM. So there is a inconsistency there.
> > > > > > 
> > > > > > Most of the checks that look at whether a VM is a secure VM use code
> > > > > > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > > > > > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that 
> > > > > > will
> > > > > > take the false branch once we have set kvm->arch.secure_guest to
> > > > > > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > > > > > most places we will treat the VM as a normal VM from then on.  If
> > > > > > there are any places where we still need to treat the VM as a secure
> > > > > > VM while we are processing the abort we can easily do that too.
> > > > > 
> > > > > Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return 
> > > > > back
> > > > > to the Ultravisor?   Because removing that assembly code will NOT 
> > > > > lead the
> > > > 
> > > > No.  The suggestion is that vcpu->arch.secure_guest stays set to
> > > > KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.
> > > 
> > > In the fast_guest_return path, if it finds 
> > > (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_ABORT) is true, should it 
> > > return to
> > > UV or not?
> > > 
> > > Ideally it should return back to the ultravisor the first time
> > > KVMPPC_SECURE_INIT_ABORT is set, and not than onwards.
> > 
> > What is ideal about that behavior?  Why would that be a particularly
> > good thing to do?
> 
> It is following the rule -- "return back to the caller".

That doesn't address the question of why vcpu->arch.secure_guest
should be cleared at the point where we do that.

Paul.


Re: [PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-13 Thread Paul Mackerras
On Wed, Nov 13, 2019 at 08:59:08PM +0530, Bharata B Rao wrote:
> On Tue, Nov 12, 2019 at 04:34:34PM +1100, Paul Mackerras wrote:
> > On Mon, Nov 04, 2019 at 09:47:58AM +0530, Bharata B Rao wrote:
> > [snip]
> > > @@ -5442,6 +5471,64 @@ static int kvmhv_store_to_eaddr(struct kvm_vcpu 
> > > *vcpu, ulong *eaddr, void *ptr,
> > >   return rc;
> > >  }
> > >  
> > > +/*
> > > + *  IOCTL handler to turn off secure mode of guest
> > > + *
> > > + * - Issue ucall to terminate the guest on the UV side
> > > + * - Unpin the VPA pages (Enables these pages to be migrated back
> > > + *   when VM becomes secure again)
> > > + * - Recreate partition table as the guest is transitioning back to
> > > + *   normal mode
> > > + * - Release all device pages
> > > + */
> > > +static int kvmhv_svm_off(struct kvm *kvm)
> > > +{
> > > + struct kvm_vcpu *vcpu;
> > > + int srcu_idx;
> > > + int ret = 0;
> > > + int i;
> > > +
> > > + if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> > > + return ret;
> > > +
> > 
> > A further comment on this code: it should check that no vcpus are
> > running and fail if any are running, and it should prevent any vcpus
> > from running until the function is finished, using code like that in
> > kvmhv_configure_mmu().  That is, it should do something like this:
> > 
> > mutex_lock(>arch.mmu_setup_lock);
> > mmu_was_ready = kvm->arch.mmu_ready;
> > if (kvm->arch.mmu_ready) {
> > kvm->arch.mmu_ready = 0;
> > /* order mmu_ready vs. vcpus_running */
> > smp_mb();
> > if (atomic_read(>arch.vcpus_running)) {
> > kvm->arch.mmu_ready = 1;
> > ret = -EBUSY;
> > goto out_unlock;
> > }
> > }
> > 
> > and then after clearing kvm->arch.secure_guest below:
> > 
> > > + srcu_idx = srcu_read_lock(>srcu);
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + struct kvm_memory_slot *memslot;
> > > + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > > +
> > > + if (!slots)
> > > + continue;
> > > +
> > > + kvm_for_each_memslot(memslot, slots) {
> > > + kvmppc_uvmem_drop_pages(memslot, kvm, true);
> > > + uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
> > > + }
> > > + }
> > > + srcu_read_unlock(>srcu, srcu_idx);
> > > +
> > > + ret = uv_svm_terminate(kvm->arch.lpid);
> > > + if (ret != U_SUCCESS) {
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> > > +
> > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > + spin_lock(>arch.vpa_update_lock);
> > > + unpin_vpa_reset(kvm, >arch.dtl);
> > > + unpin_vpa_reset(kvm, >arch.slb_shadow);
> > > + unpin_vpa_reset(kvm, >arch.vpa);
> > > + spin_unlock(>arch.vpa_update_lock);
> > > + }
> > > +
> > > + ret = kvmppc_reinit_partition_table(kvm);
> > > + if (ret)
> > > + goto out;
> > > +
> > > + kvm->arch.secure_guest = 0;
> > 
> > you need to do:
> > 
> > kvm->arch.mmu_ready = mmu_was_ready;
> >  out_unlock:
> > mutex_unlock(>arch.mmu_setup_lock);
> > 
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > 
> > With that extra check in place, it should be safe to unpin the vpas if
> > there is a good reason to do so.  ("Userspace has some bug that we
> > haven't found" isn't a good reason to do so.)
> 
> QEMU indeed does set_one_reg to reset the VPAs but that only marks
> the VPA update as pending. The actual unpinning happens when vcpu
> gets to run after reset at which time the VPAs are updated after
> any unpinning (if required)
> 
> When secure guest reboots, vpu 0 gets to run and does unpin its
> VPA pages and then proceeds with switching to secure. Here UV
> tries to page-in all the guest pages, including the still pinned
> VPA pages corresponding to other vcpus which haven't had a chance
> to run till now. They are all still pinned and hence page-in fails.
> 
> To prevent this, we have to explicitly unpin the VPA pages during
> this svm off ioctl. This will ensure that SMP secure guest is able
> to reboot correctly.

OK, that makes sense.  Please put a comment in the code explaining
this briefly.

> So I will incorporate the code chunk you have shown above to fail
> if any vcpu is running and prevent any vcpu from running when
> we unpin VPAs from this ioctl.

Sounds good.

Paul.


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-13 Thread Paul Mackerras
On Tue, Nov 12, 2019 at 10:32:33PM -0800, Ram Pai wrote:
> On Wed, Nov 13, 2019 at 11:14:27AM +1100, Paul Mackerras wrote:
> > On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> > > On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > > > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > > > There is subtle problem removing that code from the assembly.
> > > > > 
> > > > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without 
> > > > > clearing
> > > > > kvm->arch.secure_guest, the hypervisor will continue to think that the
> > > > > VM is a secure VM.   However the primary reason the H_SVM_INIT_ABORT
> > > > > hcall was invoked, was to inform the Hypervisor that it should no 
> > > > > longer
> > > > > consider the VM as a Secure VM. So there is a inconsistency there.
> > > > 
> > > > Most of the checks that look at whether a VM is a secure VM use code
> > > > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > > > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that will
> > > > take the false branch once we have set kvm->arch.secure_guest to
> > > > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > > > most places we will treat the VM as a normal VM from then on.  If
> > > > there are any places where we still need to treat the VM as a secure
> > > > VM while we are processing the abort we can easily do that too.
> > > 
> > > Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return back
> > > to the Ultravisor?   Because removing that assembly code will NOT lead the
> > 
> > No.  The suggestion is that vcpu->arch.secure_guest stays set to
> > KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.
> 
> In the fast_guest_return path, if it finds 
> (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_ABORT) is true, should it return 
> to
> UV or not?
> 
> Ideally it should return back to the ultravisor the first time
> KVMPPC_SECURE_INIT_ABORT is set, and not than onwards.

What is ideal about that behavior?  Why would that be a particularly
good thing to do?

Paul.


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-12 Thread Paul Mackerras
On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > There is subtle problem removing that code from the assembly.
> > > 
> > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without clearing
> > > kvm->arch.secure_guest, the hypervisor will continue to think that the
> > > VM is a secure VM.   However the primary reason the H_SVM_INIT_ABORT
> > > hcall was invoked, was to inform the Hypervisor that it should no longer
> > > consider the VM as a Secure VM. So there is a inconsistency there.
> > 
> > Most of the checks that look at whether a VM is a secure VM use code
> > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that will
> > take the false branch once we have set kvm->arch.secure_guest to
> > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > most places we will treat the VM as a normal VM from then on.  If
> > there are any places where we still need to treat the VM as a secure
> > VM while we are processing the abort we can easily do that too.
> 
> Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return back
> to the Ultravisor?   Because removing that assembly code will NOT lead the

No.  The suggestion is that vcpu->arch.secure_guest stays set to
KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.

> Hypervisor back into the Ultravisor.  This is fine with the
> ultravisor. But then the hypervisor will not know where to return to.
> If it wants to return directly to the VM, it wont know to
> which address. It will be in a limbo.
> 
> > 
> > > This is fine, as long as the VM does not invoke any hcall or does not
> > > receive any hypervisor-exceptions.  The moment either of those happen,
> > > the control goes into the hypervisor, the hypervisor services
> > > the exception/hcall and while returning, it will see that the
> > > kvm->arch.secure_guest flag is set and **incorrectly** return
> > > to the ultravisor through a UV_RETURN ucall.  Ultravisor will
> > > not know what to do with it, because it does not consider that
> > > VM as a Secure VM.  Bad things happen.
> > 
> > If bad things happen in the ultravisor because the hypervisor did
> > something it shouldn't, then it's game over, you just lost, thanks for
> > playing.  The ultravisor MUST be able to cope with bogus UV_RETURN
> > calls for a VM that it doesn't consider to be a secure VM.  You need
> > to work out how to handle such calls safely and implement that in the
> > ultravisor.
> 
> Actually we do handle this gracefully in the ultravisor :). 
> We just retun back to the hypervisor saying "sorry dont know what
> to do with it, please handle it yourself".
> 
> However hypervisor would not know what to do with that return, and bad
> things happen in the hypervisor.

Right.  We need something after the "sc 2" to handle the case where
the ultravisor returns with an error from the UV_RETURN.

Paul.


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-12 Thread Paul Mackerras
On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> On Tue, Nov 12, 2019 at 04:38:36PM +1100, Paul Mackerras wrote:
> > On Mon, Nov 11, 2019 at 05:01:58PM -0800, Ram Pai wrote:
> > > On Mon, Nov 11, 2019 at 03:19:24PM +1100, Paul Mackerras wrote:
> > > > On Mon, Nov 04, 2019 at 09:47:59AM +0530, Bharata B Rao wrote:
> > > > > From: Sukadev Bhattiprolu 
> > > > > 
> > > > > Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
> > > > > abort an SVM after it has issued the H_SVM_INIT_START and before the
> > > > > H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
> > > > > encounters security violations or other errors when starting an SVM.
> > > > > 
> > > > > Note that this hcall is different from UV_SVM_TERMINATE ucall which
> > > > > is used by HV to terminate/cleanup an SVM.
> > > > > 
> > > > > In case of H_SVM_INIT_ABORT, we should page-out all the pages back to
> > > > > HV (i.e., we should not skip the page-out). Otherwise the VM's pages,
> > > > > possibly including its text/data would be stuck in secure memory.
> > > > > Since the SVM did not go secure, its MSR_S bit will be clear and the
> > > > > VM wont be able to access its pages even to do a clean exit.
> > > > 
> ...skip...
> > > 
> > > If the ultravisor cleans up the SVM's state on its side and then informs
> > > the Hypervisor to abort the SVM, the hypervisor will not be able to
> > > cleanly terminate the VM.  Because to terminate the SVM, the hypervisor
> > > still needs the services of the Ultravisor. For example: to get the
> > > pages back into the hypervisor if needed. Another example is, the
> > > hypervisor can call UV_SVM_TERMINATE.  Regardless of which ucall
> > > gets called, the ultravisor has to hold on to enough state of the
> > > SVM to service that request.
> > 
> > OK, that's a good reason.  That should be explained in the commit
> > message.
> > 
> > > The current design assumes that the hypervisor explicitly informs the
> > > ultravisor, that it is done with the SVM, through the UV_SVM_TERMINATE
> > > ucall. Till that point the Ultravisor must to be ready to service any
> > > ucalls made by the hypervisor on the SVM's behalf.
> > 
> > I see that UV_SVM_TERMINATE is done when the VM is being destroyed (at
> > which point kvm->arch.secure_guest doesn't matter any more), and in
> > kvmhv_svm_off(), where kvm->arch.secure_guest gets cleared
> > explicitly.  Hence I don't see any need for clearing it in the
> > assembly code on the next secure guest entry.  I think the change to
> > book3s_hv_rmhandlers.S can just be dropped.
> 
> There is subtle problem removing that code from the assembly.
> 
> If the H_SVM_INIT_ABORT hcall returns to the ultravisor without clearing
> kvm->arch.secure_guest, the hypervisor will continue to think that the
> VM is a secure VM.   However the primary reason the H_SVM_INIT_ABORT
> hcall was invoked, was to inform the Hypervisor that it should no longer
> consider the VM as a Secure VM. So there is a inconsistency there.

Most of the checks that look at whether a VM is a secure VM use code
like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that will
take the false branch once we have set kvm->arch.secure_guest to
KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
most places we will treat the VM as a normal VM from then on.  If
there are any places where we still need to treat the VM as a secure
VM while we are processing the abort we can easily do that too.

> This is fine, as long as the VM does not invoke any hcall or does not
> receive any hypervisor-exceptions.  The moment either of those happen,
> the control goes into the hypervisor, the hypervisor services
> the exception/hcall and while returning, it will see that the
> kvm->arch.secure_guest flag is set and **incorrectly** return
> to the ultravisor through a UV_RETURN ucall.  Ultravisor will
> not know what to do with it, because it does not consider that
> VM as a Secure VM.  Bad things happen.

If bad things happen in the ultravisor because the hypervisor did
something it shouldn't, then it's game over, you just lost, thanks for
playing.  The ultravisor MUST be able to cope with bogus UV_RETURN
calls for a VM that it doesn't consider to be a secure VM.  You need
to work out how to handle such calls safely and implement that in the
ultravisor.

Paul.


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-11 Thread Paul Mackerras
On Mon, Nov 11, 2019 at 05:01:58PM -0800, Ram Pai wrote:
> On Mon, Nov 11, 2019 at 03:19:24PM +1100, Paul Mackerras wrote:
> > On Mon, Nov 04, 2019 at 09:47:59AM +0530, Bharata B Rao wrote:
> > > From: Sukadev Bhattiprolu 
> > > 
> > > Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
> > > abort an SVM after it has issued the H_SVM_INIT_START and before the
> > > H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
> > > encounters security violations or other errors when starting an SVM.
> > > 
> > > Note that this hcall is different from UV_SVM_TERMINATE ucall which
> > > is used by HV to terminate/cleanup an SVM.
> > > 
> > > In case of H_SVM_INIT_ABORT, we should page-out all the pages back to
> > > HV (i.e., we should not skip the page-out). Otherwise the VM's pages,
> > > possibly including its text/data would be stuck in secure memory.
> > > Since the SVM did not go secure, its MSR_S bit will be clear and the
> > > VM wont be able to access its pages even to do a clean exit.
> > 
> > It seems fragile to me to have one more transfer back into the
> > ultravisor after this call.  Why does the UV need to do this call and
> > then get control back again one more time?  
> > Why can't the UV defer
> > doing this call until it can do it without expecting to see a return
> > from the hcall?  
> 
> Sure. But, what if the hypervisor calls back into the UV through a
> ucall, asking for some page to be paged-out?  If the ultravisor has
> cleaned up the state associated with the SVM, it wont be able to service
> that request.
> 
> H_SVM_INIT_ABORT is invoked to tell the hypervisor that the
> secure-state-transition for the VM cannot be continued any further.
> Hypervisor can than choose to do whatever with that information. It can
> cleanup its state, and/or make ucalls to get some information from the
> ultravisor.  It can also choose not to return control back to the ultravisor.
> 
> 
> > And if it does need to see a return from the hcall,
> > what would happen if a malicious hypervisor doesn't do the return?
> 
> That is fine.  At most it will be a denail-of-service attack.
> 
> RP
> 
> > 
> > Paul.
> 
> 
> 
> 
> 
> If the ultravisor cleans up the SVM's state on its side and then informs
> the Hypervisor to abort the SVM, the hypervisor will not be able to
> cleanly terminate the VM.  Because to terminate the SVM, the hypervisor
> still needs the services of the Ultravisor. For example: to get the
> pages back into the hypervisor if needed. Another example is, the
> hypervisor can call UV_SVM_TERMINATE.  Regardless of which ucall
> gets called, the ultravisor has to hold on to enough state of the
> SVM to service that request.

OK, that's a good reason.  That should be explained in the commit
message.

> The current design assumes that the hypervisor explicitly informs the
> ultravisor, that it is done with the SVM, through the UV_SVM_TERMINATE
> ucall. Till that point the Ultravisor must to be ready to service any
> ucalls made by the hypervisor on the SVM's behalf.

I see that UV_SVM_TERMINATE is done when the VM is being destroyed (at
which point kvm->arch.secure_guest doesn't matter any more), and in
kvmhv_svm_off(), where kvm->arch.secure_guest gets cleared
explicitly.  Hence I don't see any need for clearing it in the
assembly code on the next secure guest entry.  I think the change to
book3s_hv_rmhandlers.S can just be dropped.

Paul.


Re: [PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-11 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:58AM +0530, Bharata B Rao wrote:
[snip]
> @@ -5442,6 +5471,64 @@ static int kvmhv_store_to_eaddr(struct kvm_vcpu *vcpu, 
> ulong *eaddr, void *ptr,
>   return rc;
>  }
>  
> +/*
> + *  IOCTL handler to turn off secure mode of guest
> + *
> + * - Issue ucall to terminate the guest on the UV side
> + * - Unpin the VPA pages (Enables these pages to be migrated back
> + *   when VM becomes secure again)
> + * - Recreate partition table as the guest is transitioning back to
> + *   normal mode
> + * - Release all device pages
> + */
> +static int kvmhv_svm_off(struct kvm *kvm)
> +{
> + struct kvm_vcpu *vcpu;
> + int srcu_idx;
> + int ret = 0;
> + int i;
> +
> + if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> + return ret;
> +

A further comment on this code: it should check that no vcpus are
running and fail if any are running, and it should prevent any vcpus
from running until the function is finished, using code like that in
kvmhv_configure_mmu().  That is, it should do something like this:

mutex_lock(>arch.mmu_setup_lock);
mmu_was_ready = kvm->arch.mmu_ready;
if (kvm->arch.mmu_ready) {
kvm->arch.mmu_ready = 0;
/* order mmu_ready vs. vcpus_running */
smp_mb();
if (atomic_read(>arch.vcpus_running)) {
kvm->arch.mmu_ready = 1;
ret = -EBUSY;
goto out_unlock;
}
}

and then after clearing kvm->arch.secure_guest below:

> + srcu_idx = srcu_read_lock(>srcu);
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + struct kvm_memory_slot *memslot;
> + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> +
> + if (!slots)
> + continue;
> +
> + kvm_for_each_memslot(memslot, slots) {
> + kvmppc_uvmem_drop_pages(memslot, kvm, true);
> + uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
> + }
> + }
> + srcu_read_unlock(>srcu, srcu_idx);
> +
> + ret = uv_svm_terminate(kvm->arch.lpid);
> + if (ret != U_SUCCESS) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + spin_lock(>arch.vpa_update_lock);
> + unpin_vpa_reset(kvm, >arch.dtl);
> + unpin_vpa_reset(kvm, >arch.slb_shadow);
> + unpin_vpa_reset(kvm, >arch.vpa);
> + spin_unlock(>arch.vpa_update_lock);
> + }
> +
> + ret = kvmppc_reinit_partition_table(kvm);
> + if (ret)
> + goto out;
> +
> + kvm->arch.secure_guest = 0;

you need to do:

kvm->arch.mmu_ready = mmu_was_ready;
 out_unlock:
mutex_unlock(>arch.mmu_setup_lock);

> +out:
> + return ret;
> +}
> +

With that extra check in place, it should be safe to unpin the vpas if
there is a good reason to do so.  ("Userspace has some bug that we
haven't found" isn't a good reason to do so.)

Paul.



Re: [PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-10 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:58AM +0530, Bharata B Rao wrote:
> Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
> This ioctl will be issued by QEMU during reset and includes the
> the following steps:
> 
> - Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
> - Unpin the VPA pages so that they can be migrated back to secure
>   side when guest becomes secure again. This is required because
>   pinned pages can't be migrated.

Unpinning the VPA pages is normally handled during VM reset by QEMU
doing set_one_reg operations to set the values for the
KVM_REG_PPC_VPA_ADDR, KVM_REG_PPC_VPA_SLB and KVM_REG_PPC_VPA_DTL
pseudo-registers to zero.  Is there some reason why this isn't
happening for a secure VM, and if so, what is that reason?
If it is happening, then why do we need to unpin the pages explicitly
here?

> - Reinitialize guest's partitioned scoped page tables. These are
>   freed when guest becomes secure (H_SVM_INIT_DONE)

It doesn't seem particularly useful to me to free the partition-scoped
page tables when the guest becomes secure, and it feels like it makes
things more fragile.  If you don't free them then, then you don't need
to reallocate them now.

> - Release all device pages of the secure guest.
> 
> After these steps, guest is ready to issue UV_ESM call once again
> to switch to secure mode.

Paul.


Re: [PATCH v10 5/8] KVM: PPC: Handle memory plug/unplug to secure VM

2019-11-10 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:57AM +0530, Bharata B Rao wrote:
> Register the new memslot with UV during plug and unregister
> the memslot during unplug. In addition, release all the
> device pages during unplug.
> 
> Signed-off-by: Bharata B Rao 
> Signed-off-by: Sukadev Bhattiprolu 
>   [Added skip_page_out arg to kvmppc_uvmem_drop_pages()]

Reviewed-by: Paul Mackerras 


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-10 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:59AM +0530, Bharata B Rao wrote:
> From: Sukadev Bhattiprolu 
> 
> Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
> abort an SVM after it has issued the H_SVM_INIT_START and before the
> H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
> encounters security violations or other errors when starting an SVM.
> 
> Note that this hcall is different from UV_SVM_TERMINATE ucall which
> is used by HV to terminate/cleanup an SVM.
> 
> In case of H_SVM_INIT_ABORT, we should page-out all the pages back to
> HV (i.e., we should not skip the page-out). Otherwise the VM's pages,
> possibly including its text/data would be stuck in secure memory.
> Since the SVM did not go secure, its MSR_S bit will be clear and the
> VM wont be able to access its pages even to do a clean exit.

It seems fragile to me to have one more transfer back into the
ultravisor after this call.  Why does the UV need to do this call and
then get control back again one more time?  Why can't the UV defer
doing this call until it can do it without expecting to see a return
from the hcall?  And if it does need to see a return from the hcall,
what would happen if a malicious hypervisor doesn't do the return?

Paul.


Re: [PATCH v10 1/8] mm: ksm: Export ksm_madvise()

2019-11-06 Thread Paul Mackerras
On Wed, Nov 06, 2019 at 12:15:42PM +0530, Bharata B Rao wrote:
> On Wed, Nov 06, 2019 at 03:33:29PM +1100, Paul Mackerras wrote:
> > On Mon, Nov 04, 2019 at 09:47:53AM +0530, Bharata B Rao wrote:
> > > KVM PPC module needs ksm_madvise() for supporting secure guests.
> > > Guest pages that become secure are represented as device private
> > > pages in the host. Such pages shouldn't participate in KSM merging.
> > 
> > If we don't do the ksm_madvise call, then as far as I can tell, it
> > should all still work correctly, but we might have KSM pulling pages
> > in unnecessarily, causing a reduction in performance.  Is that right?
> 
> I thought so too. When KSM tries to merge a secure page, it should
> cause a fault resulting in page-out the secure page. However I see
> the below crash when KSM is enabled and KSM scan tries to kmap and
> memcmp the device private page.
> 
> BUG: Unable to handle kernel data access at 0xc007fffe0001
> Faulting instruction address: 0xc00ab5a0
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in:
> CPU: 0 PID: 22 Comm: ksmd Not tainted 5.4.0-rc2-00026-g2249c0ae4a53-dirty #376
> NIP:  c00ab5a0 LR: c03d7c3c CTR: 0004
> REGS: c001c85d79b0 TRAP: 0300   Not tainted  
> (5.4.0-rc2-00026-g2249c0ae4a53-dirty)
> MSR:  9280b033   CR: 24002242  
> XER: 2004
> CFAR: c00ab3d0 DAR: c007fffe0001 DSISR: 4000 IRQMASK: 0 
> GPR00: 0004 c001c85d7c40 c18ce000 c001c388 
> GPR04: c007fffe0001 0001   
> GPR08: c1992298 603820002138  3a69 
> GPR12: 24002242 c255 c001c870 c179b728 
> GPR16: c00c01800040 c179b5b8 c00c0070e200  
> GPR20:   f000 c179b648 
> GPR24: c24464a0 c249f568 c1118918  
> GPR28: c001c804c590 c249f518  c001c870 
> NIP [c00ab5a0] memcmp+0x320/0x6a0
> LR [c03d7c3c] memcmp_pages+0x8c/0xe0
> Call Trace:
> [c001c85d7c40] [c001c804c590] 0xc001c804c590 (unreliable)
> [c001c85d7c70] [c04591d0] ksm_scan_thread+0x960/0x21b0
> [c001c85d7db0] [c01bf328] kthread+0x198/0x1a0
> [c001c85d7e20] [c000bfbc] ret_from_kernel_thread+0x5c/0x80
> Instruction dump:
> ebc1fff0 eba1ffe8 eb81ffe0 eb61ffd8 4e800020 3861 4d810020 3860 
> 4e800020 3804 7c0903a6 7d201c28 <7d402428> 7c295040 38630008 38840008 

Hmmm, that seems like a bug in the ZONE_DEVICE stuff generally.  All
that ksm is doing as far as I can see is follow_page() and
kmap_atomic().  I wonder how many other places in the kernel might
also be prone to crashing if they try to touch device pages?

> In anycase, we wouldn't want secure guests pages to be pulled out due
> to KSM, hence disabled merging.

Sure, I don't disagree with that, but I worry that we are papering
over a bug here.

Paul.


Re: [PATCH v10 2/8] KVM: PPC: Support for running secure guests

2019-11-05 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:54AM +0530, Bharata B Rao wrote:
> A pseries guest can be run as secure guest on Ultravisor-enabled
> POWER platforms. On such platforms, this driver will be used to manage
> the movement of guest pages between the normal memory managed by
> hypervisor (HV) and secure memory managed by Ultravisor (UV).
> 
> HV is informed about the guest's transition to secure mode via hcalls:
> 
> H_SVM_INIT_START: Initiate securing a VM
> H_SVM_INIT_DONE: Conclude securing a VM
> 
> As part of H_SVM_INIT_START, register all existing memslots with
> the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
> the guest to secure mode is complete.
> 
> These two states (transition to secure mode STARTED and transition
> to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
> Setting these states will cause the assembly code that enters the
> guest to call the UV_RETURN ucall instead of trying to enter the
> guest directly.
> 
> Migration of pages betwen normal and secure memory of secure
> guest is implemented in H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> 
> H_SVM_PAGE_IN: Move the content of a normal page to secure page
> H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> 
> Private ZONE_DEVICE memory equal to the amount of secure memory
> available in the platform for running secure guests is created.
> Whenever a page belonging to the guest becomes secure, a page from
> this private device memory is used to represent and track that secure
> page on the HV side. The movement of pages between normal and secure
> memory is done via migrate_vma_pages() using UV_PAGE_IN and
> UV_PAGE_OUT ucalls.
> 
> Signed-off-by: Bharata B Rao 

Reviewed-by: Paul Mackerras 


Re: [PATCH v10 1/8] mm: ksm: Export ksm_madvise()

2019-11-05 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:53AM +0530, Bharata B Rao wrote:
> KVM PPC module needs ksm_madvise() for supporting secure guests.
> Guest pages that become secure are represented as device private
> pages in the host. Such pages shouldn't participate in KSM merging.

If we don't do the ksm_madvise call, then as far as I can tell, it
should all still work correctly, but we might have KSM pulling pages
in unnecessarily, causing a reduction in performance.  Is that right?

> Signed-off-by: Bharata B Rao 

Acked-by: Paul Mackerras 


Re: [PATCH v10 4/8] KVM: PPC: Radix changes for secure guest

2019-11-05 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:56AM +0530, Bharata B Rao wrote:
> - After the guest becomes secure, when we handle a page fault of a page
>   belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
> - Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
> - Ensure all those routines that walk the secondary page tables of
>   the guest don't do so in case of secure VM. For secure guest, the
>   active secondary page tables are in secure memory and the secondary
>   page tables in HV are freed when guest becomes secure.

Why do we free the page tables?  Just to save a little memory?  It
feels like it would make things more fragile.

Also, I don't see where the freeing gets done in this patch.

Paul.


Re: [PATCH v10 3/8] KVM: PPC: Shared pages support for secure guests

2019-11-05 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:55AM +0530, Bharata B Rao wrote:
> A secure guest will share some of its pages with hypervisor (Eg. virtio
> bounce buffers etc). Support sharing of pages between hypervisor and
> ultravisor.
> 
> Shared page is reachable via both HV and UV side page tables. Once a
> secure page is converted to shared page, the device page that represents
> the secure page is unmapped from the HV side page tables.

I'd like to understand a little better what's going on - see below...

> +/*
> + * Shares the page with HV, thus making it a normal page.
> + *
> + * - If the page is already secure, then provision a new page and share
> + * - If the page is a normal page, share the existing page
> + *
> + * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
> + * to unmap the device page from QEMU's page tables.
> + */
> +static unsigned long
> +kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long 
> page_shift)
> +{
> +
> + int ret = H_PARAMETER;
> + struct page *uvmem_page;
> + struct kvmppc_uvmem_page_pvt *pvt;
> + unsigned long pfn;
> + unsigned long gfn = gpa >> page_shift;
> + int srcu_idx;
> + unsigned long uvmem_pfn;
> +
> + srcu_idx = srcu_read_lock(>srcu);
> + mutex_lock(>arch.uvmem_lock);
> + if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
> + uvmem_page = pfn_to_page(uvmem_pfn);
> + pvt = uvmem_page->zone_device_data;
> + pvt->skip_page_out = true;
> + }
> +
> +retry:
> + mutex_unlock(>arch.uvmem_lock);
> + pfn = gfn_to_pfn(kvm, gfn);

At this point, pfn is the value obtained from the page table for
userspace (e.g. QEMU), right?  I would think it should be equal to
uvmem_pfn in most cases, shouldn't it?  If not, what is it going to
be?

> + if (is_error_noslot_pfn(pfn))
> + goto out;
> +
> + mutex_lock(>arch.uvmem_lock);
> + if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, _pfn)) {
> + uvmem_page = pfn_to_page(uvmem_pfn);
> + pvt = uvmem_page->zone_device_data;
> + pvt->skip_page_out = true;
> + kvm_release_pfn_clean(pfn);

This is going to do a put_page(), unless pfn is a reserved pfn.  If it
does a put_page(), where did we do the corresponding get_page()?
However, since kvmppc_gfn_is_uvmem_pfn() returned true, doesn't that
mean that pfn here should be a device pfn, and in fact should be the
same as uvmem_pfn (possibly with some extra bit(s) set)?  What does
kvm_is_reserved_pfn() return for a device pfn?

Paul.


Re: [PATCH v10 0/8] KVM: PPC: Driver to manage pages of secure guest

2019-11-05 Thread Paul Mackerras
On Mon, Nov 04, 2019 at 09:47:52AM +0530, Bharata B Rao wrote:
> Hi,
> 
> This is the next version of the patchset that adds required support
> in the KVM hypervisor to run secure guests on PEF-enabled POWER platforms.
> 
> The major change in this version is about not using kvm.arch->rmap[]
> array to store device PFNs, thus not depending on the memslot availability
> to reach to the device PFN from the fault path. Instead of rmap[], we
> now have a different array which gets created and destroyed along with
> memslot creation and deletion. These arrays hang off from kvm.arch and
> are arragned in a simple linked list for now. We could move to some other
> data structure in future if walking of linked list becomes an overhead
> due to large number of memslots.

Thanks.  This is looking really close now.

> Other changes include:
> 
> - Rearranged/Merged/Cleaned up patches, removed all Acks/Reviewed-by since
>   all the patches have changed.
> - Added a new patch to support H_SVM_INIT_ABORT hcall (From Suka)
> - Added KSM unmerge support so that VMAs that have device PFNs don't
>   participate in KSM merging and eventually crash in KSM code.
> - Release device pages during unplug (Paul) and ensure that memory
>   hotplug and unplug works correctly.
> - Let kvm-hv module to load on PEF-disabled platforms (Ram) when
>   CONFIG_PPC_UV is enabled allowing regular non-secure guests
>   to still run.
> - Support guest reset when swithing to secure is in progress.
> - Check if page is already secure in kvmppc_send_page_to_uv() before
>   sending it to UV.
> - Fixed sentinal for header file kvm_book3s_uvmem.h (Jason)
> 
> Now, all the dependencies required by this patchset are in powerpc/next
> on which this patchset is based upon.

Can you tell me what patches that are in powerpc/next but not upstream
this depends on?

Paul.


Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-10-22 Thread Paul Mackerras
On Tue, Oct 22, 2019 at 11:59:35AM +0530, Bharata B Rao wrote:
> On Fri, Oct 18, 2019 at 8:31 AM Paul Mackerras  wrote:
> >
> > On Wed, Sep 25, 2019 at 10:36:43AM +0530, Bharata B Rao wrote:
> > > Manage migration of pages betwen normal and secure memory of secure
> > > guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> > >
> > > H_SVM_PAGE_IN: Move the content of a normal page to secure page
> > > H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> > >
> > > Private ZONE_DEVICE memory equal to the amount of secure memory
> > > available in the platform for running secure guests is created.
> > > Whenever a page belonging to the guest becomes secure, a page from
> > > this private device memory is used to represent and track that secure
> > > page on the HV side. The movement of pages between normal and secure
> > > memory is done via migrate_vma_pages() using UV_PAGE_IN and
> > > UV_PAGE_OUT ucalls.
> >
> > As we discussed privately, but mentioning it here so there is a
> > record:  I am concerned about this structure
> >
> > > +struct kvmppc_uvmem_page_pvt {
> > > + unsigned long *rmap;
> > > + struct kvm *kvm;
> > > + unsigned long gpa;
> > > +};
> >
> > which keeps a reference to the rmap.  The reference could become stale
> > if the memslot is deleted or moved, and nothing in the patch series
> > ensures that the stale references are cleaned up.
> 
> I will add code to release the device PFNs when memslot goes away. In
> fact the early versions of the patchset had this, but it subsequently
> got removed.
> 
> >
> > If it is possible to do without the long-term rmap reference, and
> > instead find the rmap via the memslots (with the srcu lock held) each
> > time we need the rmap, that would be safer, I think, provided that we
> > can sort out the lock ordering issues.
> 
> All paths except fault handler access rmap[] under srcu lock. Even in
> case of fault handler, for those faults induced by us (shared page
> handling, releasing device pfns), we do hold srcu lock. The difficult
> case is when we fault due to HV accessing a device page. In this case
> we come to fault hanler with mmap_sem already held and are not in a
> position to take kvm srcu lock as that would lead to lock order
> reversal. Given that we have pages mapped in still, I assume memslot
> can't go away while we access rmap[], so think we should be ok here.

The mapping of pages in userspace memory, and the mapping of userspace
memory to guest physical space, are two distinct things.  The memslots
describe the mapping of userspace addresses to guest physical
addresses, but don't say anything about what is mapped at those
userspace addresses.  So you can indeed get a page fault on a
userspace address at the same time that a memslot is being deleted
(even a memslot that maps that particular userspace address), because
removing the memslot does not unmap anything from userspace memory,
it just breaks the association between that userspace memory and guest
physical memory.  Deleting the memslot does unmap the pages from the
guest but doesn't unmap them from the userspace process (e.g. QEMU).

It is an interesting question what the semantics should be when a
memslot is deleted and there are pages of userspace currently paged
out to the device (i.e. the ultravisor).  One approach might be to say
that all those pages have to come back to the host before we finish
the memslot deletion, but that is probably not necessary; I think we
could just say that those pages are gone and can be replaced by zero
pages if they get accessed on the host side.  If userspace then unmaps
the corresponding region of the userspace memory map, we can then just
forget all those pages with very little work.

> However if that sounds fragile, may be I can go back to my initial
> design where we weren't using rmap[] to store device PFNs. That will
> increase the memory usage but we give us an easy option to have
> per-guest mutex to protect concurrent page-ins/outs/faults.

That sounds like it would be the best option, even if only in the
short term.  At least it would give us a working solution, even if
it's not the best performing solution.

Paul.


Re: [PATCH v2 0/6] KVM: PPC: Book3S: HV: XIVE: Allocate less VPs in OPAL

2019-10-20 Thread Paul Mackerras
On Wed, Oct 16, 2019 at 11:44:03PM +0200, Greg Kurz wrote:
> On Fri, 27 Sep 2019 13:53:32 +0200
> Greg Kurz  wrote:
> 
> > This brings some fixes and allows to start more VMs with an in-kernel
> > XIVE or XICS-on-XIVE device.
> > 
> > Changes since v1 (https://patchwork.ozlabs.org/cover/1166099/):
> > - drop a useless patch
> > - add a patch to show VP ids in debugfs
> > - update some changelogs
> > - fix buggy check in patch 5
> > - Cc: stable 
> > 
> > --
> > Greg
> > 
> > ---
> > 
> > Greg Kurz (6):
> >   KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
> >   KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
> >   KVM: PPC: Book3S HV: XIVE: Show VP id in debugfs
> >   KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
> >   KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
> >   KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
> > 
> > 
> >  Documentation/virt/kvm/devices/xics.txt |   14 +++
> >  Documentation/virt/kvm/devices/xive.txt |8 ++
> >  arch/powerpc/include/uapi/asm/kvm.h |3 +
> >  arch/powerpc/kvm/book3s_xive.c  |  142 
> > ---
> >  arch/powerpc/kvm/book3s_xive.h  |   17 
> >  arch/powerpc/kvm/book3s_xive_native.c   |   40 +++--
> >  6 files changed, 167 insertions(+), 57 deletions(-)
> > 
> 
> Ping ?

I'm about to send a pull request to Paolo for 2/6 (to go into 5.4) and
I'm preparing a tree of stuff for 5.5 that will include the rest of
the patches.  However, I have been delayed by the fact that multipath
SCSI is currently broken upstream on the P8 test box that I use, so I
haven't been able to test things.

Paul.


Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-10-17 Thread Paul Mackerras
On Wed, Sep 25, 2019 at 10:36:43AM +0530, Bharata B Rao wrote:
> Manage migration of pages betwen normal and secure memory of secure
> guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> 
> H_SVM_PAGE_IN: Move the content of a normal page to secure page
> H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> 
> Private ZONE_DEVICE memory equal to the amount of secure memory
> available in the platform for running secure guests is created.
> Whenever a page belonging to the guest becomes secure, a page from
> this private device memory is used to represent and track that secure
> page on the HV side. The movement of pages between normal and secure
> memory is done via migrate_vma_pages() using UV_PAGE_IN and
> UV_PAGE_OUT ucalls.

As we discussed privately, but mentioning it here so there is a
record:  I am concerned about this structure

> +struct kvmppc_uvmem_page_pvt {
> + unsigned long *rmap;
> + struct kvm *kvm;
> + unsigned long gpa;
> +};

which keeps a reference to the rmap.  The reference could become stale
if the memslot is deleted or moved, and nothing in the patch series
ensures that the stale references are cleaned up.

If it is possible to do without the long-term rmap reference, and
instead find the rmap via the memslots (with the srcu lock held) each
time we need the rmap, that would be safer, I think, provided that we
can sort out the lock ordering issues.

Paul.


Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use

2019-09-24 Thread Paul Mackerras
On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> We currently prevent userspace to connect a new vCPU if we already have
> one with the same vCPU id. This is good but unfortunately not enough,
> because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> can return colliding values. For examples, 348 stays unchanged since it
> is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> guest's core stride is 8. Nothing currently prevents userspace to connect
> vCPUs with forged ids, that end up being associated to the same VP. This
> confuses the irq layer and likely crashes the kernel:
> 
> [96631.670454] genirq: Flags mismatch irq 4161. 0001 (kvm-1-2392) vs. 
> 0001 (kvm-1-348)

Have you seen a host kernel crash?  How hard would it be to exploit
this, and would it just be a denial of service, or do you think it
could be used to get a use-after-free in the kernel or something like
that?

Also, does this patch depend on the patch 2 in this series?

Paul.


Re: [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated

2019-09-23 Thread Paul Mackerras
On Mon, Sep 23, 2019 at 05:43:37PM +0200, Greg Kurz wrote:
> From: Cédric Le Goater 
> 
> Do not assign the device private pointer before making sure the XIVE
> VPs are allocated in OPAL and test pointer validity when releasing
> the device.
> 
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method 
> by a 'release' method")
> Signed-off-by: Cédric Le Goater 
> Signed-off-by: Greg Kurz 

What happens in the case where the OPAL allocation fails?  Does the
host crash, or hang, or leak resources?  I presume that users can
trigger the allocation failure just by starting a suitably large
number of guests - is that right?  Is there an easier way?  I'm trying
to work out whether this is urgently needed in 5.4 and the stable
trees or not.

Paul.


Re: [PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag

2019-09-23 Thread Paul Mackerras
   42: // returns to executing guest
> 105:   ppc_msgsnd_sync()/smp_mb()
> 105:   ppc_msgsnd() -> 42
>  42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
> 105: // hangs waiting on 42 to process messages/call_single_queue
> 
> Fixing this scenario would require an smp_mb() *after* clearing
> host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
> subsequent processing of IPI messages.
> 
> To handle both cases, this patch splits kvmppc_set_host_ipi() into
> separate set/clear functions, where we execute smp_mb() prior to
> setting host_ipi flag, and after clearing host_ipi flag. These
> functions pair with each other to synchronize the sender and receiver
> sides.
> 
> With that change in place the above workload ran for 20 hours without
> triggering any lock-ups.
> 
> Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell 
> handling") # v4.0
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Nicholas Piggin 
> Cc: kvm-...@vger.kernel.org
> Signed-off-by: Michael Roth 

Looks good, makes sense.

Acked-by: Paul Mackerras 


Re: [PATCH kernel v2 2/4] KVM: PPC: Invalidate multiple TCEs at once

2019-08-26 Thread Paul Mackerras
On Mon, Aug 26, 2019 at 04:17:03PM +1000, Alexey Kardashevskiy wrote:
> Invalidating a TCE cache entry for each updated TCE is quite expensive.
> This makes use of the new iommu_table_ops::xchg_no_kill()/tce_kill()
> callbacks to bring down the time spent in mapping a huge guest DMA window;
> roughly 20s to 10s for each guest's 100GB of DMA space.
> 
> Signed-off-by: Alexey Kardashevskiy 

With the addition of "Book3S" to the patch title,

Acked-by: Paul Mackerras 


[PATCH] KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions

2019-08-26 Thread Paul Mackerras
There are some POWER9 machines where the OPAL firmware does not support
the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
The impact of this is that a guest using XIVE natively will not be able
to be migrated successfully.  On the source side, the get_attr operation
on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
will fail; on the destination side, the set_attr operation for the same
attribute will fail.

This adds tests for the existence of the OPAL get/set queue state
functions, and if they are not supported, the XIVE-native KVM device
is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
Userspace can then either provide a software emulation of XIVE, or
else tell the guest that it does not have a XIVE controller available
to it.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_ppc.h| 1 +
 arch/powerpc/include/asm/xive.h   | 1 +
 arch/powerpc/kvm/book3s.c | 8 +---
 arch/powerpc/kvm/book3s_xive_native.c | 5 +
 arch/powerpc/kvm/powerpc.c| 3 ++-
 arch/powerpc/sysdev/xive/native.c | 7 +++
 6 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2484e6a..8e8514e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -598,6 +598,7 @@ extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu,
 union kvmppc_one_reg *val);
 extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu,
 union kvmppc_one_reg *val);
+extern bool kvmppc_xive_native_supported(void);
 
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index efb0e59..818989e 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -135,6 +135,7 @@ extern int xive_native_get_queue_state(u32 vp_id, uint32_t 
prio, u32 *qtoggle,
 extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
   u32 qindex);
 extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
+extern bool xive_native_has_queue_state_support(void);
 
 #else
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 9524d92..d7fcdfa 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1083,9 +1083,11 @@ static int kvmppc_book3s_init(void)
if (xics_on_xive()) {
kvmppc_xive_init_module();
kvm_register_device_ops(_xive_ops, KVM_DEV_TYPE_XICS);
-   kvmppc_xive_native_init_module();
-   kvm_register_device_ops(_xive_native_ops,
-   KVM_DEV_TYPE_XIVE);
+   if (kvmppc_xive_native_supported()) {
+   kvmppc_xive_native_init_module();
+   kvm_register_device_ops(_xive_native_ops,
+   KVM_DEV_TYPE_XIVE);
+   }
} else
 #endif
kvm_register_device_ops(_xics_ops, KVM_DEV_TYPE_XICS);
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index f0cab43..248c1ea 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1179,6 +1179,11 @@ int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, 
union kvmppc_one_reg *val)
return 0;
 }
 
+bool kvmppc_xive_native_supported(void)
+{
+   return xive_native_has_queue_state_support();
+}
+
 static int xive_native_debug_show(struct seq_file *m, void *private)
 {
struct kvmppc_xive *xive = m->private;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 0dba7eb..7012dd7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -566,7 +566,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 * a POWER9 processor) and the PowerNV platform, as
 * nested is not yet supported.
 */
-   r = xive_enabled() && !!cpu_has_feature(CPU_FTR_HVMODE);
+   r = xive_enabled() && !!cpu_has_feature(CPU_FTR_HVMODE) &&
+   kvmppc_xive_native_supported();
break;
 #endif
 
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 2f26b74..37987c8 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -800,6 +800,13 @@ int xive_native_set_queue_state(u32 vp_id, u32 prio, u32 
qtoggle, u32 qindex)
 }
 EXPORT_SYMBOL_GPL(xive_native_set_queue_state);
 
+bool xive_native_has_queue_state_support(void)
+{
+   return opal_check_token(OPAL_XIVE_GET_QUEUE_STATE) &&
+   opal_check_token(OPAL_XIVE_SET_QUEUE_STATE);
+}
+EXPORT_SYMBOL_GPL(xive_nati

Re: [PATCH v7 0/7] KVMPPC driver to manage secure guest pages

2019-08-22 Thread Paul Mackerras
On Thu, Aug 22, 2019 at 03:56:13PM +0530, Bharata B Rao wrote:
> Hi,
> 
> A pseries guest can be run as a secure guest on Ultravisor-enabled
> POWER platforms. On such platforms, this driver will be used to manage
> the movement of guest pages between the normal memory managed by
> hypervisor(HV) and secure memory managed by Ultravisor(UV).
> 
> Private ZONE_DEVICE memory equal to the amount of secure memory
> available in the platform for running secure guests is created.
> Whenever a page belonging to the guest becomes secure, a page from
> this private device memory is used to represent and track that secure
> page on the HV side. The movement of pages between normal and secure
> memory is done via migrate_vma_pages(). The reverse movement is driven
> via pagemap_ops.migrate_to_ram().
> 
> The page-in or page-out requests from UV will come to HV as hcalls and
> HV will call back into UV via uvcalls to satisfy these page requests.
> 
> These patches are against hmm.git
> (https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm)
> 
> plus
> 
> Claudio Carvalho's base ultravisor enablement patchset v6
> (https://lore.kernel.org/linuxppc-dev/20190822034838.27876-1-cclau...@linux.ibm.com/T/#t)

How are you thinking these patches will go upstream?  Are you going to
send them via the hmm tree?

I assume you need Claudio's patchset as a prerequisite for your series
to compile, which means the hmm maintainers would need to pull in a
topic branch from Michael Ellerman's powerpc tree, or something like
that.

Paul.


Re: [PATCH kernel] vfio/spapr_tce: Fix incorrect tce_iommu_group memory free

2019-08-22 Thread Paul Mackerras
On Mon, Aug 19, 2019 at 11:51:17AM +1000, Alexey Kardashevskiy wrote:
> The @tcegrp variable is used in 1) a loop over attached groups
> 2) it stores a pointer to a newly allocated tce_iommu_group if 1) found
> nothing. However the error handler does not distinguish how we got there
> and incorrectly releases memory for a found+incompatible group.
> 
> This fixes it by adding another error handling case.
> 
> Fixes: 0bd971676e68 ("powerpc/powernv/npu: Add compound IOMMU groups")
> Signed-off-by: Alexey Kardashevskiy 

Good catch.  This is potentially nasty since it is a double free.
Alex, are you going to take this, or would you prefer it goes via
Michael Ellerman's tree?

Reviewed-by: Paul Mackerras 


Re: [PATCH v6 7/7] powerpc/kvm: Use UV_RETURN ucall to return to ultravisor

2019-08-22 Thread Paul Mackerras
On Thu, Aug 22, 2019 at 12:48:38AM -0300, Claudio Carvalho wrote:
> From: Sukadev Bhattiprolu 
> 
> When an SVM makes an hypercall or incurs some other exception, the
> Ultravisor usually forwards (a.k.a. reflects) the exceptions to the
> Hypervisor. After processing the exception, Hypervisor uses the
> UV_RETURN ultracall to return control back to the SVM.
> 
> The expected register state on entry to this ultracall is:
> 
> * Non-volatile registers are restored to their original values.
> * If returning from an hypercall, register R0 contains the return value
>   (unlike other ultracalls) and, registers R4 through R12 contain any
>   output values of the hypercall.
> * R3 contains the ultracall number, i.e UV_RETURN.
> * If returning with a synthesized interrupt, R2 contains the
>   synthesized interrupt number.

This isn't accurate: R2 contains the value that should end up in SRR1
when we are back in the secure guest.  HSRR0 and HSRR1 contain the
instruction pointer and MSR that the guest should run with.  They may
be different from the instruction pointer and MSR that the guest vCPU
last had, if the hypervisor has synthesized an interrupt for the
guest.  The ultravisor needs to detect this case and respond
appropriately.

> Thanks to input from Paul Mackerras, Ram Pai and Mike Anderson.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> Signed-off-by: Claudio Carvalho 

Apart from that comment on the patch description -

Acked-by: Paul Mackerras 


Re: [PATCH v2 1/3] KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts

2019-08-14 Thread Paul Mackerras
On Wed, Aug 14, 2019 at 02:46:38PM +1000, Jordan Niethe wrote:
> On Tue, 2019-08-13 at 20:03 +1000, Paul Mackerras wrote:

[snip]
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index 337e644..2e7e788 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -2831,29 +2831,39 @@ kvm_cede_prodded:
> >  kvm_cede_exit:
> > ld  r9, HSTATE_KVM_VCPU(r13)
> >  #ifdef CONFIG_KVM_XICS
> > -   /* Abort if we still have a pending escalation */
> > +   /* are we using XIVE with single escalation? */
> > +   ld  r10, VCPU_XIVE_ESC_VADDR(r9)
> > +   cmpdi   r10, 0
> > +   beq 3f
> > +   li  r6, XIVE_ESB_SET_PQ_00
> Would it make sense to put the above instruction down into the 4: label
> instead? If we do not branch to 4, r6 is overwriten anyway. 

Right.

> I think that would save a load when we do not branch to 4. Also it

Well, li is a load immediate rather than a load ("load" would normally
imply a load from memory).  Load-immediate instructions are
essentially free since they can easily be executed in parallel with
other instructions and execute in a single cycle.

> would mean that you could use r5 everywhere instead of changing it to
> r6? 

Yes.  If I have to respin the patch for other reasons then I will
rearrange things as you suggest.  I don't think it's worth respinning
just for this change -- it won't reduce the total number of
instructions, and I strongly doubt there would be any measurable
performance difference.

> > +   /*
> > +* If we still have a pending escalation, abort the cede,
> > +* and we must set PQ to 10 rather than 00 so that we don't
> > +* potentially end up with two entries for the escalation
> > +* interrupt in the XIVE interrupt queue.  In that case
> > +* we also don't want to set xive_esc_on to 1 here in
> > +* case we race with xive_esc_irq().
> > +*/
> > lbz r5, VCPU_XIVE_ESC_ON(r9)
> > cmpwi   r5, 0
> > -   beq 1f
> > +   beq 4f
> > li  r0, 0
> > stb r0, VCPU_CEDED(r9)
> > -1: /* Enable XIVE escalation */
> > -   li  r5, XIVE_ESB_SET_PQ_00
> > +   li  r6, XIVE_ESB_SET_PQ_10
> > +   b   5f
> > +4: li  r0, 1
> > +   stb r0, VCPU_XIVE_ESC_ON(r9)
> > +   /* make sure store to xive_esc_on is seen before xive_esc_irq
> > runs */
> > +   sync
> > +5: /* Enable XIVE escalation */
> > mfmsr   r0
> > andi.   r0, r0, MSR_DR  /* in real mode? */
> > beq 1f
> > -   ld  r10, VCPU_XIVE_ESC_VADDR(r9)
> > -   cmpdi   r10, 0
> > -   beq 3f
> > -   ldx r0, r10, r5
> > +   ldx r0, r10, r6
> > b   2f
> >  1: ld  r10, VCPU_XIVE_ESC_RADDR(r9)
> > -   cmpdi   r10, 0
> > -   beq 3f
> > -   ldcix   r0, r10, r5
> > +   ldcix   r0, r10, r6
> >  2: sync
> > -   li  r0, 1
> > -   stb r0, VCPU_XIVE_ESC_ON(r9)
> >  #endif /* CONFIG_KVM_XICS */
> >  3: b   guest_exit_cont
> >  

Paul.


  1   2   3   4   5   6   7   8   9   10   >