Re: [PATCH] KVM: selftests: Rename vm_handle_exception in evmcs test

2021-06-04 Thread Ricardo Koller
On Fri, Jun 04, 2021 at 09:26:54PM +, Sean Christopherson wrote:
> On Fri, Jun 04, 2021, Ricardo Koller wrote:
> > Kernel test robot reports this:
> > 
> > > /usr/bin/ld: tools/testing/selftests/kvm/x86_64/evmcs_test.c:157: 
> > > undefined reference to `vm_handle_exception'
> > > /usr/bin/ld: tools/testing/selftests/kvm/x86_64/evmcs_test.c:158: 
> > > undefined reference to `vm_handle_exception'
> > > collect2: error: ld returned 1 exit status
> > 
> > Fix it by renaming vm_handle_exception to vm_install_vector_handler in
> > evmcs_test.c.
> > 
> > Fixes: a2bad6a990a4 ("KVM: selftests: Rename vm_handle_exception")
> 
> Belated code review...

Thanks for the review.

> Can we rename the helper to vm_install_exception_handler()?
> 
> In x86, "vector" is the number of the exception and "vectoring" is the process
> of determining the resulting vector that gets delivered to software (e.g. when
> dealing with contributory faults like #GP->#PF->#DF), but the thing that's 
> being
> handled is an exception.

Got it. What about this renaming:

  vm_handle_exception(vec)  -> vm_install_exception_handler(vec)
  vm_install_exception_handler(vec, ec) -> vm_install_sync_handler(vec, ec)

> 
> arm appears to have similar terminology.  And looking at the arm code, it's 
> very
> confusing to have a helper vm_install_vector_handler() install into
> exception_handlers, _not_ into vector_handlers.  Calling the vector_handlers
> "default" handlers is also confusing, as "default" usually implies the thing 
> can
> be overwritten.  But in this case, the "default" handler is just another layer
> in the routing.
> 
> The multiple layers of routing is also confusing and a bit hard to wade 
> through
> for the uninitiated.  The whole thing can be made more straightfoward by doing
> away with the intermediate routing, whacking ~50 lines of code in the process.
> E.g. (definitely not functional code):

I think that works and it does remove a bunch of code. Just need to play
with the idea and check that it can cover all cases.

For now, given that the build is broken, what about this series of
patches:

1. keep this patch to fix x86 kvm selftests
2. rename both arm and x86 to vm_install_exception_handler and 
vm_install_sync_handler
3. restructure the internals of exception handling in arm

Alternatively, I can send 1+2 together and then 3. What do you think?

Thanks,
Ricardo

> 
> diff --git a/tools/testing/selftests/kvm/aarch64/debug-exceptions.c 
> b/tools/testing/selftests/kvm/aarch64/debug-exceptions.c
> index 51c42ac24dca..c784e4b770cf 100644
> --- a/tools/testing/selftests/kvm/aarch64/debug-exceptions.c
> +++ b/tools/testing/selftests/kvm/aarch64/debug-exceptions.c
> @@ -212,15 +212,15 @@ int main(int argc, char *argv[])
> exit(KSFT_SKIP);
> }
>  
> -   vm_install_exception_handler(vm, VECTOR_SYNC_CURRENT,
> +   vm_install_exception_handler_ec(vm, VECTOR_SYNC_CURRENT,
> ESR_EC_BRK_INS, guest_sw_bp_handler);
> -   vm_install_exception_handler(vm, VECTOR_SYNC_CURRENT,
> +   vm_install_exception_handler_ec(vm, VECTOR_SYNC_CURRENT,
> ESR_EC_HW_BP_CURRENT, guest_hw_bp_handler);
> -   vm_install_exception_handler(vm, VECTOR_SYNC_CURRENT,
> +   vm_install_exception_handler_ec(vm, VECTOR_SYNC_CURRENT,
> ESR_EC_WP_CURRENT, guest_wp_handler);
> -   vm_install_exception_handler(vm, VECTOR_SYNC_CURRENT,
> +   vm_install_exception_handler_ec(vm, VECTOR_SYNC_CURRENT,
> ESR_EC_SSTEP_CURRENT, guest_ss_handler);
> -   vm_install_exception_handler(vm, VECTOR_SYNC_CURRENT,
> +   vm_install_exception_handler_ec(vm, VECTOR_SYNC_CURRENT,
> ESR_EC_SVC64, guest_svc_handler);
>  
> for (stage = 0; stage < 7; stage++) {
> diff --git a/tools/testing/selftests/kvm/include/aarch64/processor.h 
> b/tools/testing/selftests/kvm/include/aarch64/processor.h
> index 1a3abe1037b0..211cb684577a 100644
> --- a/tools/testing/selftests/kvm/include/aarch64/processor.h
> +++ b/tools/testing/selftests/kvm/include/aarch64/processor.h
> @@ -110,10 +110,10 @@ void vm_init_descriptor_tables(struct kvm_vm *vm);
>  void vcpu_init_descriptor_tables(struct kvm_vm *vm, uint32_t vcpuid);
>  
>  typedef void(*handler_fn)(struct ex_regs *);
> -void vm_install_exception_handler(struct kvm_vm *vm,
> -   int vector, int ec, handler_fn handler);
> -void vm_install_vector_handler(struct kvm_vm *vm,
> -   int vector, handler_fn handler);
> +void vm_install_exception_handler_ec(struct kvm_vm *vm, int vector, int ec,
> +handler_fn handler);
> +void vm_install_exception_handler(struct kvm_vm *vm, int vector,
> + handler_fn handler);
>  
>  #define write_sysreg(reg, val)   \
>  ({  

[PATCH] KVM: selftests: Rename vm_handle_exception in evmcs test

2021-06-04 Thread Ricardo Koller
Kernel test robot reports this:

> /usr/bin/ld: tools/testing/selftests/kvm/x86_64/evmcs_test.c:157: undefined 
> reference to `vm_handle_exception'
> /usr/bin/ld: tools/testing/selftests/kvm/x86_64/evmcs_test.c:158: undefined 
> reference to `vm_handle_exception'
> collect2: error: ld returned 1 exit status

Fix it by renaming vm_handle_exception to vm_install_vector_handler in
evmcs_test.c.

Fixes: a2bad6a990a4 ("KVM: selftests: Rename vm_handle_exception")
Reported-by: kernel test robot 
Signed-off-by: Ricardo Koller 
---
 tools/testing/selftests/kvm/x86_64/evmcs_test.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/evmcs_test.c 
b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
index 63096cea26c6..278711723f4b 100644
--- a/tools/testing/selftests/kvm/x86_64/evmcs_test.c
+++ b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
@@ -154,8 +154,8 @@ int main(int argc, char *argv[])
 
vm_init_descriptor_tables(vm);
vcpu_init_descriptor_tables(vm, VCPU_ID);
-   vm_handle_exception(vm, UD_VECTOR, guest_ud_handler);
-   vm_handle_exception(vm, NMI_VECTOR, guest_nmi_handler);
+   vm_install_vector_handler(vm, UD_VECTOR, guest_ud_handler);
+   vm_install_vector_handler(vm, NMI_VECTOR, guest_nmi_handler);
 
pr_info("Running L1 which uses EVMCS to run L2\n");
 
-- 
2.32.0.rc1.229.g3e70b5a671-goog

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-04 Thread Catalin Marinas
On Fri, Jun 04, 2021 at 02:09:50PM +0100, Steven Price wrote:
> On 04/06/2021 12:42, Catalin Marinas wrote:
> > On Fri, Jun 04, 2021 at 12:15:56PM +0100, Steven Price wrote:
> >> On 03/06/2021 18:13, Catalin Marinas wrote:
> >>> On Mon, May 24, 2021 at 11:45:12AM +0100, Steven Price wrote:
>  diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>  b/arch/arm64/include/uapi/asm/kvm.h
>  index 24223adae150..b3edde68bc3e 100644
>  --- a/arch/arm64/include/uapi/asm/kvm.h
>  +++ b/arch/arm64/include/uapi/asm/kvm.h
>  @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>   __u32 reserved[12];
>   };
>   
>  +struct kvm_arm_copy_mte_tags {
>  +__u64 guest_ipa;
>  +__u64 length;
>  +void __user *addr;
>  +__u64 flags;
>  +__u64 reserved[2];
>  +};
>  +
>  +#define KVM_ARM_TAGS_TO_GUEST   0
>  +#define KVM_ARM_TAGS_FROM_GUEST 1
>  +
>   /* If you need to interpret the index values, here is the key: */
>   #define KVM_REG_ARM_COPROC_MASK 0x0FFF
>   #define KVM_REG_ARM_COPROC_SHIFT16
>  diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>  index e89a5e275e25..baa33359e477 100644
>  --- a/arch/arm64/kvm/arm.c
>  +++ b/arch/arm64/kvm/arm.c
>  @@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   
>   return 0;
>   }
>  +case KVM_ARM_MTE_COPY_TAGS: {
>  +struct kvm_arm_copy_mte_tags copy_tags;
>  +
>  +if (copy_from_user(_tags, argp, sizeof(copy_tags)))
>  +return -EFAULT;
>  +return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
>  +}
> >>>
> >>> I wonder whether we need an update of the user structure following a
> >>> fault, like how much was copied etc. In case of an error, some tags were
> >>> copied and the VMM may want to skip the page before continuing. But here
> >>> there's no such information provided.
> >>>
> >>> On the ptrace interface, we return 0 on the syscall if any bytes were
> >>> copied and update iov_len to such number. Maybe you want to still return
> >>> an error here but updating copy_tags.length would be nice (and, of
> >>> course, a copy_to_user() back).
> >>
> >> Good idea - as you suggest I'll make it update length with the number of
> >> bytes not processed. Although in general I think we're expecting the VMM
> >> to know where the memory is so this is more of a programming error - but
> >> could still be useful for debugging.
> > 
> > Or update it to the number of bytes copied to be consistent with
> > ptrace()'s iov.len. On success, the structure is effectively left
> > unchanged.
> 
> I was avoiding that because it confuses the error code when the initial
> copy_from_user() fails. In that case the structure is clearly unchanged,
> so you can only tell from a -EFAULT return that nothing happened. By
> returning the number of bytes left you can return an error code along
> with the information that the copy only half completed.
> 
> It also seems cleaner to leave the structure unchanged if e.g. the flags
> or reserved fields are invalid rather than having to set length=0 to
> signal that nothing was done.
> 
> Although I do feel like arguing whether to use a ptrace() interface or a
> copy_{to,from}_user() interface is somewhat ridiculous considering
> neither are exactly considered good.
> 
> Rather than changing the structure we could return either an error code
> (if nothing was copied) or the number of bytes left. That way ioctl()==0
> means complete success, >0 means partial success and <0 means complete
> failure and provides a detailed error code. The ioctl() can be repeated
> (with adjusted pointers) if it returns >0 and a detailed error is needed.

That would be more like read/write (nearly, those always return the
amount copied). Anyway, I don't have any strong preference, I'll leave
the details up to you as long as there is some indication of how much
was copied or left.

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Sergey Senozhatsky
On (21/06/04 09:24), Paolo Bonzini wrote:
> On 04/06/21 09:21, Vitaly Kuznetsov wrote:
> > >   preempt_notifier_inc();
> > > + kvm_init_pm_notifier(kvm);
> > You've probably thought it through and I didn't but wouldn't it be
> > easier to have one global pm_notifier call for KVM which would go
> > through the list of VMs instead of registering/deregistering a
> > pm_notifier call for every created/destroyed VM?
> 
> That raises questions on the locking, i.e. if we can we take the kvm_lock
> safely from the notifier.

Right, I wanted to take the VM lock, rather than subsystem lock
(kvm_lock).
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Sergey Senozhatsky
On (21/06/04 09:46), Marc Zyngier wrote:
[..]
> > +void kvm_arch_pm_notifier(struct kvm *kvm)
> > +{
> > +#ifdef CONFIG_PM
> > +   int c;
> > +
> > +   mutex_lock(>lock);
> > +   for (c = 0; c < kvm->created_vcpus; c++) {
> > +   struct kvm_vcpu *vcpu = kvm->vcpus[c];
> > +   int r;
> > +
> > +   if (!vcpu)
> > +   continue;
> 
> Wouldn't kvm_for_each_vcpu() avoid this kind of checks?

Right, that's what I do in v2, which I haven't posted yet.

[..]
> > +#include 
> > +
> >  #ifndef KVM_MAX_VCPU_ID
> >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> >  #endif
> > @@ -579,6 +581,10 @@ struct kvm {
> > pid_t userspace_pid;
> > unsigned int max_halt_poll_ns;
> > u32 dirty_ring_size;
> > +
> > +#ifdef CONFIG_PM
> > +   struct notifier_block pm_notifier;
> > +#endif
> 
> I'd certainly like to be able to opt out from this on architectures
> that do not implement anything useful in the PM callbacks.

Well on the other hand PM-callbacks are harmless on those archs, they
won't overload the __weak function.

> Please consider making this an independent config option that individual
> archs can buy into.

Sure, I can take a look into this, but how is this better than __weak
function? (that's a real question)

[..]
> > +#ifdef CONFIG_PM
> > +static int kvm_pm_notifier_call(struct notifier_block *bl,
> > +   unsigned long state,
> > +   void *unused)
> > +{
> > +   struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
> > +
> > +   switch (state) {
> > +   case PM_HIBERNATION_PREPARE:
> > +   case PM_SUSPEND_PREPARE:
> > +   kvm_arch_pm_notifier(kvm);
> 
> How about passing the state to the notifier callback? I'd expect it to
> be useful to do something on resume too.

For different states we can have different kvm_arch functions instead.
kvm_arch_pm_notifier() can be renamed to kvm_arch_suspend_notifier(),
so that we don't need to have `switch (state)` in every arch-code. Then
for resume/post resume states we can have kvm_arch_resume_notifier()
arch functions.

> > +   break;
> > +   }
> > +   return NOTIFY_DONE;
> > +}
> > +
> > +static void kvm_init_pm_notifier(struct kvm *kvm)
> > +{
> > +   kvm->pm_notifier.notifier_call = kvm_pm_notifier_call;
> > +   kvm->pm_notifier.priority = INT_MAX;
> 
> How is this priority determined?

Nothing magical here. I want this to be executed first, before we suspend
ftrace, RCU and the like. Besides KVM is usually the last one to register
its PM callbacks, so there can be something on the notifier list that
returns NOTIFY_STOP_MASK in front of KVM PM-notifier list entry.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd approach)

2021-06-04 Thread Marc Zyngier
On Fri, 04 Jun 2021 15:51:29 +0100,
Shameerali Kolothum Thodi  wrote:
> 
> Hi Marc,
> 
> > -Original Message-
> > From: Marc Zyngier [mailto:m...@kernel.org]
> > Sent: 04 June 2021 14:55
> > To: Shameerali Kolothum Thodi 
> > Cc: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> > linux-ker...@vger.kernel.org; w...@kernel.org; catalin.mari...@arm.com;
> > james.mo...@arm.com; julien.thierry.k...@gmail.com;
> > suzuki.poul...@arm.com; jean-phili...@linaro.org; Alexandru Elisei
> > ; Linuxarm 
> > Subject: Re: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd
> > approach)
> > 
> > On Fri, 04 Jun 2021 09:13:10 +0100,
> > Shameerali Kolothum Thodi 
> > wrote:
> > >
> > > Hi,
> > >
> > > > -Original Message-
> > > > From: Shameerali Kolothum Thodi
> > > > Sent: 06 May 2021 17:52
> > > > To: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> > > > linux-ker...@vger.kernel.org
> > > > Cc: m...@kernel.org; w...@kernel.org; catalin.mari...@arm.com;
> > > > james.mo...@arm.com; julien.thierry.k...@gmail.com;
> > > > suzuki.poul...@arm.com; jean-phili...@linaro.org; Linuxarm
> > > > 
> > > > Subject: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd
> > > > approach)
> > > >
> > > > This is based on a suggestion from Will [0] to try out the asid
> > > > based kvm vmid solution as a separate VMID allocator instead of
> > > > the shared lib approach attempted in v4[1].
> > > >
> > > > The idea is to compare both the approaches and see whether the
> > > > shared lib solution with callbacks make sense or not.
> > >
> > > A gentle ping on this. Please take a look and let me know.
> > 
> > I had a look and I don't overly dislike it. I'd like to see the code
> > without the pinned stuff though, at least to ease the reviewing. I
> > haven't tested it in anger, but I have pushed the rebased code at [1]
> > as it really didn't apply to 5.13-rc4.
> 
> Thanks for taking a look and the rebase. I will remove the pinned stuff
> in the next revision as that was added just to compare against the previous
> version.
> 
> > 
> > One thing I'm a bit worried about is that we so far relied on VMID 0
> > never being allocated to a guest, which is now crucial for protected
> > KVM. I can't really convince myself that this can never happen with
> > this.
> 
> Hmm..not sure I quite follow that. As per the current logic vmid 0 is
> reserved and is not allocated to Guest.

And that's the bit I'm struggling to validate here. I guess it works
because cur_idx is set to 1 in new_vmid().

> 
> > Plus, I've found this nugget:
> > 
> >  > max_pinned_vmids = NUM_USER_VMIDS - num_possible_cpus() - 2;
> > 
> > 
> > What is this "- 2"? My hunch is that it should really be "- 1" as VMID
> > 0 is reserved, and we have no equivalent of KPTI for S2.
> 
> I think this is more related to the "pinned vmid" stuff and was borrowed from
> the asid_update_limit() fn in arch/arm64/mm/context.c. But I missed the
> comment that explains the reason behind it. It says,
> 
> ---x---
>   /*
>* There must always be an ASID available after rollover. Ensure that,
>* even if all CPUs have a reserved ASID and the maximum number of ASIDs
>* are pinned, there still is at least one empty slot in the ASID map.
>*/
>   max_pinned_asids = num_available_asids - num_possible_cpus() - 2;
> ---x---
> 
> So this is to make sure we will have at least one VMID available
> after rollover in case we have pinned the max number of VMIDs. I
> will include that comment to make it clear.

That doesn't really explain the -2. Or is that that we have one for
the extra empty slot, and one for the reserved?

Jean-Philippe?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


RE: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd approach)

2021-06-04 Thread Shameerali Kolothum Thodi
Hi Marc,

> -Original Message-
> From: Marc Zyngier [mailto:m...@kernel.org]
> Sent: 04 June 2021 14:55
> To: Shameerali Kolothum Thodi 
> Cc: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> linux-ker...@vger.kernel.org; w...@kernel.org; catalin.mari...@arm.com;
> james.mo...@arm.com; julien.thierry.k...@gmail.com;
> suzuki.poul...@arm.com; jean-phili...@linaro.org; Alexandru Elisei
> ; Linuxarm 
> Subject: Re: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd
> approach)
> 
> On Fri, 04 Jun 2021 09:13:10 +0100,
> Shameerali Kolothum Thodi 
> wrote:
> >
> > Hi,
> >
> > > -Original Message-
> > > From: Shameerali Kolothum Thodi
> > > Sent: 06 May 2021 17:52
> > > To: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> > > linux-ker...@vger.kernel.org
> > > Cc: m...@kernel.org; w...@kernel.org; catalin.mari...@arm.com;
> > > james.mo...@arm.com; julien.thierry.k...@gmail.com;
> > > suzuki.poul...@arm.com; jean-phili...@linaro.org; Linuxarm
> > > 
> > > Subject: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd
> > > approach)
> > >
> > > This is based on a suggestion from Will [0] to try out the asid
> > > based kvm vmid solution as a separate VMID allocator instead of
> > > the shared lib approach attempted in v4[1].
> > >
> > > The idea is to compare both the approaches and see whether the
> > > shared lib solution with callbacks make sense or not.
> >
> > A gentle ping on this. Please take a look and let me know.
> 
> I had a look and I don't overly dislike it. I'd like to see the code
> without the pinned stuff though, at least to ease the reviewing. I
> haven't tested it in anger, but I have pushed the rebased code at [1]
> as it really didn't apply to 5.13-rc4.

Thanks for taking a look and the rebase. I will remove the pinned stuff
in the next revision as that was added just to compare against the previous
version.

> 
> One thing I'm a bit worried about is that we so far relied on VMID 0
> never being allocated to a guest, which is now crucial for protected
> KVM. I can't really convince myself that this can never happen with
> this.

Hmm..not sure I quite follow that. As per the current logic vmid 0 is
reserved and is not allocated to Guest. 

> Plus, I've found this nugget:
> 
>max_pinned_vmids = NUM_USER_VMIDS - num_possible_cpus() - 2;
> 
> 
> What is this "- 2"? My hunch is that it should really be "- 1" as VMID
> 0 is reserved, and we have no equivalent of KPTI for S2.

I think this is more related to the "pinned vmid" stuff and was borrowed from
the asid_update_limit() fn in arch/arm64/mm/context.c. But I missed the
comment that explains the reason behind it. It says,

---x---
/*
 * There must always be an ASID available after rollover. Ensure that,
 * even if all CPUs have a reserved ASID and the maximum number of ASIDs
 * are pinned, there still is at least one empty slot in the ASID map.
 */
max_pinned_asids = num_available_asids - num_possible_cpus() - 2;
---x---

So this is to make sure we will have at least one VMID available after rollover
in case we have pinned the max number of VMIDs. I will include that comment
to make it clear.

Thanks,
Shameer

> 
>   M.
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h
> =kvm-arm64/mmu/vmid
> 
> --
> Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 4/4] KVM: arm64: Introduce KVM_CAP_ARM_PROTECTED_VM

2021-06-04 Thread Mark Rutland
On Thu, Jun 03, 2021 at 07:33:47PM +0100, Will Deacon wrote:
> Introduce a new VM capability, KVM_CAP_ARM_PROTECTED_VM, which can be
> used to isolate guest memory from the host. For now, the EL2 portion is
> missing, so this documents and exposes the user ABI for the host.
> 
> Signed-off-by: Will Deacon 
> ---
>  Documentation/virt/kvm/api.rst|  69 
>  arch/arm64/include/asm/kvm_host.h |  10 +++
>  arch/arm64/include/uapi/asm/kvm.h |   9 +++
>  arch/arm64/kvm/arm.c  |  18 +++---
>  arch/arm64/kvm/mmu.c  |   3 +
>  arch/arm64/kvm/pkvm.c | 104 ++
>  include/uapi/linux/kvm.h  |   1 +
>  7 files changed, 205 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 7fcb2fd38f42..dfbaf905c435 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6362,6 +6362,75 @@ default.
>  
>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
>  
> +7.26 KVM_CAP_ARM_PROTECTED_VM
> +-
> +
> +:Architectures: arm64
> +:Target: VM
> +:Parameters: flags is a single KVM_CAP_ARM_PROTECTED_VM_FLAGS_* value
> +
> +The presence of this capability indicates that KVM supports running in a
> +configuration where the host Linux kernel does not have access to guest 
> memory.
> +On such a system, a small hypervisor layer at EL2 can configure the stage-2
> +page tables for both the CPU and any DMA-capable devices to protect guest
> +memory pages so that they are inaccessible to the host unless access is 
> granted
> +explicitly by the guest.
> +
> +The 'flags' parameter is defined as follows:
> +
> +7.26.1 KVM_CAP_ARM_PROTECTED_VM_FLAGS_ENABLE
> +
> +
> +:Capability: 'flag' parameter to KVM_CAP_ARM_PROTECTED_VM
> +:Architectures: arm64
> +:Target: VM
> +:Parameters: args[0] contains memory slot ID to hold guest firmware
> +:Returns: 0 on success; negative error code on failure
> +
> +Enabling this capability causes all memory slots of the specified VM to be
> +unmapped from the host system and put into a state where they are no longer
> +configurable. The memory slot corresponding to the ID passed in args[0] is
> +populated with the guest firmware image provided by the host firmware.

As on the prior patch, I don't quite follow the rationale for the guest
fw coming from the host fw, and it seems to go against the usual design
for VM contents, so I fear it could be a problem in future (even if not
in android's specific model for usage).

> +The first vCPU to enter the guest is defined to be the primary vCPU. All 
> other
> +vCPUs belonging to the VM are secondary vCPUs.
> +
> +All vCPUs belonging to a VM with this capability enabled are initialised to a
> +pre-determined reset state

What is that "pre-determined reset state"? e.g. is that just what KVM
does today, or is there something more specific (e.g. that might change
with the "Boot protocol version" below)?

> irrespective of any prior configuration according to
> +the KVM_ARM_VCPU_INIT ioctl, with the following exceptions for the primary
> +vCPU:
> +
> + === ===
> + Register(s) Reset value
> + === ===
> + X0-X14: Preserved (see KVM_SET_ONE_REG)
> + X15:Boot protocol version (0)

What's the "Boot protocol" in this context? Is that just referring to
this handover state, or is that something more involved?

> + X16-X30:Reserved (0)
> + PC: IPA base of firmware memory slot
> + SP: IPA end of firmware memory slot
> + === ===
> +
> +Secondary vCPUs belonging to a VM with this capability enabled will return
> +-EPERM in response to a KVM_RUN ioctl() if the vCPU was not initialised with
> +the KVM_ARM_VCPU_POWER_OFF feature.

I assume this means that protected VMs always get a trusted PSCI
implementation? It might be worth mentioning so (and worth consdiering
if that should always have the SMCCC bits too).

> +
> +There is no support for AArch32 at any exception level.

Is this only going to run on CPUs without AArch32 EL0? ... or does this
mean behaviour will be erratic if someone tries to run AArch32 EL0?

> +
> +It is an error to enable this capability on a VM after issuing a KVM_RUN
> +ioctl() on one of its vCPUs.
> +
> +7.26.2 KVM_CAP_ARM_PROTECTED_VM_FLAGS_INFO
> +--
> +
> +:Capability: 'flag' parameter to KVM_CAP_ARM_PROTECTED_VM
> +:Architectures: arm64
> +:Target: VM
> +:Parameters: args[0] contains pointer to 'struct kvm_protected_vm_info'
> +:Returns: 0 on success; negative error code on failure
> +
> +Populates the 'struct kvm_protected_vm_info' pointed to by args[0] with
> +information about the protected environment for the VM.
> +
>  8. Other capabilities.
>  ==
>  
> diff --git 

Re: [PATCH 3/4] KVM: arm64: Parse reserved-memory node for pkvm guest firmware region

2021-06-04 Thread Mark Rutland
On Thu, Jun 03, 2021 at 07:33:46PM +0100, Will Deacon wrote:
> Add support for a "linux,pkvm-guest-firmware-memory" reserved memory
> region, which can be used to identify a firmware image for protected
> VMs.

The idea that the guest's FW comes from the host's FW strikes me as
unusual; what's the rationale for this coming from the host FW? IIUC
other confidential compute VM environments allow you to load up whatever
virtual FW you want, but this is measured such that the virtual FW used
can be attested.

Thanks,
Mark.

> 
> Signed-off-by: Will Deacon 
> ---
>  arch/arm64/kvm/Makefile |  2 +-
>  arch/arm64/kvm/pkvm.c   | 52 +
>  2 files changed, 53 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/pkvm.c
> 
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 589921392cb1..61e054411831 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,7 +14,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o 
> $(KVM)/eventfd.o \
>$(KVM)/vfio.o $(KVM)/irqchip.o \
>arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
>inject_fault.o va_layout.o handle_exit.o \
> -  guest.o debug.o reset.o sys_regs.o \
> +  guest.o debug.o pkvm.o reset.o sys_regs.o \
>vgic-sys-reg-v3.o fpsimd.o pmu.o \
>arch_timer.o trng.o\
>vgic/vgic.o vgic/vgic-init.o \
> diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
> new file mode 100644
> index ..7af5d03a3941
> --- /dev/null
> +++ b/arch/arm64/kvm/pkvm.c
> @@ -0,0 +1,52 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * KVM host (EL1) interface to Protected KVM (pkvm) code at EL2.
> + *
> + * Copyright (C) 2021 Google LLC
> + * Author: Will Deacon 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static struct reserved_mem *pkvm_firmware_mem;
> +
> +static int __init pkvm_firmware_rmem_err(struct reserved_mem *rmem,
> +  const char *reason)
> +{
> + phys_addr_t end = rmem->base + rmem->size;
> +
> + kvm_err("Ignoring pkvm guest firmware memory reservation [%pa - %pa]: 
> %s\n",
> + >base, , reason);
> + return -EINVAL;
> +}
> +
> +static int __init pkvm_firmware_rmem_init(struct reserved_mem *rmem)
> +{
> + unsigned long node = rmem->fdt_node;
> +
> + if (kvm_get_mode() != KVM_MODE_PROTECTED)
> + return pkvm_firmware_rmem_err(rmem, "protected mode not 
> enabled");
> +
> + if (pkvm_firmware_mem)
> + return pkvm_firmware_rmem_err(rmem, "duplicate reservation");
> +
> + if (!of_get_flat_dt_prop(node, "no-map", NULL))
> + return pkvm_firmware_rmem_err(rmem, "missing \"no-map\" 
> property");
> +
> + if (of_get_flat_dt_prop(node, "reusable", NULL))
> + return pkvm_firmware_rmem_err(rmem, "\"reusable\" property 
> unsupported");
> +
> + if (!PAGE_ALIGNED(rmem->base))
> + return pkvm_firmware_rmem_err(rmem, "base is not page-aligned");
> +
> + if (!PAGE_ALIGNED(rmem->size))
> + return pkvm_firmware_rmem_err(rmem, "size is not page-aligned");
> +
> + pkvm_firmware_mem = rmem;
> + return 0;
> +}
> +RESERVEDMEM_OF_DECLARE(pkvm_firmware, "linux,pkvm-guest-firmware-memory",
> +pkvm_firmware_rmem_init);
> -- 
> 2.32.0.rc0.204.g9fa02ecfa5-goog
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 2/4] KVM: arm64: Extend comment in has_vhe()

2021-06-04 Thread Mark Rutland
On Thu, Jun 03, 2021 at 07:33:45PM +0100, Will Deacon wrote:
> has_vhe() expands to a compile-time constant when evaluated from the VHE
> or nVHE code, alternatively checking a static key when called from
> elsewhere in the kernel. On face value, this looks like a case of
> premature optimization, but in fact this allows symbol references on
> VHE-specific code paths to be dropped from the nVHE object.
> 
> Expand the comment in has_vhe() to make this clearer, hopefully
> discouraging anybody from simplifying the code.
> 
> Cc: David Brazdil 
> Signed-off-by: Will Deacon 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/virt.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 7379f35ae2c6..3218ca17f819 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -111,6 +111,9 @@ static __always_inline bool has_vhe(void)
>   /*
>* Code only run in VHE/NVHE hyp context can assume VHE is present or
>* absent. Otherwise fall back to caps.
> +  * This allows the compiler to discard VHE-specific code from the
> +  * nVHE object, reducing the number of external symbol references
> +  * needed to link.
>*/
>   if (is_vhe_hyp_code())
>   return true;
> -- 
> 2.32.0.rc0.204.g9fa02ecfa5-goog
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-04 Thread Catalin Marinas
On Fri, Jun 04, 2021 at 01:51:38PM +0100, Steven Price wrote:
> On 04/06/2021 12:36, Catalin Marinas wrote:
> > On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote:
> >> On 03/06/2021 17:00, Catalin Marinas wrote:
> >>> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote:
>  @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>  phys_addr_t fault_ipa,
>   if (writable)
>   prot |= KVM_PGTABLE_PROT_W;
>   
>  -if (fault_status != FSC_PERM && !device)
>  +if (fault_status != FSC_PERM && !device) {
>  +ret = sanitise_mte_tags(kvm, pfn, vma_pagesize);
>  +if (ret)
>  +goto out_unlock;
> >>>
> >>> Maybe it was discussed in a previous version, why do we need this in
> >>> addition to kvm_set_spte_gfn()?
> >>
> >> kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a
> >> memslot is changed by the VMM). For the initial access we will normally
> >> fault the page into stage 2 with user_mem_abort().
> > 
> > Right. Can we move the sanitise_mte_tags() call to
> > kvm_pgtable_stage2_map() instead or we don't have the all the
> > information needed?
> 
> I tried that before: kvm_pgtable_stage2_map() is shared with the
> hypervisor so sadly we can't go poking around in the host as this breaks
> on nVHE. I mentioned it in the v12 cover letter but it was in a wall of
> text:

Ah, I missed this in the cover letter (haven't read it).

So, apart from the nitpick with the early return for less indentation,
feel free to add:

Reviewed-by: Catalin Marinas 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 1/4] KVM: arm64: Ignore 'kvm-arm.mode=protected' when using VHE

2021-06-04 Thread Mark Rutland
On Thu, Jun 03, 2021 at 07:33:44PM +0100, Will Deacon wrote:
> Ignore 'kvm-arm.mode=protected' when using VHE so that kvm_get_mode()
> only returns KVM_MODE_PROTECTED on systems where the feature is available.

IIUC, since the introduction of the idreg-override code, and the
mutate_to_vhe stuff, passing 'kvm-arm.mode=protected' should make the
kernel stick to EL1, right? So this should only affect M1 (or other HW
with a similar impediment).

One minor comment below; otherwise:

Acked-by: Mark Rutland 

> 
> Cc: David Brazdil 
> Signed-off-by: Will Deacon 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  1 -
>  arch/arm64/kernel/cpufeature.c  | 10 +-
>  arch/arm64/kvm/arm.c|  6 +-
>  3 files changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index cb89dbdedc46..e85dbdf1ee8e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2300,7 +2300,6 @@
>  
>   protected: nVHE-based mode with support for guests whose
>  state is kept private from the host.
> -Not valid if the kernel is running in EL2.
>  
>   Defaults to VHE/nVHE based on hardware support.
>  
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index efed2830d141..dc1f2e747828 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1773,15 +1773,7 @@ static void cpu_enable_mte(struct 
> arm64_cpu_capabilities const *cap)
>  #ifdef CONFIG_KVM
>  static bool is_kvm_protected_mode(const struct arm64_cpu_capabilities 
> *entry, int __unused)
>  {
> - if (kvm_get_mode() != KVM_MODE_PROTECTED)
> - return false;
> -
> - if (is_kernel_in_hyp_mode()) {
> - pr_warn("Protected KVM not available with VHE\n");
> - return false;
> - }
> -
> - return true;
> + return kvm_get_mode() == KVM_MODE_PROTECTED;
>  }
>  #endif /* CONFIG_KVM */
>  
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 1cb39c0803a4..8d5e23198dfd 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -2121,7 +2121,11 @@ static int __init early_kvm_mode_cfg(char *arg)
>   return -EINVAL;
>  
>   if (strcmp(arg, "protected") == 0) {
> - kvm_mode = KVM_MODE_PROTECTED;
> + if (!is_kernel_in_hyp_mode())
> + kvm_mode = KVM_MODE_PROTECTED;
> + else
> + pr_warn_once("Protected KVM not available with VHE\n");

... assuming this is only for M1, it might be better to say:

Protected KVM not available on this hardware

... since that doesn't suggest that other VHE-capable HW is also not
PKVM-capable.

Thanks,
Mark.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd approach)

2021-06-04 Thread Marc Zyngier
On Fri, 04 Jun 2021 09:13:10 +0100,
Shameerali Kolothum Thodi  wrote:
> 
> Hi,
> 
> > -Original Message-
> > From: Shameerali Kolothum Thodi
> > Sent: 06 May 2021 17:52
> > To: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> > linux-ker...@vger.kernel.org
> > Cc: m...@kernel.org; w...@kernel.org; catalin.mari...@arm.com;
> > james.mo...@arm.com; julien.thierry.k...@gmail.com;
> > suzuki.poul...@arm.com; jean-phili...@linaro.org; Linuxarm
> > 
> > Subject: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd
> > approach)
> > 
> > This is based on a suggestion from Will [0] to try out the asid
> > based kvm vmid solution as a separate VMID allocator instead of
> > the shared lib approach attempted in v4[1].
> > 
> > The idea is to compare both the approaches and see whether the
> > shared lib solution with callbacks make sense or not.
> 
> A gentle ping on this. Please take a look and let me know.

I had a look and I don't overly dislike it. I'd like to see the code
without the pinned stuff though, at least to ease the reviewing. I
haven't tested it in anger, but I have pushed the rebased code at [1]
as it really didn't apply to 5.13-rc4.

One thing I'm a bit worried about is that we so far relied on VMID 0
never being allocated to a guest, which is now crucial for protected
KVM. I can't really convince myself that this can never happen with
this. Plus, I've found this nugget:



What is this "- 2"? My hunch is that it should really be "- 1" as VMID
0 is reserved, and we have no equivalent of KPTI for S2.

M.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/mmu/vmid

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-04 Thread Steven Price
On 04/06/2021 12:42, Catalin Marinas wrote:
> On Fri, Jun 04, 2021 at 12:15:56PM +0100, Steven Price wrote:
>> On 03/06/2021 18:13, Catalin Marinas wrote:
>>> On Mon, May 24, 2021 at 11:45:12AM +0100, Steven Price wrote:
 diff --git a/arch/arm64/include/uapi/asm/kvm.h 
 b/arch/arm64/include/uapi/asm/kvm.h
 index 24223adae150..b3edde68bc3e 100644
 --- a/arch/arm64/include/uapi/asm/kvm.h
 +++ b/arch/arm64/include/uapi/asm/kvm.h
 @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
  };
  
 +struct kvm_arm_copy_mte_tags {
 +  __u64 guest_ipa;
 +  __u64 length;
 +  void __user *addr;
 +  __u64 flags;
 +  __u64 reserved[2];
 +};
 +
 +#define KVM_ARM_TAGS_TO_GUEST 0
 +#define KVM_ARM_TAGS_FROM_GUEST   1
 +
  /* If you need to interpret the index values, here is the key: */
  #define KVM_REG_ARM_COPROC_MASK   0x0FFF
  #define KVM_REG_ARM_COPROC_SHIFT  16
 diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
 index e89a5e275e25..baa33359e477 100644
 --- a/arch/arm64/kvm/arm.c
 +++ b/arch/arm64/kvm/arm.c
 @@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
  
return 0;
}
 +  case KVM_ARM_MTE_COPY_TAGS: {
 +  struct kvm_arm_copy_mte_tags copy_tags;
 +
 +  if (copy_from_user(_tags, argp, sizeof(copy_tags)))
 +  return -EFAULT;
 +  return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
 +  }
>>>
>>> I wonder whether we need an update of the user structure following a
>>> fault, like how much was copied etc. In case of an error, some tags were
>>> copied and the VMM may want to skip the page before continuing. But here
>>> there's no such information provided.
>>>
>>> On the ptrace interface, we return 0 on the syscall if any bytes were
>>> copied and update iov_len to such number. Maybe you want to still return
>>> an error here but updating copy_tags.length would be nice (and, of
>>> course, a copy_to_user() back).
>>
>> Good idea - as you suggest I'll make it update length with the number of
>> bytes not processed. Although in general I think we're expecting the VMM
>> to know where the memory is so this is more of a programming error - but
>> could still be useful for debugging.
> 
> Or update it to the number of bytes copied to be consistent with
> ptrace()'s iov.len. On success, the structure is effectively left
> unchanged.

I was avoiding that because it confuses the error code when the initial
copy_from_user() fails. In that case the structure is clearly unchanged,
so you can only tell from a -EFAULT return that nothing happened. By
returning the number of bytes left you can return an error code along
with the information that the copy only half completed.

It also seems cleaner to leave the structure unchanged if e.g. the flags
or reserved fields are invalid rather than having to set length=0 to
signal that nothing was done.

Although I do feel like arguing whether to use a ptrace() interface or a
copy_{to,from}_user() interface is somewhat ridiculous considering
neither are exactly considered good.

Rather than changing the structure we could return either an error code
(if nothing was copied) or the number of bytes left. That way ioctl()==0
means complete success, >0 means partial success and <0 means complete
failure and provides a detailed error code. The ioctl() can be repeated
(with adjusted pointers) if it returns >0 and a detailed error is needed.

Steve
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC PATCH 4/4] KVM: arm64: Introduce KVM_CAP_ARM_PROTECTED_VM

2021-06-04 Thread Sean Christopherson
On Thu, Jun 03, 2021, Will Deacon wrote:
> +Enabling this capability causes all memory slots of the specified VM to be
> +unmapped from the host system and put into a state where they are no longer
> +configurable.

What happens if userspace remaps the hva of the memslot?  In other words, how
will the implementation ensure the physical memory backing the guest is truly
inaccessible?

And how will the guest/host share memory?  E.g. what if the guest wants to set
up a new shared memory region and the host wants that to happen in a new 
memslot?

On a related topic, would the concept of guest-only memory be useful[*]?  The
idea is to require userspace to map guest-private memory with a dedicated VMA
flag.  That will allow the host kernel to prevent userspace (or the kernel) from
mapping guest-private memory, and will also allow KVM to verify that memory that
the guest wants to be private is indeed private.

[*] 
https://lkml.kernel.org/r/20210416154106.23721-14-kirill.shute...@linux.intel.com

> The memory slot corresponding to the ID passed in args[0] is
> +populated with the guest firmware image provided by the host firmware.

Why take a slot instead of an address range?  I assume copying the host firmware
into guest memory will utilize existing APIs, i.e. the memslot lookups to 
resolve
the GPA->HVA will Just Work (TM).

> +The first vCPU to enter the guest is defined to be the primary vCPU. All 
> other
> +vCPUs belonging to the VM are secondary vCPUs.
> +
> +All vCPUs belonging to a VM with this capability enabled are initialised to a
> +pre-determined reset state irrespective of any prior configuration according 
> to
> +the KVM_ARM_VCPU_INIT ioctl, with the following exceptions for the primary
> +vCPU:
> +
> + === ===
> + Register(s) Reset value
> + === ===
> + X0-X14: Preserved (see KVM_SET_ONE_REG)

What's the intended use case for allowing userspace to set a pile of registers?
Giving userspace control of vCPU state is tricky because the state is untrusted.
Is the state going to be attested in any way, or is the VMM trusted while the
VM is being created and only de-privileged later on?

> + X15:Boot protocol version (0)
> + X16-X30:Reserved (0)
> + PC: IPA base of firmware memory slot
> + SP: IPA end of firmware memory slot
> + === ===
> +

...

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c5d1f3c87dbd..e1d4a87d18e4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1349,6 +1349,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   bool writable = !(mem->flags & KVM_MEM_READONLY);
>   int ret = 0;
>  
> + if (kvm_vm_is_protected(kvm))
> + return -EPERM;

FWIW, this will miss the benign corner case where KVM_SET_USER_MEMORY_REGION
does not actualy change anything about an existing region.

A more practical problem is that this check won't happen until KVM has marked
the existing region as invalid in the DELETE or MOVE case.  So userspace can't
completely delete a region, but it can temporarily delete a region by briefly
making it in invalid.  No idea what the ramifications of that are on arm64.

>   if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
>   change != KVM_MR_FLAGS_ONLY)
>   return 0;

...

> +static int pkvm_vm_ioctl_enable(struct kvm *kvm, u64 slotid)
> +{
> + int ret = 0;

Deep into the nits: technically unnecessary since ret is guaranteed to be 
written
before being consumed.

> + mutex_lock(>lock);
> + if (kvm_vm_is_protected(kvm)) {
> + ret = -EPERM;
> + goto out_kvm_unlock;
> + }
> +
> + mutex_lock(>slots_lock);
> + ret = pkvm_enable(kvm, slotid);
> + if (ret)
> + goto out_slots_unlock;
> +
> + kvm->arch.pkvm.enabled = true;
> +out_slots_unlock:
> + mutex_unlock(>slots_lock);
> +out_kvm_unlock:
> + mutex_unlock(>lock);
> + return ret;
> +}

...

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 3fd9a7e9d90c..58ab8508be5e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_SGX_ATTRIBUTE 196
>  #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
>  #define KVM_CAP_PTP_KVM 198
> +#define KVM_CAP_ARM_PROTECTED_VM 199

Rather than a dedicated (and arm-only) capability for protected VMs, what about
adding a capability to retrieve the supported VM types?  And obviously make
protected VMs a different type that must be specified at VM creation (there's
already plumbing for this).  Even if there's no current need to know at creation
time that a VM will be protected, it's also unlikely to be a burden on userspace
and could be nice to have down the road.  The late "activation" ioctl() can 
still
be supported, e.g. to start disallowing memslot 

Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-04 Thread Steven Price
On 04/06/2021 12:36, Catalin Marinas wrote:
> On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote:
>> On 03/06/2021 17:00, Catalin Marinas wrote:
>>> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote:
 diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
 index c5d1f3c87dbd..226035cf7d6c 100644
 --- a/arch/arm64/kvm/mmu.c
 +++ b/arch/arm64/kvm/mmu.c
 @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
 *memslot,
return PAGE_SIZE;
  }
  
 +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 +   unsigned long size)
 +{
 +  if (kvm_has_mte(kvm)) {
 +  /*
 +   * The page will be mapped in stage 2 as Normal Cacheable, so
 +   * the VM will be able to see the page's tags and therefore
 +   * they must be initialised first. If PG_mte_tagged is set,
 +   * tags have already been initialised.
 +   * pfn_to_online_page() is used to reject ZONE_DEVICE pages
 +   * that may not support tags.
 +   */
 +  unsigned long i, nr_pages = size >> PAGE_SHIFT;
 +  struct page *page = pfn_to_online_page(pfn);
 +
 +  if (!page)
 +  return -EFAULT;
 +
 +  for (i = 0; i < nr_pages; i++, page++) {
 +  /*
 +   * There is a potential (but very unlikely) race
 +   * between two VMs which are sharing a physical page
 +   * entering this at the same time. However by splitting
 +   * the test/set the only risk is tags being overwritten
 +   * by the mte_clear_page_tags() call.
 +   */
>>>
>>> And I think the real risk here is when the page is writable by at least
>>> one of the VMs sharing the page. This excludes KSM, so it only leaves
>>> the MAP_SHARED mappings.
>>>
 +  if (!test_bit(PG_mte_tagged, >flags)) {
 +  mte_clear_page_tags(page_address(page));
 +  set_bit(PG_mte_tagged, >flags);
 +  }
 +  }
>>>
>>> If we want to cover this race (I'd say in a separate patch), we can call
>>> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I
>>> got the arguments right). We can avoid the big lock in most cases if
>>> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.)
>>> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we
>>> do for VM_MTE but the new flag would not affect the stage 1 VMM page
>>> attributes).
>>
>> To be honest I'm coming round to just exporting a
>> mte_prepare_page_tags() function which does the clear/set with the lock
>> held. I doubt it's such a performance critical path that it will cause
>> any noticeable issues. Then if we run into performance problems in the
>> future we can start experimenting with extra VM flags etc as necessary.
> 
> It works for me.
> 
>> And from your later email:
>>> Another idea: if VM_SHARED is found for any vma within a region in
>>> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE
>>> for the guest or reject the memory slot if MTE was already enabled.
>>>
>>> An alternative here would be to clear VM_MTE_ALLOWED so that any
>>> subsequent mprotect(PROT_MTE) in the VMM would fail in
>>> arch_validate_flags(). MTE would still be allowed in the guest but in
>>> the VMM for the guest memory regions. We can probably do this
>>> irrespective of VM_SHARED. Of course, the VMM can still mmap() the
>>> memory initially with PROT_MTE but that's not an issue IIRC, only the
>>> concurrent mprotect().
>>
>> This could work, but I worry that it's potential fragile. Also the rules
>> for what user space can do are not obvious and may be surprising. I'd
>> also want to look into the likes of mremap() to see how easy it would be
>> to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED)
>> memory sneaking into a memslot.
>>
>> Unless you think it's worth complicating the ABI in the hope of avoiding
>> the big lock overhead I think it's probably best to stick with the big
>> lock at least until we have more data on the overhead.
> 
> It's up to Marc but I think for now just make it safe and once we get
> our hands on hardware, we can assess the impact. For example, starting
> multiple VMs simultaneously will contend on such big lock but we have an
> option to optimise it by setting PG_mte_tagged on allocation via a new
> VM_* flag.
> 
> For my last suggestion above, changing the VMM ABI afterwards is a bit
> tricky, so we could state now that VM_SHARED and MTE are not allowed
> (though it needs a patch to enforce it). That's assuming that mprotect()
> in the VMM cannot race with the user_mem_abort() on another CPU which
> makes the lock necessary anyway.
> 
 +  }
 +
 

Re: [PATCH v13 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-04 Thread Catalin Marinas
On Fri, Jun 04, 2021 at 12:15:56PM +0100, Steven Price wrote:
> On 03/06/2021 18:13, Catalin Marinas wrote:
> > On Mon, May 24, 2021 at 11:45:12AM +0100, Steven Price wrote:
> >> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> >> b/arch/arm64/include/uapi/asm/kvm.h
> >> index 24223adae150..b3edde68bc3e 100644
> >> --- a/arch/arm64/include/uapi/asm/kvm.h
> >> +++ b/arch/arm64/include/uapi/asm/kvm.h
> >> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
> >>__u32 reserved[12];
> >>  };
> >>  
> >> +struct kvm_arm_copy_mte_tags {
> >> +  __u64 guest_ipa;
> >> +  __u64 length;
> >> +  void __user *addr;
> >> +  __u64 flags;
> >> +  __u64 reserved[2];
> >> +};
> >> +
> >> +#define KVM_ARM_TAGS_TO_GUEST 0
> >> +#define KVM_ARM_TAGS_FROM_GUEST   1
> >> +
> >>  /* If you need to interpret the index values, here is the key: */
> >>  #define KVM_REG_ARM_COPROC_MASK   0x0FFF
> >>  #define KVM_REG_ARM_COPROC_SHIFT  16
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index e89a5e275e25..baa33359e477 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
> >>  
> >>return 0;
> >>}
> >> +  case KVM_ARM_MTE_COPY_TAGS: {
> >> +  struct kvm_arm_copy_mte_tags copy_tags;
> >> +
> >> +  if (copy_from_user(_tags, argp, sizeof(copy_tags)))
> >> +  return -EFAULT;
> >> +  return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
> >> +  }
> > 
> > I wonder whether we need an update of the user structure following a
> > fault, like how much was copied etc. In case of an error, some tags were
> > copied and the VMM may want to skip the page before continuing. But here
> > there's no such information provided.
> > 
> > On the ptrace interface, we return 0 on the syscall if any bytes were
> > copied and update iov_len to such number. Maybe you want to still return
> > an error here but updating copy_tags.length would be nice (and, of
> > course, a copy_to_user() back).
> 
> Good idea - as you suggest I'll make it update length with the number of
> bytes not processed. Although in general I think we're expecting the VMM
> to know where the memory is so this is more of a programming error - but
> could still be useful for debugging.

Or update it to the number of bytes copied to be consistent with
ptrace()'s iov.len. On success, the structure is effectively left
unchanged.

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-04 Thread Catalin Marinas
On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote:
> On 03/06/2021 17:00, Catalin Marinas wrote:
> > On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote:
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index c5d1f3c87dbd..226035cf7d6c 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
> >> *memslot,
> >>return PAGE_SIZE;
> >>  }
> >>  
> >> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
> >> +   unsigned long size)
> >> +{
> >> +  if (kvm_has_mte(kvm)) {
> >> +  /*
> >> +   * The page will be mapped in stage 2 as Normal Cacheable, so
> >> +   * the VM will be able to see the page's tags and therefore
> >> +   * they must be initialised first. If PG_mte_tagged is set,
> >> +   * tags have already been initialised.
> >> +   * pfn_to_online_page() is used to reject ZONE_DEVICE pages
> >> +   * that may not support tags.
> >> +   */
> >> +  unsigned long i, nr_pages = size >> PAGE_SHIFT;
> >> +  struct page *page = pfn_to_online_page(pfn);
> >> +
> >> +  if (!page)
> >> +  return -EFAULT;
> >> +
> >> +  for (i = 0; i < nr_pages; i++, page++) {
> >> +  /*
> >> +   * There is a potential (but very unlikely) race
> >> +   * between two VMs which are sharing a physical page
> >> +   * entering this at the same time. However by splitting
> >> +   * the test/set the only risk is tags being overwritten
> >> +   * by the mte_clear_page_tags() call.
> >> +   */
> > 
> > And I think the real risk here is when the page is writable by at least
> > one of the VMs sharing the page. This excludes KSM, so it only leaves
> > the MAP_SHARED mappings.
> > 
> >> +  if (!test_bit(PG_mte_tagged, >flags)) {
> >> +  mte_clear_page_tags(page_address(page));
> >> +  set_bit(PG_mte_tagged, >flags);
> >> +  }
> >> +  }
> > 
> > If we want to cover this race (I'd say in a separate patch), we can call
> > mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I
> > got the arguments right). We can avoid the big lock in most cases if
> > kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.)
> > and __alloc_zeroed_user_highpage() clears the tags on allocation (as we
> > do for VM_MTE but the new flag would not affect the stage 1 VMM page
> > attributes).
> 
> To be honest I'm coming round to just exporting a
> mte_prepare_page_tags() function which does the clear/set with the lock
> held. I doubt it's such a performance critical path that it will cause
> any noticeable issues. Then if we run into performance problems in the
> future we can start experimenting with extra VM flags etc as necessary.

It works for me.

> And from your later email:
> > Another idea: if VM_SHARED is found for any vma within a region in
> > kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE
> > for the guest or reject the memory slot if MTE was already enabled.
> > 
> > An alternative here would be to clear VM_MTE_ALLOWED so that any
> > subsequent mprotect(PROT_MTE) in the VMM would fail in
> > arch_validate_flags(). MTE would still be allowed in the guest but in
> > the VMM for the guest memory regions. We can probably do this
> > irrespective of VM_SHARED. Of course, the VMM can still mmap() the
> > memory initially with PROT_MTE but that's not an issue IIRC, only the
> > concurrent mprotect().
> 
> This could work, but I worry that it's potential fragile. Also the rules
> for what user space can do are not obvious and may be surprising. I'd
> also want to look into the likes of mremap() to see how easy it would be
> to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED)
> memory sneaking into a memslot.
> 
> Unless you think it's worth complicating the ABI in the hope of avoiding
> the big lock overhead I think it's probably best to stick with the big
> lock at least until we have more data on the overhead.

It's up to Marc but I think for now just make it safe and once we get
our hands on hardware, we can assess the impact. For example, starting
multiple VMs simultaneously will contend on such big lock but we have an
option to optimise it by setting PG_mte_tagged on allocation via a new
VM_* flag.

For my last suggestion above, changing the VMM ABI afterwards is a bit
tricky, so we could state now that VM_SHARED and MTE are not allowed
(though it needs a patch to enforce it). That's assuming that mprotect()
in the VMM cannot race with the user_mem_abort() on another CPU which
makes the lock necessary anyway.

> >> +  }
> >> +
> >> +  return 0;
> >> +}
> >> +
> >>  static int user_mem_abort(struct kvm_vcpu *vcpu, 

Re: [PATCH v13 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-04 Thread Steven Price
On 03/06/2021 18:13, Catalin Marinas wrote:
> On Mon, May 24, 2021 at 11:45:12AM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..b3edde68bc3e 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>  __u32 reserved[12];
>>  };
>>  
>> +struct kvm_arm_copy_mte_tags {
>> +__u64 guest_ipa;
>> +__u64 length;
>> +void __user *addr;
>> +__u64 flags;
>> +__u64 reserved[2];
>> +};
>> +
>> +#define KVM_ARM_TAGS_TO_GUEST   0
>> +#define KVM_ARM_TAGS_FROM_GUEST 1
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK 0x0FFF
>>  #define KVM_REG_ARM_COPROC_SHIFT16
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e89a5e275e25..baa33359e477 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>  
>>  return 0;
>>  }
>> +case KVM_ARM_MTE_COPY_TAGS: {
>> +struct kvm_arm_copy_mte_tags copy_tags;
>> +
>> +if (copy_from_user(_tags, argp, sizeof(copy_tags)))
>> +return -EFAULT;
>> +return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
>> +}
> 
> I wonder whether we need an update of the user structure following a
> fault, like how much was copied etc. In case of an error, some tags were
> copied and the VMM may want to skip the page before continuing. But here
> there's no such information provided.
> 
> On the ptrace interface, we return 0 on the syscall if any bytes were
> copied and update iov_len to such number. Maybe you want to still return
> an error here but updating copy_tags.length would be nice (and, of
> course, a copy_to_user() back).
> 

Good idea - as you suggest I'll make it update length with the number of
bytes not processed. Although in general I think we're expecting the VMM
to know where the memory is so this is more of a programming error - but
could still be useful for debugging.

Thanks,

Steve
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-04 Thread Steven Price
On 03/06/2021 17:00, Catalin Marinas wrote:
> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..226035cf7d6c 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>> *memslot,
>>  return PAGE_SIZE;
>>  }
>>  
>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>> + unsigned long size)
>> +{
>> +if (kvm_has_mte(kvm)) {
> 
> Nitpick (less indentation):
> 
>   if (!kvm_has_mte(kvm))
>   return 0;

Thanks, will change.

>> +/*
>> + * The page will be mapped in stage 2 as Normal Cacheable, so
>> + * the VM will be able to see the page's tags and therefore
>> + * they must be initialised first. If PG_mte_tagged is set,
>> + * tags have already been initialised.
>> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages
>> + * that may not support tags.
>> + */
>> +unsigned long i, nr_pages = size >> PAGE_SHIFT;
>> +struct page *page = pfn_to_online_page(pfn);
>> +
>> +if (!page)
>> +return -EFAULT;
>> +
>> +for (i = 0; i < nr_pages; i++, page++) {
>> +/*
>> + * There is a potential (but very unlikely) race
>> + * between two VMs which are sharing a physical page
>> + * entering this at the same time. However by splitting
>> + * the test/set the only risk is tags being overwritten
>> + * by the mte_clear_page_tags() call.
>> + */
> 
> And I think the real risk here is when the page is writable by at least
> one of the VMs sharing the page. This excludes KSM, so it only leaves
> the MAP_SHARED mappings.
> 
>> +if (!test_bit(PG_mte_tagged, >flags)) {
>> +mte_clear_page_tags(page_address(page));
>> +set_bit(PG_mte_tagged, >flags);
>> +}
>> +}
> 
> If we want to cover this race (I'd say in a separate patch), we can call
> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I
> got the arguments right). We can avoid the big lock in most cases if
> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.)
> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we
> do for VM_MTE but the new flag would not affect the stage 1 VMM page
> attributes).

To be honest I'm coming round to just exporting a
mte_prepare_page_tags() function which does the clear/set with the lock
held. I doubt it's such a performance critical path that it will cause
any noticeable issues. Then if we run into performance problems in the
future we can start experimenting with extra VM flags etc as necessary.

And from your later email:
> Another idea: if VM_SHARED is found for any vma within a region in
> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE
> for the guest or reject the memory slot if MTE was already enabled.
> 
> An alternative here would be to clear VM_MTE_ALLOWED so that any
> subsequent mprotect(PROT_MTE) in the VMM would fail in
> arch_validate_flags(). MTE would still be allowed in the guest but in
> the VMM for the guest memory regions. We can probably do this
> irrespective of VM_SHARED. Of course, the VMM can still mmap() the
> memory initially with PROT_MTE but that's not an issue IIRC, only the
> concurrent mprotect().

This could work, but I worry that it's potential fragile. Also the rules
for what user space can do are not obvious and may be surprising. I'd
also want to look into the likes of mremap() to see how easy it would be
to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED)
memory sneaking into a memslot.

Unless you think it's worth complicating the ABI in the hope of avoiding
the big lock overhead I think it's probably best to stick with the big
lock at least until we have more data on the overhead.

>> +}
>> +
>> +return 0;
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>struct kvm_memory_slot *memslot, unsigned long hva,
>>unsigned long fault_status)
>> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (writable)
>>  prot |= KVM_PGTABLE_PROT_W;
>>  
>> -if (fault_status != FSC_PERM && !device)
>> +if (fault_status != FSC_PERM && !device) {
>> +ret = sanitise_mte_tags(kvm, pfn, vma_pagesize);
>> +if (ret)
>> +goto out_unlock;
> 
> Maybe it was discussed in a previous version, why do we need this in
> addition to kvm_set_spte_gfn()?

kvm_set_spte_gfn() is only used 

Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Marc Zyngier
On Fri, 04 Jun 2021 10:20:28 +0100,
Sergey Senozhatsky  wrote:
> 
> On (21/06/04 09:46), Marc Zyngier wrote:
> [..]
> > > +void kvm_arch_pm_notifier(struct kvm *kvm)
> > > +{
> > > +#ifdef CONFIG_PM
> > > + int c;
> > > +
> > > + mutex_lock(>lock);
> > > + for (c = 0; c < kvm->created_vcpus; c++) {
> > > + struct kvm_vcpu *vcpu = kvm->vcpus[c];
> > > + int r;
> > > +
> > > + if (!vcpu)
> > > + continue;
> > 
> > Wouldn't kvm_for_each_vcpu() avoid this kind of checks?
> 
> Right, that's what I do in v2, which I haven't posted yet.
> 
> [..]
> > > +#include 
> > > +
> > >  #ifndef KVM_MAX_VCPU_ID
> > >  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
> > >  #endif
> > > @@ -579,6 +581,10 @@ struct kvm {
> > >   pid_t userspace_pid;
> > >   unsigned int max_halt_poll_ns;
> > >   u32 dirty_ring_size;
> > > +
> > > +#ifdef CONFIG_PM
> > > + struct notifier_block pm_notifier;
> > > +#endif
> > 
> > I'd certainly like to be able to opt out from this on architectures
> > that do not implement anything useful in the PM callbacks.
> 
> Well on the other hand PM-callbacks are harmless on those archs, they
> won't overload the __weak function.

I don't care much for the callbacks. But struct kvm is bloated enough,
and I'd be happy not to have this structure embedded in it if I can
avoid it.

> 
> > Please consider making this an independent config option that individual
> > archs can buy into.
> 
> Sure, I can take a look into this, but how is this better than __weak
> function? (that's a real question)

Weak functions are OK AFAIC. More crud in struct kvm is what I'm not
OK with.

> 
> [..]
> > > +#ifdef CONFIG_PM
> > > +static int kvm_pm_notifier_call(struct notifier_block *bl,
> > > + unsigned long state,
> > > + void *unused)
> > > +{
> > > + struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
> > > +
> > > + switch (state) {
> > > + case PM_HIBERNATION_PREPARE:
> > > + case PM_SUSPEND_PREPARE:
> > > + kvm_arch_pm_notifier(kvm);
> > 
> > How about passing the state to the notifier callback? I'd expect it to
> > be useful to do something on resume too.
> 
> For different states we can have different kvm_arch functions instead.
> kvm_arch_pm_notifier() can be renamed to kvm_arch_suspend_notifier(),
> so that we don't need to have `switch (state)` in every arch-code. Then
> for resume/post resume states we can have kvm_arch_resume_notifier()
> arch functions.

I'd rather we keep an arch API that is similar to the one the rest of
the kernel has, instead of a flurry of small helpers that need to grow
each time someone adds a new PM state. A switch() in the arch-specific
implementation is absolutely fine.

> 
> > > + break;
> > > + }
> > > + return NOTIFY_DONE;
> > > +}
> > > +
> > > +static void kvm_init_pm_notifier(struct kvm *kvm)
> > > +{
> > > + kvm->pm_notifier.notifier_call = kvm_pm_notifier_call;
> > > + kvm->pm_notifier.priority = INT_MAX;
> > 
> > How is this priority determined?
> 
> Nothing magical here. I want this to be executed first, before we suspend
> ftrace, RCU and the like. Besides KVM is usually the last one to register
> its PM callbacks, so there can be something on the notifier list that
> returns NOTIFY_STOP_MASK in front of KVM PM-notifier list entry.

Which begs the question: should arch-specific code be able to veto
suspend and return an error itself? Always returning NOTIFY_DONE seems
highly optimistic.

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-04 Thread Catalin Marinas
On Thu, Jun 03, 2021 at 05:00:31PM +0100, Catalin Marinas wrote:
> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index c5d1f3c87dbd..226035cf7d6c 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
> > *memslot,
> > return PAGE_SIZE;
> >  }
> >  
> > +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
> > +unsigned long size)
> > +{
> > +   if (kvm_has_mte(kvm)) {
> > +   /*
> > +* The page will be mapped in stage 2 as Normal Cacheable, so
> > +* the VM will be able to see the page's tags and therefore
> > +* they must be initialised first. If PG_mte_tagged is set,
> > +* tags have already been initialised.
> > +* pfn_to_online_page() is used to reject ZONE_DEVICE pages
> > +* that may not support tags.
> > +*/
> > +   unsigned long i, nr_pages = size >> PAGE_SHIFT;
> > +   struct page *page = pfn_to_online_page(pfn);
> > +
> > +   if (!page)
> > +   return -EFAULT;
> > +
> > +   for (i = 0; i < nr_pages; i++, page++) {
> > +   /*
> > +* There is a potential (but very unlikely) race
> > +* between two VMs which are sharing a physical page
> > +* entering this at the same time. However by splitting
> > +* the test/set the only risk is tags being overwritten
> > +* by the mte_clear_page_tags() call.
> > +*/
> 
> And I think the real risk here is when the page is writable by at least
> one of the VMs sharing the page. This excludes KSM, so it only leaves
> the MAP_SHARED mappings.
> 
> > +   if (!test_bit(PG_mte_tagged, >flags)) {
> > +   mte_clear_page_tags(page_address(page));
> > +   set_bit(PG_mte_tagged, >flags);
> > +   }
> > +   }
> 
> If we want to cover this race (I'd say in a separate patch), we can call
> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I
> got the arguments right). We can avoid the big lock in most cases if
> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.)
> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we
> do for VM_MTE but the new flag would not affect the stage 1 VMM page
> attributes).

Another idea: if VM_SHARED is found for any vma within a region in
kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE
for the guest or reject the memory slot if MTE was already enabled.

An alternative here would be to clear VM_MTE_ALLOWED so that any
subsequent mprotect(PROT_MTE) in the VMM would fail in
arch_validate_flags(). MTE would still be allowed in the guest but in
the VMM for the guest memory regions. We can probably do this
irrespective of VM_SHARED. Of course, the VMM can still mmap() the
memory initially with PROT_MTE but that's not an issue IIRC, only the
concurrent mprotect().

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Vitaly Kuznetsov
Sergey Senozhatsky  writes:

> Add KVM suspend/hibernate PM-notifier which lets architectures
> to implement arch-specific VM suspend code. For instance, on x86
> this sets PVCLOCK_GUEST_STOPPED on all the VCPUs.
>
> Our case is that user puts the host system into sleep multiple
> times a day (e.g. closes the laptop's lid) so we need a reliable
> way to suspend VMs properly.
>
> Signed-off-by: Sergey Senozhatsky 
> ---
>  arch/arm64/kvm/arm.c   |  4 
>  arch/mips/kvm/mips.c   |  4 
>  arch/powerpc/kvm/powerpc.c |  4 
>  arch/s390/kvm/kvm-s390.c   |  4 
>  arch/x86/kvm/x86.c | 21 
>  include/linux/kvm_host.h   |  8 
>  virt/kvm/kvm_main.c| 40 ++
>  7 files changed, 85 insertions(+)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 1126eae27400..547dbe44d039 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1311,6 +1311,10 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm 
> *kvm,
>   }
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index 4d4af97dcc88..d4408acd2be6 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -980,6 +980,10 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
>   kvm_flush_remote_tlbs(kvm);
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long 
> arg)
>  {
>   long r;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index a2a68a958fa0..96e8a7b6fcf0 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -2334,6 +2334,10 @@ static int kvmppc_get_cpu_char(struct kvm_ppc_cpu_char 
> *cp)
>  }
>  #endif
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 1296fc10f80c..c5f86fc1e497 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2367,6 +2367,10 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct 
> kvm_pv_cmd *cmd)
>   return r;
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bbc4e04e67ad..3f3d6497593f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5613,6 +5613,27 @@ static int kvm_vm_ioctl_set_msr_filter(struct kvm 
> *kvm, void __user *argp)
>   return 0;
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +#ifdef CONFIG_PM
> + int c;
> +
> + mutex_lock(>lock);
> + for (c = 0; c < kvm->created_vcpus; c++) {
> + struct kvm_vcpu *vcpu = kvm->vcpus[c];
> + int r;
> +
> + if (!vcpu)
> + continue;
> + r = kvm_set_guest_paused(vcpu);
> + if (!r)
> + continue;
> + pr_err("Failed to suspend VCPU-%d: %d\n", vcpu->vcpu_id,  r);
> + }
> + mutex_unlock(>lock);
> +#endif
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2f34487e21f2..86695320a6b7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -37,6 +37,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
>  #endif
> @@ -579,6 +581,10 @@ struct kvm {
>   pid_t userspace_pid;
>   unsigned int max_halt_poll_ns;
>   u32 dirty_ring_size;
> +
> +#ifdef CONFIG_PM
> + struct notifier_block pm_notifier;
> +#endif
>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -992,6 +998,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm);
> +
>  #ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
>  void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry 
> *debugfs_dentry);
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6b4feb92dc79..86925ab7d162 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -779,6 +780,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  
>  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>  
> +#ifdef CONFIG_PM
> +static int kvm_pm_notifier_call(struct 

Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Sergey Senozhatsky
On (21/06/03 19:27), Peter Zijlstra wrote:
> On Fri, Jun 04, 2021 at 01:43:15AM +0900, Sergey Senozhatsky wrote:
[..]
> >  
> > +void kvm_arch_pm_notifier(struct kvm *kvm)
> > +{
> > +}
> > +
> >  long kvm_arch_vm_ioctl(struct file *filp,
> >unsigned int ioctl, unsigned long arg)
> >  {
> 
> What looks like you wants a __weak function.

True. Thanks for the suggestion.

I thought about it, but I recalled that tglx had  __strong opinions
on __weak functions.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Sergey Senozhatsky
Add KVM suspend/hibernate PM-notifier which lets architectures
to implement arch-specific VM suspend code. For instance, on x86
this sets PVCLOCK_GUEST_STOPPED on all the VCPUs.

Our case is that user puts the host system into sleep multiple
times a day (e.g. closes the laptop's lid) so we need a reliable
way to suspend VMs properly.

Signed-off-by: Sergey Senozhatsky 
---
 arch/arm64/kvm/arm.c   |  4 
 arch/mips/kvm/mips.c   |  4 
 arch/powerpc/kvm/powerpc.c |  4 
 arch/s390/kvm/kvm-s390.c   |  4 
 arch/x86/kvm/x86.c | 21 
 include/linux/kvm_host.h   |  8 
 virt/kvm/kvm_main.c| 40 ++
 7 files changed, 85 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1126eae27400..547dbe44d039 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1311,6 +1311,10 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
}
 }
 
+void kvm_arch_pm_notifier(struct kvm *kvm)
+{
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 4d4af97dcc88..d4408acd2be6 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -980,6 +980,10 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
kvm_flush_remote_tlbs(kvm);
 }
 
+void kvm_arch_pm_notifier(struct kvm *kvm)
+{
+}
+
 long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long 
arg)
 {
long r;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index a2a68a958fa0..96e8a7b6fcf0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -2334,6 +2334,10 @@ static int kvmppc_get_cpu_char(struct kvm_ppc_cpu_char 
*cp)
 }
 #endif
 
+void kvm_arch_pm_notifier(struct kvm *kvm)
+{
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
 {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 1296fc10f80c..c5f86fc1e497 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2367,6 +2367,10 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct 
kvm_pv_cmd *cmd)
return r;
 }
 
+void kvm_arch_pm_notifier(struct kvm *kvm)
+{
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbc4e04e67ad..3f3d6497593f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5613,6 +5613,27 @@ static int kvm_vm_ioctl_set_msr_filter(struct kvm *kvm, 
void __user *argp)
return 0;
 }
 
+void kvm_arch_pm_notifier(struct kvm *kvm)
+{
+#ifdef CONFIG_PM
+   int c;
+
+   mutex_lock(>lock);
+   for (c = 0; c < kvm->created_vcpus; c++) {
+   struct kvm_vcpu *vcpu = kvm->vcpus[c];
+   int r;
+
+   if (!vcpu)
+   continue;
+   r = kvm_set_guest_paused(vcpu);
+   if (!r)
+   continue;
+   pr_err("Failed to suspend VCPU-%d: %d\n", vcpu->vcpu_id,  r);
+   }
+   mutex_unlock(>lock);
+#endif
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2f34487e21f2..86695320a6b7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -37,6 +37,8 @@
 #include 
 #include 
 
+#include 
+
 #ifndef KVM_MAX_VCPU_ID
 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
 #endif
@@ -579,6 +581,10 @@ struct kvm {
pid_t userspace_pid;
unsigned int max_halt_poll_ns;
u32 dirty_ring_size;
+
+#ifdef CONFIG_PM
+   struct notifier_block pm_notifier;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
@@ -992,6 +998,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
 
+void kvm_arch_pm_notifier(struct kvm *kvm);
+
 #ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
 void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6b4feb92dc79..86925ab7d162 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -779,6 +780,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_PM
+static int kvm_pm_notifier_call(struct notifier_block *bl,
+   unsigned long state,
+   void *unused)
+{
+   struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
+
+   switch (state) {
+   case PM_HIBERNATION_PREPARE:
+   case PM_SUSPEND_PREPARE:
+   

Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Marc Zyngier
On Thu, 03 Jun 2021 17:43:15 +0100,
Sergey Senozhatsky  wrote:
> 
> Add KVM suspend/hibernate PM-notifier which lets architectures
> to implement arch-specific VM suspend code. For instance, on x86
> this sets PVCLOCK_GUEST_STOPPED on all the VCPUs.
> 
> Our case is that user puts the host system into sleep multiple
> times a day (e.g. closes the laptop's lid) so we need a reliable
> way to suspend VMs properly.
> 
> Signed-off-by: Sergey Senozhatsky 
> ---
>  arch/arm64/kvm/arm.c   |  4 
>  arch/mips/kvm/mips.c   |  4 
>  arch/powerpc/kvm/powerpc.c |  4 
>  arch/s390/kvm/kvm-s390.c   |  4 
>  arch/x86/kvm/x86.c | 21 
>  include/linux/kvm_host.h   |  8 
>  virt/kvm/kvm_main.c| 40 ++
>  7 files changed, 85 insertions(+)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 1126eae27400..547dbe44d039 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1311,6 +1311,10 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm 
> *kvm,
>   }
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index 4d4af97dcc88..d4408acd2be6 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -980,6 +980,10 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
>   kvm_flush_remote_tlbs(kvm);
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long 
> arg)
>  {
>   long r;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index a2a68a958fa0..96e8a7b6fcf0 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -2334,6 +2334,10 @@ static int kvmppc_get_cpu_char(struct kvm_ppc_cpu_char 
> *cp)
>  }
>  #endif
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 1296fc10f80c..c5f86fc1e497 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2367,6 +2367,10 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct 
> kvm_pv_cmd *cmd)
>   return r;
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bbc4e04e67ad..3f3d6497593f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5613,6 +5613,27 @@ static int kvm_vm_ioctl_set_msr_filter(struct kvm 
> *kvm, void __user *argp)
>   return 0;
>  }
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm)
> +{
> +#ifdef CONFIG_PM
> + int c;
> +
> + mutex_lock(>lock);
> + for (c = 0; c < kvm->created_vcpus; c++) {
> + struct kvm_vcpu *vcpu = kvm->vcpus[c];
> + int r;
> +
> + if (!vcpu)
> + continue;

Wouldn't kvm_for_each_vcpu() avoid this kind of checks?

> + r = kvm_set_guest_paused(vcpu);
> + if (!r)
> + continue;
> + pr_err("Failed to suspend VCPU-%d: %d\n", vcpu->vcpu_id,  r);
> + }
> + mutex_unlock(>lock);
> +#endif
> +}
> +
>  long kvm_arch_vm_ioctl(struct file *filp,
>  unsigned int ioctl, unsigned long arg)
>  {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2f34487e21f2..86695320a6b7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -37,6 +37,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #ifndef KVM_MAX_VCPU_ID
>  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
>  #endif
> @@ -579,6 +581,10 @@ struct kvm {
>   pid_t userspace_pid;
>   unsigned int max_halt_poll_ns;
>   u32 dirty_ring_size;
> +
> +#ifdef CONFIG_PM
> + struct notifier_block pm_notifier;
> +#endif

I'd certainly like to be able to opt out from this on architectures
that do not implement anything useful in the PM callbacks. Please
consider making this an independent config option that individual
archs can buy into.

>  };
>  
>  #define kvm_err(fmt, ...) \
> @@ -992,6 +998,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
>  
> +void kvm_arch_pm_notifier(struct kvm *kvm);
> +
>  #ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
>  void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry 
> *debugfs_dentry);
>  #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6b4feb92dc79..86925ab7d162 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ 

RE: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd approach)

2021-06-04 Thread Shameerali Kolothum Thodi
Hi,

> -Original Message-
> From: Shameerali Kolothum Thodi
> Sent: 06 May 2021 17:52
> To: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> linux-ker...@vger.kernel.org
> Cc: m...@kernel.org; w...@kernel.org; catalin.mari...@arm.com;
> james.mo...@arm.com; julien.thierry.k...@gmail.com;
> suzuki.poul...@arm.com; jean-phili...@linaro.org; Linuxarm
> 
> Subject: [RFC PATCH 0/3] kvm/arm: New VMID allocator based on asid(2nd
> approach)
> 
> This is based on a suggestion from Will [0] to try out the asid
> based kvm vmid solution as a separate VMID allocator instead of
> the shared lib approach attempted in v4[1].
> 
> The idea is to compare both the approaches and see whether the
> shared lib solution with callbacks make sense or not.

A gentle ping on this. Please take a look and let me know.

Thanks,
Shameer

> Though we are not yet using the pinned vmids yet, patch #2 has
> code for pinned vmid support. This is just to help the comparison.
> 
> Test Setup/Results
> 
> The measurement was made with maxcpus set to 8 and with the
> number of VMID limited to 4-bit. The test involves running
> concurrently 40 guests with 2 vCPUs. Each guest will then
> execute hackbench 5 times before exiting.
> 
> The performance difference between the current algo and the
> new one are(avg. of 10 runs):
> - 1.9% less entry/exit from the guest
> - 0.5% faster
> This is more or less comparable to v4 numbers.
> 
> For the complete series, please see,
> https://github.com/hisilicon/kernel-dev/tree/private-v5.12-rc7-vmid-2nd-rfc
> 
> and for the shared asid lib v4 solution,
> https://github.com/hisilicon/kernel-dev/tree/private-v5.12-rc7-asid-v4
> 
> As you can see there are of course code duplication with this
> approach but may be this one is more easy to maintain considering
> the complexity involved.
> 
> Please take a look and let me know your feedback.
> 
> Thanks,
> Shameer
> 
> 
> [0] https://lore.kernel.org/lkml/20210422160846.GB2214@willie-the-truck/
> [1]
> https://lore.kernel.org/lkml/20210414112312.13704-1-shameerali.kolothum.t
> h...@huawei.com/
> 
> Julien Grall (2):
>   arch/arm64: Introduce a capability to tell whether 16-bit VMID is
> available
>   kvm/arm: Align the VMID allocation with the arm64 ASID one
> 
> Shameer Kolothum (1):
>   kvm/arm: Introduce a new vmid allocator for KVM
> 
>  arch/arm64/include/asm/cpucaps.h   |   3 +-
>  arch/arm64/include/asm/kvm_asm.h   |   4 +-
>  arch/arm64/include/asm/kvm_host.h  |  11 +-
>  arch/arm64/include/asm/kvm_mmu.h   |   7 +-
>  arch/arm64/kernel/cpufeature.c |   9 +
>  arch/arm64/kvm/Makefile|   2 +-
>  arch/arm64/kvm/arm.c   | 115 
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c |   6 +-
>  arch/arm64/kvm/hyp/nvhe/tlb.c  |  10 +-
>  arch/arm64/kvm/hyp/vhe/tlb.c   |  10 +-
>  arch/arm64/kvm/mmu.c   |   1 -
>  arch/arm64/kvm/vmid.c  | 285
> +
>  12 files changed, 354 insertions(+), 109 deletions(-)
>  create mode 100644 arch/arm64/kvm/vmid.c
> 
> --
> 2.17.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Paolo Bonzini

On 04/06/21 09:21, Vitaly Kuznetsov wrote:
  
  	preempt_notifier_inc();

+   kvm_init_pm_notifier(kvm);
  

You've probably thought it through and I didn't but wouldn't it be
easier to have one global pm_notifier call for KVM which would go
through the list of VMs instead of registering/deregistering a
pm_notifier call for every created/destroyed VM?


That raises questions on the locking, i.e. if we can we take the 
kvm_lock safely from the notifier.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC][PATCH] kvm: add suspend pm-notifier

2021-06-04 Thread Paolo Bonzini

On 04/06/21 02:39, Sergey Senozhatsky wrote:
  
+void kvm_arch_pm_notifier(struct kvm *kvm)

+{
+}
+
  long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
  {

What looks like you wants a __weak function.

True. Thanks for the suggestion.

I thought about it, but I recalled that tglx had  __strong opinions
on __weak functions.



Alternatively, you can add a Kconfig symbol to virt/kvm/Kconfig and 
select it from arch/x86/kvm.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH] KVM: arm64: vgic: Communicate a change of the IRQ state via vgic_queue_irq_unlock

2021-06-04 Thread Shenming Lu
Hi Marc,

Some time ago, you suggested that we should communicate a change
of the IRQ state via vgic_queue_irq_unlock [1], which needs to
include dropping the IRQ from the VCPU's ap_list if the IRQ is
not pending or enabled but on the ap_list. And I additionally
add a case where the IRQ has to be migrated to another ap_list.

(maybe you forget this...)
Does this patch match your thought at the time?

[1] https://lore.kernel.org/patchwork/patch/1371884/

Signed-off-by: Shenming Lu 
---
 arch/arm64/kvm/vgic/vgic.c | 116 -
 1 file changed, 75 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index 15b666200f0b..9b88d49aa439 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -326,8 +326,9 @@ static bool vgic_validate_injection(struct vgic_irq *irq, 
bool level, void *owne
 
 /*
  * Check whether an IRQ needs to (and can) be queued to a VCPU's ap list.
- * Do the queuing if necessary, taking the right locks in the right order.
- * Returns true when the IRQ was queued, false otherwise.
+ * Do the queuing, dropping or migrating if necessary, taking the right
+ * locks in the right order. Returns true when the IRQ was queued, false
+ * otherwise.
  *
  * Needs to be entered with the IRQ lock already held, but will return
  * with all locks dropped.
@@ -335,49 +336,38 @@ static bool vgic_validate_injection(struct vgic_irq *irq, 
bool level, void *owne
 bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq,
   unsigned long flags)
 {
+   struct kvm_vcpu *target_vcpu;
struct kvm_vcpu *vcpu;
+   bool ret = false;
 
lockdep_assert_held(>irq_lock);
 
 retry:
-   vcpu = vgic_target_oracle(irq);
-   if (irq->vcpu || !vcpu) {
+   target_vcpu = vgic_target_oracle(irq);
+   vcpu = irq->vcpu;
+   if (target_vcpu == vcpu) {
/*
-* If this IRQ is already on a VCPU's ap_list, then it
-* cannot be moved or modified and there is no more work for
+* If this IRQ's state is consistent with whether on
+* the right ap_lsit or not, there is no more work for
 * us to do.
-*
-* Otherwise, if the irq is not pending and enabled, it does
-* not need to be inserted into an ap_list and there is also
-* no more work for us to do.
 */
raw_spin_unlock_irqrestore(>irq_lock, flags);
-
-   /*
-* We have to kick the VCPU here, because we could be
-* queueing an edge-triggered interrupt for which we
-* get no EOI maintenance interrupt. In that case,
-* while the IRQ is already on the VCPU's AP list, the
-* VCPU could have EOI'ed the original interrupt and
-* won't see this one until it exits for some other
-* reason.
-*/
-   if (vcpu) {
-   kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu);
-   kvm_vcpu_kick(vcpu);
-   }
-   return false;
+   goto out;
}
 
/*
 * We must unlock the irq lock to take the ap_list_lock where
-* we are going to insert this new pending interrupt.
+* we are going to insert/drop this IRQ.
 */
raw_spin_unlock_irqrestore(>irq_lock, flags);
 
/* someone can do stuff here, which we re-check below */
 
-   raw_spin_lock_irqsave(>arch.vgic_cpu.ap_list_lock, flags);
+   if (target_vcpu)
+   raw_spin_lock_irqsave(_vcpu->arch.vgic_cpu.ap_list_lock,
+ flags);
+   if (vcpu)
+   raw_spin_lock_irqsave(>arch.vgic_cpu.ap_list_lock, flags);
raw_spin_lock(>irq_lock);
 
/*
@@ -392,30 +382,74 @@ bool vgic_queue_irq_unlock(struct kvm *kvm, struct 
vgic_irq *irq,
 * In both cases, drop the locks and retry.
 */
 
-   if (unlikely(irq->vcpu || vcpu != vgic_target_oracle(irq))) {
+   if (unlikely(target_vcpu != vgic_target_oracle(irq) ||
+vcpu != irq->vcpu)) {
raw_spin_unlock(>irq_lock);
-   raw_spin_unlock_irqrestore(>arch.vgic_cpu.ap_list_lock,
-  flags);
+   if (target_vcpu)
+   
raw_spin_unlock_irqrestore(_vcpu->arch.vgic_cpu.ap_list_lock,
+  flags);
+   if (vcpu)
+   
raw_spin_unlock_irqrestore(>arch.vgic_cpu.ap_list_lock,
+  flags);
 
raw_spin_lock_irqsave(>irq_lock, flags);
goto retry;
}
 
-   /*
-* Grab a reference to the irq to reflect the fact that it is
-* now in the