Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-30 Thread Steven Price
On 29/03/2022 18:01, Quentin Perret wrote:
> On Monday 28 Mar 2022 at 18:58:35 (+), Sean Christopherson wrote:
>> On Mon, Mar 28, 2022, Quentin Perret wrote:
>>> Hi Sean,
>>>
>>> Thanks for the reply, this helps a lot.
>>>
>>> On Monday 28 Mar 2022 at 17:13:10 (+), Sean Christopherson wrote:
 On Thu, Mar 24, 2022, Quentin Perret wrote:
> For Protected KVM (and I suspect most other confidential computing
> solutions), guests have the ability to share some of their pages back
> with the host kernel using a dedicated hypercall. This is necessary
> for e.g. virtio communications, so these shared pages need to be mapped
> back into the VMM's address space. I'm a bit confused about how that
> would work with the approach proposed here. What is going to be the
> approach for TDX?
>
> It feels like the most 'natural' thing would be to have a KVM exit
> reason describing which pages have been shared back by the guest, and to
> then allow the VMM to mmap those specific pages in response in the
> memfd. Is this something that has been discussed or considered?

 The proposed solution is to exit to userspace with a new exit reason, 
 KVM_EXIT_MEMORY_ERROR,
 when the guest makes the hypercall to request conversion[1].  The private 
 fd itself
 will never allow mapping memory into userspace, instead userspace will 
 need to punch
 a hole in the private fd backing store.  The absense of a valid mapping in 
 the private
 fd is how KVM detects that a pfn is "shared" (memslots without a private 
 fd are always
 shared)[2].
>>>
>>> Right. I'm still a bit confused about how the VMM is going to get the
>>> shared page mapped in its page-table. Once it has punched a hole into
>>> the private fd, how is it supposed to access the actual physical page
>>> that the guest shared?
>>
>> The guest doesn't share a _host_ physical page, the guest shares a _guest_ 
>> physical
>> page.  Until host userspace converts the gfn to shared and thus maps the 
>> gfn=>hva
>> via mmap(), the guest is blocked and can't read/write/exec the memory.  
>> AFAIK, no
>> architecture allows in-place decryption of guest private memory.  s390 
>> allows a
>> page to be "made accessible" to the host for the purposes of swap, and other
>> architectures will have similar behavior for migrating a protected VM, but 
>> those
>> scenarios are not sharing the page (and they also make the page inaccessible 
>> to
>> the guest).
> 
> I see. FWIW, since pKVM is entirely MMU-based, we are in fact capable of
> doing in-place sharing, which also means it can retain the content of
> the page as part of the conversion.
> 
> Also, I'll ask the Arm CCA developers to correct me if this is wrong, but
> I _believe_ it should be technically possible to do in-place sharing for
> them too.

In general this isn't possible as the physical memory could be
encrypted, so some temporary memory is required. We have prototyped
having a single temporary page for the setup when populating the guest's
initial memory - this has the nice property of not requiring any
additional allocation during the process but with the downside of
effectively two memcpy()s per page (one to the temporary page and
another, with optional encryption, into the now private page).

>>> Is there an assumption somewhere that the VMM should have this page mapped 
>>> in
>>> via an alias that it can legally access only once it has punched a hole at
>>> the corresponding offset in the private fd or something along those lines?
>>
>> Yes, the VMM must have a completely separate VMA.  The VMM doesn't haven't to
>> wait until the conversion to mmap() the shared variant, though obviously it 
>> will
>> potentially consume double the memory if the VMM actually populates both the
>> private and shared backing stores.
> 
> Gotcha. This is what confused me I think -- in this approach private and
> shared pages are in fact entirely different.
> 
> In which scenario could you end up with both the private and shared
> pages live at the same time? Would this be something like follows?
> 
>  - userspace creates a private fd, fallocates into it, and associates
>the  tuple with a private memslot;
> 
>  - userspace then mmaps anonymous memory (for ex.), and associates it
>with a standard memslot, which happens to be positioned at exactly
>the right offset w.r.t to the private memslot (with this offset
>defined by the bit that is set for the private addresses in the gpa
>space);
> 
>  - the guest runs, and accesses both 'aliases' of the page without doing
>an explicit share hypercall.
> 
> Is there another option?

AIUI you can have both private and shared "live" at the same time. But
you can have a page allocated both in the private fd and in the same
location in the (shared) memslot in the VMM's memory map. In this
situation the private fd page effectively hides the shared page.

> Is implicit sharing a 

Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

2022-02-23 Thread Steven Price
On 23/02/2022 11:49, Chao Peng wrote:
> On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:
>> On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
>>> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
 On 1/18/22 05:21, Chao Peng wrote:
> From: "Kirill A. Shutemov" 
>
> Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
> the file is inaccessible from userspace through ordinary MMU access
> (e.g., read/write/mmap). However, the file content can be accessed
> via a different mechanism (e.g. KVM MMU) indirectly.
>
> It provides semantics required for KVM guest private memory support
> that a file descriptor with this seal set is going to be used as the
> source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV but may not be accessible from host userspace.
>
> At this time only shmem implements this seal.
>

 I don't dislike this *that* much, but I do dislike this. 
 F_SEAL_INACCESSIBLE
 essentially transmutes a memfd into a different type of object.  While this
 can apparently be done successfully and without races (as in this code),
 it's at least awkward.  I think that either creating a special inaccessible
 memfd should be a single operation that create the correct type of object 
 or
 there should be a clear justification for why it's a two-step process.
>>>
>>> Now one justification maybe from Stever's comment to patch-00: for ARM
>>> usage it can be used with creating a normal memfd, (partially)populate
>>> it with initial guest memory content (e.g. firmware), and then
>>> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
>>> KVM (definitely the current code needs to be changed to support that).
>>
>> Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So 
>> this won't work.
> 
> Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will 
> need to make sure access to existing mmap-ed area should be prevented,
> but that is hard.
> 
>>
>> In any case, the whole confidential VM initialization story is a bit buddy.  
>> From the earlier emails, it sounds like ARM expects the host to fill in 
>> guest memory and measure it.  From my recollection of Intel's scheme (which 
>> may well be wrong, and I could easily be confusing it with SGX), TDX instead 
>> measures what is essentially a transcript of the series of operations that 
>> initializes the VM.  These are fundamentally not the same thing even if they 
>> accomplish the same end goal.  For TDX, we unavoidably need an operation 
>> (ioctl or similar) that initializes things according to the VM's 
>> instructions, and ARM ought to be able to use roughly the same mechanism.
> 
> Yes, TDX requires a ioctl. Steven may comment on the ARM part.

The Arm story is evolving so I can't give a definite answer yet. Our
current prototyping works by creating the initial VM content in a
memslot as with a normal VM and then calling an ioctl which throws the
big switch and converts all the (populated) pages to be protected. At
this point the RMM performs a measurement of the data that the VM is
being populated with.

The above (in our prototype) suffers from all the expected problems with
a malicious VMM being able to trick the host kernel into accessing those
pages after they have been protected (causing a fault detected by the
hardware).

The ideal (from our perspective) approach would be to follow the same
flow but where the VMM populates a memfd rather than normal anonymous
pages. The memfd could then be sealed and the pages converted to
protected ones (with the RMM measuring them in the process).

The question becomes how is that memfd populated? It would be nice if
that could be done using normal operations on a memfd (i.e. using
mmap()) and therefore this code could be (relatively) portable. This
would mean that any pages mapped from the memfd would either need to
block the sealing or be revoked at the time of sealing.

The other approach is we could of course implement a special ioctl which
effectively does a memcpy into the (created empty and sealed) memfd and
does the necessary dance with the RMM to measure the contents. This
would match the "transcript of the series of operations" described above
- but seems much less ideal from the viewpoint of the VMM.

Steve

> Chao
>>
>> Also, if we ever get fancy and teach the page allocator about memory with 
>> reduced directmap permissions, it may well be more efficient for userspace 
>> to shove data into a memfd via ioctl than it is to mmap it and write the 
>> data.
> 
> 
> 




Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-02-02 Thread Steven Price
Hi Jun,

On 02/02/2022 02:28, Nakajima, Jun wrote:
> 
>> On Jan 28, 2022, at 8:47 AM, Steven Price  wrote:
>>
>> On 18/01/2022 13:21, Chao Peng wrote:
>>> This is the v4 of this series which try to implement the fd-based KVM
>>> guest private memory. The patches are based on latest kvm/queue branch
>>> commit:
>>>
>>>  fea31d169094 KVM: x86/pmu: Fix available_event_types check for
>>>   REF_CPU_CYCLES event
>>>
>>> Introduction
>>> 
>>> In general this patch series introduce fd-based memslot which provides
>>> guest memory through memory file descriptor fd[offset,size] instead of
>>> hva/size. The fd can be created from a supported memory filesystem
>>> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>> and the the memory backing store exchange callbacks when such memslot
>>> gets created. At runtime KVM will call into callbacks provided by the
>>> backing store to get the pfn with the fd+offset. Memory backing store
>>> will also call into KVM callbacks when userspace fallocate/punch hole
>>> on the fd to notify KVM to map/unmap secondary MMU page tables.
>>>
>>> Comparing to existing hva-based memslot, this new type of memslot allows
>>> guest memory unmapped from host userspace like QEMU and even the kernel
>>> itself, therefore reduce attack surface and prevent bugs.
>>>
>>> Based on this fd-based memslot, we can build guest private memory that
>>> is going to be used in confidential computing environments such as Intel
>>> TDX and AMD SEV. When supported, the memory backing store can provide
>>> more enforcement on the fd and KVM can use a single memslot to hold both
>>> the private and shared part of the guest memory. 
>>
>> This looks like it will be useful for Arm's Confidential Compute
>> Architecture (CCA) too - in particular we need a way of ensuring that
>> user space cannot 'trick' the kernel into accessing memory which has
>> been delegated to a realm (i.e. protected guest), and a memfd seems like
>> a good match.
> 
> Good to hear that it will be useful for ARM’s CCA as well.
> 
>>
>> Some comments below.
>>
>>> mm extension
>>> -
>>> Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
>>> flag for memfd_create(), the file created with these flags cannot read(),
>>> write() or mmap() etc via normal MMU operations. The file content can
>>> only be used with the newly introduced memfile_notifier extension.
>>
>> For Arm CCA we are expecting to seed the realm with an initial memory
>> contents (e.g. kernel and initrd) which will then be measured before
>> execution starts. The 'obvious' way of doing this with a memfd would be
>> to populate parts of the memfd then seal it with F_SEAL_INACCESSIBLE.
> 
> As far as I understand, we have the same problem with TDX, where a guest TD 
> (Trust Domain) starts in private memory. We seed the private memory typically 
> with a guest firmware, and the initial image (plaintext) is copied to 
> somewhere in QEMU memory (from disk, for example) for that purpose; this 
> location is not associated with the target GPA.
> 
> Upon a (new) ioctl from QEMU, KVM requests the TDX Module to copy the pages 
> to private memory (by encrypting) specifying the target GPA, using a TDX 
> interface function (TDH.MEM.PAGE.ADD). The actual pages for the private 
> memory is allocated by the callbacks provided by the backing store during the 
> “copy” operation.
> 
> We extended the existing KVM_MEMORY_ENCRYPT_OP (ioctl) for the above. 

Ok, so if I understand correctly QEMU would do something along the lines of:

1. Use memfd_create(...MFD_INACCESSIBLE) to allocate private memory for
the guest.

2. ftruncate/fallocate the memfd to back the appropriate areas of the memfd.

3. Create a memslot in KVM pointing to the memfd

4. Load the 'guest firmware' (kernel/initrd or similar) into VMM memory

5. Use the KVM_MEMORY_ENCRYPT_OP to request the 'guest firmware' be
copied into the private memory. The ioctl would temporarily pin the
pages and ask the TDX module to copy (& encrypt) the data into the
private memory, unpinning after the copy.

6. QEMU can then free the unencrypted copy of the guest firmware.

>>
>> However as things stand it's not possible to set the INACCESSIBLE seal
>> after creating a memfd (F_ALL_SEALS hasn't been updated to include it).
>>
>> One potential workaround would be for arm64 to provide a custom KVM
>> ioctl to effectively memcpy() into the guest's protected memory which
>

Re: [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-01-28 Thread Steven Price
On 18/01/2022 13:21, Chao Peng wrote:
> This is the v4 of this series which try to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
> 
>   fea31d169094 KVM: x86/pmu: Fix available_event_types check for
>REF_CPU_CYCLES event
> 
> Introduction
> 
> In general this patch series introduce fd-based memslot which provides
> guest memory through memory file descriptor fd[offset,size] instead of
> hva/size. The fd can be created from a supported memory filesystem
> like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
> and the the memory backing store exchange callbacks when such memslot
> gets created. At runtime KVM will call into callbacks provided by the
> backing store to get the pfn with the fd+offset. Memory backing store
> will also call into KVM callbacks when userspace fallocate/punch hole
> on the fd to notify KVM to map/unmap secondary MMU page tables.
> 
> Comparing to existing hva-based memslot, this new type of memslot allows
> guest memory unmapped from host userspace like QEMU and even the kernel
> itself, therefore reduce attack surface and prevent bugs.
> 
> Based on this fd-based memslot, we can build guest private memory that
> is going to be used in confidential computing environments such as Intel
> TDX and AMD SEV. When supported, the memory backing store can provide
> more enforcement on the fd and KVM can use a single memslot to hold both
> the private and shared part of the guest memory. 

This looks like it will be useful for Arm's Confidential Compute
Architecture (CCA) too - in particular we need a way of ensuring that
user space cannot 'trick' the kernel into accessing memory which has
been delegated to a realm (i.e. protected guest), and a memfd seems like
a good match.

Some comments below.

> mm extension
> -
> Introduces new F_SEAL_INACCESSIBLE for shmem and new MFD_INACCESSIBLE
> flag for memfd_create(), the file created with these flags cannot read(),
> write() or mmap() etc via normal MMU operations. The file content can
> only be used with the newly introduced memfile_notifier extension.

For Arm CCA we are expecting to seed the realm with an initial memory
contents (e.g. kernel and initrd) which will then be measured before
execution starts. The 'obvious' way of doing this with a memfd would be
to populate parts of the memfd then seal it with F_SEAL_INACCESSIBLE.

However as things stand it's not possible to set the INACCESSIBLE seal
after creating a memfd (F_ALL_SEALS hasn't been updated to include it).

One potential workaround would be for arm64 to provide a custom KVM
ioctl to effectively memcpy() into the guest's protected memory which
would only be accessible before the guest has started. The drawback is
that it requires two copies of the data during guest setup.

Do you think things could be relaxed so the F_SEAL_INACCESSIBLE flag
could be set after a memfd has been created (and partially populated)?

Thanks,

Steve

> The memfile_notifier extension provides two sets of callbacks for KVM to
> interact with the memory backing store:
>   - memfile_notifier_ops: callbacks for memory backing store to notify
> KVM when memory gets allocated/invalidated.
>   - memfile_pfn_ops: callbacks for KVM to call into memory backing store
> to request memory pages for guest private memory.
> 
> memslot extension
> -
> Add the private fd and the fd offset to existing 'shared' memslot so that
> both private/shared guest memory can live in one single memslot. A page in
> the memslot is either private or shared. A page is private only when it's
> already allocated in the backing store fd, all the other cases it's treated
> as shared, this includes those already mapped as shared as well as those
> having not been mapped. This means the memory backing store is the place
> which tells the truth of which page is private.
> 
> Private memory map/unmap and conversion
> ---
> Userspace's map/unmap operations are done by fallocate() ioctl on the
> backing store fd.
>   - map: default fallocate() with mode=0.
>   - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE.
> The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap
> secondary MMU page tables.
> 
> Test
> 
> To test the new functionalities of this patch TDX patchset is needed.
> Since TDX patchset has not been merged so I did two kinds of test:
> 
> -  Regresion test on kvm/queue (this patch)
>Most new code are not covered. I only tested building and booting.
> 
> -  New Funational test on latest TDX code
>The patch is rebased to latest TDX code and tested the new
>funcationalities.
> 
> For TDX test please see below repos:
> Linux: https://github.com/chao-p/linux/tree/privmem-v4.3
> QEMU: https://github.com/chao-p/qemu/tree/privmem-v4
> 
> And an example QEMU command line:
> -object tdx-guest,id=tdx \
> -object 

Re: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag

2022-01-21 Thread Steven Price
On 18/01/2022 13:21, Chao Peng wrote:
> Introduce a new memfd_create() flag indicating the content of the
> created memfd is inaccessible from userspace. It does this by force
> setting F_SEAL_INACCESSIBLE seal when the file is created. It also set
> F_SEAL_SEAL to prevent future sealing, which means, it can not coexist
> with MFD_ALLOW_SEALING.
> 
> The pages backed by such memfd will be used as guest private memory in
> confidential computing environments such as Intel TDX/AMD SEV. Since
> page migration/swapping is not yet supported for such usages so these
> pages are currently marked as UNMOVABLE and UNEVICTABLE which makes
> them behave like long-term pinned pages.
> 
> Signed-off-by: Chao Peng 
> ---
>  include/uapi/linux/memfd.h |  1 +
>  mm/memfd.c | 20 +++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
>  #define MFD_CLOEXEC  0x0001U
>  #define MFD_ALLOW_SEALING0x0002U
>  #define MFD_HUGETLB  0x0004U
> +#define MFD_INACCESSIBLE 0x0008U
>  
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 9f80f162791a..26998d96dc11 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -245,16 +245,19 @@ long memfd_fcntl(struct file *file, unsigned int cmd, 
> unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>  
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +MFD_INACCESSIBLE)
>  
>  SYSCALL_DEFINE2(memfd_create,
>   const char __user *, uname,
>   unsigned int, flags)
>  {
> + struct address_space *mapping;
>   unsigned int *file_seals;
>   struct file *file;
>   int fd, error;
>   char *name;
> + gfp_t gfp;
>   long len;
>  
>   if (!(flags & MFD_HUGETLB)) {
> @@ -267,6 +270,10 @@ SYSCALL_DEFINE2(memfd_create,
>   return -EINVAL;
>   }
>  
> + /* Disallow sealing when MFD_INACCESSIBLE is set. */
> + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> + return -EINVAL;
> +
>   /* length includes terminating zero */
>   len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>   if (len <= 0)
> @@ -315,6 +322,17 @@ SYSCALL_DEFINE2(memfd_create,
>   *file_seals &= ~F_SEAL_SEAL;
>   }
>  
> + if (flags & MFD_INACCESSIBLE) {
> + mapping = file_inode(file)->i_mapping;
> + gfp = mapping_gfp_mask(mapping);
> + gfp &= ~__GFP_MOVABLE;
> + mapping_set_gfp_mask(mapping, gfp);
> + mapping_set_unevictable(mapping);
> +
> + file_seals = memfd_file_seals_ptr(file);
> + *file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE;

This looks backwards - the flags should be set on *file_seals, but here
you are unsetting all other flags.

Steve

> + }
> +
>   fd_install(fd, file);
>   kfree(name);
>   return fd;
> 




Re: [RFC v2 PATCH 06/13] KVM: Register/unregister memfd backed memslot

2021-11-25 Thread Steven Price
On 19/11/2021 13:47, Chao Peng wrote:
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> ---
>  virt/kvm/kvm_main.c | 23 +++
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 271cef8d1cd0..b8673490d301 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1426,7 +1426,7 @@ static void update_memslots(struct kvm_memslots *slots,
>  static int check_memory_region_flags(struct kvm *kvm,
>const struct kvm_userspace_memory_region_ext *mem)
>  {
> - u32 valid_flags = 0;
> + u32 valid_flags = KVM_MEM_FD;
>  
>   if (!kvm->dirty_log_unsupported)
>   valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
> @@ -1604,10 +1604,20 @@ static int kvm_set_memslot(struct kvm *kvm,
>   kvm_copy_memslots(slots, __kvm_memslots(kvm, as_id));
>   }
>  
> + if (mem->flags & KVM_MEM_FD && change == KVM_MR_CREATE) {
> + r = kvm_memfd_register(kvm, mem, new);
> + if (r)
> + goto out_slots;
> + }
> +
>   r = kvm_arch_prepare_memory_region(kvm, new, mem, change);
>   if (r)
>   goto out_slots;
>  
> + if (mem->flags & KVM_MEM_FD && (r || change == KVM_MR_DELETE)) {
^
r will never be non-zero as the 'if' above will catch that case and jump
to out_slots.

I *think* the intention was that the "if (r)" code should be after this
check to clean up in the case of error from
kvm_arch_prepare_memory_region() (as well as an explicit MR_DELETE).

Steve

> + kvm_memfd_unregister(kvm, new);
> + }
> +
>   update_memslots(slots, new, change);
>   slots = install_new_memslots(kvm, as_id, slots);
>  
> @@ -1683,10 +1693,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   return -EINVAL;
>   if (mem->guest_phys_addr & (PAGE_SIZE - 1))
>   return -EINVAL;
> - /* We can read the guest memory with __xxx_user() later on. */
>   if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
> - (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
> -  !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> + (mem->userspace_addr != untagged_addr(mem->userspace_addr)))
> + return -EINVAL;
> + /* We can read the guest memory with __xxx_user() later on. */
> + if (!(mem->flags & KVM_MEM_FD) &&
> + !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>   mem->memory_size))
>   return -EINVAL;
>   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> @@ -1727,6 +1739,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   new.dirty_bitmap = NULL;
>   memset(, 0, sizeof(new.arch));
>   } else { /* Modify an existing slot. */
> + /* Private memslots are immutable, they can only be deleted. */
> + if (mem->flags & KVM_MEM_FD && mem->private_fd >= 0)
> + return -EINVAL;
>   if ((new.userspace_addr != old.userspace_addr) ||
>   (new.npages != old.npages) ||
>   ((new.flags ^ old.flags) & KVM_MEM_READONLY))
> 




Re: [PATCH v17 5/6] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-24 Thread Steven Price
On 24/06/2021 14:35, Marc Zyngier wrote:
> Hi Steven,
> 
> On Mon, 21 Jun 2021 12:17:15 +0100,
> Steven Price  wrote:
>>
>> The VMM may not wish to have it's own mapping of guest memory mapped
>> with PROT_MTE because this causes problems if the VMM has tag checking
>> enabled (the guest controls the tags in physical RAM and it's unlikely
>> the tags are correct for the VMM).
>>
>> Instead add a new ioctl which allows the VMM to easily read/write the
>> tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
>> while the VMM can still read/write the tags for the purpose of
>> migration.
>>
>> Reviewed-by: Catalin Marinas 
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/asm/kvm_host.h |  3 ++
>>  arch/arm64/include/asm/mte-def.h  |  1 +
>>  arch/arm64/include/uapi/asm/kvm.h | 11 +
>>  arch/arm64/kvm/arm.c  |  7 +++
>>  arch/arm64/kvm/guest.c| 82 +++
>>  include/uapi/linux/kvm.h  |  1 +
>>  6 files changed, 105 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>> b/arch/arm64/include/asm/kvm_host.h
>> index 309e36cc1b42..6a2ac4636d42 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
>>  int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>> struct kvm_device_attr *attr);
>>  
>> +long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>> +struct kvm_arm_copy_mte_tags *copy_tags);
>> +
>>  /* Guest/host FPSIMD coordination helpers */
>>  int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
>>  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
>> diff --git a/arch/arm64/include/asm/mte-def.h 
>> b/arch/arm64/include/asm/mte-def.h
>> index cf241b0f0a42..626d359b396e 100644
>> --- a/arch/arm64/include/asm/mte-def.h
>> +++ b/arch/arm64/include/asm/mte-def.h
>> @@ -7,6 +7,7 @@
>>  
>>  #define MTE_GRANULE_SIZEUL(16)
>>  #define MTE_GRANULE_MASK(~(MTE_GRANULE_SIZE - 1))
>> +#define MTE_GRANULES_PER_PAGE   (PAGE_SIZE / MTE_GRANULE_SIZE)
>>  #define MTE_TAG_SHIFT   56
>>  #define MTE_TAG_SIZE4
>>  #define MTE_TAG_MASKGENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE 
>> - 1)), MTE_TAG_SHIFT)
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..b3edde68bc3e 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>  __u32 reserved[12];
>>  };
>>  
>> +struct kvm_arm_copy_mte_tags {
>> +__u64 guest_ipa;
>> +__u64 length;
>> +void __user *addr;
>> +__u64 flags;
>> +__u64 reserved[2];
>> +};
>> +
>> +#define KVM_ARM_TAGS_TO_GUEST   0
>> +#define KVM_ARM_TAGS_FROM_GUEST 1
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK 0x0FFF
>>  #define KVM_REG_ARM_COPROC_SHIFT16
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 28ce26a68f09..511f3716fe33 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1359,6 +1359,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>  
>>  return 0;
>>  }
>> +case KVM_ARM_MTE_COPY_TAGS: {
>> +struct kvm_arm_copy_mte_tags copy_tags;
>> +
>> +if (copy_from_user(_tags, argp, sizeof(copy_tags)))
>> +return -EFAULT;
>> +return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
>> +}
>>  default:
>>  return -EINVAL;
>>  }
>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index 5cb4a1cd5603..4ddb20017b2f 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -995,3 +995,85 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>  
>>  return ret;
>>  }
>> +
>> +long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>> +struct kvm_arm_copy_mte_tags *copy_tags)
>> +{
>> +gpa_t guest_ipa = copy_tags->guest_ipa;
>> +size_t length = copy_tags->length;
>> +void __user *tags = copy_tags->addr;
>> +gpa_t gfn;
>> +bool write = !(c

Re: [PATCH v17 0/6] MTE support for KVM guest

2021-06-23 Thread Steven Price
On 22/06/2021 15:21, Marc Zyngier wrote:
> On Mon, 21 Jun 2021 12:17:10 +0100, Steven Price wrote:
>> This series adds support for using the Arm Memory Tagging Extensions
>> (MTE) in a KVM guest.
>>
>> Changes since v16[1]:
>>
>>  - Dropped the first patch ("Handle race when synchronising tags") as
>>it's not KVM specific and by restricting MAP_SHARED in KVM there is
>>no longer a dependency.
>>
>> [...]
> 
> Applied to next, thanks!
> 
> [1/6] arm64: mte: Sync tags for pages where PTE is untagged
>   commit: 69e3b846d8a753f9f279f29531ca56b0f7563ad0
> [2/6] KVM: arm64: Introduce MTE VM feature
>   commit: ea7fc1bb1cd1b92b42b1d9273ce7e231d3dc9321
> [3/6] KVM: arm64: Save/restore MTE registers
>   commit: e1f358b5046479d2897f23b1d5b092687c6e7a67
> [4/6] KVM: arm64: Expose KVM_ARM_CAP_MTE
>   commit: 673638f434ee4a00319e254ade338c57618d6f7e
> [5/6] KVM: arm64: ioctl to fetch/store tags in a guest
>   commit: f0376edb1ddcab19a473b4bf1fbd5b6bbed3705b
> [6/6] KVM: arm64: Document MTE capability and ioctl
>   commit: 04c02c201d7e8149ae336ead69fb64e4e6f94bc9
> 
> I performed a number of changes in user_mem_abort(), so please
> have a look at the result. It is also pretty late in the merge
> cycle, so if anything looks amiss, I'll just drop it.

It all looks good to me - thanks for making those changes.

Steve



Re: [PATCH v17 5/6] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-23 Thread Steven Price
On 22/06/2021 11:56, Fuad Tabba wrote:
> Hi Marc,
> 
> On Tue, Jun 22, 2021 at 11:25 AM Marc Zyngier  wrote:
>>
>> Hi Fuad,
>>
>> On Tue, 22 Jun 2021 09:56:22 +0100,
>> Fuad Tabba  wrote:
>>>
>>> Hi,
>>>
>>>
>>> On Mon, Jun 21, 2021 at 12:18 PM Steven Price  wrote:
>>>>
>>>> The VMM may not wish to have it's own mapping of guest memory mapped
>>>> with PROT_MTE because this causes problems if the VMM has tag checking
>>>> enabled (the guest controls the tags in physical RAM and it's unlikely
>>>> the tags are correct for the VMM).
>>>>
>>>> Instead add a new ioctl which allows the VMM to easily read/write the
>>>> tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
>>>> while the VMM can still read/write the tags for the purpose of
>>>> migration.
>>>>
>>>> Reviewed-by: Catalin Marinas 
>>>> Signed-off-by: Steven Price 
>>>> ---
>>>>  arch/arm64/include/asm/kvm_host.h |  3 ++
>>>>  arch/arm64/include/asm/mte-def.h  |  1 +
>>>>  arch/arm64/include/uapi/asm/kvm.h | 11 +
>>>>  arch/arm64/kvm/arm.c  |  7 +++
>>>>  arch/arm64/kvm/guest.c| 82 +++
>>>>  include/uapi/linux/kvm.h  |  1 +
>>>>  6 files changed, 105 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>>>> b/arch/arm64/include/asm/kvm_host.h
>>>> index 309e36cc1b42..6a2ac4636d42 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
>>>>  int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>>>struct kvm_device_attr *attr);
>>>>
>>>> +long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>>>> +   struct kvm_arm_copy_mte_tags *copy_tags);
>>>> +
>>>>  /* Guest/host FPSIMD coordination helpers */
>>>>  int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
>>>>  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
>>>> diff --git a/arch/arm64/include/asm/mte-def.h 
>>>> b/arch/arm64/include/asm/mte-def.h
>>>> index cf241b0f0a42..626d359b396e 100644
>>>> --- a/arch/arm64/include/asm/mte-def.h
>>>> +++ b/arch/arm64/include/asm/mte-def.h
>>>> @@ -7,6 +7,7 @@
>>>>
>>>>  #define MTE_GRANULE_SIZE   UL(16)
>>>>  #define MTE_GRANULE_MASK   (~(MTE_GRANULE_SIZE - 1))
>>>> +#define MTE_GRANULES_PER_PAGE  (PAGE_SIZE / MTE_GRANULE_SIZE)
>>>>  #define MTE_TAG_SHIFT  56
>>>>  #define MTE_TAG_SIZE   4
>>>>  #define MTE_TAG_MASK   GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 
>>>> 1)), MTE_TAG_SHIFT)
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>>>> b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 24223adae150..b3edde68bc3e 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>>> __u32 reserved[12];
>>>>  };
>>>>
>>>> +struct kvm_arm_copy_mte_tags {
>>>> +   __u64 guest_ipa;
>>>> +   __u64 length;
>>>> +   void __user *addr;
>>>> +   __u64 flags;
>>>> +   __u64 reserved[2];
>>>> +};
>>>> +
>>>> +#define KVM_ARM_TAGS_TO_GUEST  0
>>>> +#define KVM_ARM_TAGS_FROM_GUEST1
>>>> +
>>>>  /* If you need to interpret the index values, here is the key: */
>>>>  #define KVM_REG_ARM_COPROC_MASK0x0FFF
>>>>  #define KVM_REG_ARM_COPROC_SHIFT   16
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 28ce26a68f09..511f3716fe33 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -1359,6 +1359,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>>>
>>>> return 0;
>>>> }
>>>> +   case KVM_ARM_MTE_COPY_TAGS: {
>>>> +   struct kvm_arm_copy_mte_tags copy_tags;
>>>> +
>>>> +   if (copy_from_user(_tags, argp, sizeof(copy_ta

[PATCH v17 5/6] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-21 Thread Steven Price
The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h |  3 ++
 arch/arm64/include/asm/mte-def.h  |  1 +
 arch/arm64/include/uapi/asm/kvm.h | 11 +
 arch/arm64/kvm/arm.c  |  7 +++
 arch/arm64/kvm/guest.c| 82 +++
 include/uapi/linux/kvm.h  |  1 +
 6 files changed, 105 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 309e36cc1b42..6a2ac4636d42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
 
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags);
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index cf241b0f0a42..626d359b396e 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -7,6 +7,7 @@
 
 #define MTE_GRANULE_SIZE   UL(16)
 #define MTE_GRANULE_MASK   (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE  (PAGE_SIZE / MTE_GRANULE_SIZE)
 #define MTE_TAG_SHIFT  56
 #define MTE_TAG_SIZE   4
 #define MTE_TAG_MASK   GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), 
MTE_TAG_SHIFT)
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST  0
+#define KVM_ARM_TAGS_FROM_GUEST1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK0x0FFF
 #define KVM_REG_ARM_COPROC_SHIFT   16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 28ce26a68f09..511f3716fe33 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1359,6 +1359,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
return 0;
}
+   case KVM_ARM_MTE_COPY_TAGS: {
+   struct kvm_arm_copy_mte_tags copy_tags;
+
+   if (copy_from_user(_tags, argp, sizeof(copy_tags)))
+   return -EFAULT;
+   return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
+   }
default:
return -EINVAL;
}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5cb4a1cd5603..4ddb20017b2f 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -995,3 +995,85 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
return ret;
 }
+
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags)
+{
+   gpa_t guest_ipa = copy_tags->guest_ipa;
+   size_t length = copy_tags->length;
+   void __user *tags = copy_tags->addr;
+   gpa_t gfn;
+   bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+   int ret = 0;
+
+   if (!kvm_has_mte(kvm))
+   return -EINVAL;
+
+   if (copy_tags->reserved[0] || copy_tags->reserved[1])
+   return -EINVAL;
+
+   if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+   return -EINVAL;
+
+   if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+   return -EINVAL;
+
+   gfn = gpa_to_gfn(guest_ipa);
+
+   mutex_lock(>slots_lock);
+
+   while (length > 0) {
+   kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+   void *maddr;
+   unsigned long num_tags;
+   struct page *page;
+
+   if (is_error_noslot_pfn(pfn)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   page = pfn_to_online_page(pfn);
+   if (!page) {
+   /* Reject ZONE_DEVICE memory */
+   ret 

[PATCH v17 4/6] KVM: arm64: Expose KVM_ARM_CAP_MTE

2021-06-21 Thread Steven Price
It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/kvm/arm.c  | 9 +
 arch/arm64/kvm/reset.c| 4 
 arch/arm64/kvm/sys_regs.c | 3 +++
 3 files changed, 16 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e720148232a0..28ce26a68f09 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+   case KVM_CAP_ARM_MTE:
+   if (!system_supports_mte() || kvm->created_vcpus)
+   return -EINVAL;
+   r = 0;
+   kvm->arch.mte_enabled = true;
+   break;
default:
r = -EINVAL;
break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 */
r = 1;
break;
+   case KVM_CAP_ARM_MTE:
+   r = system_supports_mte();
+   break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index d37ebee085cf..9e6922b9503a 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -244,6 +244,10 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
+   if (vcpu->kvm->arch.mte_enabled) {
+   ret = -EINVAL;
+   goto out;
+   }
pstate = VCPU_RESET_PSTATE_SVC;
} else {
pstate = VCPU_RESET_PSTATE_EL1;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 5c75b24eae21..f6f126eb6ac1 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1312,6 +1312,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct 
sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
   const struct sys_reg_desc *rd)
 {
+   if (kvm_has_mte(vcpu->kvm))
+   return 0;
+
return REG_HIDDEN;
 }
 
-- 
2.20.1




[PATCH v17 3/6] KVM: arm64: Save/restore MTE registers

2021-06-21 Thread Steven Price
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_host.h  |  6 ++
 arch/arm64/include/asm/kvm_mte.h   | 66 ++
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kvm/hyp/entry.S |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
 arch/arm64/kvm/sys_regs.c  | 22 ++--
 8 files changed, 124 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 692c9049befa..d436831dd706 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -12,7 +12,8 @@
 #include 
 
 /* Hyp Configuration Register (HCR) bits */
-#define HCR_ATA(UL(1) << 56)
+#define HCR_ATA_SHIFT  56
+#define HCR_ATA(UL(1) << HCR_ATA_SHIFT)
 #define HCR_FWB(UL(1) << 46)
 #define HCR_API(UL(1) << 41)
 #define HCR_APK(UL(1) << 40)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   /* Memory Tagging Extension registers */
+   RGSR_EL1,   /* Random Allocation Tag Seed Register */
+   GCR_EL1,/* Tag Control Register */
+   TFSR_EL1,   /* Tag Fault Status Register (EL1) */
+   TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index ..88dd1199670b
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020-2021 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+   isb
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@
 
 #define INIT_SCTLR_EL2_MMU_ON  \
(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |  \
-SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |  \
+SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
 
 #define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6f0044cb233e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,8 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,offsetof(struct kvm_vcpu, 
arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_v

[PATCH v17 2/6] KVM: arm64: Introduce MTE VM feature

2021-06-21 Thread Steven Price
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_emulate.h |  3 ++
 arch/arm64/include/asm/kvm_host.h|  3 ++
 arch/arm64/kvm/hyp/exception.c   |  3 +-
 arch/arm64/kvm/mmu.c | 64 +++-
 arch/arm64/kvm/sys_regs.c|  7 +++
 include/uapi/linux/kvm.h |  1 +
 6 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index 01b9857757f2..fd418955e31e 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+   if (kvm_has_mte(vcpu->kvm))
+   vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
u8 pfr0_csv2;
u8 pfr0_csv3;
+   /* Memory Tagging Extension enabled for the guest */
+   bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 11541b94b328..0418399e0a20 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);
 
-   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+   if (kvm_has_mte(vcpu->kvm))
+   new |= PSR_TCO_BIT;
 
new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c10207fed2f3..52326b739357 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,45 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+/*
+ * The page will be mapped in stage 2 as Normal Cacheable, so the VM will be
+ * able to see the page's tags and therefore they must be initialised first. If
+ * PG_mte_tagged is set, tags have already been initialised.
+ *
+ * The race in the test/set of the PG_mte_tagged flag is handled by:
+ * - preventing VM_SHARED mappings in a memslot with MTE preventing two VMs
+ *   racing to santise the same page
+ * - mmap_lock protects between a VM faulting a page in and the VMM performing
+ *   an mprotect() to add VM_MTE
+ */
+static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
+unsigned long size)
+{
+   unsigned long i, nr_pages = size >> PAGE_SHIFT;
+   struct page *page;
+
+   if (!kvm_has_mte(kvm))
+   return 0;
+
+   /*
+* pfn_to_online_page() is used to reject ZONE_DEVICE pages
+* that may not support tags.
+*/
+   page = pfn_to_online_page(pfn);
+
+   if (!page)
+   return -EFAULT;
+
+   for (i = 0; i < nr_pages; i++, page++) {
+   if (!test_bit(PG_mte_tagged, >flags)) {
+   mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+   }
+   }
+
+   return 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -971,8 +1010,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (writable)
prot |= KVM_PGTABLE_PROT_W;
 
-   if (fault_status != FSC_PERM && !device)
+   if (fault_status != FSC_PERM && !device) {
+   /* Check the VMM hasn't introduced a new VM_SHARED VMA */
+   if (kvm_has_mte(kvm) && vma->vm_f

[PATCH v17 1/6] arm64: mte: Sync tags for pages where PTE is untagged

2021-06-21 Thread Steven Price
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/mte.h |  4 ++--
 arch/arm64/include/asm/pgtable.h | 22 +++---
 arch/arm64/kernel/mte.c  | 18 +-
 3 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index bc88a1ced0d7..347ef38a35f7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
-void mte_sync_tags(pte_t *ptep, pte_t pte);
+void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -53,7 +53,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
-static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
+static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..db5402168841 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,9 +314,25 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte);
 
-   if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
-   mte_sync_tags(ptep, pte);
+   /*
+* If the PTE would provide user space access to the tags associated
+* with it then ensure that the MTE tags are synchronised.  Although
+* pte_access_permitted() returns false for exec only mappings, they
+* don't expose tags (instruction fetches don't check tags).
+*/
+   if (system_supports_mte() && pte_access_permitted(pte, false) &&
+   !pte_special(pte)) {
+   pte_t old_pte = READ_ONCE(*ptep);
+   /*
+* We only need to synchronise if the new PTE has tags enabled
+* or if swapping in (in which case another mapping may have
+* set tags in the past even if this PTE isn't tagged).
+* (!pte_none() && !pte_present()) is an open coded version of
+* is_swap_pte()
+*/
+   if (pte_tagged(pte) || (!pte_none(old_pte) && 
!pte_present(old_pte)))
+   mte_sync_tags(old_pte, pte);
+   }
 
__check_racy_pte_update(mm, ptep, pte);
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..69b3fde8759e 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -32,10 +32,9 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
 EXPORT_SYMBOL_GPL(mte_async_mode);
 #endif
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t old_pte,
+  bool check_swap, bool pte_is_tagged)
 {
-   pte_t old_pte = READ_ONCE(*ptep);
-
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
 
@@ -43,6 +42,9 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
return;
}
 
+   if (!pte_is_tagged)
+   return;
+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -55,16 +57,22 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
mte_clear_page_tags(page_address(page));
 }
 
-void mte_sync_tags(pte_t *ptep, pte_t pte)
+void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
+
+   /* Early out if there's nothing to do */
+   if (!check_swap && !pte_is_tagged)
+   return;
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
if (!test_and_set_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags

[PATCH v17 6/6] KVM: arm64: Document MTE capability and ioctl

2021-06-21 Thread Steven Price
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 Documentation/virt/kvm/api.rst | 61 ++
 1 file changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7fcb2fd38f42..97661a97943f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,43 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.130 KVM_ARM_MTE_COPY_TAGS
+---
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
+  arguments, -EFAULT if memory cannot be accessed).
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+field must point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
+If an error occurs before any data is copied then a negative error code is
+returned. If some tags have been copied before an error occurs then the number
+of bytes successfully copied is returned. If the call completes successfully
+then ``length`` is returned.
+
 5. The kvm_run structure
 
 
@@ -6362,6 +6399,30 @@ default.
 
 See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
 
+7.26 KVM_CAP_ARM_MTE
+
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the tags are maintained during swap or
+hibernation of the host; however the VMM needs to manually save/restore the
+tags as appropriate if the VM is migrated.
+
+When this capability is enabled all memory in memslots must be mapped as
+not-shareable (no MAP_SHARED), attempts to create a memslot with a
+MAP_SHARED mmap will result in an -EINVAL return.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ==
 
-- 
2.20.1




[PATCH v17 0/6] MTE support for KVM guest

2021-06-21 Thread Steven Price
This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

Changes since v16[1]:

 - Dropped the first patch ("Handle race when synchronising tags") as
   it's not KVM specific and by restricting MAP_SHARED in KVM there is
   no longer a dependency.

 - Change return code when creating a memslot with VM_SHARED regions to
   -EFAULT (and correctly jump to out_unlock on this error case).

 - Clarify documentation thanks to Catalin.

 - Rebase onto v5.13-rc4.

 - Add Reviewed-by tags from Catalin - thanks!

[1] https://lore.kernel.org/r/20210618132826.54670-1-steven.price%40arm.com

Steven Price (6):
  arm64: mte: Sync tags for pages where PTE is untagged
  KVM: arm64: Introduce MTE VM feature
  KVM: arm64: Save/restore MTE registers
  KVM: arm64: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst | 61 
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_emulate.h   |  3 +
 arch/arm64/include/asm/kvm_host.h  | 12 
 arch/arm64/include/asm/kvm_mte.h   | 66 +
 arch/arm64/include/asm/mte-def.h   |  1 +
 arch/arm64/include/asm/mte.h   |  4 +-
 arch/arm64/include/asm/pgtable.h   | 22 +-
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/include/uapi/asm/kvm.h  | 11 +++
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kernel/mte.c| 18 +++--
 arch/arm64/kvm/arm.c   | 16 +
 arch/arm64/kvm/guest.c | 82 ++
 arch/arm64/kvm/hyp/entry.S |  7 ++
 arch/arm64/kvm/hyp/exception.c |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++
 arch/arm64/kvm/mmu.c   | 64 -
 arch/arm64/kvm/reset.c |  4 ++
 arch/arm64/kvm/sys_regs.c  | 32 +++--
 include/uapi/linux/kvm.h   |  2 +
 21 files changed, 419 insertions(+), 18 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1




Re: [PATCH v16 3/7] KVM: arm64: Introduce MTE VM feature

2021-06-21 Thread Steven Price
On 21/06/2021 10:01, Marc Zyngier wrote:
> On Fri, 18 Jun 2021 14:28:22 +0100,
> Steven Price  wrote:
>>
>> Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
>> for a VM. This will expose the feature to the guest and automatically
>> tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
>> storage) to ensure that the guest cannot see stale tags, and so that
>> the tags are correctly saved/restored across swap.
>>
>> Actually exposing the new capability to user space happens in a later
>> patch.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/asm/kvm_emulate.h |  3 ++
>>  arch/arm64/include/asm/kvm_host.h|  3 ++
>>  arch/arm64/kvm/hyp/exception.c   |  3 +-
>>  arch/arm64/kvm/mmu.c | 62 +++-
>>  arch/arm64/kvm/sys_regs.c|  7 
>>  include/uapi/linux/kvm.h |  1 +
>>  6 files changed, 77 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
>> b/arch/arm64/include/asm/kvm_emulate.h
>> index f612c090f2e4..6bf776c2399c 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>>  if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
>>  vcpu_el1_is_32bit(vcpu))
>>  vcpu->arch.hcr_el2 |= HCR_TID2;
>> +
>> +if (kvm_has_mte(vcpu->kvm))
>> +vcpu->arch.hcr_el2 |= HCR_ATA;
>>  }
>>  
>>  static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>> b/arch/arm64/include/asm/kvm_host.h
>> index 7cd7d5c8c4bc..afaa5333f0e4 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -132,6 +132,8 @@ struct kvm_arch {
>>  
>>  u8 pfr0_csv2;
>>  u8 pfr0_csv3;
>> +/* Memory Tagging Extension enabled for the guest */
>> +bool mte_enabled;
>>  };
>>  
>>  struct kvm_vcpu_fault_info {
>> @@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
>>  #define kvm_arm_vcpu_sve_finalized(vcpu) \
>>  ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
>>  
>> +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
>>  #define kvm_vcpu_has_pmu(vcpu)  \
>>  (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
>>  
>> diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
>> index 73629094f903..56426565600c 100644
>> --- a/arch/arm64/kvm/hyp/exception.c
>> +++ b/arch/arm64/kvm/hyp/exception.c
>> @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
>> unsigned long target_mode,
>>  new |= (old & PSR_C_BIT);
>>  new |= (old & PSR_V_BIT);
>>  
>> -// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
>> +if (kvm_has_mte(vcpu->kvm))
>> +new |= PSR_TCO_BIT;
>>  
>>  new |= (old & PSR_DIT_BIT);
>>  
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..f5305b7561ad 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -822,6 +822,45 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>> *memslot,
>>  return PAGE_SIZE;
>>  }
>>  
>> +/*
>> + * The page will be mapped in stage 2 as Normal Cacheable, so the VM will be
>> + * able to see the page's tags and therefore they must be initialised 
>> first. If
>> + * PG_mte_tagged is set, tags have already been initialised.
>> + *
>> + * The race in the test/set of the PG_mte_tagged flag is handled by:
>> + * - preventing VM_SHARED mappings in a memslot with MTE preventing two VMs
>> + *   racing to santise the same page
>> + * - mmap_lock protects between a VM faulting a page in and the VMM 
>> performing
>> + *   an mprotect() to add VM_MTE
>> + */
>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>> + unsigned long size)
>> +{
>> +unsigned long i, nr_pages = size >> PAGE_SHIFT;
>> +struct page *page;
>> +
>> +if (!kvm_has_mte(kvm))
>> +return 0;
>> +
>> +/*
>> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages
>> + * that may not support tags.
>> + */
>> +page 

Re: [PATCH v16 1/7] arm64: mte: Handle race when synchronising tags

2021-06-21 Thread Steven Price
On 18/06/2021 16:42, Marc Zyngier wrote:
> On 2021-06-18 15:40, Catalin Marinas wrote:
>> On Fri, Jun 18, 2021 at 02:28:20PM +0100, Steven Price wrote:
>>> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
>>> before restoring/zeroing the MTE tags. However if another thread were to
>>> race and attempt to sync the tags on the same page before the first
>>> thread had completed restoring/zeroing then it would see the flag is
>>> already set and continue without waiting. This would potentially expose
>>> the previous contents of the tags to user space, and cause any updates
>>> that user space makes before the restoring/zeroing has completed to
>>> potentially be lost.
>>>
>>> Since this code is run from atomic contexts we can't just lock the page
>>> during the process. Instead implement a new (global) spinlock to protect
>>> the mte_sync_page_tags() function.
>>>
>>> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is
>>> mapped in user-space with PROT_MTE")
>>> Reviewed-by: Catalin Marinas 
>>> Signed-off-by: Steven Price 
>>
>> Although I reviewed this patch, I think we should drop it from this
>> series and restart the discussion with the Chromium guys on what/if they
>> need PROT_MTE with MAP_SHARED. It currently breaks if you have two
>> PROT_MTE mappings but if they are ok with only one of the mappings being
>> PROT_MTE, I'm happy to just document it.
>>
>> Not sure whether subsequent patches depend on it though.
> 
> I'd certainly like it to be independent of the KVM series, specially
> as this series is pretty explicit that this MTE lock is not required
> for KVM.

Sure, since KVM no longer uses the lock we don't have the dependency -
so I'll drop the first patch.

> This will require some rework of patch #2, I believe. And while we're
> at it, a rebase on 5.13-rc4 wouldn't hurt, as both patches #3 and #5
> conflict with it...

Yeah there will be minor conflicts in patch #2 - but nothing major. I'll
rebase as requested at the same time.

Thanks,

Steve



Re: [PATCH v16 7/7] KVM: arm64: Document MTE capability and ioctl

2021-06-21 Thread Steven Price
On 18/06/2021 15:52, Catalin Marinas wrote:
> On Fri, Jun 18, 2021 at 02:28:26PM +0100, Steven Price wrote:
>> +When this capability is enabled all memory in (non-device) memslots must not
>> +used VM_SHARED, attempts to create a memslot with a VM_SHARED mmap will 
>> result
>> +in an -EINVAL return.
> 
> "must not used" doesn't sound right. Anyway, I'd remove VM_SHARED as
> that's a kernel internal and not something the VMM needs to be aware of.
> Just say something like "memslots must be mapped as shareable
> (MAP_SHARED)".

I think I meant "must not use" - and indeed memslots must *not* be
mapped as shareable. I'll update to this wording:

  When this capability is enabled all memory in memslots must be mapped as
  not-shareable (no MAP_SHARED), attempts to create a memslot with MAP_SHARED
  will result in an -EINVAL return.

> Otherwise:
> 
> Reviewed-by: Catalin Marinas 
> 

Thanks,

Steve



[PATCH v16 7/7] KVM: arm64: Document MTE capability and ioctl

2021-06-18 Thread Steven Price
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Signed-off-by: Steven Price 
---
 Documentation/virt/kvm/api.rst | 61 ++
 1 file changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 22d077562149..3c27e712b1fb 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,43 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.130 KVM_ARM_MTE_COPY_TAGS
+---
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
+  arguments, -EFAULT if memory cannot be accessed).
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+field must point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
+If an error occurs before any data is copied then a negative error code is
+returned. If some tags have been copied before an error occurs then the number
+of bytes successfully copied is returned. If the call completes successfully
+then ``length`` is returned.
+
 5. The kvm_run structure
 
 
@@ -6362,6 +6399,30 @@ default.
 
 See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
 
+7.26 KVM_CAP_ARM_MTE
+
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the tags are maintained during swap or
+hibernation of the host; however the VMM needs to manually save/restore the
+tags as appropriate if the VM is migrated.
+
+When this capability is enabled all memory in (non-device) memslots must not
+used VM_SHARED, attempts to create a memslot with a VM_SHARED mmap will result
+in an -EINVAL return.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ==
 
-- 
2.20.1




[PATCH v16 5/7] KVM: arm64: Expose KVM_ARM_CAP_MTE

2021-06-18 Thread Steven Price
It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/kvm/arm.c  | 9 +
 arch/arm64/kvm/reset.c| 3 ++-
 arch/arm64/kvm/sys_regs.c | 3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..e89a5e275e25 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+   case KVM_CAP_ARM_MTE:
+   if (!system_supports_mte() || kvm->created_vcpus)
+   return -EINVAL;
+   r = 0;
+   kvm->arch.mte_enabled = true;
+   break;
default:
r = -EINVAL;
break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 */
r = 1;
break;
+   case KVM_CAP_ARM_MTE:
+   r = system_supports_mte();
+   break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 956cdc240148..50635eacfa43 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
-   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
+   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
+   vcpu->kvm->arch.mte_enabled) {
ret = -EINVAL;
goto out;
}
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 440315a556c2..d4e1c1b1a08d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1312,6 +1312,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct 
sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
   const struct sys_reg_desc *rd)
 {
+   if (kvm_has_mte(vcpu->kvm))
+   return 0;
+
return REG_HIDDEN;
 }
 
-- 
2.20.1




[PATCH v16 4/7] KVM: arm64: Save/restore MTE registers

2021-06-18 Thread Steven Price
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_host.h  |  6 ++
 arch/arm64/include/asm/kvm_mte.h   | 66 ++
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kvm/hyp/entry.S |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
 arch/arm64/kvm/sys_regs.c  | 22 ++--
 8 files changed, 124 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 692c9049befa..d436831dd706 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -12,7 +12,8 @@
 #include 
 
 /* Hyp Configuration Register (HCR) bits */
-#define HCR_ATA(UL(1) << 56)
+#define HCR_ATA_SHIFT  56
+#define HCR_ATA(UL(1) << HCR_ATA_SHIFT)
 #define HCR_FWB(UL(1) << 46)
 #define HCR_API(UL(1) << 41)
 #define HCR_APK(UL(1) << 40)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   /* Memory Tagging Extension registers */
+   RGSR_EL1,   /* Random Allocation Tag Seed Register */
+   GCR_EL1,/* Tag Control Register */
+   TFSR_EL1,   /* Tag Fault Status Register (EL1) */
+   TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index ..88dd1199670b
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020-2021 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+   isb
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@
 
 #define INIT_SCTLR_EL2_MMU_ON  \
(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |  \
-SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |  \
+SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
 
 #define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6f0044cb233e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,8 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,offsetof(struct kvm_vcpu, 
arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_v

[PATCH v16 3/7] KVM: arm64: Introduce MTE VM feature

2021-06-18 Thread Steven Price
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_emulate.h |  3 ++
 arch/arm64/include/asm/kvm_host.h|  3 ++
 arch/arm64/kvm/hyp/exception.c   |  3 +-
 arch/arm64/kvm/mmu.c | 62 +++-
 arch/arm64/kvm/sys_regs.c|  7 
 include/uapi/linux/kvm.h |  1 +
 6 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+   if (kvm_has_mte(vcpu->kvm))
+   vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
u8 pfr0_csv2;
u8 pfr0_csv3;
+   /* Memory Tagging Extension enabled for the guest */
+   bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);
 
-   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+   if (kvm_has_mte(vcpu->kvm))
+   new |= PSR_TCO_BIT;
 
new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..f5305b7561ad 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,45 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+/*
+ * The page will be mapped in stage 2 as Normal Cacheable, so the VM will be
+ * able to see the page's tags and therefore they must be initialised first. If
+ * PG_mte_tagged is set, tags have already been initialised.
+ *
+ * The race in the test/set of the PG_mte_tagged flag is handled by:
+ * - preventing VM_SHARED mappings in a memslot with MTE preventing two VMs
+ *   racing to santise the same page
+ * - mmap_lock protects between a VM faulting a page in and the VMM performing
+ *   an mprotect() to add VM_MTE
+ */
+static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
+unsigned long size)
+{
+   unsigned long i, nr_pages = size >> PAGE_SHIFT;
+   struct page *page;
+
+   if (!kvm_has_mte(kvm))
+   return 0;
+
+   /*
+* pfn_to_online_page() is used to reject ZONE_DEVICE pages
+* that may not support tags.
+*/
+   page = pfn_to_online_page(pfn);
+
+   if (!page)
+   return -EFAULT;
+
+   for (i = 0; i < nr_pages; i++, page++) {
+   if (!test_bit(PG_mte_tagged, >flags)) {
+   mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+   }
+   }
+
+   return 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -971,8 +1010,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (writable)
prot |= KVM_PGTABLE_PROT_W;
 
-   if (fault_status != FSC_PERM && !device)
+   if (fault_status != FSC_PERM && !device) {
+   /* Check the VMM hasn't introduced a new VM_SHARED VMA */
+   if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARE

[PATCH v16 6/7] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-18 Thread Steven Price
The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h |  3 ++
 arch/arm64/include/asm/mte-def.h  |  1 +
 arch/arm64/include/uapi/asm/kvm.h | 11 +
 arch/arm64/kvm/arm.c  |  7 +++
 arch/arm64/kvm/guest.c| 82 +++
 include/uapi/linux/kvm.h  |  1 +
 6 files changed, 105 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 309e36cc1b42..6a2ac4636d42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
 
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags);
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index cf241b0f0a42..626d359b396e 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -7,6 +7,7 @@
 
 #define MTE_GRANULE_SIZE   UL(16)
 #define MTE_GRANULE_MASK   (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE  (PAGE_SIZE / MTE_GRANULE_SIZE)
 #define MTE_TAG_SHIFT  56
 #define MTE_TAG_SIZE   4
 #define MTE_TAG_MASK   GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), 
MTE_TAG_SHIFT)
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST  0
+#define KVM_ARM_TAGS_FROM_GUEST1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK0x0FFF
 #define KVM_REG_ARM_COPROC_SHIFT   16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e89a5e275e25..baa33359e477 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
return 0;
}
+   case KVM_ARM_MTE_COPY_TAGS: {
+   struct kvm_arm_copy_mte_tags copy_tags;
+
+   if (copy_from_user(_tags, argp, sizeof(copy_tags)))
+   return -EFAULT;
+   return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
+   }
default:
return -EINVAL;
}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5cb4a1cd5603..4ddb20017b2f 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -995,3 +995,85 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
return ret;
 }
+
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags)
+{
+   gpa_t guest_ipa = copy_tags->guest_ipa;
+   size_t length = copy_tags->length;
+   void __user *tags = copy_tags->addr;
+   gpa_t gfn;
+   bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+   int ret = 0;
+
+   if (!kvm_has_mte(kvm))
+   return -EINVAL;
+
+   if (copy_tags->reserved[0] || copy_tags->reserved[1])
+   return -EINVAL;
+
+   if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+   return -EINVAL;
+
+   if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+   return -EINVAL;
+
+   gfn = gpa_to_gfn(guest_ipa);
+
+   mutex_lock(>slots_lock);
+
+   while (length > 0) {
+   kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+   void *maddr;
+   unsigned long num_tags;
+   struct page *page;
+
+   if (is_error_noslot_pfn(pfn)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   page = pfn_to_online_page(pfn);
+   if (!page) {
+   /* Reject ZONE_DEVICE memory */
+   ret 

[PATCH v16 2/7] arm64: mte: Sync tags for pages where PTE is untagged

2021-06-18 Thread Steven Price
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/mte.h |  4 ++--
 arch/arm64/include/asm/pgtable.h | 22 +++---
 arch/arm64/kernel/mte.c  | 17 +
 3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index bc88a1ced0d7..347ef38a35f7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
-void mte_sync_tags(pte_t *ptep, pte_t pte);
+void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -53,7 +53,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
-static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
+static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..db5402168841 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,9 +314,25 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte);
 
-   if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
-   mte_sync_tags(ptep, pte);
+   /*
+* If the PTE would provide user space access to the tags associated
+* with it then ensure that the MTE tags are synchronised.  Although
+* pte_access_permitted() returns false for exec only mappings, they
+* don't expose tags (instruction fetches don't check tags).
+*/
+   if (system_supports_mte() && pte_access_permitted(pte, false) &&
+   !pte_special(pte)) {
+   pte_t old_pte = READ_ONCE(*ptep);
+   /*
+* We only need to synchronise if the new PTE has tags enabled
+* or if swapping in (in which case another mapping may have
+* set tags in the past even if this PTE isn't tagged).
+* (!pte_none() && !pte_present()) is an open coded version of
+* is_swap_pte()
+*/
+   if (pte_tagged(pte) || (!pte_none(old_pte) && 
!pte_present(old_pte)))
+   mte_sync_tags(old_pte, pte);
+   }
 
__check_racy_pte_update(mm, ptep, pte);
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a3583a7fd400..ae0a3c68fece 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -33,10 +33,10 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
 EXPORT_SYMBOL_GPL(mte_async_mode);
 #endif
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t old_pte,
+  bool check_swap, bool pte_is_tagged)
 {
unsigned long flags;
-   pte_t old_pte = READ_ONCE(*ptep);
 
spin_lock_irqsave(_sync_lock, flags);
 
@@ -53,6 +53,9 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
}
}
 
+   if (!pte_is_tagged)
+   goto out;
+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -69,16 +72,22 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
spin_unlock_irqrestore(_sync_lock, flags);
 }
 
-void mte_sync_tags(pte_t *ptep, pte_t pte)
+void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
+
+   /* Early out if there's nothing to do */
+   if (!check_swap && !pte_is_tagged)
+   return;
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
if (!test_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags(page, ptep, check_swap);
+   mte_sync_page_ta

[PATCH v16 0/7] MTE support for KVM guest

2021-06-18 Thread Steven Price
This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

This time with less BKL but hopefully no new races!

Changes since v15[1]:

 - Prevent VM_SHARED mappings with an MTE-enabled VM.

 - Dropped the mte_prepare_page_tags() function, sanitise_mte_tags() now
   does the PG_mte_tagged dance without extra locks.

 - Added a comment to sanitise_mte_tags() explaining why the apparent
   race with the test/set of page->flags is safe.

 - Added a sentence to kvm/api.rst explaining that VM_SHARED is not
   permitted when used with MTE in a guest.

 - Dropped the Reviewed-by tags on patches 3 and 7 due to the changes.

[1] https://lore.kernel.org/r/20210614090525.4338-1-steven.price%40arm.com

Steven Price (7):
  arm64: mte: Handle race when synchronising tags
  arm64: mte: Sync tags for pages where PTE is untagged
  KVM: arm64: Introduce MTE VM feature
  KVM: arm64: Save/restore MTE registers
  KVM: arm64: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst | 61 
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_emulate.h   |  3 +
 arch/arm64/include/asm/kvm_host.h  | 12 
 arch/arm64/include/asm/kvm_mte.h   | 66 +
 arch/arm64/include/asm/mte-def.h   |  1 +
 arch/arm64/include/asm/mte.h   |  4 +-
 arch/arm64/include/asm/pgtable.h   | 22 +-
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/include/uapi/asm/kvm.h  | 11 +++
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kernel/mte.c| 37 --
 arch/arm64/kvm/arm.c   | 16 +
 arch/arm64/kvm/guest.c | 82 ++
 arch/arm64/kvm/hyp/entry.S |  7 ++
 arch/arm64/kvm/hyp/exception.c |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++
 arch/arm64/kvm/mmu.c   | 62 +++-
 arch/arm64/kvm/reset.c |  3 +-
 arch/arm64/kvm/sys_regs.c  | 32 +++--
 include/uapi/linux/kvm.h   |  2 +
 21 files changed, 432 insertions(+), 21 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1




[PATCH v16 1/7] arm64: mte: Handle race when synchronising tags

2021-06-18 Thread Steven Price
mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
user-space with PROT_MTE")
Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/kernel/mte.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..a3583a7fd400 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@
 u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
+static DEFINE_SPINLOCK(tag_sync_lock);
 
 #ifdef CONFIG_KASAN_HW_TAGS
 /* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
 
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 {
+   unsigned long flags;
pte_t old_pte = READ_ONCE(*ptep);
 
+   spin_lock_irqsave(_sync_lock, flags);
+
+   /* Recheck with the lock held */
+   if (test_bit(PG_mte_tagged, >flags))
+   goto out;
+
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-   if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
-   return;
+   if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+   set_bit(PG_mte_tagged, >flags);
+   goto out;
+   }
}
 
page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
 */
smp_wmb();
mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+
+out:
+   spin_unlock_irqrestore(_sync_lock, flags);
 }
 
 void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,7 +77,7 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
-   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   if (!test_bit(PG_mte_tagged, >flags))
mte_sync_page_tags(page, ptep, check_swap);
}
 }
-- 
2.20.1




Re: [PATCH v15 0/7] MTE support for KVM guest

2021-06-17 Thread Steven Price
On 17/06/2021 14:15, Marc Zyngier wrote:
> On Thu, 17 Jun 2021 13:13:22 +0100,
> Catalin Marinas  wrote:
>>
>> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
>>> I realise there are still open questions[1] around the performance of
>>> this series (the 'big lock', tag_sync_lock, introduced in the first
>>> patch). But there should be no impact on non-MTE workloads and until we
>>> get real MTE-enabled hardware it's hard to know whether there is a need
>>> for something more sophisticated or not. Peter Collingbourne's patch[3]
>>> to clear the tags at page allocation time should hide more of the impact
>>> for non-VM cases. So the remaining concern is around VM startup which
>>> could be effectively serialised through the lock.
>> [...]
>>> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
>>
>> Start-up, VM resume, migration could be affected by this lock, basically
>> any time you fault a page into the guest. As you said, for now it should
>> be fine as long as the hardware doesn't support MTE or qemu doesn't
>> enable MTE in guests. But the problem won't go away.
> 
> Indeed. And I find it odd to say "it's not a problem, we don't have
> any HW available". By this token, why should we merge this work the
> first place, or any of the MTE work that has gone into the kernel over
> the past years?
> 
>> We have a partial solution with an array of locks to mitigate against
>> this but there's still the question of whether we should actually bother
>> for something that's unlikely to happen in practice: MAP_SHARED memory
>> in guests (ignoring the stage 1 case for now).
>>
>> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
>> user_mem_abort() and if the VM_SHARED flag is set together with MTE
>> enabled for guests, we can reject the mapping.
> 
> That's a reasonable approach. I wonder whether we could do that right
> at the point where the memslot is associated with the VM, like this:
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index a36a2e3082d8..ebd3b3224386 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   if (!vma)
>   break;
>  
> + if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
> + return -EINVAL;
> +
>   /*
>* Take the intersection of this VMA with the memory region
>*/
> 
> which takes the problem out of the fault path altogether? We document
> the restriction and move on. With that, we can use a non-locking
> version of mte_sync_page_tags().

Does this deal with the case where the VMAs are changed after the
memslot is created? While we can do the check here to give the VMM a
heads-up if it gets it wrong, I think we also need it in
user_mem_abort() to deal with a VMM which mmap()s over the VA of the
memslot. Or am I missing something?

But if everyone is happy with the restriction (just for KVM) of not
allowing MTE+VM_SHARED then that sounds like a good way forward.

Thanks,

Steve

>> We can discuss the stage 1 case separately from this series.
> 
> Works for me.
> 
> Thanks,
> 
>   M.
> 




[PATCH v15 3/7] KVM: arm64: Introduce MTE VM feature

2021-06-14 Thread Steven Price
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_emulate.h |  3 ++
 arch/arm64/include/asm/kvm_host.h|  3 ++
 arch/arm64/include/asm/mte.h |  4 +++
 arch/arm64/kernel/mte.c  | 17 +++
 arch/arm64/kvm/hyp/exception.c   |  3 +-
 arch/arm64/kvm/mmu.c | 42 +++-
 arch/arm64/kvm/sys_regs.c|  7 +
 include/uapi/linux/kvm.h |  1 +
 8 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+   if (kvm_has_mte(vcpu->kvm))
+   vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
u8 pfr0_csv2;
u8 pfr0_csv3;
+   /* Memory Tagging Extension enabled for the guest */
+   bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 347ef38a35f7..be1de541a11c 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,6 +37,7 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
+void mte_prepare_page_tags(struct page *page);
 void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
@@ -53,6 +54,9 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
+static inline void mte_prepare_page_tags(struct page *page)
+{
+}
 static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 }
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index ae0a3c68fece..b120f82a2258 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -72,6 +72,23 @@ static void mte_sync_page_tags(struct page *page, pte_t 
old_pte,
spin_unlock_irqrestore(_sync_lock, flags);
 }
 
+void mte_prepare_page_tags(struct page *page)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(_sync_lock, flags);
+
+   /* Recheck with the lock held */
+   if (test_bit(PG_mte_tagged, >flags))
+   goto out;
+
+   mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+
+out:
+   spin_unlock_irqrestore(_sync_lock, flags);
+}
+
 void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
struct page *page = pte_page(pte);
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);
 
-   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+   if (kvm_has_mte(vcpu->kvm))
+   new |= PSR_TCO_BIT;
 
new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..ed7c624e7362 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,36 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
+unsigned long size)
+{
+   unsigned long i, nr_pages = size >> PAGE_SHIFT;
+   struct page *page;

[PATCH v15 0/7] MTE support for KVM guest

2021-06-14 Thread Steven Price
This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

I realise there are still open questions[1] around the performance of
this series (the 'big lock', tag_sync_lock, introduced in the first
patch). But there should be no impact on non-MTE workloads and until we
get real MTE-enabled hardware it's hard to know whether there is a need
for something more sophisticated or not. Peter Collingbourne's patch[3]
to clear the tags at page allocation time should hide more of the impact
for non-VM cases. So the remaining concern is around VM startup which
could be effectively serialised through the lock.

Changes since v14[2]:

 * Dropped "Handle MTE tags zeroing" patch in favour of Peter's similar
   patch[3] (now in arm64 tree).

 * Improved documentation following Catalin's review.

[1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
[2]: https://lore.kernel.org/r/20210607110816.25762-1-steven.pr...@arm.com/
[3]: https://lore.kernel.org/r/20210602235230.3928842-4-...@google.com/

Steven Price (7):
  arm64: mte: Handle race when synchronising tags
  arm64: mte: Sync tags for pages where PTE is untagged
  KVM: arm64: Introduce MTE VM feature
  KVM: arm64: Save/restore MTE registers
  KVM: arm64: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst | 57 +++
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_emulate.h   |  3 +
 arch/arm64/include/asm/kvm_host.h  | 12 
 arch/arm64/include/asm/kvm_mte.h   | 66 +
 arch/arm64/include/asm/mte-def.h   |  1 +
 arch/arm64/include/asm/mte.h   |  8 ++-
 arch/arm64/include/asm/pgtable.h   | 22 +-
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/include/uapi/asm/kvm.h  | 11 +++
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kernel/mte.c| 54 --
 arch/arm64/kvm/arm.c   | 16 +
 arch/arm64/kvm/guest.c | 82 ++
 arch/arm64/kvm/hyp/entry.S |  7 ++
 arch/arm64/kvm/hyp/exception.c |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++
 arch/arm64/kvm/mmu.c   | 42 ++-
 arch/arm64/kvm/reset.c |  3 +-
 arch/arm64/kvm/sys_regs.c  | 32 +++--
 include/uapi/linux/kvm.h   |  2 +
 21 files changed, 429 insertions(+), 21 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1




[PATCH v15 5/7] KVM: arm64: Expose KVM_ARM_CAP_MTE

2021-06-14 Thread Steven Price
It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/kvm/arm.c  | 9 +
 arch/arm64/kvm/reset.c| 3 ++-
 arch/arm64/kvm/sys_regs.c | 3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..e89a5e275e25 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+   case KVM_CAP_ARM_MTE:
+   if (!system_supports_mte() || kvm->created_vcpus)
+   return -EINVAL;
+   r = 0;
+   kvm->arch.mte_enabled = true;
+   break;
default:
r = -EINVAL;
break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 */
r = 1;
break;
+   case KVM_CAP_ARM_MTE:
+   r = system_supports_mte();
+   break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 956cdc240148..50635eacfa43 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
-   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
+   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
+   vcpu->kvm->arch.mte_enabled) {
ret = -EINVAL;
goto out;
}
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 440315a556c2..d4e1c1b1a08d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1312,6 +1312,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct 
sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
   const struct sys_reg_desc *rd)
 {
+   if (kvm_has_mte(vcpu->kvm))
+   return 0;
+
return REG_HIDDEN;
 }
 
-- 
2.20.1




[PATCH v15 2/7] arm64: mte: Sync tags for pages where PTE is untagged

2021-06-14 Thread Steven Price
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/mte.h |  4 ++--
 arch/arm64/include/asm/pgtable.h | 22 +++---
 arch/arm64/kernel/mte.c  | 17 +
 3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index bc88a1ced0d7..347ef38a35f7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
-void mte_sync_tags(pte_t *ptep, pte_t pte);
+void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -53,7 +53,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
-static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
+static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..db5402168841 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,9 +314,25 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte);
 
-   if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
-   mte_sync_tags(ptep, pte);
+   /*
+* If the PTE would provide user space access to the tags associated
+* with it then ensure that the MTE tags are synchronised.  Although
+* pte_access_permitted() returns false for exec only mappings, they
+* don't expose tags (instruction fetches don't check tags).
+*/
+   if (system_supports_mte() && pte_access_permitted(pte, false) &&
+   !pte_special(pte)) {
+   pte_t old_pte = READ_ONCE(*ptep);
+   /*
+* We only need to synchronise if the new PTE has tags enabled
+* or if swapping in (in which case another mapping may have
+* set tags in the past even if this PTE isn't tagged).
+* (!pte_none() && !pte_present()) is an open coded version of
+* is_swap_pte()
+*/
+   if (pte_tagged(pte) || (!pte_none(old_pte) && 
!pte_present(old_pte)))
+   mte_sync_tags(old_pte, pte);
+   }
 
__check_racy_pte_update(mm, ptep, pte);
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a3583a7fd400..ae0a3c68fece 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -33,10 +33,10 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
 EXPORT_SYMBOL_GPL(mte_async_mode);
 #endif
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t old_pte,
+  bool check_swap, bool pte_is_tagged)
 {
unsigned long flags;
-   pte_t old_pte = READ_ONCE(*ptep);
 
spin_lock_irqsave(_sync_lock, flags);
 
@@ -53,6 +53,9 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
}
}
 
+   if (!pte_is_tagged)
+   goto out;
+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -69,16 +72,22 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
spin_unlock_irqrestore(_sync_lock, flags);
 }
 
-void mte_sync_tags(pte_t *ptep, pte_t pte)
+void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
+
+   /* Early out if there's nothing to do */
+   if (!check_swap && !pte_is_tagged)
+   return;
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
if (!test_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags(page, ptep, check_swap);
+   mte_sync_page_ta

[PATCH v15 1/7] arm64: mte: Handle race when synchronising tags

2021-06-14 Thread Steven Price
mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
user-space with PROT_MTE")
Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/kernel/mte.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..a3583a7fd400 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@
 u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
+static DEFINE_SPINLOCK(tag_sync_lock);
 
 #ifdef CONFIG_KASAN_HW_TAGS
 /* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
 
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 {
+   unsigned long flags;
pte_t old_pte = READ_ONCE(*ptep);
 
+   spin_lock_irqsave(_sync_lock, flags);
+
+   /* Recheck with the lock held */
+   if (test_bit(PG_mte_tagged, >flags))
+   goto out;
+
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-   if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
-   return;
+   if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+   set_bit(PG_mte_tagged, >flags);
+   goto out;
+   }
}
 
page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
 */
smp_wmb();
mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+
+out:
+   spin_unlock_irqrestore(_sync_lock, flags);
 }
 
 void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,7 +77,7 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
-   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   if (!test_bit(PG_mte_tagged, >flags))
mte_sync_page_tags(page, ptep, check_swap);
}
 }
-- 
2.20.1




[PATCH v15 7/7] KVM: arm64: Document MTE capability and ioctl

2021-06-14 Thread Steven Price
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 Documentation/virt/kvm/api.rst | 57 ++
 1 file changed, 57 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 22d077562149..d412928e50cd 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,43 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.130 KVM_ARM_MTE_COPY_TAGS
+---
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
+  arguments, -EFAULT if memory cannot be accessed).
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+field must point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
+If an error occurs before any data is copied then a negative error code is
+returned. If some tags have been copied before an error occurs then the number
+of bytes successfully copied is returned. If the call completes successfully
+then ``length`` is returned.
+
 5. The kvm_run structure
 
 
@@ -6362,6 +6399,26 @@ default.
 
 See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
 
+7.26 KVM_CAP_ARM_MTE
+
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the tags are maintained during swap or
+hibernation of the host; however the VMM needs to manually save/restore the
+tags as appropriate if the VM is migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ==
 
-- 
2.20.1




[PATCH v15 6/7] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-14 Thread Steven Price
The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h |  3 ++
 arch/arm64/include/asm/mte-def.h  |  1 +
 arch/arm64/include/uapi/asm/kvm.h | 11 +
 arch/arm64/kvm/arm.c  |  7 +++
 arch/arm64/kvm/guest.c| 82 +++
 include/uapi/linux/kvm.h  |  1 +
 6 files changed, 105 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 309e36cc1b42..6a2ac4636d42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
 
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags);
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index cf241b0f0a42..626d359b396e 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -7,6 +7,7 @@
 
 #define MTE_GRANULE_SIZE   UL(16)
 #define MTE_GRANULE_MASK   (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE  (PAGE_SIZE / MTE_GRANULE_SIZE)
 #define MTE_TAG_SHIFT  56
 #define MTE_TAG_SIZE   4
 #define MTE_TAG_MASK   GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), 
MTE_TAG_SHIFT)
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST  0
+#define KVM_ARM_TAGS_FROM_GUEST1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK0x0FFF
 #define KVM_REG_ARM_COPROC_SHIFT   16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e89a5e275e25..baa33359e477 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
return 0;
}
+   case KVM_ARM_MTE_COPY_TAGS: {
+   struct kvm_arm_copy_mte_tags copy_tags;
+
+   if (copy_from_user(_tags, argp, sizeof(copy_tags)))
+   return -EFAULT;
+   return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
+   }
default:
return -EINVAL;
}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5cb4a1cd5603..4ddb20017b2f 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -995,3 +995,85 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
return ret;
 }
+
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags)
+{
+   gpa_t guest_ipa = copy_tags->guest_ipa;
+   size_t length = copy_tags->length;
+   void __user *tags = copy_tags->addr;
+   gpa_t gfn;
+   bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+   int ret = 0;
+
+   if (!kvm_has_mte(kvm))
+   return -EINVAL;
+
+   if (copy_tags->reserved[0] || copy_tags->reserved[1])
+   return -EINVAL;
+
+   if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+   return -EINVAL;
+
+   if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+   return -EINVAL;
+
+   gfn = gpa_to_gfn(guest_ipa);
+
+   mutex_lock(>slots_lock);
+
+   while (length > 0) {
+   kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+   void *maddr;
+   unsigned long num_tags;
+   struct page *page;
+
+   if (is_error_noslot_pfn(pfn)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   page = pfn_to_online_page(pfn);
+   if (!page) {
+   /* Reject ZONE_DEVICE memory */
+   ret 

[PATCH v15 4/7] KVM: arm64: Save/restore MTE registers

2021-06-14 Thread Steven Price
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_host.h  |  6 ++
 arch/arm64/include/asm/kvm_mte.h   | 66 ++
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kvm/hyp/entry.S |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
 arch/arm64/kvm/sys_regs.c  | 22 ++--
 8 files changed, 124 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 692c9049befa..d436831dd706 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -12,7 +12,8 @@
 #include 
 
 /* Hyp Configuration Register (HCR) bits */
-#define HCR_ATA(UL(1) << 56)
+#define HCR_ATA_SHIFT  56
+#define HCR_ATA(UL(1) << HCR_ATA_SHIFT)
 #define HCR_FWB(UL(1) << 46)
 #define HCR_API(UL(1) << 41)
 #define HCR_APK(UL(1) << 40)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   /* Memory Tagging Extension registers */
+   RGSR_EL1,   /* Random Allocation Tag Seed Register */
+   GCR_EL1,/* Tag Control Register */
+   TFSR_EL1,   /* Tag Fault Status Register (EL1) */
+   TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index ..88dd1199670b
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020-2021 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+   isb
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@
 
 #define INIT_SCTLR_EL2_MMU_ON  \
(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |  \
-SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |  \
+SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
 
 #define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6f0044cb233e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,8 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,offsetof(struct kvm_vcpu, 
arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_v

Re: [PATCH v14 1/8] arm64: mte: Handle race when synchronising tags

2021-06-10 Thread Steven Price
On 09/06/2021 18:41, Catalin Marinas wrote:
> On Wed, Jun 09, 2021 at 12:19:31PM +0100, Marc Zyngier wrote:
>> On Wed, 09 Jun 2021 11:51:34 +0100,
>> Steven Price  wrote:
>>> On 09/06/2021 11:30, Marc Zyngier wrote:
>>>> On Mon, 07 Jun 2021 12:08:09 +0100,
>>>> Steven Price  wrote:
>>>>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>>>>> index 125a10e413e9..a3583a7fd400 100644
>>>>> --- a/arch/arm64/kernel/mte.c
>>>>> +++ b/arch/arm64/kernel/mte.c
>>>>> @@ -25,6 +25,7 @@
>>>>>  u64 gcr_kernel_excl __ro_after_init;
>>>>>  
>>>>>  static bool report_fault_once = true;
>>>>> +static DEFINE_SPINLOCK(tag_sync_lock);
>>>>>  
>>>>>  #ifdef CONFIG_KASAN_HW_TAGS
>>>>>  /* Whether the MTE asynchronous mode is enabled. */
>>>>> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
>>>>>  
>>>>>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool 
>>>>> check_swap)
>>>>>  {
>>>>> + unsigned long flags;
>>>>>   pte_t old_pte = READ_ONCE(*ptep);
>>>>>  
>>>>> + spin_lock_irqsave(_sync_lock, flags);
>>>>
>>>> having though a bit more about this after an offline discussion with
>>>> Catalin: why can't this lock be made per mm? We can't really share
>>>> tags across processes anyway, so this is limited to threads from the
>>>> same process.
>>>
>>> Currently there's nothing stopping processes sharing tags (mmap(...,
>>> PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
>>> would have been nice if this had just been prevented from the
>>> beginning.
>>
>> I don't think it should be prevented. I think it should be made clear
>> that it is unreliable and that it will result in tag corruption.
>>
>>> Given the above, clearly the lock can't be per mm and robust.
>>
>> I don't think we need to make it robust. The architecture actively
>> prevents sharing if the tags are also shared, just like we can't
>> really expect the VMM to share tags with the guest.
> 
> The architecture does not prevent MTE tag sharing (if that's what you
> meant). The tags are just an additional metadata stored in physical
> memory. It's not associated with the VA (as in the CHERI-style
> capability tags), only checked against the logical tag in a pointer. If
> the architecture prevented MAP_SHARED, we would have prevented PROT_MTE
> on them (well, it's not too late to do this ;)).
> 
> I went with Steven a few times through this exercise, though I tend to
> forget it quickly after. The use-case we had in mind when deciding to
> allow MTE on shared mappings is something like:
> 
>   int fd = memfd_create("jitted-code", MFD_ALLOW_SEALING);
>   ftruncate(fd, size);
> 
>   void* rw_mapping = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, 
> fd, 0);
>   void* rx_mapping = mmap(NULL, size, PROT_READ | PROT_EXEC, MAP_SHARED, 
> fd, 0);
> 
>   close(fd);
> 
> The above is within the same mm but you might as well have a fork and
> the rx mapping in a child process. Any of the mappings may have
> PROT_MTE from the start or set later with mprotect(), though it's
> probably the rw one only.
> 
> The race we have is in set_pte_at() and the equivalent KVM setting for
> stage 2 (in any combination of these). To detect a page that was not
> previously tagged (first time mapped, remapped with new attributes), we
> have a test like this via set_pte_at():
> 
>   if (!test_bit(PG_mte_tagged, >flags)) {
>   mte_clear_page_tags(page);
>   set_bit(PG_mte_tagged, >flags);
>   }
> 
> Calling the above concurrently on a page may cause some tag loss in the
> absence of any locking. Note that it only matters if one of the mappings
> is writable (to write tags), so this excludes CoW (fork, KSM).
> 
> For stage 1, I think almost all cases that end up in set_pte_at() also
> have the page->lock held and the ptl. The exception is mprotect() which
> doesn't bother to look up each page and lock it, it just takes the ptl
> lock. Within the same mm, mprotect() also takes the mmap_lock as a
> writer, so it's all fine. The race is between two mms, one doing an
> mprotect(PROT_MTE) with the page already mapped in its address space and
> the other taking a fault and mapping the page via set_pte_at(). Two
> faults in two mms again are fine because of the page lock.
> 
> For stage 2, the race betw

Re: [PATCH v14 1/8] arm64: mte: Handle race when synchronising tags

2021-06-09 Thread Steven Price
On 09/06/2021 11:30, Marc Zyngier wrote:
> On Mon, 07 Jun 2021 12:08:09 +0100,
> Steven Price  wrote:
>>
>> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
>> before restoring/zeroing the MTE tags. However if another thread were to
>> race and attempt to sync the tags on the same page before the first
>> thread had completed restoring/zeroing then it would see the flag is
>> already set and continue without waiting. This would potentially expose
>> the previous contents of the tags to user space, and cause any updates
>> that user space makes before the restoring/zeroing has completed to
>> potentially be lost.
>>
>> Since this code is run from atomic contexts we can't just lock the page
>> during the process. Instead implement a new (global) spinlock to protect
>> the mte_sync_page_tags() function.
>>
>> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
>> user-space with PROT_MTE")
>> Reviewed-by: Catalin Marinas 
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/kernel/mte.c | 20 +---
>>  1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index 125a10e413e9..a3583a7fd400 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -25,6 +25,7 @@
>>  u64 gcr_kernel_excl __ro_after_init;
>>  
>>  static bool report_fault_once = true;
>> +static DEFINE_SPINLOCK(tag_sync_lock);
>>  
>>  #ifdef CONFIG_KASAN_HW_TAGS
>>  /* Whether the MTE asynchronous mode is enabled. */
>> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
>>  
>>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool 
>> check_swap)
>>  {
>> +unsigned long flags;
>>  pte_t old_pte = READ_ONCE(*ptep);
>>  
>> +spin_lock_irqsave(_sync_lock, flags);
> 
> having though a bit more about this after an offline discussion with
> Catalin: why can't this lock be made per mm? We can't really share
> tags across processes anyway, so this is limited to threads from the
> same process.

Currently there's nothing stopping processes sharing tags (mmap(...,
PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
would have been nice if this had just been prevented from the beginning.

Given the above, clearly the lock can't be per mm and robust.

> I'd also like it to be documented that page sharing can only reliably
> work with tagging if only one of the mappings is using tags.

I'm not entirely clear whether you mean "can only reliably work" to be
"is practically impossible to coordinate tag values", or whether you are
proposing to (purposefully) introduce the race with a per-mm lock? (and
document it).

I guess we could have a per-mm lock and handle the race if user space
screws up with the outcome being lost tags (double clear).

But it feels to me like it could come back to bite in the future since
VM_SHARED|VM_MTE will almost always work and I fear someone will start
using it since it's permitted by the kernel.

Steve



Re: [PATCH v14 8/8] KVM: arm64: Document MTE capability and ioctl

2021-06-09 Thread Steven Price
On 07/06/2021 18:32, Catalin Marinas wrote:
> On Mon, Jun 07, 2021 at 12:08:16PM +0100, Steven Price wrote:
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 22d077562149..fc6f0cbc30b3 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -5034,6 +5034,42 @@ see KVM_XEN_VCPU_SET_ATTR above.
>>  The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
>>  with the KVM_XEN_VCPU_GET_ATTR ioctl.
>>  
>> +4.130 KVM_ARM_MTE_COPY_TAGS
>> +---
>> +
>> +:Capability: KVM_CAP_ARM_MTE
>> +:Architectures: arm64
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_arm_copy_mte_tags
>> +:Returns: number of bytes copied, < 0 on error
> 
> I guess you can be a bit more specific here, -EINVAL on incorrect
> arguments, -EFAULT if the guest memory cannot be accessed.

Sure. Note that -EFAULT can also be returned if the VMM's memory cannot
be accessed (the other end of the copy).

>> +
>> +::
>> +
>> +  struct kvm_arm_copy_mte_tags {
>> +__u64 guest_ipa;
>> +__u64 length;
>> +void __user *addr;
>> +__u64 flags;
>> +__u64 reserved[2];
>> +  };
>> +
>> +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
>> +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The 
>> ``addr``
>> +fieldmust point to a buffer which the tags will be copied to or from.
> 
> s/fieldmust/field must/

Thanks - Vim's spell checker missed that apparently because it's syntax
highlighter got confused.

>> +
>> +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` 
>> or
>> +``KVM_ARM_TAGS_FROM_GUEST``.
>> +
>> +The size of the buffer to store the tags is ``(length / 16)`` bytes
>> +(granules in MTE are 16 bytes long). Each byte contains a single tag
>> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
>> +``PTRACE_POKEMTETAGS``.
> 
> One difference I think with ptrace() is that iov_len (length here) is
> the actual buffer size. But for kvm I think this works better since
> length is tied to the guest_ipa.

What I intended to say is that the storage in memory patches ptrace (one
byte per tag). In the kernel (e.g. for swap) we store it more compactly
(two tags per byte). As you say I think having 'length' match
'guest_ipa' is sensible rather than deducing it from the buffer size.

>> +
>> +If an error occurs before any data is copied then a negative error code is
>> +returned. If some tags have been copied before an error occurs then the 
>> number
>> +of bytes successfully copied is returned. If the call completes successfully
>> +then ``length`` is returned.
>> +
>>  5. The kvm_run structure
>>  
>>  
>> @@ -6362,6 +6398,27 @@ default.
>>  
>>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
>>  
>> +7.26 KVM_CAP_ARM_MTE
>> +
>> +
>> +:Architectures: arm64
>> +:Parameters: none
>> +
>> +This capability indicates that KVM (and the hardware) supports exposing the
>> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
>> +VMM before creating any VCPUs to allow the guest access. Note that MTE is 
>> only
>> +available to a guest running in AArch64 mode and enabling this capability 
>> will
>> +cause attempts to create AArch32 VCPUs to fail.
>> +
>> +When enabled the guest is able to access tags associated with any memory 
>> given
>> +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` 
>> so
>> +that the tags are maintained during swap or hibernation of the host; however
> 
> I'd drop PG_mte_tagged here, that's just how the implementation handles
> it, not necessary for describing the API. You can just say "KVM will
> ensure that the tags are maintained during swap or hibernation of the
> host"

Good point - will update with your wording.

>> +the VMM needs to manually save/restore the tags as appropriate if the VM is
>> +migrated.
>> +
>> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
>> +perform a bulk copy of tags to/from the guest.
>> +
>>  8. Other capabilities.
>>  ==
>>  
>> -- 
>> 2.20.1
> 
> Otherwise, feel free to add:
> 
> Reviewed-by: Catalin Marinas 
> 

Thanks!

Steve



Re: [PATCH v14 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

2021-06-09 Thread Steven Price
On 07/06/2021 18:07, Catalin Marinas wrote:
> On Mon, Jun 07, 2021 at 12:08:10PM +0100, Steven Price wrote:
>> From: Catalin Marinas 
>>
>> Currently, on an anonymous page fault, the kernel allocates a zeroed
>> page and maps it in user space. If the mapping is tagged (PROT_MTE),
>> set_pte_at() additionally clears the tags under a spinlock to avoid a
>> race on the page->flags. In order to optimise the lock, clear the page
>> tags on allocation in __alloc_zeroed_user_highpage() if the vma flags
>> have VM_MTE set.
>>
>> Signed-off-by: Catalin Marinas 
>> Signed-off-by: Steven Price 
> 
> I think you can drop this patch now that Peter's series has been queued
> via the arm64 tree:
> 
> https://lore.kernel.org/r/20210602235230.3928842-4-...@google.com
> 

Thanks for the heads up - I hadn't seen that land. I'll drop this patch
from the next posting.

Steve



[PATCH v14 3/8] arm64: mte: Sync tags for pages where PTE is untagged

2021-06-07 Thread Steven Price
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/mte.h |  4 ++--
 arch/arm64/include/asm/pgtable.h | 22 +++---
 arch/arm64/kernel/mte.c  | 17 +
 3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index bc88a1ced0d7..347ef38a35f7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
-void mte_sync_tags(pte_t *ptep, pte_t pte);
+void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -53,7 +53,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
-static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
+static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..db5402168841 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,9 +314,25 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte);
 
-   if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
-   mte_sync_tags(ptep, pte);
+   /*
+* If the PTE would provide user space access to the tags associated
+* with it then ensure that the MTE tags are synchronised.  Although
+* pte_access_permitted() returns false for exec only mappings, they
+* don't expose tags (instruction fetches don't check tags).
+*/
+   if (system_supports_mte() && pte_access_permitted(pte, false) &&
+   !pte_special(pte)) {
+   pte_t old_pte = READ_ONCE(*ptep);
+   /*
+* We only need to synchronise if the new PTE has tags enabled
+* or if swapping in (in which case another mapping may have
+* set tags in the past even if this PTE isn't tagged).
+* (!pte_none() && !pte_present()) is an open coded version of
+* is_swap_pte()
+*/
+   if (pte_tagged(pte) || (!pte_none(old_pte) && 
!pte_present(old_pte)))
+   mte_sync_tags(old_pte, pte);
+   }
 
__check_racy_pte_update(mm, ptep, pte);
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a3583a7fd400..ae0a3c68fece 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -33,10 +33,10 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
 EXPORT_SYMBOL_GPL(mte_async_mode);
 #endif
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t old_pte,
+  bool check_swap, bool pte_is_tagged)
 {
unsigned long flags;
-   pte_t old_pte = READ_ONCE(*ptep);
 
spin_lock_irqsave(_sync_lock, flags);
 
@@ -53,6 +53,9 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
}
}
 
+   if (!pte_is_tagged)
+   goto out;
+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -69,16 +72,22 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
spin_unlock_irqrestore(_sync_lock, flags);
 }
 
-void mte_sync_tags(pte_t *ptep, pte_t pte)
+void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
+
+   /* Early out if there's nothing to do */
+   if (!check_swap && !pte_is_tagged)
+   return;
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
if (!test_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags(page, ptep, check_swap);
+   mte_sync_page_ta

[PATCH v14 1/8] arm64: mte: Handle race when synchronising tags

2021-06-07 Thread Steven Price
mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
user-space with PROT_MTE")
Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/kernel/mte.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..a3583a7fd400 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@
 u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
+static DEFINE_SPINLOCK(tag_sync_lock);
 
 #ifdef CONFIG_KASAN_HW_TAGS
 /* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
 
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 {
+   unsigned long flags;
pte_t old_pte = READ_ONCE(*ptep);
 
+   spin_lock_irqsave(_sync_lock, flags);
+
+   /* Recheck with the lock held */
+   if (test_bit(PG_mte_tagged, >flags))
+   goto out;
+
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-   if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
-   return;
+   if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+   set_bit(PG_mte_tagged, >flags);
+   goto out;
+   }
}
 
page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
 */
smp_wmb();
mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+
+out:
+   spin_unlock_irqrestore(_sync_lock, flags);
 }
 
 void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -63,7 +77,7 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
-   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   if (!test_bit(PG_mte_tagged, >flags))
mte_sync_page_tags(page, ptep, check_swap);
}
 }
-- 
2.20.1




[PATCH v14 0/8] MTE support for KVM guest

2021-06-07 Thread Steven Price
This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

Changes since v13[1]:

 * Add Reviewed-by tags from Catalin - thanks!

 * Introduce a new function mte_prepare_page_tags() for handling the
   initialisation of pages ready for a KVM guest. This takes the big
   tag_sync_lock removing any race with the VMM (or another guest)
   around clearing the tags and setting PG_mte_tagged.

 * The ioctl to fetch/store tags now returns the number of bytes
   processed so userspace can tell how far a partial fetch/store got.

 * Some minor refactoring to tidy up the code thanks to pointers from
   Catalin.

[1] https://lore.kernel.org/r/20210524104513.13258-1-steven.price%40arm.com

Catalin Marinas (1):
  arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

Steven Price (7):
  arm64: mte: Handle race when synchronising tags
  arm64: mte: Sync tags for pages where PTE is untagged
  KVM: arm64: Introduce MTE VM feature
  KVM: arm64: Save/restore MTE registers
  KVM: arm64: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst | 57 +++
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_emulate.h   |  3 +
 arch/arm64/include/asm/kvm_host.h  | 12 
 arch/arm64/include/asm/kvm_mte.h   | 66 +
 arch/arm64/include/asm/mte-def.h   |  1 +
 arch/arm64/include/asm/mte.h   |  8 ++-
 arch/arm64/include/asm/page.h  |  6 +-
 arch/arm64/include/asm/pgtable.h   | 22 +-
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/include/uapi/asm/kvm.h  | 11 +++
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kernel/mte.c| 54 --
 arch/arm64/kvm/arm.c   | 16 +
 arch/arm64/kvm/guest.c | 82 ++
 arch/arm64/kvm/hyp/entry.S |  7 ++
 arch/arm64/kvm/hyp/exception.c |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++
 arch/arm64/kvm/mmu.c   | 42 ++-
 arch/arm64/kvm/reset.c |  3 +-
 arch/arm64/kvm/sys_regs.c  | 32 +++--
 arch/arm64/mm/fault.c  | 21 ++
 include/uapi/linux/kvm.h   |  2 +
 23 files changed, 454 insertions(+), 23 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1




[PATCH v14 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

2021-06-07 Thread Steven Price
From: Catalin Marinas 

Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags under a spinlock to avoid a
race on the page->flags. In order to optimise the lock, clear the page
tags on allocation in __alloc_zeroed_user_highpage() if the vma flags
have VM_MTE set.

Signed-off-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/page.h |  6 --
 arch/arm64/mm/fault.c | 21 +
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..97853570d0f1 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -13,6 +13,7 @@
 #ifndef __ASSEMBLY__
 
 #include  /* for READ_IMPLIES_EXEC */
+#include 
 #include 
 
 struct page;
@@ -28,8 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-   alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+ struct vm_area_struct *vma,
+ unsigned long vaddr);
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #define clear_user_page(page, vaddr, pg)   clear_page(page)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 871c82ab0a30..5a03428e97f3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -921,3 +921,24 @@ void do_debug_exception(unsigned long addr_if_watchpoint, 
unsigned int esr,
debug_exception_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug_exception);
+
+/*
+ * Used during anonymous page fault handling.
+ */
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+ struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+   struct page *page;
+   bool tagged = system_supports_mte() && (vma->vm_flags & VM_MTE);
+
+   page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma,
+ vaddr);
+   if (tagged && page) {
+   mte_clear_page_tags(page_address(page));
+   page_kasan_tag_reset(page);
+   set_bit(PG_mte_tagged, >flags);
+   }
+
+   return page;
+}
-- 
2.20.1




[PATCH v14 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-07 Thread Steven Price
The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h |  3 ++
 arch/arm64/include/asm/mte-def.h  |  1 +
 arch/arm64/include/uapi/asm/kvm.h | 11 +
 arch/arm64/kvm/arm.c  |  7 +++
 arch/arm64/kvm/guest.c| 82 +++
 include/uapi/linux/kvm.h  |  1 +
 6 files changed, 105 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 309e36cc1b42..6a2ac4636d42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
 
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags);
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index cf241b0f0a42..626d359b396e 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -7,6 +7,7 @@
 
 #define MTE_GRANULE_SIZE   UL(16)
 #define MTE_GRANULE_MASK   (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE  (PAGE_SIZE / MTE_GRANULE_SIZE)
 #define MTE_TAG_SHIFT  56
 #define MTE_TAG_SIZE   4
 #define MTE_TAG_MASK   GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), 
MTE_TAG_SHIFT)
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST  0
+#define KVM_ARM_TAGS_FROM_GUEST1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK0x0FFF
 #define KVM_REG_ARM_COPROC_SHIFT   16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e89a5e275e25..baa33359e477 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
return 0;
}
+   case KVM_ARM_MTE_COPY_TAGS: {
+   struct kvm_arm_copy_mte_tags copy_tags;
+
+   if (copy_from_user(_tags, argp, sizeof(copy_tags)))
+   return -EFAULT;
+   return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
+   }
default:
return -EINVAL;
}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5cb4a1cd5603..4ddb20017b2f 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -995,3 +995,85 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
return ret;
 }
+
+long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+   struct kvm_arm_copy_mte_tags *copy_tags)
+{
+   gpa_t guest_ipa = copy_tags->guest_ipa;
+   size_t length = copy_tags->length;
+   void __user *tags = copy_tags->addr;
+   gpa_t gfn;
+   bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+   int ret = 0;
+
+   if (!kvm_has_mte(kvm))
+   return -EINVAL;
+
+   if (copy_tags->reserved[0] || copy_tags->reserved[1])
+   return -EINVAL;
+
+   if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+   return -EINVAL;
+
+   if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+   return -EINVAL;
+
+   gfn = gpa_to_gfn(guest_ipa);
+
+   mutex_lock(>slots_lock);
+
+   while (length > 0) {
+   kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+   void *maddr;
+   unsigned long num_tags;
+   struct page *page;
+
+   if (is_error_noslot_pfn(pfn)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   page = pfn_to_online_page(pfn);
+   if (!page) {
+   /* Reject ZONE_DEVICE memory */
+   ret = -EFAULT;
+   goto o

[PATCH v14 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-07 Thread Steven Price
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_emulate.h |  3 ++
 arch/arm64/include/asm/kvm_host.h|  3 ++
 arch/arm64/include/asm/mte.h |  4 +++
 arch/arm64/kernel/mte.c  | 17 +++
 arch/arm64/kvm/hyp/exception.c   |  3 +-
 arch/arm64/kvm/mmu.c | 42 +++-
 arch/arm64/kvm/sys_regs.c|  7 +
 include/uapi/linux/kvm.h |  1 +
 8 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+   if (kvm_has_mte(vcpu->kvm))
+   vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
u8 pfr0_csv2;
u8 pfr0_csv3;
+   /* Memory Tagging Extension enabled for the guest */
+   bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 347ef38a35f7..be1de541a11c 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,6 +37,7 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
+void mte_prepare_page_tags(struct page *page);
 void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
@@ -53,6 +54,9 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
+static inline void mte_prepare_page_tags(struct page *page)
+{
+}
 static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 }
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index ae0a3c68fece..b120f82a2258 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -72,6 +72,23 @@ static void mte_sync_page_tags(struct page *page, pte_t 
old_pte,
spin_unlock_irqrestore(_sync_lock, flags);
 }
 
+void mte_prepare_page_tags(struct page *page)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(_sync_lock, flags);
+
+   /* Recheck with the lock held */
+   if (test_bit(PG_mte_tagged, >flags))
+   goto out;
+
+   mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+
+out:
+   spin_unlock_irqrestore(_sync_lock, flags);
+}
+
 void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
struct page *page = pte_page(pte);
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);
 
-   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+   if (kvm_has_mte(vcpu->kvm))
+   new |= PSR_TCO_BIT;
 
new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..ed7c624e7362 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,36 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
+unsigned long size)
+{
+   unsigned long i, nr_pages = size >> PAGE_SHIFT;
+   struct page *page;

[PATCH v14 8/8] KVM: arm64: Document MTE capability and ioctl

2021-06-07 Thread Steven Price
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Signed-off-by: Steven Price 
---
 Documentation/virt/kvm/api.rst | 57 ++
 1 file changed, 57 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 22d077562149..fc6f0cbc30b3 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,42 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.130 KVM_ARM_MTE_COPY_TAGS
+---
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: number of bytes copied, < 0 on error
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+fieldmust point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
+If an error occurs before any data is copied then a negative error code is
+returned. If some tags have been copied before an error occurs then the number
+of bytes successfully copied is returned. If the call completes successfully
+then ``length`` is returned.
+
 5. The kvm_run structure
 
 
@@ -6362,6 +6398,27 @@ default.
 
 See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
 
+7.26 KVM_CAP_ARM_MTE
+
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
+that the tags are maintained during swap or hibernation of the host; however
+the VMM needs to manually save/restore the tags as appropriate if the VM is
+migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ==
 
-- 
2.20.1




[PATCH v14 6/8] KVM: arm64: Expose KVM_ARM_CAP_MTE

2021-06-07 Thread Steven Price
It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/kvm/arm.c  | 9 +
 arch/arm64/kvm/reset.c| 3 ++-
 arch/arm64/kvm/sys_regs.c | 3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..e89a5e275e25 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+   case KVM_CAP_ARM_MTE:
+   if (!system_supports_mte() || kvm->created_vcpus)
+   return -EINVAL;
+   r = 0;
+   kvm->arch.mte_enabled = true;
+   break;
default:
r = -EINVAL;
break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 */
r = 1;
break;
+   case KVM_CAP_ARM_MTE:
+   r = system_supports_mte();
+   break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 956cdc240148..50635eacfa43 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
-   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
+   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
+   vcpu->kvm->arch.mte_enabled) {
ret = -EINVAL;
goto out;
}
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 440315a556c2..d4e1c1b1a08d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1312,6 +1312,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct 
sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
   const struct sys_reg_desc *rd)
 {
+   if (kvm_has_mte(vcpu->kvm))
+   return 0;
+
return REG_HIDDEN;
 }
 
-- 
2.20.1




[PATCH v14 5/8] KVM: arm64: Save/restore MTE registers

2021-06-07 Thread Steven Price
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_arm.h   |  3 +-
 arch/arm64/include/asm/kvm_host.h  |  6 ++
 arch/arm64/include/asm/kvm_mte.h   | 66 ++
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kvm/hyp/entry.S |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
 arch/arm64/kvm/sys_regs.c  | 22 ++--
 8 files changed, 124 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 692c9049befa..d436831dd706 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -12,7 +12,8 @@
 #include 
 
 /* Hyp Configuration Register (HCR) bits */
-#define HCR_ATA(UL(1) << 56)
+#define HCR_ATA_SHIFT  56
+#define HCR_ATA(UL(1) << HCR_ATA_SHIFT)
 #define HCR_FWB(UL(1) << 46)
 #define HCR_API(UL(1) << 41)
 #define HCR_APK(UL(1) << 40)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   /* Memory Tagging Extension registers */
+   RGSR_EL1,   /* Random Allocation Tag Seed Register */
+   GCR_EL1,/* Tag Control Register */
+   TFSR_EL1,   /* Tag Fault Status Register (EL1) */
+   TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index ..88dd1199670b
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020-2021 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   tbz \reg1, #(HCR_ATA_SHIFT), .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+   isb
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@
 
 #define INIT_SCTLR_EL2_MMU_ON  \
(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |  \
-SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |  \
+SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
 
 #define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6f0044cb233e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,8 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,offsetof(struct kvm_vcpu, 
arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_v

Re: [PATCH v13 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-04 Thread Steven Price
On 04/06/2021 12:42, Catalin Marinas wrote:
> On Fri, Jun 04, 2021 at 12:15:56PM +0100, Steven Price wrote:
>> On 03/06/2021 18:13, Catalin Marinas wrote:
>>> On Mon, May 24, 2021 at 11:45:12AM +0100, Steven Price wrote:
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>>>> b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 24223adae150..b3edde68bc3e 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>>>__u32 reserved[12];
>>>>  };
>>>>  
>>>> +struct kvm_arm_copy_mte_tags {
>>>> +  __u64 guest_ipa;
>>>> +  __u64 length;
>>>> +  void __user *addr;
>>>> +  __u64 flags;
>>>> +  __u64 reserved[2];
>>>> +};
>>>> +
>>>> +#define KVM_ARM_TAGS_TO_GUEST 0
>>>> +#define KVM_ARM_TAGS_FROM_GUEST   1
>>>> +
>>>>  /* If you need to interpret the index values, here is the key: */
>>>>  #define KVM_REG_ARM_COPROC_MASK   0x0FFF
>>>>  #define KVM_REG_ARM_COPROC_SHIFT  16
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index e89a5e275e25..baa33359e477 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>>>  
>>>>return 0;
>>>>}
>>>> +  case KVM_ARM_MTE_COPY_TAGS: {
>>>> +  struct kvm_arm_copy_mte_tags copy_tags;
>>>> +
>>>> +  if (copy_from_user(_tags, argp, sizeof(copy_tags)))
>>>> +  return -EFAULT;
>>>> +  return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
>>>> +  }
>>>
>>> I wonder whether we need an update of the user structure following a
>>> fault, like how much was copied etc. In case of an error, some tags were
>>> copied and the VMM may want to skip the page before continuing. But here
>>> there's no such information provided.
>>>
>>> On the ptrace interface, we return 0 on the syscall if any bytes were
>>> copied and update iov_len to such number. Maybe you want to still return
>>> an error here but updating copy_tags.length would be nice (and, of
>>> course, a copy_to_user() back).
>>
>> Good idea - as you suggest I'll make it update length with the number of
>> bytes not processed. Although in general I think we're expecting the VMM
>> to know where the memory is so this is more of a programming error - but
>> could still be useful for debugging.
> 
> Or update it to the number of bytes copied to be consistent with
> ptrace()'s iov.len. On success, the structure is effectively left
> unchanged.

I was avoiding that because it confuses the error code when the initial
copy_from_user() fails. In that case the structure is clearly unchanged,
so you can only tell from a -EFAULT return that nothing happened. By
returning the number of bytes left you can return an error code along
with the information that the copy only half completed.

It also seems cleaner to leave the structure unchanged if e.g. the flags
or reserved fields are invalid rather than having to set length=0 to
signal that nothing was done.

Although I do feel like arguing whether to use a ptrace() interface or a
copy_{to,from}_user() interface is somewhat ridiculous considering
neither are exactly considered good.

Rather than changing the structure we could return either an error code
(if nothing was copied) or the number of bytes left. That way ioctl()==0
means complete success, >0 means partial success and <0 means complete
failure and provides a detailed error code. The ioctl() can be repeated
(with adjusted pointers) if it returns >0 and a detailed error is needed.

Steve



Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-04 Thread Steven Price
On 04/06/2021 12:36, Catalin Marinas wrote:
> On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote:
>> On 03/06/2021 17:00, Catalin Marinas wrote:
>>> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote:
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index c5d1f3c87dbd..226035cf7d6c 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>>>> *memslot,
>>>>return PAGE_SIZE;
>>>>  }
>>>>  
>>>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>>>> +   unsigned long size)
>>>> +{
>>>> +  if (kvm_has_mte(kvm)) {
>>>> +  /*
>>>> +   * The page will be mapped in stage 2 as Normal Cacheable, so
>>>> +   * the VM will be able to see the page's tags and therefore
>>>> +   * they must be initialised first. If PG_mte_tagged is set,
>>>> +   * tags have already been initialised.
>>>> +   * pfn_to_online_page() is used to reject ZONE_DEVICE pages
>>>> +   * that may not support tags.
>>>> +   */
>>>> +  unsigned long i, nr_pages = size >> PAGE_SHIFT;
>>>> +  struct page *page = pfn_to_online_page(pfn);
>>>> +
>>>> +  if (!page)
>>>> +  return -EFAULT;
>>>> +
>>>> +  for (i = 0; i < nr_pages; i++, page++) {
>>>> +  /*
>>>> +   * There is a potential (but very unlikely) race
>>>> +   * between two VMs which are sharing a physical page
>>>> +   * entering this at the same time. However by splitting
>>>> +   * the test/set the only risk is tags being overwritten
>>>> +   * by the mte_clear_page_tags() call.
>>>> +   */
>>>
>>> And I think the real risk here is when the page is writable by at least
>>> one of the VMs sharing the page. This excludes KSM, so it only leaves
>>> the MAP_SHARED mappings.
>>>
>>>> +  if (!test_bit(PG_mte_tagged, >flags)) {
>>>> +  mte_clear_page_tags(page_address(page));
>>>> +  set_bit(PG_mte_tagged, >flags);
>>>> +  }
>>>> +  }
>>>
>>> If we want to cover this race (I'd say in a separate patch), we can call
>>> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I
>>> got the arguments right). We can avoid the big lock in most cases if
>>> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.)
>>> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we
>>> do for VM_MTE but the new flag would not affect the stage 1 VMM page
>>> attributes).
>>
>> To be honest I'm coming round to just exporting a
>> mte_prepare_page_tags() function which does the clear/set with the lock
>> held. I doubt it's such a performance critical path that it will cause
>> any noticeable issues. Then if we run into performance problems in the
>> future we can start experimenting with extra VM flags etc as necessary.
> 
> It works for me.
> 
>> And from your later email:
>>> Another idea: if VM_SHARED is found for any vma within a region in
>>> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE
>>> for the guest or reject the memory slot if MTE was already enabled.
>>>
>>> An alternative here would be to clear VM_MTE_ALLOWED so that any
>>> subsequent mprotect(PROT_MTE) in the VMM would fail in
>>> arch_validate_flags(). MTE would still be allowed in the guest but in
>>> the VMM for the guest memory regions. We can probably do this
>>> irrespective of VM_SHARED. Of course, the VMM can still mmap() the
>>> memory initially with PROT_MTE but that's not an issue IIRC, only the
>>> concurrent mprotect().
>>
>> This could work, but I worry that it's potential fragile. Also the rules
>> for what user space can do are not obvious and may be surprising. I'd
>> also want to look into the likes of mremap() to see how easy it would be
>> to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED)
>> memory sneaking into a memslot.
>>
>> Unless you th

Re: [PATCH v13 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-04 Thread Steven Price
On 03/06/2021 18:13, Catalin Marinas wrote:
> On Mon, May 24, 2021 at 11:45:12AM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..b3edde68bc3e 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>  __u32 reserved[12];
>>  };
>>  
>> +struct kvm_arm_copy_mte_tags {
>> +__u64 guest_ipa;
>> +__u64 length;
>> +void __user *addr;
>> +__u64 flags;
>> +__u64 reserved[2];
>> +};
>> +
>> +#define KVM_ARM_TAGS_TO_GUEST   0
>> +#define KVM_ARM_TAGS_FROM_GUEST 1
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK 0x0FFF
>>  #define KVM_REG_ARM_COPROC_SHIFT16
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e89a5e275e25..baa33359e477 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
>>  
>>  return 0;
>>  }
>> +case KVM_ARM_MTE_COPY_TAGS: {
>> +struct kvm_arm_copy_mte_tags copy_tags;
>> +
>> +if (copy_from_user(_tags, argp, sizeof(copy_tags)))
>> +return -EFAULT;
>> +return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
>> +}
> 
> I wonder whether we need an update of the user structure following a
> fault, like how much was copied etc. In case of an error, some tags were
> copied and the VMM may want to skip the page before continuing. But here
> there's no such information provided.
> 
> On the ptrace interface, we return 0 on the syscall if any bytes were
> copied and update iov_len to such number. Maybe you want to still return
> an error here but updating copy_tags.length would be nice (and, of
> course, a copy_to_user() back).
> 

Good idea - as you suggest I'll make it update length with the number of
bytes not processed. Although in general I think we're expecting the VMM
to know where the memory is so this is more of a programming error - but
could still be useful for debugging.

Thanks,

Steve



Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-06-04 Thread Steven Price
On 03/06/2021 17:00, Catalin Marinas wrote:
> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..226035cf7d6c 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>> *memslot,
>>  return PAGE_SIZE;
>>  }
>>  
>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>> + unsigned long size)
>> +{
>> +if (kvm_has_mte(kvm)) {
> 
> Nitpick (less indentation):
> 
>   if (!kvm_has_mte(kvm))
>   return 0;

Thanks, will change.

>> +/*
>> + * The page will be mapped in stage 2 as Normal Cacheable, so
>> + * the VM will be able to see the page's tags and therefore
>> + * they must be initialised first. If PG_mte_tagged is set,
>> + * tags have already been initialised.
>> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages
>> + * that may not support tags.
>> + */
>> +unsigned long i, nr_pages = size >> PAGE_SHIFT;
>> +struct page *page = pfn_to_online_page(pfn);
>> +
>> +if (!page)
>> +return -EFAULT;
>> +
>> +for (i = 0; i < nr_pages; i++, page++) {
>> +/*
>> + * There is a potential (but very unlikely) race
>> + * between two VMs which are sharing a physical page
>> + * entering this at the same time. However by splitting
>> + * the test/set the only risk is tags being overwritten
>> + * by the mte_clear_page_tags() call.
>> + */
> 
> And I think the real risk here is when the page is writable by at least
> one of the VMs sharing the page. This excludes KSM, so it only leaves
> the MAP_SHARED mappings.
> 
>> +if (!test_bit(PG_mte_tagged, >flags)) {
>> +mte_clear_page_tags(page_address(page));
>> +set_bit(PG_mte_tagged, >flags);
>> +}
>> +}
> 
> If we want to cover this race (I'd say in a separate patch), we can call
> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I
> got the arguments right). We can avoid the big lock in most cases if
> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.)
> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we
> do for VM_MTE but the new flag would not affect the stage 1 VMM page
> attributes).

To be honest I'm coming round to just exporting a
mte_prepare_page_tags() function which does the clear/set with the lock
held. I doubt it's such a performance critical path that it will cause
any noticeable issues. Then if we run into performance problems in the
future we can start experimenting with extra VM flags etc as necessary.

And from your later email:
> Another idea: if VM_SHARED is found for any vma within a region in
> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE
> for the guest or reject the memory slot if MTE was already enabled.
> 
> An alternative here would be to clear VM_MTE_ALLOWED so that any
> subsequent mprotect(PROT_MTE) in the VMM would fail in
> arch_validate_flags(). MTE would still be allowed in the guest but in
> the VMM for the guest memory regions. We can probably do this
> irrespective of VM_SHARED. Of course, the VMM can still mmap() the
> memory initially with PROT_MTE but that's not an issue IIRC, only the
> concurrent mprotect().

This could work, but I worry that it's potential fragile. Also the rules
for what user space can do are not obvious and may be surprising. I'd
also want to look into the likes of mremap() to see how easy it would be
to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED)
memory sneaking into a memslot.

Unless you think it's worth complicating the ABI in the hope of avoiding
the big lock overhead I think it's probably best to stick with the big
lock at least until we have more data on the overhead.

>> +}
>> +
>> +return 0;
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>struct kvm_memory_slot *memslot, unsigned long hva,
>>unsigned long fault_status)
>> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (writable)
>> 

Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-05-27 Thread Steven Price
On 24/05/2021 19:11, Catalin Marinas wrote:
> On Fri, May 21, 2021 at 10:42:09AM +0100, Steven Price wrote:
>> On 20/05/2021 18:27, Catalin Marinas wrote:
>>> On Thu, May 20, 2021 at 04:58:01PM +0100, Steven Price wrote:
>>>> On 20/05/2021 13:05, Catalin Marinas wrote:
>>>>> On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
>>>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>>>> index e89a5e275e25..4b6c83beb75d 100644
>>>>>> --- a/arch/arm64/kvm/arm.c
>>>>>> +++ b/arch/arm64/kvm/arm.c
>>>>>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct 
>>>>>> kvm *kvm,
>>>>>>  }
>>>>>>  }
>>>>>>  
>>>>>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>>>>>> +  struct kvm_arm_copy_mte_tags 
>>>>>> *copy_tags)
>>>>>> +{
>>>>>> +gpa_t guest_ipa = copy_tags->guest_ipa;
>>>>>> +size_t length = copy_tags->length;
>>>>>> +void __user *tags = copy_tags->addr;
>>>>>> +gpa_t gfn;
>>>>>> +bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>>>>>> +int ret = 0;
>>>>>> +
>>>>>> +if (copy_tags->reserved[0] || copy_tags->reserved[1])
>>>>>> +return -EINVAL;
>>>>>> +
>>>>>> +if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>>>>>> +return -EINVAL;
>>>>>> +
>>>>>> +if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>>>>>> +return -EINVAL;
>>>>>> +
>>>>>> +gfn = gpa_to_gfn(guest_ipa);
>>>>>> +
>>>>>> +mutex_lock(>slots_lock);
>>>>>> +
>>>>>> +while (length > 0) {
>>>>>> +kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>>>>>> +void *maddr;
>>>>>> +unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
>>>>>> +
>>>>>> +if (is_error_noslot_pfn(pfn)) {
>>>>>> +ret = -EFAULT;
>>>>>> +goto out;
>>>>>> +}
>>>>>> +
>>>>>> +maddr = page_address(pfn_to_page(pfn));
>>>>>> +
>>>>>> +if (!write) {
>>>>>> +num_tags = mte_copy_tags_to_user(tags, maddr, 
>>>>>> num_tags);
>>>>>> +kvm_release_pfn_clean(pfn);
>>>>>
>>>>> Do we need to check if PG_mte_tagged is set? If the page was not faulted
>>>>> into the guest address space but the VMM has the page, does the
>>>>> gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
>>>>> not, this may read stale tags.
>>>>
>>>> Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
>>>> will fault it into the guest.
>>>
>>> It doesn't indeed. What it does is a get_user_pages() but it's not of
>>> much help since the VMM pte wouldn't be tagged (we would have solved
>>> lots of problems if we required PROT_MTE in the VMM...)
>>
>> Sadly it solves some problems and creates others :(
> 
> I had some (random) thoughts on how to make things simpler, maybe. I
> think most of these races would have been solved if we required PROT_MTE
> in the VMM but this has an impact on the VMM if it wants to use MTE
> itself. If such requirement was in place, all KVM needed to do is check
> PG_mte_tagged.
> 
> So what we actually need is a set_pte_at() in the VMM to clear the tags
> and set PG_mte_tagged. Currently, we only do this if the memory type is
> tagged (PROT_MTE) but it's not strictly necessary.
> 
> As an optimisation for normal programs, we don't want to do this all the
> time but the visible behaviour wouldn't change (well, maybe for ptrace
> slightly). However, it doesn't mean we couldn't for a VMM, with an
> opt-in via prctl(). This would add a MMCF_MTE_TAG_INIT bit (couldn't
> think of a better name) to mm_context_t.flags and set_pte_at() would
> behave as if the pt

[PATCH v13 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-05-24 Thread Steven Price
The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h |  3 ++
 arch/arm64/include/asm/mte-def.h  |  1 +
 arch/arm64/include/uapi/asm/kvm.h | 11 +
 arch/arm64/kvm/arm.c  |  7 +++
 arch/arm64/kvm/guest.c| 79 +++
 include/uapi/linux/kvm.h  |  1 +
 6 files changed, 102 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 309e36cc1b42..66b6339df949 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
   struct kvm_device_attr *attr);
 
+int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+  struct kvm_arm_copy_mte_tags *copy_tags);
+
 /* Guest/host FPSIMD coordination helpers */
 int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index cf241b0f0a42..626d359b396e 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -7,6 +7,7 @@
 
 #define MTE_GRANULE_SIZE   UL(16)
 #define MTE_GRANULE_MASK   (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE  (PAGE_SIZE / MTE_GRANULE_SIZE)
 #define MTE_TAG_SHIFT  56
 #define MTE_TAG_SIZE   4
 #define MTE_TAG_MASK   GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), 
MTE_TAG_SHIFT)
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST  0
+#define KVM_ARM_TAGS_FROM_GUEST1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK0x0FFF
 #define KVM_REG_ARM_COPROC_SHIFT   16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e89a5e275e25..baa33359e477 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1345,6 +1345,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
return 0;
}
+   case KVM_ARM_MTE_COPY_TAGS: {
+   struct kvm_arm_copy_mte_tags copy_tags;
+
+   if (copy_from_user(_tags, argp, sizeof(copy_tags)))
+   return -EFAULT;
+   return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
+   }
default:
return -EINVAL;
}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5cb4a1cd5603..7a1e181eb463 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -995,3 +995,82 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
return ret;
 }
+
+int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+  struct kvm_arm_copy_mte_tags *copy_tags)
+{
+   gpa_t guest_ipa = copy_tags->guest_ipa;
+   size_t length = copy_tags->length;
+   void __user *tags = copy_tags->addr;
+   gpa_t gfn;
+   bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+   int ret = 0;
+
+   if (!kvm_has_mte(kvm))
+   return -EINVAL;
+
+   if (copy_tags->reserved[0] || copy_tags->reserved[1])
+   return -EINVAL;
+
+   if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+   return -EINVAL;
+
+   if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+   return -EINVAL;
+
+   gfn = gpa_to_gfn(guest_ipa);
+
+   mutex_lock(>slots_lock);
+
+   while (length > 0) {
+   kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+   void *maddr;
+   unsigned long num_tags;
+   struct page *page;
+
+   if (is_error_noslot_pfn(pfn)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   page = pfn_to_online_page(pfn);
+   if (!page) {
+   /* Reject ZONE_DEVICE memory */
+   ret = -EFAULT;
+   goto out;
+  

[PATCH v13 5/8] KVM: arm64: Save/restore MTE registers

2021-05-24 Thread Steven Price
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h  |  6 ++
 arch/arm64/include/asm/kvm_mte.h   | 68 ++
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kvm/hyp/entry.S |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
 arch/arm64/kvm/sys_regs.c  | 22 +--
 7 files changed, 124 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   /* Memory Tagging Extension registers */
+   RGSR_EL1,   /* Random Allocation Tag Seed Register */
+   GCR_EL1,/* Tag Control Register */
+   TFSR_EL1,   /* Tag Fault Status Register (EL1) */
+   TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index ..eae4bce9e269
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   and \reg1, \reg1, #(HCR_ATA)
+   cbz \reg1, .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   and \reg1, \reg1, #(HCR_ATA)
+   cbz \reg1, .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+   isb
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@
 
 #define INIT_SCTLR_EL2_MMU_ON  \
(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |  \
-SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |  \
+SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
 
 #define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6f0044cb233e 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,8 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,offsetof(struct kvm_vcpu, 
arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs));
+  DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, 
sys_regs[RGSR_EL1]));
+  DEFINE(CPU_GCR_EL1,  offsetof(struct kvm_cpu_context, 
sys_regs[GCR_EL1]));
   DEFINE(CPU_APIAKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APIAKEYLO_EL1]));
   DEFINE(CPU_APIBKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APIBKEYLO_EL1]));
   DEFINE(CPU_APDAKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APDAKEYLO_EL1]));
diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S

[PATCH v13 6/8] KVM: arm64: Expose KVM_ARM_CAP_MTE

2021-05-24 Thread Steven Price
It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Signed-off-by: Steven Price 
---
 arch/arm64/kvm/arm.c  | 9 +
 arch/arm64/kvm/reset.c| 3 ++-
 arch/arm64/kvm/sys_regs.c | 3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..e89a5e275e25 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+   case KVM_CAP_ARM_MTE:
+   if (!system_supports_mte() || kvm->created_vcpus)
+   return -EINVAL;
+   r = 0;
+   kvm->arch.mte_enabled = true;
+   break;
default:
r = -EINVAL;
break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 */
r = 1;
break;
+   case KVM_CAP_ARM_MTE:
+   r = system_supports_mte();
+   break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
break;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 956cdc240148..50635eacfa43 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
-   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
+   if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
+   vcpu->kvm->arch.mte_enabled) {
ret = -EINVAL;
goto out;
}
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 440315a556c2..d4e1c1b1a08d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1312,6 +1312,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct 
sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
   const struct sys_reg_desc *rd)
 {
+   if (kvm_has_mte(vcpu->kvm))
+   return 0;
+
return REG_HIDDEN;
 }
 
-- 
2.20.1




[PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature

2021-05-24 Thread Steven Price
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_emulate.h |  3 ++
 arch/arm64/include/asm/kvm_host.h|  3 ++
 arch/arm64/kvm/hyp/exception.c   |  3 +-
 arch/arm64/kvm/mmu.c | 48 +++-
 arch/arm64/kvm/sys_regs.c|  7 
 include/uapi/linux/kvm.h |  1 +
 6 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+   if (kvm_has_mte(vcpu->kvm))
+   vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
u8 pfr0_csv2;
u8 pfr0_csv3;
+   /* Memory Tagging Extension enabled for the guest */
+   bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);
 
-   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+   if (kvm_has_mte(vcpu->kvm))
+   new |= PSR_TCO_BIT;
 
new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..226035cf7d6c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
+unsigned long size)
+{
+   if (kvm_has_mte(kvm)) {
+   /*
+* The page will be mapped in stage 2 as Normal Cacheable, so
+* the VM will be able to see the page's tags and therefore
+* they must be initialised first. If PG_mte_tagged is set,
+* tags have already been initialised.
+* pfn_to_online_page() is used to reject ZONE_DEVICE pages
+* that may not support tags.
+*/
+   unsigned long i, nr_pages = size >> PAGE_SHIFT;
+   struct page *page = pfn_to_online_page(pfn);
+
+   if (!page)
+   return -EFAULT;
+
+   for (i = 0; i < nr_pages; i++, page++) {
+   /*
+* There is a potential (but very unlikely) race
+* between two VMs which are sharing a physical page
+* entering this at the same time. However by splitting
+* the test/set the only risk is tags being overwritten
+* by the mte_clear_page_tags() call.
+*/
+   if (!test_bit(PG_mte_tagged, >flags)) {
+   mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+   }
+   }
+   }
+
+   return 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (writable)
prot |= KVM_PGTABLE_PROT_W;
 
-   if (f

[PATCH v13 3/8] arm64: mte: Sync tags for pages where PTE is untagged

2021-05-24 Thread Steven Price
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/mte.h |  4 ++--
 arch/arm64/include/asm/pgtable.h | 22 +++---
 arch/arm64/kernel/mte.c  | 16 
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index bc88a1ced0d7..347ef38a35f7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged  PG_arch_2
 
-void mte_sync_tags(pte_t *ptep, pte_t pte);
+void mte_sync_tags(pte_t old_pte, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -53,7 +53,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged  0
 
-static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
+static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..db5402168841 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,9 +314,25 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte);
 
-   if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
-   mte_sync_tags(ptep, pte);
+   /*
+* If the PTE would provide user space access to the tags associated
+* with it then ensure that the MTE tags are synchronised.  Although
+* pte_access_permitted() returns false for exec only mappings, they
+* don't expose tags (instruction fetches don't check tags).
+*/
+   if (system_supports_mte() && pte_access_permitted(pte, false) &&
+   !pte_special(pte)) {
+   pte_t old_pte = READ_ONCE(*ptep);
+   /*
+* We only need to synchronise if the new PTE has tags enabled
+* or if swapping in (in which case another mapping may have
+* set tags in the past even if this PTE isn't tagged).
+* (!pte_none() && !pte_present()) is an open coded version of
+* is_swap_pte()
+*/
+   if (pte_tagged(pte) || (!pte_none(old_pte) && 
!pte_present(old_pte)))
+   mte_sync_tags(old_pte, pte);
+   }
 
__check_racy_pte_update(mm, ptep, pte);
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 45fac0e9c323..ae0a3c68fece 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -33,10 +33,10 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
 EXPORT_SYMBOL_GPL(mte_async_mode);
 #endif
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t old_pte,
+  bool check_swap, bool pte_is_tagged)
 {
unsigned long flags;
-   pte_t old_pte = READ_ONCE(*ptep);
 
spin_lock_irqsave(_sync_lock, flags);
 
@@ -53,6 +53,9 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
}
}
 
+   if (!pte_is_tagged)
+   goto out;
+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -69,17 +72,22 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
spin_unlock_irqrestore(_sync_lock, flags);
 }
 
-void mte_sync_tags(pte_t *ptep, pte_t pte)
+void mte_sync_tags(pte_t old_pte, pte_t pte)
 {
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
bool pte_is_tagged = pte_tagged(pte);
 
+   /* Early out if there's nothing to do */
+   if (!check_swap && !pte_is_tagged)
+   return;
+
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
if (!test_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags(page, ptep, check_swap);
+   mte_sync_page_tags(page, old_pte, check_swap,
+  pte_is_tagged);
}
 }
 
-- 
2.20.1




[PATCH v13 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

2021-05-24 Thread Steven Price
From: Catalin Marinas 

Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags under a spinlock to avoid a
race on the page->flags. In order to optimise the lock, clear the page
tags on allocation in __alloc_zeroed_user_highpage() if the vma flags
have VM_MTE set.

Signed-off-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/page.h |  6 --
 arch/arm64/mm/fault.c | 21 +
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..97853570d0f1 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -13,6 +13,7 @@
 #ifndef __ASSEMBLY__
 
 #include  /* for READ_IMPLIES_EXEC */
+#include 
 #include 
 
 struct page;
@@ -28,8 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-   alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+ struct vm_area_struct *vma,
+ unsigned long vaddr);
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #define clear_user_page(page, vaddr, pg)   clear_page(page)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 871c82ab0a30..5a03428e97f3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -921,3 +921,24 @@ void do_debug_exception(unsigned long addr_if_watchpoint, 
unsigned int esr,
debug_exception_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug_exception);
+
+/*
+ * Used during anonymous page fault handling.
+ */
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+ struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+   struct page *page;
+   bool tagged = system_supports_mte() && (vma->vm_flags & VM_MTE);
+
+   page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma,
+ vaddr);
+   if (tagged && page) {
+   mte_clear_page_tags(page_address(page));
+   page_kasan_tag_reset(page);
+   set_bit(PG_mte_tagged, >flags);
+   }
+
+   return page;
+}
-- 
2.20.1




[PATCH v13 8/8] KVM: arm64: Document MTE capability and ioctl

2021-05-24 Thread Steven Price
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Signed-off-by: Steven Price 
---
 Documentation/virt/kvm/api.rst | 52 ++
 1 file changed, 52 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 22d077562149..ab45d7fe2aa5 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,37 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.130 KVM_ARM_MTE_COPY_TAGS
+---
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: 0 on success, < 0 on error
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+fieldmust point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
 5. The kvm_run structure
 
 
@@ -6362,6 +6393,27 @@ default.
 
 See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
 
+7.26 KVM_CAP_ARM_MTE
+
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
+that the tags are maintained during swap or hibernation of the host; however
+the VMM needs to manually save/restore the tags as appropriate if the VM is
+migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ==
 
-- 
2.20.1




[PATCH v13 0/8] MTE support for KVM guest

2021-05-24 Thread Steven Price
This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

Changes since v12[1]:

 * Use DEFINE_SPINLOCK() to define tag_sync_lock.

 * Refactor mte_sync_tags() to take the old PTE value rather than a
   pointer to the PTE. The checks in set_pte_at() are also strengthed to
   avoid the function call when possible.

 * Fix prefix on a couple of patches ("arm64: kvm" -> "KVM: arm64").

 * Reorder arguments to sanitise_mte_tags() ("size, pfn" -> "pfn,
   size").

 * Add/improve comments in several places.

 * Report the host's sanitised version of ID_AA64PFR1_EL1:MTE rather
   than making up one for the guest.

 * Insert ISB at the end of mte_switch_to_hyp macro.

 * Drop the definition of CPU_TFSRE0_EL1 in asm-offsets.c as it isn't
   used anymore.

 * Prevent creation of 32 bit vCPUs when MTE is enabled for the guest
   (and document it).

 * Move kvm_vm_ioctl_mte_copy_tags() to guest.c.

 * Reject ZONE_DEVICE memory in kvm_vm_ioctl_mte_copy_tags() and
   correctly handle pages where PG_mte_tagged hasn't been set yet.

 * Define MTE_GRANULES_PER_PAGE rather than open coding the divison
   PAGE_SIZE / MTE_GRANULE_SIZE.

 * Correct the definition of struct kvm_arm_copy_mte_tags in the docs.
   Also avoid mentioning MTE_GRANULE_SIZE as it isn't exported to
   userspace.

[1] https://lore.kernel.org/r/20210517123239.8025-1-steven.pr...@arm.com/

Catalin Marinas (1):
  arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

Steven Price (7):
  arm64: mte: Handle race when synchronising tags
  arm64: mte: Sync tags for pages where PTE is untagged
  KVM: arm64: Introduce MTE VM feature
  KVM: arm64: Save/restore MTE registers
  KVM: arm64: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst | 52 ++
 arch/arm64/include/asm/kvm_emulate.h   |  3 +
 arch/arm64/include/asm/kvm_host.h  | 12 
 arch/arm64/include/asm/kvm_mte.h   | 68 +++
 arch/arm64/include/asm/mte-def.h   |  1 +
 arch/arm64/include/asm/mte.h   |  4 +-
 arch/arm64/include/asm/page.h  |  6 +-
 arch/arm64/include/asm/pgtable.h   | 22 +-
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/include/uapi/asm/kvm.h  | 11 +++
 arch/arm64/kernel/asm-offsets.c|  2 +
 arch/arm64/kernel/mte.c| 37 --
 arch/arm64/kvm/arm.c   | 16 +
 arch/arm64/kvm/guest.c | 79 ++
 arch/arm64/kvm/hyp/entry.S |  7 ++
 arch/arm64/kvm/hyp/exception.c |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++
 arch/arm64/kvm/mmu.c   | 48 -
 arch/arm64/kvm/reset.c |  3 +-
 arch/arm64/kvm/sys_regs.c  | 32 +++--
 arch/arm64/mm/fault.c  | 21 ++
 include/uapi/linux/kvm.h   |  2 +
 22 files changed, 431 insertions(+), 22 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1




[PATCH v13 1/8] arm64: mte: Handle race when synchronising tags

2021-05-24 Thread Steven Price
mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
user-space with PROT_MTE")
Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
---
 arch/arm64/kernel/mte.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..45fac0e9c323 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@
 u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
+static DEFINE_SPINLOCK(tag_sync_lock);
 
 #ifdef CONFIG_KASAN_HW_TAGS
 /* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
 
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 {
+   unsigned long flags;
pte_t old_pte = READ_ONCE(*ptep);
 
+   spin_lock_irqsave(_sync_lock, flags);
+
+   /* Recheck with the lock held */
+   if (test_bit(PG_mte_tagged, >flags))
+   goto out;
+
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-   if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
-   return;
+   if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+   set_bit(PG_mte_tagged, >flags);
+   goto out;
+   }
}
 
page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
 */
smp_wmb();
mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+
+out:
+   spin_unlock_irqrestore(_sync_lock, flags);
 }
 
 void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -60,10 +74,11 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
-   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   if (!test_bit(PG_mte_tagged, >flags))
mte_sync_page_tags(page, ptep, check_swap);
}
 }
-- 
2.20.1




Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-05-21 Thread Steven Price
On 20/05/2021 18:27, Catalin Marinas wrote:
> On Thu, May 20, 2021 at 04:58:01PM +0100, Steven Price wrote:
>> On 20/05/2021 13:05, Catalin Marinas wrote:
>>> On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index e89a5e275e25..4b6c83beb75d 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm 
>>>> *kvm,
>>>>}
>>>>  }
>>>>  
>>>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>>>> +struct kvm_arm_copy_mte_tags *copy_tags)
>>>> +{
>>>> +  gpa_t guest_ipa = copy_tags->guest_ipa;
>>>> +  size_t length = copy_tags->length;
>>>> +  void __user *tags = copy_tags->addr;
>>>> +  gpa_t gfn;
>>>> +  bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>>>> +  int ret = 0;
>>>> +
>>>> +  if (copy_tags->reserved[0] || copy_tags->reserved[1])
>>>> +  return -EINVAL;
>>>> +
>>>> +  if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>>>> +  return -EINVAL;
>>>> +
>>>> +  if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>>>> +  return -EINVAL;
>>>> +
>>>> +  gfn = gpa_to_gfn(guest_ipa);
>>>> +
>>>> +  mutex_lock(>slots_lock);
>>>> +
>>>> +  while (length > 0) {
>>>> +  kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>>>> +  void *maddr;
>>>> +  unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
>>>> +
>>>> +  if (is_error_noslot_pfn(pfn)) {
>>>> +  ret = -EFAULT;
>>>> +  goto out;
>>>> +  }
>>>> +
>>>> +  maddr = page_address(pfn_to_page(pfn));
>>>> +
>>>> +  if (!write) {
>>>> +  num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
>>>> +  kvm_release_pfn_clean(pfn);
>>>
>>> Do we need to check if PG_mte_tagged is set? If the page was not faulted
>>> into the guest address space but the VMM has the page, does the
>>> gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
>>> not, this may read stale tags.
>>
>> Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
>> will fault it into the guest.
> 
> It doesn't indeed. What it does is a get_user_pages() but it's not of
> much help since the VMM pte wouldn't be tagged (we would have solved
> lots of problems if we required PROT_MTE in the VMM...)

Sadly it solves some problems and creates others :(

>>>> +  } else {
>>>> +  num_tags = mte_copy_tags_from_user(maddr, tags,
>>>> + num_tags);
>>>> +  kvm_release_pfn_dirty(pfn);
>>>> +  }
>>>
>>> Same question here, if the we can't guarantee the stage 2 pte being set,
>>> we'd need to set PG_mte_tagged.
>>
>> This is arguably worse as we'll be writing tags into the guest but
>> without setting PG_mte_tagged - so they'll be lost when the guest then
>> faults the pages in. Which sounds like it should break migration.
>>
>> I think the below should be safe, and avoids the overhead of setting the
>> flag just for reads.
>>
>> Thanks,
>>
>> Steve
>>
>> 8<
>>  page = pfn_to_page(pfn);
>>  maddr = page_address(page);
>>
>>  if (!write) {
>>  if (test_bit(PG_mte_tagged, >flags))
>>  num_tags = mte_copy_tags_to_user(tags, maddr,
>>  MTE_GRANULES_PER_PAGE);
>>  else
>>  /* No tags in memory, so write zeros */
>>  num_tags = MTE_GRANULES_PER_PAGE -
>>  clear_user(tag, MTE_GRANULES_PER_PAGE);
>>  kvm_release_pfn_clean(pfn);
> 
> For ptrace we return a -EOPNOTSUPP if the address doesn't have VM_MTE
> but I don't think it makes sense here, so I'm fine with clearing the
> destination and assuming that the ta

Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature

2021-05-21 Thread Steven Price
On 20/05/2021 18:50, Catalin Marinas wrote:
> On Thu, May 20, 2021 at 04:05:46PM +0100, Steven Price wrote:
>> On 20/05/2021 12:54, Catalin Marinas wrote:
>>> On Mon, May 17, 2021 at 01:32:35PM +0100, Steven Price wrote:
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index c5d1f3c87dbd..8660f6a03f51 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>>>> *memslot,
>>>>return PAGE_SIZE;
>>>>  }
>>>>  
>>>> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
>>>> +   kvm_pfn_t pfn)
>>>> +{
>>>> +  if (kvm_has_mte(kvm)) {
>>>> +  /*
>>>> +   * The page will be mapped in stage 2 as Normal Cacheable, so
>>>> +   * the VM will be able to see the page's tags and therefore
>>>> +   * they must be initialised first. If PG_mte_tagged is set,
>>>> +   * tags have already been initialised.
>>>> +   */
>>>> +  unsigned long i, nr_pages = size >> PAGE_SHIFT;
>>>> +  struct page *page = pfn_to_online_page(pfn);
>>>> +
>>>> +  if (!page)
>>>> +  return -EFAULT;
>>>
>>> IIRC we ended up with pfn_to_online_page() to reject ZONE_DEVICE pages
>>> that may be mapped into a guest and we have no idea whether they support
>>> MTE. It may be worth adding a comment, otherwise, as Marc said, the page
>>> wouldn't disappear.
>>
>> I'll add a comment.
>>
>>>> +
>>>> +  for (i = 0; i < nr_pages; i++, page++) {
>>>> +  if (!test_and_set_bit(PG_mte_tagged, >flags))
>>>> +  mte_clear_page_tags(page_address(page));
>>>
>>> We started the page->flags thread and ended up fixing it for the host
>>> set_pte_at() as per the first patch:
>>>
>>> https://lore.kernel.org/r/c3293d47-a5f2-ea4a-6730-f5cae26d8...@arm.com
>>>
>>> Now, can we have a race between the stage 2 kvm_set_spte_gfn() and a
>>> stage 1 set_pte_at()? Only the latter takes a lock. Or between two
>>> kvm_set_spte_gfn() in different VMs? I think in the above thread we
>>> concluded that there's only a problem if the page is shared between
>>> multiple VMMs (MAP_SHARED). How realistic is this and what's the
>>> workaround?
>>>
>>> Either way, I think it's worth adding a comment here on the race on
>>> page->flags as it looks strange that here it's just a test_and_set_bit()
>>> while set_pte_at() uses a spinlock.
>>>
>>
>> Very good point! I should have thought about that. I think splitting the
>> test_and_set_bit() in two (as with the cache flush) is sufficient. While
>> there technically still is a race which could lead to user space tags
>> being clobbered:
>>
>> a) It's very odd for a VMM to be doing an mprotect() after the fact to
>> add PROT_MTE, or to be sharing the memory with another process which
>> sets PROT_MTE.
>>
>> b) The window for the race is incredibly small and the VMM (generally)
>> needs to be robust against the guest changing tags anyway.
>>
>> But I'll add a comment here as well:
>>
>>  /*
>>   * There is a potential race between sanitising the
>>   * flags here and user space using mprotect() to add
>>   * PROT_MTE to access the tags, however by splitting
>>   * the test/set the only risk is user space tags
>>   * being overwritten by the mte_clear_page_tags() call.
>>   */
> 
> I think (well, I haven't re-checked), an mprotect() in the VMM ends up
> calling set_pte_at_notify() which would call kvm_set_spte_gfn() and that
> will map the page in the guest. So the problem only appears between
> different VMMs sharing the same page. In principle they can be
> MAP_PRIVATE but they'd be CoW so the race wouldn't matter. So it's left
> with MAP_SHARED between multiple VMMs.

mprotect.c only has a call to set_pte_at() not set_pte_at_notify(). And
AFAICT the MMU notifiers are called to invalidate only in
change_pmd_range(). So the stage 2 mappings would be invalidated rather
than populated. However I believe this should cause synchronisation
because of the KVM mmu_lock. So from my reading you are right an
mprotect() can't race.

MAP_SHARED between multiple VMs is then the only potential problem.

> I think we should just state that this is unsafe and they can delete
> each-others tags. If we are really worried, we can export that lock you
> added in mte.c.
> 

I'll just update the comment for now.

Thanks,

Steve



Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-05-20 Thread Steven Price
On 20/05/2021 13:05, Catalin Marinas wrote:
> On Mon, May 17, 2021 at 01:32:38PM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..b3edde68bc3e 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>  __u32 reserved[12];
>>  };
>>  
>> +struct kvm_arm_copy_mte_tags {
>> +__u64 guest_ipa;
>> +__u64 length;
>> +void __user *addr;
>> +__u64 flags;
>> +__u64 reserved[2];
> 
> I forgot the past discussions, what's the reserved for? Future
> expansion?

Yes - for future expansion. Marc asked for them[1]:

> I'd be keen on a couple of reserved __64s. Just in case...

[1] https://lore.kernel.org/r/87ft14xl9e.wl-maz%40kernel.org

>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e89a5e275e25..4b6c83beb75d 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm 
>> *kvm,
>>  }
>>  }
>>  
>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>> +  struct kvm_arm_copy_mte_tags *copy_tags)
>> +{
>> +gpa_t guest_ipa = copy_tags->guest_ipa;
>> +size_t length = copy_tags->length;
>> +void __user *tags = copy_tags->addr;
>> +gpa_t gfn;
>> +bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>> +int ret = 0;
>> +
>> +if (copy_tags->reserved[0] || copy_tags->reserved[1])
>> +return -EINVAL;
>> +
>> +if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>> +return -EINVAL;
>> +
>> +if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>> +return -EINVAL;
>> +
>> +gfn = gpa_to_gfn(guest_ipa);
>> +
>> +mutex_lock(>slots_lock);
>> +
>> +while (length > 0) {
>> +kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>> +void *maddr;
>> +unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
>> +
>> +if (is_error_noslot_pfn(pfn)) {
>> +ret = -EFAULT;
>> +goto out;
>> +}
>> +
>> +maddr = page_address(pfn_to_page(pfn));
>> +
>> +if (!write) {
>> +num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
>> +kvm_release_pfn_clean(pfn);
> 
> Do we need to check if PG_mte_tagged is set? If the page was not faulted
> into the guest address space but the VMM has the page, does the
> gfn_to_pfn_prot() guarantee that a kvm_set_spte_gfn() was called? If
> not, this may read stale tags.

Ah, I hadn't thought about that... No I don't believe gfn_to_pfn_prot()
will fault it into the guest.

>> +} else {
>> +num_tags = mte_copy_tags_from_user(maddr, tags,
>> +   num_tags);
>> +kvm_release_pfn_dirty(pfn);
>> +}
> 
> Same question here, if the we can't guarantee the stage 2 pte being set,
> we'd need to set PG_mte_tagged.

This is arguably worse as we'll be writing tags into the guest but
without setting PG_mte_tagged - so they'll be lost when the guest then
faults the pages in. Which sounds like it should break migration.

I think the below should be safe, and avoids the overhead of setting the
flag just for reads.

Thanks,

Steve

8<
page = pfn_to_page(pfn);
maddr = page_address(page);

if (!write) {
if (test_bit(PG_mte_tagged, >flags))
num_tags = mte_copy_tags_to_user(tags, maddr,
MTE_GRANULES_PER_PAGE);
else
/* No tags in memory, so write zeros */
num_tags = MTE_GRANULES_PER_PAGE -
clear_user(tag, MTE_GRANULES_PER_PAGE);
kvm_release_pfn_clean(pfn);
} else {
num_tags = mte_copy_tags_from_user(maddr, tags,
MTE_GRANULES_PER_PAGE);
kvm_release_pfn_dirty(pfn);
}

if (num_tags != MTE_GRANULES_PER_PAGE) {
ret = -EFAULT;
goto out;
}

if (write)
test_and_set_bit(PG_mte_tagged, >flags);



Re: [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers

2021-05-20 Thread Steven Price
On 20/05/2021 10:46, Marc Zyngier wrote:
> On Wed, 19 May 2021 14:04:20 +0100,
> Steven Price  wrote:
>>
>> On 17/05/2021 18:17, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:36 +0100,
>>> Steven Price  wrote:
>>>>
>>>> Define the new system registers that MTE introduces and context switch
>>>> them. The MTE feature is still hidden from the ID register as it isn't
>>>> supported in a VM yet.
>>>>
>>>> Signed-off-by: Steven Price 
>>>> ---
>>>>  arch/arm64/include/asm/kvm_host.h  |  6 ++
>>>>  arch/arm64/include/asm/kvm_mte.h   | 66 ++
>>>>  arch/arm64/include/asm/sysreg.h|  3 +-
>>>>  arch/arm64/kernel/asm-offsets.c|  3 +
>>>>  arch/arm64/kvm/hyp/entry.S |  7 +++
>>>>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
>>>>  arch/arm64/kvm/sys_regs.c  | 22 ++--
>>>>  7 files changed, 123 insertions(+), 5 deletions(-)
>>>>  create mode 100644 arch/arm64/include/asm/kvm_mte.h
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>>>> b/arch/arm64/include/asm/kvm_host.h
>>>> index afaa5333f0e4..309e36cc1b42 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -208,6 +208,12 @@ enum vcpu_sysreg {
>>>>CNTP_CVAL_EL0,
>>>>CNTP_CTL_EL0,
>>>>  
>>>> +  /* Memory Tagging Extension registers */
>>>> +  RGSR_EL1,   /* Random Allocation Tag Seed Register */
>>>> +  GCR_EL1,/* Tag Control Register */
>>>> +  TFSR_EL1,   /* Tag Fault Status Register (EL1) */
>>>> +  TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
>>>> +
>>>>/* 32bit specific registers. Keep them at the end of the range */
>>>>DACR32_EL2, /* Domain Access Control Register */
>>>>IFSR32_EL2, /* Instruction Fault Status Register */
>>>> diff --git a/arch/arm64/include/asm/kvm_mte.h 
>>>> b/arch/arm64/include/asm/kvm_mte.h
>>>> new file mode 100644
>>>> index ..6541c7d6ce06
>>>> --- /dev/null
>>>> +++ b/arch/arm64/include/asm/kvm_mte.h
>>>> @@ -0,0 +1,66 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright (C) 2020 ARM Ltd.
>>>> + */
>>>> +#ifndef __ASM_KVM_MTE_H
>>>> +#define __ASM_KVM_MTE_H
>>>> +
>>>> +#ifdef __ASSEMBLY__
>>>> +
>>>> +#include 
>>>> +
>>>> +#ifdef CONFIG_ARM64_MTE
>>>> +
>>>> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
>>>> +alternative_if_not ARM64_MTE
>>>> +  b   .L__skip_switch\@
>>>> +alternative_else_nop_endif
>>>> +  mrs \reg1, hcr_el2
>>>> +  and \reg1, \reg1, #(HCR_ATA)
>>>> +  cbz \reg1, .L__skip_switch\@
>>>> +
>>>> +  mrs_s   \reg1, SYS_RGSR_EL1
>>>> +  str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
>>>> +  mrs_s   \reg1, SYS_GCR_EL1
>>>> +  str \reg1, [\h_ctxt, #CPU_GCR_EL1]
>>>> +
>>>> +  ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
>>>> +  msr_s   SYS_RGSR_EL1, \reg1
>>>> +  ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
>>>> +  msr_s   SYS_GCR_EL1, \reg1
>>>> +
>>>> +.L__skip_switch\@:
>>>> +.endm
>>>> +
>>>> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
>>>> +alternative_if_not ARM64_MTE
>>>> +  b   .L__skip_switch\@
>>>> +alternative_else_nop_endif
>>>> +  mrs \reg1, hcr_el2
>>>> +  and \reg1, \reg1, #(HCR_ATA)
>>>> +  cbz \reg1, .L__skip_switch\@
>>>> +
>>>> +  mrs_s   \reg1, SYS_RGSR_EL1
>>>> +  str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
>>>> +  mrs_s   \reg1, SYS_GCR_EL1
>>>> +  str \reg1, [\g_ctxt, #CPU_GCR_EL1]
>>>> +
>>>> +  ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
>>>> +  msr_s   SYS_RGSR_EL1, \reg1
>>>> +  ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
>>>> +  msr_s   SYS_GCR_EL1, \reg1
>>>
>>> What is the rational for not having any synchronisation here? It is
>>> quite uncommon to allocate memory at EL2, but VHE can perform all kind
>>> of tricks.
>>
>> I don't follow. This 

Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature

2021-05-20 Thread Steven Price
On 20/05/2021 12:54, Catalin Marinas wrote:
> On Mon, May 17, 2021 at 01:32:35PM +0100, Steven Price wrote:
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..8660f6a03f51 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>> *memslot,
>>  return PAGE_SIZE;
>>  }
>>  
>> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
>> + kvm_pfn_t pfn)
>> +{
>> +if (kvm_has_mte(kvm)) {
>> +/*
>> + * The page will be mapped in stage 2 as Normal Cacheable, so
>> + * the VM will be able to see the page's tags and therefore
>> + * they must be initialised first. If PG_mte_tagged is set,
>> + * tags have already been initialised.
>> + */
>> +unsigned long i, nr_pages = size >> PAGE_SHIFT;
>> +struct page *page = pfn_to_online_page(pfn);
>> +
>> +if (!page)
>> +return -EFAULT;
> 
> IIRC we ended up with pfn_to_online_page() to reject ZONE_DEVICE pages
> that may be mapped into a guest and we have no idea whether they support
> MTE. It may be worth adding a comment, otherwise, as Marc said, the page
> wouldn't disappear.

I'll add a comment.

>> +
>> +for (i = 0; i < nr_pages; i++, page++) {
>> +if (!test_and_set_bit(PG_mte_tagged, >flags))
>> +mte_clear_page_tags(page_address(page));
> 
> We started the page->flags thread and ended up fixing it for the host
> set_pte_at() as per the first patch:
> 
> https://lore.kernel.org/r/c3293d47-a5f2-ea4a-6730-f5cae26d8...@arm.com
> 
> Now, can we have a race between the stage 2 kvm_set_spte_gfn() and a
> stage 1 set_pte_at()? Only the latter takes a lock. Or between two
> kvm_set_spte_gfn() in different VMs? I think in the above thread we
> concluded that there's only a problem if the page is shared between
> multiple VMMs (MAP_SHARED). How realistic is this and what's the
> workaround?
> 
> Either way, I think it's worth adding a comment here on the race on
> page->flags as it looks strange that here it's just a test_and_set_bit()
> while set_pte_at() uses a spinlock.
> 

Very good point! I should have thought about that. I think splitting the
test_and_set_bit() in two (as with the cache flush) is sufficient. While
there technically still is a race which could lead to user space tags
being clobbered:

a) It's very odd for a VMM to be doing an mprotect() after the fact to
add PROT_MTE, or to be sharing the memory with another process which
sets PROT_MTE.

b) The window for the race is incredibly small and the VMM (generally)
needs to be robust against the guest changing tags anyway.

But I'll add a comment here as well:

/*
 * There is a potential race between sanitising the
 * flags here and user space using mprotect() to add
 * PROT_MTE to access the tags, however by splitting
 * the test/set the only risk is user space tags
 * being overwritten by the mte_clear_page_tags() call.
 */

Thanks,

Steve



Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature

2021-05-20 Thread Steven Price
On 20/05/2021 09:51, Marc Zyngier wrote:
> On Wed, 19 May 2021 11:48:21 +0100,
> Steven Price  wrote:
>>
>> On 17/05/2021 17:45, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:35 +0100,
>>> Steven Price  wrote:
[...]
>>>> +  }
>>>> +  }
>>>> +
>>>> +  return 0;
>>>> +}
>>>> +
>>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>  struct kvm_memory_slot *memslot, unsigned long hva,
>>>>  unsigned long fault_status)
>>>> @@ -971,8 +996,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>>>> phys_addr_t fault_ipa,
>>>>if (writable)
>>>>prot |= KVM_PGTABLE_PROT_W;
>>>>  
>>>> -  if (fault_status != FSC_PERM && !device)
>>>> +  if (fault_status != FSC_PERM && !device) {
>>>> +  ret = sanitise_mte_tags(kvm, vma_pagesize, pfn);
>>>> +  if (ret)
>>>> +  goto out_unlock;
>>>> +
>>>>clean_dcache_guest_page(pfn, vma_pagesize);
>>>> +  }
>>>>  
>>>>if (exec_fault) {
>>>>prot |= KVM_PGTABLE_PROT_X;
>>>> @@ -1168,12 +1198,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
>>>> kvm_gfn_range *range)
>>>>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>>>>  {
>>>>kvm_pfn_t pfn = pte_pfn(range->pte);
>>>> +  int ret;
>>>>  
>>>>if (!kvm->arch.mmu.pgt)
>>>>return 0;
>>>>  
>>>>WARN_ON(range->end - range->start != 1);
>>>>  
>>>> +  ret = sanitise_mte_tags(kvm, PAGE_SIZE, pfn);
>>>> +  if (ret)
>>>> +  return ret;
>>>
>>> Notice the change in return type?
>>
>> I do now - I was tricked by the use of '0' as false. Looks like false
>> ('0') is actually the correct return here to avoid an unnecessary
>> kvm_flush_remote_tlbs().
> 
> Yup. BTW, the return values have been fixed to proper boolean types in
> the latest set of fixes.

Thanks for the heads up - I'll return 'false' to avoid regressing that.

>>
>>>> +
>>>>/*
>>>> * We've moved a page around, probably through CoW, so let's treat it
>>>> * just like a translation fault and clean the cache to the PoC.
>>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>>> index 76ea2800c33e..24a844cb79ca 100644
>>>> --- a/arch/arm64/kvm/sys_regs.c
>>>> +++ b/arch/arm64/kvm/sys_regs.c
>>>> @@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
>>>>break;
>>>>case SYS_ID_AA64PFR1_EL1:
>>>>val &= ~FEATURE(ID_AA64PFR1_MTE);
>>>> +  if (kvm_has_mte(vcpu->kvm))
>>>> +  val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
>>>> +ID_AA64PFR1_MTE);
>>>
>>> Shouldn't this be consistent with what the HW is capable of
>>> (i.e. FEAT_MTE3 if available), and extracted from the sanitised view
>>> of the feature set?
>>
>> Yes - however at the moment our sanitised view is either FEAT_MTE2 or
>> nothing:
>>
>>  {
>>  .desc = "Memory Tagging Extension",
>>  .capability = ARM64_MTE,
>>  .type = ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE,
>>  .matches = has_cpuid_feature,
>>  .sys_reg = SYS_ID_AA64PFR1_EL1,
>>  .field_pos = ID_AA64PFR1_MTE_SHIFT,
>>  .min_field_value = ID_AA64PFR1_MTE,
>>  .sign = FTR_UNSIGNED,
>>  .cpu_enable = cpu_enable_mte,
>>  },
>>
>> When host support for FEAT_MTE3 is added then the KVM code will need
>> revisiting to expose that down to the guest safely (AFAICS there's
>> nothing extra to do here, but I haven't tested any of the MTE3
>> features). I don't think we want to expose newer versions to the guest
>> than the host is aware of. (Or indeed expose FEAT_MTE if the host has
>> MTE disabled because Linux requires at least FEAT_MTE2).
> 
> What I was suggesting is to have something like this:
> 
>  pfr = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1);
>  mte = cpuid_feature_extract_unsigned_field(pfr, ID_AA64PFR1_MTE_SHIFT);
>  val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE), mte);
> 
> which does the trick nicely, and doesn't expose more than the host
> supports.

Ok, I have to admit to not fully understanding the sanitised register
code - but wouldn't this expose higher MTE values if all CPUs support
it, even though the host doesn't know what a hypothetical 'MTE4' adds?
Or is there some magic in the sanitising that caps the value to what the
host knows about?

Thanks,

Steve



Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged

2021-05-20 Thread Steven Price
On 20/05/2021 13:25, Catalin Marinas wrote:
> On Thu, May 20, 2021 at 12:55:21PM +0100, Steven Price wrote:
>> On 19/05/2021 19:06, Catalin Marinas wrote:
>>> On Mon, May 17, 2021 at 01:32:34PM +0100, Steven Price wrote:
>>>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>>>> the page with PROT_MTE. So when restoring pages from swap we will
>>>> need to check to see if there are any saved tags even if !pte_tagged().
>>>>
>>>> However don't check pages for which pte_access_permitted() returns false
>>>> as these will not have been swapped out.
>>>>
>>>> Signed-off-by: Steven Price 
>>>> ---
>>>>  arch/arm64/include/asm/pgtable.h |  9 +++--
>>>>  arch/arm64/kernel/mte.c  | 16 ++--
>>>>  2 files changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h 
>>>> b/arch/arm64/include/asm/pgtable.h
>>>> index 0b10204e72fc..275178a810c1 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, 
>>>> unsigned long addr,
>>>>if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>>>>__sync_icache_dcache(pte);
>>>>  
>>>> -  if (system_supports_mte() &&
>>>> -  pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>>>> +  /*
>>>> +   * If the PTE would provide user space access to the tags associated
>>>> +   * with it then ensure that the MTE tags are synchronised.  Exec-only
>>>> +   * mappings don't expose tags (instruction fetches don't check tags).
>>>> +   */
>>>> +  if (system_supports_mte() && pte_present(pte) &&
>>>> +  pte_access_permitted(pte, false) && !pte_special(pte))
>>>>mte_sync_tags(ptep, pte);
>>>
>>> Looking at the mte_sync_page_tags() logic, we bail out early if it's the
>>> old pte is not a swap one and the new pte is not tagged. So we only need
>>> to call mte_sync_tags() if it's a tagged new pte or the old one is swap.
>>> What about changing the set_pte_at() test to:
>>>
>>> if (system_supports_mte() && pte_present(pte) && !pte_special(pte) &&
>>> (pte_tagged(pte) || is_swap_pte(READ_ONCE(*ptep
>>> mte_sync_tags(ptep, pte);
>>>
>>> We can even change mte_sync_tags() to take the old pte directly:
>>>
>>> if (system_supports_mte() && pte_present(pte) && !pte_special(pte)) {
>>> pte_t old_pte = READ_ONCE(*ptep);
>>> if (pte_tagged(pte) || is_swap_pte(old_pte))
>>> mte_sync_tags(old_pte, pte);
>>> }
>>>
>>> It would save a function call in most cases where the page is not
>>> tagged.
>>
>> Yes that looks like a good optimisation - although you've missed the
>> pte_access_permitted() part of the check ;)
> 
> I was actually wondering if we could remove it. I don't think it buys us
> much as we have a pte_present() check already, so we know it is pointing
> to a valid page. Currently we'd only get a tagged pte on user mappings,
> same with swap entries.

Actually the other way round makes more sense surely?
pte_access_permitted() is true if both PTE_VALID & PTE_USER are set.
pte_present() is true if *either* PTE_VALID or PTE_PROT_NONE are set. So
the pte_present() is actually redundant.

> When vmalloc kasan_hw will be added, I think we have a set_pte_at() with
> a tagged pte but init_mm and high address (we might as well add a
> warning if addr > TASK_SIZE_64 on the mte_sync_tags path so that we
> don't forget).

While we might not yet have tagged kernel pages - I'm not sure there's
much point weakening the check to have to then check addr as well in the
future.

>> The problem I hit is one of include dependencies:
>>
>> is_swap_pte() is defined (as a static inline) in
>> include/linux/swapops.h. However the definition depends on
>> pte_none()/pte_present() which are defined in pgtable.h - so there's a
>> circular dependency.
>>
>> Open coding is_swap_pte() in set_pte_at() works, but it's a bit ugly.
>> Any ideas on how to improve on the below?
>>
>>  if (system_supports_mte() && pte_present(pte) &&
>>  pte_access_permitted(pte, false) &am

Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged

2021-05-20 Thread Steven Price
On 19/05/2021 19:06, Catalin Marinas wrote:
> On Mon, May 17, 2021 at 01:32:34PM +0100, Steven Price wrote:
>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>> the page with PROT_MTE. So when restoring pages from swap we will
>> need to check to see if there are any saved tags even if !pte_tagged().
>>
>> However don't check pages for which pte_access_permitted() returns false
>> as these will not have been swapped out.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/asm/pgtable.h |  9 +++--
>>  arch/arm64/kernel/mte.c  | 16 ++--
>>  2 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h 
>> b/arch/arm64/include/asm/pgtable.h
>> index 0b10204e72fc..275178a810c1 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, 
>> unsigned long addr,
>>  if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>>  __sync_icache_dcache(pte);
>>  
>> -if (system_supports_mte() &&
>> -pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>> +/*
>> + * If the PTE would provide user space access to the tags associated
>> + * with it then ensure that the MTE tags are synchronised.  Exec-only
>> + * mappings don't expose tags (instruction fetches don't check tags).
>> + */
>> +if (system_supports_mte() && pte_present(pte) &&
>> +pte_access_permitted(pte, false) && !pte_special(pte))
>>  mte_sync_tags(ptep, pte);
> 
> Looking at the mte_sync_page_tags() logic, we bail out early if it's the
> old pte is not a swap one and the new pte is not tagged. So we only need
> to call mte_sync_tags() if it's a tagged new pte or the old one is swap.
> What about changing the set_pte_at() test to:
> 
>   if (system_supports_mte() && pte_present(pte) && !pte_special(pte) &&
>   (pte_tagged(pte) || is_swap_pte(READ_ONCE(*ptep
>   mte_sync_tags(ptep, pte);
> 
> We can even change mte_sync_tags() to take the old pte directly:
> 
>   if (system_supports_mte() && pte_present(pte) && !pte_special(pte)) {
>   pte_t old_pte = READ_ONCE(*ptep);
>   if (pte_tagged(pte) || is_swap_pte(old_pte))
>   mte_sync_tags(old_pte, pte);
>   }
> 
> It would save a function call in most cases where the page is not
> tagged.
> 

Yes that looks like a good optimisation - although you've missed the
pte_access_permitted() part of the check ;) The problem I hit is one of
include dependencies:

is_swap_pte() is defined (as a static inline) in
include/linux/swapops.h. However the definition depends on
pte_none()/pte_present() which are defined in pgtable.h - so there's a
circular dependency.

Open coding is_swap_pte() in set_pte_at() works, but it's a bit ugly.
Any ideas on how to improve on the below?

if (system_supports_mte() && pte_present(pte) &&
pte_access_permitted(pte, false) && !pte_special(pte)) {
pte_t old_pte = READ_ONCE(*ptep);
/*
 * We only need to synchronise if the new PTE has tags enabled
 * or if swapping in (in which case another mapping may have
 * set tags in the past even if this PTE isn't tagged).
 * (!pte_none() && !pte_present()) is an open coded version of
 * is_swap_pte()
 */
if (pte_tagged(pte) || (!pte_none(pte) && !pte_present(pte)))
mte_sync_tags(old_pte, pte);
}

Steve



Re: [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl

2021-05-20 Thread Steven Price
On 20/05/2021 11:24, Marc Zyngier wrote:
> On Wed, 19 May 2021 15:09:23 +0100,
> Steven Price  wrote:
>>
>> On 17/05/2021 19:09, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:39 +0100,
>>> Steven Price  wrote:
[...]>>>> +bytes (i.e. 1/16th of the corresponding size). Each byte
contains a single tag
>>>> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
>>>> +``PTRACE_POKEMTETAGS``.
>>>> +
>>>>  5. The kvm_run structure
>>>>  
>>>>  
>>>> @@ -6362,6 +6396,25 @@ default.
>>>>  
>>>>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
>>>>  
>>>> +7.26 KVM_CAP_ARM_MTE
>>>> +
>>>> +
>>>> +:Architectures: arm64
>>>> +:Parameters: none
>>>> +
>>>> +This capability indicates that KVM (and the hardware) supports exposing 
>>>> the
>>>> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by 
>>>> the
>>>> +VMM before the guest will be granted access.
>>>> +
>>>> +When enabled the guest is able to access tags associated with any memory 
>>>> given
>>>> +to the guest. KVM will ensure that the pages are flagged 
>>>> ``PG_mte_tagged`` so
>>>> +that the tags are maintained during swap or hibernation of the host; 
>>>> however
>>>> +the VMM needs to manually save/restore the tags as appropriate if the VM 
>>>> is
>>>> +migrated.
>>>> +
>>>> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl 
>>>> to
>>>> +perform a bulk copy of tags to/from the guest.
>>>> +
>>>
>>> Missing limitation to AArch64 guests.
>>
>> As mentioned previously it's not technically limited to AArch64, but
>> I'll expand this to make it clear that MTE isn't usable from a AArch32 VCPU.
> 
> I believe the architecture is quite clear that it *is* limited to
> AArch64. The clarification is welcome though.

I explained that badly. A system supporting MTE doesn't have to have all
CPUs running AArch64 - fairly obviously you can boot a 32 bit OS on a
system supporting AArch64.

Since the KVM capability is a VM capability it's not architecturally
inconsistent to enable it even if all your CPUs are running AArch32 (at
EL1 and lower) - just a bit pointless.

However, given your comment that a mixture of AArch32/AArch64 VCPUs is a
bug - we can fail creation of AArch32 VCPUs and I'll explicitly document
this is a AArch64 only feature.

Thanks,

Steve



Re: [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE

2021-05-20 Thread Steven Price
On 20/05/2021 11:09, Marc Zyngier wrote:
> On Wed, 19 May 2021 14:26:31 +0100,
> Steven Price  wrote:
>>
>> On 17/05/2021 18:40, Marc Zyngier wrote:
>>> On Mon, 17 May 2021 13:32:37 +0100,
>>> Steven Price  wrote:
>>>>
>>>> It's now safe for the VMM to enable MTE in a guest, so expose the
>>>> capability to user space.
>>>>
>>>> Signed-off-by: Steven Price 
>>>> ---
>>>>  arch/arm64/kvm/arm.c  | 9 +
>>>>  arch/arm64/kvm/sys_regs.c | 3 +++
>>>>  2 files changed, 12 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 1cb39c0803a4..e89a5e275e25 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>>>r = 0;
>>>>kvm->arch.return_nisv_io_abort_to_user = true;
>>>>break;
>>>> +  case KVM_CAP_ARM_MTE:
>>>> +  if (!system_supports_mte() || kvm->created_vcpus)
>>>> +  return -EINVAL;
>>>> +  r = 0;
>>>> +  kvm->arch.mte_enabled = true;
>>>
>>> As far as I can tell from the architecture, this isn't valid for a
>>> 32bit guest.
>>
>> Indeed, however the MTE flag is a property of the VM not of the vCPU.
>> And, unless I'm mistaken, it's technically possible to create a VM where
>> some CPUs are 32 bit and some 64 bit. Not that I can see much use of a
>> configuration like that.
> 
> It looks that this is indeed a bug, and I'm on my way to squash it.
> Can't believe we allowed that for so long...

Ah, well if you're going to kill off mixed 32bit/64bit VMs then...

> But the architecture clearly states:
> 
> 
> These features are supported in AArch64 state only.
> 
> 
> So I'd expect something like:
> 
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 956cdc240148..50635eacfa43 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -220,7 +220,8 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>   switch (vcpu->arch.target) {
>   default:
>   if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
> - if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1)) {
> + if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) ||
> + vcpu->kvm->arch.mte_enabled) {
>   ret = -EINVAL;
>   goto out;
>   }
> 
> that makes it completely impossible to create 32bit CPUs within a
> MTE-enabled guest.

... that makes complete sense, and I'll include this hunk in my next
posting.

Thanks,

Steve



Re: [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl

2021-05-19 Thread Steven Price
On 17/05/2021 19:09, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:39 +0100,
> Steven Price  wrote:
>>
>> A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
>> granting a guest access to the tags, and provides a mechanism for the
>> VMM to enable it.
>>
>> A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
>> access the tags of a guest without having to maintain a PROT_MTE mapping
>> in userspace. The above capability gates access to the ioctl.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  Documentation/virt/kvm/api.rst | 53 ++
>>  1 file changed, 53 insertions(+)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 22d077562149..a31661b870ba 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -5034,6 +5034,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
>>  The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
>>  with the KVM_XEN_VCPU_GET_ATTR ioctl.
>>  
>> +4.130 KVM_ARM_MTE_COPY_TAGS
>> +---
>> +
>> +:Capability: KVM_CAP_ARM_MTE
>> +:Architectures: arm64
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_arm_copy_mte_tags
>> +:Returns: 0 on success, < 0 on error
>> +
>> +::
>> +
>> +  struct kvm_arm_copy_mte_tags {
>> +__u64 guest_ipa;
>> +__u64 length;
>> +union {
>> +void __user *addr;
>> +__u64 padding;
>> +};
>> +__u64 flags;
>> +__u64 reserved[2];
>> +  };
> 
> This doesn't exactly match the structure in the previous patch :-(.

:( I knew there was a reason I didn't include it in the documentation
for the first 9 versions... I'll fix this up, thanks for spotting it.

>> +
>> +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
>> +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The 
>> ``addr``
>> +fieldmust point to a buffer which the tags will be copied to or from.
>> +
>> +``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` 
>> or
>> +``KVM_ARM_TAGS_FROM_GUEST``.
>> +
>> +The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
> 
> Should we add a UAPI definition for MTE_GRANULE_SIZE?

I wasn't sure whether to export this or not. The ioctl is based around
the existing ptrace interface (PTRACE_{PEEK,POKE}MTETAGS) which doesn't
expose a UAPI definition. Admittedly the documentation there also just
says "16-byte granule" rather than MTE_GRANULE_SIZE.

So I'll just remove the reference to MTE_GRANULE_SIZE in the
documentation unless you feel that we should have a UAPI definition.

>> +bytes (i.e. 1/16th of the corresponding size). Each byte contains a single 
>> tag
>> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
>> +``PTRACE_POKEMTETAGS``.
>> +
>>  5. The kvm_run structure
>>  
>>  
>> @@ -6362,6 +6396,25 @@ default.
>>  
>>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
>>  
>> +7.26 KVM_CAP_ARM_MTE
>> +
>> +
>> +:Architectures: arm64
>> +:Parameters: none
>> +
>> +This capability indicates that KVM (and the hardware) supports exposing the
>> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
>> +VMM before the guest will be granted access.
>> +
>> +When enabled the guest is able to access tags associated with any memory 
>> given
>> +to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` 
>> so
>> +that the tags are maintained during swap or hibernation of the host; however
>> +the VMM needs to manually save/restore the tags as appropriate if the VM is
>> +migrated.
>> +
>> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
>> +perform a bulk copy of tags to/from the guest.
>> +
> 
> Missing limitation to AArch64 guests.

As mentioned previously it's not technically limited to AArch64, but
I'll expand this to make it clear that MTE isn't usable from a AArch32 VCPU.

Thanks,

Steve



Re: [PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-05-19 Thread Steven Price
On 17/05/2021 19:04, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:38 +0100,
> Steven Price  wrote:
>>
>> The VMM may not wish to have it's own mapping of guest memory mapped
>> with PROT_MTE because this causes problems if the VMM has tag checking
>> enabled (the guest controls the tags in physical RAM and it's unlikely
>> the tags are correct for the VMM).
>>
>> Instead add a new ioctl which allows the VMM to easily read/write the
>> tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
>> while the VMM can still read/write the tags for the purpose of
>> migration.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/uapi/asm/kvm.h | 11 +
>>  arch/arm64/kvm/arm.c  | 69 +++
>>  include/uapi/linux/kvm.h  |  1 +
>>  3 files changed, 81 insertions(+)
>>
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
>> b/arch/arm64/include/uapi/asm/kvm.h
>> index 24223adae150..b3edde68bc3e 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
>>  __u32 reserved[12];
>>  };
>>  
>> +struct kvm_arm_copy_mte_tags {
>> +__u64 guest_ipa;
>> +__u64 length;
>> +void __user *addr;
>> +__u64 flags;
>> +__u64 reserved[2];
>> +};
>> +
>> +#define KVM_ARM_TAGS_TO_GUEST   0
>> +#define KVM_ARM_TAGS_FROM_GUEST 1
>> +
>>  /* If you need to interpret the index values, here is the key: */
>>  #define KVM_REG_ARM_COPROC_MASK 0x0FFF
>>  #define KVM_REG_ARM_COPROC_SHIFT16
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e89a5e275e25..4b6c83beb75d 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm 
>> *kvm,
>>  }
>>  }
>>  
>> +static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>> +  struct kvm_arm_copy_mte_tags *copy_tags)
>> +{
>> +gpa_t guest_ipa = copy_tags->guest_ipa;
>> +size_t length = copy_tags->length;
>> +void __user *tags = copy_tags->addr;
>> +gpa_t gfn;
>> +bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
>> +int ret = 0;
>> +
>> +if (copy_tags->reserved[0] || copy_tags->reserved[1])
>> +return -EINVAL;
>> +
>> +if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
>> +return -EINVAL;
>> +
>> +if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
>> +return -EINVAL;
>> +
>> +gfn = gpa_to_gfn(guest_ipa);
>> +
>> +mutex_lock(>slots_lock);
>> +
>> +while (length > 0) {
>> +kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
>> +void *maddr;
>> +unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
> 
> nit: this is a compile time constant, make it a #define. This will
> avoid the confusing overloading of "num_tags" as both an input and an
> output for the mte_copy_tags-* functions.

No problem, I agree my usage of num_tags wasn't very clear.

>> +
>> +if (is_error_noslot_pfn(pfn)) {
>> +ret = -EFAULT;
>> +goto out;
>> +}
>> +
>> +maddr = page_address(pfn_to_page(pfn));
>> +
>> +if (!write) {
>> +num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
>> +kvm_release_pfn_clean(pfn);
>> +} else {
>> +num_tags = mte_copy_tags_from_user(maddr, tags,
>> +   num_tags);
>> +kvm_release_pfn_dirty(pfn);
>> +}
>> +
>> +if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) {
>> +ret = -EFAULT;
>> +goto out;
>> +}
>> +
>> +gfn++;
>> +tags += num_tags;
>> +length -= PAGE_SIZE;
>> +}
>> +
>> +out:
>> +mutex_unlock(>slots_lock);
>> +return ret;
>> +}
>> +
> 
> nit again: I'd really prefer it if you moved this to guest.c, where we
> already have a bunch of the save/restore stuff.

Sure - I'll move it across.

Thanks,

Steve



Re: [PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE

2021-05-19 Thread Steven Price
On 17/05/2021 18:40, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:37 +0100,
> Steven Price  wrote:
>>
>> It's now safe for the VMM to enable MTE in a guest, so expose the
>> capability to user space.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/kvm/arm.c  | 9 +
>>  arch/arm64/kvm/sys_regs.c | 3 +++
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 1cb39c0803a4..e89a5e275e25 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>  r = 0;
>>  kvm->arch.return_nisv_io_abort_to_user = true;
>>  break;
>> +case KVM_CAP_ARM_MTE:
>> +if (!system_supports_mte() || kvm->created_vcpus)
>> +return -EINVAL;
>> +r = 0;
>> +kvm->arch.mte_enabled = true;
> 
> As far as I can tell from the architecture, this isn't valid for a
> 32bit guest.

Indeed, however the MTE flag is a property of the VM not of the vCPU.
And, unless I'm mistaken, it's technically possible to create a VM where
some CPUs are 32 bit and some 64 bit. Not that I can see much use of a
configuration like that.

Steve



Re: [PATCH v12 5/8] arm64: kvm: Save/restore MTE registers

2021-05-19 Thread Steven Price
On 17/05/2021 18:17, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:36 +0100,
> Steven Price  wrote:
>>
>> Define the new system registers that MTE introduces and context switch
>> them. The MTE feature is still hidden from the ID register as it isn't
>> supported in a VM yet.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/asm/kvm_host.h  |  6 ++
>>  arch/arm64/include/asm/kvm_mte.h   | 66 ++
>>  arch/arm64/include/asm/sysreg.h|  3 +-
>>  arch/arm64/kernel/asm-offsets.c|  3 +
>>  arch/arm64/kvm/hyp/entry.S |  7 +++
>>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
>>  arch/arm64/kvm/sys_regs.c  | 22 ++--
>>  7 files changed, 123 insertions(+), 5 deletions(-)
>>  create mode 100644 arch/arm64/include/asm/kvm_mte.h
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>> b/arch/arm64/include/asm/kvm_host.h
>> index afaa5333f0e4..309e36cc1b42 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -208,6 +208,12 @@ enum vcpu_sysreg {
>>  CNTP_CVAL_EL0,
>>  CNTP_CTL_EL0,
>>  
>> +/* Memory Tagging Extension registers */
>> +RGSR_EL1,   /* Random Allocation Tag Seed Register */
>> +GCR_EL1,/* Tag Control Register */
>> +TFSR_EL1,   /* Tag Fault Status Register (EL1) */
>> +TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
>> +
>>  /* 32bit specific registers. Keep them at the end of the range */
>>  DACR32_EL2, /* Domain Access Control Register */
>>  IFSR32_EL2, /* Instruction Fault Status Register */
>> diff --git a/arch/arm64/include/asm/kvm_mte.h 
>> b/arch/arm64/include/asm/kvm_mte.h
>> new file mode 100644
>> index ..6541c7d6ce06
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/kvm_mte.h
>> @@ -0,0 +1,66 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (C) 2020 ARM Ltd.
>> + */
>> +#ifndef __ASM_KVM_MTE_H
>> +#define __ASM_KVM_MTE_H
>> +
>> +#ifdef __ASSEMBLY__
>> +
>> +#include 
>> +
>> +#ifdef CONFIG_ARM64_MTE
>> +
>> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
>> +alternative_if_not ARM64_MTE
>> +b   .L__skip_switch\@
>> +alternative_else_nop_endif
>> +mrs \reg1, hcr_el2
>> +and \reg1, \reg1, #(HCR_ATA)
>> +cbz \reg1, .L__skip_switch\@
>> +
>> +mrs_s   \reg1, SYS_RGSR_EL1
>> +str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
>> +mrs_s   \reg1, SYS_GCR_EL1
>> +str \reg1, [\h_ctxt, #CPU_GCR_EL1]
>> +
>> +ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
>> +msr_s   SYS_RGSR_EL1, \reg1
>> +ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
>> +msr_s   SYS_GCR_EL1, \reg1
>> +
>> +.L__skip_switch\@:
>> +.endm
>> +
>> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
>> +alternative_if_not ARM64_MTE
>> +b   .L__skip_switch\@
>> +alternative_else_nop_endif
>> +mrs \reg1, hcr_el2
>> +and \reg1, \reg1, #(HCR_ATA)
>> +cbz \reg1, .L__skip_switch\@
>> +
>> +mrs_s   \reg1, SYS_RGSR_EL1
>> +str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
>> +mrs_s   \reg1, SYS_GCR_EL1
>> +str \reg1, [\g_ctxt, #CPU_GCR_EL1]
>> +
>> +ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
>> +msr_s   SYS_RGSR_EL1, \reg1
>> +ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
>> +msr_s   SYS_GCR_EL1, \reg1
> 
> What is the rational for not having any synchronisation here? It is
> quite uncommon to allocate memory at EL2, but VHE can perform all kind
> of tricks.

I don't follow. This is part of the __guest_exit path and there's an ISB
at the end of that - is that not sufficient? I don't see any possibility
for allocating memory before that. What am I missing?

>> +
>> +.L__skip_switch\@:
>> +.endm
>> +
>> +#else /* CONFIG_ARM64_MTE */
>> +
>> +.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
>> +.endm
>> +
>> +.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
>> +.endm
>> +
>> +#endif /* CONFIG_ARM64_MTE */
>> +#endif /* __ASSEMBLY__ */
>> +#endif /* __ASM_KVM_MTE_H */
>> diff --git a/arch/arm64/include/asm/sysreg.h 
>> b/arch/arm64/include/asm/sysreg.h
>> index 65d15700a168..347ccac2341e 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -651,7 +651,8 @@

Re: [PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature

2021-05-19 Thread Steven Price
On 17/05/2021 17:45, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:35 +0100,
> Steven Price  wrote:
>>
>> Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
>> for a VM. This will expose the feature to the guest and automatically
>> tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
>> storage) to ensure that the guest cannot see stale tags, and so that
>> the tags are correctly saved/restored across swap.
>>
>> Actually exposing the new capability to user space happens in a later
>> patch.
> 
> uber nit in $SUBJECT: "KVM: arm64:" is the preferred prefix (just like
> patches 7 and 8).

Good spot - I obviously got carried away with the "arm64:" prefix ;)

>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/asm/kvm_emulate.h |  3 +++
>>  arch/arm64/include/asm/kvm_host.h|  3 +++
>>  arch/arm64/kvm/hyp/exception.c   |  3 ++-
>>  arch/arm64/kvm/mmu.c | 37 +++-
>>  arch/arm64/kvm/sys_regs.c|  3 +++
>>  include/uapi/linux/kvm.h |  1 +
>>  6 files changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
>> b/arch/arm64/include/asm/kvm_emulate.h
>> index f612c090f2e4..6bf776c2399c 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>>  if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
>>  vcpu_el1_is_32bit(vcpu))
>>  vcpu->arch.hcr_el2 |= HCR_TID2;
>> +
>> +if (kvm_has_mte(vcpu->kvm))
>> +vcpu->arch.hcr_el2 |= HCR_ATA;
>>  }
>>  
>>  static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>> b/arch/arm64/include/asm/kvm_host.h
>> index 7cd7d5c8c4bc..afaa5333f0e4 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -132,6 +132,8 @@ struct kvm_arch {
>>  
>>  u8 pfr0_csv2;
>>  u8 pfr0_csv3;
>> +/* Memory Tagging Extension enabled for the guest */
>> +bool mte_enabled;
>>  };
>>  
>>  struct kvm_vcpu_fault_info {
>> @@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
>>  #define kvm_arm_vcpu_sve_finalized(vcpu) \
>>  ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
>>  
>> +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
>>  #define kvm_vcpu_has_pmu(vcpu)  \
>>  (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
>>  
>> diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
>> index 73629094f903..56426565600c 100644
>> --- a/arch/arm64/kvm/hyp/exception.c
>> +++ b/arch/arm64/kvm/hyp/exception.c
>> @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
>> unsigned long target_mode,
>>  new |= (old & PSR_C_BIT);
>>  new |= (old & PSR_V_BIT);
>>  
>> -// TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
>> +if (kvm_has_mte(vcpu->kvm))
>> +new |= PSR_TCO_BIT;
>>  
>>  new |= (old & PSR_DIT_BIT);
>>  
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c5d1f3c87dbd..8660f6a03f51 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>> *memslot,
>>  return PAGE_SIZE;
>>  }
>>  
>> +static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
>> + kvm_pfn_t pfn)
> 
> Nit: please order the parameters as address, then size.

Sure

>> +{
>> +if (kvm_has_mte(kvm)) {
>> +/*
>> + * The page will be mapped in stage 2 as Normal Cacheable, so
>> + * the VM will be able to see the page's tags and therefore
>> + * they must be initialised first. If PG_mte_tagged is set,
>> + * tags have already been initialised.
>> + */
>> +unsigned long i, nr_pages = size >> PAGE_SHIFT;
>> +struct page *page = pfn_to_online_page(pfn);
>> +
>> +if (!page)
>> +return -EFAULT;
> 
> Under which circumstances can this happen? We already have done a GUP
> on the page, so I really can't see how the page

Re: [PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged

2021-05-19 Thread Steven Price
On 17/05/2021 17:14, Marc Zyngier wrote:
> On Mon, 17 May 2021 13:32:34 +0100,
> Steven Price  wrote:
>>
>> A KVM guest could store tags in a page even if the VMM hasn't mapped
>> the page with PROT_MTE. So when restoring pages from swap we will
>> need to check to see if there are any saved tags even if !pte_tagged().
>>
>> However don't check pages for which pte_access_permitted() returns false
>> as these will not have been swapped out.
>>
>> Signed-off-by: Steven Price 
>> ---
>>  arch/arm64/include/asm/pgtable.h |  9 +++--
>>  arch/arm64/kernel/mte.c  | 16 ++--
>>  2 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h 
>> b/arch/arm64/include/asm/pgtable.h
>> index 0b10204e72fc..275178a810c1 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, 
>> unsigned long addr,
>>  if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
>>  __sync_icache_dcache(pte);
>>  
>> -if (system_supports_mte() &&
>> -pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
>> +/*
>> + * If the PTE would provide user space access to the tags associated
>> + * with it then ensure that the MTE tags are synchronised.  Exec-only
>> + * mappings don't expose tags (instruction fetches don't check tags).
> 
> I'm not sure I understand this comment. Of course, execution doesn't
> match tags. But the memory could still have tags associated with
> it. Does this mean such a page would lose its tags is swapped out?

Hmm, I probably should have reread that - the context of the comment is
lost.

I added the comment when changing to pte_access_permitted(), and the
comment on pte_access_permitted() explains a potential gotcha:

 * p??_access_permitted() is true for valid user mappings (PTE_USER
 * bit set, subject to the write permission check). For execute-only
 * mappings, like PROT_EXEC with EPAN (both PTE_USER and PTE_UXN bits
 * not set) must return false. PROT_NONE mappings do not have the
 * PTE_VALID bit set.

So execute-only mappings return false even though that is effectively a
type of user access. However, because MTE checks are not performed by
the PE for instruction fetches this doesn't matter. I'll update the
comment, how about:

/*
 * If the PTE would provide user space access to the tags associated
 * with it then ensure that the MTE tags are synchronised.  Although
 * pte_access_permitted() returns false for exec only mappings, they
 * don't expose tags (instruction fetches don't check tags).
 */

Thanks,

Steve

> Thanks,
> 
>   M.
> 
>> + */
>> +if (system_supports_mte() && pte_present(pte) &&
>> +pte_access_permitted(pte, false) && !pte_special(pte))
>>  mte_sync_tags(ptep, pte);
>>  
>>  __check_racy_pte_update(mm, ptep, pte);
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index c88e778c2fa9..a604818c52c1 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -33,11 +33,15 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
>>  EXPORT_SYMBOL_GPL(mte_async_mode);
>>  #endif
>>  
>> -static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool 
>> check_swap)
>> +static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool 
>> check_swap,
>> +   bool pte_is_tagged)
>>  {
>>  unsigned long flags;
>>  pte_t old_pte = READ_ONCE(*ptep);
>>  
>> +if (!is_swap_pte(old_pte) && !pte_is_tagged)
>> +return;
>> +
>>  spin_lock_irqsave(_sync_lock, flags);
>>  
>>  /* Recheck with the lock held */
>> @@ -53,6 +57,9 @@ static void mte_sync_page_tags(struct page *page, pte_t 
>> *ptep, bool check_swap)
>>  }
>>  }
>>  
>> +if (!pte_is_tagged)
>> +goto out;
>> +
>>  page_kasan_tag_reset(page);
>>  /*
>>   * We need smp_wmb() in between setting the flags and clearing the
>> @@ -76,10 +83,15 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
>>  bool check_swap = nr_pages == 1;
>>  bool pte_is_tagged = pte_tagged(pte);
>>  
>> +/* Early out if there's nothing to do */
>> +if (!check_swap && !pte_is_tagged)
>> +return;
>> +
>>  /* if PG_mte_tagged is set, tags have already been initialised */
>>  for (i = 0; i < nr_pages; i++, page++) {
>>  if (!test_bit(PG_mte_tagged, >flags))
>> -mte_sync_page_tags(page, ptep, check_swap);
>> +mte_sync_page_tags(page, ptep, check_swap,
>> +   pte_is_tagged);
>>  }
>>  }
>>  
>> -- 
>> 2.20.1
>>
>>
> 




Re: [PATCH v12 1/8] arm64: mte: Handle race when synchronising tags

2021-05-17 Thread Steven Price
On 17/05/2021 15:03, Marc Zyngier wrote:
> Hi Steven,

Hi Marc,

> On Mon, 17 May 2021 13:32:32 +0100,
> Steven Price  wrote:
>>
>> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
>> before restoring/zeroing the MTE tags. However if another thread were to
>> race and attempt to sync the tags on the same page before the first
>> thread had completed restoring/zeroing then it would see the flag is
>> already set and continue without waiting. This would potentially expose
>> the previous contents of the tags to user space, and cause any updates
>> that user space makes before the restoring/zeroing has completed to
>> potentially be lost.
>>
>> Since this code is run from atomic contexts we can't just lock the page
>> during the process. Instead implement a new (global) spinlock to protect
>> the mte_sync_page_tags() function.
>>
>> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
>> user-space with PROT_MTE")
>> Signed-off-by: Steven Price 
>> ---
>> ---
>>  arch/arm64/kernel/mte.c | 21 ++---
>>  1 file changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> index 125a10e413e9..c88e778c2fa9 100644
>> --- a/arch/arm64/kernel/mte.c
>> +++ b/arch/arm64/kernel/mte.c
>> @@ -25,6 +25,7 @@
>>  u64 gcr_kernel_excl __ro_after_init;
>>  
>>  static bool report_fault_once = true;
>> +static spinlock_t tag_sync_lock;
> 
> What initialises this spinlock? Have you tried this with lockdep? I'd
> expect it to be defined with DEFINE_SPINLOCK(), which always does the
> right thing.

You of course are absolute right, and this will blow up with lockdep.
Sorry about that. DEFINE_SPINLOCK() solves the problem.

Thanks,

Steve



[PATCH v12 7/8] KVM: arm64: ioctl to fetch/store tags in a guest

2021-05-17 Thread Steven Price
The VMM may not wish to have it's own mapping of guest memory mapped
with PROT_MTE because this causes problems if the VMM has tag checking
enabled (the guest controls the tags in physical RAM and it's unlikely
the tags are correct for the VMM).

Instead add a new ioctl which allows the VMM to easily read/write the
tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
while the VMM can still read/write the tags for the purpose of
migration.

Signed-off-by: Steven Price 
---
 arch/arm64/include/uapi/asm/kvm.h | 11 +
 arch/arm64/kvm/arm.c  | 69 +++
 include/uapi/linux/kvm.h  |  1 +
 3 files changed, 81 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..b3edde68bc3e 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,17 @@ struct kvm_vcpu_events {
__u32 reserved[12];
 };
 
+struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   void __user *addr;
+   __u64 flags;
+   __u64 reserved[2];
+};
+
+#define KVM_ARM_TAGS_TO_GUEST  0
+#define KVM_ARM_TAGS_FROM_GUEST1
+
 /* If you need to interpret the index values, here is the key: */
 #define KVM_REG_ARM_COPROC_MASK0x0FFF
 #define KVM_REG_ARM_COPROC_SHIFT   16
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e89a5e275e25..4b6c83beb75d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1309,6 +1309,65 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
}
 }
 
+static int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
+ struct kvm_arm_copy_mte_tags *copy_tags)
+{
+   gpa_t guest_ipa = copy_tags->guest_ipa;
+   size_t length = copy_tags->length;
+   void __user *tags = copy_tags->addr;
+   gpa_t gfn;
+   bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
+   int ret = 0;
+
+   if (copy_tags->reserved[0] || copy_tags->reserved[1])
+   return -EINVAL;
+
+   if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
+   return -EINVAL;
+
+   if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
+   return -EINVAL;
+
+   gfn = gpa_to_gfn(guest_ipa);
+
+   mutex_lock(>slots_lock);
+
+   while (length > 0) {
+   kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
+   void *maddr;
+   unsigned long num_tags = PAGE_SIZE / MTE_GRANULE_SIZE;
+
+   if (is_error_noslot_pfn(pfn)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   maddr = page_address(pfn_to_page(pfn));
+
+   if (!write) {
+   num_tags = mte_copy_tags_to_user(tags, maddr, num_tags);
+   kvm_release_pfn_clean(pfn);
+   } else {
+   num_tags = mte_copy_tags_from_user(maddr, tags,
+  num_tags);
+   kvm_release_pfn_dirty(pfn);
+   }
+
+   if (num_tags != PAGE_SIZE / MTE_GRANULE_SIZE) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   gfn++;
+   tags += num_tags;
+   length -= PAGE_SIZE;
+   }
+
+out:
+   mutex_unlock(>slots_lock);
+   return ret;
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
@@ -1345,6 +1404,16 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
return 0;
}
+   case KVM_ARM_MTE_COPY_TAGS: {
+   struct kvm_arm_copy_mte_tags copy_tags;
+
+   if (!kvm_has_mte(kvm))
+   return -EINVAL;
+
+   if (copy_from_user(_tags, argp, sizeof(copy_tags)))
+   return -EFAULT;
+   return kvm_vm_ioctl_mte_copy_tags(kvm, _tags);
+   }
default:
return -EINVAL;
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8c95ba0fadda..4c011c60d468 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1428,6 +1428,7 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PMU_EVENT_FILTER */
 #define KVM_SET_PMU_EVENT_FILTER  _IOW(KVMIO,  0xb2, struct 
kvm_pmu_event_filter)
 #define KVM_PPC_SVM_OFF  _IO(KVMIO,  0xb3)
+#define KVM_ARM_MTE_COPY_TAGS_IOR(KVMIO,  0xb4, struct 
kvm_arm_copy_mte_tags)
 
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE_IOWR(KVMIO,  0xe0, struct kvm_create_device)
-- 
2.20.1




[PATCH v12 6/8] arm64: kvm: Expose KVM_ARM_CAP_MTE

2021-05-17 Thread Steven Price
It's now safe for the VMM to enable MTE in a guest, so expose the
capability to user space.

Signed-off-by: Steven Price 
---
 arch/arm64/kvm/arm.c  | 9 +
 arch/arm64/kvm/sys_regs.c | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1cb39c0803a4..e89a5e275e25 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
kvm->arch.return_nisv_io_abort_to_user = true;
break;
+   case KVM_CAP_ARM_MTE:
+   if (!system_supports_mte() || kvm->created_vcpus)
+   return -EINVAL;
+   r = 0;
+   kvm->arch.mte_enabled = true;
+   break;
default:
r = -EINVAL;
break;
@@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 */
r = 1;
break;
+   case KVM_CAP_ARM_MTE:
+   r = system_supports_mte();
+   break;
case KVM_CAP_STEAL_TIME:
r = kvm_arm_pvtime_supported();
break;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 88adbc2286f2..3a749fa0779b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1308,6 +1308,9 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct 
sys_reg_params *p,
 static unsigned int mte_visibility(const struct kvm_vcpu *vcpu,
   const struct sys_reg_desc *rd)
 {
+   if (kvm_has_mte(vcpu->kvm))
+   return 0;
+
return REG_HIDDEN;
 }
 
-- 
2.20.1




[PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl

2021-05-17 Thread Steven Price
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Signed-off-by: Steven Price 
---
 Documentation/virt/kvm/api.rst | 53 ++
 1 file changed, 53 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 22d077562149..a31661b870ba 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.130 KVM_ARM_MTE_COPY_TAGS
+---
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: 0 on success, < 0 on error
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   union {
+   void __user *addr;
+   __u64 padding;
+   };
+   __u64 flags;
+   __u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+fieldmust point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
+bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
 5. The kvm_run structure
 
 
@@ -6362,6 +6396,25 @@ default.
 
 See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
 
+7.26 KVM_CAP_ARM_MTE
+
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before the guest will be granted access.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
+that the tags are maintained during swap or hibernation of the host; however
+the VMM needs to manually save/restore the tags as appropriate if the VM is
+migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ==
 
-- 
2.20.1




[PATCH v12 3/8] arm64: mte: Sync tags for pages where PTE is untagged

2021-05-17 Thread Steven Price
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages for which pte_access_permitted() returns false
as these will not have been swapped out.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/pgtable.h |  9 +++--
 arch/arm64/kernel/mte.c  | 16 ++--
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b10204e72fc..275178a810c1 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -314,8 +314,13 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte);
 
-   if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
+   /*
+* If the PTE would provide user space access to the tags associated
+* with it then ensure that the MTE tags are synchronised.  Exec-only
+* mappings don't expose tags (instruction fetches don't check tags).
+*/
+   if (system_supports_mte() && pte_present(pte) &&
+   pte_access_permitted(pte, false) && !pte_special(pte))
mte_sync_tags(ptep, pte);
 
__check_racy_pte_update(mm, ptep, pte);
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index c88e778c2fa9..a604818c52c1 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -33,11 +33,15 @@ DEFINE_STATIC_KEY_FALSE(mte_async_mode);
 EXPORT_SYMBOL_GPL(mte_async_mode);
 #endif
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
+  bool pte_is_tagged)
 {
unsigned long flags;
pte_t old_pte = READ_ONCE(*ptep);
 
+   if (!is_swap_pte(old_pte) && !pte_is_tagged)
+   return;
+
spin_lock_irqsave(_sync_lock, flags);
 
/* Recheck with the lock held */
@@ -53,6 +57,9 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
}
}
 
+   if (!pte_is_tagged)
+   goto out;
+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -76,10 +83,15 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
bool check_swap = nr_pages == 1;
bool pte_is_tagged = pte_tagged(pte);
 
+   /* Early out if there's nothing to do */
+   if (!check_swap && !pte_is_tagged)
+   return;
+
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
if (!test_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags(page, ptep, check_swap);
+   mte_sync_page_tags(page, ptep, check_swap,
+  pte_is_tagged);
}
 }
 
-- 
2.20.1




[PATCH v12 5/8] arm64: kvm: Save/restore MTE registers

2021-05-17 Thread Steven Price
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h  |  6 ++
 arch/arm64/include/asm/kvm_mte.h   | 66 ++
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/kernel/asm-offsets.c|  3 +
 arch/arm64/kvm/hyp/entry.S |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
 arch/arm64/kvm/sys_regs.c  | 22 ++--
 7 files changed, 123 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index afaa5333f0e4..309e36cc1b42 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   /* Memory Tagging Extension registers */
+   RGSR_EL1,   /* Random Allocation Tag Seed Register */
+   GCR_EL1,/* Tag Control Register */
+   TFSR_EL1,   /* Tag Fault Status Register (EL1) */
+   TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index ..6541c7d6ce06
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   and \reg1, \reg1, #(HCR_ATA)
+   cbz \reg1, .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   and \reg1, \reg1, #(HCR_ATA)
+   cbz \reg1, .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 65d15700a168..347ccac2341e 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -651,7 +651,8 @@
 
 #define INIT_SCTLR_EL2_MMU_ON  \
(SCTLR_ELx_M  | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I |  \
-SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | SCTLR_EL2_RES1)
+SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 |  \
+SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
 
 #define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0cb34ccb6e73..6b489a8462f0 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,6 +111,9 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,offsetof(struct kvm_vcpu, 
arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs));
+  DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, 
sys_regs[RGSR_EL1]));
+  DEFINE(CPU_GCR_EL1,  offsetof(struct kvm_cpu_context, 
sys_regs[GCR_EL1]));
+  DEFINE(CPU_TFSRE0_EL1,   offsetof(struct kvm_cpu_context, 
sys_regs[TFSRE0_EL1]));
   DEFINE(CPU_APIAKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APIAKEYLO_EL1]));
   DEFINE(CPU_APIBKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APIBKEYLO_EL1]));
   DEFINE(CPU_APDAKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs

[PATCH v12 2/8] arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

2021-05-17 Thread Steven Price
From: Catalin Marinas 

Currently, on an anonymous page fault, the kernel allocates a zeroed
page and maps it in user space. If the mapping is tagged (PROT_MTE),
set_pte_at() additionally clears the tags under a spinlock to avoid a
race on the page->flags. In order to optimise the lock, clear the page
tags on allocation in __alloc_zeroed_user_highpage() if the vma flags
have VM_MTE set.

Signed-off-by: Catalin Marinas 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/page.h |  6 --
 arch/arm64/mm/fault.c | 21 +
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..97853570d0f1 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -13,6 +13,7 @@
 #ifndef __ASSEMBLY__
 
 #include  /* for READ_IMPLIES_EXEC */
+#include 
 #include 
 
 struct page;
@@ -28,8 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
 void copy_highpage(struct page *to, struct page *from);
 #define __HAVE_ARCH_COPY_HIGHPAGE
 
-#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
-   alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+ struct vm_area_struct *vma,
+ unsigned long vaddr);
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 #define clear_user_page(page, vaddr, pg)   clear_page(page)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 871c82ab0a30..5a03428e97f3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -921,3 +921,24 @@ void do_debug_exception(unsigned long addr_if_watchpoint, 
unsigned int esr,
debug_exception_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug_exception);
+
+/*
+ * Used during anonymous page fault handling.
+ */
+struct page *__alloc_zeroed_user_highpage(gfp_t movableflags,
+ struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+   struct page *page;
+   bool tagged = system_supports_mte() && (vma->vm_flags & VM_MTE);
+
+   page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma,
+ vaddr);
+   if (tagged && page) {
+   mte_clear_page_tags(page_address(page));
+   page_kasan_tag_reset(page);
+   set_bit(PG_mte_tagged, >flags);
+   }
+
+   return page;
+}
-- 
2.20.1




[PATCH v12 1/8] arm64: mte: Handle race when synchronising tags

2021-05-17 Thread Steven Price
mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were to
race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially expose
the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the page
during the process. Instead implement a new (global) spinlock to protect
the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
user-space with PROT_MTE")
Signed-off-by: Steven Price 
---
---
 arch/arm64/kernel/mte.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..c88e778c2fa9 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -25,6 +25,7 @@
 u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
+static spinlock_t tag_sync_lock;
 
 #ifdef CONFIG_KASAN_HW_TAGS
 /* Whether the MTE asynchronous mode is enabled. */
@@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
 
 static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
 {
+   unsigned long flags;
pte_t old_pte = READ_ONCE(*ptep);
 
+   spin_lock_irqsave(_sync_lock, flags);
+
+   /* Recheck with the lock held */
+   if (test_bit(PG_mte_tagged, >flags))
+   goto out;
+
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-   if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
-   return;
+   if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+   set_bit(PG_mte_tagged, >flags);
+   goto out;
+   }
}
 
page_kasan_tag_reset(page);
@@ -53,6 +63,10 @@ static void mte_sync_page_tags(struct page *page, pte_t 
*ptep, bool check_swap)
 */
smp_wmb();
mte_clear_page_tags(page_address(page));
+   set_bit(PG_mte_tagged, >flags);
+
+out:
+   spin_unlock_irqrestore(_sync_lock, flags);
 }
 
 void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -60,10 +74,11 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
-   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   if (!test_bit(PG_mte_tagged, >flags))
mte_sync_page_tags(page, ptep, check_swap);
}
 }
-- 
2.20.1




[PATCH v12 4/8] arm64: kvm: Introduce MTE VM feature

2021-05-17 Thread Steven Price
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_emulate.h |  3 +++
 arch/arm64/include/asm/kvm_host.h|  3 +++
 arch/arm64/kvm/hyp/exception.c   |  3 ++-
 arch/arm64/kvm/mmu.c | 37 +++-
 arch/arm64/kvm/sys_regs.c|  3 +++
 include/uapi/linux/kvm.h |  1 +
 6 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+   if (kvm_has_mte(vcpu->kvm))
+   vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7cd7d5c8c4bc..afaa5333f0e4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
u8 pfr0_csv2;
u8 pfr0_csv3;
+   /* Memory Tagging Extension enabled for the guest */
+   bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);
 
-   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+   if (kvm_has_mte(vcpu->kvm))
+   new |= PSR_TCO_BIT;
 
new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..8660f6a03f51 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -822,6 +822,31 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
return PAGE_SIZE;
 }
 
+static int sanitise_mte_tags(struct kvm *kvm, unsigned long size,
+kvm_pfn_t pfn)
+{
+   if (kvm_has_mte(kvm)) {
+   /*
+* The page will be mapped in stage 2 as Normal Cacheable, so
+* the VM will be able to see the page's tags and therefore
+* they must be initialised first. If PG_mte_tagged is set,
+* tags have already been initialised.
+*/
+   unsigned long i, nr_pages = size >> PAGE_SHIFT;
+   struct page *page = pfn_to_online_page(pfn);
+
+   if (!page)
+   return -EFAULT;
+
+   for (i = 0; i < nr_pages; i++, page++) {
+   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   mte_clear_page_tags(page_address(page));
+   }
+   }
+
+   return 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -971,8 +996,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (writable)
prot |= KVM_PGTABLE_PROT_W;
 
-   if (fault_status != FSC_PERM && !device)
+   if (fault_status != FSC_PERM && !device) {
+   ret = sanitise_mte_tags(kvm, vma_pagesize, pfn);
+   if (ret)
+   goto out_unlock;
+
clean_dcache_guest_page(pfn, vma_pagesize);
+   }
 
if (exec_fault) {
prot |= KVM_PGTABLE_PROT_X;
@@ -1168,12 +1198,17 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
kvm_pfn_t pfn = pte_pfn(range->pte);
+   int ret;
 
if (!kvm->arch.mmu.

[PATCH v12 0/8] MTE support for KVM guest

2021-05-17 Thread Steven Price
This series adds support for using the Arm Memory Tagging Extensions
(MTE) in a KVM guest.

Changes since v11[1]:

 * Series is prefixed with a bug fix for a potential race synchronising
   tags. This is basically race as was recently[2] fixed for
   PG_dcache_clean where the update of the page flag cannot be done
   atomically with the work that flag represents.

   For the PG_dcache_clean case the problem is easier because extra
   cache maintenance isn't a problem, but here restoring the tags twice
   could cause data loss.

   The current solution is a global spinlock for mte_sync_page_tags().
   If we hit scalability problems that other solutions such as
   potentially using another page flag as a lock will need to be
   investigated.

 * The second patch is from Catalin to mitigate the performance impact
   of the first - by handling the page zeroing case explicitly we can
   avoid entering mte_sync_page_tags() at all in most cases. Peter
   Collingbourne has a patch which similarly improves this case using
   the DC GZVA instruction. So this patch may be dropped in favour of
   Peter's, however Catalin's is likely easier to backport.

 * Use pte_access_permitted() in set_pte_at() to identify pages which
   may be accessed by the user rather than open-coding a check for
   PTE_USER. Also add a comment documenting what's going on.
   There's also some short-cuts added in mte_sync_tags() compared to the
   previous post, to again mitigate the performance impact of the first
   patch.

 * Move the code to sanitise tags out of user_mem_abort() into its own
   function. Also call this new function from kvm_set_spte_gfn() as that
   path was missing the sanitising.

   Originally I was going to move the code all the way down to
   kvm_pgtable_stage2_map(). Sadly as that also part of the EL2
   hypervisor this breaks nVHE as the code needs to perform actions in
   the host.

 * Drop the union in struct kvm_vcpu_events - it served no purpose and
   was confusing.

 * Update CAP number (again) and other minor conflict resolutions.

[1] https://lore.kernel.org/r/20210416154309.22129-1-steven.pr...@arm.com/
[2] https://lore.kernel.org/r/20210514095001.13236-1-catalin.mari...@arm.com/
[3] 
https://lore.kernel.org/r/de812a02fd94a0dba07d43606bd893c564aa4528.1620849613.git@google.com/

Catalin Marinas (1):
  arm64: Handle MTE tags zeroing in __alloc_zeroed_user_highpage()

Steven Price (7):
  arm64: mte: Handle race when synchronising tags
  arm64: mte: Sync tags for pages where PTE is untagged
  arm64: kvm: Introduce MTE VM feature
  arm64: kvm: Save/restore MTE registers
  arm64: kvm: Expose KVM_ARM_CAP_MTE
  KVM: arm64: ioctl to fetch/store tags in a guest
  KVM: arm64: Document MTE capability and ioctl

 Documentation/virt/kvm/api.rst | 53 +++
 arch/arm64/include/asm/kvm_emulate.h   |  3 +
 arch/arm64/include/asm/kvm_host.h  |  9 +++
 arch/arm64/include/asm/kvm_mte.h   | 66 ++
 arch/arm64/include/asm/page.h  |  6 +-
 arch/arm64/include/asm/pgtable.h   |  9 ++-
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/include/uapi/asm/kvm.h  | 11 +++
 arch/arm64/kernel/asm-offsets.c|  3 +
 arch/arm64/kernel/mte.c| 37 --
 arch/arm64/kvm/arm.c   | 78 ++
 arch/arm64/kvm/hyp/entry.S |  7 ++
 arch/arm64/kvm/hyp/exception.c |  3 +-
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 ++
 arch/arm64/kvm/mmu.c   | 37 +-
 arch/arm64/kvm/sys_regs.c  | 28 ++--
 arch/arm64/mm/fault.c  | 21 ++
 include/uapi/linux/kvm.h   |  2 +
 18 files changed, 381 insertions(+), 16 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

-- 
2.20.1




Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

2021-05-13 Thread Steven Price
On 12/05/2021 18:45, Catalin Marinas wrote:
> On Wed, May 12, 2021 at 04:46:48PM +0100, Steven Price wrote:
>> On 10/05/2021 19:35, Catalin Marinas wrote:
>>> On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
>>>> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
>>>>> On 04/05/2021 18:40, Catalin Marinas wrote:
>>>>>> On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
>>>>>>> Given the changes to set_pte_at() which means that tags are restored 
>>>>>>> from
>>>>>>> swap even if !PROT_MTE, the only race I can see remaining is the 
>>>>>>> creation of
>>>>>>> new PROT_MTE mappings. As you mention an attempt to change mappings in 
>>>>>>> the
>>>>>>> VMM memory space should involve a mmu notifier call which I think 
>>>>>>> serialises
>>>>>>> this. So the remaining issue is doing this in a separate address space.
>>>>>>>
>>>>>>> So I guess the potential problem is:
>>>>>>>
>>>>>>>* allocate memory MAP_SHARED but !PROT_MTE
>>>>>>>* fork()
>>>>>>>* VM causes a fault in parent address space
>>>>>>>* child does a mprotect(PROT_MTE)
>>>>>>>
>>>>>>> With the last two potentially racing. Sadly I can't see a good way of
>>>>>>> handling that.
> [...]
>>> Options:
>>>
>>> 1. Change the mte_sync_tags() code path to set the flag after clearing
>>> and avoid reading stale tags. We document that mprotect() on
>>> MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
>>> arch code and return an error.
>>
>> This is the best option I've come up with so far - but it's not a good
>> one! We can replace the set_bit() with a test_and_set_bit() to catch the
>> race after it has occurred - but I'm not sure what we can do about it
>> then (we've already wiped the data). Returning an error doesn't seem
>> particularly useful at that point, a message in dmesg is about the best
>> I can come up with.
> 
> What I meant about intercepting is on something like
> arch_validate_flags() to prevent VM_SHARED and VM_MTE together but only
> for mprotect(), not mmap(). However, arch_validate_flags() is currently
> called on both mmap() and mprotect() paths.

I think even if we were to restrict mprotect() there would be corner
cases around swapping in. For example if a page mapped VM_SHARED|VM_MTE
is faulted simultaneously in both processes then we have the same situation:

 * with test_and_set_bit() one process could potentially see the tags
before they have been restored - i.e. a data leak.

 * with separated test and set then one process could write to the tags
before the second restore has completed causing a lost update.

Obviously completely banning VM_SHARED|VM_MTE might work, but I don't
think that's a good idea.

> We can't do much in set_pte_at() to prevent the race with only a single
> bit.
> 
>>> 2. Figure out some other locking in the core code. However, if
>>> mprotect() in one process can race with a handle_pte_fault() in
>>> another, on the same shared mapping, it's not trivial.
>>> filemap_map_pages() would take the page lock before calling
>>> do_set_pte(), so mprotect() would need the same page lock.
>>
>> I can't see how this is going to work without harming the performance of
>> non-MTE work. Ultimately we're trying to add some sort of locking for
>> two (mostly) unrelated processes doing page table operations, which will
>> hurt scalability.
> 
> Another option is to have an arch callback to force re-faulting on the
> pte. That means we don't populate it back after the invalidation in the
> change_protection() path. We could do this only if the new pte is tagged
> and the page doesn't have PG_mte_tagged. The faulting path takes the
> page lock IIUC.

As above - I don't think this race is just on the change_protection() path.

> Well, at least for stage 1, I haven't thought much about stage 2.
> 
>>> 3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
>>> set it around the other PG_arch_* bit setting).
>>
>> This is certainly tempting, although sadly the existing
>> wait_on_page_bit() is sleeping - so this would either be a literal spin,
>> or we'd need to implement a new non-sleeping wait mechanism.
> 
> Yeah, it would have to be a custom spinning mechanism, something like:
> 

Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

2021-05-12 Thread Steven Price

On 10/05/2021 19:35, Catalin Marinas wrote:

On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:

On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:

On 04/05/2021 18:40, Catalin Marinas wrote:

On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:

Given the changes to set_pte_at() which means that tags are restored from
swap even if !PROT_MTE, the only race I can see remaining is the creation of
new PROT_MTE mappings. As you mention an attempt to change mappings in the
VMM memory space should involve a mmu notifier call which I think serialises
this. So the remaining issue is doing this in a separate address space.

So I guess the potential problem is:

   * allocate memory MAP_SHARED but !PROT_MTE
   * fork()
   * VM causes a fault in parent address space
   * child does a mprotect(PROT_MTE)

With the last two potentially racing. Sadly I can't see a good way of
handling that.


Ah, the mmap lock doesn't help as they are different processes
(mprotect() acquires it as a writer).

I wonder whether this is racy even in the absence of KVM. If both parent
and child do an mprotect(PROT_MTE), one of them may be reading stale
tags for a brief period.

Maybe we should revisit whether shared MTE pages are of any use, though
it's an ABI change (not bad if no-one is relying on this). However...

[...]

Thinking about this, we have a similar problem with the PG_dcache_clean
and two processes doing mprotect(PROT_EXEC). One of them could see the
flag set and skip the I-cache maintenance while the other executes
stale instructions. change_pte_range() could acquire the page lock if
the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
solve the MTE/KVM case but we could at least take the page lock via
user_mem_abort().

[...]

This is the real issue I see - the race in PROT_MTE case is either a data
leak (syncing after setting the bit) or data loss (syncing before setting
the bit).

[...]

But without serialising through a spinlock (in mte_sync_tags()) I haven't
been able to come up with any way of closing the race. But with the change
to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
is likely to seriously hurt performance.


Yeah. We could add another page flag as a lock though I think it should
be the core code that prevents the race.

If we are to do it in the arch code, maybe easier with a custom
ptep_modify_prot_start/end() where we check if it's VM_SHARED and
VM_MTE, take a (big) lock.


I think in the general case we don't even need VM_SHARED. For example,
we have two processes mapping a file, read-only. An mprotect() call in
both processes will race on the page->flags via the corresponding
set_pte_at(). I think an mprotect() with a page fault in different
processes can also race.

The PROT_EXEC case can be easily fixed, as you said already. The
PROT_MTE with MAP_PRIVATE I think can be made safe by a similar
approach: test flag, clear tags, set flag. A subsequent write would
trigger a CoW, so different page anyway.

Anyway, I don't think ptep_modify_prot_start/end would buy us much, it
probably makes the code even harder to read.


In the core code, something like below (well, a partial hack, not tested
and it doesn't handle huge pages but just to give an idea):

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..6ba96ff141a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct 
vm_area_struct *vma, pmd_t *pmd,
}
  
  			oldpte = ptep_modify_prot_start(vma, addr, pte);

+   if (vma->vm_flags & VM_SHARED) {
+   page = vm_normal_page(vma, addr, oldpte);
+   lock_page(page);
+   }
ptent = pte_modify(oldpte, newprot);
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
@@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+   if (page)
+   unlock_page(page);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);


That's bogus: lock_page() might sleep but this whole code sequence is
under the ptl spinlock. There are some lock_page_* variants but that
would involve either a busy loop on this path or some bailing out,
waiting for a release.

Options:

1. Change the mte_sync_tags() code path to set the flag after clearing
and avoid reading stale tags. We document that mprotect() on
MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
arch code and return an error.


This is the best o

Re: [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest

2021-05-07 Thread Steven Price

On 04/05/2021 18:44, Catalin Marinas wrote:

On Thu, Apr 29, 2021 at 05:06:07PM +0100, Steven Price wrote:

On 27/04/2021 18:58, Catalin Marinas wrote:

On Fri, Apr 16, 2021 at 04:43:08PM +0100, Steven Price wrote:

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..2b85a047c37d 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,20 @@ struct kvm_vcpu_events {
__u32 reserved[12];
   };
+struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   union {
+   void __user *addr;
+   __u64 padding;
+   };
+   __u64 flags;
+   __u64 reserved[2];
+};

[...]

Maybe add the two reserved
values to the union in case we want to store something else in the
future.


I'm not sure what you mean here. What would the reserved fields be unioned
with? And surely they are no longer reserved in that case?


In case you want to keep the structure size the same for future
expansion and the expansion only happens via the union, you'd add some
padding in there just in case. We do this for struct siginfo with an
_si_pad[] array in the union.



Ah I see what you mean. In this case "padding" is just a sizer to ensure 
that flags is always the same alignment - it's not intended to be used. 
As I noted previously though it's completely pointless as this only on 
arm64 and even 32 bit Arm would naturally align the following __u64.


reserved[] is for expansion and I guess we could have a union over the 
whole struct (like siginfo) but I think it's generally clearer to just 
spell out the reserved fields at the end of the struct.


TLDR; the union will be gone along with "padding" in the next version. 
"reserved" remains at the end of the struct for future use.


Thanks,

Steve



Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

2021-05-06 Thread Steven Price

On 04/05/2021 18:40, Catalin Marinas wrote:

On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:

On 28/04/2021 18:07, Catalin Marinas wrote:

I probably asked already but is the only way to map a standard RAM page
(not device) in stage 2 via the fault handler? One case I had in mind
was something like get_user_pages() but it looks like that one doesn't
call set_pte_at_notify(). There are a few other places where
set_pte_at_notify() is called and these may happen before we got a
chance to fault on stage 2, effectively populating the entry (IIUC). If
that's an issue, we could move the above loop and check closer to the
actual pte setting like kvm_pgtable_stage2_map().


The only call sites of kvm_pgtable_stage2_map() are in mmu.c:

  * kvm_phys_addr_ioremap() - maps as device in stage 2

  * user_mem_abort() - handled above

  * kvm_set_spte_handler() - ultimately called from the .change_pte()
callback of the MMU notifier

So the last one is potentially a problem. It's called via the MMU notifiers
in the case of set_pte_at_notify(). The users of that are:

  * uprobe_write_opcode(): Allocates a new page and performs a
copy_highpage() to copy the data to the new page (which with MTE includes
the tags and will copy across the PG_mte_tagged flag).

  * write_protect_page() (KSM): Changes the permissions on the PTE but it's
still the same page, so nothing to do regarding MTE.


My concern here is that the VMM had a stage 1 pte but we haven't yet
faulted in at stage 2 via user_mem_abort(), so we don't have any stage 2
pte set. write_protect_page() comes in and sets the new stage 2 pte via
the callback. I couldn't find any check in kvm_pgtable_stage2_map() for
the old pte, so it will set the new stage 2 pte regardless. A subsequent
guest read would no longer fault at stage 2.


  * replace_page() (KSM): If the page has MTE tags then the MTE version of
memcmp_pages() will return false, so the only caller
(try_to_merge_one_page()) will never call this on a page with tags.

  * wp_page_copy(): This one is more interesting - if we go down the
cow_user_page() path with an old page then everything is safe (tags are
copied over). The is_zero_pfn() case worries me a bit - a new page is
allocated, but I can't instantly see anything to zero out the tags (and set
PG_mte_tagged).


True, I think tag zeroing happens only if we map it as PROT_MTE in the
VMM.


  * migrate_vma_insert_page(): I think migration should be safe as the tags
should be copied.

So wp_page_copy() looks suspicious.

kvm_pgtable_stage2_map() looks like it could be a good place for the checks,
it looks like it should work and is probably a more obvious place for the
checks.


That would be my preference. It also matches the stage 1 set_pte_at().


While the set_pte_at() race on the page flags is somewhat clearer, we
may still have a race here with the VMM's set_pte_at() if the page is
mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
handling the VMM page tables (well, not always, see below).

gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
sure whether get_user_page_fast_only() does the same.

The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
KVM mmu notifier is invoked before set_pte_at() and racing with another
user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
set_pte_at() would see the PG_mte_tagged set either by the current CPU
or by the one it was racing with.


Given the changes to set_pte_at() which means that tags are restored from
swap even if !PROT_MTE, the only race I can see remaining is the creation of
new PROT_MTE mappings. As you mention an attempt to change mappings in the
VMM memory space should involve a mmu notifier call which I think serialises
this. So the remaining issue is doing this in a separate address space.

So I guess the potential problem is:

  * allocate memory MAP_SHARED but !PROT_MTE
  * fork()
  * VM causes a fault in parent address space
  * child does a mprotect(PROT_MTE)

With the last two potentially racing. Sadly I can't see a good way of
handling that.


Ah, the mmap lock doesn't help as they are different processes
(mprotect() acquires it as a writer).

I wonder whether this is racy even in the absence of KVM. If both parent
and child do an mprotect(PROT_MTE), one of them may be reading stale
tags for a brief period.

Maybe we should revisit whether shared MTE pages are of any use, though
it's an ABI change (not bad if no-one is relying on this). However...


Shared MTE pages are certainly hard to use correctly (e.g. see the 
discussions with the VMM accessing guest memory). But I guess that boat 
has sailed.



Thinking about this, we have a similar problem with the PG_dcache_clean
and two processes doing mprotect(PROT_EXEC). One of them could see the
flag set and skip the I

Re: [PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

2021-04-29 Thread Steven Price

On 28/04/2021 18:07, Catalin Marinas wrote:

On Fri, Apr 16, 2021 at 04:43:05PM +0100, Steven Price wrote:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..5f8e165ea053 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (vma_pagesize == PAGE_SIZE && !force_pte)
vma_pagesize = transparent_hugepage_adjust(memslot, hva,
   , _ipa);
+
+   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
+   pfn_valid(pfn)) {


In the current implementation, device == !pfn_valid(), so we could skip
the latter check.


Thanks, I'll drop that check.


+   /*
+* VM will be able to see the page's tags, so we must ensure
+* they have been initialised. if PG_mte_tagged is set, tags
+* have already been initialised.
+*/
+   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+   struct page *page = pfn_to_online_page(pfn);
+
+   if (!page)
+   return -EFAULT;


I think that's fine, though maybe adding a comment that otherwise it
would be mapped at stage 2 as Normal Cacheable and we cannot guarantee
that the memory supports MTE tags.


That's what I intended by "be able to see the page's tags", but I'll 
reword to be explicit about it being Normal Cacheable.



+
+   for (i = 0; i < nr_pages; i++, page++) {
+   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   mte_clear_page_tags(page_address(page));
+   }
+   }
+
if (writable)
prot |= KVM_PGTABLE_PROT_W;


I probably asked already but is the only way to map a standard RAM page
(not device) in stage 2 via the fault handler? One case I had in mind
was something like get_user_pages() but it looks like that one doesn't
call set_pte_at_notify(). There are a few other places where
set_pte_at_notify() is called and these may happen before we got a
chance to fault on stage 2, effectively populating the entry (IIUC). If
that's an issue, we could move the above loop and check closer to the
actual pte setting like kvm_pgtable_stage2_map().


The only call sites of kvm_pgtable_stage2_map() are in mmu.c:

 * kvm_phys_addr_ioremap() - maps as device in stage 2

 * user_mem_abort() - handled above

 * kvm_set_spte_handler() - ultimately called from the .change_pte() 
callback of the MMU notifier


So the last one is potentially a problem. It's called via the MMU 
notifiers in the case of set_pte_at_notify(). The users of that are:


 * uprobe_write_opcode(): Allocates a new page and performs a 
copy_highpage() to copy the data to the new page (which with MTE 
includes the tags and will copy across the PG_mte_tagged flag).


 * write_protect_page() (KSM): Changes the permissions on the PTE but 
it's still the same page, so nothing to do regarding MTE.


 * replace_page() (KSM): If the page has MTE tags then the MTE version 
of memcmp_pages() will return false, so the only caller 
(try_to_merge_one_page()) will never call this on a page with tags.


 * wp_page_copy(): This one is more interesting - if we go down the 
cow_user_page() path with an old page then everything is safe (tags are 
copied over). The is_zero_pfn() case worries me a bit - a new page is 
allocated, but I can't instantly see anything to zero out the tags (and 
set PG_mte_tagged).


 * migrate_vma_insert_page(): I think migration should be safe as the 
tags should be copied.


So wp_page_copy() looks suspicious.

kvm_pgtable_stage2_map() looks like it could be a good place for the 
checks, it looks like it should work and is probably a more obvious 
place for the checks.



While the set_pte_at() race on the page flags is somewhat clearer, we
may still have a race here with the VMM's set_pte_at() if the page is
mapped as tagged. KVM has its own mmu_lock but it wouldn't be held when
handling the VMM page tables (well, not always, see below).

gfn_to_pfn_prot() ends up calling get_user_pages*(). At least the slow
path (hva_to_pfn_slow()) ends up with FOLL_TOUCH in gup and the VMM pte
would be set, tags cleared (if PROT_MTE) before the stage 2 pte. I'm not
sure whether get_user_page_fast_only() does the same.

The race with an mprotect(PROT_MTE) in the VMM is fine I think as the
KVM mmu notifier is invoked before set_pte_at() and racing with another
user_mem_abort() is serialised by the KVM mmu_lock. The subsequent
set_pte_at() would see the PG_mte_tagged set either by the current CPU
or by the one it was racing with.



Given the changes to set_pte_at() which means that tags are restored 
from swap even if !PROT_MTE, the only race I can see remaining is the 
creation of new PROT_MTE mappings. As you mention 

Re: [PATCH v11 5/6] KVM: arm64: ioctl to fetch/store tags in a guest

2021-04-29 Thread Steven Price

On 27/04/2021 18:58, Catalin Marinas wrote:

On Fri, Apr 16, 2021 at 04:43:08PM +0100, Steven Price wrote:

diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 24223adae150..2b85a047c37d 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -184,6 +184,20 @@ struct kvm_vcpu_events {
__u32 reserved[12];
  };
  
+struct kvm_arm_copy_mte_tags {

+   __u64 guest_ipa;
+   __u64 length;
+   union {
+   void __user *addr;
+   __u64 padding;
+   };
+   __u64 flags;
+   __u64 reserved[2];
+};


I know Marc asked for some reserved space in here but I'm not sure it's
the right place. And what's with the union of a 64-bit pointer and
64-bit padding, it doesn't change any layout?


Yes it's unnecessary here - habits die hard. This would ensure that the 
layout is the same for 32 bit and 64 bit. But it's irrelevant here as 
(a) we don't support 32 bit, and (b) flags has 64 bit alignment anyway. 
I'll drop the union (and 'padding').



Maybe add the two reserved
values to the union in case we want to store something else in the
future.


I'm not sure what you mean here. What would the reserved fields be 
unioned with? And surely they are no longer reserved in that case?



Or maybe I'm missing something, I haven't checked how other KVM ioctls
work.


KVM ioctls seem to (sometimes) have some reserved space at the end of 
the structure for expansion without the ioctl number changing (since the 
structure size is encoded into the ioctl).


Steve



Re: [PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged

2021-04-29 Thread Steven Price

On 27/04/2021 18:43, Catalin Marinas wrote:

On Fri, Apr 16, 2021 at 04:43:04PM +0100, Steven Price wrote:

A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages which are !pte_valid_user() as these will
not have been swapped out.


You should remove the pte_valid_user() mention from the commit log as
well.


Good spot - sorry about that. I really must get better at reading my own 
commit messages.



diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e17b96d0e4b5..cf4b52a33b3c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
__sync_icache_dcache(pte);
  
  	if (system_supports_mte() &&

-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
+   pte_present(pte) && (pte_val(pte) & PTE_USER) && !pte_special(pte))


I would add a pte_user() macro here or, if we restore the tags only when
the page is readable, use pte_access_permitted(pte, false). Also add a
comment why we do this.


pte_access_permitted() looks like it describes what we want (user space 
can access the memory). I'll add the following comment:


 /*
  * If the PTE would provide user space will access to the tags
  * associated with it then ensure that the MTE tags are synchronised.
  * Exec-only mappings don't expose tags (instruction fetches don't
  * check tags).
  */


There's also the pte_user_exec() case which may not have the PTE_USER
set (exec-only permission) but I don't think it matters. We don't do tag
checking on instruction fetches, so if the user adds a PROT_READ to it,
it would go through set_pte_at() again. I'm not sure KVM does anything
special with exec-only mappings at stage 2, I suspect they won't be
accessible by the guest (but needs checking).


It comes down to the behaviour of get_user_pages(). AFAICT that will 
fail if the memory is exec-only, so no stage 2 mapping will be created. 
Which of course means the guest can't do anything with that memory. That 
certainly seems like the only sane behaviour even without MTE.



mte_sync_tags(ptep, pte);
  
  	__check_racy_pte_update(mm, ptep, pte);

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index b3c70a612c7a..e016ab57ea36 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init;
  
  static bool report_fault_once = true;
  
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)

+static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
+  bool pte_is_tagged)
  {
pte_t old_pte = READ_ONCE(*ptep);
  
  	if (check_swap && is_swap_pte(old_pte)) {

swp_entry_t entry = pte_to_swp_entry(old_pte);
  
-		if (!non_swap_entry(entry) && mte_restore_tags(entry, page))

+   if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+   set_bit(PG_mte_tagged, >flags);
return;
+   }
}
  
+	if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, >flags))

+   return;


I don't think we need another test_bit() here, it was done in the
caller (bar potential races which need more thought).


Good point - I'll change that to a straight set_bit().


+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
  
  	/* if PG_mte_tagged is set, tags have already been initialised */

for (i = 0; i < nr_pages; i++, page++) {
-   if (!test_and_set_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags(page, ptep, check_swap);
+   if (!test_bit(PG_mte_tagged, >flags))
+   mte_sync_page_tags(page, ptep, check_swap,
+  pte_is_tagged);
}
  }


You were right in the previous thread that if we have a race, it's
already there even without your patches KVM patches.

If it's the same pte in a multithreaded app, we should be ok as the core
code holds the ptl (the arch code also holds the mmap_lock during
exception handling but only as a reader, so you can have multiple
holders).

If there are multiple ptes to the same page, for example mapped with
MAP_ANONYMOUS | MAP_SHARED, metadata recovery is done via
arch_swap_restore() before we even set the pte

[PATCH v11 3/6] arm64: kvm: Save/restore MTE registers

2021-04-16 Thread Steven Price
Define the new system registers that MTE introduces and context switch
them. The MTE feature is still hidden from the ID register as it isn't
supported in a VM yet.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_host.h  |  6 ++
 arch/arm64/include/asm/kvm_mte.h   | 66 ++
 arch/arm64/include/asm/sysreg.h|  3 +-
 arch/arm64/kernel/asm-offsets.c|  3 +
 arch/arm64/kvm/hyp/entry.S |  7 +++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 21 +++
 arch/arm64/kvm/sys_regs.c  | 22 ++--
 7 files changed, 123 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_mte.h

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 1170ee137096..d00cc3590f6e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -208,6 +208,12 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   /* Memory Tagging Extension registers */
+   RGSR_EL1,   /* Random Allocation Tag Seed Register */
+   GCR_EL1,/* Tag Control Register */
+   TFSR_EL1,   /* Tag Fault Status Register (EL1) */
+   TFSRE0_EL1, /* Tag Fault Status Register (EL0) */
+
/* 32bit specific registers. Keep them at the end of the range */
DACR32_EL2, /* Domain Access Control Register */
IFSR32_EL2, /* Instruction Fault Status Register */
diff --git a/arch/arm64/include/asm/kvm_mte.h b/arch/arm64/include/asm/kvm_mte.h
new file mode 100644
index ..6541c7d6ce06
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_mte.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 ARM Ltd.
+ */
+#ifndef __ASM_KVM_MTE_H
+#define __ASM_KVM_MTE_H
+
+#ifdef __ASSEMBLY__
+
+#include 
+
+#ifdef CONFIG_ARM64_MTE
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   and \reg1, \reg1, #(HCR_ATA)
+   cbz \reg1, .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\h_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\g_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+alternative_if_not ARM64_MTE
+   b   .L__skip_switch\@
+alternative_else_nop_endif
+   mrs \reg1, hcr_el2
+   and \reg1, \reg1, #(HCR_ATA)
+   cbz \reg1, .L__skip_switch\@
+
+   mrs_s   \reg1, SYS_RGSR_EL1
+   str \reg1, [\g_ctxt, #CPU_RGSR_EL1]
+   mrs_s   \reg1, SYS_GCR_EL1
+   str \reg1, [\g_ctxt, #CPU_GCR_EL1]
+
+   ldr \reg1, [\h_ctxt, #CPU_RGSR_EL1]
+   msr_s   SYS_RGSR_EL1, \reg1
+   ldr \reg1, [\h_ctxt, #CPU_GCR_EL1]
+   msr_s   SYS_GCR_EL1, \reg1
+
+.L__skip_switch\@:
+.endm
+
+#else /* CONFIG_ARM64_MTE */
+
+.macro mte_switch_to_guest g_ctxt, h_ctxt, reg1
+.endm
+
+.macro mte_switch_to_hyp g_ctxt, h_ctxt, reg1
+.endm
+
+#endif /* CONFIG_ARM64_MTE */
+#endif /* __ASSEMBLY__ */
+#endif /* __ASM_KVM_MTE_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index dfd4edbfe360..5424d195cf96 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -580,7 +580,8 @@
 #define SCTLR_ELx_M(BIT(0))
 
 #define SCTLR_ELx_FLAGS(SCTLR_ELx_M  | SCTLR_ELx_A | SCTLR_ELx_C | \
-SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB)
+SCTLR_ELx_SA | SCTLR_ELx_I | SCTLR_ELx_IESB | \
+SCTLR_ELx_ITFSB)
 
 /* SCTLR_EL2 specific flags. */
 #define SCTLR_EL2_RES1 ((BIT(4))  | (BIT(5))  | (BIT(11)) | (BIT(16)) | \
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index a36e2fc330d4..944e4f1f45d9 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -108,6 +108,9 @@ int main(void)
   DEFINE(VCPU_WORKAROUND_FLAGS,offsetof(struct kvm_vcpu, 
arch.workaround_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(CPU_USER_PT_REGS, offsetof(struct kvm_cpu_context, regs));
+  DEFINE(CPU_RGSR_EL1, offsetof(struct kvm_cpu_context, 
sys_regs[RGSR_EL1]));
+  DEFINE(CPU_GCR_EL1,  offsetof(struct kvm_cpu_context, 
sys_regs[GCR_EL1]));
+  DEFINE(CPU_TFSRE0_EL1,   offsetof(struct kvm_cpu_context, 
sys_regs[TFSRE0_EL1]));
   DEFINE(CPU_APIAKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APIAKEYLO_EL1]));
   DEFINE(CPU_APIBKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APIBKEYLO_EL1]));
   DEFINE(CPU_APDAKEYLO_EL1,offsetof(struct kvm_cpu_context, 
sys_regs[APDAKEYLO_EL1]));
diff

[PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

2021-04-16 Thread Steven Price
Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
for a VM. This will expose the feature to the guest and automatically
tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
storage) to ensure that the guest cannot see stale tags, and so that
the tags are correctly saved/restored across swap.

Actually exposing the new capability to user space happens in a later
patch.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/kvm_emulate.h |  3 +++
 arch/arm64/include/asm/kvm_host.h|  3 +++
 arch/arm64/kvm/hyp/exception.c   |  3 ++-
 arch/arm64/kvm/mmu.c | 20 
 arch/arm64/kvm/sys_regs.c|  3 +++
 include/uapi/linux/kvm.h |  1 +
 6 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index f612c090f2e4..6bf776c2399c 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
vcpu_el1_is_32bit(vcpu))
vcpu->arch.hcr_el2 |= HCR_TID2;
+
+   if (kvm_has_mte(vcpu->kvm))
+   vcpu->arch.hcr_el2 |= HCR_ATA;
 }
 
 static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 3d10e6527f7d..1170ee137096 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -132,6 +132,8 @@ struct kvm_arch {
 
u8 pfr0_csv2;
u8 pfr0_csv3;
+   /* Memory Tagging Extension enabled for the guest */
+   bool mte_enabled;
 };
 
 struct kvm_vcpu_fault_info {
@@ -767,6 +769,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
 #define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
 
+#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
 #define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
 
diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
index 73629094f903..56426565600c 100644
--- a/arch/arm64/kvm/hyp/exception.c
+++ b/arch/arm64/kvm/hyp/exception.c
@@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
unsigned long target_mode,
new |= (old & PSR_C_BIT);
new |= (old & PSR_V_BIT);
 
-   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
+   if (kvm_has_mte(vcpu->kvm))
+   new |= PSR_TCO_BIT;
 
new |= (old & PSR_DIT_BIT);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 77cb2d28f2a4..5f8e165ea053 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -879,6 +879,26 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (vma_pagesize == PAGE_SIZE && !force_pte)
vma_pagesize = transparent_hugepage_adjust(memslot, hva,
   , _ipa);
+
+   if (fault_status != FSC_PERM && kvm_has_mte(kvm) && !device &&
+   pfn_valid(pfn)) {
+   /*
+* VM will be able to see the page's tags, so we must ensure
+* they have been initialised. if PG_mte_tagged is set, tags
+* have already been initialised.
+*/
+   unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
+   struct page *page = pfn_to_online_page(pfn);
+
+   if (!page)
+   return -EFAULT;
+
+   for (i = 0; i < nr_pages; i++, page++) {
+   if (!test_and_set_bit(PG_mte_tagged, >flags))
+   mte_clear_page_tags(page_address(page));
+   }
+   }
+
if (writable)
prot |= KVM_PGTABLE_PROT_W;
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4f2f1e3145de..18c87500a7a8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1047,6 +1047,9 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
break;
case SYS_ID_AA64PFR1_EL1:
val &= ~FEATURE(ID_AA64PFR1_MTE);
+   if (kvm_has_mte(vcpu->kvm))
+   val |= FIELD_PREP(FEATURE(ID_AA64PFR1_MTE),
+ ID_AA64PFR1_MTE);
break;
case SYS_ID_AA64ISAR1_EL1:
if (!vcpu_has_ptrauth(vcpu))
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6afee209620..6dc16c09a2d1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING 192
 #define KVM

[PATCH v11 6/6] KVM: arm64: Document MTE capability and ioctl

2021-04-16 Thread Steven Price
A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
granting a guest access to the tags, and provides a mechanism for the
VMM to enable it.

A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
access the tags of a guest without having to maintain a PROT_MTE mapping
in userspace. The above capability gates access to the ioctl.

Signed-off-by: Steven Price 
---
 Documentation/virt/kvm/api.rst | 53 ++
 1 file changed, 53 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 1a2b5210cdbf..ccc84f21ba5e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4938,6 +4938,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+4.131 KVM_ARM_MTE_COPY_TAGS
+---
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: 0 on success, < 0 on error
+
+::
+
+  struct kvm_arm_copy_mte_tags {
+   __u64 guest_ipa;
+   __u64 length;
+   union {
+   void __user *addr;
+   __u64 padding;
+   };
+   __u64 flags;
+   __u64 reserved[2];
+  };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+fieldmust point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / MTE_GRANULE_SIZE)``
+bytes (i.e. 1/16th of the corresponding size). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
 5. The kvm_run structure
 
 
@@ -6227,6 +6261,25 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between 
them.
 This capability can be used to check / enable 2nd DAWR feature provided
 by POWER10 processor.
 
+7.23 KVM_CAP_ARM_MTE
+
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before the guest will be granted access.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the pages are flagged ``PG_mte_tagged`` so
+that the tags are maintained during swap or hibernation of the host; however
+the VMM needs to manually save/restore the tags as appropriate if the VM is
+migrated.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
 8. Other capabilities.
 ==
 
-- 
2.20.1




[PATCH v11 1/6] arm64: mte: Sync tags for pages where PTE is untagged

2021-04-16 Thread Steven Price
A KVM guest could store tags in a page even if the VMM hasn't mapped
the page with PROT_MTE. So when restoring pages from swap we will
need to check to see if there are any saved tags even if !pte_tagged().

However don't check pages which are !pte_valid_user() as these will
not have been swapped out.

Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/mte.c  | 16 
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e17b96d0e4b5..cf4b52a33b3c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void set_pte_at(struct mm_struct *mm, 
unsigned long addr,
__sync_icache_dcache(pte);
 
if (system_supports_mte() &&
-   pte_present(pte) && pte_tagged(pte) && !pte_special(pte))
+   pte_present(pte) && (pte_val(pte) & PTE_USER) && !pte_special(pte))
mte_sync_tags(ptep, pte);
 
__check_racy_pte_update(mm, ptep, pte);
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index b3c70a612c7a..e016ab57ea36 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -26,17 +26,23 @@ u64 gcr_kernel_excl __ro_after_init;
 
 static bool report_fault_once = true;
 
-static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
+static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap,
+  bool pte_is_tagged)
 {
pte_t old_pte = READ_ONCE(*ptep);
 
if (check_swap && is_swap_pte(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);
 
-   if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
+   if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
+   set_bit(PG_mte_tagged, >flags);
return;
+   }
}
 
+   if (!pte_is_tagged || test_and_set_bit(PG_mte_tagged, >flags))
+   return;
+
page_kasan_tag_reset(page);
/*
 * We need smp_wmb() in between setting the flags and clearing the
@@ -54,11 +60,13 @@ void mte_sync_tags(pte_t *ptep, pte_t pte)
struct page *page = pte_page(pte);
long i, nr_pages = compound_nr(page);
bool check_swap = nr_pages == 1;
+   bool pte_is_tagged = pte_tagged(pte);
 
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
-   if (!test_and_set_bit(PG_mte_tagged, >flags))
-   mte_sync_page_tags(page, ptep, check_swap);
+   if (!test_bit(PG_mte_tagged, >flags))
+   mte_sync_page_tags(page, ptep, check_swap,
+  pte_is_tagged);
}
 }
 
-- 
2.20.1




  1   2   >