Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-26 Thread James Morse
Hi guys,

On 24/06/2020 17:24, Catalin Marinas wrote:
> On Wed, Jun 24, 2020 at 03:59:35PM +0100, Steven Price wrote:
>> On 24/06/2020 15:21, Catalin Marinas wrote:
>>> On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote:
 On 23/06/2020 18:48, Catalin Marinas wrote:
> This causes potential issues since we can't guarantee that all the
> Cacheable memory slots allocated by the VMM support MTE. If they do not,
> the arch behaviour is "unpredictable". We also can't trust the guest to
> not enable MTE on such Cacheable mappings.

 Architecturally it seems dodgy to export any address that isn't "normal
 memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although
 I'm a bit worried this might cause a regression in some existing case.
>>>
>>> What I had in mind is some persistent memory that may be given to the
>>> guest for direct access. This is allowed to be cacheable (write-back)
>>> but may not have tag storage.
>>
>> At the moment we don't have a good idea what would happen if/when the guest
>> (or host) attempts to use that memory as tagged. If we have a relatively
>> safe hardware behaviour (e.g. the tags are silently dropped/read-as-zero)
>> then that's not a big issue. But if the accesses cause some form of abort
>> then we need to understand how that would be handled.
> 
> The architecture is not prescriptive here, the behaviour is
> "unpredictable". It could mean tags read-as-zero/write-ignored or an
> SError.

This surely is the same as treating a VFIO device as memory and performing some
unsupported operation on it.

I thought the DT 'which memory ranges' description for MTE was removed. 
Wouldn't the rules
for a guest be the same? If you enable MTE, everything described as memory must 
support
MTE. Something like persistent memory then can't be described as memory, ... we 
have the
same problem on the host.


> 1. As in your current patches, assume any Cacheable at Stage 2 can have
>  MTE enabled at Stage 1. In addition, we need to check whether the
>  physical memory supports MTE and it could be something simple like
>  pfn_valid(). Is there a way to reject a memory slot passed by the
>  VMM?

 Yes pfn_valid() should have been in there. At the moment pfn_to_page() is
 called without any checks.

 The problem with attempting to reject a memory slot is that the memory
 backing that slot can change. So checking at the time the slot is created
 isn't enough (although it might be a useful error checking feature).
>>>
>>> But isn't the slot changed as a result of another VMM call? So we could
>>> always have such check in place.
>>
>> Once you have created a memslot the guest's view of memory follows the user
>> space's address space. This is the KVM_CAP_SYNC_MMU capability. So there's
>> nothing stopping a VMM adding a memslot backed with perfectly reasonable
>> memory then mmap()ing over the top of it some memory which isn't MTE
>> compatible. KVM gets told the memory is being removed (via mmu notifiers)
>> but I think it waits for the next fault before (re)creating the stage 2
>> entries.

(indeed, stage2 is pretty lazy)


> OK, so that's where we could kill the guest if the VMM doesn't play
> nicely. It means that we need the check when setting up the stage 2
> entry. I guess it's fine if we only have the check at that point and
> ignore it on KVM_SET_USER_MEMORY_REGION. It would be nice if we returned
> on error on slot setup but

> we may not know (yet) whether the VMM intends to enable MTE for the guest.

We don't. Memory slots take the VM-fd, whereas the easy-to-add feature bits are 
per-vcpu.
Packing features into the 'type' that create-vm takes is a problem once we run 
out,
although the existing user is the IPA space size, and MTE is a property of the 
memory system.


The meaning of the flag is then "I described this as memory, only let the guest 
access
memory through this range that is MTE capable". What do we do when that is 
violated? Tell
the VMM is the nicest, but its not something we ever expect to happen. I guess 
an abort is
what real hardware would do, (if firmware magically turned off MTE while it was 
in use).

This would need to be kvm's inject_abt64(), as otherwise the vcpu may take the 
stage2
fault again, forever. For kvm_set_spte_hva() we can't inject an abort (which 
vcpu?), so
not mapping the page and waiting for the guest to access it is the only 
option...


Thanks,

James


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Catalin Marinas
On Wed, Jun 24, 2020 at 03:59:35PM +0100, Steven Price wrote:
> On 24/06/2020 15:21, Catalin Marinas wrote:
> > On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote:
> > > On 23/06/2020 18:48, Catalin Marinas wrote:
> > > > On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote:
> > > > > These patches add support to KVM to enable MTE within a guest. It is
> > > > > based on Catalin's v4 MTE user space series[1].
> > > > > 
> > > > > [1] 
> > > > > http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> > > > > 
> > > > > Posting as an RFC as I'd like feedback on the approach taken. First a
> > > > > little background on how MTE fits within the architecture:
> > > > > 
> > > > > The stage 2 page tables have limited scope for controlling the
> > > > > availability of MTE. If a page is mapped as Normal and cached in 
> > > > > stage 2
> > > > > then it's the stage 1 tables that get to choose whether the memory is
> > > > > tagged or not. So the only way of forbidding tags on a page from the
> > > > > hypervisor is to change the cacheability (or make it device memory)
> > > > > which would cause other problems.  Note this restriction fits the
> > > > > intention that a system should have all (general purpose) memory
> > > > > supporting tags if it support MTE, so it's not too surprising.
> > > > > 
> > > > > However, the upshot of this is that to enable MTE within a guest all
> > > > > pages of memory mapped into the guest as normal cached pages in stage 
> > > > > 2
> > > > > *must* support MTE (i.e. we must ensure the tags are appropriately
> > > > > sanitised and save/restore the tags during swap etc).
> > > > > 
> > > > > My current approach is that KVM transparently upgrades any pages
> > > > > provided by the VMM to be tag-enabled when they are faulted in (i.e.
> > > > > sets the PG_mte_tagged flag on the page) which has the benefit of
> > > > > requiring fewer changes in the VMM. However, save/restore of the VM
> > > > > state still requires the VMM to have a PROT_MTE enabled mapping so 
> > > > > that
> > > > > it can access the tag values. A VMM which 'forgets' to enable PROT_MTE
> > > > > would lose the tag values when saving/restoring (tags are RAZ/WI when
> > > > > PROT_MTE isn't set).
> > > > > 
> > > > > An alternative approach would be to enforce the VMM provides PROT_MTE
> > > > > memory in the first place. This seems appealing to prevent the above
> > > > > potentially unexpected gotchas with save/restore, however this would
> > > > > also extend to memory that you might not expect to have PROT_MTE 
> > > > > (e.g. a
> > > > > shared frame buffer for an emulated graphics card).
> > > > 
> > > > As you mentioned above, if memory is mapped as Normal Cacheable at Stage
> > > > 2 (whether we use FWB or not), the guest is allowed to turn MTE on via
> > > > Stage 1. There is no way for KVM to prevent a guest from using MTE other
> > > > than the big HCR_EL2.ATA knob.
> > > > 
> > > > This causes potential issues since we can't guarantee that all the
> > > > Cacheable memory slots allocated by the VMM support MTE. If they do not,
> > > > the arch behaviour is "unpredictable". We also can't trust the guest to
> > > > not enable MTE on such Cacheable mappings.
> > > 
> > > Architecturally it seems dodgy to export any address that isn't "normal
> > > memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although
> > > I'm a bit worried this might cause a regression in some existing case.
> > 
> > What I had in mind is some persistent memory that may be given to the
> > guest for direct access. This is allowed to be cacheable (write-back)
> > but may not have tag storage.
> 
> At the moment we don't have a good idea what would happen if/when the guest
> (or host) attempts to use that memory as tagged. If we have a relatively
> safe hardware behaviour (e.g. the tags are silently dropped/read-as-zero)
> then that's not a big issue. But if the accesses cause some form of abort
> then we need to understand how that would be handled.

The architecture is not prescriptive here, the behaviour is
"unpredictable". It could mean tags read-as-zero/write-ignored or an
SError.

> > > > 1. As in your current patches, assume any Cacheable at Stage 2 can have
> > > >  MTE enabled at Stage 1. In addition, we need to check whether the
> > > >  physical memory supports MTE and it could be something simple like
> > > >  pfn_valid(). Is there a way to reject a memory slot passed by the
> > > >  VMM?
> > > 
> > > Yes pfn_valid() should have been in there. At the moment pfn_to_page() is
> > > called without any checks.
> > > 
> > > The problem with attempting to reject a memory slot is that the memory
> > > backing that slot can change. So checking at the time the slot is created
> > > isn't enough (although it might be a useful error checking feature).
> > 
> > But isn't the slot changed as a result of another VMM call? So we could
> > always have such check in pla

Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Steven Price

On 24/06/2020 15:21, Catalin Marinas wrote:

On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote:

On 23/06/2020 18:48, Catalin Marinas wrote:

On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote:

These patches add support to KVM to enable MTE within a guest. It is
based on Catalin's v4 MTE user space series[1].

[1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com

Posting as an RFC as I'd like feedback on the approach taken. First a
little background on how MTE fits within the architecture:

The stage 2 page tables have limited scope for controlling the
availability of MTE. If a page is mapped as Normal and cached in stage 2
then it's the stage 1 tables that get to choose whether the memory is
tagged or not. So the only way of forbidding tags on a page from the
hypervisor is to change the cacheability (or make it device memory)
which would cause other problems.  Note this restriction fits the
intention that a system should have all (general purpose) memory
supporting tags if it support MTE, so it's not too surprising.

However, the upshot of this is that to enable MTE within a guest all
pages of memory mapped into the guest as normal cached pages in stage 2
*must* support MTE (i.e. we must ensure the tags are appropriately
sanitised and save/restore the tags during swap etc).

My current approach is that KVM transparently upgrades any pages
provided by the VMM to be tag-enabled when they are faulted in (i.e.
sets the PG_mte_tagged flag on the page) which has the benefit of
requiring fewer changes in the VMM. However, save/restore of the VM
state still requires the VMM to have a PROT_MTE enabled mapping so that
it can access the tag values. A VMM which 'forgets' to enable PROT_MTE
would lose the tag values when saving/restoring (tags are RAZ/WI when
PROT_MTE isn't set).

An alternative approach would be to enforce the VMM provides PROT_MTE
memory in the first place. This seems appealing to prevent the above
potentially unexpected gotchas with save/restore, however this would
also extend to memory that you might not expect to have PROT_MTE (e.g. a
shared frame buffer for an emulated graphics card).


As you mentioned above, if memory is mapped as Normal Cacheable at Stage
2 (whether we use FWB or not), the guest is allowed to turn MTE on via
Stage 1. There is no way for KVM to prevent a guest from using MTE other
than the big HCR_EL2.ATA knob.

This causes potential issues since we can't guarantee that all the
Cacheable memory slots allocated by the VMM support MTE. If they do not,
the arch behaviour is "unpredictable". We also can't trust the guest to
not enable MTE on such Cacheable mappings.


Architecturally it seems dodgy to export any address that isn't "normal
memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although
I'm a bit worried this might cause a regression in some existing case.


What I had in mind is some persistent memory that may be given to the
guest for direct access. This is allowed to be cacheable (write-back)
but may not have tag storage.


At the moment we don't have a good idea what would happen if/when the 
guest (or host) attempts to use that memory as tagged. If we have a 
relatively safe hardware behaviour (e.g. the tags are silently 
dropped/read-as-zero) then that's not a big issue. But if the accesses 
cause some form of abort then we need to understand how that would be 
handled.



On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous
mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped
ranges to KVM, irrespective of whether it's guest RAM, emulated device,
virtio etc. (as long as they are Cacheable), filters unsafe ranges that
may be mapped into guest.


That would be an easy way of doing the filtering, but it's not clear whether
PROT_MTE is actually what the VMM wants (it most likely doesn't want to have
tag checking enabled on the memory in user space).


 From the other sub-thread, yeah, we probably don't want to mandate
PROT_MTE because of potential inadvertent tag check faults in the VMM
itself.


Note that in the next revision of the MTE patches I'll drop the DT
memory nodes checking and rely only on the CPUID information (arch
updated promised by the architects).

I see two possible ways to handle this (there may be more):

1. As in your current patches, assume any Cacheable at Stage 2 can have
 MTE enabled at Stage 1. In addition, we need to check whether the
 physical memory supports MTE and it could be something simple like
 pfn_valid(). Is there a way to reject a memory slot passed by the
 VMM?


Yes pfn_valid() should have been in there. At the moment pfn_to_page() is
called without any checks.

The problem with attempting to reject a memory slot is that the memory
backing that slot can change. So checking at the time the slot is created
isn't enough (although it might be a useful error checking feature).


But isn't the slot changed as a result 

Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Catalin Marinas
On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote:
> On 23/06/2020 18:48, Catalin Marinas wrote:
> > On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote:
> > > These patches add support to KVM to enable MTE within a guest. It is
> > > based on Catalin's v4 MTE user space series[1].
> > > 
> > > [1] 
> > > http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> > > 
> > > Posting as an RFC as I'd like feedback on the approach taken. First a
> > > little background on how MTE fits within the architecture:
> > > 
> > > The stage 2 page tables have limited scope for controlling the
> > > availability of MTE. If a page is mapped as Normal and cached in stage 2
> > > then it's the stage 1 tables that get to choose whether the memory is
> > > tagged or not. So the only way of forbidding tags on a page from the
> > > hypervisor is to change the cacheability (or make it device memory)
> > > which would cause other problems.  Note this restriction fits the
> > > intention that a system should have all (general purpose) memory
> > > supporting tags if it support MTE, so it's not too surprising.
> > > 
> > > However, the upshot of this is that to enable MTE within a guest all
> > > pages of memory mapped into the guest as normal cached pages in stage 2
> > > *must* support MTE (i.e. we must ensure the tags are appropriately
> > > sanitised and save/restore the tags during swap etc).
> > > 
> > > My current approach is that KVM transparently upgrades any pages
> > > provided by the VMM to be tag-enabled when they are faulted in (i.e.
> > > sets the PG_mte_tagged flag on the page) which has the benefit of
> > > requiring fewer changes in the VMM. However, save/restore of the VM
> > > state still requires the VMM to have a PROT_MTE enabled mapping so that
> > > it can access the tag values. A VMM which 'forgets' to enable PROT_MTE
> > > would lose the tag values when saving/restoring (tags are RAZ/WI when
> > > PROT_MTE isn't set).
> > > 
> > > An alternative approach would be to enforce the VMM provides PROT_MTE
> > > memory in the first place. This seems appealing to prevent the above
> > > potentially unexpected gotchas with save/restore, however this would
> > > also extend to memory that you might not expect to have PROT_MTE (e.g. a
> > > shared frame buffer for an emulated graphics card).
> > 
> > As you mentioned above, if memory is mapped as Normal Cacheable at Stage
> > 2 (whether we use FWB or not), the guest is allowed to turn MTE on via
> > Stage 1. There is no way for KVM to prevent a guest from using MTE other
> > than the big HCR_EL2.ATA knob.
> > 
> > This causes potential issues since we can't guarantee that all the
> > Cacheable memory slots allocated by the VMM support MTE. If they do not,
> > the arch behaviour is "unpredictable". We also can't trust the guest to
> > not enable MTE on such Cacheable mappings.
> 
> Architecturally it seems dodgy to export any address that isn't "normal
> memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although
> I'm a bit worried this might cause a regression in some existing case.

What I had in mind is some persistent memory that may be given to the
guest for direct access. This is allowed to be cacheable (write-back)
but may not have tag storage.

> > On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous
> > mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped
> > ranges to KVM, irrespective of whether it's guest RAM, emulated device,
> > virtio etc. (as long as they are Cacheable), filters unsafe ranges that
> > may be mapped into guest.
> 
> That would be an easy way of doing the filtering, but it's not clear whether
> PROT_MTE is actually what the VMM wants (it most likely doesn't want to have
> tag checking enabled on the memory in user space).

>From the other sub-thread, yeah, we probably don't want to mandate
PROT_MTE because of potential inadvertent tag check faults in the VMM
itself.

> > Note that in the next revision of the MTE patches I'll drop the DT
> > memory nodes checking and rely only on the CPUID information (arch
> > updated promised by the architects).
> > 
> > I see two possible ways to handle this (there may be more):
> > 
> > 1. As in your current patches, assume any Cacheable at Stage 2 can have
> > MTE enabled at Stage 1. In addition, we need to check whether the
> > physical memory supports MTE and it could be something simple like
> > pfn_valid(). Is there a way to reject a memory slot passed by the
> > VMM?
> 
> Yes pfn_valid() should have been in there. At the moment pfn_to_page() is
> called without any checks.
> 
> The problem with attempting to reject a memory slot is that the memory
> backing that slot can change. So checking at the time the slot is created
> isn't enough (although it might be a useful error checking feature).

But isn't the slot changed as a result of another VMM call? So we could
always have such check in

Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Peter Maydell
On Wed, 24 Jun 2020 at 12:18, Steven Price  wrote:
> Ah yes, similar to (1) but much lower overhead ;) That's probably the
> best option - it can be hidden in a memcpy_ignoring_tags() function.
> However it still means that the VMM can't directly touch the guest's
> memory which might cause issues for the VMM.

That's kind of awkward, since in general QEMU assumes it can
naturally just access guest RAM[*] (eg emulation of DMAing devices,
virtio, graphics display, gdb stub memory accesses). It would be
nicer to be able to do it the other way around, maybe, so that the
current APIs give you the "just the memory" and if you really want
to do tagged accesses to guest ram you can do it with tag-specific
APIs. I haven't thought about this very much though and haven't
read enough of the MTE spec recently enough to make much
sensible comment. So mostly what I'm trying to encourage here
is that the people implementing the KVM/kernel side of this API
also think about the userspace side of it, so we get one coherent
design rather than a half-a-product that turns out to be awkward
to use :-)

[*] "guest ram is encrypted" also breaks this assumption, of course;
I haven't looked at the efforts in that direction that are already in
QEMU to see how they work, though.

thanks
-- PMM


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Catalin Marinas
On Wed, Jun 24, 2020 at 12:18:46PM +0100, Steven Price wrote:
> On 24/06/2020 12:09, Catalin Marinas wrote:
> > On Wed, Jun 24, 2020 at 12:03:35PM +0100, Steven Price wrote:
> > > On 24/06/2020 11:34, Dave Martin wrote:
> > > > On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote:
> > > > > On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote:
> > > > > > On Wed, 17 Jun 2020 at 13:39, Steven Price  
> > > > > > wrote:
> > > > > > > These patches add support to KVM to enable MTE within a guest. It 
> > > > > > > is
> > > > > > > based on Catalin's v4 MTE user space series[1].
> > > > > > > 
> > > > > > > [1] 
> > > > > > > http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> > > > > > > 
> > > > > > > Posting as an RFC as I'd like feedback on the approach taken.
> > > > > > 
> > > > > > What's your plan for handling tags across VM migration?
> > > > > > Will the kernel expose the tag ram to userspace so we
> > > > > > can copy it from the source machine to the destination
> > > > > > at the same time as we copy the actual ram contents ?
> > > > > 
> > > > > Qemu can map the guest memory with PROT_MTE and access the tags 
> > > > > directly
> > > > > with LDG/STG instructions. Steven was actually asking in the cover
> > > > > letter whether we should require that the VMM maps the guest memory 
> > > > > with
> > > > > PROT_MTE as a guarantee that it can access the guest tags.
> > > > > 
> > > > > There is no architecturally visible tag ram (tag storage), that's a
> > > > > microarchitecture detail.
> > > > 
> > > > If userspace maps the guest memory with PROT_MTE for dump purposes,
> > > > isn't it going to get tag check faults when accessing the memory
> > > > (i.e., when dumping the regular memory content, not the tags
> > > > specifically).
> > > > 
> > > > Does it need to map two aliases, one with PROT_MTE and one without,
> > > > and is that architecturally valid?
> > > 
> > > Userspace would either need to have two mappings (I don't believe there 
> > > are
> > > any architectural issues with that - but this could be awkward to arrange 
> > > in
> > > some situations) or be careful to avoid faults. Basically your choices 
> > > with
> > > one mapping are:
> > > 
> > >   1. Disable tag checking (using prctl) when touching the memory. This 
> > > works
> > > but means you lose tag checking for the VMM's own accesses during this 
> > > code
> > > sequence.
> > > 
> > >   2. Read the tag values and ensure you use the correct tag. This suffers
> > > from race conditions if the VM is still running.
> > > 
> > >   3. Use one of the exceptions in the architecture that generates a Tag
> > > Unchecked access. Sadly the only remotely useful thing I can see in the v8
> > > ARM is "A base register plus immediate offset addressing form, with the SP
> > > as the base register." - but making sure SP is in range of where you want 
> > > to
> > > access would be a pain.
> > 
> > Or:
> > 
> > 4. Set PSTATE.TCO when accessing tagged memory in an unsafe way.
> 
> Ah yes, similar to (1) but much lower overhead ;) That's probably the best
> option - it can be hidden in a memcpy_ignoring_tags() function. However it
> still means that the VMM can't directly touch the guest's memory which might
> cause issues for the VMM.

You are right, I don't think it's safe for the VMM to access the guest
memory via a PROT_MTE mapping. If the guest is using memory tagging for
for a buffer and then it is passed to qemu for virtio, the tag
information may have been lost already (does qemu only get the IPA in
this case?)

So we may end up with two mappings after all, one for the normal
execution and a new on if migration is needed.

-- 
Catalin


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Steven Price

On 24/06/2020 12:09, Catalin Marinas wrote:

On Wed, Jun 24, 2020 at 12:03:35PM +0100, Steven Price wrote:

On 24/06/2020 11:34, Dave Martin wrote:

On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote:

On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote:

On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:

These patches add support to KVM to enable MTE within a guest. It is
based on Catalin's v4 MTE user space series[1].

[1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com

Posting as an RFC as I'd like feedback on the approach taken.


What's your plan for handling tags across VM migration?
Will the kernel expose the tag ram to userspace so we
can copy it from the source machine to the destination
at the same time as we copy the actual ram contents ?


Qemu can map the guest memory with PROT_MTE and access the tags directly
with LDG/STG instructions. Steven was actually asking in the cover
letter whether we should require that the VMM maps the guest memory with
PROT_MTE as a guarantee that it can access the guest tags.

There is no architecturally visible tag ram (tag storage), that's a
microarchitecture detail.


If userspace maps the guest memory with PROT_MTE for dump purposes,
isn't it going to get tag check faults when accessing the memory
(i.e., when dumping the regular memory content, not the tags
specifically).

Does it need to map two aliases, one with PROT_MTE and one without,
and is that architecturally valid?


Userspace would either need to have two mappings (I don't believe there are
any architectural issues with that - but this could be awkward to arrange in
some situations) or be careful to avoid faults. Basically your choices with
one mapping are:

  1. Disable tag checking (using prctl) when touching the memory. This works
but means you lose tag checking for the VMM's own accesses during this code
sequence.

  2. Read the tag values and ensure you use the correct tag. This suffers
from race conditions if the VM is still running.

  3. Use one of the exceptions in the architecture that generates a Tag
Unchecked access. Sadly the only remotely useful thing I can see in the v8
ARM is "A base register plus immediate offset addressing form, with the SP
as the base register." - but making sure SP is in range of where you want to
access would be a pain.


Or:

4. Set PSTATE.TCO when accessing tagged memory in an unsafe way.



Ah yes, similar to (1) but much lower overhead ;) That's probably the 
best option - it can be hidden in a memcpy_ignoring_tags() function. 
However it still means that the VMM can't directly touch the guest's 
memory which might cause issues for the VMM.


Steve


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Steven Price

On 23/06/2020 18:48, Catalin Marinas wrote:

Hi Steven,

On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote:

These patches add support to KVM to enable MTE within a guest. It is
based on Catalin's v4 MTE user space series[1].

[1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com

Posting as an RFC as I'd like feedback on the approach taken. First a
little background on how MTE fits within the architecture:

The stage 2 page tables have limited scope for controlling the
availability of MTE. If a page is mapped as Normal and cached in stage 2
then it's the stage 1 tables that get to choose whether the memory is
tagged or not. So the only way of forbidding tags on a page from the
hypervisor is to change the cacheability (or make it device memory)
which would cause other problems.  Note this restriction fits the
intention that a system should have all (general purpose) memory
supporting tags if it support MTE, so it's not too surprising.

However, the upshot of this is that to enable MTE within a guest all
pages of memory mapped into the guest as normal cached pages in stage 2
*must* support MTE (i.e. we must ensure the tags are appropriately
sanitised and save/restore the tags during swap etc).

My current approach is that KVM transparently upgrades any pages
provided by the VMM to be tag-enabled when they are faulted in (i.e.
sets the PG_mte_tagged flag on the page) which has the benefit of
requiring fewer changes in the VMM. However, save/restore of the VM
state still requires the VMM to have a PROT_MTE enabled mapping so that
it can access the tag values. A VMM which 'forgets' to enable PROT_MTE
would lose the tag values when saving/restoring (tags are RAZ/WI when
PROT_MTE isn't set).

An alternative approach would be to enforce the VMM provides PROT_MTE
memory in the first place. This seems appealing to prevent the above
potentially unexpected gotchas with save/restore, however this would
also extend to memory that you might not expect to have PROT_MTE (e.g. a
shared frame buffer for an emulated graphics card).


As you mentioned above, if memory is mapped as Normal Cacheable at Stage
2 (whether we use FWB or not), the guest is allowed to turn MTE on via
Stage 1. There is no way for KVM to prevent a guest from using MTE other
than the big HCR_EL2.ATA knob.

This causes potential issues since we can't guarantee that all the
Cacheable memory slots allocated by the VMM support MTE. If they do not,
the arch behaviour is "unpredictable". We also can't trust the guest to
not enable MTE on such Cacheable mappings.


Architecturally it seems dodgy to export any address that isn't "normal 
memory" (i.e. with tag storage) to the guest as Normal Cacheable. 
Although I'm a bit worried this might cause a regression in some 
existing case.



On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous
mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped
ranges to KVM, irrespective of whether it's guest RAM, emulated device,
virtio etc. (as long as they are Cacheable), filters unsafe ranges that
may be mapped into guest.


That would be an easy way of doing the filtering, but it's not clear 
whether PROT_MTE is actually what the VMM wants (it most likely doesn't 
want to have tag checking enabled on the memory in user space).



Note that in the next revision of the MTE patches I'll drop the DT
memory nodes checking and rely only on the CPUID information (arch
updated promised by the architects).

I see two possible ways to handle this (there may be more):

1. As in your current patches, assume any Cacheable at Stage 2 can have
MTE enabled at Stage 1. In addition, we need to check whether the
physical memory supports MTE and it could be something simple like
pfn_valid(). Is there a way to reject a memory slot passed by the
VMM?


Yes pfn_valid() should have been in there. At the moment pfn_to_page() 
is called without any checks.


The problem with attempting to reject a memory slot is that the memory 
backing that slot can change. So checking at the time the slot is 
created isn't enough (although it might be a useful error checking feature).


It's not clear to me what we can do at fault time when we discover the 
memory isn't tag-capable and would have been mapped cacheable other than 
kill the VM.



2. Similar to 1 but instead of checking whether the pfn supports MTE, we
require the VMM to only pass PROT_MTE ranges (filtering already done
by the host kernel). We need a way to reject the slot and return an
error to the VMM.

I think rejecting a slot at the Stage 2 fault time is very late. You
probably won't be able to do much other than killing the guest.


As above, we will struggle to catch all cases during slot creation, so I 
think we're going to have to deal with this late detection as well.



Both 1 and 2 above risk breaking existing VMMs just because they happen
to start on an MTE-capable machine. So, can we also 

Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Catalin Marinas
On Wed, Jun 24, 2020 at 12:03:35PM +0100, Steven Price wrote:
> On 24/06/2020 11:34, Dave Martin wrote:
> > On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote:
> > > On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote:
> > > > On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:
> > > > > These patches add support to KVM to enable MTE within a guest. It is
> > > > > based on Catalin's v4 MTE user space series[1].
> > > > > 
> > > > > [1] 
> > > > > http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> > > > > 
> > > > > Posting as an RFC as I'd like feedback on the approach taken.
> > > > 
> > > > What's your plan for handling tags across VM migration?
> > > > Will the kernel expose the tag ram to userspace so we
> > > > can copy it from the source machine to the destination
> > > > at the same time as we copy the actual ram contents ?
> > > 
> > > Qemu can map the guest memory with PROT_MTE and access the tags directly
> > > with LDG/STG instructions. Steven was actually asking in the cover
> > > letter whether we should require that the VMM maps the guest memory with
> > > PROT_MTE as a guarantee that it can access the guest tags.
> > > 
> > > There is no architecturally visible tag ram (tag storage), that's a
> > > microarchitecture detail.
> > 
> > If userspace maps the guest memory with PROT_MTE for dump purposes,
> > isn't it going to get tag check faults when accessing the memory
> > (i.e., when dumping the regular memory content, not the tags
> > specifically).
> > 
> > Does it need to map two aliases, one with PROT_MTE and one without,
> > and is that architecturally valid?
> 
> Userspace would either need to have two mappings (I don't believe there are
> any architectural issues with that - but this could be awkward to arrange in
> some situations) or be careful to avoid faults. Basically your choices with
> one mapping are:
> 
>  1. Disable tag checking (using prctl) when touching the memory. This works
> but means you lose tag checking for the VMM's own accesses during this code
> sequence.
> 
>  2. Read the tag values and ensure you use the correct tag. This suffers
> from race conditions if the VM is still running.
> 
>  3. Use one of the exceptions in the architecture that generates a Tag
> Unchecked access. Sadly the only remotely useful thing I can see in the v8
> ARM is "A base register plus immediate offset addressing form, with the SP
> as the base register." - but making sure SP is in range of where you want to
> access would be a pain.

Or:

4. Set PSTATE.TCO when accessing tagged memory in an unsafe way.

-- 
Catalin


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Steven Price

On 24/06/2020 11:34, Dave Martin wrote:

On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote:

On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote:

On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:

These patches add support to KVM to enable MTE within a guest. It is
based on Catalin's v4 MTE user space series[1].

[1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com

Posting as an RFC as I'd like feedback on the approach taken.


What's your plan for handling tags across VM migration?
Will the kernel expose the tag ram to userspace so we
can copy it from the source machine to the destination
at the same time as we copy the actual ram contents ?


Qemu can map the guest memory with PROT_MTE and access the tags directly
with LDG/STG instructions. Steven was actually asking in the cover
letter whether we should require that the VMM maps the guest memory with
PROT_MTE as a guarantee that it can access the guest tags.

There is no architecturally visible tag ram (tag storage), that's a
microarchitecture detail.


If userspace maps the guest memory with PROT_MTE for dump purposes,
isn't it going to get tag check faults when accessing the memory
(i.e., when dumping the regular memory content, not the tags
specifically).

Does it need to map two aliases, one with PROT_MTE and one without,
and is that architecturally valid?


Userspace would either need to have two mappings (I don't believe there 
are any architectural issues with that - but this could be awkward to 
arrange in some situations) or be careful to avoid faults. Basically 
your choices with one mapping are:


 1. Disable tag checking (using prctl) when touching the memory. This 
works but means you lose tag checking for the VMM's own accesses during 
this code sequence.


 2. Read the tag values and ensure you use the correct tag. This 
suffers from race conditions if the VM is still running.


 3. Use one of the exceptions in the architecture that generates a Tag 
Unchecked access. Sadly the only remotely useful thing I can see in the 
v8 ARM is "A base register plus immediate offset addressing form, with 
the SP as the base register." - but making sure SP is in range of where 
you want to access would be a pain.


The kernel could provide a mechanism to do this, but I'm not sure that 
would be better than 1.


This, however, is another argument for my current approach of 
"upgrading" the pages automatically and not forcing the VMM to set 
PROT_MTE. But in this case we probably would need a kernel interface to 
fetch the tags as the VM sees them.


Steve


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Dave Martin
On Wed, Jun 24, 2020 at 10:38:48AM +0100, Catalin Marinas wrote:
> On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote:
> > On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:
> > > These patches add support to KVM to enable MTE within a guest. It is
> > > based on Catalin's v4 MTE user space series[1].
> > >
> > > [1] 
> > > http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> > >
> > > Posting as an RFC as I'd like feedback on the approach taken.
> > 
> > What's your plan for handling tags across VM migration?
> > Will the kernel expose the tag ram to userspace so we
> > can copy it from the source machine to the destination
> > at the same time as we copy the actual ram contents ?
> 
> Qemu can map the guest memory with PROT_MTE and access the tags directly
> with LDG/STG instructions. Steven was actually asking in the cover
> letter whether we should require that the VMM maps the guest memory with
> PROT_MTE as a guarantee that it can access the guest tags.
> 
> There is no architecturally visible tag ram (tag storage), that's a
> microarchitecture detail.

If userspace maps the guest memory with PROT_MTE for dump purposes,
isn't it going to get tag check faults when accessing the memory
(i.e., when dumping the regular memory content, not the tags
specifically).

Does it need to map two aliases, one with PROT_MTE and one without,
and is that architecturally valid?

Cheers
---Dave


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-24 Thread Catalin Marinas
On Tue, Jun 23, 2020 at 07:05:07PM +0100, Peter Maydell wrote:
> On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:
> > These patches add support to KVM to enable MTE within a guest. It is
> > based on Catalin's v4 MTE user space series[1].
> >
> > [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> >
> > Posting as an RFC as I'd like feedback on the approach taken.
> 
> What's your plan for handling tags across VM migration?
> Will the kernel expose the tag ram to userspace so we
> can copy it from the source machine to the destination
> at the same time as we copy the actual ram contents ?

Qemu can map the guest memory with PROT_MTE and access the tags directly
with LDG/STG instructions. Steven was actually asking in the cover
letter whether we should require that the VMM maps the guest memory with
PROT_MTE as a guarantee that it can access the guest tags.

There is no architecturally visible tag ram (tag storage), that's a
microarchitecture detail.

-- 
Catalin


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-23 Thread Peter Maydell
On Wed, 17 Jun 2020 at 13:39, Steven Price  wrote:
>
> These patches add support to KVM to enable MTE within a guest. It is
> based on Catalin's v4 MTE user space series[1].
>
> [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
>
> Posting as an RFC as I'd like feedback on the approach taken.

What's your plan for handling tags across VM migration?
Will the kernel expose the tag ram to userspace so we
can copy it from the source machine to the destination
at the same time as we copy the actual ram contents ?

thanks
-- PMM


Re: [RFC PATCH 0/2] MTE support for KVM guest

2020-06-23 Thread Catalin Marinas
Hi Steven,

On Wed, Jun 17, 2020 at 01:38:42PM +0100, Steven Price wrote:
> These patches add support to KVM to enable MTE within a guest. It is
> based on Catalin's v4 MTE user space series[1].
> 
> [1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com
> 
> Posting as an RFC as I'd like feedback on the approach taken. First a
> little background on how MTE fits within the architecture:
> 
> The stage 2 page tables have limited scope for controlling the
> availability of MTE. If a page is mapped as Normal and cached in stage 2
> then it's the stage 1 tables that get to choose whether the memory is
> tagged or not. So the only way of forbidding tags on a page from the
> hypervisor is to change the cacheability (or make it device memory)
> which would cause other problems.  Note this restriction fits the
> intention that a system should have all (general purpose) memory
> supporting tags if it support MTE, so it's not too surprising.
> 
> However, the upshot of this is that to enable MTE within a guest all
> pages of memory mapped into the guest as normal cached pages in stage 2
> *must* support MTE (i.e. we must ensure the tags are appropriately
> sanitised and save/restore the tags during swap etc).
> 
> My current approach is that KVM transparently upgrades any pages
> provided by the VMM to be tag-enabled when they are faulted in (i.e.
> sets the PG_mte_tagged flag on the page) which has the benefit of
> requiring fewer changes in the VMM. However, save/restore of the VM
> state still requires the VMM to have a PROT_MTE enabled mapping so that
> it can access the tag values. A VMM which 'forgets' to enable PROT_MTE
> would lose the tag values when saving/restoring (tags are RAZ/WI when
> PROT_MTE isn't set).
> 
> An alternative approach would be to enforce the VMM provides PROT_MTE
> memory in the first place. This seems appealing to prevent the above
> potentially unexpected gotchas with save/restore, however this would
> also extend to memory that you might not expect to have PROT_MTE (e.g. a
> shared frame buffer for an emulated graphics card). 

As you mentioned above, if memory is mapped as Normal Cacheable at Stage
2 (whether we use FWB or not), the guest is allowed to turn MTE on via
Stage 1. There is no way for KVM to prevent a guest from using MTE other
than the big HCR_EL2.ATA knob.

This causes potential issues since we can't guarantee that all the
Cacheable memory slots allocated by the VMM support MTE. If they do not,
the arch behaviour is "unpredictable". We also can't trust the guest to
not enable MTE on such Cacheable mappings.

On the host kernel, mmap'ing with PROT_MTE is only allowed for anonymous
mappings and shmem. So requiring the VMM to always pass PROT_MTE mapped
ranges to KVM, irrespective of whether it's guest RAM, emulated device,
virtio etc. (as long as they are Cacheable), filters unsafe ranges that
may be mapped into guest.

Note that in the next revision of the MTE patches I'll drop the DT
memory nodes checking and rely only on the CPUID information (arch
updated promised by the architects).

I see two possible ways to handle this (there may be more):

1. As in your current patches, assume any Cacheable at Stage 2 can have
   MTE enabled at Stage 1. In addition, we need to check whether the
   physical memory supports MTE and it could be something simple like
   pfn_valid(). Is there a way to reject a memory slot passed by the
   VMM?

2. Similar to 1 but instead of checking whether the pfn supports MTE, we
   require the VMM to only pass PROT_MTE ranges (filtering already done
   by the host kernel). We need a way to reject the slot and return an
   error to the VMM.

I think rejecting a slot at the Stage 2 fault time is very late. You
probably won't be able to do much other than killing the guest.

Both 1 and 2 above risk breaking existing VMMs just because they happen
to start on an MTE-capable machine. So, can we also require the VMM to
explicitly opt in to MTE support in guests via some ioctl()? This in
turn would enable the additional checks in KVM for the MTE capability of
the memory slots (1 or 2 above).

An alternative to an MTE enable ioctl(), if all the memory slots are set
up prior to the VM starting, KVM could check 1 or 2 above and decide
whether to expose MTE to guests (HCR_EL2.ATA).

More questions than solutions above, mostly for the KVM and Qemu
maintainers.

Thanks.

-- 
Catalin


[RFC PATCH 0/2] MTE support for KVM guest

2020-06-17 Thread Steven Price
These patches add support to KVM to enable MTE within a guest. It is
based on Catalin's v4 MTE user space series[1].

[1] http://lkml.kernel.org/r/20200515171612.1020-1-catalin.marinas%40arm.com

Posting as an RFC as I'd like feedback on the approach taken. First a
little background on how MTE fits within the architecture:

The stage 2 page tables have limited scope for controlling the
availability of MTE. If a page is mapped as Normal and cached in stage 2
then it's the stage 1 tables that get to choose whether the memory is
tagged or not. So the only way of forbidding tags on a page from the
hypervisor is to change the cacheability (or make it device memory)
which would cause other problems.  Note this restriction fits the
intention that a system should have all (general purpose) memory
supporting tags if it support MTE, so it's not too surprising.

However, the upshot of this is that to enable MTE within a guest all
pages of memory mapped into the guest as normal cached pages in stage 2
*must* support MTE (i.e. we must ensure the tags are appropriately
sanitised and save/restore the tags during swap etc).

My current approach is that KVM transparently upgrades any pages
provided by the VMM to be tag-enabled when they are faulted in (i.e.
sets the PG_mte_tagged flag on the page) which has the benefit of
requiring fewer changes in the VMM. However, save/restore of the VM
state still requires the VMM to have a PROT_MTE enabled mapping so that
it can access the tag values. A VMM which 'forgets' to enable PROT_MTE
would lose the tag values when saving/restoring (tags are RAZ/WI when
PROT_MTE isn't set).

An alternative approach would be to enforce the VMM provides PROT_MTE
memory in the first place. This seems appealing to prevent the above
potentially unexpected gotchas with save/restore, however this would
also extend to memory that you might not expect to have PROT_MTE (e.g. a
shared frame buffer for an emulated graphics card). 

Comments on the approach (or ideas for alternative approaches) are very
welcome.

Steven Price (2):
  arm64: kvm: Save/restore MTE registers
  arm64: kvm: Introduce MTE VCPU feature

 arch/arm64/include/asm/kvm_emulate.h |  3 +++
 arch/arm64/include/asm/kvm_host.h|  9 -
 arch/arm64/include/uapi/asm/kvm.h|  1 +
 arch/arm64/kvm/hyp/sysreg-sr.c   | 12 
 arch/arm64/kvm/reset.c   |  8 
 arch/arm64/kvm/sys_regs.c|  8 
 virt/kvm/arm/mmu.c   | 11 +++
 7 files changed, 51 insertions(+), 1 deletion(-)

-- 
2.20.1