Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-16 Thread Andy Lutomirski



On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> This new KVM exit allows userspace to handle memory-related errors. It
> indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> The flags includes additional information for userspace to handle the
> error. Currently bit 0 is defined as 'private memory' where '1'
> indicates error happens due to private memory access and '0' indicates
> error happens due to shared memory access.
>
> When private memory is enabled, this new exit will be used for KVM to
> exit to userspace for shared <-> private memory conversion in memory
> encryption usage. In such usage, typically there are two kind of memory
> conversions:
>   - explicit conversion: happens when guest explicitly calls into KVM
> to map a range (as private or shared), KVM then exits to userspace
> to perform the map/unmap operations.
>   - implicit conversion: happens in KVM page fault handler where KVM
> exits to userspace for an implicit conversion when the page is in a
> different state than requested (private or shared).
>
> Suggested-by: Sean Christopherson 
> Co-developed-by: Yu Zhang 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> ---
>  Documentation/virt/kvm/api.rst | 23 +++
>  include/uapi/linux/kvm.h   |  9 +
>  2 files changed, 32 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst 
> b/Documentation/virt/kvm/api.rst
> index f3fa75649a78..975688912b8c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6537,6 +6537,29 @@ array field represents return values. The 
> userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on 
> RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
> 
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> + __u32 flags;
> + __u32 padding;
> + __u64 gpa;
> + __u64 size;
> + } memory;
> +

Would it make sense to also have a field for the access type (read, write, 
execute, etc)?  I realize that shared <-> private conversion doesn't strictly 
need this, but it seems like it could be useful for logging failures and also 
for avoiding a second immediate fault if the type gets converted but doesn't 
have the right protection yet.

(Obviously, if this were changed, KVM would need the ability to report that it 
doesn't actually know the mode.)

--Andy



Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-21 Thread Andy Lutomirski
(please excuse any formatting disasters.  my internet went out as I was 
composing this, and i did my best to rescue it.)

On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
>> On 15.09.22 16:29, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" 
>> > 
>> > KVM can use memfd-provided memory for guest memory. For normal userspace
>> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>> > virtual address space and then tells KVM to use the virtual address to
>> > setup the mapping in the secondary page table (e.g. EPT).
>> > 
>> > With confidential computing technologies like Intel TDX, the
>> > memfd-provided memory may be encrypted with special key for special
>> > software domain (e.g. KVM guest) and is not expected to be directly
>> > accessed by userspace. Precisely, userspace access to such encrypted
>> > memory may lead to host crash so it should be prevented.
>> 
>> Initially my thaught was that this whole inaccessible thing is TDX specific
>> and there is no need to force that on other mechanisms. That's why I
>> suggested to not expose this to user space but handle the notifier
>> requirements internally.
>> 
>> IIUC now, protected KVM has similar demands. Either access (read/write) of
>> guest RAM would result in a fault and possibly crash the hypervisor (at
>> least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared 
> to private
> while preserving the contents, e.g. to hand off a large buffer 
> (hundreds of MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.
>
>> > This patch introduces userspace inaccessible memfd (created with
>> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>> > in-kernel interface so KVM can directly interact with core-mm without
>> > the need to map the memory into KVM userspace.
>> 
>> With secretmem we decided to not add such "concept switch" flags and instead
>> use a dedicated syscall.
>>
>
> I have no personal preference whatsoever between a flag and a dedicated 
> syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.

The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem 
or similar.  But if we need flags or other details in the future, maybe this 
isn't ideal.

>
>> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
>> even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory 
> needs to
> come sooner than later.  "inaccessible" in quotes because we might want 
> to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done 
> by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM 
> could tightly
> tightly control usage without taking on too much complexity in the 
> kernel, but
> working through things, routing the behavior through the shim itself 
> might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" 
> fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.

What's the exact use case?  Is it just to pre-populate the memory?

>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>  mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping 
> for
>  the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything 
> in
>  one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>  can't be guaranteed by userspace, maybe add some way for userspace to 
> wait
>  until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>  to do an expensive check every time.

Hmm.  I haven't looked at the code to see if this would really work, but I 
think this could be done more in line with how the rest of the kernel works by 
using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, 
just let it be mmapped as usual (but don't allow any form of GUP or pinning).  
Then have an ioctl to switch to to shared mode that takes locks or sets flags 
so that no new faults can be serviced and does unmap_mapping_range.

As long as the shim arranges to have its own vm_ops, I don't immediately see 
any reason this can't work.



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-09 Thread Andy Lutomirski



On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
> On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
>> On 8/19/22 17:27, Kirill A. Shutemov wrote:
>> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
>> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
>> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
>> > > > > 
>> > > > > If your memory could be swapped, that would be enough of a good 
>> > > > > reason
>> > > > > to make use of shmem.c: but it cannot be swapped; and although there
>> > > > > are some references in the mailthreads to it perhaps being swappable
>> > > > > in future, I get the impression that will not happen soon if ever.
>> > > > > 
>> > > > > If your memory could be migrated, that would be some reason to use
>> > > > > filesystem page cache (because page migration happens to understand
>> > > > > that type of memory): but it cannot be migrated.
>> > > > 
>> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And 
>> > > > swapping
>> > > > theoretically possible, but I'm not aware of any plans as of now.
>> > > > 
>> > > > [1] 
>> > > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
>> > > 
>> > > I always forget, migration means different things to different audiences.
>> > > As an mm person, I was meaning page migration, whereas a virtualization
>> > > person thinks VM live migration (which that reference appears to be 
>> > > about),
>> > > a scheduler person task migration, an ornithologist bird migration, etc.
>> > > 
>> > > But you're an mm person too: you may have cited that reference in the
>> > > knowledge that TDX 1.5 Live Migration will entail page migration of the
>> > > kind I'm thinking of.  (Anyway, it's not important to clarify that here.)
>> > 
>> > TDX 1.5 brings both.
>> > 
>> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
>> > 
>> 
>> This seems to be a pretty bad fit for the way that the core mm migrates
>> pages.  The core mm unmaps the page, then moves (in software) the contents
>> to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE doesn't fit into
>> that workflow very well.  I'm not saying it can't be done, but it won't just
>> work.
>
> Hm. From what I see we have all necessary infrastructure in place.
>
> Unmaping is NOP for inaccessible pages as it is never mapped and we have
> mapping->a_ops->migrate_folio() callback that allows to replace software
> copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
>
> What do I miss?

Hmm, maybe this isn't as bad as I thought.

Right now, unless I've missed something, the migration workflow is to unmap 
(via try_to_migrate) all mappings, then migrate the backing store (with 
->migrate_folio(), although it seems like most callers expect the actual copy 
to happen outside of ->migrate_folio(), and then make new mappings.  With the 
*current* (vma-based, not fd-based) model for KVM memory, this won't work -- we 
can't unmap before calling TDH.MEM.PAGE.RELOCATE.

But maybe it's actually okay with some care or maybe mild modifications with 
the fd-based model.  We don't have any mmaps, per se, to unmap for secret / 
INACCESSIBLE memory.  So maybe we can get all the way to ->migrate_folio() 
without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE 
from inside migrate_folio().  And there will be nothing to fault back in.  From 
the core code's perspective, it's like migrating a memfd that doesn't happen to 
have my mappings at the time.

--Andy



Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-08 Thread Andy Lutomirski

On 8/24/22 02:41, Chao Peng wrote:

On Tue, Aug 23, 2022 at 04:05:27PM +, Sean Christopherson wrote:

On Tue, Aug 23, 2022, David Hildenbrand wrote:

On 19.08.22 05:38, Hugh Dickins wrote:

On Fri, 19 Aug 2022, Sean Christopherson wrote:

On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:

On Wed, 6 Jul 2022, Chao Peng wrote:
But since then, TDX in particular has forced an effort into preventing
(by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.

Are any of the shmem.c mods useful to existing users of shmem.c? No.
Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.


But QEMU and other VMMs are users of shmem and memfd.  The new features 
certainly
aren't useful for _all_ existing users, but I don't think it's fair to say that
they're not useful for _any_ existing users.


Okay, I stand corrected: there exist some users of memfd_create()
who will also have use for "INACCESSIBLE" memory.


As raised in reply to the relevant patch, I'm not sure if we really have
to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
requirement of specific memfd_notifer (memfile_notifier) implementations
-- such as TDX that will convert the memory and MCE-kill the machine on
ordinary write access. We might be able to set/enforce this when
registering a notifier internally instead, and fail notifier
registration if a condition isn't met (e.g., existing mmap).

So I'd be curious, which other users of shmem/memfd would benefit from
(MMU)-"INACCESSIBLE" memory obtained via memfd_create()?


I agree that there's no need to expose the inaccessible behavior via uAPI.  
Making
it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any 
other
flags that get added in the future).

AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't 
provide
any unique functionality.


That's also what I'm thinking. And I don't see problem immediately if
user has populated the fd at the binding time. Actually that looks an
advantage for previously discussed guest payload pre-loading.


I think this gets awkward. Trying to define sensible semantics for what 
happens if a shmem or similar fd gets used as secret guest memory and 
that fd isn't utterly and completely empty can get quite nasty.  For 
example:


If there are already mmaps, then TDX (much more so than SEV) really 
doesn't want to also use it as guest memory.


If there is already data in the fd, then maybe some technologies can use 
this for pre-population, but TDX needs explicit instructions in order to 
get the guest's hash right.



In general, it seems like it will be much more likely to actually work 
well if the user (uAPI) is required to declare to the kernel exactly 
what the fd is for (e.g. TDX secret memory, software-only secret memory, 
etc) before doing anything at all with it other than binding it to KVM.


INACCESSIBLE is a way to achieve this.  Maybe it's not the prettiest in 
the world -- I personally would rather see an explicit request for, say, 
TDX or SEV memory or maybe the memory that works for a particular KVM 
instance instead of something generic like INACCESSIBLE, but this is a 
pretty weak preference.  But I think that just starting with a plain 
memfd is a can of worms.




Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-08 Thread Andy Lutomirski

On 8/19/22 17:27, Kirill A. Shutemov wrote:

On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:

On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:


If your memory could be swapped, that would be enough of a good reason
to make use of shmem.c: but it cannot be swapped; and although there
are some references in the mailthreads to it perhaps being swappable
in future, I get the impression that will not happen soon if ever.

If your memory could be migrated, that would be some reason to use
filesystem page cache (because page migration happens to understand
that type of memory): but it cannot be migrated.


Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1] 
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html


I always forget, migration means different things to different audiences.
As an mm person, I was meaning page migration, whereas a virtualization
person thinks VM live migration (which that reference appears to be about),
a scheduler person task migration, an ornithologist bird migration, etc.

But you're an mm person too: you may have cited that reference in the
knowledge that TDX 1.5 Live Migration will entail page migration of the
kind I'm thinking of.  (Anyway, it's not important to clarify that here.)


TDX 1.5 brings both.

In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.



This seems to be a pretty bad fit for the way that the core mm migrates 
pages.  The core mm unmaps the page, then moves (in software) the 
contents to a new address, then faults it in.  TDH.MEM.PAGE.RELOCATE 
doesn't fit into that workflow very well.  I'm not saying it can't be 
done, but it won't just work.




Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-08 Thread Andy Lutomirski

On 8/18/22 06:24, Kirill A . Shutemov wrote:

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:

On Wed, 6 Jul 2022, Chao Peng wrote:

This is the v7 of this series which tries to implement the fd-based KVM
guest private memory.


Here at last are my reluctant thoughts on this patchset.

fd-based approach for supporting KVM guest private memory: fine.

Use or abuse of memfd and shmem.c: mistaken.

memfd_create() was an excellent way to put together the initial prototype.

But since then, TDX in particular has forced an effort into preventing
(by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.

Are any of the shmem.c mods useful to existing users of shmem.c? No.
Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

What use do you have for a filesystem here?  Almost none.
IIUC, what you want is an fd through which QEMU can allocate kernel
memory, selectively free that memory, and communicate fd+offset+length
to KVM.  And perhaps an interface to initialize a little of that memory
from a template (presumably copied from a real file on disk somewhere).

You don't need shmem.c or a filesystem for that!

If your memory could be swapped, that would be enough of a good reason
to make use of shmem.c: but it cannot be swapped; and although there
are some references in the mailthreads to it perhaps being swappable
in future, I get the impression that will not happen soon if ever.

If your memory could be migrated, that would be some reason to use
filesystem page cache (because page migration happens to understand
that type of memory): but it cannot be migrated.


Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1] 
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html



This thing?

https://cdrdv2.intel.com/v1/dl/getContent/733578

That looks like migration between computers, not between NUMA nodes.  Or 
am I missing something?




Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-22 Thread Andy Lutomirski

On 7/21/22 14:19, Sean Christopherson wrote:

On Thu, Jul 21, 2022, Gupta, Pankaj wrote:





I view it as a performance problem because nothing stops KVM from copying from
userspace into the private fd during the SEV ioctl().  What's missing is the
ability for userspace to directly initialze the private fd, which may or may not
avoid an extra memcpy() depending on how clever userspace is.

Can you please elaborate more what you see as a performance problem? And
possible ways to solve it?


Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying 
is
that in-place encryption is not a functional requirement, which means it's 
purely
an optimization, and thus we should other bother supporting in-place encryption
_if_ it would solve a performane bottleneck.


Even if we end up having a performance problem, I think we need to 
understand the workloads that we want to optimize before getting too 
excited about designing a speedup.


In particular, there's (depending on the specific technology, perhaps, 
and also architecture) a possible tradeoff between trying to reduce 
copying and trying to reduce unmapping and the associated flushes.  If a 
user program maps an fd, populates it, and then converts it in place 
into private memory (especially if it doesn't do it in a single shot), 
then that memory needs to get unmapped both from the user mm and 
probably from the kernel direct map.  On the flip side, it's possible to 
imagine an ioctl that does copy-and-add-to-private-fd that uses a 
private mm and doesn't need any TLB IPIs.


All of this is to say that trying to optimize right now seems quite 
premature to me.




Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-13 Thread Andy Lutomirski



On Wed, Jul 13, 2022, at 3:35 AM, Gupta, Pankaj wrote:
 This is the v7 of this series which tries to implement the fd-based KVM
 guest private memory. The patches are based on latest kvm/queue branch
 commit:

 b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
 split_desc_cache only by default capacity

 Introduction
 
 In general this patch series introduce fd-based memslot which provides
 guest memory through memory file descriptor fd[offset,size] instead of
 hva/size. The fd can be created from a supported memory filesystem
 like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
>>>
>>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host
>>> page cache instead of mapping pages into userspace address space. Can we hit
>>> double (un-coordinated) page cache problem with this when guest page cache
>>> is also used?
>> 
>> This is my understanding: in host it will be indeed in page cache (in
>> current shmem implementation) but that's just the way it allocates and
>> provides the physical memory for the guest. In guest, guest OS will not
>> see this fd (absolutely), it only sees guest memory, on top of which it
>> can build its own page cache system for its own file-mapped content but
>> that is unrelated to host page cache.
>
> yes. If guest fills its page cache with file backed memory, this at host 
> side(on shmem fd backend) will also fill the host page cache fast. This 
> can have an impact on performance of guest VM's if host goes to memory 
> pressure situation sooner. Or else we end up utilizing way less System 
> RAM.

Is this in any meaningful way different from a regular VM?

--Andy



Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-14 Thread Andy Lutomirski
On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson  wrote:
>
> On Tue, Jun 14, 2022, Andy Lutomirski wrote:
> > On Tue, Jun 14, 2022 at 12:32 AM Chao Peng  
> > wrote:
> > >
> > > On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote:
> > > > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> > > >
> > > > One argument is that userspace can simply rely on cgroups to detect 
> > > > misbehaving
> > > > guests, but (a) those types of OOMs will be a nightmare to debug and 
> > > > (b) an OOM
> > > > kill from the host is typically considered a _host_ issue and will be 
> > > > treated as
> > > > a missed SLO.
> > > >
> > > > An idea for handling this in the kernel without too much complexity 
> > > > would be to
> > > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page 
> > > > faults from
> > > > allocating pages, i.e. holes can only be filled by an explicit 
> > > > fallocate().  Minor
> > > > faults, e.g. due to NUMA balancing stupidity, and major faults due to 
> > > > swap would
> > > > still work, but writes to previously unreserved/unallocated memory 
> > > > would get a
> > > > SIGSEGV on something it has mapped.  That would allow the userspace VMM 
> > > > to prevent
> > > > unintentional allocations without having to coordinate 
> > > > unmapping/remapping across
> > > > multiple processes.
> > >
> > > Since this is mainly for shared memory and the motivation is catching
> > > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> > > those range backed by private fd as PROT_NONE during the conversion so
> > > subsequence misbehaved accesses will be blocked instead of causing double
> > > allocation silently.
>
> PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap().
>
> > This patch series is fairly close to implementing a rather more
> > efficient solution.  I'm not familiar enough with hypervisor userspace
> > to really know if this would work, but:
> >
> > What if shared guest memory could also be file-backed, either in the
> > same fd or with a second fd covering the shared portion of a memslot?
> > This would allow changes to the backing store (punching holes, etc) to
> > be some without mmap_lock or host-userspace TLB flushes?  Depending on
> > what the guest is doing with its shared memory, userspace might need
> > the memory mapped or it might not.
>
> That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea.  The 
> issue,
> unless I'm misreading code, is that punching a hole in the shared memory 
> backing
> store doesn't prevent reallocating that hole on fault, i.e. a helper process 
> that
> keeps a valid mapping of guest shared memory can silently fill the hole.
>
> What we're hoping to achieve is a way to prevent allocating memory without a 
> very
> explicit action from userspace, e.g. fallocate().

Ah, I misunderstood.  I thought your goal was to mmap it and prevent
page faults from allocating.

It is indeed the case (and has been since before quite a few of us
were born) that a hole in a sparse file is logically just a bunch of
zeros.  A way to make a file for which a hole is an actual hole seems
like it would solve this problem nicely.  It could also be solved more
specifically for KVM by making sure that the private/shared mode that
userspace programs is strict enough to prevent accidental allocations
-- if a GPA is definitively private, shared, neither, or (potentially,
on TDX only) both, then a page that *isn't* shared will never be
accidentally allocated by KVM.  If the shared backing is not mmapped,
it also won't be accidentally allocated by host userspace on a stray
or careless write.


--Andy



Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-14 Thread Andy Lutomirski
On Tue, Jun 14, 2022 at 12:32 AM Chao Peng  wrote:
>
> On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote:
> > On Wed, Jun 08, 2022, Vishal Annapurve wrote:
> >
> > One argument is that userspace can simply rely on cgroups to detect 
> > misbehaving
> > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an 
> > OOM
> > kill from the host is typically considered a _host_ issue and will be 
> > treated as
> > a missed SLO.
> >
> > An idea for handling this in the kernel without too much complexity would 
> > be to
> > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults 
> > from
> > allocating pages, i.e. holes can only be filled by an explicit fallocate(). 
> >  Minor
> > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap 
> > would
> > still work, but writes to previously unreserved/unallocated memory would 
> > get a
> > SIGSEGV on something it has mapped.  That would allow the userspace VMM to 
> > prevent
> > unintentional allocations without having to coordinate unmapping/remapping 
> > across
> > multiple processes.
>
> Since this is mainly for shared memory and the motivation is catching
> misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
> those range backed by private fd as PROT_NONE during the conversion so
> subsequence misbehaved accesses will be blocked instead of causing double
> allocation silently.

This patch series is fairly close to implementing a rather more
efficient solution.  I'm not familiar enough with hypervisor userspace
to really know if this would work, but:

What if shared guest memory could also be file-backed, either in the
same fd or with a second fd covering the shared portion of a memslot?
This would allow changes to the backing store (punching holes, etc) to
be some without mmap_lock or host-userspace TLB flushes?  Depending on
what the guest is doing with its shared memory, userspace might need
the memory mapped or it might not.

--Andy



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-06-10 Thread Andy Lutomirski
On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson  wrote:
>
> On Mon, Apr 25, 2022, Andy Lutomirski wrote:
> >
> >
> > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
> > >>
> >
> > >>
> > >> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's 
> > >> in
> > >> the initial state appropriate for that VM.
> > >>
> > >> For TDX, this completely bypasses the cases where the data is 
> > >> prepopulated
> > >> and TDX can't handle it cleanly.
>
> I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that 
> the
> source and destination have different HPAs.  There's just no pressing need to
> support such behavior because userspace is highly motivated to keep the 
> initial
> image small for performance reasons, i.e. burning a few extra pages while 
> building
> the guest is a non-issue.

Following up on this, rather belatedly.  After re-reading the docs,
TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel®
TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2
here:

https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf

For each TD page:

1. The host VMM specifies a TDR as a parameter and calls the
TDH.MEM.PAGE.ADD function. It copies the contents from the TD
image page into the target TD page which is encrypted with the TD
ephemeral key. TDH.MEM.PAGE.ADD also extends the TD
measurement with the page GPA.

2. The host VMM extends the TD measurement with the contents of
the new page by calling the TDH.MR.EXTEND function on each 256-
byte chunk of the new TD page.

So this is a bit like SGX.  There is a specific series of operations
that have to be done in precisely the right order to reproduce the
intended TD measurement.  Otherwise the guest will boot and run until
it tries to get a report and then it will have a hard time getting
anyone to believe its report.

So I don't think the host kernel can get away with host userspace just
providing pre-populated memory.  Userspace needs to tell the host
kernel exactly what sequence of adds, extends, etc to perform and in
what order, and the host kernel needs to do precisely what userspace
asks it to do.  "Here's the contents of memory" doesn't cut it unless
the tooling that builds the guest image matches the exact semantics
that the host kernel provides.

--Andy



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-21 Thread Andy Lutomirski



On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote:

> But a dedicated KVM ioctl() to add/remove shared ranges would be easy 
> to implement
> and wouldn't necessarily even need to interact with the memslots.  It 
> could be a
> consumer of memslots, e.g. if we wanted to disallow registering regions 
> without an
> associated memslot, but I think we'd want to avoid even that because 
> things will
> get messy during memslot updates, e.g. if dirty logging is toggled or a 
> shared
> memory region is temporarily removed then we wouldn't want to destroy 
> the tracking.
>
> I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
> should be far more efficient.
>
> One benefit to explicitly tracking this in KVM is that it might be 
> useful for
> software-only protected VMs, e.g. KVM could mark a region in the XArray 
> as "pending"
> based on guest hypercalls to share/unshare memory, and then complete 
> the transaction
> when userspace invokes the ioctl() to complete the share/unshare.

That makes sense.

If KVM goes this route, perhaps there the allowed states for a GPA should 
include private, shared, and also private-and-shared.  Then anyone who wanted 
to use the same masked GPA for shared and private on TDX could do so if they 
wanted to.



Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

2022-05-20 Thread Andy Lutomirski

On 5/19/22 08:37, Chao Peng wrote:

Extend the memslot definition to provide guest private memory through a
file descriptor(fd) instead of userspace_addr(hva). Such guest private
memory(fd) may never be mapped into userspace so no userspace_addr(hva)
can be used. Instead add another two new fields
(private_fd/private_offset), plus the existing memory_size to represent
the private memory range. Such memslot can still have the existing
userspace_addr(hva). When use, a single memslot can maintain both
private memory through private fd(private_fd/private_offset) and shared
memory through hva(userspace_addr). A GPA is considered private by KVM
if the memslot has private fd and that corresponding page in the private
fd is populated, otherwise, it's shared.




So this is a strange API and, IMO, a layering violation.  I want to make 
sure that we're all actually on board with making this a permanent part 
of the Linux API.  Specifically, we end up with a multiplexing situation 
as you have described. For a given GPA, there are *two* possible host 
backings: an fd-backed one (from the fd, which is private for now might 
might end up potentially shared depending on future extensions) and a 
VMA-backed one.  The selection of which one backs the address is made 
internally by whatever backs the fd.


This, IMO, a clear layering violation.  Normally, an fd has an 
associated address space, and pages in that address space can have 
contents, can be holes that appear to contain all zeros, or could have 
holes that are inaccessible.  If you try to access a hole, you get 
whatever is in the hole.


But now, with this patchset, the fd is more of an overlay and you get 
*something else* if you try to access through the hole.


This results in operations on the fd bubbling up to the KVM mapping in 
what is, IMO, a strange way.  If the user punches a hole, KVM has to 
modify its mappings such that the GPA goes to whatever VMA may be there. 
 (And update the RMP, the hypervisor's tables, or whatever else might 
actually control privateness.)  Conversely, if the user does fallocate 
to fill a hole, the guest mapping *to an unrelated page* has to be 
zapped so that the fd's page shows up.  And the RMP needs updating, etc.


I am lukewarm on this for a few reasons.

1. This is weird.  AFAIK nothing else works like this.  Obviously this 
is subjecting, but "weird" and "layering violation" sometimes translate 
to "problematic locking".


2. fd-backed private memory can't have normal holes.  If I make a memfd, 
punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that 
reads as zero.  If I write to it, the page gets allocated.  But with 
this new mechanism, if I punch a hole and put it in a memslot, reads and 
writes go somewhere else.  So what if I actually wanted lazily allocated 
private zeros?


2b. For a hypothetical future extension in which an fd can also have 
shared pages (for conversion, for example, or simply because the fd 
backing might actually be more efficient than indirecting through VMAs 
and therefore get used for shared memory or entirely-non-confidential 
VMs), lazy fd-backed zeros sound genuinely useful.


3. TDX hardware capability is not fully exposed.  TDX can have a private 
page and a shared page at GPAs that differ only by the private bit. 
Sure, no one plans to use this today, but baking this into the user ABI 
throws away half the potential address space.


3b. Any software solution that works like TDX (which IMO seems like an 
eminently reasonable design to me) has the same issue.



The alternative would be to have some kind of separate table or bitmap 
(part of the memslot?) that tells KVM whether a GPA should map to the fd.


What do you all think?



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-25 Thread Andy Lutomirski



On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
>> 

>> 
>> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in 
>> the initial state appropriate for that VM.
>> 
>> For TDX, this completely bypasses the cases where the data is prepopulated 
>> and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which 
>> data might be written to the memory before we find out whether that data 
>> will be unreclaimable or unmovable.
>
> This sounds a more strict rule to avoid semantics unclear.
>
> So userspace needs to know what excatly happens for a 'bind' operation.
> This is different when binds to different technologies. E.g. for SEV, it
> may imply after this call, the memfile can be accessed (through mmap or
> what ever) from userspace, while for current TDX this should be not allowed.

I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve 
similar goals and have broadly similar ways of achieving them, they really are 
different, and having userspace be aware of the differences seems okay to me.

(Although I don't think that allowing userspace to mmap SEV shared pages is 
particularly wise -- it will result in faults or cache incoherence depending on 
the variant of SEV in use.)

>
> And I feel we still need a third flow/operation to indicate the
> completion of the initialization on the memfile before the guest's 
> first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> and prevent future userspace access. After this point, then the memfile
> becomes truely private fd.

Even that is technology-dependent.  For TDX, this operation doesn't really 
exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough 
detail).  For pKVM, I guess it does exist and isn't quite the same as a 
shared->private conversion.

Maybe this could be generalized a bit as an operation "measure and make 
private" that would be supported by the technologies for which it's useful.


>
>> 
>> 
>> --
>> 
>> Now I have a question, since I don't think anyone has really answered it: 
>> how does this all work with SEV- or pKVM-like technologies in which private 
>> and shared pages share the same address space?  I sounds like you're 
>> proposing to have a big memfile that contains private and shared pages and 
>> to use that same memfile as pages are converted back and forth.  IO and even 
>> real physical DMA could be done on that memfile.  Am I understanding 
>> correctly?
>
> For TDX case, and probably SEV as well, this memfile contains private memory
> only. But this design at least makes it possible for usage cases like
> pKVM which wants both private/shared memory in the same memfile and rely
> on other ways like mmap/munmap or mprotect to toggle private/shared instead
> of fallocate/hole punching.

Hmm.  Then we still need some way to get KVM to generate the correct SEV 
pagetables.  For TDX, there are private memslots and shared memslots, and they 
can overlap.  If they overlap and both contain valid pages at the same address, 
then the results may not be what the guest-side ABI expects, but everything 
will work.  So, when a single logical guest page transitions between shared and 
private, no change to the memslots is needed.  For SEV, this is not the case: 
everything is in one set of pagetables, and there isn't a natural way to 
resolve overlaps.

If the memslot code becomes efficient enough, then the memslots could be 
fragmented.  Or the memfile could support private and shared data in the same 
memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and 
hopefully reuse the same code.

>
>> 
>> If so, I think this makes sense, but I'm wondering if the actual memslot 
>> setup should be different.  For TDX, private memory lives in a logically 
>> separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can 
>> reflect this straightforwardly.
>
> I believe so. The flow should be similar but we do need pass different
> flags during the 'bind' to the backing store for different usages. That
> should be some new flags for pKVM but the callbacks (API here) between
> memfile_notifile and its consumers can be reused.

And also some different flag in the operation that installs the fd as a memslot?

>
>> 
>> And the corresponding TDX question: is the intent still that shared pages 
>> aren't allowed at all in a TDX memfile?  If so, that would be the most 
>> direct mapping to what the hardware actually does.
>
> Exactly. TDX will still use fallocate/hole punching to turn on/off the
> private page. Once off, the tra

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-24 Thread Andy Lutomirski



On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote:
> On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote:
>> On Tue, Apr 05, 2022, Quentin Perret wrote:
>> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> Only when the register succeeds, the fd is
> converted into a private fd, before that, the fd is just a normal (shared)
> one. During this conversion, the previous data is preserved so you can put
> some initial data in guest pages (whether the architecture allows this is
> architecture-specific and out of the scope of this patch).

I think this can be made to work, but it will be awkward.  On TDX, for example, 
what exactly are the semantics supposed to be?  An error code if the memory 
isn't all zero?  An error code if it has ever been written?

Fundamentally, I think this is because your proposed lifecycle for these 
memfiles results in a lightweight API but is awkward for the intended use 
cases.  You're proposing, roughly:

1. Create a memfile. 

Now it's in a shared state with an unknown virt technology.  It can be read and 
written.  Let's call this state BRAND_NEW.

2. Bind to a VM.

Now it's an a bound state.  For TDX, for example, let's call the new state 
BOUND_TDX.  In this state, the TDX rules are followed (private memory can't be 
converted, etc).

The problem here is that the BOUND_NEW state allows things that are nonsensical 
in TDX, and the binding step needs to invent some kind of semantics for what 
happens when binding a nonempty memfile.


So I would propose a somewhat different order:

1. Create a memfile.  It's in the UNBOUND state and no operations whatsoever 
are allowed except binding or closing.

2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the 
initial state appropriate for that VM.

For TDX, this completely bypasses the cases where the data is prepopulated and 
TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data 
might be written to the memory before we find out whether that data will be 
unreclaimable or unmovable.


--

Now I have a question, since I don't think anyone has really answered it: how 
does this all work with SEV- or pKVM-like technologies in which private and 
shared pages share the same address space?  I sounds like you're proposing to 
have a big memfile that contains private and shared pages and to use that same 
memfile as pages are converted back and forth.  IO and even real physical DMA 
could be done on that memfile.  Am I understanding correctly?

If so, I think this makes sense, but I'm wondering if the actual memslot setup 
should be different.  For TDX, private memory lives in a logically separate 
memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect 
this straightforwardly.

And the corresponding TDX question: is the intent still that shared pages 
aren't allowed at all in a TDX memfile?  If so, that would be the most direct 
mapping to what the hardware actually does.

--Andy



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-12 Thread Andy Lutomirski
On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
>
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
>
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
>
> RDMA is per-process but uses a different counter, so you can double up
>
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two
>
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
>
> I think it has to be cgroup based considering where we are now :\
>

So this is another situation where the actual backend (TDX, SEV, pKVM, pure 
software) makes a difference -- depending on exactly what backend we're using, 
the memory may not be unmoveable.  It might even be swappable (in the 
potentially distant future).

Anyway, here's a concrete proposal, with a bit of handwaving:

We add new cgroup limits:

memory.unmoveable
memory.locked

These can be set to an actual number or they can be set to the special value 
ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with 
capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or 
locked memory with this (and potentially other) new APIs.  If it's 0, then they 
can't.  If it's another value, then the memory can be allocated, charged to the 
cgroup, up to the limit, with no particular capability needed.  The default at 
boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do 
so.  This avoids introducing a DoS, makes it easy to run tests without 
configuring cgroup, and lets serious users set up their cgroups.

Nothing is charge per mm.

To make this fully sensible, we need to know what the backend is for the 
private memory before allocating any so that we can charge it accordingly.



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-07 Thread Andy Lutomirski



On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
>> memory behave like longterm pinned pages and thus should be accounted to
>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
>> 
>> Signed-off-by: Chao Peng 
>> ---
>>  mm/shmem.c | 25 -
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 7b43e274c9a2..ae46fb96494b 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, 
>> pgoff_t start, pgoff_t end)
>>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>> pgoff_t start, pgoff_t end)
>>  {
>> -#ifdef CONFIG_MEMFILE_NOTIFIER
>>  struct shmem_inode_info *info = SHMEM_I(inode);
>>  
>> +#ifdef CONFIG_MEMFILE_NOTIFIER
>>  start = max(start, folio->index);
>>  end = min(end, folio->index + folio_nr_pages(folio));
>>  
>>  memfile_notifier_invalidate(>memfile_notifiers, start, end);
>>  #endif
>> +
>> +if (info->xflags & SHM_F_INACCESSIBLE)
>> +atomic64_sub(end - start, >mm->pinned_vm);
>
> As Vishal's to-be-posted selftest discovered, this is broken as 
> current->mm may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's 
> nothing that
> prevents a different process from punching hole in the shmem backing.
>

How about just not charging the mm in the first place?  There’s precedent: 
ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current 
status).

In any case, for an administrator to try to assemble the various rlimits into a 
coherent policy is, and always has been, quite messy. ISTM cgroup limits, which 
can actually add across processes usefully, are much better.

So, aside from the fact that these fds aren’t in a filesystem and are thus 
available by default, I’m not convinced that this accounting is useful or 
necessary.

Maybe we could just have some switch require to enable creation of private 
memory in the first place, and anyone who flips that switch without configuring 
cgroups is subject to DoS.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-06 Thread Andy Lutomirski



On Tue, Apr 5, 2022, at 11:30 AM, Sean Christopherson wrote:
> On Tue, Apr 05, 2022, Andy Lutomirski wrote:

>
>> resume guest
>> *** host -> hypervisor -> guest ***
>> Guest unshares the page.
>> *** guest -> hypervisor ***
>> Hypervisor removes PTE.  TLBI.
>> *** hypervisor -> guest ***
>> 
>> Obviously considerable cleverness is needed to make a virt IOMMU like this
>> work well, but still.
>> 
>> Anyway, my suggestion is that the fd backing proposal get slightly modified
>> to get it ready for multiple subtypes of backing object, which should be a
>> pretty minimal change.  Then, if someone actually needs any of this
>> cleverness, it can be added later.  In the mean time, the
>> pread()/pwrite()/splice() scheme is pretty good.
>
> Tangentially related to getting private-fd ready for multiple things, 
> what about
> implementing the pread()/pwrite()/splice() scheme in pKVM itself?  I.e. 
> read() on
> the VM fd, with the offset corresponding to gfn in some way.
>

Hmm, could make sense.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-05 Thread Andy Lutomirski



On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote:
> On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
>> 
>> 
>> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
>> > On Mon, Apr 04, 2022, Quentin Perret wrote:
>> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> >> FWIW, there are a couple of reasons why I'd like to have in-place
>> >> conversions:
>> >> 
>> >>  - one goal of pKVM is to migrate some things away from the Arm
>> >>Trustzone environment (e.g. DRM and the likes) and into protected VMs
>> >>instead. This will give Linux a fighting chance to defend itself
>> >>against these things -- they currently have access to _all_ memory.
>> >>And transitioning pages between Linux and Trustzone (donations and
>> >>shares) is fast and non-destructive, so we really do not want pKVM to
>> >>regress by requiring the hypervisor to memcpy things;
>> >
>> > Is there actually a _need_ for the conversion to be non-destructive?  
>> > E.g. I assume
>> > the "trusted" side of things will need to be reworked to run as a pKVM 
>> > guest, at
>> > which point reworking its logic to understand that conversions are 
>> > destructive and
>> > slow-ish doesn't seem too onerous.
>> >
>> >>  - it can be very useful for protected VMs to do shared=>private
>> >>conversions. Think of a VM receiving some data from the host in a
>> >>shared buffer, and then it wants to operate on that buffer without
>> >>risking to leak confidential informations in a transient state. In
>> >>that case the most logical thing to do is to convert the buffer back
>> >>to private, do whatever needs to be done on that buffer (decrypting a
>> >>frame, ...), and then share it back with the host to consume it;
>> >
>> > If performance is a motivation, why would the guest want to do two 
>> > conversions
>> > instead of just doing internal memcpy() to/from a private page?  I 
>> > would be quite
>> > surprised if multiple exits and TLB shootdowns is actually faster, 
>> > especially at
>> > any kind of scale where zapping stage-2 PTEs will cause lock contention 
>> > and IPIs.
>> 
>> I don't know the numbers or all the details, but this is arm64, which is a 
>> rather better architecture than x86 in this regard.  So maybe it's not so 
>> bad, at least in very simple cases, ignoring all implementation details.  
>> (But see below.)  Also the systems in question tend to have fewer CPUs than 
>> some of the massive x86 systems out there.
>
> Yep. I can try and do some measurements if that's really necessary, but
> I'm really convinced the cost of the TLBI for the shared->private
> conversion is going to be significantly smaller than the cost of memcpy
> the buffer twice in the guest for us. To be fair, although the cost for
> the CPU update is going to be low, the cost for IOMMU updates _might_ be
> higher, but that very much depends on the hardware. On systems that use
> e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and
> the iotlb invalidation is done on the back of the CPU invalidation. So,
> on systems with sane hardware the overhead is *really* quite small.
>
> Also, memcpy requires double the memory, it is pretty bad for power, and
> it causes memory traffic which can't be a good thing for things running
> concurrently.
>
>> If we actually wanted to support transitioning the same page between shared 
>> and private, though, we have a bit of an awkward situation.  Private to 
>> shared is conceptually easy -- do some bookkeeping, reconstitute the direct 
>> map entry, and it's done.  The other direction is a mess: all existing uses 
>> of the page need to be torn down.  If the page has been recently used for 
>> DMA, this includes IOMMU entries.
>>
>> Quentin: let's ignore any API issues for now.  Do you have a concept of how 
>> a nondestructive shared -> private transition could work well, even in 
>> principle?
>
> I had a high level idea for the workflow, but I haven't looked into the
> implementation details.
>
> The idea would be to allow KVM *or* userspace to take a reference
> to a page in the fd in an exclusive manner. KVM could take a reference
> on a page (which would be necessary before to donating it to a guest)
> using some kind of memfile_notifier as proposed in this series, and
> userspace could do the same some other way (mmap

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-04 Thread Andy Lutomirski



On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote:
> On Mon, Apr 04, 2022, Quentin Perret wrote:
>> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote:
>> FWIW, there are a couple of reasons why I'd like to have in-place
>> conversions:
>> 
>>  - one goal of pKVM is to migrate some things away from the Arm
>>Trustzone environment (e.g. DRM and the likes) and into protected VMs
>>instead. This will give Linux a fighting chance to defend itself
>>against these things -- they currently have access to _all_ memory.
>>And transitioning pages between Linux and Trustzone (donations and
>>shares) is fast and non-destructive, so we really do not want pKVM to
>>regress by requiring the hypervisor to memcpy things;
>
> Is there actually a _need_ for the conversion to be non-destructive?  
> E.g. I assume
> the "trusted" side of things will need to be reworked to run as a pKVM 
> guest, at
> which point reworking its logic to understand that conversions are 
> destructive and
> slow-ish doesn't seem too onerous.
>
>>  - it can be very useful for protected VMs to do shared=>private
>>conversions. Think of a VM receiving some data from the host in a
>>shared buffer, and then it wants to operate on that buffer without
>>risking to leak confidential informations in a transient state. In
>>that case the most logical thing to do is to convert the buffer back
>>to private, do whatever needs to be done on that buffer (decrypting a
>>frame, ...), and then share it back with the host to consume it;
>
> If performance is a motivation, why would the guest want to do two 
> conversions
> instead of just doing internal memcpy() to/from a private page?  I 
> would be quite
> surprised if multiple exits and TLB shootdowns is actually faster, 
> especially at
> any kind of scale where zapping stage-2 PTEs will cause lock contention 
> and IPIs.

I don't know the numbers or all the details, but this is arm64, which is a 
rather better architecture than x86 in this regard.  So maybe it's not so bad, 
at least in very simple cases, ignoring all implementation details.  (But see 
below.)  Also the systems in question tend to have fewer CPUs than some of the 
massive x86 systems out there.

If we actually wanted to support transitioning the same page between shared and 
private, though, we have a bit of an awkward situation.  Private to shared is 
conceptually easy -- do some bookkeeping, reconstitute the direct map entry, 
and it's done.  The other direction is a mess: all existing uses of the page 
need to be torn down.  If the page has been recently used for DMA, this 
includes IOMMU entries.

Quentin: let's ignore any API issues for now.  Do you have a concept of how a 
nondestructive shared -> private transition could work well, even in principle? 
 The best I can come up with is a special type of shared page that is not 
GUP-able and maybe not even mmappable, having a clear option for transitions to 
fail, and generally preventing the nasty cases from happening in the first 
place.

Maybe there could be a special mode for the private memory fds in which 
specific pages are marked as "managed by this fd but actually shared".  pread() 
and pwrite() would work on those pages, but not mmap().  (Or maybe mmap() but 
the resulting mappings would not permit GUP.)  And transitioning them would be 
a special operation on the fd that is specific to pKVM and wouldn't work on TDX 
or SEV.

Hmm.  Sean and Chao, are we making a bit of a mistake by making these fds 
technology-agnostic?  That is, would we want to distinguish between a TDX 
backing fd, a SEV backing fd, a software-based backing fd, etc?  API-wise this 
could work by requiring the fd to be bound to a KVM VM instance and possibly 
even configured a bit before any other operations would be allowed.

(Destructive transitions nicely avoid all the nasty cases.  If something is 
still pinning a shared page when it's "transitioned" to private (really just 
replaced with a new page), then the old page continues existing for as long as 
needed as a separate object.)



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-01 Thread Andy Lutomirski
On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote:
> On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote:


> To answer your original question about memory 'conversion', the key
> thing is that the pKVM hypervisor controls the stage-2 page-tables for
> everyone in the system, all guests as well as the host. As such, a page
> 'conversion' is nothing more than a permission change in the relevant
> page-tables.
>

So I can see two different ways to approach this.

One is that you split the whole address space in half and, just like SEV and 
TDX, allocate one bit to indicate the shared/private status of a page.  This 
makes it work a lot like SEV and TDX.

The other is to have shared and private pages be distinguished only by their 
hypercall history and the (protected) page tables.  This saves some address 
space and some page table allocations, but it opens some cans of worms too.  In 
particular, the guest and the hypervisor need to coordinate, in a way that the 
guest can trust, to ensure that the guest's idea of which pages are private 
match the host's.  This model seems a bit harder to support nicely with the 
private memory fd model, but not necessarily impossible.

Also, what are you trying to accomplish by having the host userspace mmap 
private pages?  Is the idea that multiple guest could share the same page until 
such time as one of them tries to write to it?  That would be kind of like 
having a third kind of memory that's visible to host and guests but is 
read-only for everyone.  TDX and SEV can't support this at all (a private page 
belongs to one guest and one guest only, at least in SEV and in the current TDX 
SEAM spec).  I imagine that this could be supported with private memory fds 
with some care without mmap, though -- the host could still populate the page 
with memcpy.  Or I suppose a memslot could support using MAP_PRIVATE fds and 
have approximately the right semantics.

--Andy





Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-31 Thread Andy Lutomirski
On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote:
> On Wed, Mar 30, 2022, Quentin Perret wrote:
>> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote:
>> > On 29/03/2022 18:01, Quentin Perret wrote:
>> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in
>> > > the shared gpa range at an address that doesn't have a backing memslot,
>> > > will KVM check whether there is a corresponding private memslot at the
>> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or
>> > > would that just generate an MMIO exit as usual?
>> > 
>> > My understanding is that the guest needs some way of tagging whether a
>> > page is expected to be shared or private. On the architectures I'm aware
>> > of this is done by effectively stealing a bit from the IPA space and
>> > pretending it's a flag bit.
>> 
>> Right, and that is in fact the main point of divergence we have I think.
>> While I understand this might be necessary for TDX and the likes, this
>> makes little sense for pKVM. This would effectively embed into the IPA a
>> purely software-defined non-architectural property/protocol although we
>> don't actually need to: we (pKVM) can reasonably expect the guest to
>> explicitly issue hypercalls to share pages in-place. So I'd be really
>> keen to avoid baking in assumptions about that model too deep in the
>> host mm bits if at all possible.
>
> There is no assumption about stealing PA bits baked into this API.  Even 
> within
> x86 KVM, I consider it a hard requirement that the common flows not assume the
> private vs. shared information is communicated through the PA.

Quentin, I think we might need a clarification.  The API in this patchset 
indeed has no requirement that a PA bit distinguish between private and shared, 
but I think it makes at least a weak assumption that *something*, a priori, 
distinguishes them.  In particular, there are private memslots and shared 
memslots, so the logical flow of resolving a guest memory access looks like:

1. guest accesses a GVA

2. read guest paging structures

3. determine whether this is a shared or private access

4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures 
accordingly.  In particular, the memslot to reference is different depending on 
the access type.

For TDX, this maps on to the fd-based model perfectly: the host-side paging 
structures for the shared and private slots are completely separate.  For SEV, 
the structures are shared and KVM will need to figure out what to do in case a 
private and shared memslot overlap.  Presumably it's sufficient to declare that 
one of them wins, although actually determining which one is active for a given 
GPA may involve checking whether the backing store for a given page actually 
exists.

But I don't understand pKVM well enough to understand how it fits in.  Quentin, 
how is the shared vs private mode of a memory access determined?  How do the 
paging structures work?  Can a guest switch between shared and private by 
issuing a hypercall without changing any guest-side paging structures or 
anything else?

It's plausible that SEV and (maybe) pKVM would be better served if memslots 
could be sparse or if there was otherwise a direct way for host userspace to 
indicate to KVM which address ranges are actually active (not hole-punched) in 
a given memslot or to otherwise be able to make a rule that two different 
memslots (one shared and one private) can't claim the same address.



Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-03-28 Thread Andy Lutomirski
On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
>
> This is the v5 of this series which tries to implement the fd-based KVM
> guest private memory. The patches are based on latest kvm/queue branch
> commit:
>
>   d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2

Can this series be run and a VM booted without TDX?  A feature like
that might help push it forward.

--Andy



Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

2022-03-04 Thread Andy Lutomirski

On 2/23/22 04:05, Steven Price wrote:

On 23/02/2022 11:49, Chao Peng wrote:

On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote:

On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:

On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:

On 1/18/22 05:21, Chao Peng wrote:

From: "Kirill A. Shutemov" 

Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace through ordinary MMU access
(e.g., read/write/mmap). However, the file content can be accessed
via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

At this time only shmem implements this seal.



I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
essentially transmutes a memfd into a different type of object.  While this
can apparently be done successfully and without races (as in this code),
it's at least awkward.  I think that either creating a special inaccessible
memfd should be a single operation that create the correct type of object or
there should be a clear justification for why it's a two-step process.


Now one justification maybe from Stever's comment to patch-00: for ARM
usage it can be used with creating a normal memfd, (partially)populate
it with initial guest memory content (e.g. firmware), and then
F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
KVM (definitely the current code needs to be changed to support that).


Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this 
won't work.


Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will
need to make sure access to existing mmap-ed area should be prevented,
but that is hard.



In any case, the whole confidential VM initialization story is a bit buddy.  
From the earlier emails, it sounds like ARM expects the host to fill in guest 
memory and measure it.  From my recollection of Intel's scheme (which may well 
be wrong, and I could easily be confusing it with SGX), TDX instead measures 
what is essentially a transcript of the series of operations that initializes 
the VM.  These are fundamentally not the same thing even if they accomplish the 
same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) 
that initializes things according to the VM's instructions, and ARM ought to be 
able to use roughly the same mechanism.


Yes, TDX requires a ioctl. Steven may comment on the ARM part.


The Arm story is evolving so I can't give a definite answer yet. Our
current prototyping works by creating the initial VM content in a
memslot as with a normal VM and then calling an ioctl which throws the
big switch and converts all the (populated) pages to be protected. At
this point the RMM performs a measurement of the data that the VM is
being populated with.

The above (in our prototype) suffers from all the expected problems with
a malicious VMM being able to trick the host kernel into accessing those
pages after they have been protected (causing a fault detected by the
hardware).

The ideal (from our perspective) approach would be to follow the same
flow but where the VMM populates a memfd rather than normal anonymous
pages. The memfd could then be sealed and the pages converted to
protected ones (with the RMM measuring them in the process).

The question becomes how is that memfd populated? It would be nice if
that could be done using normal operations on a memfd (i.e. using
mmap()) and therefore this code could be (relatively) portable. This
would mean that any pages mapped from the memfd would either need to
block the sealing or be revoked at the time of sealing.

The other approach is we could of course implement a special ioctl which
effectively does a memcpy into the (created empty and sealed) memfd and
does the necessary dance with the RMM to measure the contents. This
would match the "transcript of the series of operations" described above
- but seems much less ideal from the viewpoint of the VMM.


A VMM that supports Other Vendors will need to understand this sort of 
model regardless.


I don't particularly mind the idea of having the kernel consume a normal 
memfd and spit out a new object, but I find the concept of changing the 
type of the object in place, even if it has other references, and trying 
to control all the resulting races to be somewhat alarming.


In pseudo-Rust, this is the difference between:

fn convert_to_private(in:  Memfd)

and

fn convert_to_private(in: Memfd) -> PrivateMemoryFd

This doesn't map particularly nicely to the kernel, though.

--Andy\



Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

2022-02-17 Thread Andy Lutomirski
On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote:
> On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote:
>> On 1/18/22 05:21, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" 
>> > 
>> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
>> > the file is inaccessible from userspace through ordinary MMU access
>> > (e.g., read/write/mmap). However, the file content can be accessed
>> > via a different mechanism (e.g. KVM MMU) indirectly.
>> > 
>> > It provides semantics required for KVM guest private memory support
>> > that a file descriptor with this seal set is going to be used as the
>> > source of guest memory in confidential computing environments such
>> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
>> > 
>> > At this time only shmem implements this seal.
>> > 
>> 
>> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE
>> essentially transmutes a memfd into a different type of object.  While this
>> can apparently be done successfully and without races (as in this code),
>> it's at least awkward.  I think that either creating a special inaccessible
>> memfd should be a single operation that create the correct type of object or
>> there should be a clear justification for why it's a two-step process.
>
> Now one justification maybe from Stever's comment to patch-00: for ARM
> usage it can be used with creating a normal memfd, (partially)populate
> it with initial guest memory content (e.g. firmware), and then
> F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in
> KVM (definitely the current code needs to be changed to support that).

Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right?  So this 
won't work.

In any case, the whole confidential VM initialization story is a bit buddy.  
From the earlier emails, it sounds like ARM expects the host to fill in guest 
memory and measure it.  From my recollection of Intel's scheme (which may well 
be wrong, and I could easily be confusing it with SGX), TDX instead measures 
what is essentially a transcript of the series of operations that initializes 
the VM.  These are fundamentally not the same thing even if they accomplish the 
same end goal.  For TDX, we unavoidably need an operation (ioctl or similar) 
that initializes things according to the VM's instructions, and ARM ought to be 
able to use roughly the same mechanism.

Also, if we ever get fancy and teach the page allocator about memory with 
reduced directmap permissions, it may well be more efficient for userspace to 
shove data into a memfd via ioctl than it is to mmap it and write the data.



Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier

2022-02-11 Thread Andy Lutomirski

On 1/18/22 05:21, Chao Peng wrote:

It maintains a memfile_notifier list in shmem_inode_info structure and
implements memfile_pfn_ops callbacks defined by memfile_notifier. It
then exposes them to memfile_notifier via
shmem_get_memfile_notifier_info.

We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
allocated by userspace for private memory. If there is no pages
allocated at the offset then error should be returned so KVM knows that
the memory is not private memory.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 



  static int memfile_get_notifier_info(struct inode *inode,
 struct memfile_notifier_list **list,
 struct memfile_pfn_ops **ops)
  {
-   return -EOPNOTSUPP;
+   int ret = -EOPNOTSUPP;
+#ifdef CONFIG_SHMEM
+   ret = shmem_get_memfile_notifier_info(inode, list, ops);
+#endif
+   return ret;
  }



+int shmem_get_memfile_notifier_info(struct inode *inode,
+   struct memfile_notifier_list **list,
+   struct memfile_pfn_ops **ops)
+{
+   struct shmem_inode_info *info;
+
+   if (!shmem_mapping(inode->i_mapping))
+   return -EINVAL;
+
+   info = SHMEM_I(inode);
+   *list = >memfile_notifiers;
+   if (ops)
+   *ops = _pfn_ops;
+
+   return 0;


I can't wrap my head around exactly who is supposed to call these 
functions and when, but there appears to be a missing check that the 
inode is actually a shmem inode.


What is this code trying to do?  It's very abstract.



Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE

2022-02-11 Thread Andy Lutomirski

On 1/18/22 05:21, Chao Peng wrote:

From: "Kirill A. Shutemov" 

Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of
the file is inaccessible from userspace through ordinary MMU access
(e.g., read/write/mmap). However, the file content can be accessed
via a different mechanism (e.g. KVM MMU) indirectly.

It provides semantics required for KVM guest private memory support
that a file descriptor with this seal set is going to be used as the
source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV but may not be accessible from host userspace.

At this time only shmem implements this seal.



I don't dislike this *that* much, but I do dislike this. 
F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type 
of object.  While this can apparently be done successfully and without 
races (as in this code), it's at least awkward.  I think that either 
creating a special inaccessible memfd should be a single operation that 
create the correct type of object or there should be a clear 
justification for why it's a two-step process.


(Imagine if the way to create an eventfd would be to call 
timerfd_create() and then do a special fcntl to turn it into an eventfd 
but only if it's not currently armed.  This would be weird.)




Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST

2021-12-02 Thread Andy Lutomirski

On 11/19/21 05:47, Chao Peng wrote:

From: "Kirill A. Shutemov" 

The new seal type provides semantics required for KVM guest private
memory support. A file descriptor with the seal set is going to be used
as source of guest memory in confidential computing environments such as
Intel TDX and AMD SEV.

F_SEAL_GUEST can only be set on empty memfd. After the seal is set
userspace cannot read, write or mmap the memfd.


I don't have a strong objection here, but, given that you're only 
supporting it for memfd, would a memfd_create() flag be more 
straightforward?  If nothing else, it would avoid any possible locking 
issue.


I'm also very very slightly nervous about a situation in which one 
program sends a memfd to an untrusted other process and that process 
truncates the memfd and then F_SEAL_GUESTs it.  This could be mostly 
mitigated by also requiring that no other seals be set when F_SEAL_GUEST 
happens, but the alternative MFD_GUEST would eliminate this issue too.




Re: [RFC v2 PATCH 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2021-12-02 Thread Andy Lutomirski

On 11/19/21 05:47, Chao Peng wrote:

This RFC series try to implement the fd-based KVM guest private memory
proposal described at [1] and an improved 'New Proposal' described at [2].


I generally like this.  Thanks!



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-18 Thread Andy Lutomirski
On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin  wrote:
>
> On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote:
> > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin  wrote:
> > >
> > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > > > 4c. The guest kernel maintains an array of physical addresses that are
> > > > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > > > location through whatever protocol, and before resuming a
> > > > moved/snapshotted/duplicated VM, it takes the responsibility for
> > > > memzeroing this memory. The huge pro here would be that this
> > > > eliminates all races, and reduces complexity quite a bit, because the
> > > > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > > > with this, and it can even optimize things like on-disk memory
> > > > snapshots to simply not write out those pages to disk.
> > > >
> > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > > > userspace API to deal with, and it'd be race free, and eliminate a lot
> > > > of kernel complexity.
> > >
> > > Clearly this has a chance to break applications, right?
> > > If there's an app that uses this as a non-system-calls way
> > > to find out whether there was a fork, it will break
> > > when wipe triggers without a fork ...
> > > For example, imagine:
> > >
> > > MADV_WIPEONFORK
> > > copy secret data to MADV_DONTFORK
> > > fork
> > >
> > >
> > > used to work, with this change it gets 0s instead of the secret data.
> > >
> > >
> > > I am also not sure it's wise to expose each guest process
> > > to the hypervisor like this. E.g. each process needs a
> > > guest physical address of its own then. This is a finite resource.
> > >
> > >
> > > The mmap interface proposed here is somewhat baroque, but it is
> > > certainly simple to implement ...
> >
> > Wipe of fork/vmgenid/whatever could end up being much more problematic
> > than it naively appears -- it could be wiped in the middle of a read.
> > Either the API needs to handle this cleanly, or we need something more
> > aggressive like signal-on-fork.
> >
> > --Andy
>
>
> Right, it's not on fork, it's actually when process is snapshotted.
>
> If we assume it's CRIU we care about, then I
> wonder what's wrong with something like
> MADV_CHANGEONPTRACE_SEIZE
> and basically say it's X bytes which change the value...

I feel like we may be approaching this from the wrong end.  Rather
than saying "what data structure can the kernel expose that might
plausibly be useful", how about we try identifying some specific
userspace needs and see what a good solution could look like.  I can
identify two major cryptographic use cases:

1. A userspace RNG.  The API exposed by the userspace end is a
function that generates random numbers.  The userspace code in turn
wants to know some things from the kernel: it wants some
best-quality-available random seed data from the kernel (and possibly
an indication of how good it is) as well as an indication of whether
the userspace memory may have been cloned or rolled back, or, failing
that, an indication of whether a reseed is needed.  Userspace could
implement a wide variety of algorithms on top depending on its goals
and compliance requirements, but the end goal is for the userspace
part to be very, very fast.

2. A userspace crypto stack that wants to avoid shooting itself in the
foot due to inadvertently doing the same thing twice.  For example, an
AES-GCM stack does not want to reuse an IV, *expecially* if there is
even the slightest chance that it might reuse the IV for different
data.  This use case doesn't necessarily involve random numbers, but,
if anything, it needs to be even faster than #1.

The threats here are not really the same.  For #1, a userspace RNG
should be able to recover from a scenario in which an adversary clones
the entire process *and gets to own the clone*.  For example, in
Android, an adversary can often gain complete control of a fork of the
zygote -- this shouldn't adversely affect the security properties of
other forks.  Similarly, a server farm could operate by having one
booted server that is cloned to create more workers.  Those clones
could be provisioned with secrets and permissions post-clone, and at
attacker gaining control of a fresh clone could be considered
acceptable.  For #2, in contrast, if an adversary gains control of a
clone of an AES-GCM session, they learn the key outright -- the
relevant attack scenario is that the adversary gets to interact with
two clones without compromising either clone per se.

It's worth noting that, in both cases, there could possibly be more
than one instance of an RNG or an AES-GCM session in the same process.
This means that using signals is awkward but not necessarily
impossibly.  (This is an area in which Linux, and POSIX in general, is
much weaker than Windows.)



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-18 Thread Andy Lutomirski
On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin  wrote:
>
> On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote:
> > 4c. The guest kernel maintains an array of physical addresses that are
> > MADV_WIPEONFORK. The hypervisor knows about this array and its
> > location through whatever protocol, and before resuming a
> > moved/snapshotted/duplicated VM, it takes the responsibility for
> > memzeroing this memory. The huge pro here would be that this
> > eliminates all races, and reduces complexity quite a bit, because the
> > hypervisor can perfectly synchronize its bringup (and SMP bringup)
> > with this, and it can even optimize things like on-disk memory
> > snapshots to simply not write out those pages to disk.
> >
> > A 4c-like approach seems like it'd be a lot of bang for the buck -- we
> > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new
> > userspace API to deal with, and it'd be race free, and eliminate a lot
> > of kernel complexity.
>
> Clearly this has a chance to break applications, right?
> If there's an app that uses this as a non-system-calls way
> to find out whether there was a fork, it will break
> when wipe triggers without a fork ...
> For example, imagine:
>
> MADV_WIPEONFORK
> copy secret data to MADV_DONTFORK
> fork
>
>
> used to work, with this change it gets 0s instead of the secret data.
>
>
> I am also not sure it's wise to expose each guest process
> to the hypervisor like this. E.g. each process needs a
> guest physical address of its own then. This is a finite resource.
>
>
> The mmap interface proposed here is somewhat baroque, but it is
> certainly simple to implement ...

Wipe of fork/vmgenid/whatever could end up being much more problematic
than it naively appears -- it could be wiped in the middle of a read.
Either the API needs to handle this cleanly, or we need something more
aggressive like signal-on-fork.

--Andy



Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver

2020-10-17 Thread Andy Lutomirski
On Fri, Oct 16, 2020 at 6:40 PM Jann Horn  wrote:
>
> [adding some more people who are interested in RNG stuff: Andy, Jason,
> Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this
> concerns some pretty fundamental API stuff related to RNG usage]
>
> On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin
>  wrote:
> > - Background
> >
> > The VM Generation ID is a feature defined by Microsoft (paper:
> > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
> > multiple hypervisor vendors.
> >
> > The feature is required in virtualized environments by apps that work
> > with local copies/caches of world-unique data such as random values,
> > uuids, monotonically increasing counters, etc.
> > Such apps can be negatively affected by VM snapshotting when the VM
> > is either cloned or returned to an earlier point in time.
> >
> > The VM Generation ID is a simple concept meant to alleviate the issue
> > by providing a unique ID that changes each time the VM is restored
> > from a snapshot. The hw provided UUID value can be used to
> > differentiate between VMs or different generations of the same VM.
> >
> > - Problem
> >
> > The VM Generation ID is exposed through an ACPI device by multiple
> > hypervisor vendors but neither the vendors or upstream Linux have no
> > default driver for it leaving users to fend for themselves.
> >
> > Furthermore, simply finding out about a VM generation change is only
> > the starting point of a process to renew internal states of possibly
> > multiple applications across the system. This process could benefit
> > from a driver that provides an interface through which orchestration
> > can be easily done.
> >
> > - Solution
> >
> > This patch is a driver which exposes the Virtual Machine Generation ID
> > via a char-dev FS interface that provides ID update sync and async
> > notification, retrieval and confirmation mechanisms:
> >
> > When the device is 'open()'ed a copy of the current vm UUID is
> > associated with the file handle. 'read()' operations block until the
> > associated UUID is no longer up to date - until HW vm gen id changes -
> > at which point the new UUID is provided/returned. Nonblocking 'read()'
> > uses EWOULDBLOCK to signal that there is no _new_ UUID available.
> >
> > 'poll()' is implemented to allow polling for UUID updates. Such
> > updates result in 'EPOLLIN' events.
> >
> > Subsequent read()s following a UUID update no longer block, but return
> > the updated UUID. The application needs to acknowledge the UUID update
> > by confirming it through a 'write()'.
> > Only on writing back to the driver the right/latest UUID, will the
> > driver mark this "watcher" as up to date and remove EPOLLIN status.
> >
> > 'mmap()' support allows mapping a single read-only shared page which
> > will always contain the latest UUID value at offset 0.
>
> It would be nicer if that page just contained an incrementing counter,
> instead of a UUID. It's not like the application cares *what* the UUID
> changed to, just that it *did* change and all RNGs state now needs to
> be reseeded from the kernel, right? And an application can't reliably
> read the entire UUID from the memory mapping anyway, because the VM
> might be forked in the middle.
>
> So I think your kernel driver should detect UUID changes and then turn
> those into a monotonically incrementing counter. (Probably 64 bits
> wide?) (That's probably also a little bit faster than comparing an
> entire UUID.)
>
> An option might be to put that counter into the vDSO, instead of a
> separate VMA; but I don't know how the other folks feel about that.
> Andy, do you have opinions on this? That way, normal userspace code
> that uses this infrastructure wouldn't have to mess around with a
> special device at all. And it'd be usable in seccomp sandboxes and so
> on without needing special plumbing. And libraries wouldn't have to
> call open() and mess with file descriptor numbers.

The vDSO might be annoyingly slow for this.  Something like the rseq
page might make sense.  It could be a generic indication of "system
went through some form of suspend".



Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-29 Thread Andy Lutomirski
> On Dec 28, 2018, at 6:54 PM, Matthew Wilcox  wrote:
>
>> On Sat, Dec 29, 2018 at 12:12:27AM +, Peter Maydell wrote:
>> On Fri, 28 Dec 2018 at 23:16, Andreas Dilger  wrot
>>> On Dec 28, 2018, at 4:18 AM, Peter Maydell  wrote:
 The problem is that there is no 32-bit API in some cases
 (unless I have misunderstood the kernel code) -- not all
 host architectures implement compat syscalls or allow them
 to be called from 64-bit processes or implement all the older
 syscall variants that had smaller offets. If there was a guaranteed
 "this syscall always exists and always gives me 32-bit offsets"
 we could use it.
>>>
>>> The "32bitapi" mount option would use 32-bit hash for seekdir
>>> and telldir, regardless of what kernel API was used.  That would
>>> just set the FMODE_32BITHASH flag in the file->f_mode for all files.
>>
>> A mount option wouldn't be much use to QEMU -- we can't tell
>> our users how to mount their filesystems, which they're
>> often doing lots of other things with besides running QEMU.
>> (Otherwise we could just tell them "don't use ext4", which
>> would also solve the problem :-)) We need something we can
>> use at the individual-syscall level.
>
> Could you use a prctl to set whether you were running in 32 or 64 bit
> mode?  Or do you change which kind of task you're emulating too often
> to make this a good idea?


How would this work?  We already have the separate
COMPAT_DEFINE_SYSCALL entries *and* in_compat_syscall(). Now we’d have
a third degree of freedom.

Either the arches people care about should add reasonable ways to
issue 32-bit syscalls from 64-bit mode or there should be an explicit
way to ask for the 32-bit directory offsets.



Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Andy Lutomirski
[sending again, slightly edited, due to email client issues]

On Thu, Dec 27, 2018 at 9:25 AM Florian Weimer  wrote:
>
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
>   {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, 
> d_name="authorized_keys", d_type=DT_REG},
>   {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", 
> d_type=DT_DIR},
>   {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", 
> d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).
>

...

>
> However, both qemu-user and the 9p file system can run in such a way
> that the kernel is entered from a 64-bit process, but the actual usage
> is from a 32-bit process:


I imagine that at least some of the problems you're seeing are due to this bug:

https://lkml.org/lkml/2018/10/18/859

Presumably the right fix involves modifying the relevant VFS file
operations to indicate the relevant ABI to the implementations.  I
would guess that 9p is triggering the “not really in the syscall you
think you’re in” issue.



Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-27 Thread Andy Lutomirski



> On Dec 27, 2018, at 10:18 AM, Florian Weimer  wrote:
> 
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
> 
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
> 
> getdents(3, [
>  {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, 
> d_name="authorized_keys", d_type=DT_REG},
>  {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", 
> d_type=DT_DIR},
>  {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", 
> d_type=DT_DIR}
> ], 32768) = 88
> 
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).

I imagine you’re encountering this bug:

https://lkml.org/lkml/2018/10/18/859

Presumably the right fix involves modifying the relevant VFS file operations to 
indicate the relevant ABI to the implementations.

I would guess that 9p is triggering the “not really in the syscall you think 
you’re in” issue.


Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote:
>> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> >> One correction: it's a feature of the device in the system.
>> >> >> There could be a mix of devices bypassing and not
>> >> >> bypassing the IOMMU.
>> >> >
>> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> >> > IOMMU can chose to let the device bypass. So any fix here belongs
>> >> > into the platform/iommu code too and not into some driver.
>> >> >
>> >> >> Sounds good. And a way to detect appropriate devices could
>> >> >> be by looking at the feature flag, perhaps?
>> >> >
>> >> > Again, no! The way to detect that is to look into the iommu description
>> >> > structures provided by the firmware. They provide everything necessary
>> >> > to tell the iommu code which devices are not translated.
>> >> >
>> >>
>> >> Except on PPC and SPARC.  As far as I know, those are the only
>> >> problematic platforms.
>> >>
>> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> >> fixed to report correct data in the DMAR tables?
>> >>
>> >> --Andy
>> >
>> > Meaning virtio or assigned devices?
>> > For virtio - it's way too late since these are working configurations.
>> > For assigned devices - they don't work on x86 so it doesn't have
>> > to be disabled, it's safe to ignore.
>>
>> I mean actually prevent QEMU from running in q35-iommu mode with any
>> virtio devices attached or maybe even turn off q35-iommu mode entirely
>> [1].  Doesn't it require that the user literally pass the word
>> "experimental" into QEMU right now?  It did at some point IIRC.
>>
>> The reason I'm asking is that, other than q35-iommu, QEMU's virtio
>> devices *don't* bypass the IOMMU except on PPC and SPARC, simply
>> because there is no other configuration AFAICT that has virtio and and
>> IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
>> correctly (thus breaking q35-iommu users with older guest kernels,
>> which hopefully don't actually exist) and to come up with a PPC- and
>> SPARC-specific solution, or maybe OpenFirmware-specific solution, to
>> handle PPC and SPARC down the road.
>>
>> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
>> showed up in a release asking the QEMU team to please not do that
>> until this issue was resolved.  Sadly, that email was ignored :(
>>
>> --Andy
>
> Sorry, I didn't make myself clear.
> Point is, QEMU is not the only virtio implementation out there.
> So we can't know no virtio implementations have an IOMMU as long as
> linux supports this IOMMU.
> virtio always used physical addresses since it was born and if it
> changes that it must do this in a way that does not break existing
> users.

Is there any non-QEMU virtio implementation can provide an
IOMMU-bypassing virtio device on a platform that has a nontrivial
IOMMU?

--Andy



Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote:
>> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> >> One correction: it's a feature of the device in the system.
>> >> There could be a mix of devices bypassing and not
>> >> bypassing the IOMMU.
>> >
>> > No, it really is not. A device can't chose to bypass the IOMMU. But the
>> > IOMMU can chose to let the device bypass. So any fix here belongs
>> > into the platform/iommu code too and not into some driver.
>> >
>> >> Sounds good. And a way to detect appropriate devices could
>> >> be by looking at the feature flag, perhaps?
>> >
>> > Again, no! The way to detect that is to look into the iommu description
>> > structures provided by the firmware. They provide everything necessary
>> > to tell the iommu code which devices are not translated.
>> >
>>
>> Except on PPC and SPARC.  As far as I know, those are the only
>> problematic platforms.
>>
>> Is it too late to *disable* QEMU's q35-iommu thingy until it can be
>> fixed to report correct data in the DMAR tables?
>>
>> --Andy
>
> Meaning virtio or assigned devices?
> For virtio - it's way too late since these are working configurations.
> For assigned devices - they don't work on x86 so it doesn't have
> to be disabled, it's safe to ignore.

I mean actually prevent QEMU from running in q35-iommu mode with any
virtio devices attached or maybe even turn off q35-iommu mode entirely
[1].  Doesn't it require that the user literally pass the word
"experimental" into QEMU right now?  It did at some point IIRC.

The reason I'm asking is that, other than q35-iommu, QEMU's virtio
devices *don't* bypass the IOMMU except on PPC and SPARC, simply
because there is no other configuration AFAICT that has virtio and and
IOMMU.  So maybe the right solution is to fix q35-iommu to use DMAR
correctly (thus breaking q35-iommu users with older guest kernels,
which hopefully don't actually exist) and to come up with a PPC- and
SPARC-specific solution, or maybe OpenFirmware-specific solution, to
handle PPC and SPARC down the road.

[1] I'm pretty sure I emailed the QEMU list before q35-iommu ever
showed up in a release asking the QEMU team to please not do that
until this issue was resolved.  Sadly, that email was ignored :(

--Andy



Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api

2016-04-27 Thread Andy Lutomirski
On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel  wrote:
> On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote:
>> One correction: it's a feature of the device in the system.
>> There could be a mix of devices bypassing and not
>> bypassing the IOMMU.
>
> No, it really is not. A device can't chose to bypass the IOMMU. But the
> IOMMU can chose to let the device bypass. So any fix here belongs
> into the platform/iommu code too and not into some driver.
>
>> Sounds good. And a way to detect appropriate devices could
>> be by looking at the feature flag, perhaps?
>
> Again, no! The way to detect that is to look into the iommu description
> structures provided by the firmware. They provide everything necessary
> to tell the iommu code which devices are not translated.
>

Except on PPC and SPARC.  As far as I know, those are the only
problematic platforms.

Is it too late to *disable* QEMU's q35-iommu thingy until it can be
fixed to report correct data in the DMAR tables?

--Andy



[Qemu-devel] [Bug 1574346] [NEW] TCG: mov to segment register is incorrectly emulated for AMD CPUs

2016-04-24 Thread Andy Lutomirski
Public bug reported:

In TCG mode, the effect of:

xorl %eax, %eax
movl %eax, %gs

is to mark the GS segment unusable and set its base to zero.  After
doing this, reading MSR_GS_BASE will return zero and using a GS prefix
in long mode will treat the GS base as zero.

This is correct for Intel CPUs but is incorrect for AMD CPUs.  On an AMD
CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave the
base unchanged.

To make it easier to use TCG to validate behavior on different CPUs,
please consider changing the TCG behavior to match actual CPU behavior
when emulating an AMD CPU.

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1574346

Title:
  TCG: mov to segment register is incorrectly emulated for AMD CPUs

Status in QEMU:
  New

Bug description:
  In TCG mode, the effect of:

  xorl %eax, %eax
  movl %eax, %gs

  is to mark the GS segment unusable and set its base to zero.  After
  doing this, reading MSR_GS_BASE will return zero and using a GS prefix
  in long mode will treat the GS base as zero.

  This is correct for Intel CPUs but is incorrect for AMD CPUs.  On an
  AMD CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave
  the base unchanged.

  To make it easier to use TCG to validate behavior on different CPUs,
  please consider changing the TCG behavior to match actual CPU behavior
  when emulating an AMD CPU.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1574346/+subscriptions



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-20 Thread Andy Lutomirski
On Apr 20, 2016 6:14 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote:
>
> On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> > > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> > >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> 
> > >> wrote:
> > >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> > >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin 
> > >> >> <m...@redhat.com> wrote:
> > >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> > >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> > >> >> >> >
> > >> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
> > >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, 
> > >> >> >> > > the host
> > >> >> >> > > device would skip translation?  Or is that problematic for 
> > >> >> >> > > vfio?
> > >> >> >> >
> > >> >> >> > Exactly that's problematic for security.
> > >> >> >> > You can't allow guest driver to decide whether device skips 
> > >> >> >> > security.
> > >> >> >>
> > >> >> >> Right. Because fundamentally, this *isn't* a property of the 
> > >> >> >> endpoint
> > >> >> >> device, and doesn't live in virtio itself.
> > >> >> >>
> > >> >> >> It's a property of the platform IOMMU, and lives there.
> > >> >> >
> > >> >> > It's a property of the hypervisor virtio implementation, and lives 
> > >> >> > there.
> > >> >>
> > >> >> It is now, but QEMU could, in principle, change the way it thinks
> > >> >> about it so that virtio devices would use the QEMU DMA API but ask
> > >> >> QEMU to pass everything through 1:1.  This would be entirely invisible
> > >> >> to guests but would make it be a property of the IOMMU implementation.
> > >> >> At that point, maybe QEMU could find a (platform dependent) way to
> > >> >> tell the guest what's going on.
> > >> >>
> > >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> > >> >> set up 1:1 mappings in the guest so that the virtio devices would work
> > >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> > >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> > >> >> up with an offset.  I don't know too much about those platforms, but
> > >> >> presumably the layout could be changed so that 1:1 really was 1:1.
> > >> >>
> > >> >> --Andy
> > >> >
> > >> > Sure. Do you see any reason why the decision to do this can't be
> > >> > keyed off the virtio feature bit?
> > >>
> > >> I can think of three types of virtio host:
> > >>
> > >> a) virtio always bypasses the IOMMU.
> > >>
> > >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> > >> it does) -- i.e. virtio works like any other device.
> > >>
> > >> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
> > >
> > > d) some virtio devices bypass the IOMMU and some don't,
> > > e.g. it's harder to support IOMMU with vhost.
> > >
> > >
> > >> If this is keyed off a virtio feature bit and anyone tries to
> > >> implement (c), the vfio is going to have a problem.  And, if it's
> > >> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> > >> setups unless the Xen hypervisor adds a giant and probably unreliable
> > >> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
> > >> default x86 QEMU configuration, and I'd really like to keep it that
> > >> way.
> > >>
> > >> What could plausibly work using a virtio feature bit is for a device
> > >> to say "hey, I'm a new device and I support the platform-defined IOMMU
> > >> mechanism".  This bit would be *set* on default IOMMU-less QEMU
> > >> configurations and on physical virtio PCI cards.
> > >
> > > And clear on xen.
> >
> > How?  QEMU has no idea that the guest is running Xen.
>
> I was under impression xen_enabled() is true in QEMU.
> Am I wrong?

I'd be rather surprised, given that QEMU would have to inspect the
guest kernel to figure it out.  I'm talking about Xen under QEMU.  For
example, if you feed QEMU a guest disk image that contains Fedora with
the xen packages installed, you can boot it and get a grub menu.  If
you ask grub to boot Xen, you get Xen.  If you ask grub to boot Linux
directly, you don't get Xen.

I assume xen_enabled is for QEMU under Xen, i.e. QEMU, running under
Xen, supplying emulated devices to a Xen domU guest.  Since QEMU is
seeing the guest address space directly, this should be much the same
as QEMU !xen_enabled -- if you boot plain Linux, everything works, but
if you do Xen -> QEMU -> HVM guest running Xen PV -> Linux, then
virtio drivers in the Xen PV Linux guest need to translate addresses.
--Andy

>
> --
> MST



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >> >
>> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the 
>> >> >> > > host
>> >> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >> >
>> >> >> > Exactly that's problematic for security.
>> >> >> > You can't allow guest driver to decide whether device skips security.
>> >> >>
>> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> >> device, and doesn't live in virtio itself.
>> >> >>
>> >> >> It's a property of the platform IOMMU, and lives there.
>> >> >
>> >> > It's a property of the hypervisor virtio implementation, and lives 
>> >> > there.
>> >>
>> >> It is now, but QEMU could, in principle, change the way it thinks
>> >> about it so that virtio devices would use the QEMU DMA API but ask
>> >> QEMU to pass everything through 1:1.  This would be entirely invisible
>> >> to guests but would make it be a property of the IOMMU implementation.
>> >> At that point, maybe QEMU could find a (platform dependent) way to
>> >> tell the guest what's going on.
>> >>
>> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> >> set up 1:1 mappings in the guest so that the virtio devices would work
>> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> >> up with an offset.  I don't know too much about those platforms, but
>> >> presumably the layout could be changed so that 1:1 really was 1:1.
>> >>
>> >> --Andy
>> >
>> > Sure. Do you see any reason why the decision to do this can't be
>> > keyed off the virtio feature bit?
>>
>> I can think of three types of virtio host:
>>
>> a) virtio always bypasses the IOMMU.
>>
>> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
>> it does) -- i.e. virtio works like any other device.
>>
>> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
>
> d) some virtio devices bypass the IOMMU and some don't,
> e.g. it's harder to support IOMMU with vhost.
>
>
>> If this is keyed off a virtio feature bit and anyone tries to
>> implement (c), the vfio is going to have a problem.  And, if it's
>> keyed off a virtio feature bit, then (a) won't work on Xen or similar
>> setups unless the Xen hypervisor adds a giant and probably unreliable
>> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
>> default x86 QEMU configuration, and I'd really like to keep it that
>> way.
>>
>> What could plausibly work using a virtio feature bit is for a device
>> to say "hey, I'm a new device and I support the platform-defined IOMMU
>> mechanism".  This bit would be *set* on default IOMMU-less QEMU
>> configurations and on physical virtio PCI cards.
>
> And clear on xen.

How?  QEMU has no idea that the guest is running Xen.



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >
>> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >
>> >> > Exactly that's problematic for security.
>> >> > You can't allow guest driver to decide whether device skips security.
>> >>
>> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> device, and doesn't live in virtio itself.
>> >>
>> >> It's a property of the platform IOMMU, and lives there.
>> >
>> > It's a property of the hypervisor virtio implementation, and lives there.
>>
>> It is now, but QEMU could, in principle, change the way it thinks
>> about it so that virtio devices would use the QEMU DMA API but ask
>> QEMU to pass everything through 1:1.  This would be entirely invisible
>> to guests but would make it be a property of the IOMMU implementation.
>> At that point, maybe QEMU could find a (platform dependent) way to
>> tell the guest what's going on.
>>
>> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> set up 1:1 mappings in the guest so that the virtio devices would work
>> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> up with an offset.  I don't know too much about those platforms, but
>> presumably the layout could be changed so that 1:1 really was 1:1.
>>
>> --Andy
>
> Sure. Do you see any reason why the decision to do this can't be
> keyed off the virtio feature bit?

I can think of three types of virtio host:

a) virtio always bypasses the IOMMU.

b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
it does) -- i.e. virtio works like any other device.

c) virtio may bypass the IOMMU depending on what the guest asks it to do.

If this is keyed off a virtio feature bit and anyone tries to
implement (c), the vfio is going to have a problem.  And, if it's
keyed off a virtio feature bit, then (a) won't work on Xen or similar
setups unless the Xen hypervisor adds a giant and probably unreliable
kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
default x86 QEMU configuration, and I'd really like to keep it that
way.

What could plausibly work using a virtio feature bit is for a device
to say "hey, I'm a new device and I support the platform-defined IOMMU
mechanism".  This bit would be *set* on default IOMMU-less QEMU
configurations and on physical virtio PCI cards.  The guest could
operate accordingly.  I'm not sure I see a good way for feature
negotiation to work the other direction, though.

PPC and SPARC could only set this bit on emulated devices if they know
that new guest kernels are in use.

--Andy



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin  wrote:
> On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >
>> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> > > device would skip translation?  Or is that problematic for vfio?
>> >
>> > Exactly that's problematic for security.
>> > You can't allow guest driver to decide whether device skips security.
>>
>> Right. Because fundamentally, this *isn't* a property of the endpoint
>> device, and doesn't live in virtio itself.
>>
>> It's a property of the platform IOMMU, and lives there.
>
> It's a property of the hypervisor virtio implementation, and lives there.

It is now, but QEMU could, in principle, change the way it thinks
about it so that virtio devices would use the QEMU DMA API but ask
QEMU to pass everything through 1:1.  This would be entirely invisible
to guests but would make it be a property of the IOMMU implementation.
At that point, maybe QEMU could find a (platform dependent) way to
tell the guest what's going on.

FWIW, as far as I can tell, PPC and SPARC really could, in principle,
set up 1:1 mappings in the guest so that the virtio devices would work
regardless of whether QEMU is ignoring the IOMMU or not -- I think the
only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
up with an offset.  I don't know too much about those platforms, but
presumably the layout could be changed so that 1:1 really was 1:1.

--Andy



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dw...@infradead.org> 
>> >> wrote:
>> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> >> > the truth, and even legacy kernels ought to cope with that.
>> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> >> > tables, which puts it back into the same camp as ARM and Power.
>> >>
>> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> >> implementation on x86 has always been "experimental", so it just might
>> >> be okay to change it in a way that causes some older kernels to OOPS.
>> >>
>> >> --Andy
>> >
>> > Since it's experimental, it might be OK to change *guest kernels*
>> > such that they oops on old QEMU.
>> > But guest kernels were not experimental - so we need a QEMU mode that
>> > makes them work fine. The more functionality is available in this QEMU
>> > mode, the betterm because it's going to be the default for a while. For
>> > the same reason, it is preferable to also have new kernels not crash in
>> > this mode.
>> >
>>
>> People add QEMU features that need new guest kernels all time time.
>> If you enable virtio-scsi and try to boot a guest that's too old, it
>> won't work.  So I don't see anything fundamentally wrong with saying
>> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
>> kernel is too old.  It might be annoying, since old kernels do work on
>> actual Q35 hardware, but it at least seems to be that it might be
>> okay.
>>
>> --Andy
>
> Yes but we need a mode that makes both old and new kernels work, and
> that should be the default for a while.  this is what the
> IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
> API, new kernels go "oh compatibility mode" and bypass the IOMMU
> within DMA API.

I thought that PLATFORM served that purpose.  Woudn't the host
advertise PLATFORM support and, if the guest doesn't ack it, the host
device would skip translation?  Or is that problematic for vfio?

>
> --
> MST



-- 
Andy Lutomirski
AMA Capital Management, LLC



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin"  wrote:
>
>
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".

What happens when you use a device like this on Xen or with a similar
software translation layer?

I think that a "please bypass IOMMU" feature would be better in the
PCI, IOMMU, or platform code.  For Xen, virtio would still want to use
the DMA API, just without translating at the DMAR or hardware level.
Doing it in virtio is awkward, because virtio is involved at the
device level and the driver level, but the translation might be
entirely in between.

I think a nicer long-term approach would be to have a way to ask the
guest to set up a full 1:1 mapping for performance, but to still
handle the case where the guest refuses to do so or where there's more
than one translation layer involved.

But I agree that this part shouldn't delay the other part of your series.

--Andy



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-19 Thread Andy Lutomirski
On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dw...@infradead.org> 
>> wrote:
>> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> > the truth, and even legacy kernels ought to cope with that.
>> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> > tables, which puts it back into the same camp as ARM and Power.
>>
>> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> implementation on x86 has always been "experimental", so it just might
>> be okay to change it in a way that causes some older kernels to OOPS.
>>
>> --Andy
>
> Since it's experimental, it might be OK to change *guest kernels*
> such that they oops on old QEMU.
> But guest kernels were not experimental - so we need a QEMU mode that
> makes them work fine. The more functionality is available in this QEMU
> mode, the betterm because it's going to be the default for a while. For
> the same reason, it is preferable to also have new kernels not crash in
> this mode.
>

People add QEMU features that need new guest kernels all time time.
If you enable virtio-scsi and try to boot a guest that's too old, it
won't work.  So I don't see anything fundamentally wrong with saying
that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
kernel is too old.  It might be annoying, since old kernels do work on
actual Q35 hardware, but it at least seems to be that it might be
okay.

--Andy



Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api

2016-04-18 Thread Andy Lutomirski
On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse  wrote:
> For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> the truth, and even legacy kernels ought to cope with that.
> FSVO 'ought to' where I suspect some of them will actually crash with a
> NULL pointer dereference if there's no "catch-all" DMAR unit in the
> tables, which puts it back into the same camp as ARM and Power.

I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
implementation on x86 has always been "experimental", so it just might
be okay to change it in a way that causes some older kernels to OOPS.

--Andy



Re: [Qemu-devel] [PATCH v3 1/4] firmware: introduce sysfs driver for QEMU's fw_cfg device

2015-10-06 Thread Andy Lutomirski
On Sat, Oct 3, 2015 at 4:28 PM, Gabriel L. Somlo  wrote:
> From: Gabriel Somlo 
>
> Make fw_cfg entries of type "file" available via sysfs. Entries
> are listed under /sys/firmware/qemu_fw_cfg/by_key, in folders
> named after each entry's selector key. Filename, selector value,
> and size read-only attributes are included for each entry. Also,
> a "raw" attribute allows retrieval of the full binary content of
> each entry.
>
> This patch also provides a documentation file outlining the
> guest-side "hardware" interface exposed by the QEMU fw_cfg device.
>

What's the status of "by_name"?  There's a single (presumably
incorrect) mention of it in a comment in this patch.

I would prefer if the kernel populated by_name itself rather than
deferring that to udev, since I'd like to use this facility in virtme,
and I'd like to use fw_cfg very early on boot before I even start
udev.

--Andy



Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages

2014-10-07 Thread Andy Lutomirski
On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli aarca...@redhat.com wrote:
 On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
 mremap like interface, or file+commands protocol interface. I tend to
 like mremap more, that's why I opted for a remap_anon_pages syscall
 kept orthogonal to the userfaultfd functionality (remap_anon_pages
 could be also used standalone as an accelerated mremap in some
 circumstances) but nothing prevents to just embed the same mechanism

 Sorry for the self followup, but something else comes to mind to
 elaborate this further.

 In term of interfaces, the most efficient I could think of to minimize
 the enter/exit kernel, would be to append the source address of the
 data received from the network transport, to the userfaultfd_write()
 command (by appending 8 bytes to the wakeup command). Said that,
 mixing the mechanism to be notified about userfaults with the
 mechanism to resolve an userfault to me looks a complication. I kind
 of liked to keep the userfaultfd protocol is very simple and doing
 just its thing. The userfaultfd doesn't need to know how the userfault
 was resolved, even mremap would work theoretically (until we run out
 of vmas). I thought it was simpler to keep it that way. However if we
 want to resolve the fault with a write() syscall this may be the
 most efficient way to do it, as we're already doing a write() into the
 pseudofd to wakeup the page fault that contains the destination
 address, I just need to append the source address to the wakeup command.

 I probably grossly overestimated the benefits of resolving the
 userfault with a zerocopy page move, sorry. So if we entirely drop the
 zerocopy behavior and the TLB flush of the old page like you
 suggested, the way to keep the userfaultfd mechanism decoupled from
 the userfault resolution mechanism would be to implement an
 atomic-copy syscall. That would work for SIGBUS userfaults too without
 requiring a pseudofd then. It would be enough then to call
 mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
 that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
 wouldn't page fault or call GUP into the destination address (it can't
 otherwise the in-flight partial copy would be visible to the process,
 breaking the atomicity of the copy), but it would fill in the
 pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
 currently has (in turn it would by design bypass the VM_USERFAULT
 check and be ideal for resolving userfaults).

At the risk of asking a possibly useless question, would it make sense
to splice data into a userfaultfd?

--Andy


 mcopy_atomic could then be also extended to tmpfs and it would work
 without requiring the source page to be a tmpfs page too without
 having to convert page types on the fly.

 If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
 course so it'd be even less intrusive than the current
 remap_anon_pages and it would require zero TLB flush during its
 runtime (it would just require an atomic copy).

 So should I try to embed a mcopy_atomic inside userfault_write or can
 I expose it to userland as a standalone new syscall? Or should I do
 something different? Comments?

 Thanks,
 Andrea



-- 
Andy Lutomirski
AMA Capital Management, LLC



Re: [Qemu-devel] [PATCH RFC 00/11] qemu: towards virtio-1 host support

2014-10-07 Thread Andy Lutomirski
On 10/07/2014 07:39 AM, Cornelia Huck wrote:
 This patchset aims to get us some way to implement virtio-1 compliant
 and transitional devices in qemu. Branch available at
 
 git://github.com/cohuck/qemu virtio-1
 
 I've mainly focused on:
 - endianness handling
 - extended feature bits
 - virtio-ccw new/changed commands

At the risk of some distraction, would it be worth thinking about a
solution to the IOMMU bypassing mess as part of this?

--Andy



[Qemu-devel] [PATCH qemu] i386, linux-headers: Add support for kvm_get_rng_seed

2014-07-16 Thread Andy Lutomirski
This updates x86's kvm_para.h for the feature bit definition and
target-i386/cpu.c for the feature name and default.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 linux-headers/asm-x86/kvm_para.h | 2 ++
 target-i386/cpu.c| 5 +++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/linux-headers/asm-x86/kvm_para.h b/linux-headers/asm-x86/kvm_para.h
index e41c5c1..a9b27ce 100644
--- a/linux-headers/asm-x86/kvm_para.h
+++ b/linux-headers/asm-x86/kvm_para.h
@@ -24,6 +24,7 @@
 #define KVM_FEATURE_STEAL_TIME 5
 #define KVM_FEATURE_PV_EOI 6
 #define KVM_FEATURE_PV_UNHALT  7
+#define KVM_FEATURE_GET_RNG_SEED   8
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -40,6 +41,7 @@
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN  0x4b564d04
+#define MSR_KVM_GET_RNG_SEED 0x4b564d05
 
 struct kvm_steal_time {
__u64 steal;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 8fd1497..4ea7e6c 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -236,7 +236,7 @@ static const char *ext4_feature_name[] = {
 static const char *kvm_feature_name[] = {
 kvmclock, kvm_nopiodelay, kvm_mmu, kvmclock,
 kvm_asyncpf, kvm_steal_time, kvm_pv_eoi, kvm_pv_unhalt,
-NULL, NULL, NULL, NULL,
+kvm_get_rng_seed, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL,
@@ -368,7 +368,8 @@ static uint32_t kvm_default_features[FEATURE_WORDS] = {
 (1  KVM_FEATURE_ASYNC_PF) |
 (1  KVM_FEATURE_STEAL_TIME) |
 (1  KVM_FEATURE_PV_EOI) |
-(1  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT),
+(1  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+(1  KVM_FEATURE_GET_RNG_SEED),
 [FEAT_1_ECX] = CPUID_EXT_X2APIC,
 };
 
-- 
1.9.3




Re: [Qemu-devel] [PATCH 00/10] RFC: userfault

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
 Hello everyone,
 
 There's a large CC list for this RFC because this adds two new
 syscalls (userfaultfd and remap_anon_pages) and
 MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
 or on a completely different API if somebody has better ideas are
 welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

 
 The combination of these features are what I would propose to
 implement postcopy live migration in qemu, and in general demand
 paging of remote memory, hosted in different cloud nodes.
 
 The MADV_USERFAULT feature should be generic enough that it can
 provide the userfaults to the Android volatile range feature too, on
 access of reclaimed volatile pages.
 
 If the access could ever happen in kernel context through syscalls
 (not not just from userland context), then userfaultfd has to be used
 to make the userfault unnoticeable to the syscall (no error will be
 returned). This latter feature is more advanced than what volatile
 ranges alone could do with SIGBUS so far (but it's optional, if the
 process doesn't call userfaultfd, the regular SIGBUS will fire, if the
 fd is closed SIGBUS will also fire for any blocked userfault that was
 waiting a userfaultfd_write ack).
 
 userfaultfd is also a generic enough feature, that it allows KVM to
 implement postcopy live migration without having to modify a single
 line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
 other GUP features works just fine in combination with userfaults
 (userfaults trigger async page faults in the guest scheduler so those
 guest processes that aren't waiting for userfaults can keep running in
 the guest vcpus).
 
 remap_anon_pages is the syscall to use to resolve the userfaults (it's
 not mandatory, vmsplice will likely still be used in the case of local
 postcopy live migration just to upgrade the qemu binary, but
 remap_anon_pages is faster and ideal for transferring memory across
 the network, it's zerocopy and doesn't touch the vma: it only holds
 the mmap_sem for reading).
 
 The current behavior of remap_anon_pages is very strict to avoid any
 chance of memory corruption going unnoticed. mremap is not strict like
 that: if there's a synchronization bug it would drop the destination
 range silently resulting in subtle memory corruption for
 example. remap_anon_pages would return -EEXIST in that case. If there
 are holes in the source range remap_anon_pages will return -ENOENT.
 
 If remap_anon_pages is used always with 2M naturally aligned
 addresses, transparent hugepages will not be splitted. In there could
 be 4k (or any size) holes in the 2M (or any size) source range,
 remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
 relax some of its strict checks (-ENOENT won't be returned if
 RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
 a noop on any hole in the source range). This flag is generally useful
 when implementing userfaults with THP granularity, but it shouldn't be
 set if doing the userfaults with PAGE_SIZE granularity if the
 developer wants to benefit from the strict -ENOENT behavior.
 
 The remap_anon_pages syscall API is not vectored, as I expect it to be
 used mainly for demand paging (where there can be just one faulting
 range per userfault) or for large ranges (with the THP model as an
 alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
 granularity before starting the guest in the destination node) where
 vectoring isn't going to provide much performance advantages (thanks
 to the THP coarser granularity).
 
 On the rmap side remap_anon_pages doesn't add much complexity: there's
 no need of nonlinear anon vmas to support it because I added the
 constraint that it will fail if the mapcount is more than 1. So in
 general the source range of remap_anon_pages should be marked
 MADV_DONTFORK to prevent any risk of failure if the process ever
 forks (like qemu can in some case).
 
 One part that hasn't been tested is the poll() syscall on the
 userfaultfd because the postcopy migration thread currently is more
 efficient waiting on blocking read()s (I'll write some code to test
 poll() too). I also appended below a patch to trinity to exercise
 remap_anon_pages and userfaultfd and it completes trinity
 successfully.
 
 The code can be found here:
 
 git clone --reference linux 
 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
 
 The branch is rebased so you can get updates for example with:
 
 git fetch  git checkout -f origin/userfault
 
 Comments welcome, thanks!
 Andrea
 
 From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
 From: Andrea Arcangeli aarca...@redhat.com
 Date: Wed, 2 Jul 2014 18:32:35 +0200
 Subject: [PATCH] add remap_anon_pages and userfaultfd
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 ---
  include/syscalls-x86_64.h   |   2 +
  syscalls/remap_anon_pages.c | 100 
 

Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
 Once an userfaultfd is created MADV_USERFAULT regions talks through
 the userfaultfd protocol with the thread responsible for doing the
 memory externalization of the process.
 
 The protocol starts by userland writing the requested/preferred
 USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
 kernel knows it, it will ack it by allowing userland to read 64bit
 from the userfault fd that will contain the same 64bit
 USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
 will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
 will have to try again by writing an older protocol version if
 suitable for its usage too, and read it back again until it stops
 reading -1ULL. After that the userfaultfd protocol starts.
 
 The protocol consists in the userfault fd reads 64bit in size
 providing userland the fault addresses. After a userfault address has
 been read and the fault is resolved by userland, the application must
 write back 128bits in the form of [ start, end ] range (64bit each)
 that will tell the kernel such a range has been mapped. Multiple read
 userfaults can be resolved in a single range write. poll() can be used
 to know when there are new userfaults to read (POLLIN) and when there
 are threads waiting a wakeup through a range write (POLLOUT).
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com

 +#ifdef CONFIG_PROC_FS
 +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 +{
 + struct userfaultfd_ctx *ctx = f-private_data;
 + int ret;
 + wait_queue_t *wq;
 + struct userfaultfd_wait_queue *uwq;
 + unsigned long pending = 0, total = 0;
 +
 + spin_lock(ctx-fault_wqh.lock);
 + list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) {
 + uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 + if (uwq-pending)
 + pending++;
 + total++;
 + }
 + spin_unlock(ctx-fault_wqh.lock);
 +
 + ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\n, pending, total);

This should show the protocol version, too.

 +
 +SYSCALL_DEFINE1(userfaultfd, int, flags)
 +{
 + int fd, error;
 + struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)




Re: [Qemu-devel] Turning off default storage devices?

2014-04-16 Thread Andy Lutomirski
On Mon, Apr 14, 2014 at 1:15 AM, Markus Armbruster arm...@redhat.com wrote:
 Peter Crosthwaite peter.crosthwa...@xilinx.com writes:

 Hi Andy,

 On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski l...@amacapital.net wrote:
 Currently, -M q35 boots linux quite a bit slower than the default
 machine type.  This seems to be because it takes a few hundred ms to
 determine that there's nothing attached to the AHCI controller.

 In virtio setups, there will probably never be anything attached to
 the AHCI controller.  Would it be possible to add something like
 -machine default_storage=off to turn off default storage devices?
 This could include the AHCI on q35 and the cdrom and such on pc.

 There's precedent: -machine usb=off turns off the default USB
 controllers, which is great for setups that use xhci.


 Is there a more generic solution to your problem? Can you implement
 command line device removal in a non specific way and avoid having to
 invent AHCI or even storage specific arguments. You could
 considering bringing the xhci use case you mentioned under the same
 umbrella.

 USB has always been off by default, at least for the boards I'm familiar
 with, due to the USB emulation's non-trivial CPU use.

 There's no such thing as a Q35 board without USB in the physical world.
 Can't stop us from making a virtual one, of course.

 Likewise, there's no such thing as a Q35 board without AHCI in the
 physical world, and again that can't stop us from making a virtual one.

 The difference to USB is that our q35 machines have always had AHCI even
 with -nodefaults.  You seem to propose adding a switch to disable AHCI,
 yet leave it enabled with -nodefaults.

 -nodefaults should give you a board with all the optional components
 suppressed.

Will this break libvirt, which may expect -nodefaults to still come
with an IDE bus?


 On the one hand, I'd rather not add exceptions to -nodefaults give me
 the board with all its optional components suppressed semantics.

 On the other hand, a few hundred ms are a long time.

That's why I proposed a new option.  Yes, it's ugly :/

--Andy



Re: [Qemu-devel] Turning off default storage devices?

2014-04-13 Thread Andy Lutomirski
On Wed, Apr 9, 2014 at 8:13 PM, Peter Crosthwaite
peter.crosthwa...@xilinx.com wrote:
 On Thu, Apr 10, 2014 at 9:57 AM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite
 peter.crosthwa...@xilinx.com wrote:
 Hi Andy,

 On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski l...@amacapital.net 
 wrote:
 Currently, -M q35 boots linux quite a bit slower than the default
 machine type.  This seems to be because it takes a few hundred ms to
 determine that there's nothing attached to the AHCI controller.

 In virtio setups, there will probably never be anything attached to
 the AHCI controller.  Would it be possible to add something like
 -machine default_storage=off to turn off default storage devices?
 This could include the AHCI on q35 and the cdrom and such on pc.

 There's precedent: -machine usb=off turns off the default USB
 controllers, which is great for setups that use xhci.


 Is there a more generic solution to your problem? Can you implement
 command line device removal in a non specific way and avoid having to
 invent AHCI or even storage specific arguments. You could
 considering bringing the xhci use case you mentioned under the same
 umbrella.

 An option like -suppress-default-device foobar to turn off the device
 named foobar would work, but what happens if that device is a bus?

 Lets call that a misuse in the first instance. But in general, when
 attaching devices QEMU should be able to gracefully fail on unresolved
 deps. So it would be reasonable to work on that assumption given that
 every device should be able to handle a missing bus/gpio/interrupt
 etc. due to -device misuseability.

 Will this just cause QEMU to crash?  Maybe the machine code would have
 to opt in to allowing this kind of suppression, and there could be a
 general error of you try to suppress a device that can't be
 suppressed.


 I would argue that there is no such thing. You may end up with a
 useless machine but its still valid to supress something and then by
 extension all its dependants are non functional.

The q35 code is:

/* ahci and SATA device, for q35 1 ahci controller is built-in */
ahci = pci_create_simple_multifunction(host_bus,
   PCI_DEVFN(ICH9_SATA1_DEV,
 ICH9_SATA1_FUNC),
   true, ich9-ahci);
idebus[0] = qdev_get_child_bus(ahci-qdev, ide.0);
idebus[1] = qdev_get_child_bus(ahci-qdev, ide.1);

It looks like making pci_create_simple_multifunction return null will
crash quite quickly.  Even fixing the next two lines will just cause
null pointer dereferences later on.

Is there a different way to indicate that a device wasn't actually created?

--Andy



[Qemu-devel] Turning off default storage devices?

2014-04-09 Thread Andy Lutomirski
Currently, -M q35 boots linux quite a bit slower than the default
machine type.  This seems to be because it takes a few hundred ms to
determine that there's nothing attached to the AHCI controller.

In virtio setups, there will probably never be anything attached to
the AHCI controller.  Would it be possible to add something like
-machine default_storage=off to turn off default storage devices?
This could include the AHCI on q35 and the cdrom and such on pc.

There's precedent: -machine usb=off turns off the default USB
controllers, which is great for setups that use xhci.

Thanks,
Andy



Re: [Qemu-devel] Turning off default storage devices?

2014-04-09 Thread Andy Lutomirski
On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite
peter.crosthwa...@xilinx.com wrote:
 Hi Andy,

 On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski l...@amacapital.net wrote:
 Currently, -M q35 boots linux quite a bit slower than the default
 machine type.  This seems to be because it takes a few hundred ms to
 determine that there's nothing attached to the AHCI controller.

 In virtio setups, there will probably never be anything attached to
 the AHCI controller.  Would it be possible to add something like
 -machine default_storage=off to turn off default storage devices?
 This could include the AHCI on q35 and the cdrom and such on pc.

 There's precedent: -machine usb=off turns off the default USB
 controllers, which is great for setups that use xhci.


 Is there a more generic solution to your problem? Can you implement
 command line device removal in a non specific way and avoid having to
 invent AHCI or even storage specific arguments. You could
 considering bringing the xhci use case you mentioned under the same
 umbrella.

An option like -suppress-default-device foobar to turn off the device
named foobar would work, but what happens if that device is a bus?
Will this just cause QEMU to crash?  Maybe the machine code would have
to opt in to allowing this kind of suppression, and there could be a
general error of you try to suppress a device that can't be
suppressed.

I can try to code this up, but I know nothing about QEMU internals.
I'm just a user :)

--Andy



[Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver

2014-04-02 Thread Andy Lutomirski
Running:

./virtme-run --installed-kernel

from this virtme commit:

https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f

results in a bunch of missing lines of text once bootup finishes.
Pressing enter a few times gradually fixes it.

I don't know whether this is a qemu bug or a Linux bug.

I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent
3.14-rc kernel.  For the latter, cirrus is built-in (not a module),
I'm running:

virtme-run --kimg arch/x86/boot/bzImage

and I see more profound corruption.

--Andy



Re: [Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver

2014-04-02 Thread Andy Lutomirski
On Tue, Apr 1, 2014 at 3:09 PM, Andy Lutomirski l...@amacapital.net wrote:
 Running:

 ./virtme-run --installed-kernel

 from this virtme commit:

 https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f

 results in a bunch of missing lines of text once bootup finishes.
 Pressing enter a few times gradually fixes it.

 I don't know whether this is a qemu bug or a Linux bug.

 I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent
 3.14-rc kernel.  For the latter, cirrus is built-in (not a module),
 I'm running:

 virtme-run --kimg arch/x86/boot/bzImage

 and I see more profound corruption.

I'm guessing this is a cirrus drm bug.  bochs-drm (using virtme-run
--installed-kernel --qemu-opts -vga std) does not appear to have the
same issue.  Neither does qxl.

(qxl is painfully slow, though, and it doesn't seem to be using UC memory.)

--Andy



[Qemu-devel] Re: [PATCH-V4 0/7] virtio-9p:Introducing security model for VirtFS

2010-05-26 Thread Andy Lutomirski

Venkateswararao Jujjuri (JV) wrote:

This patch series introduces the security model for VirtFS.

Brief description of this patch series:

It introduces two type of security models for VirtFS.
They are: mapped and passthrough.

The following is common to both security models.

* Client's VFS determines/enforces the access control.
  Largely server should never return EACCESS.

* Client sends gid/mode-bit information as part of creation only.

Changes from V3
---
o Return NULL instead of exit(1) on failure in virtio_9p_init()
o Capitalized sm_passthrough, sm_mappe
o Added handling for EINTR for read/write.
o Corrected default permissions for mkdir in mapped mode.
o Added additional error handling.

Changes from V2
---
o Removed warnings resulting from chmod/chown.
o Added code to fail normally if secuirty_model option is not specified.

Changes from V1
---
o Added support for chmod and chown.
o Used chmod/chown to set credentials instead of setuid/setgid.
o Fixed a bug where uid used instated of uid.


Security model: mapped
--

VirtFS server(QEMU) intercepts and maps all the file object create requests.
Files on the fileserver will be created with QEMU's user credentials and the
client-user's credentials are stored in extended attributes.
During getattr() server extracts the client-user's credentials from extended
attributes and sends to the client.

Given that only the user space extended attributes are available to regular
files, special files are created as regular files on the fileserver and the
appropriate mode bits are stored in xattrs and will be extracted during
getattr.

If the extended attributes are missing, server sends back the filesystem
stat() unaltered. This provision will make the files created on the
fileserver usable to client.

Points to be considered

* Filesystem will be VirtFS'ized. Meaning, other filesystems may not
 understand the credentials of the files created under this model.


How hard would it be to make this compatible with rsync's --fake-super? 
 (--fake-super already does almost what you're doing, and if you make 
the formats compatible, then rsync could be used to translate.  OTOH, 
rsyncing a VirtFS-ified filesystem to a remote --fake-super system might 
have odd side-effects.)


--Andy