Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-22 Thread Kirill Smelkov
Hi Andrea,

On Thu, May 21, 2015 at 05:52:51PM +0200, Andrea Arcangeli wrote:
 Hi Kirill,
 
 On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
  Sorry for maybe speaking up too late, but here is additional real
 
 Not too late, in fact I don't think there's any change required for
 this at this stage, but it'd be great if you could help me to review.

Thanks


  Since arrays can be large, it would be slow and thus not practical to
 [..]
  So I've implemented a scheme where array data is initially PROT_READ
  protected, then we catch SIGSEGV, if it is write and area belongs to array
 
 In the case of postcopy live migration (for qemu and/or containers) and
 postcopy live snapshotting, splitting the vmas is not an option
 because we may run out of them.
 
 If your PROT_READ areas are limited perhaps this isn't an issue but
 with hundreds GB guests (currently plenty in production) that needs to
 live migrate fully reliably and fast, the vmas could exceed the limit
 if we were to use mprotect. If your arrays are very large and the
 PROT_READ aren't limited, using userfaultfd this isn't only an
 optimization for you too, it's actually a must to avoid a potential
 -ENOMEM.

I understand. To somehow mitigate this issue for every array/file I try
to allocate ram pages from separate file on tmpfs with the same offset.
This way if we allocate a lot of pages and mmap them in with PROT_READ,
if they are adjacent to each other, the kernel will merge adjacent vmas
into one vma:


https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/include/wendelin/bigfile/ram.h#L100

https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/ram_shmfs.c#L102

https://lab.nexedi.cn/kirr/wendelin.core/blob/ca064f75/bigfile/virtmem.c#L435

I agree this is only a half-measure - file parts accessed could be
sparse, and also there is a in-shmfs overhead maintaining tables what
real pages are allocated to which part of file on shmfs.

So yes, if userfaultfd allows to overcomes vma layer and work directly
on page tables, this helps.


  Also, since arrays could be large - bigger than RAM, and only sparse
  parts of it could be needed to get needed information, for reading it
  also makes sense to lazily load data in SIGSEGV handler with initial
  PROT_NONE protection.
 
 Similarly I heard somebody wrote a fastresume to load the suspended
 (on disk) guest ram using userfaultfd. That is a slightly less
 fundamental case than postcopy because you could do it also with
 MAP_SHARED, but it's still interesting in allowing to compress or
 decompress the suspended ram on the fly with lz4 for example,
 something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
 additional benefit of not having an orphaned inode left open even if
 the file is deleted, that prevents to unmount the filesystem for the
 whole lifetime of the guest).

I see. Just a note - transparent compression/decompression could be
achieved with MAP_SHARED if the compression is being performed by
underlying filesystem - e.g. implemented with FUSE.

( I have not measured performance though )


  This is very similar to how memory mapped files work, but adds
  transactionality which, as far as I know, is not provided by any
  currently in-kernel filesystem on Linux.
 
 That's another benefit yes.
 
  The gist of virtual memory-manager is this:
  
  
  https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
  https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  
  (vma_on_pagefault)
 
 I'll check it more in detail ASAP, thanks for the pointers!
 
  For operations it currently needs
  
  - establishing virtual memory areas and connecting to tracking it
 
 That's the UFFDIO_REGISTER/UNREGISTER.

Yes

  - changing pages protection
  
  PROT_NONE or absent - initially
 
 absent is what works with -mm already. The lazy loading already works.

Yes

  PROT_NONE   - PROT_READ- after read
 
 Current UFFDIO_COPY will map it using vma-vm_page_prot.
 
 We'll need a new flag for UFFDIO_COPY to map it readonly. This is
 already contemplated:
 
   /*
* There will be a wrprotection flag later that allows to map
* pages wrprotected on the fly. And such a flag will be
* available if the wrprotection ioctl are implemented for the
* range according to the uffdio_register.ioctls.
*/
 #define UFFDIO_COPY_MODE_DONTWAKE ((__u64)10)
   __u64 mode;
 
 If the memory protection framework exists (either through the
 uffdio_register.ioctl out value, or through uffdio_api.features
 out-only value) you can pass a new flag (MODE_WP) above to transition
 from absent to PROT_READ.

Yes. The same probably applies to UFFDIO_ZEROPAGE (to mmap-in zeropage
as RO on read, if that part of file is currently hole)

So we settle on adding

UFFDIO_COPY_MODE_WP and
UFFDIO_ZEROPAGE_MODE_WP
?

Or 

Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-21 Thread Kirill Smelkov
Hello up there,

On Thu, May 14, 2015 at 07:30:57PM +0200, Andrea Arcangeli wrote:
 Hello everyone,
 
 This is the latest userfaultfd patchset against mm-v4.1-rc3
 2015-05-14-10:04.
 
 The postcopy live migration feature on the qemu side is mostly ready
 to be merged and it entirely depends on the userfaultfd syscall to be
 merged as well. So it'd be great if this patchset could be reviewed
 for merging in -mm.
 
 Userfaults allow to implement on demand paging from userland and more
 generally they allow userland to more efficiently take control of the
 behavior of page faults than what was available before
 (PROT_NONE + SIGSEGV trap).
 
 The use cases are:

[...]

 Even though there wasn't a real use case requesting it yet, it also
 allows to implement distributed shared memory in a way that readonly
 shared mappings can exist simultaneously in different hosts and they
 can be become exclusive at the first wrprotect fault.

Sorry for maybe speaking up too late, but here is additional real
potential use-case which in my view is overlapping with the above:

Recently we needed to implement persistency for NumPy arrays - that is
to track made changes to array memory and transactionally either abandon
the changes on transaction abort, or store them back to storage on
transaction commit.

Since arrays can be large, it would be slow and thus not practical to
have original data copy and compare memory to original to find what
array parts have been changed.

So I've implemented a scheme where array data is initially PROT_READ
protected, then we catch SIGSEGV, if it is write and area belongs to array
data - we mark that page as PROT_WRITE and continue. On commit time we
know which parts were modified.

Also, since arrays could be large - bigger than RAM, and only sparse
parts of it could be needed to get needed information, for reading it
also makes sense to lazily load data in SIGSEGV handler with initial
PROT_NONE protection.

This is very similar to how memory mapped files work, but adds
transactionality which, as far as I know, is not provided by any
currently in-kernel filesystem on Linux.

The system is done as files, and arrays are then build on top of
this-way memory-mapped files. So from now on we can forget about NumPy
arrays and only talk about files, their mapping, lazy loading and
transactionally storing in-memory changes back to file storage.

To get this working, a custom user-space virtual memory manager is
unrolled, which manages RAM memory pages, file mappings into virtual
address-space, tracks pages protection and does SIGSEGV handling
appropriately.


The gist of virtual memory-manager is this:


https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  
(vma_on_pagefault)


For operations it currently needs

- establishing virtual memory areas and connecting to tracking it

- changing pages protection

PROT_NONE or absent - initially
PROT_NONE   - PROT_READ- after read
PROT_READ   - PROT_READWRITE   - after write
PROT_READWRITE  - PROT_READ- after commit
PROT_READWRITE  - PROT_NONE or absent (again)  - after abort
PROT_READ   - PROT_NONE or absent (again)  - on reclaim

- working with aliasable memory (thus taken from tmpfs)

there could be two overlapping-in-file mapping for file (array)
requested at different time, and changes from one mapping should
propagate to another one - for common parts only 1 page should
be memory-mapped into 2 places in address-space.

so what is currently lacking on userfaultfd side is:

- ability to remove / make PROT_NONE already mapped pages
  (UFFDIO_REMAP was recently dropped)

- ability to arbitrarily change pages protection (e.g. RW - R)

- inject aliasable memory from tmpfs (or better hugetlbfs) and into
  several places (UFFDIO_REMAP + some mapping copy semantic).


The code is ugly because it is only a prototype. You can clone/read it
all from here:

https://lab.nexedi.cn/kirr/wendelin.core

Virtual memory-manager even has tests, and from them it could be seen
how the system is supposed to work (after each access - what pages and
where are mapped and how):


https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/tests/test_virtmem.c

The performance currently is not great, partly because of page clearing
when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
overhead and other dumb things on my side.

I still wanted to show the case, as userfaultd here has potential to
remove overhead related to kernel.

Thanks beforehand for feedback,

Kirill


P.S. some context

http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret/asEntireHTML



Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-21 Thread Andrea Arcangeli
Hi Kirill,

On Thu, May 21, 2015 at 04:11:11PM +0300, Kirill Smelkov wrote:
 Sorry for maybe speaking up too late, but here is additional real

Not too late, in fact I don't think there's any change required for
this at this stage, but it'd be great if you could help me to review.

 Since arrays can be large, it would be slow and thus not practical to
[..]
 So I've implemented a scheme where array data is initially PROT_READ
 protected, then we catch SIGSEGV, if it is write and area belongs to array

In the case of postcopy live migration (for qemu and/or containers) and
postcopy live snapshotting, splitting the vmas is not an option
because we may run out of them.

If your PROT_READ areas are limited perhaps this isn't an issue but
with hundreds GB guests (currently plenty in production) that needs to
live migrate fully reliably and fast, the vmas could exceed the limit
if we were to use mprotect. If your arrays are very large and the
PROT_READ aren't limited, using userfaultfd this isn't only an
optimization for you too, it's actually a must to avoid a potential
-ENOMEM.

 Also, since arrays could be large - bigger than RAM, and only sparse
 parts of it could be needed to get needed information, for reading it
 also makes sense to lazily load data in SIGSEGV handler with initial
 PROT_NONE protection.

Similarly I heard somebody wrote a fastresume to load the suspended
(on disk) guest ram using userfaultfd. That is a slightly less
fundamental case than postcopy because you could do it also with
MAP_SHARED, but it's still interesting in allowing to compress or
decompress the suspended ram on the fly with lz4 for example,
something MAP_PRIVATE/MAP_SHARED wouldn't do (plus there's the
additional benefit of not having an orphaned inode left open even if
the file is deleted, that prevents to unmount the filesystem for the
whole lifetime of the guest).

 This is very similar to how memory mapped files work, but adds
 transactionality which, as far as I know, is not provided by any
 currently in-kernel filesystem on Linux.

That's another benefit yes.

 The gist of virtual memory-manager is this:
 
 
 https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
 https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c  
 (vma_on_pagefault)

I'll check it more in detail ASAP, thanks for the pointers!

 For operations it currently needs
 
 - establishing virtual memory areas and connecting to tracking it

That's the UFFDIO_REGISTER/UNREGISTER.

 - changing pages protection
 
 PROT_NONE or absent - initially

absent is what works with -mm already. The lazy loading already works.

 PROT_NONE   - PROT_READ- after read

Current UFFDIO_COPY will map it using vma-vm_page_prot.

We'll need a new flag for UFFDIO_COPY to map it readonly. This is
already contemplated:

/*
 * There will be a wrprotection flag later that allows to map
 * pages wrprotected on the fly. And such a flag will be
 * available if the wrprotection ioctl are implemented for the
 * range according to the uffdio_register.ioctls.
 */
#define UFFDIO_COPY_MODE_DONTWAKE   ((__u64)10)
__u64 mode;

If the memory protection framework exists (either through the
uffdio_register.ioctl out value, or through uffdio_api.features
out-only value) you can pass a new flag (MODE_WP) above to transition
from absent to PROT_READ.

 PROT_READ   - PROT_READWRITE   - after write

This will need to add UFFDIO_MPROTECT.

 PROT_READWRITE  - PROT_READ- after commit

UFFDIO_MPROTECT again (but harder if going from rw to ro, because of a
slight mess to solve with regard to FAULT_FLAG_TRIED, in case you want
to run this UFFDIO_MPROTECT without stopping the threads that are
accessing the memory concurrently).

And this should only work if the uffdio_register.mode had MODE_WP set,
so we don't run into the races created by COWs (gup vs fork race).

 PROT_READWRITE  - PROT_NONE or absent (again)  - after abort

UFFDIO_MPROTECT again, but you won't be able to read the page contents
inside the memory manager thread (the one working with
userfaultfd).

The manager at all times if forbidden to touch the memory it is
tracking with userfaultfd (if it does it'll deadlock, but kill -9 will
get rid of it). gdb ironically because it is using an underoptimized
access_process_vm wouldn't hang, because FAULT_FLAG_RETRY won't be set
in handle_userfault in the gdb context, and it'll just receive a
sigbus if by mistake the user tries to touch the memory. Even if it
will hung later as get_user_pages_locked|unlocked gets used there too,
kill -9 would solve gdb too.

Back to the problem of accessing the UFFDIO_MPROTECT(PROT_NONE)
memory: to do that a new ioctl should be required. I'd rather not go
back to the route of UFFDIO_REMAP, but it could copy the 

Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-20 Thread Andrea Arcangeli
Hi Andrew,

On Tue, May 19, 2015 at 02:38:01PM -0700, Andrew Morton wrote:
 On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com 
 wrote:
 
  This is the latest userfaultfd patchset against mm-v4.1-rc3
  2015-05-14-10:04.
 
 It would be useful to have some userfaultfd testcases in
 tools/testing/selftests/.  Partly as an aid to arch maintainers when
 enabling this.  And also as a standalone thing to give people a
 practical way of exercising this interface.

Agreed.

I was also thinking about writing a trinity module for it, I wrote it
for an older version but it was much easier to do that back then
before we had ioctls, now it's more tricky because the ioctls requires
the fd open first etc... it's not enough to just call a syscall with a
flood of supervised-random params anymore.

 What are your thoughts on enabling userfaultfd for other architectures,
 btw?  Are there good use cases, are people working on it, etc?

powerpc should be enabled and functional already. There's not much
arch dependent code in it, so in theory if the postcopy live migration
patchset is applied to qemu, it should work on powerpc out of the
box. Nobody tested it yet but I don't expect trouble on the kernel side.

Adding support for all other archs is just a few liner patch that
defines the syscall number. I didn't do that out of tree because every
time a new syscall materialized I would get more rejects during
rebase.

 Also, I assume a manpage is in the works?  Sooner rather than later
 would be good - Michael's review of proposed kernel interfaces has
 often been valuable.

Yes, the manpage was certainly planned. It would require updates as we
keep adding features (like the wrprotect tracking, the non-cooperative
usage, and extending the availability of the ioctls to tmpfs). We can
definitely write a manpage with the current features.

Ok, so I'll continue working on the testcase and on the manpage.

Thanks!!
Andrea



Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-20 Thread Andrea Arcangeli
Hello Richard,

On Tue, May 19, 2015 at 11:59:42PM +0200, Richard Weinberger wrote:
 On Tue, May 19, 2015 at 11:38 PM, Andrew Morton
 a...@linux-foundation.org wrote:
  On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com 
  wrote:
 
  This is the latest userfaultfd patchset against mm-v4.1-rc3
  2015-05-14-10:04.
 
  It would be useful to have some userfaultfd testcases in
  tools/testing/selftests/.  Partly as an aid to arch maintainers when
  enabling this.  And also as a standalone thing to give people a
  practical way of exercising this interface.
 
  What are your thoughts on enabling userfaultfd for other architectures,
  btw?  Are there good use cases, are people working on it, etc?
 
 UML is using SIGSEGV for page faults.
 i.e. the UML processes receives a SIGSEGV, learns the faulting address
 from the mcontext
 and resolves the fault by installing a new mapping.
 
 If userfaultfd is faster that the SIGSEGV notification it could speed
 up UML a bit.
 For UML I'm only interested in the notification, not the resolving
 part. The missing
 data is present, only a new mapping is needed. No copy of data.
 
 Andrea, what do you think?

I think you need some kind of UFFDIO_MPROTECT ioctl that is the same
ioctl wrprotect tracking also needs. At the moment we focused the
future plans mostly on wrprotection tracking but it could be extended
to protnone tracking, either with the same feature flag as
wrprotection (with a generic UFFDIO_MPROTECT) or with two separate
feature flags and two separate ioctl.

Your pages are not missing, like in the postcopy live snapshotting
case the pages are not missing. The userfaultfd memory protection
ioctl will not modify the VMA, but it'll just selectively mark
pte/trans_huge_pmd wrprotected/protnone in order to get the faults. In
the case of postcopy live snapshotting a single ioctl call will mark
the entire guest address space readonly.

For live snapshotting the fault resolution is a no brainer: when you
get the fault the page is still readable and it just needs to be
copied off by the live snapshotting thread to a different location,
and then the UFFDIO_MPROTECT will be called again to make the page
writable and wake the blocked fault.

For the protnone, you need to modify the page before waking the
blocked userfault, you can't just remove the protnone or other threads
could modify it (if there are other threads). You'd need a further
ioctl to copy the page off to a different place by using its kernel
address (the userland address is not mapped) and copy it back to
overwrite the original page.

Alternatively once we extend the handle_userfault to tmpfs you could
map the page in two virtual mappings and track the faults in one
mapping (where the tracked app runs) and read/write the page contents
in the other mapping that isn't tracked by the userfault.

These are the first thoughts that comes to mind without knowing
exactly what you need to do after you get the fault address, and
without knowing exactly why you need to mark the region PROT_NONE.

There will be some complications in adding the wrprotection/protnone
feature: if faults could already happen when the wrprotect/protnone is
armed, the handle_userfault() could be invoked in a retry-fault, that
is not ok without allowing the userfault to return VM_FAULT_RETRY even
during a refault (i.e. FAULT_FLAG_TRIED set but FAULT_FLAG_ALLOW_RETRY
not set). The invariants of vma-vm_page_prot and pte/trans_huge_pmd
permissions must also not break anywhere. These are the two main
reasons why these features that requires to flip protection bits are
left implemented later and made visible later with uffdio_api.feature
flags and/or through uffdio_register.ioctl during UFFDIO_REGISTER.



Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-19 Thread Richard Weinberger
On Tue, May 19, 2015 at 11:38 PM, Andrew Morton
a...@linux-foundation.org wrote:
 On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com 
 wrote:

 This is the latest userfaultfd patchset against mm-v4.1-rc3
 2015-05-14-10:04.

 It would be useful to have some userfaultfd testcases in
 tools/testing/selftests/.  Partly as an aid to arch maintainers when
 enabling this.  And also as a standalone thing to give people a
 practical way of exercising this interface.

 What are your thoughts on enabling userfaultfd for other architectures,
 btw?  Are there good use cases, are people working on it, etc?

UML is using SIGSEGV for page faults.
i.e. the UML processes receives a SIGSEGV, learns the faulting address
from the mcontext
and resolves the fault by installing a new mapping.

If userfaultfd is faster that the SIGSEGV notification it could speed
up UML a bit.
For UML I'm only interested in the notification, not the resolving
part. The missing
data is present, only a new mapping is needed. No copy of data.

Andrea, what do you think?

-- 
Thanks,
//richard



Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-19 Thread Andrew Morton
On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com wrote:

 This is the latest userfaultfd patchset against mm-v4.1-rc3
 2015-05-14-10:04.

It would be useful to have some userfaultfd testcases in
tools/testing/selftests/.  Partly as an aid to arch maintainers when
enabling this.  And also as a standalone thing to give people a
practical way of exercising this interface.

What are your thoughts on enabling userfaultfd for other architectures,
btw?  Are there good use cases, are people working on it, etc?


Also, I assume a manpage is in the works?  Sooner rather than later
would be good - Michael's review of proposed kernel interfaces has
often been valuable.




Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-18 Thread Pavel Emelyanov
On 05/14/2015 08:30 PM, Andrea Arcangeli wrote:
 Hello everyone,
 
 This is the latest userfaultfd patchset against mm-v4.1-rc3
 2015-05-14-10:04.
 
 The postcopy live migration feature on the qemu side is mostly ready
 to be merged and it entirely depends on the userfaultfd syscall to be
 merged as well. So it'd be great if this patchset could be reviewed
 for merging in -mm.
 
 Userfaults allow to implement on demand paging from userland and more
 generally they allow userland to more efficiently take control of the
 behavior of page faults than what was available before
 (PROT_NONE + SIGSEGV trap).

Not to spam with 23 e-mails, all patches are

Acked-by: Pavel Emelyanov xe...@parallels.com

Thanks!

-- Pavel




[Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-14 Thread Andrea Arcangeli
Hello everyone,

This is the latest userfaultfd patchset against mm-v4.1-rc3
2015-05-14-10:04.

The postcopy live migration feature on the qemu side is mostly ready
to be merged and it entirely depends on the userfaultfd syscall to be
merged as well. So it'd be great if this patchset could be reviewed
for merging in -mm.

Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control of the
behavior of page faults than what was available before
(PROT_NONE + SIGSEGV trap).

The use cases are:

1) KVM postcopy live migration (one form of cloud memory
   externalization).

   KVM postcopy live migration is the primary driver of this work:


http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

2) postcopy live migration of binaries inside linux containers:

http://thread.gmane.org/gmane.linux.kernel.mm/132662

3) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would, plus the avoidance of fork
   overhead in the first place).

   While the wrprotect tracking is not implemented yet, the syscall API is
   already contemplating the wrprotect fault tracking and it's generic enough
   to allow its later implementation in a backwards compatible fashion.

4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
   should be extended to work also on tmpfs and then the
   uffdio_register.ioctls will notify userland that UFFDIO_COPY is
   available even when the registered virtual memory range is tmpfs
   backed.

5) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This depends on point 4)
   to be fulfilled first, as volatile pages happily apply to tmpfs.

Even though there wasn't a real use case requesting it yet, it also
allows to implement distributed shared memory in a way that readonly
shared mappings can exist simultaneously in different hosts and they
can be become exclusive at the first wrprotect fault.

The development version can also be cloned here:

git clone --reference linux -b userfault 
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Slides from LSF-MM summit (but beware that they're not uptodate):


https://www.kernel.org/pub/linux/kernel/people/andrea/userfaultfd/userfaultfd-LSFMM-2015.pdf

Comments welcome.

Thanks,
Andrea

Changelog of the major changes since the last RFC v3:

o The API has been slightly modified to avoid having to introduce a
  second revision of the API, in order to support the non cooperative
  usage.

o Various mixed fixes thanks to the feedback from Dave Hansen and
  David Gilbert.

  The most notable one is the use of mm_users instead of mm_count to
  pin the mm to avoid crashes that assumed the vma still existed (in
  the userfaultfd_release method and in the various ioctl). exit_mmap
  doesn't even set mm-mmap to NULL, so unless I introduce a
  userfaultfd_exit to call in mmput, I have to pin the mm_users to be
  safe. This is a visible change mainly for the non-cooperative usage.

o userfaults are waken immediately even if they're not been read
  yet, this can lead to POLLIN false positives (so I only allow poll
  if the fd is open in nonblocking mode to be sure it won't hang).


http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f222d9de0a5302dc8ac62d6fab53a84251098751

o optimize read to return entries in O(1) and poll which was already
  O(1) becomes lockless. This required to split the waitqueue in two,
  one for pending faults and one for non pending faults, and the
  faults are refiled across the two waitqueues when they're read. Both
  waitqueues are protected by a single lock to be simpler and faster
  at runtime (the fault_pending_wqh one).


http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=9aa033ed43a1134c2223dac8c5d9e02e0100fca1

o Allocate the ctx with kmem_cache_alloc.


http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfaultid=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6

o Originally qemu had two bitflags for each page and kept 3 states (of
  the 4 possible with two bits) for each page in order to deal with
  the races that can happen if one thread is reading the userfaults
  and another thread is calling the UFFDIO_COPY ioctl in the
  background. This patch solves all races in the kernel so the two
  bits per page can be dropped from qemu codebase. I started
  documenting the races that can materialize by using 2 threads
  (instead of running the workload single threaded with a single poll
  event loop) and how userland had to solve them until I decided it
  was simpler to fix the race in the kernel