Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-03 Thread Andrea Arcangeli
Hi Andy,

thanks for CC'ing linux-api.

On Wed, Jul 02, 2014 at 06:56:03PM -0700, Andy Lutomirski wrote:
 On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
  Once an userfaultfd is created MADV_USERFAULT regions talks through
  the userfaultfd protocol with the thread responsible for doing the
  memory externalization of the process.
  
  The protocol starts by userland writing the requested/preferred
  USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
  kernel knows it, it will ack it by allowing userland to read 64bit
  from the userfault fd that will contain the same 64bit
  USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
  will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
  will have to try again by writing an older protocol version if
  suitable for its usage too, and read it back again until it stops
  reading -1ULL. After that the userfaultfd protocol starts.
  
  The protocol consists in the userfault fd reads 64bit in size
  providing userland the fault addresses. After a userfault address has
  been read and the fault is resolved by userland, the application must
  write back 128bits in the form of [ start, end ] range (64bit each)
  that will tell the kernel such a range has been mapped. Multiple read
  userfaults can be resolved in a single range write. poll() can be used
  to know when there are new userfaults to read (POLLIN) and when there
  are threads waiting a wakeup through a range write (POLLOUT).
  
  Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 
  +#ifdef CONFIG_PROC_FS
  +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
  +{
  +   struct userfaultfd_ctx *ctx = f-private_data;
  +   int ret;
  +   wait_queue_t *wq;
  +   struct userfaultfd_wait_queue *uwq;
  +   unsigned long pending = 0, total = 0;
  +
  +   spin_lock(ctx-fault_wqh.lock);
  +   list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) {
  +   uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
  +   if (uwq-pending)
  +   pending++;
  +   total++;
  +   }
  +   spin_unlock(ctx-fault_wqh.lock);
  +
  +   ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\n, pending, total);
 
 This should show the protocol version, too.

Ok, does the below look ok?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 388553e..f9d3e9f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -493,7 +493,13 @@ static int userfaultfd_show_fdinfo(struct seq_file *m, 
struct file *f)
}
spin_unlock(ctx-fault_wqh.lock);
 
-   ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\n, pending, total);
+   /*
+* If more protocols will be added, there will be all shown
+* separated by a space. Like this:
+*  protocols: 0xaa 0xbb
+*/
+   ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n,
+pending, total, USERFAULTFD_PROTOCOL);
 
return ret;
 }


  +
  +SYSCALL_DEFINE1(userfaultfd, int, flags)
  +{
  +   int fd, error;
  +   struct file *file;
 
 This looks like it can't be used more than once in a process.  That will

It can't be used more than once, correct.

file = ERR_PTR(-EBUSY);
if (get_mm_slot(current-mm))
goto out_free_unlock;

If a userfaultfd is already registered for the current mm the second
one gets -EBUSY.

 be unfortunate for libraries.  Would it be feasible to either have

So you envision two userfaultfd memory managers for the same process?
I assume each one would claim separate ranges of memory?

For that case the demultiplexing of userfaults can be entirely managed
by userland.

One libuserfault library can actually register the userfaultfd, and
then the two libs can register into libuserfault and claim their own
ranges. It could run the code of the two libs in the thread context
that waits on the userfaultfd with zero overhead, or message passing
across threads can be used to run both libs in parallel in their own
thread. The demultiplexing code wouldn't be CPU intensive. The
downside are two schedule event required if they want to run their lib
code in a separate thread. If we'd claim the two different ranges in
the kernel for two different userfaultfd, the kernel would be speaking
directly with each library thread. That'd be the only advantage if
they don't want to run in the context of the thread that waits on the
userfaultfd.

To increase SMP scalability in the future we could also add a
UFFD_LOAD_BALANCE to distribute userfaults to different userfaultfd,
that if used could relax the -EBUSY (but it wouldn't be two different
claimed ranges for two different libs).

If passing UFFD_LOAD_BALANCE to the current code sys_userfaultfd would
return -EINVAL. I haven't implemented it because I'm not sure if such
thing would ever be needed. Compared to distributing the userfaults in
userland to different threads that would only save two context
switches per event. I don't 

[Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andrea Arcangeli
Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli aarca...@redhat.com
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/Makefile  |   1 +
 fs/userfaultfd.c | 557 +++
 include/linux/syscalls.h |   1 +
 include/linux/userfaultfd.h  |  40 +++
 init/Kconfig |  10 +
 kernel/sys_ni.c  |   1 +
 mm/huge_memory.c |  20 +-
 mm/memory.c  |   5 +-
 10 files changed, 629 insertions(+), 8 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 08bc856..5aa2da4 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -361,3 +361,4 @@
 352i386sched_getattr   sys_sched_getattr
 353i386renameat2   sys_renameat2
 354i386remap_anon_pagessys_remap_anon_pages
+355i386userfaultfd sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 37bd179..7dca902 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -324,6 +324,7 @@
 315common  sched_getattr   sys_sched_getattr
 316common  renameat2   sys_renameat2
 317common  remap_anon_pagessys_remap_anon_pages
+318common  userfaultfd sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..e00e243 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
 obj-$(CONFIG_SIGNALFD) += signalfd.o
 obj-$(CONFIG_TIMERFD)  += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
+obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 000..4902fa3
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,557 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi davi...@xmailserver.org
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include linux/kref.h
+#include linux/hashtable.h
+#include linux/sched.h
+#include linux/mm.h
+#include linux/poll.h
+#include linux/slab.h
+#include linux/seq_file.h
+#include linux/file.h
+#include linux/bug.h
+#include linux/anon_inodes.h
+#include linux/syscalls.h
+#include linux/userfaultfd.h
+
+struct userfaultfd_ctx {
+   /* pseudo fd refcounting */
+   struct kref kref;
+   /* waitqueue head for the userfaultfd page faults */
+   wait_queue_head_t fault_wqh;
+   /* waitqueue head for the pseudo fd to wakeup poll/read */
+   wait_queue_head_t fd_wqh;
+   /* userfaultfd syscall flags */
+   unsigned int flags;
+   /* state machine */
+   unsigned int state;
+   /* released */
+   bool released;
+};
+
+struct userfaultfd_wait_queue {
+   unsigned long address;
+   wait_queue_t wq;
+   bool pending;
+   struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+  

Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
 Once an userfaultfd is created MADV_USERFAULT regions talks through
 the userfaultfd protocol with the thread responsible for doing the
 memory externalization of the process.
 
 The protocol starts by userland writing the requested/preferred
 USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
 kernel knows it, it will ack it by allowing userland to read 64bit
 from the userfault fd that will contain the same 64bit
 USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
 will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
 will have to try again by writing an older protocol version if
 suitable for its usage too, and read it back again until it stops
 reading -1ULL. After that the userfaultfd protocol starts.
 
 The protocol consists in the userfault fd reads 64bit in size
 providing userland the fault addresses. After a userfault address has
 been read and the fault is resolved by userland, the application must
 write back 128bits in the form of [ start, end ] range (64bit each)
 that will tell the kernel such a range has been mapped. Multiple read
 userfaults can be resolved in a single range write. poll() can be used
 to know when there are new userfaults to read (POLLIN) and when there
 are threads waiting a wakeup through a range write (POLLOUT).
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com

 +#ifdef CONFIG_PROC_FS
 +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 +{
 + struct userfaultfd_ctx *ctx = f-private_data;
 + int ret;
 + wait_queue_t *wq;
 + struct userfaultfd_wait_queue *uwq;
 + unsigned long pending = 0, total = 0;
 +
 + spin_lock(ctx-fault_wqh.lock);
 + list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) {
 + uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 + if (uwq-pending)
 + pending++;
 + total++;
 + }
 + spin_unlock(ctx-fault_wqh.lock);
 +
 + ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\n, pending, total);

This should show the protocol version, too.

 +
 +SYSCALL_DEFINE1(userfaultfd, int, flags)
 +{
 + int fd, error;
 + struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)