[PATCH 1/3] [v2] vm: add a syscall to map a process memory into a pipe

2017-10-25 Thread Andrei Vagin
It is a hybrid of process_vm_readv() and vmsplice().

vmsplice can map memory from a current address space into a pipe.
process_vm_readv can read memory of another process.

A new system call can map memory of another process into a pipe.

ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov,
unsigned long nr_segs, unsigned int flags)

All arguments are identical with vmsplice except pid which specifies a
target process.

Currently if we want to dump a process memory to a file or to a socket,
we can use process_vm_readv() + write(), but it works slow, because data
are copied into a temporary user-space buffer.

A second way is to use vmsplice() + splice(). It is more effective,
because data are not copied into a temporary buffer, but here is another
problem. vmsplice works with the currect address space, so it can be
used only if we inject our code into a target process.

The second way suffers from a few other issues:
* a process has to be stopped to run a parasite code
* a number of pipes is limited, so it may be impossible to dump all
  memory in one iteration, and we have to stop process and inject our
  code a few times.
* pages in pipes are unreclaimable, so it isn't good to hold a lot of
  memory in pipes.

The introduced syscall allows to use a second way without injecting any
code into a target process.

My experiments shows that process_vmsplice() + splice() works two time
faster than process_vm_readv() + write().

It is particularly useful on a pre-dump stage. On this stage we enable a
memory tracker, and then we are dumping  a process memory while a
process continues work. On the first iteration we are dumping all
memory, and then we are dumpung only modified memory from a previous
iteration.  After a few pre-dump operations, a process is stopped and
dumped finally. The pre-dump operations allow to significantly decrease
a process downtime, when a process is migrated to another host.

v2: move this syscall under CONFIG_CROSS_MEMORY_ATTACH
give correct flags to get_user_pages_remote()

Cc: Alexander Viro 
Cc: Arnd Bergmann 
Cc: Pavel Emelyanov 
Cc: Michael Kerrisk 
Cc: Thomas Gleixner 
Cc: Andrew Morton 
Cc: Josh Triplett 
Cc: Jann Horn 
Signed-off-by: Andrei Vagin 
---
 fs/splice.c   | 223 ++
 include/linux/compat.h|   3 +
 include/linux/syscalls.h  |   4 +
 include/uapi/asm-generic/unistd.h |   5 +-
 kernel/sys_ni.c   |   2 +
 5 files changed, 236 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index f3084cce0ea6..4bf37207feb9 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1358,6 +1359,228 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec 
__user *, iov,
return error;
 }
 
+#ifdef CONFIG_CROSS_MEMORY_ATTACH
+/*
+ * Map pages from a specified task into a pipe
+ */
+static int remote_single_vec_to_pipe(struct task_struct *task,
+   struct mm_struct *mm,
+   const struct iovec *rvec,
+   struct pipe_inode_info *pipe,
+   unsigned int flags,
+   size_t *total)
+{
+   struct pipe_buffer buf = {
+   .ops = _page_pipe_buf_ops,
+   .flags = flags
+   };
+   unsigned long addr = (unsigned long) rvec->iov_base;
+   unsigned long pa = addr & PAGE_MASK;
+   unsigned long start_offset = addr - pa;
+   unsigned long nr_pages;
+   ssize_t len = rvec->iov_len;
+   struct page *process_pages[16];
+   bool failed = false;
+   int ret = 0;
+
+   nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1;
+   while (nr_pages) {
+   long pages = min(nr_pages, 16UL);
+   int locked = 1, n;
+   ssize_t copied;
+
+   /*
+* Get the pages we're interested in.  We must
+* access remotely because task/mm might not
+* current/current->mm
+*/
+   down_read(>mmap_sem);
+   pages = get_user_pages_remote(task, mm, pa, pages, 0,
+ process_pages, NULL, );
+   if (locked)
+   up_read(>mmap_sem);
+   if (pages <= 0) {
+   failed = true;
+   ret = -EFAULT;
+   break;
+   }
+
+   copied = pages * PAGE_SIZE - start_offset;
+   if (copied > len)
+   copied = len;
+   len -= copied;
+
+   for (n = 0; copied; n++, start_offset = 0) {
+   int 

[PATCH 1/3] [v2] vm: add a syscall to map a process memory into a pipe

2017-10-25 Thread Andrei Vagin
It is a hybrid of process_vm_readv() and vmsplice().

vmsplice can map memory from a current address space into a pipe.
process_vm_readv can read memory of another process.

A new system call can map memory of another process into a pipe.

ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov,
unsigned long nr_segs, unsigned int flags)

All arguments are identical with vmsplice except pid which specifies a
target process.

Currently if we want to dump a process memory to a file or to a socket,
we can use process_vm_readv() + write(), but it works slow, because data
are copied into a temporary user-space buffer.

A second way is to use vmsplice() + splice(). It is more effective,
because data are not copied into a temporary buffer, but here is another
problem. vmsplice works with the currect address space, so it can be
used only if we inject our code into a target process.

The second way suffers from a few other issues:
* a process has to be stopped to run a parasite code
* a number of pipes is limited, so it may be impossible to dump all
  memory in one iteration, and we have to stop process and inject our
  code a few times.
* pages in pipes are unreclaimable, so it isn't good to hold a lot of
  memory in pipes.

The introduced syscall allows to use a second way without injecting any
code into a target process.

My experiments shows that process_vmsplice() + splice() works two time
faster than process_vm_readv() + write().

It is particularly useful on a pre-dump stage. On this stage we enable a
memory tracker, and then we are dumping  a process memory while a
process continues work. On the first iteration we are dumping all
memory, and then we are dumpung only modified memory from a previous
iteration.  After a few pre-dump operations, a process is stopped and
dumped finally. The pre-dump operations allow to significantly decrease
a process downtime, when a process is migrated to another host.

v2: move this syscall under CONFIG_CROSS_MEMORY_ATTACH
give correct flags to get_user_pages_remote()

Cc: Alexander Viro 
Cc: Arnd Bergmann 
Cc: Pavel Emelyanov 
Cc: Michael Kerrisk 
Cc: Thomas Gleixner 
Cc: Andrew Morton 
Cc: Josh Triplett 
Cc: Jann Horn 
Signed-off-by: Andrei Vagin 
---
 fs/splice.c   | 223 ++
 include/linux/compat.h|   3 +
 include/linux/syscalls.h  |   4 +
 include/uapi/asm-generic/unistd.h |   5 +-
 kernel/sys_ni.c   |   2 +
 5 files changed, 236 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index f3084cce0ea6..4bf37207feb9 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1358,6 +1359,228 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec 
__user *, iov,
return error;
 }
 
+#ifdef CONFIG_CROSS_MEMORY_ATTACH
+/*
+ * Map pages from a specified task into a pipe
+ */
+static int remote_single_vec_to_pipe(struct task_struct *task,
+   struct mm_struct *mm,
+   const struct iovec *rvec,
+   struct pipe_inode_info *pipe,
+   unsigned int flags,
+   size_t *total)
+{
+   struct pipe_buffer buf = {
+   .ops = _page_pipe_buf_ops,
+   .flags = flags
+   };
+   unsigned long addr = (unsigned long) rvec->iov_base;
+   unsigned long pa = addr & PAGE_MASK;
+   unsigned long start_offset = addr - pa;
+   unsigned long nr_pages;
+   ssize_t len = rvec->iov_len;
+   struct page *process_pages[16];
+   bool failed = false;
+   int ret = 0;
+
+   nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1;
+   while (nr_pages) {
+   long pages = min(nr_pages, 16UL);
+   int locked = 1, n;
+   ssize_t copied;
+
+   /*
+* Get the pages we're interested in.  We must
+* access remotely because task/mm might not
+* current/current->mm
+*/
+   down_read(>mmap_sem);
+   pages = get_user_pages_remote(task, mm, pa, pages, 0,
+ process_pages, NULL, );
+   if (locked)
+   up_read(>mmap_sem);
+   if (pages <= 0) {
+   failed = true;
+   ret = -EFAULT;
+   break;
+   }
+
+   copied = pages * PAGE_SIZE - start_offset;
+   if (copied > len)
+   copied = len;
+   len -= copied;
+
+   for (n = 0; copied; n++, start_offset = 0) {
+   int size = min_t(int, copied, PAGE_SIZE - start_offset);
+
+   if (!failed) {
+   buf.page = process_pages[n];
+