[PATCH 1/3] [v2] vm: add a syscall to map a process memory into a pipe
It is a hybrid of process_vm_readv() and vmsplice(). vmsplice can map memory from a current address space into a pipe. process_vm_readv can read memory of another process. A new system call can map memory of another process into a pipe. ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags) All arguments are identical with vmsplice except pid which specifies a target process. Currently if we want to dump a process memory to a file or to a socket, we can use process_vm_readv() + write(), but it works slow, because data are copied into a temporary user-space buffer. A second way is to use vmsplice() + splice(). It is more effective, because data are not copied into a temporary buffer, but here is another problem. vmsplice works with the currect address space, so it can be used only if we inject our code into a target process. The second way suffers from a few other issues: * a process has to be stopped to run a parasite code * a number of pipes is limited, so it may be impossible to dump all memory in one iteration, and we have to stop process and inject our code a few times. * pages in pipes are unreclaimable, so it isn't good to hold a lot of memory in pipes. The introduced syscall allows to use a second way without injecting any code into a target process. My experiments shows that process_vmsplice() + splice() works two time faster than process_vm_readv() + write(). It is particularly useful on a pre-dump stage. On this stage we enable a memory tracker, and then we are dumping a process memory while a process continues work. On the first iteration we are dumping all memory, and then we are dumpung only modified memory from a previous iteration. After a few pre-dump operations, a process is stopped and dumped finally. The pre-dump operations allow to significantly decrease a process downtime, when a process is migrated to another host. v2: move this syscall under CONFIG_CROSS_MEMORY_ATTACH give correct flags to get_user_pages_remote() Cc: Alexander ViroCc: Arnd Bergmann Cc: Pavel Emelyanov Cc: Michael Kerrisk Cc: Thomas Gleixner Cc: Andrew Morton Cc: Josh Triplett Cc: Jann Horn Signed-off-by: Andrei Vagin --- fs/splice.c | 223 ++ include/linux/compat.h| 3 + include/linux/syscalls.h | 4 + include/uapi/asm-generic/unistd.h | 5 +- kernel/sys_ni.c | 2 + 5 files changed, 236 insertions(+), 1 deletion(-) diff --git a/fs/splice.c b/fs/splice.c index f3084cce0ea6..4bf37207feb9 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "internal.h" @@ -1358,6 +1359,228 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov, return error; } +#ifdef CONFIG_CROSS_MEMORY_ATTACH +/* + * Map pages from a specified task into a pipe + */ +static int remote_single_vec_to_pipe(struct task_struct *task, + struct mm_struct *mm, + const struct iovec *rvec, + struct pipe_inode_info *pipe, + unsigned int flags, + size_t *total) +{ + struct pipe_buffer buf = { + .ops = _page_pipe_buf_ops, + .flags = flags + }; + unsigned long addr = (unsigned long) rvec->iov_base; + unsigned long pa = addr & PAGE_MASK; + unsigned long start_offset = addr - pa; + unsigned long nr_pages; + ssize_t len = rvec->iov_len; + struct page *process_pages[16]; + bool failed = false; + int ret = 0; + + nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1; + while (nr_pages) { + long pages = min(nr_pages, 16UL); + int locked = 1, n; + ssize_t copied; + + /* +* Get the pages we're interested in. We must +* access remotely because task/mm might not +* current/current->mm +*/ + down_read(>mmap_sem); + pages = get_user_pages_remote(task, mm, pa, pages, 0, + process_pages, NULL, ); + if (locked) + up_read(>mmap_sem); + if (pages <= 0) { + failed = true; + ret = -EFAULT; + break; + } + + copied = pages * PAGE_SIZE - start_offset; + if (copied > len) + copied = len; + len -= copied; + + for (n = 0; copied; n++, start_offset = 0) { + int
[PATCH 1/3] [v2] vm: add a syscall to map a process memory into a pipe
It is a hybrid of process_vm_readv() and vmsplice(). vmsplice can map memory from a current address space into a pipe. process_vm_readv can read memory of another process. A new system call can map memory of another process into a pipe. ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags) All arguments are identical with vmsplice except pid which specifies a target process. Currently if we want to dump a process memory to a file or to a socket, we can use process_vm_readv() + write(), but it works slow, because data are copied into a temporary user-space buffer. A second way is to use vmsplice() + splice(). It is more effective, because data are not copied into a temporary buffer, but here is another problem. vmsplice works with the currect address space, so it can be used only if we inject our code into a target process. The second way suffers from a few other issues: * a process has to be stopped to run a parasite code * a number of pipes is limited, so it may be impossible to dump all memory in one iteration, and we have to stop process and inject our code a few times. * pages in pipes are unreclaimable, so it isn't good to hold a lot of memory in pipes. The introduced syscall allows to use a second way without injecting any code into a target process. My experiments shows that process_vmsplice() + splice() works two time faster than process_vm_readv() + write(). It is particularly useful on a pre-dump stage. On this stage we enable a memory tracker, and then we are dumping a process memory while a process continues work. On the first iteration we are dumping all memory, and then we are dumpung only modified memory from a previous iteration. After a few pre-dump operations, a process is stopped and dumped finally. The pre-dump operations allow to significantly decrease a process downtime, when a process is migrated to another host. v2: move this syscall under CONFIG_CROSS_MEMORY_ATTACH give correct flags to get_user_pages_remote() Cc: Alexander Viro Cc: Arnd Bergmann Cc: Pavel Emelyanov Cc: Michael Kerrisk Cc: Thomas Gleixner Cc: Andrew Morton Cc: Josh Triplett Cc: Jann Horn Signed-off-by: Andrei Vagin --- fs/splice.c | 223 ++ include/linux/compat.h| 3 + include/linux/syscalls.h | 4 + include/uapi/asm-generic/unistd.h | 5 +- kernel/sys_ni.c | 2 + 5 files changed, 236 insertions(+), 1 deletion(-) diff --git a/fs/splice.c b/fs/splice.c index f3084cce0ea6..4bf37207feb9 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "internal.h" @@ -1358,6 +1359,228 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov, return error; } +#ifdef CONFIG_CROSS_MEMORY_ATTACH +/* + * Map pages from a specified task into a pipe + */ +static int remote_single_vec_to_pipe(struct task_struct *task, + struct mm_struct *mm, + const struct iovec *rvec, + struct pipe_inode_info *pipe, + unsigned int flags, + size_t *total) +{ + struct pipe_buffer buf = { + .ops = _page_pipe_buf_ops, + .flags = flags + }; + unsigned long addr = (unsigned long) rvec->iov_base; + unsigned long pa = addr & PAGE_MASK; + unsigned long start_offset = addr - pa; + unsigned long nr_pages; + ssize_t len = rvec->iov_len; + struct page *process_pages[16]; + bool failed = false; + int ret = 0; + + nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1; + while (nr_pages) { + long pages = min(nr_pages, 16UL); + int locked = 1, n; + ssize_t copied; + + /* +* Get the pages we're interested in. We must +* access remotely because task/mm might not +* current/current->mm +*/ + down_read(>mmap_sem); + pages = get_user_pages_remote(task, mm, pa, pages, 0, + process_pages, NULL, ); + if (locked) + up_read(>mmap_sem); + if (pages <= 0) { + failed = true; + ret = -EFAULT; + break; + } + + copied = pages * PAGE_SIZE - start_offset; + if (copied > len) + copied = len; + len -= copied; + + for (n = 0; copied; n++, start_offset = 0) { + int size = min_t(int, copied, PAGE_SIZE - start_offset); + + if (!failed) { + buf.page = process_pages[n]; +