Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Kirill Tkhai
On 16.05.2019 16:42, Adam Borowski wrote:
> On Thu, May 16, 2019 at 04:10:07PM +0300, Kirill Tkhai wrote:
>> On 15.05.2019 22:38, Adam Borowski wrote:
>>> On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
 This patchset adds a new syscall, which makes possible
 to clone a mapping from a process to another process.
 The syscall supplements the functionality provided
 by process_vm_writev() and process_vm_readv() syscalls,
 and it may be useful in many situation.

 For example, it allows to make a zero copy of data,
 when process_vm_writev() was previously used:
>>>
>>> I wonder, why not optimize the existing interfaces to do zero copy if
>>> properly aligned?  No need for a new syscall, and old code would immediately
>>> benefit.
>>
>> Because, this is just not possible. You can't zero copy anonymous pages
>> of a process to pages of a remote process, when they are different pages.
> 
> fork() manages that, and so does KSM.  Like KSM, you want to make a page
> shared -- you just skip the comparison step as you want to overwrite the old
> contents.
> 
> And there's no need to touch the page, as fork() manages that fine no matter
> if the page is resident, anonymous in swap, or file-backed, all without
> reading from swap.

Yes, and in case of you dive into the patchset, you will found the new syscall
manages page table entries in the same way fork() makes.
 
 There are several problems with process_vm_writev() in this example:

 1)it causes pagefault on remote process memory, and it forces
   allocation of a new page (if was not preallocated);

 2)amount of memory for this example is doubled in a moment --
   n pages in current and n pages in remote tasks are occupied
   at the same time;

 3)received data has no a chance to be properly swapped for
   a long time.
>>>
>>> That'll handle all of your above problems, except for making pages
>>> subject to CoW if written to.  But if making pages writeably shared is
>>> desired, the old functions have a "flags" argument that doesn't yet have a
>>> single bit defined.
> 
> 
> Meow!
> 



Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Kirill Tkhai
On 16.05.2019 16:52, Michal Hocko wrote:
> On Thu 16-05-19 15:30:34, Michal Hocko wrote:
>> [You are defining a new user visible API, please always add linux-api
>>  mailing list - now done]
>>
>> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> [...]
>>> The proposed syscall aims to introduce an interface, which
>>> supplements currently existing process_vm_writev() and
>>> process_vm_readv(), and allows to solve the problem with
>>> anonymous memory transfer. The above example may be rewritten as:
>>>
>>> void *buf;
>>>
>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>> recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> /* Sign of @pid is direction: "from @pid task to current" or vice 
>>> versa. */
>>> process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>>> munmap(buf, n * PAGE_SIZE);
> 
> AFAIU this means that you actually want to do an mmap of an anonymous
> memory with a COW semantic to the remote process right?

Yes.

> How does the remote process find out where and what has been mmaped?

Any way. Isn't this a trivial task? :) You may use socket or any
of appropriate linux features to communicate between them.

>What if the range collides? This sounds quite scary to me TBH.

In case of range collides, the part of old VMA becomes unmapped.
The same way we behave on ordinary mmap. You may intersect a range,
which another thread mapped, so you need a synchronization between
them. There is no a principle difference.

Also I'm going to add a flag to prevent unmapping like Kees suggested.
Please, see his message.

> Why cannot you simply use shared memory for that?

Because of remote task may want specific type of VMA. It may want not to
share a VMA with its children.

Speaking about online migration, a task wants its anonymous private VMAs
remain the same after the migration. Otherwise, imagine the situation,
when task's stack becomes a shared VMA after the migration.
Also, task wants anonymous mapping remains anonymous.

In general, in case of shared memory is enough for everything, we would
have never had process_vm_writev() and process_vm_readv() syscalls.

Kirill


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Kirill Tkhai
On 16.05.2019 16:32, Jann Horn wrote:
> On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai  wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> [...]
>> The proposed syscall aims to introduce an interface, which
>> supplements currently existing process_vm_writev() and
>> process_vm_readv(), and allows to solve the problem with
>> anonymous memory transfer. The above example may be rewritten as:
>>
>> void *buf;
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>MAP_PRIVATE|MAP_ANONYMOUS, ...);
>> recv(sock, buf, n * PAGE_SIZE, 0);
>>
>> /* Sign of @pid is direction: "from @pid task to current" or vice 
>> versa. */
>> process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>> munmap(buf, n * PAGE_SIZE);
> 
> In this specific example, an alternative would be to splice() from the
> socket into /proc/$pid/mem, or something like that, right?
> proc_mem_operations has no ->splice_read() at the moment, and it'd
> need that to be more efficient, but that could be built without
> creating new UAPI, right?

I have just never seen, a socket memory may be preempted into swap.
If so, there is a fundamental problem.
But, anyway, like you guessed below:
 
> But I guess maybe your workload is not that simple? What do you
> actually do with the received data between receiving it and shoving it
> over into the other process?

Data are usually sent encrypted and compressed by socket, so there is no
possibility to go this way. You may want to do everything with data,
before passing to another process.

Kirill


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Michal Hocko
On Thu 16-05-19 15:30:34, Michal Hocko wrote:
> [You are defining a new user visible API, please always add linux-api
>  mailing list - now done]
> 
> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
[...]
> > The proposed syscall aims to introduce an interface, which
> > supplements currently existing process_vm_writev() and
> > process_vm_readv(), and allows to solve the problem with
> > anonymous memory transfer. The above example may be rewritten as:
> > 
> > void *buf;
> > 
> > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> >MAP_PRIVATE|MAP_ANONYMOUS, ...);
> > recv(sock, buf, n * PAGE_SIZE, 0);
> > 
> > /* Sign of @pid is direction: "from @pid task to current" or vice 
> > versa. */
> > process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> > munmap(buf, n * PAGE_SIZE);

AFAIU this means that you actually want to do an mmap of an anonymous
memory with a COW semantic to the remote process right? How does the
remote process find out where and what has been mmaped? What if the
range collides? This sounds quite scary to me TBH. Why cannot you simply
use shared memory for that?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Adam Borowski
On Thu, May 16, 2019 at 04:10:07PM +0300, Kirill Tkhai wrote:
> On 15.05.2019 22:38, Adam Borowski wrote:
> > On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
> >> This patchset adds a new syscall, which makes possible
> >> to clone a mapping from a process to another process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> >>
> >> For example, it allows to make a zero copy of data,
> >> when process_vm_writev() was previously used:
> > 
> > I wonder, why not optimize the existing interfaces to do zero copy if
> > properly aligned?  No need for a new syscall, and old code would immediately
> > benefit.
> 
> Because, this is just not possible. You can't zero copy anonymous pages
> of a process to pages of a remote process, when they are different pages.

fork() manages that, and so does KSM.  Like KSM, you want to make a page
shared -- you just skip the comparison step as you want to overwrite the old
contents.

And there's no need to touch the page, as fork() manages that fine no matter
if the page is resident, anonymous in swap, or file-backed, all without
reading from swap.

> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >>
> >> 2)amount of memory for this example is doubled in a moment --
> >>   n pages in current and n pages in remote tasks are occupied
> >>   at the same time;
> >>
> >> 3)received data has no a chance to be properly swapped for
> >>   a long time.
> > 
> > That'll handle all of your above problems, except for making pages
> > subject to CoW if written to.  But if making pages writeably shared is
> > desired, the old functions have a "flags" argument that doesn't yet have a
> > single bit defined.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Jann Horn
On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai  wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
[...]
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
>
> void *buf;
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>MAP_PRIVATE|MAP_ANONYMOUS, ...);
> recv(sock, buf, n * PAGE_SIZE, 0);
>
> /* Sign of @pid is direction: "from @pid task to current" or vice 
> versa. */
> process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> munmap(buf, n * PAGE_SIZE);

In this specific example, an alternative would be to splice() from the
socket into /proc/$pid/mem, or something like that, right?
proc_mem_operations has no ->splice_read() at the moment, and it'd
need that to be more efficient, but that could be built without
creating new UAPI, right?

But I guess maybe your workload is not that simple? What do you
actually do with the received data between receiving it and shoving it
over into the other process?


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Michal Hocko
[You are defining a new user visible API, please always add linux-api
 mailing list - now done]

On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
> 
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
> 
>   struct iovec local_iov, remote_iov;
>   void *buf;
> 
>   buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>  MAP_PRIVATE|MAP_ANONYMOUS, ...);
>   recv(sock, buf, n * PAGE_SIZE, 0);
> 
>   local_iov->iov_base = buf;
>   local_iov->iov_len = n * PAGE_SIZE;
>   remove_iov = ...;
> 
>   process_vm_writev(pid, _iov, 1, _iov, 1 0);
>   munmap(buf, n * PAGE_SIZE);
> 
>   (Note, that above completely ignores error handling)
> 
> There are several problems with process_vm_writev() in this example:
> 
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);
> 
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;
> 
> 3)received data has no a chance to be properly swapped for
>   a long time.
> 
> The third is the most critical in case of remote process touches
> the data pages some time after process_vm_writev() was made.
> Imagine, node is under memory pressure:
> 
> a)kernel moves @buf pages into swap right after recv();
> b)process_vm_writev() reads the data back from swap to pages;
> c)process_vm_writev() allocates duplicate pages in remote
>   process and populates them;
> d)munmap() unmaps @buf;
> e)5 minutes later remote task touches data.
> 
> In stages "a" and "b" kernel submits unneeded IO and makes
> system IO throughput worse. To make "b" and "c", kernel
> reclaims memory, and moves pages of some other processes
> to swap, so they have to read pages from swap back. Also,
> unneeded copying of pages is occured, while zero-copy is
> more preferred.
> 
> We observe similar problem during online migration of big enough
> containers, when after doubling of container's size, the time
> increases 100 times. The system resides under high IO and
> throwing out of useful cashes.
> 
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
> 
>   void *buf;
> 
>   buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>  MAP_PRIVATE|MAP_ANONYMOUS, ...);
>   recv(sock, buf, n * PAGE_SIZE, 0);
> 
>   /* Sign of @pid is direction: "from @pid task to current" or vice 
> versa. */
>   process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>   munmap(buf, n * PAGE_SIZE);
> 
> It is swap-friendly: in case of memory is swapped right after recv(),
> the syscall just copies pagetable entries like we do on fork(),
> so real access to pages does not occurs, and no IO is needed.
> No excess pages are reclaimed, and number of pages is not doubled.
> Also, zero-copy takes a place, and this also reduces overhead.
> 
> The patchset does not introduce much new code, since we simply
> reuse existing copy_page_range() and copy_vma() functions.
> We extend copy_vma() to be able merge VMAs in remote task [2/5],
> and teach copy_page_range() to work with different local and
> remote addresses [3/5]. Patch [5/5] introduces the syscall logic,
> which mostly consists of sanity checks. The rest of patches
> are preparations.
> 
> This syscall may be used for page servers like in example
> above, for migration (I assume, even virtual machines may
> want something like this), for zero-copy desiring users
> of process_vm_writev() and process_vm_readv(), for debug
> purposes, etc. It requires the same permittions like
> existing proc_vm_xxx() syscalls have.
> 
> The tests I used may be obtained here:
> 
> [1]https://gist.github.com/tkhai/198d32fdc001ec7812a5e1ccf091f275
> [2]https://gist.github.com/tkhai/f52dbaeedad5a699f3fb386fda676562
> 
> ---
> 
> Kirill Tkhai (5):
>   mm: Add process_vm_mmap() syscall declaration
>   mm: Extend copy_vma()
>   mm: Extend copy_page_range()
>   mm: Export round_hint_to_min()
>   mm: Add process_vm_mmap()
> 
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |1 
>  arch/x86/entry/syscalls/syscall_64.tbl |2 
>  include/linux/huge_mm.h|6 +
>  include/linux/mm.h |   11 ++
>  include/linux/mm_types.h   |2 
>  include/linux/mman.h   |   14 +++
>  include/linux/syscalls.h   |5 +
>  include/uapi/asm-generic/mman-common.h |5 +
>  

Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Jann Horn
On Thu, May 16, 2019 at 3:03 PM Kirill Tkhai  wrote:
> On 15.05.2019 21:46, Jann Horn wrote:
> > On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai  wrote:
> >> This patchset adds a new syscall, which makes possible
> >> to clone a mapping from a process to another process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> >>
> >> For example, it allows to make a zero copy of data,
> >> when process_vm_writev() was previously used:
> > [...]
> >> This syscall may be used for page servers like in example
> >> above, for migration (I assume, even virtual machines may
> >> want something like this), for zero-copy desiring users
> >> of process_vm_writev() and process_vm_readv(), for debug
> >> purposes, etc. It requires the same permittions like
> >> existing proc_vm_xxx() syscalls have.
> >
> > Have you considered using userfaultfd instead? userfaultfd has
> > interfaces (UFFDIO_COPY and UFFDIO_ZERO) for directly shoving pages
> > into the VMAs of other processes. This works without the churn of
> > creating and merging VMAs all the time. userfaultfd is the interface
> > that was written to support virtual machine migration (and it supports
> > live migration, too).
>
> I know about userfaultfd, but it does solve the discussed problem.
> It allocates new pages to make UFFDIO_COPY (see mcopy_atomic_pte()),
> and it accumulates all the disadvantages, the example from [0/5]
> message has.

Sorry, right, I misremembered that.


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Kirill Tkhai
Hi, Adam,

On 15.05.2019 22:38, Adam Borowski wrote:
> On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
>>
>> For example, it allows to make a zero copy of data,
>> when process_vm_writev() was previously used:
> 
> I wonder, why not optimize the existing interfaces to do zero copy if
> properly aligned?  No need for a new syscall, and old code would immediately
> benefit.

Because, this is just not possible. You can't zero copy anonymous pages
of a process to pages of a remote process, when they are different pages.

>> There are several problems with process_vm_writev() in this example:
>>
>> 1)it causes pagefault on remote process memory, and it forces
>>   allocation of a new page (if was not preallocated);
>>
>> 2)amount of memory for this example is doubled in a moment --
>>   n pages in current and n pages in remote tasks are occupied
>>   at the same time;
>>
>> 3)received data has no a chance to be properly swapped for
>>   a long time.
> 
> That'll handle all of your above problems, except for making pages
> subject to CoW if written to.  But if making pages writeably shared is
> desired, the old functions have a "flags" argument that doesn't yet have a
> single bit defined.

Kirill


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-16 Thread Kirill Tkhai
Hi, Jann,

On 15.05.2019 21:46, Jann Horn wrote:
> On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai  wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
>>
>> For example, it allows to make a zero copy of data,
>> when process_vm_writev() was previously used:
> [...]
>> This syscall may be used for page servers like in example
>> above, for migration (I assume, even virtual machines may
>> want something like this), for zero-copy desiring users
>> of process_vm_writev() and process_vm_readv(), for debug
>> purposes, etc. It requires the same permittions like
>> existing proc_vm_xxx() syscalls have.
> 
> Have you considered using userfaultfd instead? userfaultfd has
> interfaces (UFFDIO_COPY and UFFDIO_ZERO) for directly shoving pages
> into the VMAs of other processes. This works without the churn of
> creating and merging VMAs all the time. userfaultfd is the interface
> that was written to support virtual machine migration (and it supports
> live migration, too).

I know about userfaultfd, but it does solve the discussed problem.
It allocates new pages to make UFFDIO_COPY (see mcopy_atomic_pte()),
and it accumulates all the disadvantages, the example from [0/5]
message has.

Kirill


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-15 Thread Adam Borowski
On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
> 
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:

I wonder, why not optimize the existing interfaces to do zero copy if
properly aligned?  No need for a new syscall, and old code would immediately
benefit.

> There are several problems with process_vm_writev() in this example:
> 
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);
> 
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;
> 
> 3)received data has no a chance to be properly swapped for
>   a long time.

That'll handle all of your above problems, except for making pages
subject to CoW if written to.  But if making pages writeably shared is
desired, the old functions have a "flags" argument that doesn't yet have a
single bit defined.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!


Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-15 Thread Jann Horn
On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai  wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
>
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
[...]
> This syscall may be used for page servers like in example
> above, for migration (I assume, even virtual machines may
> want something like this), for zero-copy desiring users
> of process_vm_writev() and process_vm_readv(), for debug
> purposes, etc. It requires the same permittions like
> existing proc_vm_xxx() syscalls have.

Have you considered using userfaultfd instead? userfaultfd has
interfaces (UFFDIO_COPY and UFFDIO_ZERO) for directly shoving pages
into the VMAs of other processes. This works without the churn of
creating and merging VMAs all the time. userfaultfd is the interface
that was written to support virtual machine migration (and it supports
live migration, too).


[PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping

2019-05-15 Thread Kirill Tkhai
This patchset adds a new syscall, which makes possible
to clone a mapping from a process to another process.
The syscall supplements the functionality provided
by process_vm_writev() and process_vm_readv() syscalls,
and it may be useful in many situation.

For example, it allows to make a zero copy of data,
when process_vm_writev() was previously used:

struct iovec local_iov, remote_iov;
void *buf;

buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, ...);
recv(sock, buf, n * PAGE_SIZE, 0);

local_iov->iov_base = buf;
local_iov->iov_len = n * PAGE_SIZE;
remove_iov = ...;

process_vm_writev(pid, _iov, 1, _iov, 1 0);
munmap(buf, n * PAGE_SIZE);

(Note, that above completely ignores error handling)

There are several problems with process_vm_writev() in this example:

1)it causes pagefault on remote process memory, and it forces
  allocation of a new page (if was not preallocated);

2)amount of memory for this example is doubled in a moment --
  n pages in current and n pages in remote tasks are occupied
  at the same time;

3)received data has no a chance to be properly swapped for
  a long time.

The third is the most critical in case of remote process touches
the data pages some time after process_vm_writev() was made.
Imagine, node is under memory pressure:

a)kernel moves @buf pages into swap right after recv();
b)process_vm_writev() reads the data back from swap to pages;
c)process_vm_writev() allocates duplicate pages in remote
  process and populates them;
d)munmap() unmaps @buf;
e)5 minutes later remote task touches data.

In stages "a" and "b" kernel submits unneeded IO and makes
system IO throughput worse. To make "b" and "c", kernel
reclaims memory, and moves pages of some other processes
to swap, so they have to read pages from swap back. Also,
unneeded copying of pages is occured, while zero-copy is
more preferred.

We observe similar problem during online migration of big enough
containers, when after doubling of container's size, the time
increases 100 times. The system resides under high IO and
throwing out of useful cashes.

The proposed syscall aims to introduce an interface, which
supplements currently existing process_vm_writev() and
process_vm_readv(), and allows to solve the problem with
anonymous memory transfer. The above example may be rewritten as:

void *buf;

buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, ...);
recv(sock, buf, n * PAGE_SIZE, 0);

/* Sign of @pid is direction: "from @pid task to current" or vice 
versa. */
process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
munmap(buf, n * PAGE_SIZE);

It is swap-friendly: in case of memory is swapped right after recv(),
the syscall just copies pagetable entries like we do on fork(),
so real access to pages does not occurs, and no IO is needed.
No excess pages are reclaimed, and number of pages is not doubled.
Also, zero-copy takes a place, and this also reduces overhead.

The patchset does not introduce much new code, since we simply
reuse existing copy_page_range() and copy_vma() functions.
We extend copy_vma() to be able merge VMAs in remote task [2/5],
and teach copy_page_range() to work with different local and
remote addresses [3/5]. Patch [5/5] introduces the syscall logic,
which mostly consists of sanity checks. The rest of patches
are preparations.

This syscall may be used for page servers like in example
above, for migration (I assume, even virtual machines may
want something like this), for zero-copy desiring users
of process_vm_writev() and process_vm_readv(), for debug
purposes, etc. It requires the same permittions like
existing proc_vm_xxx() syscalls have.

The tests I used may be obtained here:

[1]https://gist.github.com/tkhai/198d32fdc001ec7812a5e1ccf091f275
[2]https://gist.github.com/tkhai/f52dbaeedad5a699f3fb386fda676562

---

Kirill Tkhai (5):
  mm: Add process_vm_mmap() syscall declaration
  mm: Extend copy_vma()
  mm: Extend copy_page_range()
  mm: Export round_hint_to_min()
  mm: Add process_vm_mmap()


 arch/x86/entry/syscalls/syscall_32.tbl |1 
 arch/x86/entry/syscalls/syscall_64.tbl |2 
 include/linux/huge_mm.h|6 +
 include/linux/mm.h |   11 ++
 include/linux/mm_types.h   |2 
 include/linux/mman.h   |   14 +++
 include/linux/syscalls.h   |5 +
 include/uapi/asm-generic/mman-common.h |5 +
 include/uapi/asm-generic/unistd.h  |5 +
 init/Kconfig   |9 +-
 kernel/fork.c  |5 +
 kernel/sys_ni.c|2 
 mm/huge_memory.c   |   30 --
 mm/memory.c|  165 +---
 mm/mmap.c