Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-16 Thread Kirill Smelkov
On Tue, Apr 13, 2021 at 10:52:13PM -0700, Andrei Vagin wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.

I'd like to add that there are cases when using ptrace is even hardly possible:
in my situation one process needs to modify address space of another process
while that target process is being blocked under pagefault. From
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/notes.txt#L149-171 
,
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/wcfs.go#L395-397 :

 8< 
Client cannot be ptraced while under pagefault
==

We cannot use ptrace to run code on client thread that is under pagefault:

The kernel sends SIGSTOP to interrupt tracee, but the signal will be
processed only when the process returns from kernel space, e.g. here

 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160

This way the tracer won't receive obligatory information that tracee
stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
ptrace commands will fail:

 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207

My original idea was to use ptrace to run code in process to change it's
memory mappings, while the triggering process is under pagefault/read
to wcfs, and the above shows it won't work - trying to ptrace the
client from under wcfs will just block forever (the kernel will be
waiting for read operation to finish for ptrace, and read will be first
waiting on ptrace stopping to complete = deadlock)

...

//  ( one could imagine adjusting mappings synchronously via running
//wcfs-trusted code via ptrace that wcfs injects into clients, but 
ptrace
//won't work when client thread is blocked under pagefault or 
syscall(^) )
 8< 

To workaround that I need to add special thread into target process and
implement custom additional "isolation protocol" in between my filesystem and
client processes that use it:

https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/wcfs.go#L94-182
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/client/wcfs.h#L20-96
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/client/wcfs.cpp#L24-203

Most parts of that dance would be much easier, or completely
unnecessary, if it could be possible to reliably make changes to address
space of target process from outside.

Kirill


Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Andrei Vagin
On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin  wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
> 
> It seems to me like your proposed API doesn't really fit either one of
> those usecases well...

We definitely can invent more specific interfaces for each of these
problems. Sure, they will handle their use-cases a bit better than this
generic one. But do we want to have two very specific interfaces with
separate kernel implementations? My previous experiences showed that the
kernel community doesn't like interfaces that are specific for only one
narrow use-case.

So when I was working on process_vm_exec, I was thinking how to make
one interfaces that will be good enough for all these use-cases.

> 
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
> 
> In this case, since you really only want an mm_struct to run code
> under, it seems weird to create a whole task with its own PID and so
> on. It seems to me like something similar to the /dev/kvm API would be
> more appropriate here? Implementation options that I see for that
> would be:
> 
> 1. mm_struct-based:
>   a set of syscalls to create a new mm_struct,
>   change memory mappings under that mm_struct, and switch to it
> 2. pagetable-mirroring-based:
>   like /dev/kvm, an API to create a new pagetable, mirror parts of
>   the mm_struct's pagetables over into it with modified permissions
>   (like KVM_SET_USER_MEMORY_REGION),
>   and run code under that context.
>   page fault handling would first handle the fault against mm->pgd
>   as normal, then mirror the PTE over into the secondary pagetables.
>   invalidation could be handled with MMU notifiers.

We are ready to discuss this sort of interfaces if the community will
agree to accept it. Are there any other users except sandboxes that will
need something like this? Will the sandbox use-case enough to justify
the addition of this interface?

> 
> > Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> > process properties can be received only from the process itself. Right
> > now, we use a parasite code that is injected into the process. We do
> > this with ptrace but it is slow, unsafe, and tricky.
> 
> But this API will only let you run code under the *mm* of the target
> process, not fully in the context of a target *task*, right? So you
> still won't be able to use this for accessing anything other than
> memory? That doesn't seem very generically useful to me.

You are right, this will not rid us of the need to run a parasite code.
I wrote that it will make a process of injecting a parasite code a bit
simpler.

> 
> Also, I don't doubt that anything involving ptrace is kinda tricky,
> but it would be nice to have some more detail on what exactly makes
> this slow, unsafe and tricky. Are there API additions for ptrace that
> would make this work better? I imagine you're thinking of things like
> an API for injecting a syscall into the target process without having
> to first somehow find an existing SYSCALL instruction in the target
> process?


You describe the first problem right. We need to find or inject a
syscall instruction to a target process.
Right now, we need to do these steps to execute a system call:

* inject the syscall instruction (PTRACE_PEEKDATA/PTRACE_POKEDATA).
* get origin registers
* set new registers
* get a signal mask.
* block signals
* resume the process
* stop it on the next syscall-exit
* get registers
* set origin registers
* restore a signal mask.

One of the CRIU principals is to avoid changing a process state, so if
criu is interrupted, processes must be resumed and continue running. The
procedure of injecting a system call creates a window when a process is
in an inconsistent state, and a disappearing CRIU at such moments will
be fatal for the process. We don't think that we can eliminate such
windows, but we want to make them smaller.

In CRIU, we have a self-healed parasite. The idea is to inject a
parasite code with a signal frame that contains the origin process
state. The parasite runs in an "RPC daemon mode" and gets commands from
criu via a unix socket. If it detects that criu 

Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Jann Horn
 On Wed, Apr 14, 2021 at 2:20 PM Florian Weimer  wrote:
>
> * Jann Horn:
>
> > On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer  wrote:
> >>
> >> * Andrei Vagin:
> >>
> >> > We already have process_vm_readv and process_vm_writev to read and write
> >> > to a process memory faster than we can do this with ptrace. And now it
> >> > is time for process_vm_exec that allows executing code in an address
> >> > space of another process. We can do this with ptrace but it is much
> >> > slower.
> >> >
> >> > = Use-cases =
> >>
> >> We also have some vaguely related within the same address space: running
> >> code on another thread, without modifying its stack, while it has signal
> >> handlers blocked, and without causing system calls to fail with EINTR.
> >> This can be used to implement certain kinds of memory barriers.
> >
> > That's what the membarrier() syscall is for, right? Unless you don't
> > want to register all threads for expedited membarrier use?
>
> membarrier is not sufficiently powerful for revoking biased locks, for
> example.

But on Linux >=5.10, together with rseq, it is, right? Then lock
acquisition could look roughly like this, in pseudo-C (yes, I know,
real rseq doesn't quite look like that, you'd need inline asm for that
unless the compiler adds special support for this):


enum local_state {
  STATE_FREE_OR_BIASED,
  STATE_LOCKED
};
#define OWNER_LOCKBIT (1U<<31)
#define OWNER_WAITER_BIT (1U<<30) /* notify futex when OWNER_LOCKBIT
is cleared */
struct biased_lock {
  unsigned int owner_with_lockbit;
  enum local_state local_state;
};

void lock(struct biased_lock *L) {
  unsigned int my_tid = THREAD_SELF->tid;
  RSEQ_SEQUENCE_START(); // restart here on failure
  if (READ_ONCE(L->owner) == my_tid) {
if (READ_ONCE(L->local_state) == STATE_LOCKED) {
  RSEQ_SEQUENCE_END();
  /*
   * Deadlock, abort execution.
   * Note that we are not necessarily actually *holding* the lock;
   * this can also happen if we entered a signal handler while we
   * were in the process of acquiring the lock.
   * But in that case it could just as well have happened that we
   * already grabbed the lock, so the caller is wrong anyway.
   */
  fatal_error();
}
RSEQ_COMMIT(L->local_state = STATE_LOCKED);
return; /* fastpath success */
  }
  RSEQ_SEQUENCE_END();

  /* slowpath */
  /* acquire and lock owner field */
  unsigned int old_owner_with_lockbit;
  while (1) {
old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
if (old_owner_with_lockbit & OWNER_LOCKBIT) {
  if (!__sync_bool_compare_and_swap (>owner_with_lockbit,
old_owner_with_lockbit, my_tid | OWNER_LOCKBIT | OWNER_WAITER_BIT))
   continue;
  futex(>owner_with_lockbit, FUTEX_WAIT,
old_owner_with_lockbit, NULL, NULL, 0);
  continue;
} else {
  if (__sync_bool_compare_and_swap (>owner_with_lockbit,
old_owner_with_lockbit, my_tid | OWNER_LOCKBIT))
break;
}
  }

  /*
   * ensure old owner won't lock local_state anymore.
   * we only have to worry about the owner that directly preceded us here;
   * it will have done this step for the owners that preceded it before clearing
   * the LOCKBIT; so if we were the old owner, we don't have to sync.
   */
  if (old_owner_with_lockbit != my_tid) {
if (membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, 0, 0))
  fatal_error();
  }

  /*
   * As soon as the lock becomes STATE_FREE_OR_BIASED, we own it; but
   * at this point it might still be locked.
   */
  while (READ_ONCE(L->local_state) == STATE_LOCKED) {
futex(>local_state, FUTEX_WAIT, STATE_LOCKED, NULL, NULL, 0);
  }

  /* OK, now the lock is biased to us and we can grab it. */
  WRITE_ONCE(L->local_state, STATE_LOCKED);

  /* drop lockbit */
  unsigned int old_owner_with_lockbit;
  while (1) {
old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
if (__sync_bool_compare_and_swap (>owner_with_lockbit,
old_owner_with_lockbit, my_tid))
  break;
  }
  if (old_owner_with_lockbit & OWNER_WAITER_BIT)
futex(>owner_with_lockbit, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
}

void unlock(struct biased_lock *L) {
  unsigned int my_tid = THREAD_SELF->tid;

  /*
   * If we run before the membarrier(), the lock() path will immediately
   * see the lock as uncontended, and we don't need to call futex().
   * If we run after the membarrier(), the ->owner_with_lockbit read
   * here will observe the new owner and we'll wake the futex.
   */
  RSEQ_SEQUENCE_START();
  unsigned int old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
  RSEQ_COMMIT(WRITE_ONCE(L->local_state, STATE_FREE_OR_BIASED));
  if (old_owner_with_lockbit != my_tid)
futex(>local_state, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
}


Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Florian Weimer
* Jann Horn:

> On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer  wrote:
>>
>> * Andrei Vagin:
>>
>> > We already have process_vm_readv and process_vm_writev to read and write
>> > to a process memory faster than we can do this with ptrace. And now it
>> > is time for process_vm_exec that allows executing code in an address
>> > space of another process. We can do this with ptrace but it is much
>> > slower.
>> >
>> > = Use-cases =
>>
>> We also have some vaguely related within the same address space: running
>> code on another thread, without modifying its stack, while it has signal
>> handlers blocked, and without causing system calls to fail with EINTR.
>> This can be used to implement certain kinds of memory barriers.
>
> That's what the membarrier() syscall is for, right? Unless you don't
> want to register all threads for expedited membarrier use?

membarrier is not sufficiently powerful for revoking biased locks, for
example.

For the EINTR issue,  is an
example.  I believe CIFS has since seen a few fixes (after someone
reported that tar on CIFS wouldn't work because the SIGCHLD causing
utimensat to fail—and there isn't even a signal handler for SIGCHLD!),
but the time it took to get to this point doesn't give me confidence
that it is safe to send signals to a thread that is running unknown
code.

But as you explained regarding the set*id broadcast, it seems that if we
had this run-on-another-thread functionality, we would likely encounter
issues similar to those with SA_RESTART.  We don't see the issue with
set*id today because it's a rare operation, and multi-threaded file
servers that need to change credentials frequently opt out of the set*id
broadcast anyway.  (What I have in mind is a future world where any
printf call, any malloc call, can trigger such a broadcast.)

The cross-VM CRIU scenario would probably somewhere in between (not
quite the printf/malloc level, but more frequent than set*id).

Thanks,
Florian



Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Jann Horn
On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer  wrote:
>
> * Andrei Vagin:
>
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
>
> We also have some vaguely related within the same address space: running
> code on another thread, without modifying its stack, while it has signal
> handlers blocked, and without causing system calls to fail with EINTR.
> This can be used to implement certain kinds of memory barriers.

That's what the membarrier() syscall is for, right? Unless you don't
want to register all threads for expedited membarrier use?

> It is
> also necessary to implement set*id with POSIX semantics in userspace.
> (Linux only changes the current thread credentials, POSIX requires
> process-wide changes.)  We currently use a signal for set*id, but it has
> issues (it can be blocked, the signal could come from somewhere, etc.).
> We can't use signals for barriers because of the EINTR issue, and
> because the signal context is stored on the stack.

This essentially becomes a question of "how much is set*id allowed to
block and what level of guarantee should there be by the time it
returns that no threads will perform privileged actions anymore after
it returns", right?

Like, if some piece of kernel code grabs a pointer to the current
credentials or acquires a temporary reference to some privileged
resource, then blocks on reading an argument from userspace, and then
performs a privileged action using the previously-grabbed credentials
or resource, what behavior do you want? Should setuid() block until
that privileged action has completed? Should it abort that action
(which is kinda what you get with the signals approach)? Should it
just return immediately even though an attacker who can write to
process memory at that point might still be able to influence a
privileged operation that hasn't read all its inputs yet? Should the
kernel be designed to keep track of whether it is currently holding a
privileged resource? Or should the kernel just specifically permit
credential changes in specific places where it is known that a task
might block for a long time and it is not holding any privileged
resources (kinda like the approach taken for freezer stuff)?

If userspace wants multithreaded setuid() without syscall aborting,
things get gnarly really fast; and having an interface to remotely
perform operations under another task's context isn't really relevant
to the core problem here, I think.


Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Florian Weimer
* Andrei Vagin:

> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

We also have some vaguely related within the same address space: running
code on another thread, without modifying its stack, while it has signal
handlers blocked, and without causing system calls to fail with EINTR.
This can be used to implement certain kinds of memory barriers.  It is
also necessary to implement set*id with POSIX semantics in userspace.
(Linux only changes the current thread credentials, POSIX requires
process-wide changes.)  We currently use a signal for set*id, but it has
issues (it can be blocked, the signal could come from somewhere, etc.).
We can't use signals for barriers because of the EINTR issue, and
because the signal context is stored on the stack.

Thanks,
Florian



Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Benjamin Berg
On Wed, 2021-04-14 at 09:34 +0200, Johannes Berg wrote:
> On Wed, 2021-04-14 at 08:22 +0100, Anton Ivanov wrote:
> > On 14/04/2021 06:52, Andrei Vagin wrote:
> > > We already have process_vm_readv and process_vm_writev to read and
> > > write
> > > to a process memory faster than we can do this with ptrace. And now
> > > it
> > > is time for process_vm_exec that allows executing code in an
> > > address
> > > space of another process. We can do this with ptrace but it is much
> > > slower.
> > > 
> > > = Use-cases =
> > > 
> > > Here are two known use-cases. The first one is “application kernel”
> > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > process that runs the sandbox kernel and a set of stub processes
> > > that
> > > are used to manage guest address spaces. Guest code is executed in
> > > the
> > > context of stub processes but all system calls are intercepted and
> > > handled in the sandbox kernel. Right now, these sort of sandboxes
> > > use
> > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > significantly speed them up.
> > 
> > Certainly interesting, but will require um to rework most of its
> > memory 
> > management and we will most likely need extra mm support to make use
> > of 
> > it in UML. We are not likely to get away just with one syscall there.
> 
> Might help the seccomp mode though:
> 
> https://patchwork.ozlabs.org/project/linux-um/list/?series=231980

Hmm, to me it sounds like it replaces both ptrace and seccomp mode
while completely avoiding the scheduling overhead that these techniques
have. I think everything UML needs is covered:

 * The new API can do syscalls in the target memory space
   (we can modify the address space)
 * The new API can run code until the next syscall happens
   (or a signal happens, which means SIGALRM for scheduling works)
 * Single step tracing should work by setting EFLAGS

I think the memory management itself stays fundamentally the same. We
just do the initial clone() using CLONE_STOPPED. We don't need any stub
code/data and we have everything we need to modify the address space
and run the userspace process.

Benjamin


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Johannes Berg
On Wed, 2021-04-14 at 08:22 +0100, Anton Ivanov wrote:
> On 14/04/2021 06:52, Andrei Vagin wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> > 
> > = Use-cases =
> > 
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
> 
> Certainly interesting, but will require um to rework most of its memory 
> management and we will most likely need extra mm support to make use of 
> it in UML. We are not likely to get away just with one syscall there.

Might help the seccomp mode though:

https://patchwork.ozlabs.org/project/linux-um/list/?series=231980

johannes




Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Anton Ivanov

On 14/04/2021 06:52, Andrei Vagin wrote:

We already have process_vm_readv and process_vm_writev to read and write
to a process memory faster than we can do this with ptrace. And now it
is time for process_vm_exec that allows executing code in an address
space of another process. We can do this with ptrace but it is much
slower.

= Use-cases =

Here are two known use-cases. The first one is “application kernel”
sandboxes like User-mode Linux and gVisor. In this case, we have a
process that runs the sandbox kernel and a set of stub processes that
are used to manage guest address spaces. Guest code is executed in the
context of stub processes but all system calls are intercepted and
handled in the sandbox kernel. Right now, these sort of sandboxes use
PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
significantly speed them up.


Certainly interesting, but will require um to rework most of its memory 
management and we will most likely need extra mm support to make use of 
it in UML. We are not likely to get away just with one syscall there.




Another use-case is CRIU (Checkpoint/Restore in User-space). Several
process properties can be received only from the process itself. Right
now, we use a parasite code that is injected into the process. We do
this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can
simplify the process of injecting a parasite code and it will allow
pre-dump memory without stopping processes. The pre-dump here is when we
enable a memory tracker and dump the memory while a process is continue
running. On each interaction we dump memory that has been changed from
the previous iteration. In the final step, we will stop processes and
dump their full state. Right now the most effective way to dump process
memory is to create a set of pipes and splice memory into these pipes
from the parasite code. With process_vm_exec, we will be able to call
vmsplice directly. It means that we will not need to stop a process to
inject the parasite code.

= How it works =

process_vm_exec has two modes:

* Execute code in an address space of a target process and stop on any
   signal or system call.

* Execute a system call in an address space of a target process.

int process_vm_exec(pid_t pid, struct sigcontext uctx,
unsigned long flags, siginfo_t siginfo,
sigset_t  *sigmask, size_t sizemask)

PID - target process identification. We can consider to use pidfd
instead of PID here.

sigcontext contains a process state with what the process will be
resumed after switching the address space and then when a process will
be stopped, its sate will be saved back to sigcontext.

siginfo is information about a signal that has interrupted the process.
If a process is interrupted by a system call, signfo will contain a
synthetic siginfo of the SIGSYS signal.

sigmask is a set of signals that process_vm_exec returns via signfo.

# How fast is it

In the fourth patch, you can find two benchmarks that execute a function
that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap
system calls, proces_vm_exec uses the process_vm_exec syscall to do the
same thing.

ptrace_vm_exec:   1446 ns/syscall
ptrocess_vm_exec:  289 ns/syscall

PS: This version is just a prototype. Its goal is to collect the initial
feedback, to discuss the interfaces, and maybe to get some advice on
implementation..

Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Anton Ivanov 
Cc: Christian Brauner 
Cc: Dmitry Safonov <0x7f454...@gmail.com>
Cc: Ingo Molnar 
Cc: Jeff Dike 
Cc: Mike Rapoport 
Cc: Michael Kerrisk (man-pages) 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Richard Weinberger 
Cc: Thomas Gleixner 

Andrei Vagin (4):
   signal: add a helper to restore a process state from sigcontex
   arch/x86: implement the process_vm_exec syscall
   arch/x86: allow to execute syscalls via process_vm_exec
   selftests: add tests for process_vm_exec

  arch/Kconfig  |  15 ++
  arch/x86/Kconfig  |   1 +
  arch/x86/entry/common.c   |  19 +++
  arch/x86/entry/syscalls/syscall_64.tbl|   1 +
  arch/x86/include/asm/sigcontext.h |   2 +
  arch/x86/kernel/Makefile  |   1 +
  arch/x86/kernel/process_vm_exec.c | 160 ++
  arch/x86/kernel/signal.c  | 125 ++
  include/linux/entry-common.h  |   2 +
  include/linux/process_vm_exec.h   |  17 ++
  include/linux/sched.h |   7 +
  include/linux/syscalls.h  |   6 +
  include/uapi/asm-generic/unistd.h |   4 +-
  include/uapi/linux/process_vm_exec.h  |   8 +
  kernel/entry/common.c |   2 +-
  kernel/fork.c |   9 +
  kernel/sys_ni.c   |   2 +
  .../selftests/process_vm_exec/Makefile|   7 +
  

Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Jann Horn
On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin  wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

It seems to me like your proposed API doesn't really fit either one of
those usecases well...

> Here are two known use-cases. The first one is “application kernel”
> sandboxes like User-mode Linux and gVisor. In this case, we have a
> process that runs the sandbox kernel and a set of stub processes that
> are used to manage guest address spaces. Guest code is executed in the
> context of stub processes but all system calls are intercepted and
> handled in the sandbox kernel. Right now, these sort of sandboxes use
> PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> significantly speed them up.

In this case, since you really only want an mm_struct to run code
under, it seems weird to create a whole task with its own PID and so
on. It seems to me like something similar to the /dev/kvm API would be
more appropriate here? Implementation options that I see for that
would be:

1. mm_struct-based:
  a set of syscalls to create a new mm_struct,
  change memory mappings under that mm_struct, and switch to it
2. pagetable-mirroring-based:
  like /dev/kvm, an API to create a new pagetable, mirror parts of
  the mm_struct's pagetables over into it with modified permissions
  (like KVM_SET_USER_MEMORY_REGION),
  and run code under that context.
  page fault handling would first handle the fault against mm->pgd
  as normal, then mirror the PTE over into the secondary pagetables.
  invalidation could be handled with MMU notifiers.

> Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> process properties can be received only from the process itself. Right
> now, we use a parasite code that is injected into the process. We do
> this with ptrace but it is slow, unsafe, and tricky.

But this API will only let you run code under the *mm* of the target
process, not fully in the context of a target *task*, right? So you
still won't be able to use this for accessing anything other than
memory? That doesn't seem very generically useful to me.

Also, I don't doubt that anything involving ptrace is kinda tricky,
but it would be nice to have some more detail on what exactly makes
this slow, unsafe and tricky. Are there API additions for ptrace that
would make this work better? I imagine you're thinking of things like
an API for injecting a syscall into the target process without having
to first somehow find an existing SYSCALL instruction in the target
process?

> process_vm_exec can
> simplify the process of injecting a parasite code and it will allow
> pre-dump memory without stopping processes. The pre-dump here is when we
> enable a memory tracker and dump the memory while a process is continue
> running. On each interaction we dump memory that has been changed from
> the previous iteration. In the final step, we will stop processes and
> dump their full state. Right now the most effective way to dump process
> memory is to create a set of pipes and splice memory into these pipes
> from the parasite code. With process_vm_exec, we will be able to call
> vmsplice directly. It means that we will not need to stop a process to
> inject the parasite code.

Alternatively you could add splice support to /proc/$pid/mem or add a
syscall similar to process_vm_readv() that splices into a pipe, right?


[PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-13 Thread Andrei Vagin
We already have process_vm_readv and process_vm_writev to read and write
to a process memory faster than we can do this with ptrace. And now it
is time for process_vm_exec that allows executing code in an address
space of another process. We can do this with ptrace but it is much
slower.

= Use-cases =

Here are two known use-cases. The first one is “application kernel”
sandboxes like User-mode Linux and gVisor. In this case, we have a
process that runs the sandbox kernel and a set of stub processes that
are used to manage guest address spaces. Guest code is executed in the
context of stub processes but all system calls are intercepted and
handled in the sandbox kernel. Right now, these sort of sandboxes use
PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
significantly speed them up.

Another use-case is CRIU (Checkpoint/Restore in User-space). Several
process properties can be received only from the process itself. Right
now, we use a parasite code that is injected into the process. We do
this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can
simplify the process of injecting a parasite code and it will allow
pre-dump memory without stopping processes. The pre-dump here is when we
enable a memory tracker and dump the memory while a process is continue
running. On each interaction we dump memory that has been changed from
the previous iteration. In the final step, we will stop processes and
dump their full state. Right now the most effective way to dump process
memory is to create a set of pipes and splice memory into these pipes
from the parasite code. With process_vm_exec, we will be able to call
vmsplice directly. It means that we will not need to stop a process to
inject the parasite code.

= How it works =

process_vm_exec has two modes:

* Execute code in an address space of a target process and stop on any
  signal or system call.

* Execute a system call in an address space of a target process.

int process_vm_exec(pid_t pid, struct sigcontext uctx,
unsigned long flags, siginfo_t siginfo,
sigset_t  *sigmask, size_t sizemask)

PID - target process identification. We can consider to use pidfd
instead of PID here.

sigcontext contains a process state with what the process will be
resumed after switching the address space and then when a process will
be stopped, its sate will be saved back to sigcontext.

siginfo is information about a signal that has interrupted the process.
If a process is interrupted by a system call, signfo will contain a
synthetic siginfo of the SIGSYS signal.

sigmask is a set of signals that process_vm_exec returns via signfo.

# How fast is it

In the fourth patch, you can find two benchmarks that execute a function
that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap
system calls, proces_vm_exec uses the process_vm_exec syscall to do the
same thing.

ptrace_vm_exec:   1446 ns/syscall
ptrocess_vm_exec:  289 ns/syscall

PS: This version is just a prototype. Its goal is to collect the initial
feedback, to discuss the interfaces, and maybe to get some advice on
implementation..

Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Anton Ivanov 
Cc: Christian Brauner 
Cc: Dmitry Safonov <0x7f454...@gmail.com>
Cc: Ingo Molnar 
Cc: Jeff Dike 
Cc: Mike Rapoport 
Cc: Michael Kerrisk (man-pages) 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Richard Weinberger 
Cc: Thomas Gleixner 

Andrei Vagin (4):
  signal: add a helper to restore a process state from sigcontex
  arch/x86: implement the process_vm_exec syscall
  arch/x86: allow to execute syscalls via process_vm_exec
  selftests: add tests for process_vm_exec

 arch/Kconfig  |  15 ++
 arch/x86/Kconfig  |   1 +
 arch/x86/entry/common.c   |  19 +++
 arch/x86/entry/syscalls/syscall_64.tbl|   1 +
 arch/x86/include/asm/sigcontext.h |   2 +
 arch/x86/kernel/Makefile  |   1 +
 arch/x86/kernel/process_vm_exec.c | 160 ++
 arch/x86/kernel/signal.c  | 125 ++
 include/linux/entry-common.h  |   2 +
 include/linux/process_vm_exec.h   |  17 ++
 include/linux/sched.h |   7 +
 include/linux/syscalls.h  |   6 +
 include/uapi/asm-generic/unistd.h |   4 +-
 include/uapi/linux/process_vm_exec.h  |   8 +
 kernel/entry/common.c |   2 +-
 kernel/fork.c |   9 +
 kernel/sys_ni.c   |   2 +
 .../selftests/process_vm_exec/Makefile|   7 +
 tools/testing/selftests/process_vm_exec/log.h |  26 +++
 .../process_vm_exec/process_vm_exec.c | 105 
 .../process_vm_exec/process_vm_exec_fault.c   | 111 
 .../process_vm_exec/process_vm_exec_syscall.c |  81 +
 .../process_vm_exec/ptrace_vm_exec.c