Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Andrei Vagin
On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin  wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
> 
> It seems to me like your proposed API doesn't really fit either one of
> those usecases well...

We definitely can invent more specific interfaces for each of these
problems. Sure, they will handle their use-cases a bit better than this
generic one. But do we want to have two very specific interfaces with
separate kernel implementations? My previous experiences showed that the
kernel community doesn't like interfaces that are specific for only one
narrow use-case.

So when I was working on process_vm_exec, I was thinking how to make
one interfaces that will be good enough for all these use-cases.

> 
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
> 
> In this case, since you really only want an mm_struct to run code
> under, it seems weird to create a whole task with its own PID and so
> on. It seems to me like something similar to the /dev/kvm API would be
> more appropriate here? Implementation options that I see for that
> would be:
> 
> 1. mm_struct-based:
>   a set of syscalls to create a new mm_struct,
>   change memory mappings under that mm_struct, and switch to it
> 2. pagetable-mirroring-based:
>   like /dev/kvm, an API to create a new pagetable, mirror parts of
>   the mm_struct's pagetables over into it with modified permissions
>   (like KVM_SET_USER_MEMORY_REGION),
>   and run code under that context.
>   page fault handling would first handle the fault against mm->pgd
>   as normal, then mirror the PTE over into the secondary pagetables.
>   invalidation could be handled with MMU notifiers.

We are ready to discuss this sort of interfaces if the community will
agree to accept it. Are there any other users except sandboxes that will
need something like this? Will the sandbox use-case enough to justify
the addition of this interface?

> 
> > Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> > process properties can be received only from the process itself. Right
> > now, we use a parasite code that is injected into the process. We do
> > this with ptrace but it is slow, unsafe, and tricky.
> 
> But this API will only let you run code under the *mm* of the target
> process, not fully in the context of a target *task*, right? So you
> still won't be able to use this for accessing anything other than
> memory? That doesn't seem very generically useful to me.

You are right, this will not rid us of the need to run a parasite code.
I wrote that it will make a process of injecting a parasite code a bit
simpler.

> 
> Also, I don't doubt that anything involving ptrace is kinda tricky,
> but it would be nice to have some more detail on what exactly makes
> this slow, unsafe and tricky. Are there API additions for ptrace that
> would make this work better? I imagine you're thinking of things like
> an API for injecting a syscall into the target process without having
> to first somehow find an existing SYSCALL instruction in the target
> process?


You describe the first problem right. We need to find or inject a
syscall instruction to a target process.
Right now, we need to do these steps to execute a system call:

* inject the syscall instruction (PTRACE_PEEKDATA/PTRACE_POKEDATA).
* get origin registers
* set new registers
* get a signal mask.
* block signals
* resume the process
* stop it on the next syscall-exit
* get registers
* set origin registers
* restore a signal mask.

One of the CRIU principals is to avoid changing a process state, so if
criu is interrupted, processes must be resumed and continue running. The
procedure of injecting a system call creates a window when a process is
in an inconsistent state, and a disappearing CRIU at such moments will
be fatal for the process. We don't think that we can eliminate such
windows, but we want to make them smaller.

In CRIU, we have a self

[PATCH 4/4] selftests: add tests for process_vm_exec

2021-04-13 Thread Andrei Vagin
Output:
 $ make run_tests
 TAP version 13
 1..4
 # selftests: process_vm_exec: process_vm_exec
 # 1..1
 # ok 1 275 ns/syscall
 # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 1 selftests: process_vm_exec: process_vm_exec
 # selftests: process_vm_exec: process_vm_exec_fault
 # 1..1
 # ok 1 789 ns/signal
 # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 2 selftests: process_vm_exec: process_vm_exec_fault
 # selftests: process_vm_exec: ptrace_vm_exec
 # 1..1
 # ok 1 1378 ns/syscall# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 3 selftests: process_vm_exec: ptrace_vm_exec
 # selftests: process_vm_exec: process_vm_exec_syscall
 # 1..1
 # ok 1 write works as expectd
 # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 4 selftests: process_vm_exec: process_vm_exec_syscall

Signed-off-by: Andrei Vagin 
---
 .../selftests/process_vm_exec/Makefile|   7 ++
 tools/testing/selftests/process_vm_exec/log.h |  26 
 .../process_vm_exec/process_vm_exec.c | 105 +
 .../process_vm_exec/process_vm_exec_fault.c   | 111 ++
 .../process_vm_exec/process_vm_exec_syscall.c |  81 +
 .../process_vm_exec/ptrace_vm_exec.c  | 111 ++
 6 files changed, 441 insertions(+)
 create mode 100644 tools/testing/selftests/process_vm_exec/Makefile
 create mode 100644 tools/testing/selftests/process_vm_exec/log.h
 create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec.c
 create mode 100644 
tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
 create mode 100644 
tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
 create mode 100644 tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c

diff --git a/tools/testing/selftests/process_vm_exec/Makefile 
b/tools/testing/selftests/process_vm_exec/Makefile
new file mode 100644
index ..bdf7fcf0fdd3
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+UNAME_M := $(shell uname -m)
+TEST_GEN_PROGS_x86_64 := process_vm_exec process_vm_exec_fault ptrace_vm_exec 
process_vm_exec_syscall
+TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))
+
+include ../lib.mk
diff --git a/tools/testing/selftests/process_vm_exec/log.h 
b/tools/testing/selftests/process_vm_exec/log.h
new file mode 100644
index ..ef268c2cf2b8
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/log.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __SELFTEST_PROCESS_VM_EXEC_LOG_H__
+#define __SELFTEST_PROCESS_VM_EXEC_LOG_H__
+
+#define pr_msg(fmt, lvl, ...)  \
+   ksft_print_msg("[%s] (%s:%d)\t" fmt "\n",   \
+   lvl, __FILE__, __LINE__, ##__VA_ARGS__)
+
+#define pr_p(func, fmt, ...)   func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...)   \
+   ({  \
+   ksft_test_result_error(fmt "\n", ##__VA_ARGS__);
\
+   -1; \
+   })
+
+#define pr_fail(fmt, ...)  \
+   ({  \
+   ksft_test_result_fail(fmt "\n", ##__VA_ARGS__); \
+   -1; \
+   })
+
+#define pr_perror(fmt, ...)pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+#endif
diff --git a/tools/testing/selftests/process_vm_exec/process_vm_exec.c 
b/tools/testing/selftests/process_vm_exec/process_vm_exec.c
new file mode 100644
index ..aa4009c43e01
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/process_vm_exec.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "asm/unistd.h"
+#include 
+#include 
+
+#include "../kselftest.h"
+#include "log.h"
+
+#ifndef __NR_process_vm_exec
+#define __NR_process_vm_exec 441
+#endif
+
+#define TEST_SYSCALL 123
+#define TEST_SYSCALL_RET 456
+#define TEST_MARKER 789
+#define TEST_TIMEOUT 5
+#define TEST_STACK_SIZE 65536
+
+static inline long __syscall1(long n, long a1)
+{
+   unsigned long ret;
+
+   __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1) : "rcx", 
"r11", "memory");
+
+   return ret;
+}
+
+int marker;
+
+static void guest(void)
+{
+   while (1)
+   if (__syscall1(TEST_SYSCALL, marker) != TEST_SYSCALL_RET)
+   abort();
+}
+
+int main(int argc, char **argv)
+{
+   struct sigcontext ctx = {};
+   struct timespec start, cur;
+   int status, ret;
+ 

[PATCH 3/4] arch/x86: allow to execute syscalls via process_vm_exec

2021-04-13 Thread Andrei Vagin
process_vm_exec allows to execute code in an address space of another
process. It changes the current address space to the target address
space and resume the current process with registers from sigcontex that
is passed in the arguments.

This changes adds the PROCESS_VM_EXEC_SYSCALL flag and if it is set
process_vm_exec will execute a system call with arguments from sigcontext.

process_vm_exec retuns 0 if the system call has been executed and an error
code in other cases.

A return code of the system call can be found in a proper register in
sigcontext.

Signed-off-by: Andrei Vagin 
---
 arch/x86/entry/common.c  |  5 -
 arch/x86/kernel/process_vm_exec.c| 29 +++-
 include/linux/entry-common.h |  2 ++
 include/linux/process_vm_exec.h  |  2 ++
 include/uapi/linux/process_vm_exec.h |  8 
 kernel/entry/common.c|  2 +-
 6 files changed, 45 insertions(+), 3 deletions(-)
 create mode 100644 include/uapi/linux/process_vm_exec.h

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 42eac459b25b..8de02ca19aca 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -40,7 +40,10 @@
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
 #ifdef CONFIG_PROCESS_VM_EXEC
-   if (current->exec_mm && current->exec_mm->ctx) {
+   struct exec_mm *exec_mm = current->exec_mm;
+
+   if (exec_mm && exec_mm->ctx &&
+   !(exec_mm->flags & PROCESS_VM_EXEC_SYSCALL)) {
kernel_siginfo_t info = {
.si_signo = SIGSYS,
.si_call_addr = (void __user *)KSTK_EIP(current),
diff --git a/arch/x86/kernel/process_vm_exec.c 
b/arch/x86/kernel/process_vm_exec.c
index 28b32330f744..9124b23f1e9b 100644
--- a/arch/x86/kernel/process_vm_exec.c
+++ b/arch/x86/kernel/process_vm_exec.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
@@ -73,7 +74,7 @@ SYSCALL_DEFINE6(process_vm_exec, pid_t, pid, struct 
sigcontext __user *, uctx,
 
sigset_t mask;
 
-   if (flags)
+   if (flags & ~PROCESS_VM_EXEC_SYSCALL)
return -EINVAL;
 
if (sizemask != sizeof(sigset_t))
@@ -97,6 +98,9 @@ SYSCALL_DEFINE6(process_vm_exec, pid_t, pid, struct 
sigcontext __user *, uctx,
}
 
current_pt_regs()->ax = 0;
+   if (flags & PROCESS_VM_EXEC_SYSCALL)
+   syscall_exit_to_user_mode_prepare(current_pt_regs());
+
ret = swap_vm_exec_context(uctx);
if (ret < 0)
goto err_mm_put;
@@ -117,6 +121,29 @@ SYSCALL_DEFINE6(process_vm_exec, pid_t, pid, struct 
sigcontext __user *, uctx,
mmgrab(prev_mm);
swap_mm(prev_mm, mm);
 
+   if (flags & PROCESS_VM_EXEC_SYSCALL) {
+   struct pt_regs *regs = current_pt_regs();
+   kernel_siginfo_t info;
+   int sysno;
+
+   regs->orig_ax = regs->ax;
+   regs->ax = -ENOSYS;
+   sysno = syscall_get_nr(current, regs);
+
+   do_syscall_64(sysno, regs);
+
+   restore_vm_exec_context(regs);
+   info.si_signo = SIGSYS;
+   info.si_call_addr = (void __user *)KSTK_EIP(current);
+   info.si_arch = syscall_get_arch(current);
+   info.si_syscall = sysno;
+   ret = copy_siginfo_to_user(current->exec_mm->siginfo, );
+   current_pt_regs()->orig_ax = __NR_process_vm_exec;
+   current_pt_regs()->ax = -ENOSYS;
+   syscall_enter_from_user_mode_work(current_pt_regs(), 
current_pt_regs()->orig_ax);
+   return ret;
+   }
+
ret = current_pt_regs()->ax;
 
return ret;
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..d0ebbe9ca9e4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -285,6 +285,8 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
 }
 #endif
 
+void syscall_exit_to_user_mode_prepare(struct pt_regs *regs);
+
 /**
  * syscall_exit_to_user_mode - Handle work before returning to user mode
  * @regs:  Pointer to currents pt_regs
diff --git a/include/linux/process_vm_exec.h b/include/linux/process_vm_exec.h
index a02535fbd5c8..2e04b4875a92 100644
--- a/include/linux/process_vm_exec.h
+++ b/include/linux/process_vm_exec.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_PROCESS_VM_EXEC_H
 #define _LINUX_PROCESS_VM_EXEC_H
 
+#include 
+
 struct exec_mm {
struct sigcontext *ctx;
struct mm_struct *mm;
diff --git a/include/uapi/linux/process_vm_exec.h 
b/include/uapi/linux/process_vm_exec.h
new file mode 100644
index ..35465b5d3ebf
--- /dev/null
+++ b/include/uapi/linux/process_vm_exec.h
@@ -0,0 +1,8 @@
+/* SPDX-Lice

[PATCH 1/4] signal: add a helper to restore a process state from sigcontex

2021-04-13 Thread Andrei Vagin
It will be used to implement process_vm_exec.

Signed-off-by: Andrei Vagin 
---
 arch/x86/kernel/signal.c | 78 ++--
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index be0d7d4152ec..cc269a20dd5f 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -79,51 +79,43 @@ static void force_valid_ss(struct pt_regs *regs)
 # define CONTEXT_COPY_SIZE sizeof(struct sigcontext)
 #endif
 
-static int restore_sigcontext(struct pt_regs *regs,
- struct sigcontext __user *usc,
+static int __restore_sigcontext(struct pt_regs *regs,
+ struct sigcontext __user *sc,
  unsigned long uc_flags)
 {
-   struct sigcontext sc;
-
-   /* Always make any pending restarted system calls return -EINTR */
-   current->restart_block.fn = do_no_restart_syscall;
-
-   if (copy_from_user(, usc, CONTEXT_COPY_SIZE))
-   return -EFAULT;
-
 #ifdef CONFIG_X86_32
-   set_user_gs(regs, sc.gs);
-   regs->fs = sc.fs;
-   regs->es = sc.es;
-   regs->ds = sc.ds;
+   set_user_gs(regs, sc->gs);
+   regs->fs = sc->fs;
+   regs->es = sc->es;
+   regs->ds = sc->ds;
 #endif /* CONFIG_X86_32 */
 
-   regs->bx = sc.bx;
-   regs->cx = sc.cx;
-   regs->dx = sc.dx;
-   regs->si = sc.si;
-   regs->di = sc.di;
-   regs->bp = sc.bp;
-   regs->ax = sc.ax;
-   regs->sp = sc.sp;
-   regs->ip = sc.ip;
+   regs->bx = sc->bx;
+   regs->cx = sc->cx;
+   regs->dx = sc->dx;
+   regs->si = sc->si;
+   regs->di = sc->di;
+   regs->bp = sc->bp;
+   regs->ax = sc->ax;
+   regs->sp = sc->sp;
+   regs->ip = sc->ip;
 
 #ifdef CONFIG_X86_64
-   regs->r8 = sc.r8;
-   regs->r9 = sc.r9;
-   regs->r10 = sc.r10;
-   regs->r11 = sc.r11;
-   regs->r12 = sc.r12;
-   regs->r13 = sc.r13;
-   regs->r14 = sc.r14;
-   regs->r15 = sc.r15;
+   regs->r8 = sc->r8;
+   regs->r9 = sc->r9;
+   regs->r10 = sc->r10;
+   regs->r11 = sc->r11;
+   regs->r12 = sc->r12;
+   regs->r13 = sc->r13;
+   regs->r14 = sc->r14;
+   regs->r15 = sc->r15;
 #endif /* CONFIG_X86_64 */
 
/* Get CS/SS and force CPL3 */
-   regs->cs = sc.cs | 0x03;
-   regs->ss = sc.ss | 0x03;
+   regs->cs = sc->cs | 0x03;
+   regs->ss = sc->ss | 0x03;
 
-   regs->flags = (regs->flags & ~FIX_EFLAGS) | (sc.flags & FIX_EFLAGS);
+   regs->flags = (regs->flags & ~FIX_EFLAGS) | (sc->flags & FIX_EFLAGS);
/* disable syscall checks */
regs->orig_ax = -1;
 
@@ -136,10 +128,26 @@ static int restore_sigcontext(struct pt_regs *regs,
force_valid_ss(regs);
 #endif
 
-   return fpu__restore_sig((void __user *)sc.fpstate,
+   return fpu__restore_sig((void __user *)sc->fpstate,
   IS_ENABLED(CONFIG_X86_32));
 }
 
+static int restore_sigcontext(struct pt_regs *regs,
+ struct sigcontext __user *usc,
+ unsigned long uc_flags)
+{
+   struct sigcontext sc;
+
+   /* Always make any pending restarted system calls return -EINTR */
+   current->restart_block.fn = do_no_restart_syscall;
+
+   if (copy_from_user(, usc, CONTEXT_COPY_SIZE))
+   return -EFAULT;
+
+   return __restore_sigcontext(regs, , uc_flags);
+}
+
+
 static __always_inline int
 __unsafe_setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate,
 struct pt_regs *regs, unsigned long mask)
-- 
2.29.2



[PATCH 2/4] arch/x86: implement the process_vm_exec syscall

2021-04-13 Thread Andrei Vagin
This change introduces the new system call:
process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)

process_vm_exec allows to execute the current process in an address
space of another process.

process_vm_exec swaps the current address space with an address space of
a specified process, sets a state from sigcontex and resumes the process.
When a process receives a signal or calls a system call,
process_vm_exec saves the process state back to sigcontext, restores the
origin address space, restores the origin process state, and returns to
userspace.

If it was interrupted by a signal and the signal is in the user_mask,
the signal is dequeued and information about it is saved in uinfo.
If process_vm_exec is interrupted by a system call, a synthetic siginfo
for the SIGSYS signal is generated.

The behavior of this system call is similar to PTRACE_SYSEMU but
everything is happing in the context of one process, so
process_vm_exec shows a better performance.

PTRACE_SYSEMU is primarily used to implement sandboxes (application
kernels) like User-mode Linux or gVisor. These type of sandboxes
intercepts applications system calls and acts as the guest kernel.
A simple benchmark, where a "tracee" process executes systems calls in a
loop and a "tracer" process traps syscalls and handles them just
incrementing the tracee instruction pointer to skip the syscall
instruction shows that process_vm_exec works more than 5 times faster
than PTRACE_SYSEMU.

Signed-off-by: Andrei Vagin 
---
 arch/Kconfig   |  15 +++
 arch/x86/Kconfig   |   1 +
 arch/x86/entry/common.c|  16 +++
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/asm/sigcontext.h  |   2 +
 arch/x86/kernel/Makefile   |   1 +
 arch/x86/kernel/process_vm_exec.c  | 133 +
 arch/x86/kernel/signal.c   |  47 +
 include/linux/process_vm_exec.h|  15 +++
 include/linux/sched.h  |   7 ++
 include/linux/syscalls.h   |   6 ++
 include/uapi/asm-generic/unistd.h  |   4 +-
 kernel/fork.c  |   9 ++
 kernel/sys_ni.c|   2 +
 14 files changed, 258 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/process_vm_exec.c
 create mode 100644 include/linux/process_vm_exec.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ba4e966484ab..3ed9b8fb1727 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -514,6 +514,21 @@ config SECCOMP_FILTER
 
  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config HAVE_ARCH_PROCESS_VM_EXEC
+   bool
+   help
+ An arch should select this symbol to support the process_vm_exec 
system call.
+
+config PROCESS_VM_EXEC
+   prompt "Enable the process_vm_exec syscall"
+   def_bool y
+   depends on HAVE_ARCH_PROCESS_VM_EXEC
+   help
+ process_vm_exec allows executing code and system calls in a specified
+ address space.
+
+ If unsure, say Y.
+
 config HAVE_ARCH_STACKLEAK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fbf26e0f7a6a..1c7ebb58865e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
+   select HAVE_ARCH_PROCESS_VM_EXEC
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE
select SWIOTLB
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..42eac459b25b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_XEN_PV
 #include 
@@ -38,6 +39,21 @@
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
+#ifdef CONFIG_PROCESS_VM_EXEC
+   if (current->exec_mm && current->exec_mm->ctx) {
+   kernel_siginfo_t info = {
+   .si_signo = SIGSYS,
+   .si_call_addr = (void __user *)KSTK_EIP(current),
+   .si_arch = syscall_get_arch(current),
+   .si_syscall = nr,
+   };
+   restore_vm_exec_context(regs);
+   regs->ax = copy_siginfo_to_user(current->exec_mm->siginfo, 
);
+   syscall_exit_to_user_mode(regs);
+   return;
+   }
+#endif
+
nr = syscall_enter_from_user_mode(regs, nr);
 
instrumentation_begin();
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 379819244b91..2a8e27b2d87e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -362,6 +362,7 @@
 438common

[PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-13 Thread Andrei Vagin
We already have process_vm_readv and process_vm_writev to read and write
to a process memory faster than we can do this with ptrace. And now it
is time for process_vm_exec that allows executing code in an address
space of another process. We can do this with ptrace but it is much
slower.

= Use-cases =

Here are two known use-cases. The first one is “application kernel”
sandboxes like User-mode Linux and gVisor. In this case, we have a
process that runs the sandbox kernel and a set of stub processes that
are used to manage guest address spaces. Guest code is executed in the
context of stub processes but all system calls are intercepted and
handled in the sandbox kernel. Right now, these sort of sandboxes use
PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
significantly speed them up.

Another use-case is CRIU (Checkpoint/Restore in User-space). Several
process properties can be received only from the process itself. Right
now, we use a parasite code that is injected into the process. We do
this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can
simplify the process of injecting a parasite code and it will allow
pre-dump memory without stopping processes. The pre-dump here is when we
enable a memory tracker and dump the memory while a process is continue
running. On each interaction we dump memory that has been changed from
the previous iteration. In the final step, we will stop processes and
dump their full state. Right now the most effective way to dump process
memory is to create a set of pipes and splice memory into these pipes
from the parasite code. With process_vm_exec, we will be able to call
vmsplice directly. It means that we will not need to stop a process to
inject the parasite code.

= How it works =

process_vm_exec has two modes:

* Execute code in an address space of a target process and stop on any
  signal or system call.

* Execute a system call in an address space of a target process.

int process_vm_exec(pid_t pid, struct sigcontext uctx,
unsigned long flags, siginfo_t siginfo,
sigset_t  *sigmask, size_t sizemask)

PID - target process identification. We can consider to use pidfd
instead of PID here.

sigcontext contains a process state with what the process will be
resumed after switching the address space and then when a process will
be stopped, its sate will be saved back to sigcontext.

siginfo is information about a signal that has interrupted the process.
If a process is interrupted by a system call, signfo will contain a
synthetic siginfo of the SIGSYS signal.

sigmask is a set of signals that process_vm_exec returns via signfo.

# How fast is it

In the fourth patch, you can find two benchmarks that execute a function
that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap
system calls, proces_vm_exec uses the process_vm_exec syscall to do the
same thing.

ptrace_vm_exec:   1446 ns/syscall
ptrocess_vm_exec:  289 ns/syscall

PS: This version is just a prototype. Its goal is to collect the initial
feedback, to discuss the interfaces, and maybe to get some advice on
implementation..

Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Anton Ivanov 
Cc: Christian Brauner 
Cc: Dmitry Safonov <0x7f454...@gmail.com>
Cc: Ingo Molnar 
Cc: Jeff Dike 
Cc: Mike Rapoport 
Cc: Michael Kerrisk (man-pages) 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Richard Weinberger 
Cc: Thomas Gleixner 

Andrei Vagin (4):
  signal: add a helper to restore a process state from sigcontex
  arch/x86: implement the process_vm_exec syscall
  arch/x86: allow to execute syscalls via process_vm_exec
  selftests: add tests for process_vm_exec

 arch/Kconfig  |  15 ++
 arch/x86/Kconfig  |   1 +
 arch/x86/entry/common.c   |  19 +++
 arch/x86/entry/syscalls/syscall_64.tbl|   1 +
 arch/x86/include/asm/sigcontext.h |   2 +
 arch/x86/kernel/Makefile  |   1 +
 arch/x86/kernel/process_vm_exec.c | 160 ++
 arch/x86/kernel/signal.c  | 125 ++
 include/linux/entry-common.h  |   2 +
 include/linux/process_vm_exec.h   |  17 ++
 include/linux/sched.h |   7 +
 include/linux/syscalls.h  |   6 +
 include/uapi/asm-generic/unistd.h |   4 +-
 include/uapi/linux/process_vm_exec.h  |   8 +
 kernel/entry/common.c |   2 +-
 kernel/fork.c |   9 +
 kernel/sys_ni.c   |   2 +
 .../selftests/process_vm_exec/Makefile|   7 +
 tools/testing/selftests/process_vm_exec/log.h |  26 +++
 .../process_vm_exec/process_vm_exec.c | 105 
 .../process_vm_exec/process_vm_exec_fault.c   | 111 
 .../process_vm_exec/process_vm_exec_syscall.c |  81 +
 .../process_vm_exec/ptrace_vm_

Re: [PATCH RESEND v1 3/4] powerpc/vdso: Separate vvar vma from vdso

2021-04-04 Thread Andrei Vagin
On Wed, Mar 31, 2021 at 04:48:46PM +, Christophe Leroy wrote:
> From: Dmitry Safonov 
> 
> Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front")
> VVAR page is in front of the VDSO area. In result it breaks CRIU
> (Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]"
> from /proc/../maps points at ELF/vdso image, rather than at VVAR data page.
> Laurent made a patch to keep CRIU working (by reading aux vector).
> But I think it still makes sence to separate two mappings into different
> VMAs. It will also make ppc64 less "special" for userspace and as
> a side-bonus will make VVAR page un-writable by debugger (which previously
> would COW page and can be unexpected).
> 
> I opportunistically Cc stable on it: I understand that usually such
> stuff isn't a stable material, but that will allow us in CRIU have
> one workaround less that is needed just for one release (v5.11) on
> one platform (ppc64), which we otherwise have to maintain.
> I wouldn't go as far as to say that the commit 511157ab641e is ABI
> regression as no other userspace got broken, but I'd really appreciate
> if it gets backported to v5.11 after v5.12 is released, so as not
> to complicate already non-simple CRIU-vdso code. Thanks!
> 
> Cc: Andrei Vagin 

Acked-by: Andrei Vagin 

> Cc: Andy Lutomirski 
> Cc: Benjamin Herrenschmidt 
> Cc: Christophe Leroy 
> Cc: Laurent Dufour 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: sta...@vger.kernel.org # v5.11
> [1]: https://github.com/checkpoint-restore/criu/issues/1417
> Signed-off-by: Dmitry Safonov 
> Tested-by: Christophe Leroy 
> Signed-off-by: Christophe Leroy 


Re: [PATCH RESEND v1 2/4] lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()

2021-04-04 Thread Andrei Vagin
On Wed, Mar 31, 2021 at 04:48:45PM +, Christophe Leroy wrote:
> For the same reason as commit e876f0b69dc9 ("lib/vdso: Allow
> architectures to provide the vdso data pointer"), powerpc wants to
> avoid calculation of relative position to code.
> 
> As the timens_vdso_data is next page to vdso_data, provide
> vdso_data pointer to __arch_get_timens_vdso_data() in order
> to ease the calculation on powerpc in following patches.
>

Acked-by: Andrei Vagin 
 
> Signed-off-by: Christophe Leroy 


Re: [PATCH RESEND v1 4/4] powerpc/vdso: Add support for time namespaces

2021-04-04 Thread Andrei Vagin
On Wed, Mar 31, 2021 at 04:48:47PM +, Christophe Leroy wrote:
> This patch adds the necessary glue to provide time namespaces.
> 
> Things are mainly copied from ARM64.
> 
> __arch_get_timens_vdso_data() calculates timens vdso data position
> based on the vdso data position, knowing it is the next page in vvar.
> This avoids having to redo the mflr/bcl/mflr/mtlr dance to locate
> the page relative to running code position.
>

Acked-by: Andrei Vagin 
 
> Signed-off-by: Christophe Leroy 


Re: [PATCH] move_mount: allow to add a mount into an existing group

2021-03-28 Thread Andrei Vagin
On Thu, Mar 25, 2021 at 03:14:44PM +0300, Pavel Tikhomirov wrote:
> Previously a sharing group (shared and master ids pair) can be only
> inherited when mount is created via bindmount. This patch adds an
> ability to add an existing private mount into an existing sharing group.
> 
> With this functionality one can first create the desired mount tree from
> only private mounts (without the need to care about undesired mount
> propagation or mount creation order implied by sharing group
> dependencies), and next then setup any desired mount sharing between
> those mounts in tree as needed.
> 
> This allows CRIU to restore any set of mount namespaces, mount trees and
> sharing group trees for a container.
> 
> We have many issues with restoring mounts in CRIU related to sharing
> groups and propagation:
> - reverse sharing groups vs mount tree order requires complex mounts
>   reordering which mostly implies also using some temporary mounts
> (please see https://lkml.org/lkml/2021/3/23/569 for more info)
> 
> - mount() syscall creates tons of mounts due to propagation
> - mount re-parenting due to propagation
> - "Mount Trap" due to propagation
> - "Non Uniform" propagation, meaning that with different tricks with
>   mount order and temporary children-"lock" mounts one can create mount
>   trees which can't be restored without those tricks
> (see https://www.linuxplumbersconf.org/event/7/contributions/640/)
> 
> With this new functionality we can resolve all the problems with
> propagation at once.
> 

Thanks for picking this up. Overall it looks good for me. Here is one
comment inline.

> Cc: Eric W. Biederman 
> Cc: Alexander Viro 
> Cc: Andrei Vagin 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: lkml 
> Signed-off-by: Pavel Tikhomirov 
> 
> ---
> This is a rework of "mnt: allow to add a mount into an existing group"
> patch from Andrei. https://lkml.org/lkml/2017/4/28/20
> 
> New do_set_group is similar to do_move_mount, but with many restrictions
> of do_move_mount removed and that's why:
> 
> 1) Allow "cross-namespace" sharing group set. If we allow operation only
> with mounts from current+anon mount namespace one would still be able to
> setns(from_mntns) + open_tree(from, OPEN_TREE_CLONE) + setns(to_mntns) +
> move_mount(anon, to, MOVE_MOUNT_SET_GROUP) to set sharing group to mount
> in different mount namespace with source mount. But with this approach
> we would need to create anon mount namespace and mount copy each time,
> which is just a waste of resources. So instead lets just check if we are
> allowed to modify both mount namespaces (which looks equivalent to what
> setns-es and open_tree check).
> 
> 2) Allow operating on non-root dentry of the mount. As if we prohibit it
> this would require extra care from CRIU side in places where we wan't to
> copy sharing group from mount on host (for external mounts) and user
> gives us path to non-root dentry. I don't see any problem with
> referencing mount with any dentry for sharing group setting. Also there
> is no problem with referencing one by file and one by directory.
> 
> 3) Also checks wich only apply to actually moving mount which we have in
> do_move_mount and open_tree are skipped. We don't need to check
> MNT_LOCKED, unbindable, nsfs loops and ancestor relation as we don't
> move mounts.
> 
> Security note: there would be no (new) loops in sharing groups tree,
> because this new move_mount(MOVE_MOUNT_SET_GROUP) operation only adds
> one _private_ mount to one group (without moving between groups), the
> sharing groups tree itself stays unchanged after it.
> 
> In Virtuozzo we have "mount-v2" implementation, based with the original
> kernel patch from Andrei, tested for almost a year and it actually
> decreased number of bugs with mounts a lot. One can take a look on the
> implementation of sharing group restore in CRIU in "mount-v2" here:
> 
> https://src.openvz.org/projects/OVZ/repos/criu/browse/criu/mount-v2.c#898
> 
> This works almost the same with current version of patch if we replace
> mount(MS_SET_GROUP) to move_mount(MOVE_MOUNT_SET_GROUP), please see
> super-draft port for mainstream criu, this at least passes
> non-user-namespaced mount tests (zdtm.py --mounts-v2 -f ns).
> 
> https://github.com/Snorch/criu/commits/mount-v2-poc
> 
> ---
>  fs/namespace.c | 57 +-
>  include/uapi/linux/mount.h |  3 +-
>  2 files changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 9d33909d0f9e..ab439d8510dd 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2660,6 +2

Re: [PATCH 1/4] arm64: expose orig_x0 in the user_pt_regs structure

2021-03-26 Thread Andrei Vagin
On Fri, Mar 26, 2021 at 11:28 AM Catalin Marinas
 wrote:
>
> On Mon, Mar 22, 2021 at 03:50:50PM -0700, Andrei Vagin wrote:
> > diff --git a/arch/arm64/include/uapi/asm/ptrace.h 
> > b/arch/arm64/include/uapi/asm/ptrace.h
> > index 758ae984ff97..3c118c5b0893 100644
> > --- a/arch/arm64/include/uapi/asm/ptrace.h
> > +++ b/arch/arm64/include/uapi/asm/ptrace.h
> > @@ -90,6 +90,7 @@ struct user_pt_regs {
> >   __u64   sp;
> >   __u64   pc;
> >   __u64   pstate;
> > + __u64   orig_x0;
> >  };
>
> That's a UAPI change, likely to go wrong. For example, a
> ptrace(PTRACE_GETREGSET, pid, REGSET_GPR, data) would write past the end
> of an old struct user_pt_regs in the debugger.

ptrace(PTRACE_GETREGSET, ...) receives iovec:
ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, )

iov contains a pointer to a buffer and its size and the kernel fills
only the part that fits the buffer.
I think this interface was invented to allow extending structures
without breaking backward compatibility.

>
> --
> Catalin


[PATCH 3/4] selftest/arm64/ptrace: add a test for orig_x0

2021-03-22 Thread Andrei Vagin
The test creates two processes where one traces another one.  The tracee
executes a system call, the tracer traps it, changes orig_x0, triggers a
signal and checks that the syscall is restarted with the setted
argument.

Test output:
 $ ./ptrace_restart_syscall_test
 1..3
 ok 1 orig_x0: 0x3
 ok 2 x0: 0x5
 ok 3 The child exited with code 0.
 # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Andrei Vagin 
---
 tools/testing/selftests/arm64/ptrace/Makefile |   6 +
 tools/testing/selftests/arm64/ptrace/lib.h|  36 ++
 .../ptrace/ptrace_restart_syscall_test.c  | 122 ++
 3 files changed, 164 insertions(+)
 create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile
 create mode 100644 tools/testing/selftests/arm64/ptrace/lib.h
 create mode 100644 
tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c

diff --git a/tools/testing/selftests/arm64/ptrace/Makefile 
b/tools/testing/selftests/arm64/ptrace/Makefile
new file mode 100644
index ..1bc10e2d2ac8
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -g -I../../../../../usr/include/
+TEST_GEN_PROGS := ptrace_restart_syscall_test
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/arm64/ptrace/lib.h 
b/tools/testing/selftests/arm64/ptrace/lib.h
new file mode 100644
index ..14f4737188a3
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/lib.h
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#ifndef __PTRACE_TEST_LOG_H
+#define __PTRACE_TEST_LOG_H
+
+#define pr_p(func, fmt, ...) func("%s:%d: " fmt ": %m", \
+ __func__, __LINE__, ##__VA_ARGS__)
+
+#define pr_err(fmt, ...)   \
+   ({  \
+   ksft_test_result_error(fmt "\n", ##__VA_ARGS__);\
+   -1; \
+   })
+
+#define pr_fail(fmt, ...)  \
+   ({  \
+   ksft_test_result_fail(fmt "\n", ##__VA_ARGS__); \
+   -1; \
+   })
+
+#define pr_perror(fmt, ...)pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+static inline int ptrace_and_wait(pid_t pid, int cmd, int sig)
+{
+   int status;
+
+   /* Stop on syscall-exit. */
+   if (ptrace(cmd, pid, 0, 0))
+   return pr_perror("Can't resume the child %d", pid);
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Can't wait for the child %d", pid);
+   if (!WIFSTOPPED(status) || WSTOPSIG(status) != sig)
+   return pr_err("Unexpected status: %x", status);
+   return 0;
+}
+
+#endif
diff --git a/tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c 
b/tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c
new file mode 100644
index ..ce59657f41be
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../../kselftest.h"
+#include "lib.h"
+
+static int child(int fd)
+{
+   char c;
+
+   if (read(fd, , 1) != 1)
+   return 1;
+
+   return 0;
+}
+
+int main(int argc, void **argv)
+{
+   union {
+   struct user_regs_struct r;
+   struct {
+   char __regs[272];
+   unsigned long long orig_x0;
+   unsigned long long orig_x7;
+   };
+   } regs = {};
+   struct iovec iov = {
+   .iov_base = ,
+   .iov_len = sizeof(regs),
+   };
+   int status;
+   pid_t pid;
+   int p[2], fdzero;
+
+   ksft_set_plan(3);
+
+   if (pipe(p))
+   return pr_perror("Can't create a pipe");
+   fdzero = open("/dev/zero", O_RDONLY);
+   if (fdzero < 0)
+   return pr_perror("Can't open /dev/zero");
+
+   pid = fork();
+   if (pid == 0) {
+   kill(getpid(), SIGSTOP);
+   return child(p[0]);
+   }
+   if (pid < 0)
+   return 1;
+
+   if (ptrace(PTRACE_ATTACH, pid, 0, 0))
+   return pr_perror("Can't attach to the child %d", pid);
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Can't wait for the child %d", pid);
+   /* Skip SIGSTOP */
+   if (ptrace_and_wait(pid, PTRACE_CONT, SIGSTOP))
+   return 1;
+
+   /* Resume the

[PATCH 4/4] selftest/arm64/ptrace: add a test for orig_x7

2021-03-22 Thread Andrei Vagin
In system calls, x7 is used to indicate whether a tracee has been
stopped on syscall-enter or syscall-exit and the origin value of x7 is
saved in orig_x7.

Test output:
 $ ./ptrace_syscall_test
 1..4
 ok 1 x7: 0
 ok 2 x7: 1
 ok 3 x7: 686920776f726c64
 ok 4 The child exited with code 0.
 # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Andrei Vagin 
---
 tools/testing/selftests/arm64/ptrace/Makefile |   2 +-
 .../arm64/ptrace/ptrace_syscall_test.c| 158 ++
 2 files changed, 159 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c

diff --git a/tools/testing/selftests/arm64/ptrace/Makefile 
b/tools/testing/selftests/arm64/ptrace/Makefile
index 1bc10e2d2ac8..ea02c1a63806 100644
--- a/tools/testing/selftests/arm64/ptrace/Makefile
+++ b/tools/testing/selftests/arm64/ptrace/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 
 CFLAGS += -g -I../../../../../usr/include/
-TEST_GEN_PROGS := ptrace_restart_syscall_test
+TEST_GEN_PROGS := ptrace_restart_syscall_test ptrace_syscall_test
 
 include ../../lib.mk
diff --git a/tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c 
b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c
new file mode 100644
index ..ad55b44ae9f5
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../../kselftest.h"
+#include "lib.h"
+
+#define X7_TEST_VAL 0x686920776f726c64UL
+
+static long test_syscall(long *x7)
+{
+   register long x0 __asm__("x0") = 0;
+   register long *x1 __asm__("x1") = x7;
+   register long x8 __asm__("x8") = 0x555;
+
+   __asm__ (
+   "ldr x7, [x1, 0]\n"
+   "svc 0\n"
+   "str x7, [x1, 0]\n"
+  : "=r"(x0)
+  : "r"(x0), "r"(x1), "r"(x8)
+  :
+   );
+   return x0;
+}
+
+static int child(void)
+{
+   long  val = X7_TEST_VAL;
+
+   if (test_syscall()) {
+   ksft_print_msg("The test syscall failed\n");
+   return 1;
+   }
+   if (val != X7_TEST_VAL) {
+   ksft_print_msg("Unexpected x7: %lx\n", val);
+   return 1;
+   }
+
+   if (test_syscall()) {
+   ksft_print_msg("The test syscall failed\n");
+   return 1;
+   }
+   if (val != ~X7_TEST_VAL) {
+   ksft_print_msg("Unexpected x7: %lx\n", val);
+   return 1;
+   }
+
+   return 0;
+}
+
+#ifndef PTRACE_SYSEMU
+#define PTRACE_SYSEMU 31
+#endif
+
+int main(int argc, void **argv)
+{
+   union {
+   struct user_regs_struct r;
+   struct {
+   char __regs[272];
+   unsigned long long orig_x0;
+   unsigned long long orig_x7;
+   };
+   } regs = {};
+   struct iovec iov = {
+   .iov_base = ,
+   .iov_len = sizeof(regs),
+   };
+   int status;
+   pid_t pid;
+
+   ksft_set_plan(4);
+
+   pid = fork();
+   if (pid == 0) {
+   kill(getpid(), SIGSTOP);
+   _exit(child());
+   }
+   if (pid < 0)
+   return 1;
+
+   if (ptrace_and_wait(pid, PTRACE_ATTACH, SIGSTOP))
+   return 1;
+   /* skip SIGSTOP */
+   if (ptrace_and_wait(pid, PTRACE_CONT, SIGSTOP))
+   return 1;
+
+   /* Resume the child to the next system call. */
+   if (ptrace_and_wait(pid, PTRACE_SYSCALL, SIGTRAP))
+   return 1;
+
+   /* Check that x7 is 0 on syscall-enter. */
+   if (ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, ))
+   return pr_perror("Can't get child registers");
+   if (regs.orig_x7 != X7_TEST_VAL)
+   return pr_fail("Unexpected orig_x7: %lx", regs.orig_x7);
+   if (regs.r.regs[7] != 0)
+   return pr_fail("Unexpected x7: %lx", regs.r.regs[7]);
+   ksft_test_result_pass("x7: %llx\n", regs.r.regs[7]);
+
+   if (ptrace_and_wait(pid, PTRACE_SYSCALL, SIGTRAP))
+   return 1;
+
+   /* Check that x7 is 1 on syscall-exit. */
+   if (ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, ))
+   return pr_perror("Can't get child registers");
+   if (regs.r.regs[7] != 1)
+   return pr_fail("Unexpected x7: %lx", regs.r.regs[7]);
+   ksft_test_result_pass("x7: %llx\n", regs.r.regs[7]);
+
+   /* Check that the child will not see a new value of x7. */
+   

[PATCH 2/4] arm64/ptrace: introduce orig_x7 in the user_pt_regs structure

2021-03-22 Thread Andrei Vagin
We have some ABI weirdness in the way that we handle syscall exit stops
because we indicate whether or not the stop has been signalled from
syscall entry or syscall exit by clobbering a general purpose register
x7 in the tracee and restoring its old value after the stop.

This behavior was inherited from ARM and it isn't common for other
architectures. Now, we have PTRACE_GET_SYSCALL_INFO that gives all
required information about system calls, so the hack with clobbering
registers isn't needed anymore.

This change instroduces orig_x7 in the user_pt_regs structure that will
contains an origin value of the x7 register if the tracee is stopped in
a system call..

Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/asm/ptrace.h  |  1 +
 arch/arm64/include/uapi/asm/ptrace.h |  1 +
 arch/arm64/kernel/ptrace.c   | 18 --
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index d4cdf98ac003..1008f0fbc5ea 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -184,6 +184,7 @@ struct pt_regs {
u64 pc;
u64 pstate;
u64 orig_x0;
+   u64 orig_x7;
};
};
 #ifdef __AARCH64EB__
diff --git a/arch/arm64/include/uapi/asm/ptrace.h 
b/arch/arm64/include/uapi/asm/ptrace.h
index 3c118c5b0893..be7583ff5f4d 100644
--- a/arch/arm64/include/uapi/asm/ptrace.h
+++ b/arch/arm64/include/uapi/asm/ptrace.h
@@ -91,6 +91,7 @@ struct user_pt_regs {
__u64   pc;
__u64   pstate;
__u64   orig_x0;
+   __u64   orig_x7;
 };
 
 struct user_fpsimd_state {
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 170f42fd6101..1ed5b4aa986b 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1750,7 +1750,7 @@ static void tracehook_report_syscall(struct pt_regs *regs,
 enum ptrace_syscall_dir dir)
 {
int regno;
-   unsigned long saved_reg;
+   u64 _saved_reg, *saved_reg;
 
/*
 * We have some ABI weirdness here in the way that we handle syscall
@@ -1768,19 +1768,25 @@ static void tracehook_report_syscall(struct pt_regs 
*regs,
 * - Syscall stops behave differently to seccomp and pseudo-step traps
 *   (the latter do not nobble any registers).
 */
-   regno = (is_compat_task() ? 12 : 7);
-   saved_reg = regs->regs[regno];
+   if (is_compat_task()) {
+   regno = 12;
+   saved_reg = &_saved_reg;
+   } else {
+   regno = 7;
+   saved_reg = >orig_x7;
+   }
+   *saved_reg = regs->regs[regno];
regs->regs[regno] = dir;
 
if (dir == PTRACE_SYSCALL_ENTER) {
if (tracehook_report_syscall_entry(regs))
forget_syscall(regs);
-   regs->regs[regno] = saved_reg;
+   regs->regs[regno] = *saved_reg;
} else if (!test_thread_flag(TIF_SINGLESTEP)) {
tracehook_report_syscall_exit(regs, 0);
-   regs->regs[regno] = saved_reg;
+   regs->regs[regno] = *saved_reg;
} else {
-   regs->regs[regno] = saved_reg;
+   regs->regs[regno] = *saved_reg;
 
/*
 * Signal a pseudo-step exception since we are stepping but
-- 
2.29.2



[PATCH 1/4] arm64: expose orig_x0 in the user_pt_regs structure

2021-03-22 Thread Andrei Vagin
orig_x0 is recorded at the start of the syscall entry and then it is
used for resetting the argument back to its original value during
syscall restarts.

If orig_x0 isn't available from user-space, this makes it tricky to
manage arguments of restarted system calls.

Cc: Keno Fischer 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/asm/ptrace.h  | 2 +-
 arch/arm64/include/uapi/asm/ptrace.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index e58bca832dff..d4cdf98ac003 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -183,9 +183,9 @@ struct pt_regs {
u64 sp;
u64 pc;
u64 pstate;
+   u64 orig_x0;
};
};
-   u64 orig_x0;
 #ifdef __AARCH64EB__
u32 unused2;
s32 syscallno;
diff --git a/arch/arm64/include/uapi/asm/ptrace.h 
b/arch/arm64/include/uapi/asm/ptrace.h
index 758ae984ff97..3c118c5b0893 100644
--- a/arch/arm64/include/uapi/asm/ptrace.h
+++ b/arch/arm64/include/uapi/asm/ptrace.h
@@ -90,6 +90,7 @@ struct user_pt_regs {
__u64   sp;
__u64   pc;
__u64   pstate;
+   __u64   orig_x0;
 };
 
 struct user_fpsimd_state {
-- 
2.29.2



[PATCH 0/4 v3] arm64/ptrace: allow to get all registers on syscall traps

2021-03-22 Thread Andrei Vagin
Here are two known problems with registers when a tracee is stopped in
syscall traps.

The first problem is about the x7 register that is used to indicate
whether or not the stop has been signalled from syscall entry or syscall
exit. This means that:

- Any writes by the tracer to this register during the stop are
  ignored/discarded.

- The actual value of the register is not available during the stop,
  so the tracer cannot save it and restore it later.

For applications like the user-mode Linux or gVisor, it is critical to
have access to the full set of registers in any moment. For example,
they need to change values of all registers to emulate rt_sigreturn or
execve and they need to have the full set of registers to build a signal
frame.

The second problem is that orig_x0 isn't exposed to user-space. orig_x0
is recorded at the start of the syscall entry and then it is used for
resetting the argument back to its original value during syscall
restarts.

This series extends the user_pt_regs structure with orig_x0 and orig_x7.
This doesn't break the backward compatibility. There are two interfaces
how user-space can receive user_pt_regs from the kernel. The first one
is PTRACE_{GET,SET}REGSET. In this case, the user-space provides a
buffer and its size and the kernel fills only the part that fits the
size. The second interface is a core dump file where registers are
written in a separate note and the user-space can parse only the part
that it knows.

Cc: Catalin Marinas 
Cc: Dave Martin 
Cc: Keno Fischer  
Cc: Oleg Nesterov 
Cc: Will Deacon 

Andrei Vagin (4):
  arm64: expose orig_x0 in the user_pt_regs structure
  arm64/ptrace: introduce orig_x7 in the user_pt_regs structure
  selftest/arm64/ptrace: add a test for orig_x0
  selftest/arm64/ptrace: add a test for orig_x7

v2: use the ptrace option instead of adding a new regset.
v3: append orig_x0 and orig_x7 to the user_pt_regs.

 arch/arm64/include/asm/ptrace.h   |   5 +
 arch/arm64/kernel/ptrace.c| 130 +++-
 include/uapi/linux/elf.h  |   1 +
 tools/testing/selftests/arm64/Makefile|   2 +-
 tools/testing/selftests/arm64/ptrace/Makefile |   6 +
 .../arm64/ptrace/ptrace_syscall_regs_test.c   | 142 ++
 6 files changed, 246 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile
 create mode 100644 
tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c

-- 
2.29.2



Re: [PATCH 1/3] arm64/ptrace: don't clobber task registers on syscall entry/exit traps

2021-02-25 Thread Andrei Vagin
On Thu, Feb 04, 2021 at 03:23:34PM +, Will Deacon wrote:
> On Mon, Feb 01, 2021 at 11:40:10AM -0800, Andrei Vagin wrote:
> > ip/r12 for AArch32 and x7 for AArch64 is used to indicate whether or not
> > the stop has been signalled from syscall entry or syscall exit. This
> > means that:
> > 
> > - Any writes by the tracer to this register during the stop are
> >   ignored/discarded.
> > 
> > - The actual value of the register is not available during the stop,
> >   so the tracer cannot save it and restore it later.
> > 
> > Right now, these registers are clobbered in tracehook_report_syscall.
> > This change moves the logic to gpr_get and compat_gpr_get where
> > registers are copied into a user-space buffer.
> > 
> > This will allow to change these registers and to introduce a new
> > ptrace option to get the full set of registers.
> > 
> > Signed-off-by: Andrei Vagin 
> > ---
> >  arch/arm64/include/asm/ptrace.h |   5 ++
> >  arch/arm64/kernel/ptrace.c  | 104 
> >  2 files changed, 69 insertions(+), 40 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/ptrace.h 
> > b/arch/arm64/include/asm/ptrace.h
> > index e58bca832dff..0a9552b4f61e 100644
> > --- a/arch/arm64/include/asm/ptrace.h
> > +++ b/arch/arm64/include/asm/ptrace.h
> > @@ -170,6 +170,11 @@ static inline unsigned long pstate_to_compat_psr(const 
> > unsigned long pstate)
> > return psr;
> >  }
> >  
> > +enum ptrace_syscall_dir {
> > +   PTRACE_SYSCALL_ENTER = 0,
> > +   PTRACE_SYSCALL_EXIT,
> > +};
> > +
> >  /*
> >   * This struct defines the way the registers are stored on the stack 
> > during an
> >   * exception. Note that sizeof(struct pt_regs) has to be a multiple of 16 
> > (for
> > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> > index 8ac487c84e37..39da03104528 100644
> > --- a/arch/arm64/kernel/ptrace.c
> > +++ b/arch/arm64/kernel/ptrace.c
> > @@ -40,6 +40,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -561,7 +562,31 @@ static int gpr_get(struct task_struct *target,
> >struct membuf to)
> >  {
> > struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
> > -   return membuf_write(, uregs, sizeof(*uregs));
> > +   unsigned long saved_reg;
> > +   int ret;
> > +
> > +   /*
> > +* We have some ABI weirdness here in the way that we handle syscall
> > +* exit stops because we indicate whether or not the stop has been
> > +* signalled from syscall entry or syscall exit by clobbering the 
> > general
> > +* purpose register x7.
> > +*/
> 
> When you move a comment, please don't truncate it!

This is my fault. In the previous version, the other part of this
comment was irelevant, because I always allowed to change clobbered
registers, but then I realized that we can't do that.

> 
> > +   saved_reg = uregs->regs[7];
> > +
> > +   switch (target->ptrace_message) {
> > +   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
> > +   uregs->regs[7] = PTRACE_SYSCALL_ENTER;
> > +   break;
> > +   case PTRACE_EVENTMSG_SYSCALL_EXIT:
> > +   uregs->regs[7] = PTRACE_SYSCALL_EXIT;
> > +   break;
> > +   }
> 
> I'm wary of checking target->ptrace_message here, as I seem to recall the
> regset code also being used for coredumps. What guarantees we don't break
> things there?

Registers were clobbered in tracehook_report_syscall,
task->ptrace_message is set in ptrace_report_syscall.

do_coredump() is called from get_signal and secure_computing, so we
always see actuall registers in core dumps with and without these
changes.

> 
> > +
> > +   ret =  membuf_write(, uregs, sizeof(*uregs));
> > +
> > +   uregs->regs[7] = saved_reg;
> > +
> > +   return ret;
> >  }
> >  
> >  static int gpr_set(struct task_struct *target, const struct user_regset 
> > *regset,
> > @@ -575,6 +600,17 @@ static int gpr_set(struct task_struct *target, const 
> > struct user_regset *regset,
> > if (ret)
> > return ret;
> >  
> > +   /*
> > +* Historically, x7 can't be changed if the stop has been signalled
> > +* from syscall-enter of syscall-exit.
> > +*/
> > +   switch (target->ptrace_message) {
> > +   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
> > +   case PTRACE_EVENTMSG_SYSCALL_EXIT:
> >

Re: [PATCH 0/3 v2] arm64/ptrace: allow to get all registers on syscall traps

2021-02-08 Thread Andrei Vagin
On Mon, Feb 01, 2021 at 07:11:12PM -0500, Keno Fischer wrote:
> Hi Andrei,
> 
> > This series introduces the PTRACE_O_ARM64_RAW_REGS option. If it is set,
> > PTRACE_GETREGSET returns values of all registers, and PTRACE_SETREGSET
> > allows to change any of them.
> 
> thanks for picking this up. I meant to work on this, but unfortunately ran out
> of time to be able to push it through, so I'm glad you're working on
> it, since it
> does absolutely need to get fixed. Besides this issue, the other problem we
> ran into when trying to port our ptracer to aarch64 is that orig_x0 is not
> accessible through the ptrace interface on aarch64, which can cause tricky
> behavior around restarts.

Could you describe the problem in more details? I wonder whether we have
the same thing in CRIU...

> We managed to work around that in the end,
> but it's painful. If we're fixing the kernel here anyway, I'm wondering if
> we might want to address that as well while we're at it.

Sure let think how to do this properly.

In this case, I think the ptrace option isn't a good choise. I don't
think that it is a good idea to change the layout of regset depending on
options...

Thanks,
Andrei

> 
> Keno


Re: [PATCH 2/3] arm64/ptrace: introduce PTRACE_O_ARM64_RAW_REGS

2021-02-08 Thread Andrei Vagin
On Thu, Feb 04, 2021 at 03:36:15PM +, Will Deacon wrote:
> On Mon, Feb 01, 2021 at 11:40:11AM -0800, Andrei Vagin wrote:
> > We have some ABI weirdness in the way that we handle syscall
> > exit stops because we indicate whether or not the stop has been
> > signalled from syscall entry or syscall exit by clobbering a general
> > purpose register (ip/r12 for AArch32, x7 for AArch64) in the tracee
> > and restoring its old value after the stop.
> > 
> > This behavior was inherited from ARM and it isn't common for other
> > architectures. Now, we have PTRACE_GET_SYSCALL_INFO that gives all
> > required information about system calls, so the hack with clobbering
> > registers isn't needed anymore.
> > 
> > This change adds the new ptrace option PTRACE_O_ARM64_RAW_REGS.  If it
> > is set, PTRACE_GETREGSET returns values of all registers without
> > clobbering r12 or x7 and PTRACE_SETREGSE sets all registers even if a
> > process has been stopped in syscall-enter or syscall-exit.
> > 
> > Signed-off-by: Andrei Vagin 
> > ---
> >  arch/arm64/include/uapi/asm/ptrace.h |  4 ++
> >  arch/arm64/kernel/ptrace.c   | 70 
> >  include/linux/ptrace.h   |  1 +
> >  include/uapi/linux/ptrace.h  |  9 +++-
> >  4 files changed, 52 insertions(+), 32 deletions(-)
> 
> Please split this up so that the arm64-specific changes are separate to
> the core changes.

ok

> 
> > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> > index 83ee45fa634b..bcc8c362ddd9 100644
> > --- a/include/uapi/linux/ptrace.h
> > +++ b/include/uapi/linux/ptrace.h
> > @@ -7,6 +7,7 @@
> >  /* has the defines to get at the registers. */
> >  
> >  #include 
> > +#include 
> >  
> >  #define PTRACE_TRACEME0
> >  #define PTRACE_PEEKTEXT   1
> > @@ -137,8 +138,14 @@ struct ptrace_syscall_info {
> >  #define PTRACE_O_EXITKILL  (1 << 20)
> >  #define PTRACE_O_SUSPEND_SECCOMP   (1 << 21)
> >  
> > +/* (1<<28) is reserved for arch specific options. */
> > +#ifndef _PTRACE_O_ARCH_OPTIONS
> > +#define _PTRACE_O_ARCH_OPTIONS 0
> > +#endif
> 
> It seems a bit fragile to rely on a comment here to define the user ABI;
> why not define _PTRACE_O_ARCH_OPTIONS to the right value unconditionally?

We don't want to allow setting options that are not supported.
_PTRACE_O_ARCH_OPTIONS is added to PTRACE_O_MASK and then
ptrace_setoptions checks whether all specified options is supported or
not.

> 
> Also, it seems as though we immediately burn this bit on arm64, so if we
> ever wanted another option we'd have to come back here and allocate another
> bit. Could we do better, e.g. by calling into an arch hook
> (arch_ptrace_setoptions()) and passing the 'addr' parameter?

I am not sure that I understand the idea. Do you suggest to have
PTRACE_O_ARCH_OPTION and pass arch-specific options in addr? In this
case, I think it could be more cleaner to introduce a new ptrace
command. If this is the idea, I think it worth doing this only if we
expect to have more than one,two,three options.

As for my solution, we need to come back to allocate a new bit
to be sure that we don't intersect with non-arch specific options.
And those who add a non-arch option should see that they don't use bits
of arch-specific options.

Let's decide what interface we want to use to solve the problem and then
if we will stop on a ptrace option I will figure out how to improve
this code.

> 
> How do other architectures manage this sort of thing? I'm wondering whether
> a separate regset containing just "real x7" and orig_x0 would be preferable
> after all...

Yeah, it might be a good idea. We will need to do one extra ptrace
system call, but in comparison with ptrace context-switches, this is
nothing.

Dave, Keno, what do you think about this?

> 
> Will


[PATCH 1/3] arm64/ptrace: don't clobber task registers on syscall entry/exit traps

2021-02-01 Thread Andrei Vagin
ip/r12 for AArch32 and x7 for AArch64 is used to indicate whether or not
the stop has been signalled from syscall entry or syscall exit. This
means that:

- Any writes by the tracer to this register during the stop are
  ignored/discarded.

- The actual value of the register is not available during the stop,
  so the tracer cannot save it and restore it later.

Right now, these registers are clobbered in tracehook_report_syscall.
This change moves the logic to gpr_get and compat_gpr_get where
registers are copied into a user-space buffer.

This will allow to change these registers and to introduce a new
ptrace option to get the full set of registers.

Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/asm/ptrace.h |   5 ++
 arch/arm64/kernel/ptrace.c  | 104 
 2 files changed, 69 insertions(+), 40 deletions(-)

diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index e58bca832dff..0a9552b4f61e 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -170,6 +170,11 @@ static inline unsigned long pstate_to_compat_psr(const 
unsigned long pstate)
return psr;
 }
 
+enum ptrace_syscall_dir {
+   PTRACE_SYSCALL_ENTER = 0,
+   PTRACE_SYSCALL_EXIT,
+};
+
 /*
  * This struct defines the way the registers are stored on the stack during an
  * exception. Note that sizeof(struct pt_regs) has to be a multiple of 16 (for
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 8ac487c84e37..39da03104528 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -561,7 +562,31 @@ static int gpr_get(struct task_struct *target,
   struct membuf to)
 {
struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
-   return membuf_write(, uregs, sizeof(*uregs));
+   unsigned long saved_reg;
+   int ret;
+
+   /*
+* We have some ABI weirdness here in the way that we handle syscall
+* exit stops because we indicate whether or not the stop has been
+* signalled from syscall entry or syscall exit by clobbering the 
general
+* purpose register x7.
+*/
+   saved_reg = uregs->regs[7];
+
+   switch (target->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   uregs->regs[7] = PTRACE_SYSCALL_ENTER;
+   break;
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   uregs->regs[7] = PTRACE_SYSCALL_EXIT;
+   break;
+   }
+
+   ret =  membuf_write(, uregs, sizeof(*uregs));
+
+   uregs->regs[7] = saved_reg;
+
+   return ret;
 }
 
 static int gpr_set(struct task_struct *target, const struct user_regset 
*regset,
@@ -575,6 +600,17 @@ static int gpr_set(struct task_struct *target, const 
struct user_regset *regset,
if (ret)
return ret;
 
+   /*
+* Historically, x7 can't be changed if the stop has been signalled
+* from syscall-enter of syscall-exit.
+*/
+   switch (target->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   newregs.regs[7] = task_pt_regs(target)->regs[7];
+   break;
+   }
+
if (!valid_user_regs(, target))
return -EINVAL;
 
@@ -1206,6 +1242,20 @@ static inline compat_ulong_t compat_get_user_reg(struct 
task_struct *task, int i
struct pt_regs *regs = task_pt_regs(task);
 
switch (idx) {
+   case 12:
+   /*
+* We have some ABI weirdness here in the way that we handle
+* syscall exit stops because we indicate whether or not the
+* stop has been signalled from syscall entry or syscall exit
+* by clobbering the general purpose register r12.
+*/
+   switch (task->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   return PTRACE_SYSCALL_ENTER;
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   return PTRACE_SYSCALL_EXIT;
+   }
+   return regs->regs[idx];
case 15:
return regs->pc;
case 16:
@@ -1282,6 +1332,17 @@ static int compat_gpr_set(struct task_struct *target,
 
}
 
+   /*
+* Historically, x12 can't be changed if the stop has been signalled
+* from syscall-enter of syscall-exit.
+*/
+   switch (target->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   newregs.regs[12] = task_pt_regs(target)->regs[12];
+   break;
+   }
+
if (valid_user_regs(_regs, target))
*task_pt_regs(target) = newregs;
else
@@ -1740,53 +1801,16 @@ long arch_ptrace(s

[PATCH 3/3] selftest/arm64/ptrace: add tests for PTRACE_O_ARM64_RAW_REGS

2021-02-01 Thread Andrei Vagin
Test output:
 TAP version 13
 1..2
 # selftests: arm64/ptrace: ptrace_syscall_raw_regs_test
 # 1..2
 # ok 1 x7: 686920776f726c64
 # ok 2 The child exited with code 0.
 # # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 1 selftests: arm64/ptrace: ptrace_syscall_raw_regs_test
 # selftests: arm64/ptrace: ptrace_syscall_regs_test
 # 1..3
 # ok 1 x7: 0
 # ok 2 x7: 1
 # ok 3 The child exited with code 0.
 # # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 2 selftests: arm64/ptrace: ptrace_syscall_regs_test

Signed-off-by: Andrei Vagin 
---
 tools/testing/selftests/arm64/Makefile|   2 +-
 tools/testing/selftests/arm64/ptrace/Makefile |   6 +
 .../ptrace/ptrace_syscall_raw_regs_test.c | 142 +
 .../arm64/ptrace/ptrace_syscall_regs_test.c   | 150 ++
 4 files changed, 299 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile
 create mode 100644 
tools/testing/selftests/arm64/ptrace/ptrace_syscall_raw_regs_test.c
 create mode 100644 
tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c

diff --git a/tools/testing/selftests/arm64/Makefile 
b/tools/testing/selftests/arm64/Makefile
index 2c9d012797a7..704770a60ece 100644
--- a/tools/testing/selftests/arm64/Makefile
+++ b/tools/testing/selftests/arm64/Makefile
@@ -4,7 +4,7 @@
 ARCH ?= $(shell uname -m 2>/dev/null || echo not)
 
 ifneq (,$(filter $(ARCH),aarch64 arm64))
-ARM64_SUBTARGETS ?= tags signal pauth fp mte
+ARM64_SUBTARGETS ?= tags signal pauth fp mte ptrace
 else
 ARM64_SUBTARGETS :=
 endif
diff --git a/tools/testing/selftests/arm64/ptrace/Makefile 
b/tools/testing/selftests/arm64/ptrace/Makefile
new file mode 100644
index ..84b27449f3d1
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -g -I../../../../../usr/include/
+TEST_GEN_PROGS := ptrace_syscall_raw_regs_test ptrace_syscall_regs_test
+
+include ../../lib.mk
diff --git 
a/tools/testing/selftests/arm64/ptrace/ptrace_syscall_raw_regs_test.c 
b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_raw_regs_test.c
new file mode 100644
index ..78f913303a99
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_raw_regs_test.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../../kselftest.h"
+
+#define TEST_VAL 0x686920776f726c64UL
+
+#define pr_p(func, fmt, ...)   func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...)   \
+   ({  \
+   ksft_test_result_error(fmt "\n", ##__VA_ARGS__);
\
+   -1; \
+   })
+
+#define pr_fail(fmt, ...)  \
+   ({  \
+   ksft_test_result_fail(fmt "\n", ##__VA_ARGS__); \
+   -1; \
+   })
+
+#define pr_perror(fmt, ...)pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+static long loop(void *val)
+{
+   register long x0 __asm__("x0");
+   register void *x1 __asm__("x1") = val;
+   register long x8 __asm__("x8") = 555;
+
+   __asm__ (
+   "again:\n"
+   "ldr x7, [x1, 0]\n"
+   "svc 0\n"
+   "str x7, [x1, 0]\n"
+  : "=r"(x0)
+  : "r"(x1), "r"(x8)
+  :
+   );
+   return 0;
+}
+
+static int child(void)
+{
+   long  val = TEST_VAL;
+
+   loop();
+   if (val != ~TEST_VAL) {
+   ksft_print_msg("Unexpected x7: %lx\n", val);
+   return 1;
+   }
+
+   return 0;
+}
+
+#ifndef PTRACE_SYSEMU
+#define PTRACE_SYSEMU 31
+#endif
+
+#ifndef PTRACE_O_ARM64_RAW_REGS
+#define PTRACE_O_ARM64_RAW_REGS(1 << 28)
+#endif
+
+int main(int argc, void **argv)
+{
+   struct user_regs_struct regs = {};
+   struct iovec iov = {
+   .iov_base = ,
+   .iov_len = sizeof(struct user_regs_struct),
+   };
+   int status;
+   pid_t pid;
+
+   ksft_set_plan(2);
+
+   pid = fork();
+   if (pid == 0) {
+   kill(getpid(), SIGSTOP);
+   child();
+   _exit(0);
+   }
+   if (pid < 0)
+   return 1;
+
+   if (ptrace(PTRACE_ATTACH, pid, 0, 0))
+   return pr_perror("Can't attach to the child %d", pid);
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror(&quo

[PATCH 2/3] arm64/ptrace: introduce PTRACE_O_ARM64_RAW_REGS

2021-02-01 Thread Andrei Vagin
We have some ABI weirdness in the way that we handle syscall
exit stops because we indicate whether or not the stop has been
signalled from syscall entry or syscall exit by clobbering a general
purpose register (ip/r12 for AArch32, x7 for AArch64) in the tracee
and restoring its old value after the stop.

This behavior was inherited from ARM and it isn't common for other
architectures. Now, we have PTRACE_GET_SYSCALL_INFO that gives all
required information about system calls, so the hack with clobbering
registers isn't needed anymore.

This change adds the new ptrace option PTRACE_O_ARM64_RAW_REGS.  If it
is set, PTRACE_GETREGSET returns values of all registers without
clobbering r12 or x7 and PTRACE_SETREGSE sets all registers even if a
process has been stopped in syscall-enter or syscall-exit.

Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/uapi/asm/ptrace.h |  4 ++
 arch/arm64/kernel/ptrace.c   | 70 
 include/linux/ptrace.h   |  1 +
 include/uapi/linux/ptrace.h  |  9 +++-
 4 files changed, 52 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/uapi/asm/ptrace.h 
b/arch/arm64/include/uapi/asm/ptrace.h
index 758ae984ff97..465cc9713895 100644
--- a/arch/arm64/include/uapi/asm/ptrace.h
+++ b/arch/arm64/include/uapi/asm/ptrace.h
@@ -109,6 +109,10 @@ struct user_hwdebug_state {
}   dbg_regs[16];
 };
 
+#define PTRACE_O_ARM64_RAW_REGS(1 << 28)
+
+#define _PTRACE_O_ARCH_OPTIONS PTRACE_O_ARM64_RAW_REGS
+
 /* SVE/FP/SIMD state (NT_ARM_SVE) */
 
 struct user_sve_header {
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 39da03104528..591a4478ad76 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -565,21 +565,23 @@ static int gpr_get(struct task_struct *target,
unsigned long saved_reg;
int ret;
 
-   /*
-* We have some ABI weirdness here in the way that we handle syscall
-* exit stops because we indicate whether or not the stop has been
-* signalled from syscall entry or syscall exit by clobbering the 
general
-* purpose register x7.
-*/
saved_reg = uregs->regs[7];
 
-   switch (target->ptrace_message) {
-   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
-   uregs->regs[7] = PTRACE_SYSCALL_ENTER;
-   break;
-   case PTRACE_EVENTMSG_SYSCALL_EXIT:
-   uregs->regs[7] = PTRACE_SYSCALL_EXIT;
-   break;
+   if (!(target->ptrace & PT_ARM64_RAW_REGS)) {
+   /*
+* We have some ABI weirdness here in the way that we handle
+* syscall exit stops because we indicate whether or not the
+* stop has been signalled from syscall entry or syscall exit
+* by clobbering the general purpose register x7.
+*/
+   switch (target->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   uregs->regs[7] = PTRACE_SYSCALL_ENTER;
+   break;
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   uregs->regs[7] = PTRACE_SYSCALL_EXIT;
+   break;
+   }
}
 
ret =  membuf_write(, uregs, sizeof(*uregs));
@@ -600,15 +602,17 @@ static int gpr_set(struct task_struct *target, const 
struct user_regset *regset,
if (ret)
return ret;
 
-   /*
-* Historically, x7 can't be changed if the stop has been signalled
-* from syscall-enter of syscall-exit.
-*/
-   switch (target->ptrace_message) {
-   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
-   case PTRACE_EVENTMSG_SYSCALL_EXIT:
-   newregs.regs[7] = task_pt_regs(target)->regs[7];
-   break;
+   if (!(target->ptrace & PT_ARM64_RAW_REGS)) {
+   /*
+* Historically, x7 can't be changed if the stop has been
+* signalled from syscall-enter of syscall-exit.
+*/
+   switch (target->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   newregs.regs[7] = task_pt_regs(target)->regs[7];
+   break;
+   }
}
 
if (!valid_user_regs(, target))
@@ -1243,6 +1247,8 @@ static inline compat_ulong_t compat_get_user_reg(struct 
task_struct *task, int i
 
switch (idx) {
case 12:
+   if (task->ptrace & PT_ARM64_RAW_REGS)
+   return regs->regs[idx];
/*
 * We have some ABI weirdness here in the way that we handle
 * syscall exit stops because we indicate whether or not the
@@ -1332,15 +1338,17 @@ static int compat_gpr_set(struct task_struct *target,
 
}
 
-   /*
-* Hi

[PATCH 0/3 v2] arm64/ptrace: allow to get all registers on syscall traps

2021-02-01 Thread Andrei Vagin
Right now, ip/r12 for AArch32 and x7 for AArch64 is used to indicate
whether or not the stop has been signalled from syscall entry or syscall
exit. This means that:

- Any writes by the tracer to this register during the stop are
  ignored/discarded.

- The actual value of the register is not available during the stop,
  so the tracer cannot save it and restore it later.

For applications like the user-mode Linux or gVisor, it is critical to
have access to the full set of registers in any moment. For example,
they need to change values of all registers to emulate rt_sigreturn or
execve and they need to have the full set of registers to build a signal
frame.

This series introduces the PTRACE_O_ARM64_RAW_REGS option. If it is set,
PTRACE_GETREGSET returns values of all registers, and PTRACE_SETREGSET
allows to change any of them.

Cc: Catalin Marinas 
Cc: Dave Martin 
Cc: Keno Fischer  
Cc: Oleg Nesterov 
Cc: Will Deacon 

Andrei Vagin (3):
  arm64/ptrace: don't clobber task registers on syscall entry/exit traps
  arm64/ptrace: introduce PTRACE_O_ARM64_RAW_REGS
  selftest/arm64/ptrace: add tests for PTRACE_O_ARM64_RAW_REGS

v2: use the ptrace option instead of adding a new regset.

 arch/arm64/include/asm/ptrace.h   |   5 +
 arch/arm64/kernel/ptrace.c| 130 +++-
 include/uapi/linux/elf.h  |   1 +
 tools/testing/selftests/arm64/Makefile|   2 +-
 tools/testing/selftests/arm64/ptrace/Makefile |   6 +
 .../arm64/ptrace/ptrace_syscall_regs_test.c   | 142 ++
 6 files changed, 246 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile
 create mode 100644 
tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c

-- 
2.29.2



Re: [PATCH 2/3] arm64/ptrace: introduce NT_ARM_PRSTATUS to get a full set of registers

2021-01-29 Thread Andrei Vagin
On Wed, Jan 27, 2021 at 02:53:07PM +, Dave Martin wrote:
> On Tue, Jan 19, 2021 at 02:06:36PM -0800, Andrei Vagin wrote:
> > This is an alternative to NT_PRSTATUS that clobbers ip/r12 on AArch32,
> > x7 on AArch64 when a tracee is stopped in syscall entry or syscall exit
> > traps.
> > 
> > Signed-off-by: Andrei Vagin 
> 
> This approach looks like it works, though I still think adding an option
> for this under PTRACE_SETOPTIONS would be less intrusive.

Dave, thank you for the feedback. I will prepare a patch with an option
and then we will see what looks better.

> 
> Adding a shadow regset like this also looks like it would cause the gp
> regs to be pointlessly be dumped twice in a core dump.  Avoiding that
> might require hacks in the core code...
> 
> 
> > ---
> >  arch/arm64/kernel/ptrace.c | 39 ++
> >  include/uapi/linux/elf.h   |  1 +
> >  2 files changed, 40 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> > index 1863f080cb07..b8e4c2ddf636 100644
> > --- a/arch/arm64/kernel/ptrace.c
> > +++ b/arch/arm64/kernel/ptrace.c
> > @@ -591,6 +591,15 @@ static int gpr_get(struct task_struct *target,
> > return ret;
> >  }
> >  
> > +static int gpr_get_full(struct task_struct *target,
> > +  const struct user_regset *regset,
> > +  struct membuf to)
> > +{
> > +   struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
> > +
> > +   return membuf_write(, uregs, sizeof(*uregs));
> > +}
> > +
> >  static int gpr_set(struct task_struct *target, const struct user_regset 
> > *regset,
> >unsigned int pos, unsigned int count,
> >const void *kbuf, const void __user *ubuf)
> > @@ -1088,6 +1097,7 @@ static int tagged_addr_ctrl_set(struct task_struct 
> > *target, const struct
> >  
> >  enum aarch64_regset {
> > REGSET_GPR,
> > +   REGSET_GPR_FULL,
> 
> If we go with this approach, "REGSET_GPR_RAW" might be a preferable
> name.  Both regs represent all the regs ("full"), but REGSET_GPR is
> mangled by the kernel.

I agree that REGSET_GPR_RAW looks better in this case.

> 
> > REGSET_FPR,
> > REGSET_TLS,
> >  #ifdef CONFIG_HAVE_HW_BREAKPOINT
> > @@ -1119,6 +1129,14 @@ static const struct user_regset aarch64_regsets[] = {
> > .regset_get = gpr_get,
> > .set = gpr_set
> > },
> > +   [REGSET_GPR_FULL] = {
> > +   .core_note_type = NT_ARM_PRSTATUS,

...

> > diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> > index 30f68b42eeb5..a2086d19263a 100644
> > --- a/include/uapi/linux/elf.h
> > +++ b/include/uapi/linux/elf.h
> > @@ -426,6 +426,7 @@ typedef struct elf64_shdr {
> >  #define NT_ARM_PACA_KEYS   0x407   /* ARM pointer authentication address 
> > keys */
> >  #define NT_ARM_PACG_KEYS   0x408   /* ARM pointer authentication generic 
> > key */
> >  #define NT_ARM_TAGGED_ADDR_CTRL0x409   /* arm64 tagged address control 
> > (prctl()) */
> 
> What happened to 0x40a..0x40f?

shame on me :)

> 
> [...]
> 
> Cheers
> ---Dave


Re: [PATCH 0/3] arm64/ptrace: allow to get all registers on syscall traps

2021-01-27 Thread Andrei Vagin
On Tue, Jan 19, 2021 at 2:08 PM Andrei Vagin  wrote:
>
> Right now, ip/r12 for AArch32 and x7 for AArch64 is used to indicate
> whether or not the stop has been signalled from syscall entry or syscall
> exit. This means that:
>
> - Any writes by the tracer to this register during the stop are
>   ignored/discarded.
>
> - The actual value of the register is not available during the stop,
>   so the tracer cannot save it and restore it later.
>
> This series introduces NT_ARM_PRSTATUS to get all registers and makes it
> possible to change ip/r12 and x7 registers when tracee is stopped in
> syscall traps.
>
> For applications like the user-mode Linux or gVisor, it is critical to
> have access to the full set of registers at any moment. For example,
> they need to change values of all registers to emulate rt_sigreturn and
> they need to have the full set of registers to build a signal frame.

I have found the thread [1] where Keno, Will, and Dave discussed the same
problem. If I understand this right, the problem was not fixed, because there
were no users who needed it.

gVisor is a general-purpose sandbox to run untrusted workloads. It has a
platform interface that is responsible for syscall interception, context
switching, and managing process address spaces. Right now, we have kvm and
ptrace platforms. The ptrace platform runs a guest code in the context of stub
processes and intercepts syscalls with help of PTRACE_SYSEMU. All system calls
are handled by the gVisor kernel including rt_sigreturn and execve. Signal
handling is happing inside the gVisor kernel too. Each stub process can have
more than one thread, but we don't bind guest threads to stub threads and we
can run more than one guest thread in the context of one stub thread. Taking
into account all these facts, we need to have access to all registers at any
moment when a stub thread has been stopped.

We were able to introduce the workaround [3] for this issue. Each time when a
stub process is stopped on a system call, we queue a fake signal and resume a
process to stop it on the signal. It works, but we need to do extra interaction
with a stub process what is expensive. My benchmarks show that this workaround
slows down syscalls in gVisor for more than 50%. BTW: it is one of the major
reasons why PTRACE_SYSEMU was introduced instead of emulating it via
two calls of PTRACE_SYSCALL.


[1] 
https://lore.kernel.org/lkml/CABV8kRz0mKSc=u1LeonQSLroKJLOKWOWktCoGji2nvEBc=e7=w...@mail.gmail.com/#r
[2] https://github.com/google/gvisor/issues/5238
[3] 
https://github.com/google/gvisor/commit/a44efaab6d4b815880749a39647fb3ed9634a489

>
> Andrei Vagin (3):
>   arm64/ptrace: don't clobber task registers on syscall entry/exit traps
>   arm64/ptrace: introduce NT_ARM_PRSTATUS to get a full set of registers
>   selftest/arm64/ptrace: add a test for NT_ARM_PRSTATUS
>
>  arch/arm64/include/asm/ptrace.h   |   5 +
>  arch/arm64/kernel/ptrace.c| 130 +++-
>  include/uapi/linux/elf.h  |   1 +
>  tools/testing/selftests/arm64/Makefile|   2 +-
>  tools/testing/selftests/arm64/ptrace/Makefile |   6 +
>  .../arm64/ptrace/ptrace_syscall_regs_test.c   | 142 ++
>  6 files changed, 246 insertions(+), 40 deletions(-)
>  create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile
>  create mode 100644 
> tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c
>
> --
> 2.29.2
>


[PATCH 0/3] arm64/ptrace: allow to get all registers on syscall traps

2021-01-19 Thread Andrei Vagin
Right now, ip/r12 for AArch32 and x7 for AArch64 is used to indicate
whether or not the stop has been signalled from syscall entry or syscall
exit. This means that:

- Any writes by the tracer to this register during the stop are
  ignored/discarded.

- The actual value of the register is not available during the stop,
  so the tracer cannot save it and restore it later.

This series introduces NT_ARM_PRSTATUS to get all registers and makes it
possible to change ip/r12 and x7 registers when tracee is stopped in
syscall traps.

For applications like the user-mode Linux or gVisor, it is critical to
have access to the full set of registers in any moment. For example,
they need to change values of all registers to emulate rt_sigreturn and
they need to have the full set of registers to build a signal frame.

Andrei Vagin (3):
  arm64/ptrace: don't clobber task registers on syscall entry/exit traps
  arm64/ptrace: introduce NT_ARM_PRSTATUS to get a full set of registers
  selftest/arm64/ptrace: add a test for NT_ARM_PRSTATUS

 arch/arm64/include/asm/ptrace.h   |   5 +
 arch/arm64/kernel/ptrace.c| 130 +++-
 include/uapi/linux/elf.h  |   1 +
 tools/testing/selftests/arm64/Makefile|   2 +-
 tools/testing/selftests/arm64/ptrace/Makefile |   6 +
 .../arm64/ptrace/ptrace_syscall_regs_test.c   | 142 ++
 6 files changed, 246 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile
 create mode 100644 
tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c

-- 
2.29.2



[PATCH 2/3] arm64/ptrace: introduce NT_ARM_PRSTATUS to get a full set of registers

2021-01-19 Thread Andrei Vagin
This is an alternative to NT_PRSTATUS that clobbers ip/r12 on AArch32,
x7 on AArch64 when a tracee is stopped in syscall entry or syscall exit
traps.

Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/ptrace.c | 39 ++
 include/uapi/linux/elf.h   |  1 +
 2 files changed, 40 insertions(+)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 1863f080cb07..b8e4c2ddf636 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -591,6 +591,15 @@ static int gpr_get(struct task_struct *target,
return ret;
 }
 
+static int gpr_get_full(struct task_struct *target,
+  const struct user_regset *regset,
+  struct membuf to)
+{
+   struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
+
+   return membuf_write(, uregs, sizeof(*uregs));
+}
+
 static int gpr_set(struct task_struct *target, const struct user_regset 
*regset,
   unsigned int pos, unsigned int count,
   const void *kbuf, const void __user *ubuf)
@@ -1088,6 +1097,7 @@ static int tagged_addr_ctrl_set(struct task_struct 
*target, const struct
 
 enum aarch64_regset {
REGSET_GPR,
+   REGSET_GPR_FULL,
REGSET_FPR,
REGSET_TLS,
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
@@ -1119,6 +1129,14 @@ static const struct user_regset aarch64_regsets[] = {
.regset_get = gpr_get,
.set = gpr_set
},
+   [REGSET_GPR_FULL] = {
+   .core_note_type = NT_ARM_PRSTATUS,
+   .n = sizeof(struct user_pt_regs) / sizeof(u64),
+   .size = sizeof(u64),
+   .align = sizeof(u64),
+   .regset_get = gpr_get_full,
+   .set = gpr_set
+   },
[REGSET_FPR] = {
.core_note_type = NT_PRFPREG,
.n = sizeof(struct user_fpsimd_state) / sizeof(u32),
@@ -1225,6 +1243,7 @@ static const struct user_regset_view user_aarch64_view = {
 #ifdef CONFIG_COMPAT
 enum compat_regset {
REGSET_COMPAT_GPR,
+   REGSET_COMPAT_GPR_FULL,
REGSET_COMPAT_VFP,
 };
 
@@ -1285,6 +1304,18 @@ static int compat_gpr_get(struct task_struct *target,
return 0;
 }
 
+/* compat_gpr_get_full doesn't  overwrite x12 like compat_gpr_get. */
+static int compat_gpr_get_full(struct task_struct *target,
+ const struct user_regset *regset,
+ struct membuf to)
+{
+   int i = 0;
+
+   while (to.left)
+   membuf_store(, compat_get_user_reg(target, i++));
+   return 0;
+}
+
 static int compat_gpr_set(struct task_struct *target,
  const struct user_regset *regset,
  unsigned int pos, unsigned int count,
@@ -1435,6 +1466,14 @@ static const struct user_regset aarch32_regsets[] = {
.regset_get = compat_gpr_get,
.set = compat_gpr_set
},
+   [REGSET_COMPAT_GPR_FULL] = {
+   .core_note_type = NT_ARM_PRSTATUS,
+   .n = COMPAT_ELF_NGREG,
+   .size = sizeof(compat_elf_greg_t),
+   .align = sizeof(compat_elf_greg_t),
+   .regset_get = compat_gpr_get_full,
+   .set = compat_gpr_set
+   },
[REGSET_COMPAT_VFP] = {
.core_note_type = NT_ARM_VFP,
.n = VFP_STATE_SIZE / sizeof(compat_ulong_t),
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 30f68b42eeb5..a2086d19263a 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -426,6 +426,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_PACA_KEYS   0x407   /* ARM pointer authentication address 
keys */
 #define NT_ARM_PACG_KEYS   0x408   /* ARM pointer authentication generic 
key */
 #define NT_ARM_TAGGED_ADDR_CTRL0x409   /* arm64 tagged address control 
(prctl()) */
+#define NT_ARM_PRSTATUS0x410   /* ARM general-purpose 
registers */
 #define NT_ARC_V2  0x600   /* ARCv2 accumulator/extra registers */
 #define NT_VMCOREDD0x700   /* Vmcore Device Dump Note */
 #define NT_MIPS_DSP0x800   /* MIPS DSP ASE registers */
-- 
2.29.2



[PATCH 3/3] selftest/arm64/ptrace: add a test for NT_ARM_PRSTATUS

2021-01-19 Thread Andrei Vagin
Test output:
 TAP version 13
 1..1
 # selftests: arm64/ptrace: ptrace_syscall_regs_test
 # 1..3
 # ok 1 NT_PRSTATUS: x7 0
 # ok 2 NT_ARM_PRSTATUS: x7 686920776f726c64
 # ok 3 The child exited with code 0.
 # # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
 ok 1 selftests: arm64/ptrace: ptrace_syscall_regs_test

Signed-off-by: Andrei Vagin 
---
 tools/testing/selftests/arm64/Makefile|   2 +-
 tools/testing/selftests/arm64/ptrace/Makefile |   6 +
 .../arm64/ptrace/ptrace_syscall_regs_test.c   | 147 ++
 3 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile
 create mode 100644 
tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c

diff --git a/tools/testing/selftests/arm64/Makefile 
b/tools/testing/selftests/arm64/Makefile
index 2c9d012797a7..704770a60ece 100644
--- a/tools/testing/selftests/arm64/Makefile
+++ b/tools/testing/selftests/arm64/Makefile
@@ -4,7 +4,7 @@
 ARCH ?= $(shell uname -m 2>/dev/null || echo not)
 
 ifneq (,$(filter $(ARCH),aarch64 arm64))
-ARM64_SUBTARGETS ?= tags signal pauth fp mte
+ARM64_SUBTARGETS ?= tags signal pauth fp mte ptrace
 else
 ARM64_SUBTARGETS :=
 endif
diff --git a/tools/testing/selftests/arm64/ptrace/Makefile 
b/tools/testing/selftests/arm64/ptrace/Makefile
new file mode 100644
index ..ca906ce3a581
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -g -I../../../../../usr/include/
+TEST_GEN_PROGS := ptrace_syscall_regs_test
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c 
b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c
new file mode 100644
index ..601378b7591d
--- /dev/null
+++ b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../../kselftest.h"
+
+#define TEST_VAL 0x686920776f726c64UL
+
+#define pr_p(func, fmt, ...)   func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...)   \
+   ({  \
+   ksft_test_result_error(fmt "\n", ##__VA_ARGS__);
\
+   -1; \
+   })
+
+#define pr_fail(fmt, ...)  \
+   ({  \
+   ksft_test_result_fail(fmt "\n", ##__VA_ARGS__); \
+   -1; \
+   })
+
+#define pr_perror(fmt, ...)pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+static long loop(void *val)
+{
+   register long x0 __asm__("x0");
+   register void *x1 __asm__("x1") = val;
+   register long x8 __asm__("x8") = 555;
+
+   __asm__ (
+   "again:\n"
+   "ldr x7, [x1, 0]\n"
+   "svc 0\n"
+   "str x7, [x1, 0]\n"
+  : "=r"(x0)
+  : "r"(x1), "r"(x8)
+  :
+   );
+   return 0;
+}
+
+static int child(void)
+{
+   long  val = TEST_VAL;
+
+   loop();
+   if (val != ~TEST_VAL) {
+   ksft_print_msg("Unexpected x7: %lx\n", val);
+   return 1;
+   }
+
+   return 0;
+}
+
+#ifndef PTRACE_SYSEMU
+#define PTRACE_SYSEMU 31
+#endif
+
+#ifndef NT_ARM_PRSTATUS
+#define NT_ARM_PRSTATUS 0x410
+#endif
+
+int main(int argc, void **argv)
+{
+   struct user_regs_struct regs = {};
+   struct iovec iov = {
+   .iov_base = ,
+   .iov_len = sizeof(struct user_regs_struct),
+   };
+   int status;
+   pid_t pid;
+
+   ksft_set_plan(3);
+
+   pid = fork();
+   if (pid == 0) {
+   kill(getpid(), SIGSTOP);
+   child();
+   _exit(0);
+   }
+   if (pid < 0)
+   return 1;
+
+   if (ptrace(PTRACE_ATTACH, pid, 0, 0))
+   return pr_perror("Can't attach to the child %d", pid);
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Can't wait for the child %d", pid);
+   /* skip SIGSTOP */
+   if (ptrace(PTRACE_CONT, pid, 0, 0))
+   return pr_perror("Can't resume the child %d", pid);
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Can't wait for the child %d", pid);
+
+   /* Resume the child to the next system call. */
+   if (ptrace(PTRACE_SYSEMU, pid, 0, 0))
+   return pr_perror("Can

[PATCH 1/3] arm64/ptrace: don't clobber task registers on syscall entry/exit traps

2021-01-19 Thread Andrei Vagin
ip/r12 for AArch32 and x7 for AArch64 is used to indicate whether or not
the stop has been signalled from syscall entry or syscall exit. This
means that:

- Any writes by the tracer to this register during the stop are
  ignored/discarded.

- The actual value of the register is not available during the stop,
  so the tracer cannot save it and restore it later.

Right now, these registers are clobbered in tracehook_report_syscall.
This change moves this logic to gpr_get and compat_gpr_get where
registers are copied into a user-space buffer.

This will allow to change these registers and to introduce a new
NT_ARM_PRSTATUS command to get the full set of registers.

Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/asm/ptrace.h |   5 ++
 arch/arm64/kernel/ptrace.c  | 104 +++-
 2 files changed, 67 insertions(+), 42 deletions(-)

diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index e58bca832dff..0a9552b4f61e 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -170,6 +170,11 @@ static inline unsigned long pstate_to_compat_psr(const 
unsigned long pstate)
return psr;
 }
 
+enum ptrace_syscall_dir {
+   PTRACE_SYSCALL_ENTER = 0,
+   PTRACE_SYSCALL_EXIT,
+};
+
 /*
  * This struct defines the way the registers are stored on the stack during an
  * exception. Note that sizeof(struct pt_regs) has to be a multiple of 16 (for
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 8ac487c84e37..1863f080cb07 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -561,7 +562,33 @@ static int gpr_get(struct task_struct *target,
   struct membuf to)
 {
struct user_pt_regs *uregs = _pt_regs(target)->user_regs;
-   return membuf_write(, uregs, sizeof(*uregs));
+   unsigned long saved_reg;
+   int ret;
+
+   /*
+* We have some ABI weirdness here in the way that we handle syscall
+* exit stops because we indicate whether or not the stop has been
+* signalled from syscall entry or syscall exit by clobbering the 
general
+* purpose register x7.
+*/
+   switch (target->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   saved_reg = uregs->regs[7];
+   uregs->regs[7] = PTRACE_SYSCALL_ENTER;
+   break;
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   saved_reg = uregs->regs[7];
+   uregs->regs[7] = PTRACE_SYSCALL_EXIT;
+   break;
+   }
+
+   ret =  membuf_write(, uregs, sizeof(*uregs));
+
+   if (target->ptrace_message == PTRACE_EVENTMSG_SYSCALL_ENTRY ||
+   target->ptrace_message == PTRACE_EVENTMSG_SYSCALL_EXIT)
+   uregs->regs[7] = saved_reg;
+
+   return ret;
 }
 
 static int gpr_set(struct task_struct *target, const struct user_regset 
*regset,
@@ -1221,10 +1248,40 @@ static int compat_gpr_get(struct task_struct *target,
  const struct user_regset *regset,
  struct membuf to)
 {
+   compat_ulong_t r12;
+   bool overwrite_r12;
int i = 0;
 
-   while (to.left)
-   membuf_store(, compat_get_user_reg(target, i++));
+   /*
+* We have some ABI weirdness here in the way that we handle syscall
+* exit stops because we indicate whether or not the stop has been
+* signalled from syscall entry or syscall exit by clobbering the
+* general purpose register r12.
+*/
+   switch (target->ptrace_message) {
+   case PTRACE_EVENTMSG_SYSCALL_ENTRY:
+   r12 = PTRACE_SYSCALL_ENTER;
+   overwrite_r12 = true;
+   break;
+   case PTRACE_EVENTMSG_SYSCALL_EXIT:
+   r12 = PTRACE_SYSCALL_EXIT;
+   overwrite_r12 = true;
+   break;
+   default:
+   overwrite_r12 = false;
+   break;
+   }
+
+   while (to.left) {
+   compat_ulong_t val;
+
+   if (!overwrite_r12 || i != 12)
+   val = compat_get_user_reg(target, i++);
+   else
+   val = r12;
+   membuf_store(, val);
+   }
+
return 0;
 }
 
@@ -1740,53 +1797,16 @@ long arch_ptrace(struct task_struct *child, long 
request,
return ptrace_request(child, request, addr, data);
 }
 
-enum ptrace_syscall_dir {
-   PTRACE_SYSCALL_ENTER = 0,
-   PTRACE_SYSCALL_EXIT,
-};
-
 static void tracehook_report_syscall(struct pt_regs *regs,
 enum ptrace_syscall_dir dir)
 {
-   int regno;
-   unsigned long saved_reg;
-
-   /*
-* We have some ABI weirdness here in the way that we handle syscall
-* exit stops because we indicate wh

Re: [PATCH] timens: delete no-op time_ns_init()

2020-12-29 Thread Andrei Vagin
On Mon, Dec 28, 2020 at 1:54 PM Alexey Dobriyan  wrote:
>
> Signed-off-by: Alexey Dobriyan 

Acked-by: Andrei Vagin 

Thanks,
Andrei

> ---
>
>  kernel/time/namespace.c |6 --
>  1 file changed, 6 deletions(-)
>
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -465,9 +465,3 @@ struct time_namespace init_time_ns = {
> .ns.ops = _operations,
> .frozen_offsets = true,
>  };
> -
> -static int __init time_ns_init(void)
> -{
> -   return 0;
> -}
> -subsys_initcall(time_ns_init);


Re: [PATCH v4 0/3] time namespace aware system boot time

2020-10-22 Thread Andrei Vagin
On Mon, Oct 19, 2020 at 09:52:54PM +0200, Michael Weiß wrote:
> Time namespaces make it possible to virtualize time inside of
> containers, e.g., it is feasible to reset the uptime of a container
> to zero by setting the time namespace offset for boottime to the
> negated current value of the CLOCK_BOOTTIME.
> 
> However, the boot time stamp provided by getboottime64() does not
> take care of time namespaces. The resulting boot time stamp 'btime'
> provided by /proc/stat does not show a plausible time stamp inside
> the time namespace of a container.
> 
> We address this by shifting the value returned by getboottime64()
> by subtracting the boottime offset of the time namespace.
> (A selftest to check the expected /proc/stat 'btime' inside the
> namespace is provided.)
> 
> Further, to avoid to show processes as time travelers inside of the
> time namespace the boottime offset then needs to be added to the
> start_boottime provided by the task_struct.
> 
> v4 Changes:
> Avoid type conversions back and forth between timespec64 and ktime_t
> in 'proc/stat.c' as suggested by Andrei.
> Introduced timens_sub_boottime() in 'time_namespace.h' to provide
> better coder readability/consistency.
> 

Reviewed-by: Andrei Vagin 

Thanks,
Andrei


[tip: timers/urgent] futex: Adjust absolute futex timeouts with per time namespace offset

2020-10-20 Thread tip-bot2 for Andrei Vagin
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: c2f7d08cccf4af2ce6992feaabb9e68e4ae0bff3
Gitweb:
https://git.kernel.org/tip/c2f7d08cccf4af2ce6992feaabb9e68e4ae0bff3
Author:Andrei Vagin 
AuthorDate:Thu, 15 Oct 2020 09:00:19 -07:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 20 Oct 2020 17:02:57 +02:00

futex: Adjust absolute futex timeouts with per time namespace offset

For all commands except FUTEX_WAIT, the timeout is interpreted as an
absolute value. This absolute value is inside the task's time namespace and
has to be converted to the host's time.

Fixes: 5a590f35add9 ("posix-clocks: Wire up clock_gettime() with timens 
offsets")
Reported-by: Hans van der Laan 
Signed-off-by: Andrei Vagin 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Dmitry Safonov <0x7f454...@gmail.com>
Cc: 
Link: https://lore.kernel.org/r/20201015160020.293748-1-ava...@gmail.com
---
 kernel/futex.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 680854d..be68ac0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -3797,6 +3798,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
+   else if (!(op & FUTEX_CLOCK_REALTIME))
+   t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = 
}
/*
@@ -3989,6 +3992,8 @@ SYSCALL_DEFINE6(futex_time32, u32 __user *, uaddr, int, 
op, u32, val,
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
+   else if (!(op & FUTEX_CLOCK_REALTIME))
+   t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = 
}
if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||


[tip: timers/urgent] selftests/timens: Add a test for futex()

2020-10-20 Thread tip-bot2 for Andrei Vagin
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: a4fd8414659bf470e2146b352574bbd274e54b7a
Gitweb:
https://git.kernel.org/tip/a4fd8414659bf470e2146b352574bbd274e54b7a
Author:Andrei Vagin 
AuthorDate:Thu, 15 Oct 2020 09:00:20 -07:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 20 Oct 2020 17:02:57 +02:00

selftests/timens: Add a test for futex()

Output on success:
 1..2
 ok 1 futex with the 0 clockid
 ok 2 futex with the 1 clockid
 # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Andrei Vagin 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201015160020.293748-2-ava...@gmail.com

---
 tools/testing/selftests/timens/Makefile |   2 +-
 tools/testing/selftests/timens/futex.c  | 110 +++-
 2 files changed, 111 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/futex.c

diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index b4fd9a9..3a5936c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec futex
 TEST_GEN_PROGS_EXTENDED := gettime_perf
 
 CFLAGS := -Wall -Werror -pthread
diff --git a/tools/testing/selftests/timens/futex.c 
b/tools/testing/selftests/timens/futex.c
new file mode 100644
index 000..6b2b926
--- /dev/null
+++ b/tools/testing/selftests/timens/futex.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+#define NSEC_PER_SEC 10ULL
+
+static int run_test(int clockid)
+{
+   int futex_op = FUTEX_WAIT_BITSET;
+   struct timespec timeout, end;
+   int val = 0;
+
+   if (clockid == CLOCK_REALTIME)
+   futex_op |= FUTEX_CLOCK_REALTIME;
+
+   clock_gettime(clockid, );
+   timeout.tv_nsec += NSEC_PER_SEC / 10; // 100ms
+   if (timeout.tv_nsec > NSEC_PER_SEC) {
+   timeout.tv_sec++;
+   timeout.tv_nsec -= NSEC_PER_SEC;
+   }
+
+   if (syscall(__NR_futex, , futex_op, 0,
+   , 0, FUTEX_BITSET_MATCH_ANY) >= 0) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT\n");
+   return 1;
+   }
+
+   if (errno != ETIMEDOUT) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT: %s\n",
+   strerror(errno));
+   return 1;
+   }
+
+   clock_gettime(clockid, );
+
+   if (end.tv_sec < timeout.tv_sec ||
+   (end.tv_sec == timeout.tv_sec && end.tv_nsec < timeout.tv_nsec)) {
+   ksft_test_result_fail("futex slept less than 100ms\n");
+   return 1;
+   }
+
+
+   ksft_test_result_pass("futex with the %d clockid\n", clockid);
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int status, len, fd;
+   char buf[4096];
+   pid_t pid;
+   struct timespec mtime_now;
+
+   nscheck();
+
+   ksft_set_plan(2);
+
+   clock_gettime(CLOCK_MONOTONIC, _now);
+
+   if (unshare_timens())
+   return 1;
+
+   len = snprintf(buf, sizeof(buf), "%d %d 0",
+   CLOCK_MONOTONIC, 70 * 24 * 3600);
+   fd = open("/proc/self/timens_offsets", O_WRONLY);
+   if (fd < 0)
+   return pr_perror("/proc/self/timens_offsets");
+
+   if (write(fd, buf, len) != len)
+   return pr_perror("/proc/self/timens_offsets");
+
+   close(fd);
+
+   pid = fork();
+   if (pid < 0)
+   return pr_perror("Unable to fork");
+   if (pid == 0) {
+   int ret = 0;
+
+   ret |= run_test(CLOCK_REALTIME);
+   ret |= run_test(CLOCK_MONOTONIC);
+   if (ret)
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return 0;
+   }
+
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Unable to wait the child process");
+
+   if (WIFEXITED(status))
+   return WEXITSTATUS(status);
+
+   return 1;
+}


Re: [PATCH 1/2] futex: adjust a futex timeout with a per-timens offset

2020-10-16 Thread Andrei Vagin
On Thu, Oct 15, 2020 at 04:13:42PM +0200, Thomas Gleixner wrote:
> On Thu, Oct 15 2020 at 14:26, Dmitry Safonov wrote:
> 
> > On 10/15/20 8:29 AM, Andrei Vagin wrote:
> >> For all commands except FUTEX_WAIT, timeout is interpreted as an
> >> absolute value. This absolute value is inside the task's time namespace
> >> and has to be converted to the host's time.
> >> 
> >> Cc: 
> >> Fixes: 5a590f35add9 ("posix-clocks: Wire up clock_gettime() with timens 
> >> offsets")
> >> Reported-by: Hans van der Laan 
> >> Signed-off-by: Andrei Vagin [..]
> >> @@ -3797,6 +3798,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, 
> >> u32, val,
> >>t = timespec64_to_ktime(ts);
> >>if (cmd == FUTEX_WAIT)
> >>t = ktime_add_safe(ktime_get(), t);
> >> +  else if (!(cmd & FUTEX_CLOCK_REALTIME))
> >> +  t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
> >
> > Err, it probably should be
> > : else if (!(op & FUTEX_CLOCK_REALTIME))

Dmitry, thank you for catching this.

> 
> Duh, you are right. I stared at it for a while and did not spot it.
> 
> > And there's also
> > : SYSCALL_DEFINE6(futex_time32, ...)
> > which wants to be patched.
> 
> Indeed. I zapped the commits.

I sent a new version. This time, I extended the test to check
FUTEX_CLOCK_REALTIME and I compiled and run the compat version.
Everything works as it should be.

Thanks,
Andrei


[PATCH 1/2 v2] futex: adjust a futex timeout with a per-timens offset

2020-10-15 Thread Andrei Vagin
For all commands except FUTEX_WAIT, timeout is interpreted as an
absolute value. This absolute value is inside the task's time namespace
and has to be converted to the host's time.

Cc: 
Fixes: 5a590f35add9 ("posix-clocks: Wire up clock_gettime() with timens 
offsets")
Reported-by: Hans van der Laan 
Signed-off-by: Andrei Vagin 
---

v2:
* check FUTEX_CLOCK_REALTIME properly
* fix futex_time32 too

 kernel/futex.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index a5876694a60e..32056d2d4171 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -3797,6 +3798,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
+   else if (!(op & FUTEX_CLOCK_REALTIME))
+   t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = 
}
/*
@@ -3989,6 +3992,8 @@ SYSCALL_DEFINE6(futex_time32, u32 __user *, uaddr, int, 
op, u32, val,
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
+   else if (!(op & FUTEX_CLOCK_REALTIME))
+   t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = 
}
if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
-- 
2.26.2



[PATCH 2/2 v2] selftests/timens: Add a test for futex()

2020-10-15 Thread Andrei Vagin
Output on success:
 1..2
 ok 1 futex with the 0 clockid
 ok 2 futex with the 1 clockid
 # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Andrei Vagin 
---
v2: extend the test to check FUTEX_CLOCK_REALTIME

 tools/testing/selftests/timens/Makefile |   2 +-
 tools/testing/selftests/timens/futex.c  | 110 
 2 files changed, 111 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/futex.c

diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index b4fd9a934654..3a5936cc10ab 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec futex
 TEST_GEN_PROGS_EXTENDED := gettime_perf
 
 CFLAGS := -Wall -Werror -pthread
diff --git a/tools/testing/selftests/timens/futex.c 
b/tools/testing/selftests/timens/futex.c
new file mode 100644
index ..6b2b9264e851
--- /dev/null
+++ b/tools/testing/selftests/timens/futex.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+#define NSEC_PER_SEC 10ULL
+
+static int run_test(int clockid)
+{
+   int futex_op = FUTEX_WAIT_BITSET;
+   struct timespec timeout, end;
+   int val = 0;
+
+   if (clockid == CLOCK_REALTIME)
+   futex_op |= FUTEX_CLOCK_REALTIME;
+
+   clock_gettime(clockid, );
+   timeout.tv_nsec += NSEC_PER_SEC / 10; // 100ms
+   if (timeout.tv_nsec > NSEC_PER_SEC) {
+   timeout.tv_sec++;
+   timeout.tv_nsec -= NSEC_PER_SEC;
+   }
+
+   if (syscall(__NR_futex, , futex_op, 0,
+   , 0, FUTEX_BITSET_MATCH_ANY) >= 0) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT\n");
+   return 1;
+   }
+
+   if (errno != ETIMEDOUT) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT: %s\n",
+   strerror(errno));
+   return 1;
+   }
+
+   clock_gettime(clockid, );
+
+   if (end.tv_sec < timeout.tv_sec ||
+   (end.tv_sec == timeout.tv_sec && end.tv_nsec < timeout.tv_nsec)) {
+   ksft_test_result_fail("futex slept less than 100ms\n");
+   return 1;
+   }
+
+
+   ksft_test_result_pass("futex with the %d clockid\n", clockid);
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int status, len, fd;
+   char buf[4096];
+   pid_t pid;
+   struct timespec mtime_now;
+
+   nscheck();
+
+   ksft_set_plan(2);
+
+   clock_gettime(CLOCK_MONOTONIC, _now);
+
+   if (unshare_timens())
+   return 1;
+
+   len = snprintf(buf, sizeof(buf), "%d %d 0",
+   CLOCK_MONOTONIC, 70 * 24 * 3600);
+   fd = open("/proc/self/timens_offsets", O_WRONLY);
+   if (fd < 0)
+   return pr_perror("/proc/self/timens_offsets");
+
+   if (write(fd, buf, len) != len)
+   return pr_perror("/proc/self/timens_offsets");
+
+   close(fd);
+
+   pid = fork();
+   if (pid < 0)
+   return pr_perror("Unable to fork");
+   if (pid == 0) {
+   int ret = 0;
+
+   ret |= run_test(CLOCK_REALTIME);
+   ret |= run_test(CLOCK_MONOTONIC);
+   if (ret)
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return 0;
+   }
+
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Unable to wait the child process");
+
+   if (WIFEXITED(status))
+   return WEXITSTATUS(status);
+
+   return 1;
+}
-- 
2.26.2



[tip: timers/urgent] selftests/timens: Add a test for futex()

2020-10-15 Thread tip-bot2 for Andrei Vagin
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: 4cdf42596216e08051c0ccc4c896dcb8b2d22f10
Gitweb:
https://git.kernel.org/tip/4cdf42596216e08051c0ccc4c896dcb8b2d22f10
Author:Andrei Vagin 
AuthorDate:Thu, 15 Oct 2020 00:29:09 -07:00
Committer: Thomas Gleixner 
CommitterDate: Thu, 15 Oct 2020 11:24:04 +02:00

selftests/timens: Add a test for futex()

Output on success:
 $ ./futex
 1..1
 ok 1 futex
 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Andrei Vagin 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201015072909.271426-2-ava...@gmail.com

---
 tools/testing/selftests/timens/Makefile |   2 +-
 tools/testing/selftests/timens/futex.c  | 107 +++-
 2 files changed, 108 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/futex.c

diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index b4fd9a9..3a5936c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec futex
 TEST_GEN_PROGS_EXTENDED := gettime_perf
 
 CFLAGS := -Wall -Werror -pthread
diff --git a/tools/testing/selftests/timens/futex.c 
b/tools/testing/selftests/timens/futex.c
new file mode 100644
index 000..173760d
--- /dev/null
+++ b/tools/testing/selftests/timens/futex.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+#define NSEC_PER_SEC 10ULL
+
+static int run_test(void)
+{
+   struct timespec timeout, end;
+   int val = 0;
+
+   clock_gettime(CLOCK_MONOTONIC, );
+   timeout.tv_nsec += NSEC_PER_SEC / 10; // 100ms
+   if (timeout.tv_nsec > NSEC_PER_SEC) {
+   timeout.tv_sec++;
+   timeout.tv_nsec -= NSEC_PER_SEC;
+   }
+
+   if (syscall(__NR_futex, , FUTEX_WAIT_BITSET, 0,
+   , 0, FUTEX_BITSET_MATCH_ANY) >= 0) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT");
+   return 1;
+   }
+
+   if (errno != ETIMEDOUT) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT: %s",
+   strerror(errno));
+   return 1;
+   }
+
+   clock_gettime(CLOCK_MONOTONIC, );
+
+   if (end.tv_sec < timeout.tv_sec ||
+   (end.tv_sec == timeout.tv_sec && end.tv_nsec < timeout.tv_sec)) {
+   ksft_test_result_fail("futex slept less than 100ms");
+   return 1;
+   }
+
+
+   ksft_test_result_pass("futex\n");
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int status, len, fd;
+   char buf[4096];
+   pid_t pid;
+   struct timespec mtime_now;
+
+   nscheck();
+
+   check_supported_timers();
+
+   ksft_set_plan(1);
+
+   clock_gettime(CLOCK_MONOTONIC, _now);
+
+   if (unshare_timens())
+   return 1;
+
+   len = snprintf(buf, sizeof(buf), "%d %d 0",
+   CLOCK_MONOTONIC, 70 * 24 * 3600);
+   fd = open("/proc/self/timens_offsets", O_WRONLY);
+   if (fd < 0)
+   return pr_perror("/proc/self/timens_offsets");
+
+   if (write(fd, buf, len) != len)
+   return pr_perror("/proc/self/timens_offsets");
+
+   close(fd);
+
+   pid = fork();
+   if (pid < 0)
+   return pr_perror("Unable to fork");
+   if (pid == 0) {
+   if (run_test())
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return 0;
+   }
+
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Unable to wait the child process");
+
+   if (WIFEXITED(status))
+   return WEXITSTATUS(status);
+
+   return 1;
+}


[tip: timers/urgent] futex: Adjust futex absolute timeouts with per-timens offset

2020-10-15 Thread tip-bot2 for Andrei Vagin
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: 06764291690f8650a9f96dea42cc0dd4138d47d5
Gitweb:
https://git.kernel.org/tip/06764291690f8650a9f96dea42cc0dd4138d47d5
Author:Andrei Vagin 
AuthorDate:Thu, 15 Oct 2020 00:29:08 -07:00
Committer: Thomas Gleixner 
CommitterDate: Thu, 15 Oct 2020 11:24:04 +02:00

futex: Adjust futex absolute timeouts with per-timens offset

For all commands except FUTEX_WAIT, timeout is interpreted as an absolute
value. This absolute value is inside the task's time namespace and has to
be converted to the host's time.

Fixes: 5a590f35add9 ("posix-clocks: Wire up clock_gettime() with timens 
offsets")
Reported-by: Hans van der Laan 
Signed-off-by: Andrei Vagin 
Signed-off-by: Thomas Gleixner 
Cc: 
Link: https://lore.kernel.org/r/20201015072909.271426-1-ava...@gmail.com

---
 kernel/futex.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index a587669..9ff2b8c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -3797,6 +3798,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
+   else if (!(cmd & FUTEX_CLOCK_REALTIME))
+   t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = 
}
/*


Re: [PATCH v3 2/3] fs/proc: apply the time namespace offset to /proc/stat btime

2020-10-15 Thread Andrei Vagin
On Sun, Oct 11, 2020 at 04:59:23PM +0200, Michael Weiß wrote:
> @@ -79,6 +80,20 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int 
> cpu)
>  
>  #endif
>  
> +static void get_boottime(struct timespec64 *ts)
> +{

> + ktime_t boottime;
> +
> + /* get kernel internal system boot timestamp */
> + getboottime64(ts);
> +
> + /* shift boot timestamp according to the timens offset */
> + boottime = timespec64_to_ktime(*ts);
> + boottime = timens_ktime_to_host(CLOCK_BOOTTIME, boottime);

timens_ktime_to_host is used to convert timens' time to host's time.
Here it looks like we are using it in the opposite direction. I spent
some time to figure out what is going on here. I think it worth to add a
comment here.

> +
> + *ts = ktime_to_timespec64(boottime);

I don't like all these conversions back and forth. Maybe something like
this will look better:

#ifdef CONFIG_TIME_NS
if (current->nsproxy->time_ns != init_time_ns) {
struct timens_offsets *ns_offsets;
ns_offsets = >nsproxy->time_ns->offsets;
ts = timespec64_sub(ts, timens_offsets->boottime);
}
#endif


Thanks,
Andrei


[PATCH 1/2] futex: adjust a futex timeout with a per-timens offset

2020-10-15 Thread Andrei Vagin
For all commands except FUTEX_WAIT, timeout is interpreted as an
absolute value. This absolute value is inside the task's time namespace
and has to be converted to the host's time.

Cc: 
Fixes: 5a590f35add9 ("posix-clocks: Wire up clock_gettime() with timens 
offsets")
Reported-by: Hans van der Laan 
Signed-off-by: Andrei Vagin 
---
 kernel/futex.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index a5876694a60e..9ff2b8c5a506 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -3797,6 +3798,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, 
val,
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
+   else if (!(cmd & FUTEX_CLOCK_REALTIME))
+   t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = 
}
/*
-- 
2.26.2



[PATCH 2/2] selftests/timens: Add a test for futex()

2020-10-15 Thread Andrei Vagin
Output on success:
 $ ./futex
 1..1
 ok 1 futex
 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Andrei Vagin 
---
 tools/testing/selftests/timens/Makefile |   2 +-
 tools/testing/selftests/timens/futex.c  | 107 
 2 files changed, 108 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/futex.c

diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index b4fd9a934654..3a5936cc10ab 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs exec futex
 TEST_GEN_PROGS_EXTENDED := gettime_perf
 
 CFLAGS := -Wall -Werror -pthread
diff --git a/tools/testing/selftests/timens/futex.c 
b/tools/testing/selftests/timens/futex.c
new file mode 100644
index ..173760d3fce6
--- /dev/null
+++ b/tools/testing/selftests/timens/futex.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+#define NSEC_PER_SEC 10ULL
+
+static int run_test(void)
+{
+   struct timespec timeout, end;
+   int val = 0;
+
+   clock_gettime(CLOCK_MONOTONIC, );
+   timeout.tv_nsec += NSEC_PER_SEC / 10; // 100ms
+   if (timeout.tv_nsec > NSEC_PER_SEC) {
+   timeout.tv_sec++;
+   timeout.tv_nsec -= NSEC_PER_SEC;
+   }
+
+   if (syscall(__NR_futex, , FUTEX_WAIT_BITSET, 0,
+   , 0, FUTEX_BITSET_MATCH_ANY) >= 0) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT");
+   return 1;
+   }
+
+   if (errno != ETIMEDOUT) {
+   ksft_test_result_fail("futex didn't return ETIMEDOUT: %s",
+   strerror(errno));
+   return 1;
+   }
+
+   clock_gettime(CLOCK_MONOTONIC, );
+
+   if (end.tv_sec < timeout.tv_sec ||
+   (end.tv_sec == timeout.tv_sec && end.tv_nsec < timeout.tv_sec)) {
+   ksft_test_result_fail("futex slept less than 100ms");
+   return 1;
+   }
+
+
+   ksft_test_result_pass("futex\n");
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int status, len, fd;
+   char buf[4096];
+   pid_t pid;
+   struct timespec mtime_now;
+
+   nscheck();
+
+   check_supported_timers();
+
+   ksft_set_plan(1);
+
+   clock_gettime(CLOCK_MONOTONIC, _now);
+
+   if (unshare_timens())
+   return 1;
+
+   len = snprintf(buf, sizeof(buf), "%d %d 0",
+   CLOCK_MONOTONIC, 70 * 24 * 3600);
+   fd = open("/proc/self/timens_offsets", O_WRONLY);
+   if (fd < 0)
+   return pr_perror("/proc/self/timens_offsets");
+
+   if (write(fd, buf, len) != len)
+   return pr_perror("/proc/self/timens_offsets");
+
+   close(fd);
+
+   pid = fork();
+   if (pid < 0)
+   return pr_perror("Unable to fork");
+   if (pid == 0) {
+   if (run_test())
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return 0;
+   }
+
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Unable to wait the child process");
+
+   if (WIFEXITED(status))
+   return WEXITSTATUS(status);
+
+   return 1;
+}
-- 
2.26.2



Re: [PATCH v2 2/4] time: make getboottime64 aware of time namespace

2020-10-10 Thread Andrei Vagin
On Fri, Oct 09, 2020 at 03:28:15PM +0200, Christian Brauner wrote:
> On Thu, Oct 08, 2020 at 07:39:42AM +0200, Michael Weiß wrote:
> > getboottime64() provides the time stamp of system boot. In case of
> > time namespaces, the offset to the boot time stamp was not applied
> > earlier. However, getboottime64 is used e.g., in /proc/stat to print
> > the system boot time to userspace. In container runtimes which utilize
> > time namespaces to virtualize boottime of a container, this leaks
> > information about the host system boot time.
> > 
> > Therefore, we make getboottime64() to respect the time namespace offset
> > for boottime by subtracting the boottime offset.
> > 
> > Signed-off-by: Michael Weiß 
> > ---
> >  kernel/time/timekeeping.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> > index 4c47f388a83f..67530cdb389e 100644
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> > @@ -17,6 +17,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -2154,6 +2155,8 @@ void getboottime64(struct timespec64 *ts)
> >  {
> > struct timekeeper *tk = _core.timekeeper;
> > ktime_t t = ktime_sub(tk->offs_real, tk->offs_boot);
> > +   /* shift boot time stamp according to the timens offset */
> > +   t = timens_ktime_to_host(CLOCK_BOOTTIME, t);
> 
> Note that getbootime64() is mostly used in net/sunrpc and I don't know
> if this change has any security implications for them.

I would prefer to not patch kernel internal functions if they are used
not only to expose time to the userspace.

I think when kernel developers sees the getboottime64 function, they
will expect that it returns the real time of kernel boot. They will
not expect that it is aware of time namespaces and a returned time will
depend on a task in which context it will be called.

IMHO, as a minimum, we need to update the documentation for this function or
even adjust a function name.

And I think we need to consider an option to not change getbootime64 and
apply a timens offset right in the show_stat(fs/proc/stat.c) function.

Thanks,
Andrei


Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes

2020-08-20 Thread Andrei Vagin
On Fri, Aug 14, 2020 at 01:01:00AM +1000, Eugene Lubarsky wrote:
> On Wed, 12 Aug 2020 00:51:35 -0700
> Andrei Vagin  wrote:
> 
> > Maybe we need resurrect the task_diag series instead of inventing
> > another less-effective interface...
> 
> I would certainly welcome the resurrection of task_diag - it is clearly
> more efficient than this /proc/all/ idea. It would be good to find out
> if there's anything in particular that's currently blocking it.

Unfotunatly, I don't have enough time to lead a process of pushing
task_diag into the upstream. So if it is interesting for you, you can
restart this process and I am ready to help as much as time will permit.

I think the main blocking issue was a lack of interest from the wide
audience to this. The slow proc is the problem just for a few users, but
task_diag is a big subsystem that repeats functionality of another
subsystem with all derived problems like code duplication.

Another blocking issue is a new interface. There was no consensus on
this. Initially, I suggested to use netlink sockets, but developers from
non-network subsystem objected on this, so the transaction file
interface was introduced. The main idea similar to netlink sockets is
that we write a request and read a response.

There were some security concerns but I think I fixed them.

> 
> This RFC is mainly meant to check whether such an addition would
> be acceptable from an API point of view. It currently has an obvious
> performance issue in that seq_file seems to only return one page at a
> time so lots of read syscalls are still required. However I may not
> have the time to figure out a proposed fix for this by myself.
> Regardless, text-based formats can't match the efficiency of task_diag,
> but binary ones are also possible.

I don't have objections to this series. It can be an option if we
will decide that we don't want to do a major rework here.


Thanks,
Andrei


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-14 Thread Andrei Vagin
On Fri, Aug 14, 2020 at 06:11:58PM +0300, Kirill Tkhai wrote:
> On 14.08.2020 04:16, Andrei Vagin wrote:
> > On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
> >> On 12.08.2020 20:53, Andrei Vagin wrote:
> >>> On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> >>>> On 10.08.2020 20:34, Andrei Vagin wrote:
> >>>>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >>>>>> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>>>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >>>>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>>>>>>>> Kirill Tkhai  writes:
> >>>>>>>>>
> >>>>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>>>>>>>> Kirill Tkhai  writes:
> >>>>>>>>>>>
> >>>>>>>>>>>> Currently, there is no a way to list or iterate all or subset of 
> >>>>>>>>>>>> namespaces
> >>>>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
> >>>>>>>>>>>> directories,
> >>>>>>>>>>>> but some also may be as open files, which are not attached to a 
> >>>>>>>>>>>> process.
> >>>>>>>>>>>> When a namespace open fd is sent over unix socket and then 
> >>>>>>>>>>>> closed, it is
> >>>>>>>>>>>> impossible to know whether the namespace exists or not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Also, even if namespace is exposed as attached to a process or 
> >>>>>>>>>>>> as open file,
> >>>>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not 
> >>>>>>>>>>>> fast, because
> >>>>>>>>>>>> this multiplies at tasks and fds number.
> >>>>>>>>>>>
> >>>>>>>>>>> I am very dubious about this.
> >>>>>>>>>>>
> >>>>>>>>>>> I have been avoiding exactly this kind of interface because it can
> >>>>>>>>>>> create rather fundamental problems with checkpoint restart.
> >>>>>>>>>>
> >>>>>>>>>> restart/restore :)
> >>>>>>>>>>
> >>>>>>>>>>> You do have some filtering and the filtering is not based on 
> >>>>>>>>>>> current.
> >>>>>>>>>>> Which is good.
> >>>>>>>>>>>
> >>>>>>>>>>> A view that is relative to a user namespace might be ok.It 
> >>>>>>>>>>> almost
> >>>>>>>>>>> certainly does better as it's own little filesystem than as an 
> >>>>>>>>>>> extension
> >>>>>>>>>>> to proc though.
> >>>>>>>>>>>
> >>>>>>>>>>> The big thing we want to ensure is that if you migrate you can 
> >>>>>>>>>>> restore
> >>>>>>>>>>> everything.  I don't see how you will be able to restore these 
> >>>>>>>>>>> files
> >>>>>>>>>>> after migration.  Anything like this without having a complete
> >>>>>>>>>>> checkpoint/restore story is a non-starter.
> >>>>>>>>>>
> >>>>>>>>>> There is no difference between files in /proc/namespaces/ 
> >>>>>>>>>> directory and /proc/[pid]/ns/.
> >>>>>>>>>>
> >>>>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be 
> >>>>>>>>>> with /proc/namespaces/ files.
> >>>>>>>>>> As a person who worked deeply for pid_ns and user_ns support in 
> >>>>>>>>>> CRIU, I don't see any
> >>>>>>>>>> problem here.
> >>>>>>>>

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-13 Thread Andrei Vagin
On Thu, Aug 13, 2020 at 11:12:45AM +0300, Kirill Tkhai wrote:
> On 12.08.2020 20:53, Andrei Vagin wrote:
> > On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> >> On 10.08.2020 20:34, Andrei Vagin wrote:
> >>> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >>>> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >>>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>>>>>> Kirill Tkhai  writes:
> >>>>>>>
> >>>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>>>>>> Kirill Tkhai  writes:
> >>>>>>>>>
> >>>>>>>>>> Currently, there is no a way to list or iterate all or subset of 
> >>>>>>>>>> namespaces
> >>>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
> >>>>>>>>>> directories,
> >>>>>>>>>> but some also may be as open files, which are not attached to a 
> >>>>>>>>>> process.
> >>>>>>>>>> When a namespace open fd is sent over unix socket and then closed, 
> >>>>>>>>>> it is
> >>>>>>>>>> impossible to know whether the namespace exists or not.
> >>>>>>>>>>
> >>>>>>>>>> Also, even if namespace is exposed as attached to a process or as 
> >>>>>>>>>> open file,
> >>>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not 
> >>>>>>>>>> fast, because
> >>>>>>>>>> this multiplies at tasks and fds number.
> >>>>>>>>>
> >>>>>>>>> I am very dubious about this.
> >>>>>>>>>
> >>>>>>>>> I have been avoiding exactly this kind of interface because it can
> >>>>>>>>> create rather fundamental problems with checkpoint restart.
> >>>>>>>>
> >>>>>>>> restart/restore :)
> >>>>>>>>
> >>>>>>>>> You do have some filtering and the filtering is not based on 
> >>>>>>>>> current.
> >>>>>>>>> Which is good.
> >>>>>>>>>
> >>>>>>>>> A view that is relative to a user namespace might be ok.It 
> >>>>>>>>> almost
> >>>>>>>>> certainly does better as it's own little filesystem than as an 
> >>>>>>>>> extension
> >>>>>>>>> to proc though.
> >>>>>>>>>
> >>>>>>>>> The big thing we want to ensure is that if you migrate you can 
> >>>>>>>>> restore
> >>>>>>>>> everything.  I don't see how you will be able to restore these files
> >>>>>>>>> after migration.  Anything like this without having a complete
> >>>>>>>>> checkpoint/restore story is a non-starter.
> >>>>>>>>
> >>>>>>>> There is no difference between files in /proc/namespaces/ directory 
> >>>>>>>> and /proc/[pid]/ns/.
> >>>>>>>>
> >>>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> >>>>>>>> /proc/namespaces/ files.
> >>>>>>>> As a person who worked deeply for pid_ns and user_ns support in 
> >>>>>>>> CRIU, I don't see any
> >>>>>>>> problem here.
> >>>>>>>
> >>>>>>> An obvious diffference is that you are adding the inode to the inode 
> >>>>>>> to
> >>>>>>> the file name.  Which means that now you really do have to preserve 
> >>>>>>> the
> >>>>>>> inode numbers during process migration.
> >>>>>>>
> >>>>>>> Which means now we have to do all of the work to make inode number
> >>>>>>> restoration possible.  Which means now we need to have multiple
> >>>>>>> instan

Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes

2020-08-13 Thread Andrei Vagin
On Wed, Aug 12, 2020 at 10:47:32PM -0600, David Ahern wrote:
> On 8/12/20 1:51 AM, Andrei Vagin wrote:
> > 
> > I rebased the task_diag patches on top of v5.8:
> > https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag
> 
> Thanks for updating the patches.
> 
> > 
> > /proc/pid files have three major limitations:
> > * Requires at least three syscalls per process per file
> >   open(), read(), close()
> > * Variety of formats, mostly text based
> >   The kernel spent time to encode binary data into a text format and
> >   then tools like top and ps spent time to decode them back to a binary
> >   format.
> > * Sometimes slow due to extra attributes
> >   For example, /proc/PID/smaps contains a lot of useful informations
> >   about memory mappings and memory consumption for each of them. But
> >   even if we don't need memory consumption fields, the kernel will
> >   spend time to collect this information.
> 
> that's what I recall as well.
> 
> > 
> > More details and numbers are in this article:
> > https://avagin.github.io/how-fast-is-procfs
> > 
> > This new interface doesn't have only one of these limitations, but
> > task_diag doesn't have all of them.
> > 
> > And I compared how fast each of these interfaces:
> > 
> > The test environment:
> > CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
> > RAM: 16GB
> > kernel: v5.8 with task_diag and /proc/all patches.
> > 100K processes:
> > $ ps ax | wc -l
> > 10228
> 
> 100k processes but showing 10k here??

I'm sure that one zero has been escaped from here. task_proc_all shows
a number of tasks too and it shows 100230.

> 
> > 
> > $ time cat /proc/all/status > /dev/null
> > 
> > real0m0.577s
> > user0m0.017s
> > sys 0m0.559s
> > 
> > task_proc_all is used to read /proc/pid/status for all tasks:
> > https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c
> > 
> > $ time ./task_proc_all status
> > tasks: 100230
> > 

Thanks,
Andrei


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-12 Thread Andrei Vagin
On Tue, Aug 11, 2020 at 01:23:35PM +0300, Kirill Tkhai wrote:
> On 10.08.2020 20:34, Andrei Vagin wrote:
> > On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> >> On 06.08.2020 11:05, Andrei Vagin wrote:
> >>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>>>> Kirill Tkhai  writes:
> >>>>>
> >>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>>>> Kirill Tkhai  writes:
> >>>>>>>
> >>>>>>>> Currently, there is no a way to list or iterate all or subset of 
> >>>>>>>> namespaces
> >>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
> >>>>>>>> directories,
> >>>>>>>> but some also may be as open files, which are not attached to a 
> >>>>>>>> process.
> >>>>>>>> When a namespace open fd is sent over unix socket and then closed, 
> >>>>>>>> it is
> >>>>>>>> impossible to know whether the namespace exists or not.
> >>>>>>>>
> >>>>>>>> Also, even if namespace is exposed as attached to a process or as 
> >>>>>>>> open file,
> >>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
> >>>>>>>> because
> >>>>>>>> this multiplies at tasks and fds number.
> >>>>>>>
> >>>>>>> I am very dubious about this.
> >>>>>>>
> >>>>>>> I have been avoiding exactly this kind of interface because it can
> >>>>>>> create rather fundamental problems with checkpoint restart.
> >>>>>>
> >>>>>> restart/restore :)
> >>>>>>
> >>>>>>> You do have some filtering and the filtering is not based on current.
> >>>>>>> Which is good.
> >>>>>>>
> >>>>>>> A view that is relative to a user namespace might be ok.It almost
> >>>>>>> certainly does better as it's own little filesystem than as an 
> >>>>>>> extension
> >>>>>>> to proc though.
> >>>>>>>
> >>>>>>> The big thing we want to ensure is that if you migrate you can restore
> >>>>>>> everything.  I don't see how you will be able to restore these files
> >>>>>>> after migration.  Anything like this without having a complete
> >>>>>>> checkpoint/restore story is a non-starter.
> >>>>>>
> >>>>>> There is no difference between files in /proc/namespaces/ directory 
> >>>>>> and /proc/[pid]/ns/.
> >>>>>>
> >>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> >>>>>> /proc/namespaces/ files.
> >>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, 
> >>>>>> I don't see any
> >>>>>> problem here.
> >>>>>
> >>>>> An obvious diffference is that you are adding the inode to the inode to
> >>>>> the file name.  Which means that now you really do have to preserve the
> >>>>> inode numbers during process migration.
> >>>>>
> >>>>> Which means now we have to do all of the work to make inode number
> >>>>> restoration possible.  Which means now we need to have multiple
> >>>>> instances of nsfs so that we can restore inode numbers.
> >>>>>
> >>>>> I think this is still possible but we have been delaying figuring out
> >>>>> how to restore inode numbers long enough that may be actual technical
> >>>>> problems making it happen.
> >>>>
> >>>> Yeah, this matters. But it looks like here is not a dead end. We just 
> >>>> need
> >>>> change the names the namespaces are exported to particular fs and to 
> >>>> support
> >>>> rename().
> >>>>
> >>>> Before introduction a principally new filesystem type for this, can't
> >>>> this be solved in current /proc?
> >>>
> >>> do you m

Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes

2020-08-12 Thread Andrei Vagin
On Tue, Aug 11, 2020 at 12:58:47AM +1000, Eugene Lubarsky wrote:
> This is an idea for substantially reducing the number of syscalls needed
> by monitoring tools whilst mostly re-using the existing API.
> 
> The proposed files in this proof-of-concept patch set are:
> 
> * /proc/all/stat
>   A stat line for each process in the existing format.
> 
> * /proc/all/statm
>   statm lines but starting with a PID column.
> 
> * /proc/all/status
>   status info for all processes in the existing format.
> 
> * /proc/all/io
>   The existing /proc/pid/io data but formatted as a single line for
>   each process, similarly to stat/statm, with a PID column added.
> 
> * /proc/all/statx
>   Gathers info from stat, statm and io; the purpose is actually
>   not so much to reduce syscalls but to help userspace be more
>   efficient by not having to store data in e.g. hashtables in order
>   to gather it from separate /proc/all/ files.
> 
>   The format proposed here starts with the unchanged stat line
>   and begins the other info with a few characters, repeating for
>   each process:
> 
>   ...
>   25 (cat) R 1 1 0 0 -1 4194304 185 0 16 0 2 0 0 0 20 ...
>   m 662 188 167 5 0 112 0
>   io 4292 0 12 0 0 0 0
>   ...
> 
> 
> There has been a proposal with some overlapping goals: /proc/task-diag
> (https://github.com/avagin/linux-task-diag), but I'm not sure about
> its current status.

I rebased the task_diag patches on top of v5.8:
https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag

/proc/pid files have three major limitations:
* Requires at least three syscalls per process per file
  open(), read(), close()
* Variety of formats, mostly text based
  The kernel spent time to encode binary data into a text format and
  then tools like top and ps spent time to decode them back to a binary
  format.
* Sometimes slow due to extra attributes
  For example, /proc/PID/smaps contains a lot of useful informations
  about memory mappings and memory consumption for each of them. But
  even if we don't need memory consumption fields, the kernel will
  spend time to collect this information.

More details and numbers are in this article:
https://avagin.github.io/how-fast-is-procfs

This new interface doesn't have only one of these limitations, but
task_diag doesn't have all of them.

And I compared how fast each of these interfaces:

The test environment:
CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
RAM: 16GB
kernel: v5.8 with task_diag and /proc/all patches.
100K processes:
$ ps ax | wc -l
10228

$ time cat /proc/all/status > /dev/null

real0m0.577s
user0m0.017s
sys 0m0.559s

task_proc_all is used to read /proc/pid/status for all tasks:
https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c

$ time ./task_proc_all status
tasks: 100230

real0m0.924s
user0m0.054s
sys 0m0.858s


/proc/all/status is about 40% faster than /proc/*/status.

Now let's take a look at the perf output:

$ time perf record -g cat /proc/all/status > /dev/null
$ perf report
-   98.08% 1.38%  cat  [kernel.vmlinux]  [k] entry_SYSCALL_64
   - 96.70% entry_SYSCALL_64
  - do_syscall_64
 - 94.97% ksys_read
- 94.80% vfs_read
   - 94.58% proc_reg_read
  - seq_read
 - 87.95% proc_pid_status
+ 13.10% seq_put_decimal_ull_width
- 11.69% task_mem
   + 9.48% seq_put_decimal_ull_width
+ 10.63% seq_printf
- 10.35% cpuset_task_status_allowed
   + seq_printf
- 9.84% render_sigset_t
 1.61% seq_putc
   + 1.61% seq_puts
+ 4.99% proc_task_name
+ 4.11% seq_puts
- 3.76% render_cap_t
 2.38% seq_put_hex_ll
   + 1.25% seq_puts
  2.64% __task_pid_nr_ns
+ 1.54% get_task_mm
+ 1.34% __lock_task_sighand
+ 0.70% from_kuid_munged
  0.61% get_task_cred
  0.56% seq_putc
  0.52% hugetlb_report_usage
  0.52% from_kgid_munged
 + 4.30% proc_all_next
 + 0.82% _copy_to_user 

We can see that the kernel spent more than 50% of the time to encode binary
data into a text format.

Now let's see how fast task_diag:

$ time ./task_diag_all all -c -q

real0m0.087s
user0m0.001s
sys 0m0.082s

Maybe we need resurrect the task_diag series instead of inventing
another less-effective interface...

Thanks,
Andrei

> 
> 
> 
> Best Wishes,
> 
> Eugene
> 
> 
> Eugene Lubarsky (5):
>   fs/proc: Introduce /proc/all/stat
>   

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-10 Thread Andrei Vagin
On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
> On 06.08.2020 11:05, Andrei Vagin wrote:
> > On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> >> On 31.07.2020 01:13, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
> >>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>>>> Kirill Tkhai  writes:
> >>>>>
> >>>>>> Currently, there is no a way to list or iterate all or subset of 
> >>>>>> namespaces
> >>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
> >>>>>> directories,
> >>>>>> but some also may be as open files, which are not attached to a 
> >>>>>> process.
> >>>>>> When a namespace open fd is sent over unix socket and then closed, it 
> >>>>>> is
> >>>>>> impossible to know whether the namespace exists or not.
> >>>>>>
> >>>>>> Also, even if namespace is exposed as attached to a process or as open 
> >>>>>> file,
> >>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
> >>>>>> because
> >>>>>> this multiplies at tasks and fds number.
> >>>>>
> >>>>> I am very dubious about this.
> >>>>>
> >>>>> I have been avoiding exactly this kind of interface because it can
> >>>>> create rather fundamental problems with checkpoint restart.
> >>>>
> >>>> restart/restore :)
> >>>>
> >>>>> You do have some filtering and the filtering is not based on current.
> >>>>> Which is good.
> >>>>>
> >>>>> A view that is relative to a user namespace might be ok.It almost
> >>>>> certainly does better as it's own little filesystem than as an extension
> >>>>> to proc though.
> >>>>>
> >>>>> The big thing we want to ensure is that if you migrate you can restore
> >>>>> everything.  I don't see how you will be able to restore these files
> >>>>> after migration.  Anything like this without having a complete
> >>>>> checkpoint/restore story is a non-starter.
> >>>>
> >>>> There is no difference between files in /proc/namespaces/ directory and 
> >>>> /proc/[pid]/ns/.
> >>>>
> >>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> >>>> /proc/namespaces/ files.
> >>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
> >>>> don't see any
> >>>> problem here.
> >>>
> >>> An obvious diffference is that you are adding the inode to the inode to
> >>> the file name.  Which means that now you really do have to preserve the
> >>> inode numbers during process migration.
> >>>
> >>> Which means now we have to do all of the work to make inode number
> >>> restoration possible.  Which means now we need to have multiple
> >>> instances of nsfs so that we can restore inode numbers.
> >>>
> >>> I think this is still possible but we have been delaying figuring out
> >>> how to restore inode numbers long enough that may be actual technical
> >>> problems making it happen.
> >>
> >> Yeah, this matters. But it looks like here is not a dead end. We just need
> >> change the names the namespaces are exported to particular fs and to 
> >> support
> >> rename().
> >>
> >> Before introduction a principally new filesystem type for this, can't
> >> this be solved in current /proc?
> > 
> > do you mean to introduce names for namespaces which users will be able
> > to change? By default, this can be uuid.
> 
> Yes, I mean this.
> 
> Currently I won't give a final answer about UUID, but I planned to show some
> default names, which based on namespace type and inode num. Completely custom
> names for any /proc by default will waste too much memory.
> 
> So, I think the good way will be:
> 
> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and 
> some static
> random seed, which is generated on boot;
> 
> 2)Use the hash/uuid as default names in newly create /proc/namespaces: 
> pid-{hash/uuid(ino, "pid")}
> 
> 3)Allow rename, and allocate space only for rena

Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-06 Thread Andrei Vagin
On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
> On 31.07.2020 01:13, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> > 
> >> On 30.07.2020 17:34, Eric W. Biederman wrote:
> >>> Kirill Tkhai  writes:
> >>>
>  Currently, there is no a way to list or iterate all or subset of 
>  namespaces
>  in the system. Some namespaces are exposed in /proc/[pid]/ns/ 
>  directories,
>  but some also may be as open files, which are not attached to a process.
>  When a namespace open fd is sent over unix socket and then closed, it is
>  impossible to know whether the namespace exists or not.
> 
>  Also, even if namespace is exposed as attached to a process or as open 
>  file,
>  iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, 
>  because
>  this multiplies at tasks and fds number.
> >>>
> >>> I am very dubious about this.
> >>>
> >>> I have been avoiding exactly this kind of interface because it can
> >>> create rather fundamental problems with checkpoint restart.
> >>
> >> restart/restore :)
> >>
> >>> You do have some filtering and the filtering is not based on current.
> >>> Which is good.
> >>>
> >>> A view that is relative to a user namespace might be ok.It almost
> >>> certainly does better as it's own little filesystem than as an extension
> >>> to proc though.
> >>>
> >>> The big thing we want to ensure is that if you migrate you can restore
> >>> everything.  I don't see how you will be able to restore these files
> >>> after migration.  Anything like this without having a complete
> >>> checkpoint/restore story is a non-starter.
> >>
> >> There is no difference between files in /proc/namespaces/ directory and 
> >> /proc/[pid]/ns/.
> >>
> >> CRIU can restore open files in /proc/[pid]/ns, the same will be with 
> >> /proc/namespaces/ files.
> >> As a person who worked deeply for pid_ns and user_ns support in CRIU, I 
> >> don't see any
> >> problem here.
> > 
> > An obvious diffference is that you are adding the inode to the inode to
> > the file name.  Which means that now you really do have to preserve the
> > inode numbers during process migration.
> >
> > Which means now we have to do all of the work to make inode number
> > restoration possible.  Which means now we need to have multiple
> > instances of nsfs so that we can restore inode numbers.
> > 
> > I think this is still possible but we have been delaying figuring out
> > how to restore inode numbers long enough that may be actual technical
> > problems making it happen.
> 
> Yeah, this matters. But it looks like here is not a dead end. We just need
> change the names the namespaces are exported to particular fs and to support
> rename().
> 
> Before introduction a principally new filesystem type for this, can't
> this be solved in current /proc?

do you mean to introduce names for namespaces which users will be able
to change? By default, this can be uuid.

And I have a suggestion about the structure of /proc/namespaces/.

Each namespace is owned by one of user namespaces. Maybe it makes sense
to group namespaces by their user-namespaces?

/proc/namespaces/
 user
 mnt-X
 mnt-Y
 pid-X
 uts-Z
 user-X/
user
mnt-A
mnt-B
user-C
user-C/
   user
 user-Y/
user

Do we try to invent cgroupfs for namespaces?

Thanks,
Andrei


Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary

2020-08-03 Thread Andrei Vagin
On Thu, Jul 30, 2020 at 06:01:20PM +0300, Kirill Tkhai wrote:
> On 30.07.2020 17:34, Eric W. Biederman wrote:
> > Kirill Tkhai  writes:
> > 
> >> Currently, there is no a way to list or iterate all or subset of namespaces
> >> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
> >> but some also may be as open files, which are not attached to a process.
> >> When a namespace open fd is sent over unix socket and then closed, it is
> >> impossible to know whether the namespace exists or not.
> >>
> >> Also, even if namespace is exposed as attached to a process or as open 
> >> file,
> >> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
> >> this multiplies at tasks and fds number.

Could you describe with more details when you need to iterate
namespaces?

There are three ways to hold namespaces.

* processes
* bind-mounts
* file descriptors

When CRIU dumps a container, it enumirates all processes, collects file
descriptors and mounts. This means that we will be able to collect all
namespaces, doesn't it?


Re: [PATCH 1/6] arm64/vdso: use the fault callback to map vvar pages

2020-07-27 Thread Andrei Vagin
On Fri, Jul 24, 2020 at 06:26:23PM +0100, Catalin Marinas wrote:
> On Wed, 24 Jun 2020 01:33:16 -0700, Andrei Vagin wrote:
> > Currently the vdso has no awareness of time namespaces, which may
> > apply distinct offsets to processes in different namespaces. To handle
> > this within the vdso, we'll need to expose a per-namespace data page.
> > 
> > As a preparatory step, this patch separates the vdso data page from
> > the code pages, and has it faulted in via its own fault callback.
> > Subsquent patches will extend this to support distinct pages per time
> > namespace.
> > 
> > [...]
> 
> Applied to arm64 (for-next/timens), provisionally.
> 
> One potential issue I did not check is the compat vDSO. The arm32 port
> does not support timens currently. IIUC, with these patches and
> COMPAT_VDSO enabled, it will allow timens for compat processes. Normally
> I'd like the arm32 support first before updating compat but I don't
> think there would be any interface incompatibility here.
> 
> However, does this still work for arm32 processes if COMPAT_VDSO is
> disabled in the arm64 kernel?

Yes, it does. I checked that the timens test passes with and without
COMPAT_VDSO:

[avagin@laptop linux]$ git describe HEAD
v5.8-rc3-6-g9614cc576d76

alpine:/tip/tools/testing/selftests/timens# readelf  -h ./timens
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class: ELF32
  Data:  2's complement, little endian
  Version:   1 (current)
  OS/ABI:UNIX - System V
  ABI Version:   0
  Type:  DYN (Shared object file)
  Machine:   ARM
  Version:   0x1
  Entry point address:   0x711
  Start of program headers:  52 (bytes into file)
  Start of section headers:  15444 (bytes into file)
  Flags: 0x5000400, Version5 EABI,
hard-float ABI
  Size of this header:   52 (bytes)
  Size of program headers:   32 (bytes)
  Number of program headers: 7
  Size of section headers:   40 (bytes)
  Number of section headers: 32
  Section header string table index: 31

alpine:/tip/tools/testing/selftests/timens# uname -a
Linux arm64-alpine 5.8.0-rc3+ #100 SMP Sun Jul 26 23:21:07 PDT 2020
aarch64 Linux


[avagin@laptop linux]$ cat  .config | grep VDSO
CONFIG_COMPAT_VDSO=y
CONFIG_THUMB2_COMPAT_VDSO=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_COMPAT_VDSO=y
CONFIG_GENERIC_VDSO_TIME_NS=y


alpine:/tip/tools/testing/selftests/timens# ./timens
1..10
ok 1 Passed for CLOCK_BOOTTIME (syscall)
ok 2 Passed for CLOCK_BOOTTIME (vdso)
not ok 3 # SKIP CLOCK_BOOTTIME_ALARM isn't supported
not ok 4 # SKIP CLOCK_BOOTTIME_ALARM isn't supported
ok 5 Passed for CLOCK_MONOTONIC (syscall)
ok 6 Passed for CLOCK_MONOTONIC (vdso)
ok 7 Passed for CLOCK_MONOTONIC_COARSE (syscall)
ok 8 Passed for CLOCK_MONOTONIC_COARSE (vdso)
ok 9 Passed for CLOCK_MONOTONIC_RAW (syscall)
ok 10 Passed for CLOCK_MONOTONIC_RAW (vdso)
# Pass 8 Fail 0 Xfail 0 Xpass 0 Skip 2 Error 0


[avagin@laptop linux]$ cat  .config | grep VDSO
# CONFIG_COMPAT_VDSO is not set
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_VDSO_TIME_NS=y

alpine:/tip/tools/testing/selftests/timens# ./timens
1..10
ok 1 Passed for CLOCK_BOOTTIME (syscall)
ok 2 Passed for CLOCK_BOOTTIME (vdso)
not ok 3 # SKIP CLOCK_BOOTTIME_ALARM isn't supported
not ok 4 # SKIP CLOCK_BOOTTIME_ALARM isn't supported
ok 5 Passed for CLOCK_MONOTONIC (syscall)
ok 6 Passed for CLOCK_MONOTONIC (vdso)
ok 7 Passed for CLOCK_MONOTONIC_COARSE (syscall)
ok 8 Passed for CLOCK_MONOTONIC_COARSE (vdso)
ok 9 Passed for CLOCK_MONOTONIC_RAW (syscall)
ok 10 Passed for CLOCK_MONOTONIC_RAW (vdso)
# Pass 8 Fail 0 Xfail 0 Xpass 0 Skip 2 Error 0

Thanks,
Andrei


Re: [PATCH v5 0/6] arm64: add the time namespace support

2020-07-23 Thread Andrei Vagin
On Wed, Jul 22, 2020 at 07:15:06PM +0100, Catalin Marinas wrote:
> On Mon, Jul 13, 2020 at 06:57:43PM -0700, Andrei Vagin wrote:
> > On Sat, Jul 04, 2020 at 11:40:55PM -0700, Andrei Vagin wrote:
> > > On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> > > > Allocate the time namespace page among VVAR pages and add the logic
> > > > to handle faults on VVAR properly.
> > > > 
> > > > If a task belongs to a time namespace then the VVAR page which contains
> > > > the system wide VDSO data is replaced with a namespace specific page
> > > > which has the same layout as the VVAR page. That page has vdso_data->seq
> > > > set to 1 to enforce the slow path and vdso_data->clock_mode set to
> > > > VCLOCK_TIMENS to enforce the time namespace handling path.
> > > > 
> > > > The extra check in the case that vdso_data->seq is odd, e.g. a 
> > > > concurrent
> > > > update of the VDSO data is in progress, is not really affecting regular
> > > > tasks which are not part of a time namespace as the task is spin waiting
> > > > for the update to finish and vdso_data->seq to become even again.
> > > > 
> > > > If a time namespace task hits that code path, it invokes the 
> > > > corresponding
> > > > time getter function which retrieves the real VVAR page, reads host time
> > > > and then adds the offset for the requested clock which is stored in the
> > > > special VVAR page.
> > > > 
> > > 
> > > > v2: Code cleanups suggested by Vincenzo.
> > > > v3: add a comment in __arch_get_timens_vdso_data.
> > > > v4: - fix an issue reported by the lkp robot.
> > > > - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
> > > >   timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
> > > >   simplifies criu/vdso migration between different kernel configs.
> > > > v5: - Code cleanups suggested by Mark Rutland.
> > > > - In vdso_join_timens, mmap_write_lock is downgraded to
> > > >   mmap_read_lock. The VMA list isn't changed there, zap_page_range
> > > >   doesn't require mmap_write_lock.
> > > > 
> > > > Reviewed-by: Vincenzo Frascino 
> > > > Reviewed-by: Dmitry Safonov 
> > > 
> > > Hello Will and Catalin,
> > > 
> > > Have you had a chance to look at this patch set? I think it is ready to be
> > > merged. Let me know if you have any questions.
> > 
> > *friendly ping*
> > 
> > If I am doing something wrong, let me know.
> 
> Not really, just haven't got around to looking into it. Mark Rutland
> raised a concern (in private) about the safety of multithreaded apps
> but I think you already replied that timens_install() checks for this
> already [1].
> 
> Maybe a similar atomicity issue to the one raised by Mark but for
> single-threaded processes: the thread is executing vdso code, gets
> interrupted and a signal handler invokes setns(). Would resuming the
> execution in the vdso code on sigreturn cause any issues?

It will not cause any issues in the kernel. In the userspace,
clock_gettime() can return a clock value with an inconsistent offset, if
a process switches between two non-root namespaces. And it can triggers
SIGSEGV if it switches from a non-root to the root time namespace,
because a time namespace isn't mapped in the root time namespace.

I don't think that we need to handle this case in the kernel. Users
must understand what they are doing and have to write code so that avoid
these sort of situations. In general, I would say that in most cases it
is a bad idea to call setns from a signal handler.

Thanks,
Andrei

> 
> [1] 
> https://lore.kernel.org/linux-arm-kernel/d16b5cd1-bdb1-5667-fbda-c622cc795...@arista.com/
> 
> -- 
> Catalin


Re: [PATCH v5 0/6] arm64: add the time namespace support

2020-07-13 Thread Andrei Vagin
On Sat, Jul 04, 2020 at 11:40:55PM -0700, Andrei Vagin wrote:
> On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> > Allocate the time namespace page among VVAR pages and add the logic
> > to handle faults on VVAR properly.
> > 
> > If a task belongs to a time namespace then the VVAR page which contains
> > the system wide VDSO data is replaced with a namespace specific page
> > which has the same layout as the VVAR page. That page has vdso_data->seq
> > set to 1 to enforce the slow path and vdso_data->clock_mode set to
> > VCLOCK_TIMENS to enforce the time namespace handling path.
> > 
> > The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> > update of the VDSO data is in progress, is not really affecting regular
> > tasks which are not part of a time namespace as the task is spin waiting
> > for the update to finish and vdso_data->seq to become even again.
> > 
> > If a time namespace task hits that code path, it invokes the corresponding
> > time getter function which retrieves the real VVAR page, reads host time
> > and then adds the offset for the requested clock which is stored in the
> > special VVAR page.
> > 
> 
> > v2: Code cleanups suggested by Vincenzo.
> > v3: add a comment in __arch_get_timens_vdso_data.
> > v4: - fix an issue reported by the lkp robot.
> > - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
> >   timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
> >   simplifies criu/vdso migration between different kernel configs.
> > v5: - Code cleanups suggested by Mark Rutland.
> > - In vdso_join_timens, mmap_write_lock is downgraded to
> >   mmap_read_lock. The VMA list isn't changed there, zap_page_range
> >   doesn't require mmap_write_lock.
> > 
> > Reviewed-by: Vincenzo Frascino 
> > Reviewed-by: Dmitry Safonov 
> 
> Hello Will and Catalin,
> 
> Have you had a chance to look at this patch set? I think it is ready to be
> merged. Let me know if you have any questions.

*friendly ping*

If I am doing something wrong, let me know.

> 
> Thanks,
> Andrei
> 


Re: [PATCH v5 0/6] arm64: add the time namespace support

2020-07-05 Thread Andrei Vagin
On Wed, Jun 24, 2020 at 01:33:15AM -0700, Andrei Vagin wrote:
> Allocate the time namespace page among VVAR pages and add the logic
> to handle faults on VVAR properly.
> 
> If a task belongs to a time namespace then the VVAR page which contains
> the system wide VDSO data is replaced with a namespace specific page
> which has the same layout as the VVAR page. That page has vdso_data->seq
> set to 1 to enforce the slow path and vdso_data->clock_mode set to
> VCLOCK_TIMENS to enforce the time namespace handling path.
> 
> The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
> update of the VDSO data is in progress, is not really affecting regular
> tasks which are not part of a time namespace as the task is spin waiting
> for the update to finish and vdso_data->seq to become even again.
> 
> If a time namespace task hits that code path, it invokes the corresponding
> time getter function which retrieves the real VVAR page, reads host time
> and then adds the offset for the requested clock which is stored in the
> special VVAR page.
> 

> v2: Code cleanups suggested by Vincenzo.
> v3: add a comment in __arch_get_timens_vdso_data.
> v4: - fix an issue reported by the lkp robot.
> - vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
>   timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
>   simplifies criu/vdso migration between different kernel configs.
> v5: - Code cleanups suggested by Mark Rutland.
> - In vdso_join_timens, mmap_write_lock is downgraded to
>   mmap_read_lock. The VMA list isn't changed there, zap_page_range
>   doesn't require mmap_write_lock.
> 
> Reviewed-by: Vincenzo Frascino 
> Reviewed-by: Dmitry Safonov 

Hello Will and Catalin,

Have you had a chance to look at this patch set? I think it is ready to be
merged. Let me know if you have any questions.

Thanks,
Andrei



Re: [PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

2020-06-25 Thread Andrei Vagin
On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> So far setns() was missing time namespace support. This was partially due
> to it simply not being implemented but also because vdso_join_timens()
> could still fail which made switching to multiple namespaces atomically
> problematic. This is now fixed so support CLONE_NEWTIME with setns()
> 
> Cc: Thomas Gleixner 
> Cc: Michael Kerrisk 
> Cc: Serge Hallyn 
> Cc: Dmitry Safonov 
> Cc: Andrei Vagin 
> Signed-off-by: Christian Brauner 

Hi Christian,

I have reviewed this series and it looks good to me.

We decided to not change the return type of vdso_join_timens to avoid
conflicts with the arm64 timens patchset. With this change, you can add
my Reviewed-by to all patched in this series.

Reviewed-by: Andrei Vagin 

Thanks,
Andrei


Re: [PATCH 3/3] nsproxy: support CLONE_NEWTIME with setns()

2020-06-25 Thread Andrei Vagin
On Tue, Jun 23, 2020 at 01:55:21PM +0200, Christian Brauner wrote:
> On Fri, Jun 19, 2020 at 05:35:59PM +0200, Christian Brauner wrote:
> > So far setns() was missing time namespace support. This was partially due
> > to it simply not being implemented but also because vdso_join_timens()
> > could still fail which made switching to multiple namespaces atomically
> > problematic. This is now fixed so support CLONE_NEWTIME with setns()
> > 
> > Cc: Thomas Gleixner 
> > Cc: Michael Kerrisk 
> > Cc: Serge Hallyn 
> > Cc: Dmitry Safonov 
> > Cc: Andrei Vagin 
> > Signed-off-by: Christian Brauner 
> > ---
> 
> Andrei,
> Dmitry,
> 
> A little off-topic since its not related to the patch here but I've been
> going through the current time namespace semantics and i just want to
> confirm something with you:
> 
> Afaict, unshare(CLONE_NEWTIME) currently works similar to
> unshare(CLONE_NEWPID) in that it only changes {pid,time}_for_children
> but does _not_ change the {pid, time} namespace of the caller itself.
> For pid namespaces that makes a lot of sense but I'm not completely
> clear why you're doing this for time namespaces, especially since the
> setns() behavior for CLONE_NEWPID and CLONE_NEWTIME is very different:
> Similar to unshare(CLONE_NEWPID), setns(CLONE_NEWPID) doesn't change the
> pid namespace of the caller itself, it only changes it for it's
> children by setting up pid_for_children. _But_ for setns(CLONE_NEWTIME)
> both the caller's and the children's time namespace is changed, i.e.
> unshare(CLONE_NEWTIME) behaves different from setns(CLONE_NEWTIME). Why?

This scheme allows setting clock offsets for a namespace, before any
processes appear in it. It is not allowed to change offsets if any task
has joined a time namespace. We need this to avoid corner cases with
timers and tasks don't need to be aware of offset changes.

> 
> This also has the consequence that the unshare(CLONE_NEWTIME) +
> setns(CLONE_NEWTIME) sequence can be used to change the callers pid
> namespace. Is this intended?
> Here's some code where you can verify this (please excuse the aweful
> code I'm using to illustrate this):
> 
> int main(int argc, char *argv[])
> {
>   char buf1[4096], buf2[4096];
> 
>   if (unshare(0x0080))
>   exit(1);
> 
>   int fd = open("/proc/self/ns/time", O_RDONLY);
>   if (fd < 0)
>   exit(2);
> 
>   readlink("/proc/self/ns/time", buf1, sizeof(buf1));
>   readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
>   printf("unshare(CLONE_NEWTIME): time(%s) ~= 
> time_for_children(%s)\n", buf1, buf2);
> 
>   if (setns(fd, 0x0080))
>   exit(3);

And in this example, you use the right sequence of steps: unshare, set
offsets, setns. With clone3, we will be able to do this in one call.

> 
>   readlink("/proc/self/ns/time", buf1, sizeof(buf1));
>   readlink("/proc/self/ns/time_for_children", buf2, sizeof(buf2));
>   printf("setns(self, CLONE_NEWTIME): time(%s) == 
> time_for_children(%s)\n", buf1, buf2);
> 
>   exit(EXIT_SUCCESS);
> }
> 
> which gives:
> 
> root@f2-vm:/# ./test
> unshare(CLONE_NEWTIME):   time(time:[4026531834]) ~= 
> time_for_children(time:[4026532366])
> setns(self, CLONE_NEWTIME):   time(time:[4026531834]) == 
> time_for_children(time:[4026531834])
> 
> why is unshare(CLONE_NEWTIME) blocked from changing the callers pid
> namespace when setns(CLONE_NEWTIME) is allowed to do this?
> 
> Christian


Re: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

2020-06-25 Thread Andrei Vagin
On Wed, Jun 24, 2020 at 05:18:01PM +0200, Christian Brauner wrote:
> On Wed, Jun 24, 2020 at 01:33:17AM -0700, Andrei Vagin wrote:
> > The order of vvar pages depends on whether a task belongs to the root
> > time namespace or not. In the root time namespace, a task doesn't have a
> > per-namespace page. In a non-root namespace, the VVAR page which contains
> > the system-wide VDSO data is replaced with a namespace specific page
> > that contains clock offsets.
> > 
> > Whenever a task changes its namespace, the VVAR page tables are cleared
> > and then they will be re-faulted with a corresponding layout.
> > 
> > A task can switch its time namespace only if its ->mm isn't shared with
> > another task.
> > 
> > Reviewed-by: Vincenzo Frascino 
> > Reviewed-by: Dmitry Safonov 
> > Signed-off-by: Andrei Vagin 
> > ---
> >  arch/arm64/kernel/vdso.c | 31 +++
> >  1 file changed, 31 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> > index 7c4620451fa5..bdf492a17dff 100644
> > --- a/arch/arm64/kernel/vdso.c
> > +++ b/arch/arm64/kernel/vdso.c
> > @@ -124,6 +124,37 @@ static int __vdso_init(enum vdso_abi abi)
> > return 0;
> >  }
> >  
> > +#ifdef CONFIG_TIME_NS
> > +/*
> > + * The vvar mapping contains data for a specific time namespace, so when a 
> > task
> > + * changes namespace we must unmap its vvar data for the old namespace.
> > + * Subsequent faults will map in data for the new namespace.
> > + *
> > + * For more details see timens_setup_vdso_data().
> > + */
> > +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> > +{
> > +   struct mm_struct *mm = task->mm;
> > +   struct vm_area_struct *vma;
> > +
> > +   mmap_read_lock(mm);
> 
> Perfect, thanks! I'll adapt my patches so that my change and this change
> don't conflict and can go in together. Once they're landed we can simply
> turn int vdso_join_timens() into void vdso_join_timens() everywhere.

Yep. Let's do it this way. Thanks!

> 
> Reviewed-by: Christian Brauner 
> 
> Thanks!
> Christian


[PATCH 5/6] arm64/vdso: Restrict splitting VVAR VMA

2020-06-24 Thread Andrei Vagin
Forbid splitting VVAR VMA resulting in a stricter ABI and reducing the
amount of corner-cases to consider while working further on VDSO time
namespace support.

As the offset from timens to VVAR page is computed compile-time, the pages
in VVAR should stay together and not being partically mremap()'ed.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index be9ca8c46cff..8be6bd6625db 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -224,6 +224,17 @@ static vm_fault_t vvar_fault(const struct 
vm_special_mapping *sm,
return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
+static int vvar_mremap(const struct vm_special_mapping *sm,
+  struct vm_area_struct *new_vma)
+{
+   unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
+
+   if (new_size != VVAR_NR_PAGES * PAGE_SIZE)
+   return -EINVAL;
+
+   return 0;
+}
+
 static int __setup_additional_pages(enum vdso_abi abi,
struct mm_struct *mm,
struct linux_binprm *bprm,
@@ -306,6 +317,7 @@ static struct vm_special_mapping aarch32_vdso_maps[] = {
[AA32_MAP_VVAR] = {
.name = "[vvar]",
.fault = vvar_fault,
+   .mremap = vvar_mremap,
},
[AA32_MAP_VDSO] = {
.name = "[vdso]",
@@ -474,6 +486,7 @@ static struct vm_special_mapping aarch64_vdso_maps[] 
__ro_after_init = {
[AA64_MAP_VVAR] = {
.name   = "[vvar]",
.fault = vvar_fault,
+   .mremap = vvar_mremap,
},
[AA64_MAP_VDSO] = {
.name   = "[vdso]",
-- 
2.24.1



[PATCH 1/6] arm64/vdso: use the fault callback to map vvar pages

2020-06-24 Thread Andrei Vagin
Currently the vdso has no awareness of time namespaces, which may
apply distinct offsets to processes in different namespaces. To handle
this within the vdso, we'll need to expose a per-namespace data page.

As a preparatory step, this patch separates the vdso data page from
the code pages, and has it faulted in via its own fault callback.
Subsquent patches will extend this to support distinct pages per time
namespace.

The vvar vma has to be installed with the VM_PFNMAP flag to handle
faults via its vma fault callback.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 4e016574bd91..7c4620451fa5 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -107,29 +107,32 @@ static int __vdso_init(enum vdso_abi abi)
vdso_info[abi].vdso_code_start) >>
PAGE_SHIFT;
 
-   /* Allocate the vDSO pagelist, plus a page for the data. */
-   vdso_pagelist = kcalloc(vdso_info[abi].vdso_pages + 1,
+   vdso_pagelist = kcalloc(vdso_info[abi].vdso_pages,
sizeof(struct page *),
GFP_KERNEL);
if (vdso_pagelist == NULL)
return -ENOMEM;
 
-   /* Grab the vDSO data page. */
-   vdso_pagelist[0] = phys_to_page(__pa_symbol(vdso_data));
-
-
/* Grab the vDSO code pages. */
pfn = sym_to_pfn(vdso_info[abi].vdso_code_start);
 
for (i = 0; i < vdso_info[abi].vdso_pages; i++)
-   vdso_pagelist[i + 1] = pfn_to_page(pfn + i);
+   vdso_pagelist[i] = pfn_to_page(pfn + i);
 
-   vdso_info[abi].dm->pages = _pagelist[0];
-   vdso_info[abi].cm->pages = _pagelist[1];
+   vdso_info[abi].cm->pages = vdso_pagelist;
 
return 0;
 }
 
+static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
+struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   if (vmf->pgoff == 0)
+   return vmf_insert_pfn(vma, vmf->address,
+   sym_to_pfn(vdso_data));
+   return VM_FAULT_SIGBUS;
+}
+
 static int __setup_additional_pages(enum vdso_abi abi,
struct mm_struct *mm,
struct linux_binprm *bprm,
@@ -150,7 +153,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
}
 
ret = _install_special_mapping(mm, vdso_base, PAGE_SIZE,
-  VM_READ|VM_MAYREAD,
+  VM_READ|VM_MAYREAD|VM_PFNMAP,
   vdso_info[abi].dm);
if (IS_ERR(ret))
goto up_fail;
@@ -209,6 +212,7 @@ static struct vm_special_mapping aarch32_vdso_maps[] = {
 #ifdef CONFIG_COMPAT_VDSO
[AA32_MAP_VVAR] = {
.name = "[vvar]",
+   .fault = vvar_fault,
},
[AA32_MAP_VDSO] = {
.name = "[vdso]",
@@ -376,6 +380,7 @@ enum aarch64_map {
 static struct vm_special_mapping aarch64_vdso_maps[] __ro_after_init = {
[AA64_MAP_VVAR] = {
.name   = "[vvar]",
+   .fault = vvar_fault,
},
[AA64_MAP_VDSO] = {
.name   = "[vdso]",
-- 
2.24.1



[PATCH 4/6] arm64/vdso: Handle faults on timens page

2020-06-24 Thread Andrei Vagin
If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 57 +---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 18854e6c1373..be9ca8c46cff 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -164,15 +165,63 @@ int vdso_join_timens(struct task_struct *task, struct 
time_namespace *ns)
mmap_read_unlock(mm);
return 0;
 }
+
+static struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+   if (likely(vma->vm_mm == current->mm))
+   return current->nsproxy->time_ns->vvar_page;
+
+   /*
+* VM_PFNMAP | VM_IO protect .fault() handler from being called
+* through interfaces like /proc/$pid/mem or
+* process_vm_{readv,writev}() as long as there's no .access()
+* in special_mapping_vmops().
+* For more details check_vma_flags() and __access_remote_vm()
+*/
+
+   WARN(1, "vvar_page accessed remotely");
+
+   return NULL;
+}
+#else
+static inline struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+   return NULL;
+}
 #endif
 
 static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
 struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   if (vmf->pgoff == 0)
-   return vmf_insert_pfn(vma, vmf->address,
-   sym_to_pfn(vdso_data));
-   return VM_FAULT_SIGBUS;
+   struct page *timens_page = find_timens_vvar_page(vma);
+   unsigned long pfn;
+
+   switch (vmf->pgoff) {
+   case VVAR_DATA_PAGE_OFFSET:
+   if (timens_page)
+   pfn = page_to_pfn(timens_page);
+   else
+   pfn = sym_to_pfn(vdso_data);
+   break;
+#ifdef CONFIG_TIME_NS
+   case VVAR_TIMENS_PAGE_OFFSET:
+   /*
+* If a task belongs to a time namespace then a namespace
+* specific VVAR is mapped with the VVAR_DATA_PAGE_OFFSET and
+* the real VVAR page is mapped with the VVAR_TIMENS_PAGE_OFFSET
+* offset.
+* See also the comment near timens_setup_vdso_data().
+*/
+   if (!timens_page)
+   return VM_FAULT_SIGBUS;
+   pfn = sym_to_pfn(vdso_data);
+   break;
+#endif /* CONFIG_TIME_NS */
+   default:
+   return VM_FAULT_SIGBUS;
+   }
+
+   return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
 static int __setup_additional_pages(enum vdso_abi abi,
-- 
2.24.1



[PATCH 6/6] arm64: enable time namespace support

2020-06-24 Thread Andrei Vagin
CONFIG_TIME_NS is dependes on GENERIC_VDSO_TIME_NS.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4a094bedcb2..38d3180adf78 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -118,6 +118,7 @@ config ARM64
select GENERIC_STRNLEN_USER
select GENERIC_TIME_VSYSCALL
select GENERIC_GETTIMEOFDAY
+   select GENERIC_VDSO_TIME_NS
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
select HAVE_PCI
-- 
2.24.1



[PATCH v5 0/6] arm64: add the time namespace support

2020-06-24 Thread Andrei Vagin
Allocate the time namespace page among VVAR pages and add the logic
to handle faults on VVAR properly.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

v2: Code cleanups suggested by Vincenzo.
v3: add a comment in __arch_get_timens_vdso_data.
v4: - fix an issue reported by the lkp robot.
- vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
  timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
  simplifies criu/vdso migration between different kernel configs.
v5: - Code cleanups suggested by Mark Rutland.
- In vdso_join_timens, mmap_write_lock is downgraded to
  mmap_read_lock. The VMA list isn't changed there, zap_page_range
  doesn't require mmap_write_lock.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Cc: Thomas Gleixner 
Cc: Dmitry Safonov 

v5 on github (if someone prefers `git pull` to `git am`):
https://github.com/avagin/linux-task-diag/tree/arm64/timens-v5

Andrei Vagin (6):
  arm64/vdso: use the fault callback to map vvar pages
  arm64/vdso: Zap vvar pages when switching to a time namespace
  arm64/vdso: Add time namespace page
  arm64/vdso: Handle faults on timens page
  arm64/vdso: Restrict splitting VVAR VMA
  arm64: enable time namespace support

--
2.24.1



[PATCH 3/6] arm64/vdso: Add time namespace page

2020-06-24 Thread Andrei Vagin
Allocate the time namespace page among VVAR pages.  Provide
__arch_get_timens_vdso_data() helper for VDSO code to get the
code-relative position of VVARs on that special page.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

The time-namespace page isn't allocated on !CONFIG_TIME_NAMESPACE, but
vma is the same size, which simplifies criu/vdso migration between
different kernel configs.

Cc: Mark Rutland 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/asm/vdso.h |  2 ++
 .../include/asm/vdso/compat_gettimeofday.h| 12 
 arch/arm64/include/asm/vdso/gettimeofday.h|  8 
 arch/arm64/kernel/vdso.c  | 19 ---
 arch/arm64/kernel/vdso/vdso.lds.S |  5 -
 arch/arm64/kernel/vdso32/vdso.lds.S   |  5 -
 include/vdso/datapage.h   |  1 +
 7 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/vdso.h b/arch/arm64/include/asm/vdso.h
index 07468428fd29..f99dcb94b438 100644
--- a/arch/arm64/include/asm/vdso.h
+++ b/arch/arm64/include/asm/vdso.h
@@ -12,6 +12,8 @@
  */
 #define VDSO_LBASE 0x0
 
+#define __VVAR_PAGES2
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h 
b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index b6907ae78e53..b7c549d46d18 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -152,6 +152,18 @@ static __always_inline const struct vdso_data 
*__arch_get_vdso_data(void)
return ret;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+{
+   const struct vdso_data *ret;
+
+   /* See __arch_get_vdso_data(). */
+   asm volatile("mov %0, %1" : "=r"(ret) : "r"(_timens_data));
+
+   return ret;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
b/arch/arm64/include/asm/vdso/gettimeofday.h
index afba6ba332f8..cf39eae5eaaf 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -96,6 +96,14 @@ const struct vdso_data *__arch_get_vdso_data(void)
return _vdso_data;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(void)
+{
+   return _timens_data;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index bdf492a17dff..18854e6c1373 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -40,6 +40,12 @@ enum vdso_abi {
 #endif /* CONFIG_COMPAT_VDSO */
 };
 
+enum vvar_pages {
+   VVAR_DATA_PAGE_OFFSET,
+   VVAR_TIMENS_PAGE_OFFSET,
+   VVAR_NR_PAGES,
+};
+
 struct vdso_abi_info {
const char *name;
const char *vdso_code_start;
@@ -125,6 +131,11 @@ static int __vdso_init(enum vdso_abi abi)
 }
 
 #ifdef CONFIG_TIME_NS
+struct vdso_data *arch_get_vdso_data(void *vvar_page)
+{
+   return (struct vdso_data *)(vvar_page);
+}
+
 /*
  * The vvar mapping contains data for a specific time namespace, so when a task
  * changes namespace we must unmap its vvar data for the old namespace.
@@ -173,9 +184,11 @@ static int __setup_additional_pages(enum vdso_abi abi,
unsigned long gp_flags = 0;
void *ret;
 
+   BUILD_BUG_ON(VVAR_NR_PAGES != __VVAR_PAGES);
+
vdso_text_len = vdso_info[abi].vdso_pages << PAGE_SHIFT;
/* Be sure to map the data page */
-   vdso_mapping_len = vdso_text_len + PAGE_SIZE;
+   vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE;
 
vdso_base = get_unmapped_area(NULL, 0, vdso_mapping_len, 0, 0);
if (IS_ERR_VALUE(vdso_base)) {
@@ -183,7 +196,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
goto up_fail;
}
 
-   ret = _install_special_mapping(mm, vdso_base, PAGE_SIZE,
+   ret = _inst

[PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

2020-06-24 Thread Andrei Vagin
The order of vvar pages depends on whether a task belongs to the root
time namespace or not. In the root time namespace, a task doesn't have a
per-namespace page. In a non-root namespace, the VVAR page which contains
the system-wide VDSO data is replaced with a namespace specific page
that contains clock offsets.

Whenever a task changes its namespace, the VVAR page tables are cleared
and then they will be re-faulted with a corresponding layout.

A task can switch its time namespace only if its ->mm isn't shared with
another task.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 7c4620451fa5..bdf492a17dff 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -124,6 +124,37 @@ static int __vdso_init(enum vdso_abi abi)
return 0;
 }
 
+#ifdef CONFIG_TIME_NS
+/*
+ * The vvar mapping contains data for a specific time namespace, so when a task
+ * changes namespace we must unmap its vvar data for the old namespace.
+ * Subsequent faults will map in data for the new namespace.
+ *
+ * For more details see timens_setup_vdso_data().
+ */
+int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+{
+   struct mm_struct *mm = task->mm;
+   struct vm_area_struct *vma;
+
+   mmap_read_lock(mm);
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   unsigned long size = vma->vm_end - vma->vm_start;
+
+   if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
+   zap_page_range(vma, vma->vm_start, size);
+#ifdef CONFIG_COMPAT_VDSO
+   if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
+   zap_page_range(vma, vma->vm_start, size);
+#endif
+   }
+
+   mmap_read_unlock(mm);
+   return 0;
+}
+#endif
+
 static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
 struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-- 
2.24.1



Re: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

2020-06-23 Thread Andrei Vagin
On Tue, Jun 23, 2020 at 11:30:30AM -0700, Peter Oskolkov wrote:
...
> > >  /**
> > > +static int futex_swap(u32 __user *uaddr, unsigned int flags, u32
> > > val,
> > > +   ktime_t *abs_time, u32 __user *uaddr2)
> > > +{
> > > + u32 bitset = FUTEX_BITSET_MATCH_ANY;
> > > + struct task_struct *next = NULL;
> > > + DEFINE_WAKE_Q(wake_q);
> > > + int ret;
> > > +
> > > + ret = prepare_wake_q(uaddr2, flags, 1, bitset, _q);
> > > + if (!wake_q_empty(_q)) {
> > > + /* Pull the first wakee out of the queue to swap into.
> > > */
> > > + next = container_of(wake_q.first, struct task_struct,
> > > wake_q);
> > > + wake_q.first = wake_q.first->next;
> > > + next->wake_q.next = NULL;
> > > + /*
> > > +  * Note that wake_up_q does not touch wake_q.last, so
> > > we
> > > +  * do not bother with it here.
> > > +  */
> > > + wake_up_q(_q);
> > 
> > wake_up_q() doesn't seem to serve any purpose in that the above
> > assignment of wake_q.first shall make it an empty queue now?
> > Also, I don't see a need to touch wake_q.first either so I think we
> > can
> > get rid of wake_q altogether here.
> 
> The futex at uaddr2 may have more than one waiter, so we cannot assume
> that wake_q will be empty when we remove the first element.

The third argument of prepare_wake_q is nr_wake which is one in this
case, so we can be sure that wake_q will be empty, can't we?

> 
> > 
> > > + }
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + return futex_wait(uaddr, flags, val, abs_time, bitset, next);
> > > +}
> > 


Re: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

2020-06-23 Thread Andrei Vagin
On Fri, Jun 19, 2020 at 05:38:12PM +0200, Christian Brauner wrote:
> On Tue, Jun 16, 2020 at 12:55:41AM -0700, Andrei Vagin wrote:
> > The VVAR page layout depends on whether a task belongs to the root or
> > non-root time namespace. Whenever a task changes its namespace, the VVAR
> > page tables are cleared and then they will be re-faulted with a
> > corresponding layout.
> > 
> > Reviewed-by: Vincenzo Frascino 
> > Reviewed-by: Dmitry Safonov 
> > Signed-off-by: Andrei Vagin 
> > ---
> >  arch/arm64/kernel/vdso.c | 32 
> >  1 file changed, 32 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> > index b0aec4e8c9b4..df4bb736d28a 100644
> > --- a/arch/arm64/kernel/vdso.c
> > +++ b/arch/arm64/kernel/vdso.c
> > @@ -125,6 +125,38 @@ static int __vdso_init(enum vdso_abi abi)
> > return 0;
> >  }
> >  
> > +#ifdef CONFIG_TIME_NS
> > +/*
> > + * The vvar page layout depends on whether a task belongs to the root or
> > + * non-root time namespace. Whenever a task changes its namespace, the VVAR
> > + * page tables are cleared and then they will re-faulted with a
> > + * corresponding layout.
> > + * See also the comment near timens_setup_vdso_data() for details.
> > + */
> > +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> > +{
> > +   struct mm_struct *mm = task->mm;
> > +   struct vm_area_struct *vma;
> > +
> > +   if (mmap_write_lock_killable(mm))
> > +   return -EINTR;
> 
> Hey,
> 
> Just a heads-up I'm about to plumb CLONE_NEWTIME support into setns()

Hmm. I am not sure that I unserstand what you mean. I think setns(nsfd,
CLONE_NEWTIME) works now. For example, we use it in
tools/testing/selftests/timens/timens.c. Do you mean setns(pidfd,
CLONE_NEWTIME | CLONE_something)?

> which would mean that vdso_join_timens() ould not be allowed to fail
> anymore to make it easy to switch to multiple namespaces atomically. So
> this would probably need to be changed to mmap_write_lock() which I've
> already brought up upstream:
> https://lore.kernel.org/lkml/2020060221.pgd3r5qkjrjmfqa2@wittgenstein/
> (Assuming that people agree. I just sent the series and most people here
> are Cced.)
> 
> Thanks!
> Christian


Re: [PATCH 1/6] arm64/vdso: use the fault callback to map vvar pages

2020-06-18 Thread Andrei Vagin
On Tue, Jun 16, 2020 at 11:59:20AM +0100, Mark Rutland wrote:
> On Tue, Jun 16, 2020 at 12:55:40AM -0700, Andrei Vagin wrote:
> > This is required to support time namespaces where a time namespace data
> > page is different for each namespace.
> 
> Can you please give a bit more of an introduction to the changes here?
> As-is, this doesn't give reviewers the necessary context to understand
> the change, nor does it justify it.
> 
> Ideally, a commit message for this should look something like:

Mark, thank you for the review. I will write more details message and
fix this patch according with your other comments.

> 
> | Currently the vdso has no awareness of time namespaces, which may
> | apply distinct offsets to processes in different namespaces. To handle
> | this within the vdso, we'll need to expose a per-namespace data page.
> |
> | As a preparatory step, this patch separates the vdso data page from
> | the code pages, and has it faulted in via its own fault callback.
> | Subsquent patches will extend this to support distinct pages per time
> | namespace.
> 
> Otherwise, I have a few minor comments below.
> 
> > Reviewed-by: Vincenzo Frascino 
> > Reviewed-by: Dmitry Safonov 
> > Signed-off-by: Andrei Vagin 
> > ---
> >  arch/arm64/kernel/vdso.c | 24 +++-
> >  1 file changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> > index 4e016574bd91..b0aec4e8c9b4 100644
> > --- a/arch/arm64/kernel/vdso.c
> > +++ b/arch/arm64/kernel/vdso.c
> > @@ -108,28 +108,32 @@ static int __vdso_init(enum vdso_abi abi)
> > PAGE_SHIFT;
> >  
> > /* Allocate the vDSO pagelist, plus a page for the data. */
> > -   vdso_pagelist = kcalloc(vdso_info[abi].vdso_pages + 1,
> > +   vdso_pagelist = kcalloc(vdso_info[abi].vdso_pages,
> > sizeof(struct page *),
> > GFP_KERNEL);
> 
> The comment above is now stale. Can you please update it, or (event
> better) delete it entirely?

I will delete it.

> 
> > if (vdso_pagelist == NULL)
> > return -ENOMEM;
> >  
> > -   /* Grab the vDSO data page. */
> > -   vdso_pagelist[0] = phys_to_page(__pa_symbol(vdso_data));
> > -
> > -
> > /* Grab the vDSO code pages. */
> > pfn = sym_to_pfn(vdso_info[abi].vdso_code_start);
> >  
> > for (i = 0; i < vdso_info[abi].vdso_pages; i++)
> > -   vdso_pagelist[i + 1] = pfn_to_page(pfn + i);
> > +   vdso_pagelist[i] = pfn_to_page(pfn + i);
> >  
> > -   vdso_info[abi].dm->pages = _pagelist[0];
> > -   vdso_info[abi].cm->pages = _pagelist[1];
> > +   vdso_info[abi].cm->pages = vdso_pagelist;
> >  
> > return 0;
> >  }
> >  
> > +static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
> > +struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +   if (vmf->pgoff == 0)
> > +   return vmf_insert_pfn(vma, vmf->address,
> > +   sym_to_pfn(vdso_data));
> > +   return VM_FAULT_SIGBUS;
> > +}
> 
> This might look better as:

In the next patch, this function will handle more pages and this form
will look better.

> 
> | if (vmf->pgoff != 0)
> | return VM_FAULT_SIGBUS;
> | 
> | return vmf_insert_pfn(vma, vmf->address, sym_to_pfn(vdso_data));
> 
> > +
> >  static int __setup_additional_pages(enum vdso_abi abi,
> > struct mm_struct *mm,
> > struct linux_binprm *bprm,
> > @@ -150,7 +154,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
> > }
> >  
> > ret = _install_special_mapping(mm, vdso_base, PAGE_SIZE,
> > -  VM_READ|VM_MAYREAD,
> > +  VM_READ|VM_MAYREAD|VM_PFNMAP,
> 
> This change needs to be explained in the commit message. WHy is it
> necessary, and why only so for the data page?

I will update the commit message. VM_PFNMAP is required to handle faults
from a vma fault callback. In this case, it is vvar_fault.

> 
> Thanks,
> Mark.
> 
> >vdso_info[abi].dm);
> > if (IS_ERR(ret))
> > goto up_fail;
> > @@ -209,6 +213,7 @@ static struct vm_special_mapping aarch32_vdso_maps[] = {
> >  #ifdef CONFIG_COMPAT_VDSO
> > [AA32_MAP_VVAR] = {
> > .name = "[vvar]",
> > +   .fault = vvar_fault,
> > },
> > [AA32_MAP_VDSO] = {
> > .name = "[vdso]",
> > @@ -376,6 +381,7 @@ enum aarch64_map {
> >  static struct vm_special_mapping aarch64_vdso_maps[] __ro_after_init = {
> > [AA64_MAP_VVAR] = {
> > .name   = "[vvar]",
> > +   .fault = vvar_fault,
> > },
> > [AA64_MAP_VDSO] = {
> > .name   = "[vdso]",
> > -- 
> > 2.24.1
> > 


Re: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

2020-06-17 Thread Andrei Vagin
On Tue, Jun 16, 2020 at 10:22:26AM -0700, Peter Oskolkov wrote:
> From 6fbe0261204692a7f488261ab3c4ac696b91db5c Mon Sep 17 00:00:00 2001
> From: Peter Oskolkov 
> Date: Tue, 9 Jun 2020 16:03:14 -0700
> Subject: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation
> 
> This is an RFC!
> 
> As Paul Turner presented at LPC in 2013 ...
> - pdf: 
> http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
> - video: https://www.youtube.com/watch?v=KXuZi9aeGTw
> 
> ... Google has developed an M:N userspace threading subsystem backed
> by Google-private SwitchTo Linux Kernel API (page 17 in the pdf referenced
> above). This subsystem provides latency-sensitive services at Google with
> fine-grained user-space control/scheduling over what is running when,
> and this subsystem is used widely internally (called schedulers or fibers).
> 
> This RFC patchset is the first step to open-source this work. As explained
> in the linked pdf and video, SwitchTo API has three core operations: wait,
> resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation
> that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation
> on top of which user-space threading libraries can be built.
> 
> Another common use case for FUTEX_SWAP is message passing a-la RPC
> between tasks: task/thread T1 prepares a message,
> wakes T2 to work on it, and waits for the results; when T2 is done, it
> wakes T1 and waits for more work to arrive. Currently the simplest
> way to implement this is
> 
> a. T1: futex-wake T2, futex-wait
> b. T2: wakes, does what it has been woken to do
> c. T2: futex-wake T1, futex-wait
> 
> With FUTEX_SWAP, steps a and c above can be reduced to one futex operation
> that runs 5-10 times faster.
> 

Hi Peter,

We have a good use-case in gVisor for this new futex command.

gVisor accesses a file system through a file proxy, called the Gofer.
The gofer runs as a separate process, that is isolated from the sandbox
(sentry). Gofer instances communicate with their respective sentry using
the 9P-like protocol. We used sockets as communication channels, but
recently we switched to the flipcall (1) library which improve
performance by using shared memory for data (reducing memory copies) and
using futexes for control signaling (which is much cheaper than
sendto/recvfrom/sendmsg/recvmsg).

I modified the flipcall library to use FUTEX_SWAP and I see a
significant performance improvement.

A low level benchmarks (2) shows that req/resp is more than five time
faster with FUTEX_SWAP than with FUTEX_WAKE_WAIT. This is more or
less the same test what you did.

* FUTEX_WAKE & FUTEX_WAIT
BenchmarkSendRecv-8 88396 13625 ns/op

* FUTEX_SWAP
BenchmarkSendRecv-8479604  2524 ns/op

And a more high-level test (3) which benchmarks the open syscall in
gVisor shows about 40% improvements.

* FUTEX_WAKE & FUTEX_WAIT
BM_Open/1/real_time_mean93996 ns

* FUTEX_SWAP
BM_Open/1/real_time_mean53136 ns

I believe there are many use-cases for FUTEX_SWAP in other projects.

1. https://github.com/google/gvisor/tree/master/pkg/flipcall
2. 
https://github.com/google/gvisor/blob/master/pkg/flipcall/flipcall_test.go#L361
3. 
https://github.com/google/gvisor/blob/master/test/perf/linux/open_benchmark.cc

Thanks,
Andrei


[PATCH 4/6] arm64/vdso: Handle faults on timens page

2020-06-16 Thread Andrei Vagin
If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 57 +---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 14d5b7642b62..21bdd0d74f40 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -166,15 +167,63 @@ int vdso_join_timens(struct task_struct *task, struct 
time_namespace *ns)
mmap_write_unlock(mm);
return 0;
 }
+
+static struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+   if (likely(vma->vm_mm == current->mm))
+   return current->nsproxy->time_ns->vvar_page;
+
+   /*
+* VM_PFNMAP | VM_IO protect .fault() handler from being called
+* through interfaces like /proc/$pid/mem or
+* process_vm_{readv,writev}() as long as there's no .access()
+* in special_mapping_vmops().
+* For more details check_vma_flags() and __access_remote_vm()
+*/
+
+   WARN(1, "vvar_page accessed remotely");
+
+   return NULL;
+}
+#else
+static inline struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+   return NULL;
+}
 #endif
 
 static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
 struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   if (vmf->pgoff == 0)
-   return vmf_insert_pfn(vma, vmf->address,
-   sym_to_pfn(vdso_data));
-   return VM_FAULT_SIGBUS;
+   struct page *timens_page = find_timens_vvar_page(vma);
+   unsigned long pfn;
+
+   switch (vmf->pgoff) {
+   case VVAR_DATA_PAGE_OFFSET:
+   if (timens_page)
+   pfn = page_to_pfn(timens_page);
+   else
+   pfn = sym_to_pfn(vdso_data);
+   break;
+#ifdef CONFIG_TIME_NS
+   case VVAR_TIMENS_PAGE_OFFSET:
+   /*
+* If a task belongs to a time namespace then a namespace
+* specific VVAR is mapped with the VVAR_DATA_PAGE_OFFSET and
+* the real VVAR page is mapped with the VVAR_TIMENS_PAGE_OFFSET
+* offset.
+* See also the comment near timens_setup_vdso_data().
+*/
+   if (!timens_page)
+   return VM_FAULT_SIGBUS;
+   pfn = sym_to_pfn(vdso_data);
+   break;
+#endif /* CONFIG_TIME_NS */
+   default:
+   return VM_FAULT_SIGBUS;
+   }
+
+   return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
 static int __setup_additional_pages(enum vdso_abi abi,
-- 
2.24.1



[PATCH v4 0/6] arm64: add the time namespace support

2020-06-16 Thread Andrei Vagin
Allocate the time namespace page among VVAR pages and add the logic
to handle faults on VVAR properly.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

v2: Code cleanups suggested by Vincenzo.
v3: add a comment in __arch_get_timens_vdso_data.
v4: - fix an issue reported by the lkp robot.
- vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
  timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
  simplifies criu/vdso migration between different kernel configs.


Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Cc: Thomas Gleixner 
Cc: Dmitry Safonov 

v4 on github (if someone prefers `git pull` to `git am`):
https://github.com/avagin/linux-task-diag/tree/arm64/timens-v4

Andrei Vagin (6):
  arm64/vdso: use the fault callback to map vvar pages
  arm64/vdso: Zap vvar pages when switching to a time namespace
  arm64/vdso: Add time namespace page
  arm64/vdso: Handle faults on timens page
  arm64/vdso: Restrict splitting VVAR VMA
  arm64: enable time namespace support

--
2.24.1



[PATCH 1/6] arm64/vdso: use the fault callback to map vvar pages

2020-06-16 Thread Andrei Vagin
This is required to support time namespaces where a time namespace data
page is different for each namespace.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 4e016574bd91..b0aec4e8c9b4 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -108,28 +108,32 @@ static int __vdso_init(enum vdso_abi abi)
PAGE_SHIFT;
 
/* Allocate the vDSO pagelist, plus a page for the data. */
-   vdso_pagelist = kcalloc(vdso_info[abi].vdso_pages + 1,
+   vdso_pagelist = kcalloc(vdso_info[abi].vdso_pages,
sizeof(struct page *),
GFP_KERNEL);
if (vdso_pagelist == NULL)
return -ENOMEM;
 
-   /* Grab the vDSO data page. */
-   vdso_pagelist[0] = phys_to_page(__pa_symbol(vdso_data));
-
-
/* Grab the vDSO code pages. */
pfn = sym_to_pfn(vdso_info[abi].vdso_code_start);
 
for (i = 0; i < vdso_info[abi].vdso_pages; i++)
-   vdso_pagelist[i + 1] = pfn_to_page(pfn + i);
+   vdso_pagelist[i] = pfn_to_page(pfn + i);
 
-   vdso_info[abi].dm->pages = _pagelist[0];
-   vdso_info[abi].cm->pages = _pagelist[1];
+   vdso_info[abi].cm->pages = vdso_pagelist;
 
return 0;
 }
 
+static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
+struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   if (vmf->pgoff == 0)
+   return vmf_insert_pfn(vma, vmf->address,
+   sym_to_pfn(vdso_data));
+   return VM_FAULT_SIGBUS;
+}
+
 static int __setup_additional_pages(enum vdso_abi abi,
struct mm_struct *mm,
struct linux_binprm *bprm,
@@ -150,7 +154,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
}
 
ret = _install_special_mapping(mm, vdso_base, PAGE_SIZE,
-  VM_READ|VM_MAYREAD,
+  VM_READ|VM_MAYREAD|VM_PFNMAP,
   vdso_info[abi].dm);
if (IS_ERR(ret))
goto up_fail;
@@ -209,6 +213,7 @@ static struct vm_special_mapping aarch32_vdso_maps[] = {
 #ifdef CONFIG_COMPAT_VDSO
[AA32_MAP_VVAR] = {
.name = "[vvar]",
+   .fault = vvar_fault,
},
[AA32_MAP_VDSO] = {
.name = "[vdso]",
@@ -376,6 +381,7 @@ enum aarch64_map {
 static struct vm_special_mapping aarch64_vdso_maps[] __ro_after_init = {
[AA64_MAP_VVAR] = {
.name   = "[vvar]",
+   .fault = vvar_fault,
},
[AA64_MAP_VDSO] = {
.name   = "[vdso]",
-- 
2.24.1



[PATCH 5/6] arm64/vdso: Restrict splitting VVAR VMA

2020-06-16 Thread Andrei Vagin
Forbid splitting VVAR VMA resulting in a stricter ABI and reducing the
amount of corner-cases to consider while working further on VDSO time
namespace support.

As the offset from timens to VVAR page is computed compile-time, the pages
in VVAR should stay together and not being partically mremap()'ed.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 21bdd0d74f40..68e021324113 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -226,6 +226,17 @@ static vm_fault_t vvar_fault(const struct 
vm_special_mapping *sm,
return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
+static int vvar_mremap(const struct vm_special_mapping *sm,
+  struct vm_area_struct *new_vma)
+{
+   unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
+
+   if (new_size != VVAR_NR_PAGES * PAGE_SIZE)
+   return -EINVAL;
+
+   return 0;
+}
+
 static int __setup_additional_pages(enum vdso_abi abi,
struct mm_struct *mm,
struct linux_binprm *bprm,
@@ -308,6 +319,7 @@ static struct vm_special_mapping aarch32_vdso_maps[] = {
[AA32_MAP_VVAR] = {
.name = "[vvar]",
.fault = vvar_fault,
+   .mremap = vvar_mremap,
},
[AA32_MAP_VDSO] = {
.name = "[vdso]",
@@ -476,6 +488,7 @@ static struct vm_special_mapping aarch64_vdso_maps[] 
__ro_after_init = {
[AA64_MAP_VVAR] = {
.name   = "[vvar]",
.fault = vvar_fault,
+   .mremap = vvar_mremap,
},
[AA64_MAP_VDSO] = {
.name   = "[vdso]",
-- 
2.24.1



[PATCH 6/6] arm64: enable time namespace support

2020-06-16 Thread Andrei Vagin
CONFIG_TIME_NS is dependes on GENERIC_VDSO_TIME_NS.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 31380da53689..e3956cbfd8a9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -118,6 +118,7 @@ config ARM64
select GENERIC_STRNLEN_USER
select GENERIC_TIME_VSYSCALL
select GENERIC_GETTIMEOFDAY
+   select GENERIC_VDSO_TIME_NS
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
select HAVE_PCI
-- 
2.24.1



[PATCH 3/6] arm64/vdso: Add time namespace page

2020-06-16 Thread Andrei Vagin
Allocate the time namespace page among VVAR pages.  Provide
__arch_get_timens_vdso_data() helper for VDSO code to get the
code-relative position of VVARs on that special page.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

Cc: Mark Rutland 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/asm/vdso.h |  2 ++
 .../include/asm/vdso/compat_gettimeofday.h| 12 
 arch/arm64/include/asm/vdso/gettimeofday.h|  8 
 arch/arm64/kernel/vdso.c  | 19 ---
 arch/arm64/kernel/vdso/vdso.lds.S |  5 -
 arch/arm64/kernel/vdso32/vdso.lds.S   |  5 -
 include/vdso/datapage.h   |  1 +
 7 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/vdso.h b/arch/arm64/include/asm/vdso.h
index 07468428fd29..f99dcb94b438 100644
--- a/arch/arm64/include/asm/vdso.h
+++ b/arch/arm64/include/asm/vdso.h
@@ -12,6 +12,8 @@
  */
 #define VDSO_LBASE 0x0
 
+#define __VVAR_PAGES2
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h 
b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index b6907ae78e53..b7c549d46d18 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -152,6 +152,18 @@ static __always_inline const struct vdso_data 
*__arch_get_vdso_data(void)
return ret;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+{
+   const struct vdso_data *ret;
+
+   /* See __arch_get_vdso_data(). */
+   asm volatile("mov %0, %1" : "=r"(ret) : "r"(_timens_data));
+
+   return ret;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
b/arch/arm64/include/asm/vdso/gettimeofday.h
index afba6ba332f8..cf39eae5eaaf 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -96,6 +96,14 @@ const struct vdso_data *__arch_get_vdso_data(void)
return _vdso_data;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(void)
+{
+   return _timens_data;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index df4bb736d28a..14d5b7642b62 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -40,6 +40,12 @@ enum vdso_abi {
 #endif /* CONFIG_COMPAT_VDSO */
 };
 
+enum vvar_pages {
+   VVAR_DATA_PAGE_OFFSET,
+   VVAR_TIMENS_PAGE_OFFSET,
+   VVAR_NR_PAGES,
+};
+
 struct vdso_abi_info {
const char *name;
const char *vdso_code_start;
@@ -126,6 +132,11 @@ static int __vdso_init(enum vdso_abi abi)
 }
 
 #ifdef CONFIG_TIME_NS
+struct vdso_data *arch_get_vdso_data(void *vvar_page)
+{
+   return (struct vdso_data *)(vvar_page);
+}
+
 /*
  * The vvar page layout depends on whether a task belongs to the root or
  * non-root time namespace. Whenever a task changes its namespace, the VVAR
@@ -175,9 +186,11 @@ static int __setup_additional_pages(enum vdso_abi abi,
unsigned long gp_flags = 0;
void *ret;
 
+   BUILD_BUG_ON(VVAR_NR_PAGES != __VVAR_PAGES);
+
vdso_text_len = vdso_info[abi].vdso_pages << PAGE_SHIFT;
/* Be sure to map the data page */
-   vdso_mapping_len = vdso_text_len + PAGE_SIZE;
+   vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE;
 
vdso_base = get_unmapped_area(NULL, 0, vdso_mapping_len, 0, 0);
if (IS_ERR_VALUE(vdso_base)) {
@@ -185,7 +198,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
goto up_fail;
}
 
-   ret = _install_special_mapping(mm, vdso_base, PAGE_SIZE,
+   ret = _install_special_mapping(mm, vdso_base, VVAR_NR_PAGES * PAGE_SIZE,
   VM_READ|VM_MAYREAD|VM_PFNMAP,
 

[PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

2020-06-16 Thread Andrei Vagin
The VVAR page layout depends on whether a task belongs to the root or
non-root time namespace. Whenever a task changes its namespace, the VVAR
page tables are cleared and then they will be re-faulted with a
corresponding layout.

Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index b0aec4e8c9b4..df4bb736d28a 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -125,6 +125,38 @@ static int __vdso_init(enum vdso_abi abi)
return 0;
 }
 
+#ifdef CONFIG_TIME_NS
+/*
+ * The vvar page layout depends on whether a task belongs to the root or
+ * non-root time namespace. Whenever a task changes its namespace, the VVAR
+ * page tables are cleared and then they will re-faulted with a
+ * corresponding layout.
+ * See also the comment near timens_setup_vdso_data() for details.
+ */
+int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+{
+   struct mm_struct *mm = task->mm;
+   struct vm_area_struct *vma;
+
+   if (mmap_write_lock_killable(mm))
+   return -EINTR;
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   unsigned long size = vma->vm_end - vma->vm_start;
+
+   if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
+   zap_page_range(vma, vma->vm_start, size);
+#ifdef CONFIG_COMPAT_VDSO
+   if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
+   zap_page_range(vma, vma->vm_start, size);
+#endif
+   }
+
+   mmap_write_unlock(mm);
+   return 0;
+}
+#endif
+
 static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
 struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-- 
2.24.1



Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-10 Thread Andrei Vagin
On Tue, Jun 09, 2020 at 06:14:27PM +0200, Christian Brauner wrote:
> On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
> > On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> > > On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
...
> > > > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > > > CAP_SYS_ADMIN too.
> > > 
> > > This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> > > safe to allow unprivileged users to suspend security policies? That
> > > sounds like a bad idea.
> > 
...
> > I don't suggest to remove or
> > downgrade this capability check. The patch allows all c/r related
> > operations if the current has CAP_CHECKPOINT_RESTORE.
> > 
> > So in this case the check:
> >  if (!capable(CAP_SYS_ADMIN))
> >  return -EPERM;
> > 
> > will be converted in:
> >  if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
> >  return -EPERM;
> 
> Yeah, I got that but what's the goal here? Isn't it that you want to
> make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
> fscap set so that unprivileged users can restore their own processes
> without creating a new user namespace or am I missing something? The
> use-cases in the cover-letter make it sound like that's what this is
> leading up to:
> > > > > * Checkpoint/Restore in an HPC environment in combination with a 
> > > > > resource
> > > > >   manager distributing jobs where users are always running as 
> > > > > non-root.
> > > > >   There is a desire to provide a way to checkpoint and restore long 
> > > > > running
> > > > >   jobs.
> > > > > * Container migration as non-root
> > > > > * We have been in contact with JVM developers who are integrating
> > > > >   CRIU into a Java VM to decrease the startup time. These 
> > > > > checkpoint/restore
> > > > >   applications are not meant to be running with CAP_SYS_ADMIN.
> 
> But maybe I'm just misunderstanding crucial bits (likely (TM)).

I think you understand this right. The goal is to make it possible to
use C/R functionality for unprivileged processes. And for me, here are
two separate tasks. The first one is how to allow unprivileged users to
use C/R from the root user namespace. This is what we discuss here.

And another one is how to allow to use C/R functionality from a non-root
user namespaces. The second task is about downgrading capable to
ns_capable for map_files and PTRACE_O_SUSPEND_SECCOMP.

Thanks,
Andrei


Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-09 Thread Andrei Vagin
On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> > On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> > > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability 
> > > facilitating
> > > checkpoint/restore for non-root users.
> > > 
> > > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has 
> > > been
> > > asked numerous times if it is possible to checkpoint/restore a process as
> > > non-root. The answer usually was: 'almost'.
> > > 
> > > The main blocker to restore a process as non-root was to control the PID 
> > > of the
> > > restored process. This feature available via the clone3 system call, or 
> > > via
> > > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > > 
> > > In the past two years, requests for non-root checkpoint/restore have 
> > > increased
> > > due to the following use cases:
> > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > >   manager distributing jobs where users are always running as non-root.
> > >   There is a desire to provide a way to checkpoint and restore long 
> > > running
> > >   jobs.
> > > * Container migration as non-root
> > > * We have been in contact with JVM developers who are integrating
> > >   CRIU into a Java VM to decrease the startup time. These 
> > > checkpoint/restore
> > >   applications are not meant to be running with CAP_SYS_ADMIN.
> > > 
> > ...
> > > 
> > > The introduced capability allows to:
> > > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> > >   for the corresponding PID namespace via ns_last_pid/clone3.
> > > * Open files in /proc/pid/map_files when the current user is
> > >   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for 
> > > recovering
> > >   files that are unreachable via the file system such as deleted files, 
> > > or memfd
> > >   files.
> > 
> > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > CAP_SYS_ADMIN too.
> 
> This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> safe to allow unprivileged users to suspend security policies? That
> sounds like a bad idea.

Why do you think so bad about me;). I don't suggest to remove or
downgrade this capability check. The patch allows all c/r related
operations if the current has CAP_CHECKPOINT_RESTORE.

So in this case the check:
 if (!capable(CAP_SYS_ADMIN))
 return -EPERM;

will be converted in:
 if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
 return -EPERM;

If we want to think about how to convert this capable to ns_capable, we
need to do this in a separate series. And the logic may be that a
process is able to suspend only filters that have been added from the
current user-namespace or its descendants. But we need to think about
this more carefully, maybe there are more pitfalls.


Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-08 Thread Andrei Vagin
On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
> 
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
> 
> The main blocker to restore a process as non-root was to control the PID of 
> the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> 
> In the past two years, requests for non-root checkpoint/restore have increased
> due to the following use cases:
> * Checkpoint/Restore in an HPC environment in combination with a resource
>   manager distributing jobs where users are always running as non-root.
>   There is a desire to provide a way to checkpoint and restore long running
>   jobs.
> * Container migration as non-root
> * We have been in contact with JVM developers who are integrating
>   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
>   applications are not meant to be running with CAP_SYS_ADMIN.
> 
...
> 
> The introduced capability allows to:
> * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
>   for the corresponding PID namespace via ns_last_pid/clone3.
> * Open files in /proc/pid/map_files when the current user is
>   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
>   files that are unreachable via the file system such as deleted files, or 
> memfd
>   files.

PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
CAP_SYS_ADMIN too.

Thanks,
Andrei


Re: [PATCH] capabilities: Introduce CAP_RESTORE

2020-06-07 Thread Andrei Vagin
> >
> > I would argue that setting the current process exe file check should just 
> > be reduced to a "can you ptrace a children" check.
> > Here's why: any process can masquerade into another executable with ptrace.
> > One can fork a child, ptrace it, have the child execve("target_exe"), then 
> > replace its memory content with an arbitrary program.
>
> Then it should probably be relaxed to CAP_SYS_PTRACE in the user
> namespace and not CAP_CHECKPOINT_RESTORE. (But apparently you also have
> a way of achieving what you want anyway. Imho, it's not necessarily
> wrong to require a bit more work when you want something like fully
> unprivileged c/r that's a rather special interest group.)
>
> > With CRIU's libcompel parasite mechanism (https://criu.org/Compel) this is 
> > fairly easy to implement.
> > In fact, we could modify CRIU to do just that (but with a fair amount of 
> > efforts due to the way CRIU is written),
> > and not rely on being able to SET_MM_EXE_FILE via prctl(). In turn, that 
> > would give an easy way to masquerade any process
> > into another one, provided that one can ptrace a child.
> >

I think you misunderstand this. In the case of malicious processes,
when only one or two processes must be hidden, they can use this trick
with execve+ptrace and this is relatively simple. But in the case of
CRIU, where we need to restore a process tree with cow memory
mappings, shared mappings, file descriptors and etc, this trick with
execve+ptrace doesn't work at all. We are in a weird situation when
malicious processes can do some operations, but useful tools like CRIU
needed to be running with extra capabilities that actually reduces the
security of the entire system.


Re: [PATCH RESEND v3 0/6] arm64: add the time namespace support

2020-06-06 Thread Andrei Vagin
On Fri, Jun 05, 2020 at 11:49:04AM +0100, Mark Rutland wrote:
> Hi Andrei,
> 
> As a heads up, in mainline the arm64 vdso code has been refactored since
> v5.7, and this series will need to be rebased atop. Given we're in the
> middle of the merge window, I would suggest waiting until rc1 before
> posting a rebased series.


Ok. I will resend this series after rc1. Thanks!


[PATCH 3/6 v4] arm64/vdso: Add time namespace page

2020-06-05 Thread Andrei Vagin
Allocate the time namespace page among VVAR pages.  Provide
__arch_get_timens_vdso_data() helper for VDSO code to get the
code-relative position of VVARs on that special page.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

Cc: Mark Rutland 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
---
v4: - fix an issue reported by the lkp robot.
- vvar has the same size with/without CONFIG_TIME_NAMESPACE, but the
  timens page isn't allocated on !CONFIG_TIME_NAMESPACE. This
  simplifies criu/vdso migration between different kernel configs.

 arch/arm64/include/asm/vdso.h |  2 ++
 .../include/asm/vdso/compat_gettimeofday.h| 12 +++
 arch/arm64/include/asm/vdso/gettimeofday.h|  8 
 arch/arm64/kernel/vdso.c  | 20 ---
 arch/arm64/kernel/vdso/vdso.lds.S |  5 -
 arch/arm64/kernel/vdso32/vdso.lds.S   |  5 -
 include/vdso/datapage.h   |  1 +
 7 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/vdso.h b/arch/arm64/include/asm/vdso.h
index 07468428fd29..f99dcb94b438 100644
--- a/arch/arm64/include/asm/vdso.h
+++ b/arch/arm64/include/asm/vdso.h
@@ -12,6 +12,8 @@
  */
 #define VDSO_LBASE 0x0
 
+#define __VVAR_PAGES2
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h 
b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index b6907ae78e53..b7c549d46d18 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -152,6 +152,18 @@ static __always_inline const struct vdso_data 
*__arch_get_vdso_data(void)
return ret;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+{
+   const struct vdso_data *ret;
+
+   /* See __arch_get_vdso_data(). */
+   asm volatile("mov %0, %1" : "=r"(ret) : "r"(_timens_data));
+
+   return ret;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
b/arch/arm64/include/asm/vdso/gettimeofday.h
index afba6ba332f8..cf39eae5eaaf 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -96,6 +96,14 @@ const struct vdso_data *__arch_get_vdso_data(void)
return _vdso_data;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(void)
+{
+   return _timens_data;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 33df3cdf7982..fd609120386d 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -46,6 +46,12 @@ enum arch_vdso_type {
 #define VDSO_TYPES (ARM64_VDSO + 1)
 #endif /* CONFIG_COMPAT_VDSO */
 
+enum vvar_pages {
+   VVAR_DATA_PAGE_OFFSET,
+   VVAR_TIMENS_PAGE_OFFSET,
+   VVAR_NR_PAGES,
+};
+
 struct __vdso_abi {
const char *name;
const char *vdso_code_start;
@@ -81,6 +87,7 @@ static union {
 } vdso_data_store __page_aligned_data;
 struct vdso_data *vdso_data = vdso_data_store.data;
 
+
 static int __vdso_remap(enum arch_vdso_type arch_index,
const struct vm_special_mapping *sm,
struct vm_area_struct *new_vma)
@@ -132,6 +139,11 @@ static int __vdso_init(enum arch_vdso_type arch_index)
 }
 
 #ifdef CONFIG_TIME_NS
+struct vdso_data *arch_get_vdso_data(void *vvar_page)
+{
+   return (struct vdso_data *)(vvar_page);
+}
+
 /*
  * The vvar page layout depends on whether a task belongs to the root or
  * non-root time namespace. Whenever a task changes its namespace, the VVAR
@@ -180,9 +192,11 @@ static int __setup_additional_pages(enum arch_vdso_type 
arch_index,
unsigned long vdso_base, vdso_text_len, vdso_mapping_len;
void *ret;
 
+   BUILD_BUG_ON(VVAR_NR_PAGES != __VVAR_PAGES);
+
vdso_text_len = vdso_lookup[arch_index].vd

[PATCH 1/6] arm64/vdso: use the fault callback to map vvar pages

2020-06-02 Thread Andrei Vagin
This is required to support time namespaces where a time namespace data
page is different for each namespace.

Reviewed-by: Vincenzo Frascino 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 033a48f30dbb..031ee1a8d4a8 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -114,28 +114,32 @@ static int __vdso_init(enum arch_vdso_type arch_index)
PAGE_SHIFT;
 
/* Allocate the vDSO pagelist, plus a page for the data. */
-   vdso_pagelist = kcalloc(vdso_lookup[arch_index].vdso_pages + 1,
+   vdso_pagelist = kcalloc(vdso_lookup[arch_index].vdso_pages,
sizeof(struct page *),
GFP_KERNEL);
if (vdso_pagelist == NULL)
return -ENOMEM;
 
-   /* Grab the vDSO data page. */
-   vdso_pagelist[0] = phys_to_page(__pa_symbol(vdso_data));
-
-
/* Grab the vDSO code pages. */
pfn = sym_to_pfn(vdso_lookup[arch_index].vdso_code_start);
 
for (i = 0; i < vdso_lookup[arch_index].vdso_pages; i++)
-   vdso_pagelist[i + 1] = pfn_to_page(pfn + i);
+   vdso_pagelist[i] = pfn_to_page(pfn + i);
 
-   vdso_lookup[arch_index].dm->pages = _pagelist[0];
-   vdso_lookup[arch_index].cm->pages = _pagelist[1];
+   vdso_lookup[arch_index].cm->pages = vdso_pagelist;
 
return 0;
 }
 
+static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
+struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   if (vmf->pgoff == 0)
+   return vmf_insert_pfn(vma, vmf->address,
+   sym_to_pfn(vdso_data));
+   return VM_FAULT_SIGBUS;
+}
+
 static int __setup_additional_pages(enum arch_vdso_type arch_index,
struct mm_struct *mm,
struct linux_binprm *bprm,
@@ -155,7 +159,7 @@ static int __setup_additional_pages(enum arch_vdso_type 
arch_index,
}
 
ret = _install_special_mapping(mm, vdso_base, PAGE_SIZE,
-  VM_READ|VM_MAYREAD,
+  VM_READ|VM_MAYREAD|VM_PFNMAP,
   vdso_lookup[arch_index].dm);
if (IS_ERR(ret))
goto up_fail;
@@ -215,6 +219,7 @@ static struct vm_special_mapping aarch32_vdso_spec[C_PAGES] 
= {
 #ifdef CONFIG_COMPAT_VDSO
{
.name = "[vvar]",
+   .fault = vvar_fault,
},
{
.name = "[vdso]",
@@ -385,6 +390,7 @@ static int vdso_mremap(const struct vm_special_mapping *sm,
 static struct vm_special_mapping vdso_spec[A_PAGES] __ro_after_init = {
{
.name   = "[vvar]",
+   .fault = vvar_fault,
},
{
.name   = "[vdso]",
-- 
2.24.1



[PATCH RESEND v3 0/6] arm64: add the time namespace support

2020-06-02 Thread Andrei Vagin
Allocate the time namespace page among VVAR pages and add the logic
to handle faults on VVAR properly.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

v2: Code cleanups suggested by Vincenzo.
v3: add a comment in __arch_get_timens_vdso_data.

Reviewed-by: Vincenzo Frascino 
Cc: Thomas Gleixner 
Cc: Dmitry Safonov 

v3 on github (if someone prefers `git pull` to `git am`):
https://github.com/avagin/linux-task-diag/tree/arm64/timens-v3

Andrei Vagin (6):
  arm64/vdso: use the fault callback to map vvar pages
  arm64/vdso: Zap vvar pages when switching to a time namespace
  arm64/vdso: Add time napespace page
  arm64/vdso: Handle faults on timens page
  arm64/vdso: Restrict splitting VVAR VMA
  arm64: enable time namespace support

 arch/arm64/Kconfig|   1 +
 .../include/asm/vdso/compat_gettimeofday.h|  11 ++
 arch/arm64/include/asm/vdso/gettimeofday.h|   8 ++
 arch/arm64/kernel/vdso.c  | 134 --
 arch/arm64/kernel/vdso/vdso.lds.S |   3 +-
 arch/arm64/kernel/vdso32/vdso.lds.S   |   3 +-
 include/vdso/datapage.h   |   1 +
 7 files changed, 147 insertions(+), 14 deletions(-)

-- 
2.24.1



[PATCH 6/6] arm64: enable time namespace support

2020-06-02 Thread Andrei Vagin
CONFIG_TIME_NS is dependes on GENERIC_VDSO_TIME_NS.

Reviewed-by: Vincenzo Frascino 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5d513f461957..27d7e4ed1c93 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -111,6 +111,7 @@ config ARM64
select GENERIC_STRNLEN_USER
select GENERIC_TIME_VSYSCALL
select GENERIC_GETTIMEOFDAY
+   select GENERIC_VDSO_TIME_NS
select HANDLE_DOMAIN_IRQ
select HARDIRQS_SW_RESEND
select HAVE_PCI
-- 
2.24.1



[PATCH 4/6] arm64/vdso: Handle faults on timens page

2020-06-02 Thread Andrei Vagin
If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page.

Reviewed-by: Vincenzo Frascino 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 57 +---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 1fa6f9362e56..f3baecd8edfb 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -175,15 +176,63 @@ int vdso_join_timens(struct task_struct *task, struct 
time_namespace *ns)
up_write(>mmap_sem);
return 0;
 }
+
+static struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+   if (likely(vma->vm_mm == current->mm))
+   return current->nsproxy->time_ns->vvar_page;
+
+   /*
+* VM_PFNMAP | VM_IO protect .fault() handler from being called
+* through interfaces like /proc/$pid/mem or
+* process_vm_{readv,writev}() as long as there's no .access()
+* in special_mapping_vmops().
+* For more details check_vma_flags() and __access_remote_vm()
+*/
+
+   WARN(1, "vvar_page accessed remotely");
+
+   return NULL;
+}
+#else
+static inline struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+   return NULL;
+}
 #endif
 
 static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
 struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   if (vmf->pgoff == 0)
-   return vmf_insert_pfn(vma, vmf->address,
-   sym_to_pfn(vdso_data));
-   return VM_FAULT_SIGBUS;
+   struct page *timens_page = find_timens_vvar_page(vma);
+   unsigned long pfn;
+
+   switch (vmf->pgoff) {
+   case VVAR_DATA_PAGE_OFFSET:
+   if (timens_page)
+   pfn = page_to_pfn(timens_page);
+   else
+   pfn = sym_to_pfn(vdso_data);
+   break;
+#ifdef CONFIG_TIME_NS
+   case VVAR_TIMENS_PAGE_OFFSET:
+   /*
+* If a task belongs to a time namespace then a namespace
+* specific VVAR is mapped with the VVAR_DATA_PAGE_OFFSET and
+* the real VVAR page is mapped with the VVAR_TIMENS_PAGE_OFFSET
+* offset.
+* See also the comment near timens_setup_vdso_data().
+*/
+   if (!timens_page)
+   return VM_FAULT_SIGBUS;
+   pfn = sym_to_pfn(vdso_data);
+   break;
+#endif /* CONFIG_TIME_NS */
+   default:
+   return VM_FAULT_SIGBUS;
+   }
+
+   return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
 static int __setup_additional_pages(enum arch_vdso_type arch_index,
-- 
2.24.1



[PATCH 3/6] arm64/vdso: Add time namespace page

2020-06-02 Thread Andrei Vagin
Allocate the time namespace page among VVAR pages.  Provide
__arch_get_timens_vdso_data() helper for VDSO code to get the
code-relative position of VVARs on that special page.

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page. That page has vdso_data->seq
set to 1 to enforce the slow path and vdso_data->clock_mode set to
VCLOCK_TIMENS to enforce the time namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

Cc: Mark Rutland 
Reviewed-by: Vincenzo Frascino 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/include/asm/vdso.h |  6 +
 .../include/asm/vdso/compat_gettimeofday.h| 12 ++
 arch/arm64/include/asm/vdso/gettimeofday.h|  8 +++
 arch/arm64/kernel/vdso.c  | 22 ---
 arch/arm64/kernel/vdso/vdso.lds.S |  5 -
 arch/arm64/kernel/vdso32/vdso.lds.S   |  5 -
 include/vdso/datapage.h   |  1 +
 7 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/vdso.h b/arch/arm64/include/asm/vdso.h
index 07468428fd29..351c145d3808 100644
--- a/arch/arm64/include/asm/vdso.h
+++ b/arch/arm64/include/asm/vdso.h
@@ -12,6 +12,12 @@
  */
 #define VDSO_LBASE 0x0
 
+#ifdef CONFIG_TIME_NS
+#define __VVAR_PAGES2
+#else
+#define __VVAR_PAGES1
+#endif
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h 
b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index b6907ae78e53..b7c549d46d18 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -152,6 +152,18 @@ static __always_inline const struct vdso_data 
*__arch_get_vdso_data(void)
return ret;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline const struct vdso_data 
*__arch_get_timens_vdso_data(void)
+{
+   const struct vdso_data *ret;
+
+   /* See __arch_get_vdso_data(). */
+   asm volatile("mov %0, %1" : "=r"(ret) : "r"(_timens_data));
+
+   return ret;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
b/arch/arm64/include/asm/vdso/gettimeofday.h
index afba6ba332f8..cf39eae5eaaf 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -96,6 +96,14 @@ const struct vdso_data *__arch_get_vdso_data(void)
return _vdso_data;
 }
 
+#ifdef CONFIG_TIME_NS
+static __always_inline
+const struct vdso_data *__arch_get_timens_vdso_data(void)
+{
+   return _timens_data;
+}
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 33df3cdf7982..1fa6f9362e56 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -46,6 +46,14 @@ enum arch_vdso_type {
 #define VDSO_TYPES (ARM64_VDSO + 1)
 #endif /* CONFIG_COMPAT_VDSO */
 
+enum vvar_pages {
+   VVAR_DATA_PAGE_OFFSET,
+#ifdef CONFIG_TIME_NS
+   VVAR_TIMENS_PAGE_OFFSET,
+#endif /* CONFIG_TIME_NS */
+   VVAR_NR_PAGES,
+};
+
 struct __vdso_abi {
const char *name;
const char *vdso_code_start;
@@ -81,6 +89,12 @@ static union {
 } vdso_data_store __page_aligned_data;
 struct vdso_data *vdso_data = vdso_data_store.data;
 
+
+struct vdso_data *arch_get_vdso_data(void *vvar_page)
+{
+   return (struct vdso_data *)(vvar_page);
+}
+
 static int __vdso_remap(enum arch_vdso_type arch_index,
const struct vm_special_mapping *sm,
struct vm_area_struct *new_vma)
@@ -180,9 +194,11 @@ static int __setup_additional_pages(enum arch_vdso_type 
arch_index,
unsigned long vdso_base, vdso_text_len, vdso_mapping_len;
void *ret;
 
+   BUILD_BUG_ON(VVAR_NR_PAGES != __VVAR_PAGES);
+
vdso_text_len = vdso_lookup[arch_index].vdso_pages << PAGE_SHIFT;
/* Be sure to map the data page */
-   vdso_mapping_len = vdso_text_len + PAGE_SIZE;
+   vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE;
 
vdso_base = get_unmapped_area(NULL, 0, vdso_mapping_len, 0, 0);
if (IS_ERR_VALUE(vdso_base)) {
@@ -190,13 +206,13 @@ static int __setup_additional_pages(enum arch_vdso_type 
arch_index,
 

[PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

2020-06-02 Thread Andrei Vagin
The VVAR page layout depends on whether a task belongs to the root or
non-root time namespace. Whenever a task changes its namespace, the VVAR
page tables are cleared and then they will be re-faulted with a
corresponding layout.

Reviewed-by: Vincenzo Frascino 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 031ee1a8d4a8..33df3cdf7982 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -131,6 +131,38 @@ static int __vdso_init(enum arch_vdso_type arch_index)
return 0;
 }
 
+#ifdef CONFIG_TIME_NS
+/*
+ * The vvar page layout depends on whether a task belongs to the root or
+ * non-root time namespace. Whenever a task changes its namespace, the VVAR
+ * page tables are cleared and then they will re-faulted with a
+ * corresponding layout.
+ * See also the comment near timens_setup_vdso_data() for details.
+ */
+int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+{
+   struct mm_struct *mm = task->mm;
+   struct vm_area_struct *vma;
+
+   if (down_write_killable(>mmap_sem))
+   return -EINTR;
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   unsigned long size = vma->vm_end - vma->vm_start;
+
+   if (vma_is_special_mapping(vma, vdso_lookup[ARM64_VDSO].dm))
+   zap_page_range(vma, vma->vm_start, size);
+#ifdef CONFIG_COMPAT_VDSO
+   if (vma_is_special_mapping(vma, vdso_lookup[ARM64_VDSO32].dm))
+   zap_page_range(vma, vma->vm_start, size);
+#endif
+   }
+
+   up_write(>mmap_sem);
+   return 0;
+}
+#endif
+
 static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
 struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-- 
2.24.1



[PATCH 5/6] arm64/vdso: Restrict splitting VVAR VMA

2020-06-02 Thread Andrei Vagin
Forbid splitting VVAR VMA resulting in a stricter ABI and reducing the
amount of corner-cases to consider while working further on VDSO time
namespace support.

As the offset from timens to VVAR page is computed compile-time, the pages
in VVAR should stay together and not being partically mremap()'ed.

Reviewed-by: Vincenzo Frascino 
Signed-off-by: Andrei Vagin 
---
 arch/arm64/kernel/vdso.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index f3baecd8edfb..8b17d7d10729 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -235,6 +235,17 @@ static vm_fault_t vvar_fault(const struct 
vm_special_mapping *sm,
return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
+static int vvar_mremap(const struct vm_special_mapping *sm,
+  struct vm_area_struct *new_vma)
+{
+   unsigned long new_size = new_vma->vm_end - new_vma->vm_start;
+
+   if (new_size != VVAR_NR_PAGES * PAGE_SIZE)
+   return -EINVAL;
+
+   return 0;
+}
+
 static int __setup_additional_pages(enum arch_vdso_type arch_index,
struct mm_struct *mm,
struct linux_binprm *bprm,
@@ -317,6 +328,7 @@ static struct vm_special_mapping aarch32_vdso_spec[C_PAGES] 
= {
{
.name = "[vvar]",
.fault = vvar_fault,
+   .mremap = vvar_mremap,
},
{
.name = "[vdso]",
@@ -488,6 +500,7 @@ static struct vm_special_mapping vdso_spec[A_PAGES] 
__ro_after_init = {
{
.name   = "[vvar]",
.fault = vvar_fault,
+   .mremap = vvar_mremap,
},
{
.name   = "[vdso]",
-- 
2.24.1



Re: [PATCH] capabilities: Introduce CAP_RESTORE

2020-05-22 Thread Andrei Vagin
On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote:
> On 5/21/2020 10:53 PM, Adrian Reber wrote:
> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> > to run checkpoint/restore as non-root,
> 
> If you need CAP_SYS_ADMIN anyway you're not gaining anything by
> separating out CAP_RESTORE.
> 
> >  but by applying this patch I can
> > already checkpoint and restore processes as non-root. As there are
> > already multiple workarounds I would prefer to do it correctly in the
> > kernel to avoid that CRIU users are starting to invent more workarounds.
> 
> You've presented a couple of really inappropriate implementations
> that would qualify as workarounds. But the other two are completely
> appropriate within the system security policy. They don't "get around"
> the problem, they use existing mechanisms as they are intended.
> 

With CAP_CHECKPOINT_RESTORE, we will need to use the same mechanisms.

The problem is that CAP_SYS_ADMIN is too wide. If a process has
CAP_SYS_ADMIN, it can do a lot of things and  the operation of forking a
process with a specified pid isn't the most dangerous one in this case.
Offten security policies don't allow to grant CAP_SYS_ADMIN to any
third-party tools even in non-root user namespaces.



Re: [PATCH] capabilities: Introduce CAP_RESTORE

2020-05-22 Thread Andrei Vagin
On Fri, May 22, 2020 at 09:53:31AM +0200, Christian Brauner wrote:
> On Fri, May 22, 2020 at 07:53:50AM +0200, Adrian Reber wrote:
> > 
> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> > to run checkpoint/restore as non-root, but by applying this patch I can
> > already checkpoint and restore processes as non-root. As there are
> > already multiple workarounds I would prefer to do it correctly in the
> > kernel to avoid that CRIU users are starting to invent more workarounds.
> 
> It sounds ok to me as long as this feature is guarded by any sensible
> capability. I don't want users to be able to randomly choose their pid
> without any capability required.
> 
> We've heard the plea for unprivileged checkpoint/restore through the
> grapevine and a few times about CAP_RESTORE at plumbers but it's one of
> those cases where nobody pushed for it so it's urgency was questionable.
> This is 5.9 material though and could you please add selftests?
> 
> It also seems you have future changes planned that would make certain
> things accessible via CAP_RESTORE that are currently guarded by other
> capabilities. Any specific things in mind? It might be worth knowing
> what we'd be getting ourselves into if you're planning on flipping
> switches in other places.

/proc/pid/map_files is one of the first candidate what we need to think
about. CRIU opens files from /proc/pid/map_files to dump file mappings,
shared memory mappings, memfd files.

Right now, it is impossible to open these files without CAP_SYS_ADMIN in
the root user-namespace (proc_map_files_get_link).


[PATCH v3] selftests/timens: handle a case when alarm clocks are not supported

2020-05-21 Thread Andrei Vagin
This can happen if a testing node doesn't have RTC (real time clock)
hardware or it doesn't support alarms.

Fixes: 61c57676035d ("selftests/timens: Add Time Namespace test for supported 
clocks")
Acked-by: Vincenzo Frascino 
Reported-by: Vincenzo Frascino 
Signed-off-by: Andrei Vagin 
---
v2: fix timer.c and timerfd.c too.
v3: add Reported-by and Fixes tags.

 tools/testing/selftests/timens/clock_nanosleep.c |  2 +-
 tools/testing/selftests/timens/timens.c  |  2 +-
 tools/testing/selftests/timens/timens.h  | 13 -
 tools/testing/selftests/timens/timer.c   |  5 +
 tools/testing/selftests/timens/timerfd.c |  5 +
 5 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/timens/clock_nanosleep.c 
b/tools/testing/selftests/timens/clock_nanosleep.c
index 8e7b7c72ef65..72d41b955fb2 100644
--- a/tools/testing/selftests/timens/clock_nanosleep.c
+++ b/tools/testing/selftests/timens/clock_nanosleep.c
@@ -119,7 +119,7 @@ int main(int argc, char *argv[])
 
ksft_set_plan(4);
 
-   check_config_posix_timers();
+   check_supported_timers();
 
if (unshare_timens())
return 1;
diff --git a/tools/testing/selftests/timens/timens.c 
b/tools/testing/selftests/timens/timens.c
index 098be7c83be3..52b6a1185f52 100644
--- a/tools/testing/selftests/timens/timens.c
+++ b/tools/testing/selftests/timens/timens.c
@@ -155,7 +155,7 @@ int main(int argc, char *argv[])
 
nscheck();
 
-   check_config_posix_timers();
+   check_supported_timers();
 
ksft_set_plan(ARRAY_SIZE(clocks) * 2);
 
diff --git a/tools/testing/selftests/timens/timens.h 
b/tools/testing/selftests/timens/timens.h
index e09e7e39bc52..d4fc52d47146 100644
--- a/tools/testing/selftests/timens/timens.h
+++ b/tools/testing/selftests/timens/timens.h
@@ -14,15 +14,26 @@
 #endif
 
 static int config_posix_timers = true;
+static int config_alarm_timers = true;
 
-static inline void check_config_posix_timers(void)
+static inline void check_supported_timers(void)
 {
+   struct timespec ts;
+
if (timer_create(-1, 0, 0) == -1 && errno == ENOSYS)
config_posix_timers = false;
+
+   if (clock_gettime(CLOCK_BOOTTIME_ALARM, ) == -1 && errno == EINVAL)
+   config_alarm_timers = false;
 }
 
 static inline bool check_skip(int clockid)
 {
+   if (!config_alarm_timers && clockid == CLOCK_BOOTTIME_ALARM) {
+   ksft_test_result_skip("CLOCK_BOOTTIME_ALARM isn't supported\n");
+   return true;
+   }
+
if (config_posix_timers)
return false;
 
diff --git a/tools/testing/selftests/timens/timer.c 
b/tools/testing/selftests/timens/timer.c
index 96dba11ebe44..5e7f0051bd7b 100644
--- a/tools/testing/selftests/timens/timer.c
+++ b/tools/testing/selftests/timens/timer.c
@@ -22,6 +22,9 @@ int run_test(int clockid, struct timespec now)
timer_t fd;
int i;
 
+   if (check_skip(clockid))
+   return 0;
+
for (i = 0; i < 2; i++) {
struct sigevent sevp = {.sigev_notify = SIGEV_NONE};
int flags = 0;
@@ -74,6 +77,8 @@ int main(int argc, char *argv[])
 
nscheck();
 
+   check_supported_timers();
+
ksft_set_plan(3);
 
clock_gettime(CLOCK_MONOTONIC, _now);
diff --git a/tools/testing/selftests/timens/timerfd.c 
b/tools/testing/selftests/timens/timerfd.c
index eff1ec5ff215..9edd43d6b2c1 100644
--- a/tools/testing/selftests/timens/timerfd.c
+++ b/tools/testing/selftests/timens/timerfd.c
@@ -28,6 +28,9 @@ int run_test(int clockid, struct timespec now)
long long elapsed;
int fd, i;
 
+   if (check_skip(clockid))
+   return 0;
+
if (tclock_gettime(clockid, ))
return pr_perror("clock_gettime(%d)", clockid);
 
@@ -81,6 +84,8 @@ int main(int argc, char *argv[])
 
nscheck();
 
+   check_supported_timers();
+
ksft_set_plan(3);
 
clock_gettime(CLOCK_MONOTONIC, _now);
-- 
2.24.1



Re: [PATCHv7 00/33] kernel: Introduce Time Namespace

2019-10-22 Thread Andrei Vagin
On Thu, Oct 17, 2019 at 04:47:48PM -0700, Andrei Vagin wrote:
> 
> In my table, the "before" column is actually for the upstream kernel
> with the 18-th patch. Here is the table with the real "before" column:
> 
> | before| with 18/33 | CONFIG_TIME_NS=n | host  | inside 
> timens
> --
> avg | 150331408 | 153345935  | 153588088| 150816637 | 139192114
> --
> diff %  |   98  |  100   | 100.1| 98.3  | 90.7
> --
> stdev % |   0.3 | 0.09   | 0.15 | 0.25  | 0.13
> 
> If we compare numbers in "before", "host" and "inside timens" columns, we
> see the same results that you had. clock_gettime() works with the
> same performance in the host namespace and 7% slower in a time
> namespace.
> 

I played with this a bit more and I've found that we can speed up
clock_gettime on 5% more if we mark do_hres and do_coarse as
__always_inline.

With the unlikely hint in vdso_read_begin and noinline for
do_hres_timens and do_coarse_timens:
1..8
ok 1 host:  clock:  monotonic   cycles:  155278332
ok 2 host:  clock: monotonic-coarse cycles:  662067077
ok 3 host:  clock: monotonic-rawcycles:  151218057
ok 4 host:  clock:   boottime   cycles:  154907635
ok 5 ns:clock:  monotonic   cycles:  133100433
ok 6 host:  clock: monotonic-coarse cycles:  444170219
ok 7 host:  clock: monotonic-rawcycles:  129550178
ok 8 ns:clock:   boottime   cycles:  130167136


With __always_inline for do_hres and do_coarse:
1..8
ok 1 host:  clock:  monotonic   cycles:  163691015
ok 2 host:  clock: monotonic-coarse cycles:  641443397
ok 3 host:  clock: monotonic-rawcycles:  163649270
ok 4 host:  clock:   boottime   cycles:  163682242
ok 5 ns:clock:  monotonic   cycles:  138759212
ok 6 host:  clock: monotonic-coarse cycles:  486149502
ok 7 host:  clock: monotonic-rawcycles:  134801053
ok 8 ns:clock:   boottime   cycles:  138675460
# Pass 8 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0

With __always_inline for do_hres, do_coarse, do_hres_timens,
do_coarse_timens:
1..8
ok 1 host:  clock:  monotonic   cycles:  158984538
ok 2 host:  clock: monotonic-coarse cycles:  594932695
ok 3 host:  clock: monotonic-rawcycles:  157834511
ok 4 host:  clock:   boottime   cycles:  158297691
ok 5 ns:clock:  monotonic   cycles:  148559612
ok 6 host:  clock: monotonic-coarse cycles:  468505657
ok 7 host:  clock: monotonic-rawcycles:  146366575
ok 8 ns:clock:   boottime   cycles:  148573015
# Pass 8 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0

Thanks,
Andrei


Re: [PATCHv7 00/33] kernel: Introduce Time Namespace

2019-10-17 Thread Andrei Vagin
On Thu, Oct 17, 2019 at 11:24:45AM +0200, Thomas Gleixner wrote:
> On Fri, 11 Oct 2019, Dmitry Safonov wrote:
> > We wrote two small benchmarks. The first one gettime_perf.c calls
> > clock_gettime() in a loop for 3 seconds. It shows us performance with
> > a hot CPU cache (more clock_gettime() cycles - the better):
> > 
> > | before| CONFIG_TIME_NS=n | host  | inside timens
> > --
> > | 153242367 | 153567617| 150933203 | 139310914
> > | 153324800 | 153115132| 150919828 | 139299761
> > | 153125401 | 153686868| 150930471 | 139273917
> > | 153399355 | 153694866| 151083410 | 139286081
> > | 153489417 | 153739716| 150997262 | 139146403
> > | 153494270 | 153724332| 150035651 | 138835612
> > ---
> > avg | 153345935 | 153588088| 150816637 | 139192114
> > diff %  | 100   | 100.1| 98.3  | 90.7
> 
> 
> That host 98.3% number is weird and does not match the tests I did with the
> fallback code I provided you. On my limited testing that fallback hidden in
> the slowpath did not show any difference to the TIME_NS=n case when not
> inside a time namespace.

You did your experiments without a small optimization that we introduced
in the 18-th patch:

[PATCHv7 18/33] lib/vdso: Add unlikely() hint into vdso_read_begin()

When I did my measurements in the first time, I found that with this
timens change clock_gettime() shows a better performance when
CONFIG_TIME_NS isn't set. This looked weird for me, because I don't
expect to see this improvement. After analyzing a disassembled code of
vdso.so, I found that we can add the unlikely() hint into
vdso_read_begin() and this gives us 2% improvement of clock_gettime
performance on the upsteam kernel.

In my table, the "before" column is actually for the upstream kernel
with the 18-th patch. Here is the table with the real "before" column:

| before| with 18/33 | CONFIG_TIME_NS=n | host  | inside timens
--
avg | 150331408 | 153345935  | 153588088| 150816637 | 139192114
--
diff %  |   98  |  100   | 100.1| 98.3  | 90.7
--
stdev % |   0.3 | 0.09   | 0.15 | 0.25  | 0.13

If we compare numbers in "before", "host" and "inside timens" columns, we
see the same results that you had. clock_gettime() works with the
same performance in the host namespace and 7% slower in a time
namespace.

Now let's look why we have these 2% degradation in the host time
namespace. For that, we cat look at disassembled code of do_hres:

Before:
   0:   55  push   %rbp
   1:   48 63 f6movslq %esi,%rsi
   4:   49 89 d1mov%rdx,%r9
   7:   49 89 c8mov%rcx,%r8
   a:   48 c1 e6 04 shl$0x4,%rsi
   e:   48 01 feadd%rdi,%rsi
  11:   48 89 e5mov%rsp,%rbp
  14:   41 54   push   %r12
  16:   53  push   %rbx
  17:   44 8b 17mov(%rdi),%r10d
  1a:   41 f6 c2 01 test   $0x1,%r10b
  1e:   0f 85 fb 00 00 00   jne11f 
  24:   8b 47 04mov0x4(%rdi),%eax
  27:   83 f8 01cmp$0x1,%eax
  2a:   74 0f   je 3b 
  2c:   83 f8 02cmp$0x2,%eax
  2f:   74 72   je a3 
  31:   5b  pop%rbx
  32:   b8 ff ff ff ff  mov$0x,%eax
  37:   41 5c   pop%r12
  39:   5d  pop%rbp
  3a:   c3  retq
...

After:
   0:   55  push   %rbp
   1:   4c 63 cemovslq %esi,%r9
   4:   49 89 d0mov%rdx,%r8
   7:   49 c1 e1 04 shl$0x4,%r9
   b:   49 01 f9add%rdi,%r9
   e:   48 89 e5mov%rsp,%rbp
  11:   41 56   push   %r14
  13:   41 55   push   %r13
  15:   41 54   push   %r12
  17:   53  push   %rbx
  18:   44 8b 17mov(%rdi),%r10d
  1b:   44 89 d0mov%r10d,%eax
  1e:   f7 d0   not%eax
  20:   83 e0 01and$0x1,%eax
  23:   89 c3   mov%eax,%ebx
  25:   0f 84 03 01 00 00   je 12e 
  2b:   8b 47 04mov0x4(%rdi),%eax
  2e:   83 f8 01cmp$0x1,%eax
  31:   74 13   je 46 
  33:   83 f8 02cmp$0x2,%eax
  36:   74 7b   je b3 
  38:   b8 ff ff 

Re: [PATCHv7 01/33] ns: Introduce Time Namespace

2019-10-16 Thread Andrei Vagin
On Wed, Oct 16, 2019 at 12:39:11PM +0200, Thomas Gleixner wrote:
> On Wed, 16 Oct 2019, Vincenzo Frascino wrote:
> 
> < Trim 250+ lines ( 3+ pages) of pointlessly wasted electrons >
> 
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -1096,6 +1096,13 @@ config UTS_NS
> > > In this namespace tasks see different info provided with the
> > > uname() system call
> > >  
> > > +config TIME_NS
> > > + bool "TIME namespace"
> > > + default y
> > 
> > Having CONFIG_TIME_NS "default y" makes so that the option is selected even 
> > on
> > the architectures that have no support for time namespaces.
> > The direct consequence is that the fallbacks defined in this patch are never
> > selected and this ends up in kernel compilation errors due to missing 
> > symbols.
> > 
> > The error below shows what happens on arm64 (similar behavior on other
> > architectures):
> > 
> > aarch64-linux-gnu-ld: kernel/time/namespace.o: in function `timens_on_fork':
> > kernel/time/namespace.c:321: undefined reference to `vdso_join_timens'
> > 
> > My proposal is to keep TIME_NS "default n" (just remove "default y"), let 
> > the
> > architectures that enable time namespaces select it and make CONFIG_TIME_NS
> > select GENERIC_VDSO_TIME_NS if arch has HAVE_GENERIC_VDSO.
> 
> Nah.
> 
> config TIME_NS
>   bool "TIME namespace"
>   depends on GENERIC_VDSO_TIME_NS

I was thinking to fix this by the same way with a small difference.

If GENERIC_GETTIMEOFDAY isn't set, it should be safe to allow enabling
TIME_NS. In this case, clock_gettime works via system call and we don't
have arch-specific code in this case. Does this sound reasonable?

depends on (!GENERIC_GETTIMEOFDAY || GENERIC_VDSO_TIME_NS)

Thanks,
Andrei


Re: [PATCHv6 30/36] selftest/timens: Add Time Namespace test for supported clocks

2019-08-16 Thread Andrei Vagin
On Thu, Aug 15, 2019 at 05:18:21PM -0600, shuah wrote:
> Hi Dmitry,
> 
> Thanks for the patch.
> 
> On 8/15/19 10:38 AM, Dmitry Safonov wrote:
> > A test to check that all supported clocks work on host and inside
> > a new time namespace. Use both ways to get time: through VDSO and
> > by entering the kernel with implicit syscall.
> > 
> > Introduce a new timens directory in selftests framework for
> > the next timens tests.
> > 
> > Co-developed-by: Andrei Vagin 
> > Signed-off-by: Andrei Vagin 
> > Signed-off-by: Dmitry Safonov 
> > ---
> >   tools/testing/selftests/Makefile  |   1 +
> >   tools/testing/selftests/timens/.gitignore |   1 +
> >   tools/testing/selftests/timens/Makefile   |   5 +
> >   tools/testing/selftests/timens/config |   1 +
> >   tools/testing/selftests/timens/log.h  |  26 +++
> >   tools/testing/selftests/timens/timens.c   | 185 ++
> >   tools/testing/selftests/timens/timens.h   |  63 
> >   7 files changed, 282 insertions(+)
> >   create mode 100644 tools/testing/selftests/timens/.gitignore
> >   create mode 100644 tools/testing/selftests/timens/Makefile
> >   create mode 100644 tools/testing/selftests/timens/config
> >   create mode 100644 tools/testing/selftests/timens/log.h
> >   create mode 100644 tools/testing/selftests/timens/timens.c
> >   create mode 100644 tools/testing/selftests/timens/timens.h
> > 
> > diff --git a/tools/testing/selftests/Makefile 
> > b/tools/testing/selftests/Makefile
> > index 25b43a8c2b15..6fc63b84a857 100644
> > --- a/tools/testing/selftests/Makefile
> > +++ b/tools/testing/selftests/Makefile
> > @@ -47,6 +47,7 @@ TARGETS += splice
> >   TARGETS += static_keys
> >   TARGETS += sync
> >   TARGETS += sysctl
> > +TARGETS += timens
> 
> How long does this test run for?

On my laptop, it needs 30 miliseconds.

> Does this test need to be run
> as root? If yes, please add a root check and skip the test.

Yes, it needs to be as root. We will add this check. Thanks.

> 
> What does the test output looks like for skip and pass/fail cases?

[avagin@laptop timens]$ ./timens 
not ok 1 # SKIP Time namespaces are not supported

[root@fc24 timens]# ./timens 
1..10
ok 1 Passed for CLOCK_BOOTTIME (syscall)
ok 2 Passed for CLOCK_BOOTTIME (vdso)
ok 3 Passed for CLOCK_BOOTTIME_ALARM (syscall)
ok 4 Passed for CLOCK_BOOTTIME_ALARM (vdso)
ok 5 Passed for CLOCK_MONOTONIC (syscall)
ok 6 Passed for CLOCK_MONOTONIC (vdso)
ok 7 Passed for CLOCK_MONOTONIC_COARSE (syscall)
ok 8 Passed for CLOCK_MONOTONIC_COARSE (vdso)
ok 9 Passed for CLOCK_MONOTONIC_RAW (syscall)
ok 10 Passed for CLOCK_MONOTONIC_RAW (vdso)
# Pass 10 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0

Thanks,
Andrei

> 
> thanks,
> -- Shuah


Re: [PATCHv6 01/36] ns: Introduce Time Namespace

2019-08-16 Thread Andrei Vagin
On Thu, Aug 15, 2019 at 07:19:12PM +0200, Thomas Gleixner wrote:
> Dmitry,
> 
> On Thu, 15 Aug 2019, Dmitry Safonov wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 420567d1519a..97b7737f5aba 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -12898,6 +12898,8 @@ T:  git 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
> >  S: Maintained
> >  F: fs/timerfd.c
> >  F: include/linux/timer*
> > +F: include/linux/time_namespace.h
> > +F: kernel/time_namespace.c
> 
> Shouldn't this be kernel/time/namespace.c so all that stuff is lumped
> together. No strong opinion though.

Sure, we can move this in kernel/time. I don't remember why I decided to
place it in kernel/.

> 
> > +++ b/kernel/time_namespace.c
> > +static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
> > +{
> > +   return inc_ucount(ns, current_euid(), UCOUNT_TIME_NAMESPACES);
> > +}
> > +
> > +static void dec_time_namespaces(struct ucounts *ucounts)
> > +{
> > +   dec_ucount(ucounts, UCOUNT_TIME_NAMESPACES);
> > +}
> > +
> > +static struct time_namespace *create_time_ns(void)
> > +{
> > +   struct time_namespace *time_ns;
> > +
> > +   time_ns = kmalloc(sizeof(struct time_namespace), GFP_KERNEL);
> 
> Shouldn't this use kzalloc()? There are tons of members in that struct.

I don't think so. All of other members are initialized in clone_time_ns.
Maybe we don't need this helper. When I wrote this code, I looked at
kernel/utsname.c. I will think what we can do here to make this code
more clear.

> 
> > +   if (time_ns) {
> > +   kref_init(_ns->kref);
> > +   time_ns->initialized = false;
> 
> And you spare this one.

This one should be initialized in clone_time_ns too.
> 
> > +   }
> > +   return time_ns;
> > +}
> > +
> > +/*
> > + * Clone a new ns copying @old_ns, setting refcount to 1
> > + * @old_ns: namespace to clone
> > + * Return the new ns or ERR_PTR.
> 
> If you use kernel-doc style then please use te proper syntax
> 
> /*
>  * clone_time_ns - Clone a time namespace
>  * @old_ns:   Namespace to clone
>  *
>  * Clone @old_ns and set the clone refcount to 1
>  *
>  * Return: The new namespace or ERR_PTR.
>  */

Will fix. Thanks!

> 
> > + */
> > +static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
> > + struct time_namespace *old_ns)
> > +{
> > +   struct time_namespace *ns;
> > +   struct ucounts *ucounts;
> > +   int err;
> > +
> > +   err = -ENOSPC;
> > +   ucounts = inc_time_namespaces(user_ns);
> > +   if (!ucounts)
> > +   goto fail;
> > +
> > +   err = -ENOMEM;
> > +   ns = create_time_ns();
> > +   if (!ns)
> > +   goto fail_dec;
> > +
> > +   err = ns_alloc_inum(>ns);
> > +   if (err)
> > +   goto fail_free;
> > +
> > +   ns->ucounts = ucounts;
> > +   ns->ns.ops = _operations;
> > +   ns->user_ns = get_user_ns(user_ns);
> > +   return ns;
> > +
> > +fail_free:
> > +   kfree(ns);
> > +fail_dec:
> > +   dec_time_namespaces(ucounts);
> > +fail:
> > +   return ERR_PTR(err);
> > +}
> > +
> > +/*
> > + * Add a reference to old_ns, or clone it if @flags specify CLONE_NEWTIME.
> > + * In latter case, changes to the time of this process won't be seen by 
> > parent,
> > + * and vice versa.
> 
> Ditto

Will fix.

> 
> > + */
> > +struct time_namespace *copy_time_ns(unsigned long flags,
> > +   struct user_namespace *user_ns, struct time_namespace *old_ns)
> > +{
> > +   if (!(flags & CLONE_NEWTIME))
> > +   return get_time_ns(old_ns);
> > +
> > +   return clone_time_ns(user_ns, old_ns);
> > +}
> > +
> > +void free_time_ns(struct kref *kref)
> > +{
> > +   struct time_namespace *ns;
> > +
> > +   ns = container_of(kref, struct time_namespace, kref);
> > +   dec_time_namespaces(ns->ucounts);
> > +   put_user_ns(ns->user_ns);
> > +   ns_free_inum(>ns);
> > +   kfree(ns);
> > +}
> > +
> > +static struct time_namespace *to_time_ns(struct ns_common *ns)
> > +{
> > +   return container_of(ns, struct time_namespace, ns);
> > +}
> > +
> > +static struct ns_common *timens_get(struct task_struct *task)
> > +{
> > +   struct time_namespace *ns = NULL;
> > +   struct nsproxy *nsproxy;
> > +
> > +   task_lock(task);
> > +   nsproxy = task->nsproxy;
> > +   if (nsproxy) {
> > +   ns = nsproxy->time_ns;
> > +   get_time_ns(ns);
> > +   }
> > +   task_unlock(task);
> > +
> > +   return ns ? >ns : NULL;
> > +}
> > +
> > +static struct ns_common *timens_for_children_get(struct task_struct *task)
> > +{
> > +   struct time_namespace *ns = NULL;
> > +   struct nsproxy *nsproxy;
> > +
> > +   task_lock(task);
> > +   nsproxy = task->nsproxy;
> > +   if (nsproxy) {
> > +   ns = nsproxy->time_ns_for_children;
> > +   get_time_ns(ns);
> > +   }
> > +   task_unlock(task);
> > +
> > +   return ns ? >ns : NULL;
> > +}
> > +
> > +static void timens_put(struct ns_common *ns)
> > +{
> > +   put_time_ns(to_time_ns(ns));
> > +}
> > +
> > +static int timens_install(struct nsproxy 

Re: [PATCH v6 1/2] fork: extend clone3() to support setting a PID

2019-08-12 Thread Andrei Vagin
On Mon, Aug 12, 2019 at 2:03 PM Adrian Reber  wrote:
>
> On Mon, Aug 12, 2019 at 01:43:53PM -0700, Andrei Vagin wrote:
> > On Mon, Aug 12, 2019 at 1:10 PM Adrian Reber  wrote:
> > >
> > > The main motivation to add set_tid to clone3() is CRIU.
> > >
> > > To restore a process with the same PID/TID CRIU currently uses
> > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > > ns_last_pid and then (quickly) does a clone(). This works most of the
> > > time, but it is racy. It is also slow as it requires multiple syscalls.
> > >
> > > Extending clone3() to support set_tid makes it possible restore a
> > > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > > race free (as long as the desired PID/TID is available).
> > >
> > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > > on clone3() with set_tid as they are currently in place for ns_last_pid.
> > >
> > > Signed-off-by: Adrian Reber 
> > > ---
> > > v2:
> > >  - Removed (size < sizeof(struct clone_args)) as discussed with
> > >Christian and Dmitry
> > >  - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
> > >  - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
> > >
> > > v3:
> > >  - Return EEXIST if PID is already in use (Christian)
> > >  - Drop CLONE_SET_TID (Christian and Oleg)
> > >  - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
> > >  - Handle different `struct clone_args` sizes (Dmitry)
> > >
> > > v4:
> > >  - Rework struct size check with defines (Christian)
> > >  - Reduce number of set_tid checks (Oleg)
> > >  - Less parentheses and more robust code (Oleg)
> > >  - Do ns_capable() on correct user_ns (Oleg, Christian)
> > >
> > > v5:
> > >  - make set_tid checks earlier in alloc_pid() (Christian)
> > >  - remove unnecessary comment and struct size check (Christian)
> > >
> > > v6:
> > >  - remove CLONE_SET_TID from description (Christian)
> > >  - add clone3() tests for different clone_args sizes (Christian)
> > >  - move more set_tid checks to alloc_pid() (Oleg)
> > >  - make all set_tid checks lockless (Oleg)
> > > ---
> > >  include/linux/pid.h|  2 +-
> > >  include/linux/sched/task.h |  1 +
> > >  include/uapi/linux/sched.h |  1 +
> > >  kernel/fork.c  | 14 --
> > >  kernel/pid.c   | 37 ++---
> > >  5 files changed, 45 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > > index 2a83e434db9d..052000db0ced 100644
> > > --- a/include/linux/pid.h
> > > +++ b/include/linux/pid.h
> > > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
> > >  extern struct pid *find_get_pid(int nr);
> > >  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> > >
> > > -extern struct pid *alloc_pid(struct pid_namespace *ns);
> > > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
> > >  extern void free_pid(struct pid *pid);
> > >  extern void disable_pid_allocation(struct pid_namespace *ns);
> > >
> > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > > index 0497091e40c1..4f2a80564332 100644
> > > --- a/include/linux/sched/task.h
> > > +++ b/include/linux/sched/task.h
> > > @@ -26,6 +26,7 @@ struct kernel_clone_args {
> > > unsigned long stack;
> > > unsigned long stack_size;
> > > unsigned long tls;
> > > +   pid_t set_tid;
> > >  };
> > >
> > >  /*
> > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > > index b3105ac1381a..e1ce103a2c47 100644
> > > --- a/include/uapi/linux/sched.h
> > > +++ b/include/uapi/linux/sched.h
> > > @@ -45,6 +45,7 @@ struct clone_args {
> > > __aligned_u64 stack;
> > > __aligned_u64 stack_size;
> > > __aligned_u64 tls;
> > > +   __aligned_u64 set_tid;
> > >  };
> > >
> > >  /*
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 2852d0e76ea3..8317d408a8d6 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -117,6 +117,11 @@
> > >   */
> > >  #define MAX_THREADS FUTEX_TID_

Re: [PATCH v6 1/2] fork: extend clone3() to support setting a PID

2019-08-12 Thread Andrei Vagin
On Mon, Aug 12, 2019 at 1:10 PM Adrian Reber  wrote:
>
> The main motivation to add set_tid to clone3() is CRIU.
>
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
>
> Extending clone3() to support set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
>
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with set_tid as they are currently in place for ns_last_pid.
>
> Signed-off-by: Adrian Reber 
> ---
> v2:
>  - Removed (size < sizeof(struct clone_args)) as discussed with
>Christian and Dmitry
>  - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
>  - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
>
> v3:
>  - Return EEXIST if PID is already in use (Christian)
>  - Drop CLONE_SET_TID (Christian and Oleg)
>  - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
>  - Handle different `struct clone_args` sizes (Dmitry)
>
> v4:
>  - Rework struct size check with defines (Christian)
>  - Reduce number of set_tid checks (Oleg)
>  - Less parentheses and more robust code (Oleg)
>  - Do ns_capable() on correct user_ns (Oleg, Christian)
>
> v5:
>  - make set_tid checks earlier in alloc_pid() (Christian)
>  - remove unnecessary comment and struct size check (Christian)
>
> v6:
>  - remove CLONE_SET_TID from description (Christian)
>  - add clone3() tests for different clone_args sizes (Christian)
>  - move more set_tid checks to alloc_pid() (Oleg)
>  - make all set_tid checks lockless (Oleg)
> ---
>  include/linux/pid.h|  2 +-
>  include/linux/sched/task.h |  1 +
>  include/uapi/linux/sched.h |  1 +
>  kernel/fork.c  | 14 --
>  kernel/pid.c   | 37 ++---
>  5 files changed, 45 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 2a83e434db9d..052000db0ced 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
>  extern struct pid *find_get_pid(int nr);
>  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
>
> -extern struct pid *alloc_pid(struct pid_namespace *ns);
> +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
>  extern void free_pid(struct pid *pid);
>  extern void disable_pid_allocation(struct pid_namespace *ns);
>
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 0497091e40c1..4f2a80564332 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -26,6 +26,7 @@ struct kernel_clone_args {
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> +   pid_t set_tid;
>  };
>
>  /*
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index b3105ac1381a..e1ce103a2c47 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -45,6 +45,7 @@ struct clone_args {
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> +   __aligned_u64 set_tid;
>  };
>
>  /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2852d0e76ea3..8317d408a8d6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -117,6 +117,11 @@
>   */
>  #define MAX_THREADS FUTEX_TID_MASK
>
> +/*
> + * For different sizes of struct clone_args
> + */
> +#define CLONE3_ARGS_SIZE_V0 64
> +
>  /*
>   * Protected counters by write_lock_irq(_lock)
>   */
> @@ -2031,7 +2036,7 @@ static __latent_entropy struct task_struct 
> *copy_process(
> stackleak_task_init(p);
>
> if (pid != _struct_pid) {
> -   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> +   pid = alloc_pid(p->nsproxy->pid_ns_for_children, 
> args->set_tid);
> if (IS_ERR(pid)) {
> retval = PTR_ERR(pid);
> goto bad_fork_cleanup_thread;
> @@ -2535,9 +2540,13 @@ noinline static int copy_clone_args_from_user(struct 
> kernel_clone_args *kargs,
> if (unlikely(size > PAGE_SIZE))
> return -E2BIG;
>
> -   if (unlikely(size < sizeof(struct clone_args)))
> +   if (unlikely(size < CLONE3_ARGS_SIZE_V0))
> return -EINVAL;
>
> +   if (size < sizeof(struct clone_args))
> +   memset((void *) + size, 0,
> +   sizeof(struct clone_args) - size);
> +
> if (unlikely(!access_ok(uargs, size)))
> return -EFAULT;
>
> @@ -2571,6 +2580,7 @@ noinline static int copy_clone_args_from_user(struct 
> kernel_clone_args *kargs,
> .stack  = args.stack,
> .stack_size = 

Re: [PATCHv5 06/37] alarmtimer: Provide get_timespec() callback

2019-08-08 Thread Andrei Vagin
On Wed, Aug 07, 2019 at 08:04:10AM +0200, Thomas Gleixner wrote:
> On Mon, 29 Jul 2019, Dmitry Safonov wrote:
> >  /**
> > @@ -869,8 +871,10 @@ static int __init alarmtimer_init(void)
> > /* Initialize alarm bases */
> > alarm_bases[ALARM_REALTIME].base_clockid = CLOCK_REALTIME;
> > alarm_bases[ALARM_REALTIME].get_ktime = _get_real;
> > +   alarm_bases[ALARM_REALTIME].get_timespec = posix_get_timespec,
> 
> That's just wrong:
> 
> >  /*
> >   * Get monotonic time for posix timers
> >   */
> > -static int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
> > +int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
> >  {
> > ktime_get_ts64(tp);
> > return 0;
> 
> Using a proper function name would have avoided this.

You are right. Will fix. Thanks!
> 


  1   2   3   4   5   6   >