Re: [PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-20 Thread Adrian Reber
On Mon, Jul 20, 2020 at 01:54:52PM +0200, Christian Brauner wrote:
> On Sun, Jul 19, 2020 at 08:17:30PM +0200, Christian Brauner wrote:
> > On Sun, Jul 19, 2020 at 12:04:10PM +0200, Adrian Reber wrote:
> > > This is v6 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
> > > changes to v5 are:
> > > 
> > >  * split patch dealing with /proc/self/exe into two patches:
> > >* first patch to enable changing it with CAP_CHECKPOINT_RESTORE
> > >  and detailed history in the commit message
> > >* second patch changes -EINVAL to -EPERM
> > >  * use kselftest_harness.h infrastructure for test
> > >  * replace if (!capable(CAP_SYS_ADMIN) || 
> > > !capable(CAP_CHECKPOINT_RESTORE))
> > >with if (!checkpoint_restore_ns_capable(_user_ns))
> > > 
> > > Adrian Reber (5):
> > >   capabilities: Introduce CAP_CHECKPOINT_RESTORE
> > >   pid: use checkpoint_restore_ns_capable() for set_tid
> > >   pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> > >   proc: allow access in init userns for map_files with
> > > CAP_CHECKPOINT_RESTORE
> > >   selftests: add clone3() CAP_CHECKPOINT_RESTORE test
> > > 
> > > Nicolas Viennot (2):
> > >   prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
> > >   prctl: exe link permission error changed from -EINVAL to -EPERM
> > > 
> > >  fs/proc/base.c|   8 +-
> > >  include/linux/capability.h|   6 +
> > >  include/uapi/linux/capability.h   |   9 +-
> > >  kernel/pid.c  |   2 +-
> > >  kernel/pid_namespace.c|   2 +-
> > >  kernel/sys.c  |  13 +-
> > >  security/selinux/include/classmap.h   |   5 +-
> > >  tools/testing/selftests/clone3/.gitignore |   1 +
> > >  tools/testing/selftests/clone3/Makefile   |   4 +-
> > >  .../clone3/clone3_cap_checkpoint_restore.c| 177 ++
> > >  10 files changed, 212 insertions(+), 15 deletions(-)
> > >  create mode 100644 
> > > tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > 
> > > base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
> > 
> > Adrian, Nicolas thank you!
> > I grabbed the series to run the various core test-suites we've added
> > over the last year and pushed it to
> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=cap_checkpoint_restore
> > for now to let kbuild/ltp chew on it for a bit.
> 
> Ok, I ran the test-suite this morning and there's nothing to worry about
> it all passes _but_ the selftests had a bug using SKIP() instead of
> XFAIL() and they mixed ksft_print_msg() and TH_LOG(). I know that I
> think I mentioned to you that you can't use TH_LOG() outside of TEST*().
> Turns out I was wrong. You can do it if you pass in a specific global
> variable. Here's the diff I applied on top of the selftests you sent.
> After these changes the output looks like this:
> 
> [==] Running 1 tests from 1 test cases.
> [ RUN  ] global.clone3_cap_checkpoint_restore
> # clone3() syscall supported
> clone3_cap_checkpoint_restore.c:155:clone3_cap_checkpoint_restore:Child has 
> PID 12303
> clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] 
> Trying clone3() with CLONE_SET_TID to 12303
> clone3_cap_checkpoint_restore.c:55:clone3_cap_checkpoint_restore:Operation 
> not permitted - Failed to create new process
> clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] 
> clone3() with CLONE_SET_TID 12303 says:-1
> clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] 
> Trying clone3() with CLONE_SET_TID to 12303
> clone3_cap_checkpoint_restore.c:70:clone3_cap_checkpoint_restore:I am the 
> parent (12302). My child's pid is 12303
> clone3_cap_checkpoint_restore.c:63:clone3_cap_checkpoint_restore:I am the 
> child, my PID is 12303 (expected 12303)
> clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] 
> clone3() with CLONE_SET_TID 12303 says:0
> [   OK ] global.clone3_cap_checkpoint_restore
> [==] 1 / 1 tests passed.
> [  PASSED  ]
> 
> Ok with this below being applied on top of it?
> 
> diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c 
> b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> index c0d83511cd28..9562425aa0a9 100644
> --- a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> +++ b/tools/testing/se

[PATCH v6 5/7] prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe

2020-07-19 Thread Adrian Reber
From: Nicolas Viennot 

Originally, only a local CAP_SYS_ADMIN could change the exe link,
making it difficult for doing checkpoint/restore without CAP_SYS_ADMIN.
This commit adds CAP_CHECKPOINT_RESTORE in addition to CAP_SYS_ADMIN
for permitting changing the exe link.

The following describes the history of the /proc/self/exe permission
checks as it may be difficult to understand what decisions lead to this
point.

* [1] May 2012: This commit introduces the ability of changing
  /proc/self/exe if the user is CAP_SYS_RESOURCE capable.
  In the related discussion [2], no clear thread model is presented for
  what could happen if the /proc/self/exe changes multiple times, or why
  would the admin be at the mercy of userspace.

* [3] Oct 2014: This commit introduces a new API to change
  /proc/self/exe. The permission no longer checks for CAP_SYS_RESOURCE,
  but instead checks if the current user is root (uid=0) in its local
  namespace. In the related discussion [4] it is said that "Controlling
  exe_fd without privileges may turn out to be dangerous. At least
  things like tomoyo examine it for making policy decisions (see
  tomoyo_manager())."

* [5] Dec 2016: This commit removes the restriction to change
  /proc/self/exe at most once. The related discussion [6] informs that
  the audit subsystem relies on the exe symlink, presumably
  audit_log_d_path_exe() in kernel/audit.c.

* [7] May 2017: This commit changed the check from uid==0 to local
  CAP_SYS_ADMIN. No discussion.

* [8] July 2020: A PoC to spoof any program's /proc/self/exe via ptrace
  is demonstrated

Overall, the concrete points that were made to retain capability checks
around changing the exe symlink is that tomoyo_manager() and
audit_log_d_path_exe() uses the exe_file path.

Christian Brauner said that relying on /proc//exe being immutable (or
guarded by caps) in a sake of security is a bit misleading. It can only
be used as a hint without any guarantees of what code is being executed
once execve() returns to userspace. Christian suggested that in the
future, we could call audit_log() or similar to inform the admin of all
exe link changes, instead of attempting to provide security guarantees
via permission checks. However, this proposed change requires the
understanding of the security implications in the tomoyo/audit subsystems.

[1] b32dfe377102 ("c/r: prctl: add ability to set new mm_struct::exe_file")
[2] https://lore.kernel.org/patchwork/patch/292515/
[3] f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
[4] https://lore.kernel.org/patchwork/patch/479359/
[5] 3fb4afd9a504 ("prctl: remove one-shot limitation for changing exe link")
[6] https://lore.kernel.org/patchwork/patch/697304/
[7] 4d28df6152aa ("prctl: Allow local CAP_SYS_ADMIN changing exe_file")
[8] https://github.com/nviennot/run_as_exe

Signed-off-by: Nicolas Viennot 
Signed-off-by: Adrian Reber 
---
 kernel/sys.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..a3f4ef0bbda3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2007,11 +2007,14 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
 
if (prctl_map.exe_fd != (u32)-1) {
/*
-* Make sure the caller has the rights to
-* change /proc/pid/exe link: only local sys admin should
-* be allowed to.
+* Check if the current user is checkpoint/restore capable.
+* At the time of this writing, it checks for CAP_SYS_ADMIN
+* or CAP_CHECKPOINT_RESTORE.
+* Note that a user with access to ptrace can masquerade an
+* arbitrary program as any executable, even setuid ones.
+* This may have implications in the tomoyo subsystem.
 */
-   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+   if (!checkpoint_restore_ns_capable(current_user_ns()))
return -EINVAL;
 
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
-- 
2.26.2



[PATCH v6 6/7] prctl: exe link permission error changed from -EINVAL to -EPERM

2020-07-19 Thread Adrian Reber
From: Nicolas Viennot 

This brings consistency with the rest of the prctl() syscall where
-EPERM is returned when failing a capability check.

Signed-off-by: Nicolas Viennot 
Signed-off-by: Adrian Reber 
---
 kernel/sys.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index a3f4ef0bbda3..ca11af9d815d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2015,7 +2015,7 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
 * This may have implications in the tomoyo subsystem.
 */
if (!checkpoint_restore_ns_capable(current_user_ns()))
-   return -EINVAL;
+   return -EPERM;
 
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
-- 
2.26.2



[PATCH v6 3/7] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid

2020-07-19 Thread Adrian Reber
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
writing to ns_last_pid.

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
Acked-by: Christian Brauner 
Reviewed-by: Serge Hallyn 
---
 kernel/pid_namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int 
write,
struct ctl_table tmp = *table;
int ret, next;
 
-   if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+   if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
return -EPERM;
 
/*
-- 
2.26.2



[PATCH v6 4/7] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE

2020-07-19 Thread Adrian Reber
Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
Reviewed-by: Cyrill Gorcunov 
---
 fs/proc/base.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 65893686d1f1..b824a8c89011 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2194,16 +2194,16 @@ struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
 {
-   if (!capable(CAP_SYS_ADMIN))
+   if (!checkpoint_restore_ns_capable(_user_ns))
return ERR_PTR(-EPERM);
 
return proc_pid_get_link(dentry, inode, done);
-- 
2.26.2



[PATCH v6 7/7] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

2020-07-19 Thread Adrian Reber
This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/.gitignore |   1 +
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 177 ++
 3 files changed, 181 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/.gitignore 
b/tools/testing/selftests/clone3/.gitignore
index a81085742d40..83c0f6246055 100644
--- a/tools/testing/selftests/clone3/.gitignore
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -2,3 +2,4 @@
 clone3
 clone3_clear_sighand
 clone3_set_tid
+clone3_cap_checkpoint_restore
diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+   clone3_cap_checkpoint_restore
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c 
b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index ..c0d83511cd28
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest_harness.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+   fflush(stdout);
+   fflush(stderr);
+   _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
+{
+   int status;
+   pid_t pid = -1;
+
+   struct clone_args args = {
+   .exit_signal = SIGCHLD,
+   .set_tid = ptr_to_u64(set_tid),
+   .set_tid_size = set_tid_size,
+   };
+
+   pid = sys_clone3(, sizeof(struct clone_args));
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n", 
strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   int ret;
+   char tmp = 0;
+
+   ksft_print_msg
+   ("I am the child, my PID is %d (expected %d)\n", getpid(), 
set_tid[0]);
+
+   if (set_tid[0] != getpid())
+   child_exit(EXIT_FAILURE);
+   child_exit(EXIT_SUCCESS);
+   }
+
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n", 
getpid(), pid);
+
+   if (waitpid(pid, , 0) < 0) {
+   ksft_print_msg("Child returned %s\n", strerror(errno));
+   return -errno;
+   }
+
+   if (!WIFEXITED(status))
+   return -1;
+
+   return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n", 
getpid(), set_tid[0]);
+   ret = call_clone3_set_tid(set_tid, set_tid_size);
+   ksft_print_msg("[%d] clone3() with CLONE_SET_TID %d says:%d\n", 
getpid(), set_tid[0], ret);
+   return ret;
+}
+
+struct libcap {
+   struct __user_cap_header_struct hdr;
+   struct __user_cap_data_struct data[2];
+};
+
+static int set_capability(void)
+{
+   cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+   struct libcap *cap;
+   int ret = -1;
+   cap_t caps;
+
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return -1;
+   }
+
+   /* Drop all capabilities */
+   if (cap_clear(caps)) {
+   perror("cap_clear");
+   goto out;
+   }
+
+   cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+   cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+   cap = (struct libcap *) caps;
+
+   /* 40 -> CAP_CHECKPOINT_RESTORE */
+   cap->data[1].effective |= 1 << (40 - 32);
+   cap->data[1].permitted |= 1 << (40 - 32);
+
+   if (cap_set_proc(caps)) {
+   perror("cap

[PATCH v6 2/7] pid: use checkpoint_restore_ns_capable() for set_tid

2020-07-19 Thread Adrian Reber
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
Acked-by: Christian Brauner 
Reviewed-by: Serge Hallyn 
---
 kernel/pid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index de9d29c41d77..a9cbab0194d9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t 
*set_tid,
if (tid != 1 && !tmp->child_reaper)
goto out_free;
retval = -EPERM;
-   if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+   if (!checkpoint_restore_ns_capable(tmp->user_ns))
goto out_free;
set_tid_size--;
}
-- 
2.26.2



[PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-19 Thread Adrian Reber
This is v6 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
changes to v5 are:

 * split patch dealing with /proc/self/exe into two patches:
   * first patch to enable changing it with CAP_CHECKPOINT_RESTORE
 and detailed history in the commit message
   * second patch changes -EINVAL to -EPERM
 * use kselftest_harness.h infrastructure for test
 * replace if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
   with if (!checkpoint_restore_ns_capable(_user_ns))

Adrian Reber (5):
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
  pid: use checkpoint_restore_ns_capable() for set_tid
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  proc: allow access in init userns for map_files with
CAP_CHECKPOINT_RESTORE
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (2):
  prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
  prctl: exe link permission error changed from -EINVAL to -EPERM

 fs/proc/base.c|   8 +-
 include/linux/capability.h|   6 +
 include/uapi/linux/capability.h   |   9 +-
 kernel/pid.c  |   2 +-
 kernel/pid_namespace.c|   2 +-
 kernel/sys.c  |  13 +-
 security/selinux/include/classmap.h   |   5 +-
 tools/testing/selftests/clone3/.gitignore |   1 +
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 177 ++
 10 files changed, 212 insertions(+), 15 deletions(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
-- 
2.26.2



[PATCH v6 1/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-19 Thread Adrian Reber
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has
been asked numerous times if it is possible to checkpoint/restore a
process as non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID
of the restored process. This feature available via the clone3 system
call, or via /proc/sys/kernel/ns_last_pid is unfortunately guarded by
CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have
increased due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a
  resource manager distributing jobs where users are always running as
  non-root. There is a desire to provide a way to checkpoint and
  restore long running jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These
  checkpoint/restore applications are not meant to be running with
  CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See 
https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use
  with clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited
  use in typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for
  hardening purposes. Read-only layers prevent additional /proc mounts
  (due to proc's SB_I_USERNS_VISIBLE property), making the use of new
  PID namespaces limited as certain applications need access to /proc
  matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for
  recovering files that are unreachable via the file system such as
  deleted files, or memfd files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
Acked-by: Christian Brauner 
Reviewed-by: Serge Hallyn 
---
 include/linux/capability.h  | 6 ++
 include/uapi/linux/capability.h | 9 -
 security/selinux/include/classmap.h | 5 +++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+   return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+   ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
  */
 #define CAP_BPF39
 
-#define CAP_LAST_CAP CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE 40
+
+#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h 
b/security/selinux/include/classmap.h
index e54d62d529f1..ba2e01a6955c 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
"audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-   "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+   "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf&

[PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE

2020-07-15 Thread Adrian Reber
Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
---
 fs/proc/base.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 65893686d1f1..cada783f229e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2194,16 +2194,16 @@ struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
 {
-   if (!capable(CAP_SYS_ADMIN))
+   if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
return ERR_PTR(-EPERM);
 
return proc_pid_get_link(dentry, inode, done);
-- 
2.26.2



[PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link

2020-07-15 Thread Adrian Reber
From: Nicolas Viennot 

Allow CAP_CHECKPOINT_RESTORE capable users to change /proc/self/exe.

This commit also changes the permission error code from -EINVAL to
-EPERM for consistency with the rest of the prctl() syscall when
checking capabilities.

Signed-off-by: Nicolas Viennot 
Signed-off-by: Adrian Reber 
---
 kernel/sys.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..dd59b9142b1d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2007,12 +2007,14 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
 
if (prctl_map.exe_fd != (u32)-1) {
/*
-* Make sure the caller has the rights to
-* change /proc/pid/exe link: only local sys admin should
-* be allowed to.
+* Check if the current user is checkpoint/restore capable.
+* At the time of this writing, it checks for CAP_SYS_ADMIN
+* or CAP_CHECKPOINT_RESTORE.
+* Note that a user with access to ptrace can masquerade an
+* arbitrary program as any executable, even setuid ones.
 */
-   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
-   return -EINVAL;
+   if (!checkpoint_restore_ns_capable(current_user_ns()))
+   return -EPERM;
 
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
-- 
2.26.2



[PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

2020-07-15 Thread Adrian Reber
This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber 
Acked-by: Serge Hallyn 
---
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 2 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+   clone3_cap_checkpoint_restore
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c 
b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index ..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+   fflush(stdout);
+   fflush(stderr);
+   _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+   int status;
+   pid_t pid = -1;
+
+   struct clone_args args = {
+   .exit_signal = SIGCHLD,
+   .set_tid = ptr_to_u64(set_tid),
+   .set_tid_size = set_tid_size,
+   };
+
+   pid = sys_clone3(, sizeof(struct clone_args));
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+  strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   int ret;
+   char tmp = 0;
+
+   ksft_print_msg
+   ("I am the child, my PID is %d (expected %d)\n",
+getpid(), set_tid[0]);
+
+   if (set_tid[0] != getpid())
+   child_exit(EXIT_FAILURE);
+   child_exit(EXIT_SUCCESS);
+   }
+
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+  getpid(), pid);
+
+   if (waitpid(pid, , 0) < 0) {
+   ksft_print_msg("Child returned %s\n", strerror(errno));
+   return -errno;
+   }
+
+   if (!WIFEXITED(status))
+   return -1;
+
+   return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+  size_t set_tid_size, int expected)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+  getpid(), set_tid[0]);
+   ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+   ksft_print_msg
+   ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+getpid(), set_tid[0], ret, expected);
+   if (ret != expected) {
+   ksft_test_result_fail
+   ("[%d] Result (%d) is different than expected (%d)\n",
+getpid(), ret, expected);
+   return -1;
+   }
+   ksft_test_result_pass
+   ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+expected);
+
+   return 0;
+}
+
+struct libcap {
+   struct __user_cap_header_struct hdr;
+   struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+   cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+   struct libcap *cap;
+   int ret = -1;
+   cap_t caps;
+
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return -1;
+   }
+
+   /* Drop all capabilities */
+   if (cap_clear(caps)) {
+   perror("cap_clear");
+   goto out;
+   }
+
+   cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+   cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+   cap = (struct libcap *) caps;
+
+   /* 40 

[PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid

2020-07-15 Thread Adrian Reber
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
writing to ns_last_pid.

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
---
 kernel/pid_namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int 
write,
struct ctl_table tmp = *table;
int ret, next;
 
-   if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+   if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
return -EPERM;
 
/*
-- 
2.26.2



[PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid

2020-07-15 Thread Adrian Reber
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
---
 kernel/pid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index de9d29c41d77..a9cbab0194d9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t 
*set_tid,
if (tid != 1 && !tmp->child_reaper)
goto out_free;
retval = -EPERM;
-   if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+   if (!checkpoint_restore_ns_capable(tmp->user_ns))
goto out_free;
set_tid_size--;
}
-- 
2.26.2



[PATCH v5 1/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-15 Thread Adrian Reber
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
  manager distributing jobs where users are always running as non-root.
  There is a desire to provide a way to checkpoint and restore long running
  jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These checkpoint/restore
  applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See 
https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use with
  clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
  typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for hardening
  purposes. Read-only layers prevent additional /proc mounts (due to proc's
  SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
  certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
  files that are unreachable via the file system such as deleted files, or memfd
  files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
---
 include/linux/capability.h  | 6 ++
 include/uapi/linux/capability.h | 9 -
 security/selinux/include/classmap.h | 5 +++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+   return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+   ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
  */
 #define CAP_BPF39
 
-#define CAP_LAST_CAP CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE 40
+
+#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h 
b/security/selinux/include/classmap.h
index e54d62d529f1..ba2e01a6955c 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
"audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-   "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+   "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+   "checkpoint_restore"
 
-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 
-- 
2.26.2



[PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-15 Thread Adrian Reber
This is v5 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
changes to v4 are:

 * split into more patches to have the introduction of
   CAP_CHECKPOINT_RESTORE and the actual usage in different
   patches
 * reduce the /proc/self/exe patch to only be about
   CAP_CHECKPOINT_RESTORE

Adrian Reber (5):
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
  pid: use checkpoint_restore_ns_capable() for set_tid
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (1):
  prctl: Allow checkpoint/restore capable processes to change exe link

 fs/proc/base.c|   8 +-
 include/linux/capability.h|   6 +
 include/uapi/linux/capability.h   |   9 +-
 kernel/pid.c  |   2 +-
 kernel/pid_namespace.c|   2 +-
 kernel/sys.c  |  12 +-
 security/selinux/include/classmap.h   |   5 +-
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 9 files changed, 236 insertions(+), 15 deletions(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c


base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
-- 
2.26.2



Re: [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

2020-07-03 Thread Adrian Reber
On Thu, Jul 02, 2020 at 03:53:05PM -0500, Serge E. Hallyn wrote:
> On Wed, Jul 01, 2020 at 08:49:05AM +0200, Adrian Reber wrote:
> > This adds a test that changes its UID, uses capabilities to
> > get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> > create a process with a given PID as non-root.
> 
> Seems worth also verifying that it fails if you have no capabilities.
> I don't see that in the existing clone3/ test dir.

Bit confused about what you mean. This test does:

 * switch UID to 1000
 * run clone3() with set_tid set and expect EPERM
 * set CAP_CHECKPOINT_RESTORE capability
 * run clone3() with set_tid set and expect success

So it already does what I think you are asking for. Did I misunderstand
your comment?

    Adrian

> > Signed-off-by: Adrian Reber 
> > ---
> >  tools/testing/selftests/clone3/Makefile   |   4 +-
> >  .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
> >  2 files changed, 206 insertions(+), 1 deletion(-)
> >  create mode 100644 
> > tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > 
> > diff --git a/tools/testing/selftests/clone3/Makefile 
> > b/tools/testing/selftests/clone3/Makefile
> > index cf976c732906..ef7564cb7abe 100644
> > --- a/tools/testing/selftests/clone3/Makefile
> > +++ b/tools/testing/selftests/clone3/Makefile
> > @@ -1,6 +1,8 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  CFLAGS += -g -I../../../../usr/include/
> > +LDLIBS += -lcap
> >  
> > -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> > +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> > +   clone3_cap_checkpoint_restore
> >  
> >  include ../lib.mk
> > diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c 
> > b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > new file mode 100644
> > index ..2cc3d57b91f2
> > --- /dev/null
> > +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > @@ -0,0 +1,203 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Based on Christian Brauner's clone3() example.
> > + * These tests are assuming to be running in the host's
> > + * PID namespace.
> > + */
> > +
> > +/* capabilities related code based on selftests/bpf/test_verifier.c */
> > +
> > +#define _GNU_SOURCE
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "../kselftest.h"
> > +#include "clone3_selftests.h"
> > +
> > +#ifndef MAX_PID_NS_LEVEL
> > +#define MAX_PID_NS_LEVEL 32
> > +#endif
> > +
> > +static void child_exit(int ret)
> > +{
> > +   fflush(stdout);
> > +   fflush(stderr);
> > +   _exit(ret);
> > +}
> > +
> > +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> > +{
> > +   int status;
> > +   pid_t pid = -1;
> > +
> > +   struct clone_args args = {
> > +   .exit_signal = SIGCHLD,
> > +   .set_tid = ptr_to_u64(set_tid),
> > +   .set_tid_size = set_tid_size,
> > +   };
> > +
> > +   pid = sys_clone3(, sizeof(struct clone_args));
> > +   if (pid < 0) {
> > +   ksft_print_msg("%s - Failed to create new process\n",
> > +  strerror(errno));
> > +   return -errno;
> > +   }
> > +
> > +   if (pid == 0) {
> > +   int ret;
> > +   char tmp = 0;
> > +
> > +   ksft_print_msg
> > +   ("I am the child, my PID is %d (expected %d)\n",
> > +getpid(), set_tid[0]);
> > +
> > +   if (set_tid[0] != getpid())
> > +   child_exit(EXIT_FAILURE);
> > +   child_exit(EXIT_SUCCESS);
> > +   }
> > +
> > +   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> > +  getpid(), pid);
> > +
> > +   if (waitpid(pid, , 0) < 0) {
> > +   ksft_print_msg("Child returned %s\n", strerror(errno));
> > +   return -errno;
> > +   }
> > +
> > +   if (!WIFEXITED(status))
> > +   return -1;
> > +
> > +   return WEXITSTATUS(status);
> > +}
> > +
> > +static int test_clone3_set_tid(pid_t * se

Re: [PATCH v4 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-03 Thread Adrian Reber
On Wed, Jul 01, 2020 at 10:27:08AM +0200, Christian Brauner wrote:
> On Wed, Jul 01, 2020 at 08:49:04AM +0200, Adrian Reber wrote:
> > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > checkpoint/restore for non-root users.
> > 
> > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has 
> > been
> > asked numerous times if it is possible to checkpoint/restore a process as
> > non-root. The answer usually was: 'almost'.
> > 
> > The main blocker to restore a process as non-root was to control the PID of 
> > the
> > restored process. This feature available via the clone3 system call, or via
> > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > 
> > In the past two years, requests for non-root checkpoint/restore have 
> > increased
> > due to the following use cases:
> > * Checkpoint/Restore in an HPC environment in combination with a resource
> >   manager distributing jobs where users are always running as non-root.
> >   There is a desire to provide a way to checkpoint and restore long running
> >   jobs.
> > * Container migration as non-root
> > * We have been in contact with JVM developers who are integrating
> >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> >   applications are not meant to be running with CAP_SYS_ADMIN.
> > 
> > We have seen the following workarounds:
> > * Use a setuid wrapper around CRIU:
> >   See 
> > https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> > * Use a setuid helper that writes to ns_last_pid.
> >   Unfortunately, this helper delegation technique is impossible to use with
> >   clone3, and is thus prone to races.
> >   See https://github.com/twosigma/set_ns_last_pid
> > * Cycle through PIDs with fork() until the desired PID is reached:
> >   This has been demonstrated to work with cycling rates of 100,000 PIDs/s
> >   See https://github.com/twosigma/set_ns_last_pid
> > * Patch out the CAP_SYS_ADMIN check from the kernel
> > * Run the desired application in a new user and PID namespace to provide
> >   a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited 
> > use in
> >   typical container environments (e.g., Kubernetes) as /proc is
> >   typically protected with read-only layers (e.g., /proc/sys) for hardening
> >   purposes. Read-only layers prevent additional /proc mounts (due to proc's
> >   SB_I_USERNS_VISIBLE property), making the use of new PID namespaces 
> > limited as
> >   certain applications need access to /proc matching their PID namespace.
> > 
> > The introduced capability allows to:
> > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> >   for the corresponding PID namespace via ns_last_pid/clone3.
> > * Open files in /proc/pid/map_files when the current user is
> >   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for 
> > recovering
> >   files that are unreachable via the file system such as deleted files, or 
> > memfd
> >   files.
> > 
> > See corresponding selftest for an example with clone3().
> > 
> > Signed-off-by: Adrian Reber 
> > Signed-off-by: Nicolas Viennot 
> > ---
> 
> I think that now looks reasonable. A few comments.
> 
> Before we proceed, please split the addition of
> checkpoint_restore_ns_capable() out into a separate patch.
> In fact, I think the cleanest way of doing this would be:
> - 0/n capability: add CAP_CHECKPOINT_RESTORE
> - 1/n pid: use checkpoint_restore_ns_capable() for set_tid
> - 2/n pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> - 3/n: proc: require checkpoint_restore_ns_capable() in init userns for 
> map_files
> 
> (commit subjects up to you of course) and a nice commit message for each
> time we relax a permissions on something so we have a clear separate
> track record for each change in case we need to revert something. Then
> the rest of the patches in this series. Testing patches probably last.

Yes, makes sense. I was thinking about this already, but I was not sure
if it I should do it or not. But I had the same idea already.

Adrian



[PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

2020-07-01 Thread Adrian Reber
This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 2 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+   clone3_cap_checkpoint_restore
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c 
b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index ..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+   fflush(stdout);
+   fflush(stderr);
+   _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+   int status;
+   pid_t pid = -1;
+
+   struct clone_args args = {
+   .exit_signal = SIGCHLD,
+   .set_tid = ptr_to_u64(set_tid),
+   .set_tid_size = set_tid_size,
+   };
+
+   pid = sys_clone3(, sizeof(struct clone_args));
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+  strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   int ret;
+   char tmp = 0;
+
+   ksft_print_msg
+   ("I am the child, my PID is %d (expected %d)\n",
+getpid(), set_tid[0]);
+
+   if (set_tid[0] != getpid())
+   child_exit(EXIT_FAILURE);
+   child_exit(EXIT_SUCCESS);
+   }
+
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+  getpid(), pid);
+
+   if (waitpid(pid, , 0) < 0) {
+   ksft_print_msg("Child returned %s\n", strerror(errno));
+   return -errno;
+   }
+
+   if (!WIFEXITED(status))
+   return -1;
+
+   return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+  size_t set_tid_size, int expected)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+  getpid(), set_tid[0]);
+   ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+   ksft_print_msg
+   ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+getpid(), set_tid[0], ret, expected);
+   if (ret != expected) {
+   ksft_test_result_fail
+   ("[%d] Result (%d) is different than expected (%d)\n",
+getpid(), ret, expected);
+   return -1;
+   }
+   ksft_test_result_pass
+   ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+expected);
+
+   return 0;
+}
+
+struct libcap {
+   struct __user_cap_header_struct hdr;
+   struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+   cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+   struct libcap *cap;
+   int ret = -1;
+   cap_t caps;
+
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return -1;
+   }
+
+   /* Drop all capabilities */
+   if (cap_clear(caps)) {
+   perror("cap_clear");
+   goto out;
+   }
+
+   cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+   cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+   cap = (struct libcap *) caps;
+
+   /* 40 -> CAP_CHECKPOINT_RESTORE */
+ 

[PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe

2020-07-01 Thread Adrian Reber
From: Nicolas Viennot 

Previously, the current process could only change the /proc/self/exe
link with local CAP_SYS_ADMIN.
This commit relaxes this restriction by permitting such change with
CAP_CHECKPOINT_RESTORE, and the ability to use ptrace.

With access to ptrace facilities, a process can do the following: fork a
child, execve() the target executable, and have the child use ptrace()
to replace the memory content of the current process. This technique
makes it possible to masquerade an arbitrary program as any executable,
even setuid ones.

Signed-off-by: Nicolas Viennot 
Signed-off-by: Adrian Reber 
---
 include/linux/lsm_hook_defs.h |  1 +
 include/linux/security.h  |  6 ++
 kernel/sys.c  | 12 
 security/commoncap.c  | 26 ++
 security/security.c   |  5 +
 security/selinux/hooks.c  | 14 ++
 6 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 0098852bb56a..90e51d5e093b 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -211,6 +211,7 @@ LSM_HOOK(int, 0, task_kill, struct task_struct *p, struct 
kernel_siginfo *info,
 int sig, const struct cred *cred)
 LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
 unsigned long arg3, unsigned long arg4, unsigned long arg5)
+LSM_HOOK(int, 0, prctl_set_mm_exe_file, struct file *exe_file)
 LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,
 struct inode *inode)
 LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag)
diff --git a/include/linux/security.h b/include/linux/security.h
index 2797e7f6418e..0f594eb7e766 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -412,6 +412,7 @@ int security_task_kill(struct task_struct *p, struct 
kernel_siginfo *info,
int sig, const struct cred *cred);
 int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
+int security_prctl_set_mm_exe_file(struct file *exe_file);
 void security_task_to_inode(struct task_struct *p, struct inode *inode);
 int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
 void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
@@ -1124,6 +1125,11 @@ static inline int security_task_prctl(int option, 
unsigned long arg2,
return cap_task_prctl(option, arg2, arg3, arg4, arg5);
 }
 
+static inline int security_prctl_set_mm_exe_file(struct file *exe_file)
+{
+   return cap_prctl_set_mm_exe_file(exe_file);
+}
+
 static inline void security_task_to_inode(struct task_struct *p, struct inode 
*inode)
 { }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..bb53e8408c63 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1851,6 +1851,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
if (err)
goto exit;
 
+   err = security_prctl_set_mm_exe_file(exe.file);
+   if (err)
+   goto exit;
+
/*
 * Forbid mm->exe_file change if old file still mapped.
 */
@@ -2006,14 +2010,6 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
}
 
if (prctl_map.exe_fd != (u32)-1) {
-   /*
-* Make sure the caller has the rights to
-* change /proc/pid/exe link: only local sys admin should
-* be allowed to.
-*/
-   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
-   return -EINVAL;
-
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
return error;
diff --git a/security/commoncap.c b/security/commoncap.c
index 59bf3c1674c8..663d00fe2ecc 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -1291,6 +1291,31 @@ int cap_task_prctl(int option, unsigned long arg2, 
unsigned long arg3,
}
 }
 
+/**
+ * cap_prctl_set_mm_exe_file - Determine whether /proc/self/exe can be changed
+ * by the current process.
+ * @exe_file: The new exe file
+ * Returns 0 if permission is granted, -ve if denied.
+ *
+ * The current process is permitted to change its /proc/self/exe link via two 
policies:
+ * 1) The current user can do checkpoint/restore. At the time of this writing,
+ *this means CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
+ * 2) The current user can use ptrace.
+ *
+ * With access to ptrace facilities, a process can do the following:
+ * fork a child, execve() the target executable, and have the child use
+ * ptrace() to replace the memory content of the current process.
+ * This technique makes it possible to masquerade an arbitrary program as the
+ * target executable, even if it is setuid.
+ */
+int cap_prctl_set_mm_exe_file(struct file *exe_f

[PATCH v4 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-01 Thread Adrian Reber
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
  manager distributing jobs where users are always running as non-root.
  There is a desire to provide a way to checkpoint and restore long running
  jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These checkpoint/restore
  applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See 
https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use with
  clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
  typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for hardening
  purposes. Read-only layers prevent additional /proc mounts (due to proc's
  SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
  certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
  files that are unreachable via the file system such as deleted files, or memfd
  files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
---
 fs/proc/base.c  | 8 
 include/linux/capability.h  | 6 ++
 include/uapi/linux/capability.h | 9 -
 kernel/pid.c| 2 +-
 kernel/pid_namespace.c  | 2 +-
 security/selinux/include/classmap.h | 5 +++--
 6 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0afc8a85..ad806069c778 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2189,16 +2189,16 @@ struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
 {
-   if (!capable(CAP_SYS_ADMIN))
+   if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
return ERR_PTR(-EPERM);
 
return proc_pid_get_link(dentry, inode, done);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+   return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+   ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7

[PATCH v4 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-07-01 Thread Adrian Reber
This is v4 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. There
is only one change from v3 to address Jann's comment on patch 3/3

 (That is not necessarily true in the presence of LSMs like SELinux:
 You'd have to be able to FILE__EXECUTE_NO_TRANS the target executable
 according to the system's security policy.)

Nicolas updated the last patch (3/3). The first two patches are
unchanged from v3.

Adrian Reber (2):
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (1):
  prctl: Allow ptrace capable processes to change /proc/self/exe

 fs/proc/base.c|   8 +-
 include/linux/capability.h|   6 +
 include/linux/lsm_hook_defs.h |   1 +
 include/linux/security.h  |   6 +
 include/uapi/linux/capability.h   |   9 +-
 kernel/pid.c  |   2 +-
 kernel/pid_namespace.c|   2 +-
 kernel/sys.c  |  12 +-
 security/commoncap.c  |  26 +++
 security/security.c   |   5 +
 security/selinux/hooks.c  |  14 ++
 security/selinux/include/classmap.h   |   5 +-
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 14 files changed, 285 insertions(+), 18 deletions(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c


base-commit: f2b92b14533e646e434523abdbafddb727c23898
-- 
2.26.2



[PATCH v3 3/3] prctl: Allow ptrace capable processes to change exe_fd

2020-06-18 Thread Adrian Reber
From: Nicolas Viennot 

The current process is authorized to change its /proc/self/exe link via
two policies:
1) The current user can do checkpoint/restore In other words is
   CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
2) The current user can use ptrace.

With access to ptrace facilities, a process can do the following: fork a
child, execve() the target executable, and have the child use ptrace()
to replace the memory content of the current process. This technique
makes it possible to masquerade an arbitrary program as any executable,
even setuid ones.

This commit also changes the permission error code from -EINVAL to
-EPERM for consistency with the rest of the prctl() syscall when
checking capabilities.

Signed-off-by: Nicolas Viennot 
Signed-off-by: Adrian Reber 
---
 kernel/sys.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..ce77012a42d7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2007,12 +2007,23 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
 
if (prctl_map.exe_fd != (u32)-1) {
/*
-* Make sure the caller has the rights to
-* change /proc/pid/exe link: only local sys admin should
-* be allowed to.
+* The current process is authorized to change its
+* /proc/self/exe link via two policies:
+* 1) The current user can do checkpoint/restore
+*In other words is CAP_SYS_ADMIN or
+*CAP_CHECKPOINT_RESTORE capable.
+* 2) The current user can use ptrace.
+*
+* With access to ptrace facilities, a process can do the
+* following: fork a child, execve() the target executable,
+* and have the child use ptrace() to replace the memory
+* content of the current process. This technique makes it
+* possible to masquerade an arbitrary program as the target
+* executable, even if it is setuid.
 */
-   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
-   return -EINVAL;
+   if (!(checkpoint_restore_ns_capable(current_user_ns()) ||
+ security_ptrace_access_check(current, 
PTRACE_MODE_ATTACH_REALCREDS)))
+   return -EPERM;
 
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
-- 
2.26.2



[PATCH v3 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

2020-06-18 Thread Adrian Reber
This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 2 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+   clone3_cap_checkpoint_restore
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c 
b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index ..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+   fflush(stdout);
+   fflush(stderr);
+   _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+   int status;
+   pid_t pid = -1;
+
+   struct clone_args args = {
+   .exit_signal = SIGCHLD,
+   .set_tid = ptr_to_u64(set_tid),
+   .set_tid_size = set_tid_size,
+   };
+
+   pid = sys_clone3(, sizeof(struct clone_args));
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+  strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   int ret;
+   char tmp = 0;
+
+   ksft_print_msg
+   ("I am the child, my PID is %d (expected %d)\n",
+getpid(), set_tid[0]);
+
+   if (set_tid[0] != getpid())
+   child_exit(EXIT_FAILURE);
+   child_exit(EXIT_SUCCESS);
+   }
+
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+  getpid(), pid);
+
+   if (waitpid(pid, , 0) < 0) {
+   ksft_print_msg("Child returned %s\n", strerror(errno));
+   return -errno;
+   }
+
+   if (!WIFEXITED(status))
+   return -1;
+
+   return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+  size_t set_tid_size, int expected)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+  getpid(), set_tid[0]);
+   ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+   ksft_print_msg
+   ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+getpid(), set_tid[0], ret, expected);
+   if (ret != expected) {
+   ksft_test_result_fail
+   ("[%d] Result (%d) is different than expected (%d)\n",
+getpid(), ret, expected);
+   return -1;
+   }
+   ksft_test_result_pass
+   ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+expected);
+
+   return 0;
+}
+
+struct libcap {
+   struct __user_cap_header_struct hdr;
+   struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+   cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+   struct libcap *cap;
+   int ret = -1;
+   cap_t caps;
+
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return -1;
+   }
+
+   /* Drop all capabilities */
+   if (cap_clear(caps)) {
+   perror("cap_clear");
+   goto out;
+   }
+
+   cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+   cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+   cap = (struct libcap *) caps;
+
+   /* 40 -> CAP_CHECKPOINT_RESTORE */
+ 

[PATCH v3 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-18 Thread Adrian Reber
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
  manager distributing jobs where users are always running as non-root.
  There is a desire to provide a way to checkpoint and restore long running
  jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These checkpoint/restore
  applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See 
https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use with
  clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
  typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for hardening
  purposes. Read-only layers prevent additional /proc mounts (due to proc's
  SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
  certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
  files that are unreachable via the file system such as deleted files, or memfd
  files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
---
v2:
 - Renamed CAP_RESTORE to CAP_CHECKPOINT_RESTORE
 - Added a test
 - Added details about CRIU's use of map_files
 - Allow changing /proc/self/exe link with CAP_CHECKPOINT_RESTORE
v3:
 - made if condition easier to read
---
 fs/proc/base.c  | 8 
 include/linux/capability.h  | 6 ++
 include/uapi/linux/capability.h | 9 -
 kernel/pid.c| 2 +-
 kernel/pid_namespace.c  | 2 +-
 security/selinux/include/classmap.h | 5 +++--
 6 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0afc8a85..ad806069c778 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2189,16 +2189,16 @@ struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
 {
-   if (!capable(CAP_SYS_ADMIN))
+   if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
return ERR_PTR(-EPERM);
 
return proc_pid_get_link(dentry, inode, done);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+   return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+   ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_

[PATCH v3 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-18 Thread Adrian Reber
This is v3 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. There
is only one change from v2:

 * made if condition easier to read as requested by Cyrill

Besides that there were no further comments on the changes proposed in
this patchset.

There was the discussion from Andrei that PTRACE_O_SUSPEND_SECCOMP is
also needed for checkpointing. CRIU already has the possibility to
detect if a process is using seccomp and could so tell the user that
it cannot checkpoint a process if the process is using seccomp. As
seccomp has not come up in the requests from users to use CRIU as
non-root so far and as there was some push back from Christian to allow
PTRACE_O_SUSPEND_SECCOMP if CAP_CHECKPOINT_RESTORE is set I would like
to leave this open for the future.

Another discussion was around relaxing the existing map_files check from
capable() to ns_capable() or even completely removing it. Even if this
happens we still need CAP_CHECKPOINT_RESTORE and the removal or change
to ns_capable() is not blocked by this patchset.

Besides that there was nothing speaking against CAP_CHECKPOINT_RESTORE
during the v2 discussions.

Adrian Reber (2):
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (1):
  prctl: Allow ptrace capable processes to change exe_fd

 fs/proc/base.c|   8 +-
 include/linux/capability.h|   6 +
 include/uapi/linux/capability.h   |   9 +-
 kernel/pid.c  |   2 +-
 kernel/pid_namespace.c|   2 +-
 kernel/sys.c  |  21 +-
 security/selinux/include/classmap.h   |   5 +-
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 9 files changed, 245 insertions(+), 15 deletions(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c


base-commit: 5fcb9628fd1227a5f11d87171cb1b8b5c414d9d9
-- 
2.26.2



[PATCH v2 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

2020-06-03 Thread Adrian Reber
This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 2 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+   clone3_cap_checkpoint_restore
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c 
b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index ..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+   fflush(stdout);
+   fflush(stderr);
+   _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+   int status;
+   pid_t pid = -1;
+
+   struct clone_args args = {
+   .exit_signal = SIGCHLD,
+   .set_tid = ptr_to_u64(set_tid),
+   .set_tid_size = set_tid_size,
+   };
+
+   pid = sys_clone3(, sizeof(struct clone_args));
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+  strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   int ret;
+   char tmp = 0;
+
+   ksft_print_msg
+   ("I am the child, my PID is %d (expected %d)\n",
+getpid(), set_tid[0]);
+
+   if (set_tid[0] != getpid())
+   child_exit(EXIT_FAILURE);
+   child_exit(EXIT_SUCCESS);
+   }
+
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+  getpid(), pid);
+
+   if (waitpid(pid, , 0) < 0) {
+   ksft_print_msg("Child returned %s\n", strerror(errno));
+   return -errno;
+   }
+
+   if (!WIFEXITED(status))
+   return -1;
+
+   return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+  size_t set_tid_size, int expected)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+  getpid(), set_tid[0]);
+   ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+   ksft_print_msg
+   ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+getpid(), set_tid[0], ret, expected);
+   if (ret != expected) {
+   ksft_test_result_fail
+   ("[%d] Result (%d) is different than expected (%d)\n",
+getpid(), ret, expected);
+   return -1;
+   }
+   ksft_test_result_pass
+   ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+expected);
+
+   return 0;
+}
+
+struct libcap {
+   struct __user_cap_header_struct hdr;
+   struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+   cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+   struct libcap *cap;
+   int ret = -1;
+   cap_t caps;
+
+   caps = cap_get_proc();
+   if (!caps) {
+   perror("cap_get_proc");
+   return -1;
+   }
+
+   /* Drop all capabilities */
+   if (cap_clear(caps)) {
+   perror("cap_clear");
+   goto out;
+   }
+
+   cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+   cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+   cap = (struct libcap *) caps;
+
+   /* 40 -> CAP_CHECKPOINT_RESTORE */
+ 

[PATCH v2 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-03 Thread Adrian Reber
This is v2 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
difference from v1 are:

 * Renamed CAP_RESTORE to CAP_CHECKPOINT_RESTORE
 * Added a test
 * Added details about CRIU's use of map_files
 * Allow changing /proc/self/exe link with CAP_CHECKPOINT_RESTORE

The biggest difference is that the patchset now provides all the
changes, which are necessary to use CRIU to checkpoint and restore a
process as non-root if CAP_CHECKPOINT_RESTORE is set.

Adrian Reber (2):
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (1):
  prctl: Allow ptrace capable processes to change exe_fd

 fs/proc/base.c|   8 +-
 include/linux/capability.h|   6 +
 include/uapi/linux/capability.h   |   9 +-
 kernel/pid.c  |   2 +-
 kernel/pid_namespace.c|   2 +-
 kernel/sys.c  |  21 +-
 security/selinux/include/classmap.h   |   5 +-
 tools/testing/selftests/clone3/Makefile   |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c| 203 ++
 9 files changed, 245 insertions(+), 15 deletions(-)
 create mode 100644 
tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c


base-commit: 48f99181fc118d82dc8bf6c7221ad1c654cb8bc2
-- 
2.26.2



[PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

2020-06-03 Thread Adrian Reber
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
  manager distributing jobs where users are always running as non-root.
  There is a desire to provide a way to checkpoint and restore long running
  jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These checkpoint/restore
  applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See 
https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use with
  clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
  typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for hardening
  purposes. Read-only layers prevent additional /proc mounts (due to proc's
  SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
  certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
  files that are unreachable via the file system such as deleted files, or memfd
  files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber 
Signed-off-by: Nicolas Viennot 
---
v2:
 - Renamed CAP_RESTORE to CAP_CHECKPOINT_RESTORE
 - Added a test
 - Added details about CRIU's use of map_files
 - Allow changing /proc/self/exe link with CAP_CHECKPOINT_RESTORE
---
 fs/proc/base.c  | 8 
 include/linux/capability.h  | 6 ++
 include/uapi/linux/capability.h | 9 -
 kernel/pid.c| 2 +-
 kernel/pid_namespace.c  | 2 +-
 security/selinux/include/classmap.h | 5 +++--
 6 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0afc8a85..ce02f3a4b2d7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2189,16 +2189,16 @@ struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
 {
-   if (!capable(CAP_SYS_ADMIN))
+   if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
return ERR_PTR(-EPERM);
 
return proc_pid_get_link(dentry, inode, done);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+   return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+   ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux

[PATCH v2 3/3] prctl: Allow ptrace capable processes to change exe_fd

2020-06-03 Thread Adrian Reber
From: Nicolas Viennot 

The current process is authorized to change its /proc/self/exe link via
two policies:
1) The current user can do checkpoint/restore In other words is
   CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
2) The current user can use ptrace.

With access to ptrace facilities, a process can do the following: fork a
child, execve() the target executable, and have the child use ptrace()
to replace the memory content of the current process. This technique
makes it possible to masquerade an arbitrary program as any executable,
even setuid ones.

This commit also changes the permission error code from -EINVAL to
-EPERM for consistency with the rest of the prctl() syscall when
checking capabilities.

Signed-off-by: Nicolas Viennot 
Signed-off-by: Adrian Reber 
---
 kernel/sys.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index fd46865b46ba..2f34108e26e0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1994,12 +1994,23 @@ static int prctl_set_mm_map(int opt, const void __user 
*addr, unsigned long data
 
if (prctl_map.exe_fd != (u32)-1) {
/*
-* Make sure the caller has the rights to
-* change /proc/pid/exe link: only local sys admin should
-* be allowed to.
+* The current process is authorized to change its
+* /proc/self/exe link via two policies:
+* 1) The current user can do checkpoint/restore
+*In other words is CAP_SYS_ADMIN or
+*CAP_CHECKPOINT_RESTORE capable.
+* 2) The current user can use ptrace.
+*
+* With access to ptrace facilities, a process can do the
+* following: fork a child, execve() the target executable,
+* and have the child use ptrace() to replace the memory
+* content of the current process. This technique makes it
+* possible to masquerade an arbitrary program as the target
+* executable, even if it is setuid.
 */
-   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
-   return -EINVAL;
+   if (!(checkpoint_restore_ns_capable(current_user_ns()) ||
+ security_ptrace_access_check(current, 
PTRACE_MODE_ATTACH_REALCREDS)))
+   return -EPERM;
 
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
-- 
2.26.2



Re: clone3: allow creation of time namespace with offset

2020-05-29 Thread Adrian Reber
On Fri, May 29, 2020 at 02:26:13PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Adrian,
> 
> If there was a revision to this patch, I missed it. Is there still a
> plan to bring CLONE_NEWTIME to clone3()?

Good that you ask. The discussion ended at the point that we do not have
a way to figure out what a syscall supports from user-space.  I talked a
bit with Christian about it and he mentioned that there were some ideas
floating around how to do that. At that point it made more sense to me
to wait for such a solution to appear before continuing the clone3()
time namespace work.

Adrian

> On Tue, 17 Mar 2020 at 09:32, Adrian Reber  wrote:
> >
> > This is an attempt to add time namespace support to clone3(). I am not
> > really sure which way clone3() should handle time namespaces. The time
> > namespace through /proc cannot be used with clone3() because the offsets
> > for the time namespace need to be written before a process has been
> > created in that time namespace. This means it is necessary to somehow
> > tell clone3() the offsets for the clocks.
> >
> > The time namespace offers the possibility to set offsets for
> > CLOCK_MONOTONIC and CLOCK_BOOTTIME. My first approach was to extend
> > 'struct clone_args` with '__aligned_u64 monotonic_offset' and
> > '__aligned_u64 boottime_offset'. The problem with this approach was that
> > it was not possible to set nanoseconds for the clocks in the time
> > namespace.
> >
> > One of the motivations for clone3() with CLONE_NEWTIME was to enable
> > CRIU to restore a process in a time namespace with the corresponding
> > offsets. And although the nanosecond value can probably never be
> > restored to the same value it had during checkpointing, because the
> > clock keeps on running between CRIU pausing all processes and CRIU
> > actually reading the value of the clocks, the nanosecond value is still
> > necessary for CRIU to not restore a process where the clock jumps back
> > due to CRIU restoring it with a nanonsecond value that is too small.
> >
> > Requiring nanoseconds as well as seconds for two clocks during clone3()
> > means that it would require 4 additional members to 'struct clone_args':
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > +   __aligned_u64 boottime_offset_seconds;
> > +   __aligned_u64 boottime_offset_nanoseconds;
> > +   __aligned_u64 monotonic_offset_seconds;
> > +   __aligned_u64 monotonic_offset_nanoseconds;
> >  };
> >
> > To avoid four additional members to 'struct clone_args' this patchset
> > uses another approach:
> >
> > __aligned_u64 tls;
> > __aligned_u64 set_tid;
> > __aligned_u64 set_tid_size;
> > +   __aligned_u64 timens_offset;
> > +   __aligned_u64 timens_offset_size;
> >  };
> >
> > timens_offset is a pointer to an array just as previously done with
> > set_tid and timens_offset_size is the size of the array.
> >
> > The timens_offset array is expected to contain a struct like this:
> >
> > struct set_timens_offset {
> >int clockid;
> >struct timespec val;
> > };
> >
> > This way it is possible to pass the information of multiple clocks with
> > seconds and nanonseconds to clone3().
> >
> > To me this seems the better approach, but I am not totally convinced
> > that it is the right thing. If there are other ideas how to pass two
> > clock offsets with seconds and nanonseconds to clone3() I would be happy
> > to hear other ideas.
> >
> > Adrian
> >
> >
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> 



Re: [PATCH] capabilities: Introduce CAP_RESTORE

2020-05-27 Thread Adrian Reber
On Tue, May 26, 2020 at 08:59:29AM -0500, Eric W. Biederman wrote:
> Adrian Reber  writes:
> 
> > On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote:
> 
> >> What are the other blockers? Are you going to suggest additional new
> >> capabilities to clear them?
> >
> > As mentioned somewhere else access to /proc//map_files/ would be
> > helpful. Right now I am testing with a JVM and it works without root
> > just with the attached patch. Without access to /proc//map_files/
> > not everything CRIU can do will actually work, but we are a lot closer
> > to what our users have been asking for.
> 
> The current permission checks on /proc//map_files/ are simply
> someone being over-cautious.
> 
> Someone needs to think through the threat landscape and figure out what
> permission checks are actually needed.
> 
> Making the permission check ns_capable instead of capable is a
> no-brainer.  Figuring out which user_ns to test against might be a
> we bit harder.
> 
> We could probably even allow the owner of the process to open the files
> but that requires someone doing the work of thinking through how
> being able to opening files that you have mmaped might be a problem.

As mentioned in the other thread, CRIU can work with read access to
map_files.

> >> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> >> > to run checkpoint/restore as non-root,
> >> 
> >> If you need CAP_SYS_ADMIN anyway you're not gaining anything by
> >> separating out CAP_RESTORE.
> >
> > No, as described we can checkpoint and restore a JVM with this patch and
> > it also solves the problem the set_ns_last_pid fork() loop daemon tries
> > to solve. It is not enough to support the full functionality of CRIU as
> > map_files is also important, but we do not need CAP_SYS_ADMIN and
> > CAP_RESTORE. Only CAP_RESTORE would be necessary.
> >
> > With a new capability users can enable checkpoint/restore as non-root
> > without giving CRIU access to any of the other possibilities offered by
> > CAP_SYS_ADMIN. Setting a PID and map_files have been introduced for CRIU
> > and used to live behind CONFIG_CHECKPOINT_RESTORE. Having a capability
> > for checkpoint/restore would make it easier for CRIU users to run it as
> > non-root and make it very clear what is possible when giving CRIU the
> > new capability. No other things would be allowed than necessary for
> > checkpoint/restore. Setting a PID is most important for the restore part
> > and reading map_files would be helpful during checkpoint. So it actually
> > should be called CAP_CHECKPOINT_RESTORE as Christian mentioned in
> > another email.
> 
> Please if one is for checkpoint and one is for restore asking for a pair
> of capabilities is probably more appropriate.

I will send out a v2 with a renamed capability soon and also include
map_files to be readable with that capability.

> >> >  but by applying this patch I can
> >> > already checkpoint and restore processes as non-root. As there are
> >> > already multiple workarounds I would prefer to do it correctly in the
> >> > kernel to avoid that CRIU users are starting to invent more workarounds.
> >> 
> >> You've presented a couple of really inappropriate implementations
> >> that would qualify as workarounds. But the other two are completely
> >> appropriate within the system security policy. They don't "get around"
> >> the problem, they use existing mechanisms as they are intended.
> >
> > I agree with the user namespace approach to be appropriate, but not the
> > CAP_SYS_ADMIN approach as CRIU only needs a tiny subset (2 things) of
> > what CAP_SYS_ADMIN allows.
> 
> 
> If we are only talking 2 things can you please include in your patchset
> a patch enabling those 2 things?

The two things are setting a PID via ns_last_pid/clone3() and reading
map_files.

> But even more than this we need a request that asks not for the least
> you can possibly ask for but asks for what you need to do a good job.

Also in this thread Kamil mentioned that they also need calling prctl
with PR_SET_MM during restore in their production setup.

> I am having visions of a recurring discussion that says can we add one
> more permission check to CAP_RESTORE or CAP_CHECKPOINT when they are
> things we could know today.

I will prepare a new version of this patch using CAP_CHECKPOINT_RESTORE
for ns_last_pid/clone3(), map_files, and prctl with PR_SET_MM.

Adrian



Re: [PATCH] capabilities: Introduce CAP_RESTORE

2020-05-27 Thread Adrian Reber
On Mon, May 25, 2020 at 11:55:20AM -0700, Casey Schaufler wrote:
> On 5/25/2020 1:05 AM, Adrian Reber wrote:
> > On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote:
> >> On 5/21/2020 10:53 PM, Adrian Reber wrote:
> >>> This enables CRIU to checkpoint and restore a process as non-root.
> >> I know it sounds pedantic, but could you spell out CRIU once?
> >> While I know that everyone who cares either knows or can guess
> >> what you're talking about, it may be a mystery to some of the
> >> newer kernel developers.
> > Sure. CRIU - Checkpoint/Restore In Userspace.
> 
> Thanks. I blew out my acronym processor in the 1990's while
> working on trusted Unix system security evaluations.
> 
> >>> Over the last years CRIU upstream has been asked a couple of time if it
> >>> is possible to checkpoint and restore a process as non-root. The answer
> >>> usually was: 'almost'.
> >>>
> >>> The main blocker to restore a process was that selecting the PID of the
> >>> restored process, which is necessary for CRIU, is guarded by 
> >>> CAP_SYS_ADMIN.
> >> What are the other blockers? Are you going to suggest additional new
> >> capabilities to clear them?
> > As mentioned somewhere else access to /proc//map_files/ would be
> > helpful. Right now I am testing with a JVM and it works without root
> > just with the attached patch. Without access to /proc//map_files/
> > not everything CRIU can do will actually work, but we are a lot closer
> > to what our users have been asking for.
> 
> Are you talking about read access to map_files owned by other users
> or write access to map_files for the current user?

If I understand part of CRIU correctly, then we only need read-access
for the current user. I am sure Andrei, Pavel or Cyrill will correct me
if I am wrong concerning map_files.

> >>> In the last two years the questions about checkpoint/restore as non-root
> >>> have increased and especially in the last few months we have seen
> >>> multiple people inventing workarounds.
> >> Giving a process CAP_SYS_ADMIN is a non-root solution.
> > Yes, but like mentioned somewhere else not a solution that actually
> > works,
> 
> It's a solution that will execute and do what you're asking of it ...
> 
> >  because CAP_SYS_ADMIN allows too much.
> 
> ... but apparently not one that your users find satisfactory.
> 
> >  Especially for the
> > checkpoint/restore case, we really need one (setting the PID of a new
> > process) and to make it complete a second (reading map_files).
> >
> > Reading the comments in include/uapi/linux/capability.h concerning
> > CAP_SYS_ADMIN it allows the binary to do at least 35 things. The two
> > (three) I mentioned above (ns_last_pid (clone3) map_files) are not
> > mentioned in that list, so CAP_SYS_ADMIN allows probably much more.
> >
> > To allow checkpoint/restore as non-root nobody will give CRIU
> > CAP_SYS_ADMIN because it is too wide.
> 
> CAP_SYS_ADMIN exists for system behaviors that are not policy enforcement,
> but important to the system nonetheless. If you argue that checkpoint/restart
> is system policy enforcement rather then an administrative task it would
> be easier to sell.
> 
> Nobody likes CAP_SYS_ADMIN, but usually a process that does one of the
> things it covers will do more (sometimes many more) of the things it
> covers. The longstanding problem with breaking up CAP_SYS_ADMIN is that
> most breakouts result in programs that still need CAP_SYS_ADMIN anyway.
> 
> >>> The use-cases so far and their workarounds:
> >>>
> >>>  * Checkpoint/Restore in an HPC environment in combination with
> >>>a resource manager distributing jobs. Users are always running
> >>>as non root, but there was the desire to provide a way to
> >>>checkpoint and restore long running jobs.
> >>>Workaround: setuid wrapper to start CRIU as root as non-root
> >>>
> >>> https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> >> This is a classic and well understood mechanism for dealing with
> >> this kind of situation. You could have checkpointer-filecap-sys_admin.c
> >> instead, if you want to reduce use of the super-user.
> >>
> >>> * Another use case to checkpoint/restore processes as non-root
> >>>uses as workaround a non privileged process which cycles through
> >>>PIDs by calling fork() as fast as possible with a rate of
> >>>1

Re: [PATCH] capabilities: Introduce CAP_RESTORE

2020-05-25 Thread Adrian Reber
On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote:
> On 5/21/2020 10:53 PM, Adrian Reber wrote:
> > This enables CRIU to checkpoint and restore a process as non-root.
> 
> I know it sounds pedantic, but could you spell out CRIU once?
> While I know that everyone who cares either knows or can guess
> what you're talking about, it may be a mystery to some of the
> newer kernel developers.

Sure. CRIU - Checkpoint/Restore In Userspace.

> > Over the last years CRIU upstream has been asked a couple of time if it
> > is possible to checkpoint and restore a process as non-root. The answer
> > usually was: 'almost'.
> >
> > The main blocker to restore a process was that selecting the PID of the
> > restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN.
> 
> What are the other blockers? Are you going to suggest additional new
> capabilities to clear them?

As mentioned somewhere else access to /proc//map_files/ would be
helpful. Right now I am testing with a JVM and it works without root
just with the attached patch. Without access to /proc//map_files/
not everything CRIU can do will actually work, but we are a lot closer
to what our users have been asking for.

> > In the last two years the questions about checkpoint/restore as non-root
> > have increased and especially in the last few months we have seen
> > multiple people inventing workarounds.
> 
> Giving a process CAP_SYS_ADMIN is a non-root solution.

Yes, but like mentioned somewhere else not a solution that actually
works, because CAP_SYS_ADMIN allows too much. Especially for the
checkpoint/restore case, we really need one (setting the PID of a new
process) and to make it complete a second (reading map_files).

Reading the comments in include/uapi/linux/capability.h concerning
CAP_SYS_ADMIN it allows the binary to do at least 35 things. The two
(three) I mentioned above (ns_last_pid (clone3) map_files) are not
mentioned in that list, so CAP_SYS_ADMIN allows probably much more.

To allow checkpoint/restore as non-root nobody will give CRIU
CAP_SYS_ADMIN because it is too wide.

> > The use-cases so far and their workarounds:
> >
> >  * Checkpoint/Restore in an HPC environment in combination with
> >a resource manager distributing jobs. Users are always running
> >as non root, but there was the desire to provide a way to
> >checkpoint and restore long running jobs.
> >Workaround: setuid wrapper to start CRIU as root as non-root
> >
> > https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> 
> This is a classic and well understood mechanism for dealing with
> this kind of situation. You could have checkpointer-filecap-sys_admin.c
> instead, if you want to reduce use of the super-user.
> 
> > * Another use case to checkpoint/restore processes as non-root
> >uses as workaround a non privileged process which cycles through
> >PIDs by calling fork() as fast as possible with a rate of
> >100,000 pids/s instead of writing to ns_last_pid
> >https://github.com/twosigma/set_ns_last_pid
> 
> Oh dear.
> 
> >  * Fast Java startup using checkpoint/restore.
> >We have been in contact with JVM developers who are integrating
> >CRIU into a JVM to decrease the startup time.
> >Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel
> 
> That's not a workaround, it's a policy violation.
> Bad JVM! No biscuit!

This was used as a proof of concept to see if we can checkpoint and
restore a JVM without root. Only the ns_last_pid check was removed to
see if it works and it does.

> >  * Container migration as non root. There are people already
> >using CRIU to migrate containers as non-root. The solution
> >there is to run it in a user namespace. So if you are able
> >to carefully setup your environment with the namespaces
> >it is already possible to restore a container/process as non-root.
> 
> This is exactly the kind of situation that user namespaces are
> supposed to address.
> 
> >Unfortunately it is not always possible to setup an environment
> >in such a way and for easier access to non-root based container
> >migration this patch is also required.
> 
> If a user namespace solution is impossible or (more likely) too
> expensive, there's always the checkpointer-filecap-sys_admin option.

But then again we open up all of CAP_SYS_ADMIN, which is not necessary.

> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> > to run checkpoint/restore as non-root,
> 
> If you need CAP_SYS_ADMIN anyway you're not gaining anything by
> separating out CAP_RESTORE.

No, as 

[PATCH] capabilities: Introduce CAP_RESTORE

2020-05-21 Thread Adrian Reber
all(__NR_clone3, _args, sizeof(c_args));
if (new_pid == 0) {
printf("I am the child!\n");
exit(0);
} else if (new_pid == pid)
printf("I am the parent. My child got the pid %d!\n", new_pid);
else
printf("pid (%d) does not match expected pid (%d)\n", new_pid, 
pid);
    printf("Done\n");

return 0;
}
$ id -u; /home/libcap/clone3_set_tid 30
1001
Forking...
I am the parent. My child got the pid 30!
Done
I am the child!

Signed-off-by: Adrian Reber 
---
 include/linux/capability.h  | 5 +
 include/uapi/linux/capability.h | 9 -
 kernel/pid.c| 2 +-
 kernel/pid_namespace.c  | 2 +-
 security/selinux/include/classmap.h | 5 +++--
 5 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1278313cb2bc 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,11 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool restore_ns_capable(struct user_namespace *ns)
+{
+   return ns_capable(ns, CAP_RESTORE) || ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c7372180a0a9..4bcc4e3d41ff 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -406,7 +406,14 @@ struct vfs_ns_cap_data {
  */
 #define CAP_BPF39
 
-#define CAP_LAST_CAP CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_RESTORE40
+
+#define CAP_LAST_CAP CAP_RESTORE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 3122043fe364..bbc26f2bcff6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t 
*set_tid,
if (tid != 1 && !tmp->child_reaper)
goto out_free;
retval = -EPERM;
-   if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+   if (!restore_ns_capable(tmp->user_ns))
goto out_free;
set_tid_size--;
}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..f58186b31ce6 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int 
write,
struct ctl_table tmp = *table;
int ret, next;
 
-   if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+   if (write && !restore_ns_capable(pid_ns->user_ns))
return -EPERM;
 
/*
diff --git a/security/selinux/include/classmap.h 
b/security/selinux/include/classmap.h
index 98e1513b608a..f8b8f12a6ebd 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
"audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-   "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+   "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+   "restore"
 
-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_RESTORE
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 

base-commit: e8f3274774b45b5f4e9e3d5cad7ff9f43ae3add5
-- 
2.26.2



Re: [PATCH v2 0/6] Update clone3 self-tests

2019-09-16 Thread Adrian Reber
On Mon, Sep 16, 2019 at 09:49:34AM +0200, Christian Brauner wrote:
> On Tue, Sep 10, 2019 at 07:01:30PM +0100, Eugene Syromiatnikov wrote:
> > Hello.
> > 
> > This patch set updates clone3 selftest in several aspects:
> >  - adding checks for exit_signal invalid values handling;
> >  - adding clone3 to selftests targets;
> >  - enabling clone3 tests on all architectures;
> >  - minor cleanups of the clone3 test.
> > 
> > Applied on top of brauer/linux.git/for-next.
> 
> So I like this series a lot! Testing is very important.
> And thanks for catching the clone3() exit_signal problem. This way we
> got to release a non-broken kernel. :)
> 
> Some notes: I dropped the set_tid extension from the core process
> updates for 5.4 because we ended up in a discussion that made it clear
> we potentially need the ability to restore pids in multiple pid
> namespaces. This means we need some more discussion and the patchset is
> delayed for at least one release.
> Unfortunately, this also means the test that you have based yours upon
> does not exist anymore. However, the tests should not be blocked on
> this. I'd encourage you to talk to Adrian (who is Cced here anyway) and
> come up with a clone3() test suite I can merge. You can very likely do a
> Co-Developed-by so no-ones work gets dropped. :)
> 
> Ideally I'd like to see:
> - verifying passing different struct sizes works correctly
> - verify that flag combinations work correctly
> - verify that struct members have correct values etc. pp.
> 
> We definitely want the exit_signal test as a regression test so it
> doesn't bite us again!
> 
> (Oh, please also add tool/test/selftests/clone3/ to the pidfd/core
>  process MAINTAINERS entry.)

Eugene and I have already discussed this. We will resubmit the clone3()
selftests in the next few days with all our changes combined.

Adrian


Re: [PATCH v6 1/2] fork: extend clone3() to support setting a PID

2019-08-12 Thread Adrian Reber
On Mon, Aug 12, 2019 at 01:43:53PM -0700, Andrei Vagin wrote:
> On Mon, Aug 12, 2019 at 1:10 PM Adrian Reber  wrote:
> >
> > The main motivation to add set_tid to clone3() is CRIU.
> >
> > To restore a process with the same PID/TID CRIU currently uses
> > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > ns_last_pid and then (quickly) does a clone(). This works most of the
> > time, but it is racy. It is also slow as it requires multiple syscalls.
> >
> > Extending clone3() to support set_tid makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> >
> > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > on clone3() with set_tid as they are currently in place for ns_last_pid.
> >
> > Signed-off-by: Adrian Reber 
> > ---
> > v2:
> >  - Removed (size < sizeof(struct clone_args)) as discussed with
> >Christian and Dmitry
> >  - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
> >  - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
> >
> > v3:
> >  - Return EEXIST if PID is already in use (Christian)
> >  - Drop CLONE_SET_TID (Christian and Oleg)
> >  - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
> >  - Handle different `struct clone_args` sizes (Dmitry)
> >
> > v4:
> >  - Rework struct size check with defines (Christian)
> >  - Reduce number of set_tid checks (Oleg)
> >  - Less parentheses and more robust code (Oleg)
> >  - Do ns_capable() on correct user_ns (Oleg, Christian)
> >
> > v5:
> >  - make set_tid checks earlier in alloc_pid() (Christian)
> >  - remove unnecessary comment and struct size check (Christian)
> >
> > v6:
> >  - remove CLONE_SET_TID from description (Christian)
> >  - add clone3() tests for different clone_args sizes (Christian)
> >  - move more set_tid checks to alloc_pid() (Oleg)
> >  - make all set_tid checks lockless (Oleg)
> > ---
> >  include/linux/pid.h|  2 +-
> >  include/linux/sched/task.h |  1 +
> >  include/uapi/linux/sched.h |  1 +
> >  kernel/fork.c  | 14 --
> >  kernel/pid.c   | 37 ++---
> >  5 files changed, 45 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index 2a83e434db9d..052000db0ced 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
> >  extern struct pid *find_get_pid(int nr);
> >  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> >
> > -extern struct pid *alloc_pid(struct pid_namespace *ns);
> > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
> >  extern void free_pid(struct pid *pid);
> >  extern void disable_pid_allocation(struct pid_namespace *ns);
> >
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index 0497091e40c1..4f2a80564332 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -26,6 +26,7 @@ struct kernel_clone_args {
> > unsigned long stack;
> > unsigned long stack_size;
> > unsigned long tls;
> > +   pid_t set_tid;
> >  };
> >
> >  /*
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index b3105ac1381a..e1ce103a2c47 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -45,6 +45,7 @@ struct clone_args {
> > __aligned_u64 stack;
> > __aligned_u64 stack_size;
> > __aligned_u64 tls;
> > +   __aligned_u64 set_tid;
> >  };
> >
> >  /*
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 2852d0e76ea3..8317d408a8d6 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -117,6 +117,11 @@
> >   */
> >  #define MAX_THREADS FUTEX_TID_MASK
> >
> > +/*
> > + * For different sizes of struct clone_args
> > + */
> > +#define CLONE3_ARGS_SIZE_V0 64
> > +
> >  /*
> >   * Protected counters by write_lock_irq(_lock)
> >   */
> > @@ -2031,7 +2036,7 @@ static __latent_entropy struct task_struct 
> > *copy_process(
> > stackleak_task_init(p);
> >
> > if (pid != _struct_pid) {
> > -   pid = alloc_pid(p->nsproxy->pid_ns_for_chi

[PATCH v6 2/2] selftests: add tests for clone3()

2019-08-12 Thread Adrian Reber
This tests clone3() with and without set_tid to see if all desired PIDs
are working as expected. The test tries to clone3() with a set_tid of
-1, 1, pid_max, a PID which is already in use and an unused PID. The
same tests are also running in PID namespace.

In addition the clone3 test (without set_tid) tries to call clone3()
with different sizes of clone_args.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/.gitignore |   2 +
 tools/testing/selftests/clone3/Makefile   |  11 +
 tools/testing/selftests/clone3/clone3.c   | 231 ++
 .../testing/selftests/clone3/clone3_set_tid.c | 161 
 4 files changed, 405 insertions(+)
 create mode 100644 tools/testing/selftests/clone3/.gitignore
 create mode 100644 tools/testing/selftests/clone3/Makefile
 create mode 100644 tools/testing/selftests/clone3/clone3.c
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore 
b/tools/testing/selftests/clone3/.gitignore
new file mode 100644
index ..c63c64a78ddf
--- /dev/null
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -0,0 +1,2 @@
+clone3_set_tid
+clone3
diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
new file mode 100644
index ..4efcf45b995b
--- /dev/null
+++ b/tools/testing/selftests/clone3/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+
+CFLAGS += -I../../../../usr/include/
+
+ifeq ($(ARCH),x86_64)
+   TEST_GEN_PROGS := clone3 clone3_set_tid
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3.c 
b/tools/testing/selftests/clone3/clone3.c
new file mode 100644
index ..a0f1989f8b1b
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -0,0 +1,231 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Based on Christian Brauner's clone3() example */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+/*
+ * Different sizes of struct clone_args
+ */
+#define CLONE3_ARGS_SIZE_V0 64
+/* V1 includes set_tid */
+#define CLONE3_ARGS_SIZE_V1 72
+
+#define CLONE3_ARGS_NO_TEST 0
+#define CLONE3_ARGS_ALL_0 1
+#define CLONE3_ARGS_ALL_1 2
+
+static pid_t raw_clone(struct clone_args *args, size_t size)
+{
+   return syscall(__NR_clone3, args, size);
+}
+
+static int call_clone3(int flags, size_t size, int test_mode)
+{
+   struct clone_args args = {0};
+   pid_t ppid = -1;
+   pid_t pid = -1;
+   int status;
+
+   args.flags = flags;
+   args.exit_signal = SIGCHLD;
+
+   if (size == 0)
+   size = sizeof(struct clone_args);
+
+   if (test_mode == CLONE3_ARGS_ALL_0) {
+   args.flags = 0;
+   args.pidfd = 0;
+   args.child_tid = 0;
+   args.parent_tid = 0;
+   args.exit_signal = 0;
+   args.stack = 0;
+   args. stack_size = 0;
+   args.tls = 0;
+   args.set_tid = 0;
+   } else if (test_mode == CLONE3_ARGS_ALL_1) {
+   args.flags = 1;
+   args.pidfd = 1;
+   args.child_tid = 1;
+   args.parent_tid = 1;
+   args.exit_signal = 1;
+   args.stack = 1;
+   args. stack_size = 1;
+   args.tls = 1;
+   args.set_tid = 1;
+   }
+
+   pid = raw_clone(, size);
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+   strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   ksft_print_msg("I am the child, my PID is %d\n", getpid());
+   _exit(EXIT_SUCCESS);
+   }
+
+   ppid = getpid();
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+   ppid, pid);
+
+   (void)wait();
+   if (WEXITSTATUS(status))
+   return WEXITSTATUS(status);
+
+   return 0;
+}
+
+static int test_clone3(int flags, size_t size, int expected, int test_mode)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with flags 0x%x (size %d)\n",
+   getpid(), flags, size);
+   ret = call_clone3(flags, size, test_mode);
+   ksft_print_msg("[%d] clone3() with flags says :%d expected %d\n",
+   getpid(), ret, expected);
+   if (ret != expected)
+   ksft_exit_fail_msg(
+   "[%d] Result (%d) is different than expected (%d)\n",
+   getpid(), ret, expected);
+   ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n",
+  

[PATCH v6 1/2] fork: extend clone3() to support setting a PID

2019-08-12 Thread Adrian Reber
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with set_tid as they are currently in place for ns_last_pid.

Signed-off-by: Adrian Reber 
---
v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
 - Return EEXIST if PID is already in use (Christian)
 - Drop CLONE_SET_TID (Christian and Oleg)
 - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
 - Handle different `struct clone_args` sizes (Dmitry)

v4:
 - Rework struct size check with defines (Christian)
 - Reduce number of set_tid checks (Oleg)
 - Less parentheses and more robust code (Oleg)
 - Do ns_capable() on correct user_ns (Oleg, Christian)

v5:
 - make set_tid checks earlier in alloc_pid() (Christian)
 - remove unnecessary comment and struct size check (Christian)

v6:
 - remove CLONE_SET_TID from description (Christian)
 - add clone3() tests for different clone_args sizes (Christian)
 - move more set_tid checks to alloc_pid() (Oleg)
 - make all set_tid checks lockless (Oleg)
---
 include/linux/pid.h|  2 +-
 include/linux/sched/task.h |  1 +
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c  | 14 --
 kernel/pid.c   | 37 ++---
 5 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..052000db0ced 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4f2a80564332 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,7 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+   pid_t set_tid;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..e1ce103a2c47 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -45,6 +45,7 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+   __aligned_u64 set_tid;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..8317d408a8d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -117,6 +117,11 @@
  */
 #define MAX_THREADS FUTEX_TID_MASK
 
+/*
+ * For different sizes of struct clone_args
+ */
+#define CLONE3_ARGS_SIZE_V0 64
+
 /*
  * Protected counters by write_lock_irq(_lock)
  */
@@ -2031,7 +2036,7 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
 
if (pid != _struct_pid) {
-   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+   pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2535,9 +2540,13 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
if (unlikely(size > PAGE_SIZE))
return -E2BIG;
 
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(size < CLONE3_ARGS_SIZE_V0))
return -EINVAL;
 
+   if (size < sizeof(struct clone_args))
+   memset((void *) + size, 0,
+   sizeof(struct clone_args) - size);
+
if (unlikely(!access_ok(uargs, size)))
return -EFAULT;
 
@@ -2571,6 +2580,7 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
.stack  = args.stack,
.stack_size = args.stack_size,
.tls= args.tls,
+   .set_tid= args.set_tid,
};
 
return 0;
diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..5cdab73b9094 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ 

[PATCH v5 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-11 Thread Adrian Reber
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with set_tid as they are currently in place for ns_last_pid.

Signed-off-by: Adrian Reber 
---
v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
 - Return EEXIST if PID is already in use (Christian)
 - Drop CLONE_SET_TID (Christian and Oleg)
 - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
 - Handle different `struct clone_args` sizes (Dmitry)

v4:
 - Rework struct size check with defines (Christian)
 - Reduce number of set_tid checks (Oleg)
 - Less parentheses and more robust code (Oleg)
 - Do ns_capable() on correct user_ns (Oleg, Christian)

v5:
 - make set_tid checks earlier in alloc_pid() (Christian)
 - remove unnecessary comment and struct size check (Christian)
---
 include/linux/pid.h|  2 +-
 include/linux/sched/task.h |  1 +
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c  | 22 --
 kernel/pid.c   | 36 +---
 5 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..052000db0ced 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4f2a80564332 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,7 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+   pid_t set_tid;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..e1ce103a2c47 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -45,6 +45,7 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+   __aligned_u64 set_tid;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..0365041243b1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -117,6 +117,13 @@
  */
 #define MAX_THREADS FUTEX_TID_MASK
 
+/*
+ * Different sizes of struct clone_args
+ */
+#define CLONE3_ARGS_SIZE_V0 64
+/* V1 includes set_tid */
+#define CLONE3_ARGS_SIZE_V1 72
+
 /*
  * Protected counters by write_lock_irq(_lock)
  */
@@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
 
if (pid != _struct_pid) {
-   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+   if (args->set_tid && !ns_capable(
+   p->nsproxy->pid_ns_for_children->user_ns,
+   CAP_SYS_ADMIN)) {
+   retval = -EPERM;
+   goto bad_fork_cleanup_thread;
+   }
+   pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2535,9 +2548,13 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
if (unlikely(size > PAGE_SIZE))
return -E2BIG;
 
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(size < CLONE3_ARGS_SIZE_V0))
return -EINVAL;
 
+   if (size < sizeof(struct clone_args))
+   memset((void *) + size, 0,
+   sizeof(struct clone_args) - size);
+
if (unlikely(!access_ok(uargs, size)))
return -EFAULT;
 
@@ -2571,6 +2588,7 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
.stack  = args.stack,
.stack_size = args.stack_size,
.tls= args.tls,
+   .set_tid= args.set_ti

[PATCH v5 2/2] selftests: add tests for clone3()

2019-08-11 Thread Adrian Reber
This tests clone3() with and without set_tid to see if all desired PIDs
are working as expected. The test tries to clone3() with a set_tid of
-1, 1, pid_max, a PID which is already in use and an unused PID. The
same tests are also running in PID namespace.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/.gitignore |   2 +
 tools/testing/selftests/clone3/Makefile   |  11 ++
 tools/testing/selftests/clone3/clone3.c   | 141 +++
 .../testing/selftests/clone3/clone3_set_tid.c | 161 ++
 4 files changed, 315 insertions(+)
 create mode 100644 tools/testing/selftests/clone3/.gitignore
 create mode 100644 tools/testing/selftests/clone3/Makefile
 create mode 100644 tools/testing/selftests/clone3/clone3.c
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore 
b/tools/testing/selftests/clone3/.gitignore
new file mode 100644
index ..c63c64a78ddf
--- /dev/null
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -0,0 +1,2 @@
+clone3_set_tid
+clone3
diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
new file mode 100644
index ..4efcf45b995b
--- /dev/null
+++ b/tools/testing/selftests/clone3/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+
+CFLAGS += -I../../../../usr/include/
+
+ifeq ($(ARCH),x86_64)
+   TEST_GEN_PROGS := clone3 clone3_set_tid
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3.c 
b/tools/testing/selftests/clone3/clone3.c
new file mode 100644
index ..55a6915566b8
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Based on Christian Brauner's clone3() example */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static pid_t raw_clone(struct clone_args *args)
+{
+   return syscall(__NR_clone3, args, sizeof(struct clone_args));
+}
+
+static int call_clone3(int flags)
+{
+   struct clone_args args = {0};
+   pid_t ppid = -1;
+   pid_t pid = -1;
+   int status;
+
+   args.flags = flags;
+   args.exit_signal = SIGCHLD;
+
+   pid = raw_clone();
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+   strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   ksft_print_msg("I am the child, my PID is %d\n", getpid());
+   _exit(EXIT_SUCCESS);
+   }
+
+   ppid = getpid();
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+   ppid, pid);
+
+   (void)wait();
+   if (WEXITSTATUS(status))
+   return WEXITSTATUS(status);
+
+   return 0;
+}
+
+static int test_clone3(int flags, int expected)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with flags 0x%x\n",
+   getpid(), flags);
+   ret = call_clone3(flags);
+   ksft_print_msg("[%d] clone3() with flags says :%d expected %d\n",
+   getpid(), ret, expected);
+   if (ret != expected)
+   ksft_exit_fail_msg(
+   "[%d] Result (%d) is different than expected (%d)\n",
+   getpid(), ret, expected);
+   ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n",
+   getpid(), ret, expected);
+   return 0;
+}
+int main(int argc, char *argv[])
+{
+   int ret = -1;
+   pid_t pid;
+
+   ksft_print_header();
+   ksft_set_plan(3);
+
+   /* Just a simple clone3() should return 0.*/
+   if (test_clone3(0, 0))
+   goto on_error;
+   /* Do a clone3() in a new PID NS.*/
+   if (test_clone3(CLONE_NEWPID, 0))
+   goto on_error;
+   ksft_print_msg("First unshare\n");
+   if (unshare(CLONE_NEWPID))
+   goto on_error;
+   /*
+* Before clone3()ing in a new PID NS with
+* CLONE_NEWPID a fork() is necessary.
+*/
+   if (test_clone3(CLONE_NEWPID, -EINVAL))
+   goto on_error;
+   pid = fork();
+   if (pid < 0) {
+   ksft_print_msg("First fork() failed\n");
+   goto on_error;
+   }
+   if (pid > 0) {
+   (void)wait(NULL);
+   goto parent_out;
+   }
+   ksft_set_plan(6);
+   if (test_clone3(CLONE_NEWPID, 0))
+   goto on_error;
+   if (test_clone3(0, 0))
+   goto on_error;
+   ksft_print_msg("Second unshare\n");

Re: [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-11 Thread Adrian Reber
On Sun, Aug 11, 2019 at 09:06:59PM +0200, Christian Brauner wrote:
> On Sun, Aug 11, 2019 at 08:51:48AM +0200, Christian Brauner wrote:
> > On Sat, Aug 10, 2019 at 07:59:18AM +0200, Adrian Reber wrote:
> > > On Sat, Aug 10, 2019 at 03:10:34AM +0200, Christian Brauner wrote:
> > > > On Thu, Aug 08, 2019 at 11:22:21PM +0200, Adrian Reber wrote:
> > > > > The main motivation to add set_tid to clone3() is CRIU.
> > > > > 
> > > > > To restore a process with the same PID/TID CRIU currently uses
> > > > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > > > > ns_last_pid and then (quickly) does a clone(). This works most of the
> > > > > time, but it is racy. It is also slow as it requires multiple 
> > > > > syscalls.
> > > > > 
> > > > > Extending clone3() to support set_tid makes it possible restore a
> > > > > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > > > > race free (as long as the desired PID/TID is available).
> > > > > 
> > > > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > > > > on clone3() with set_tid as they are currently in place for 
> > > > > ns_last_pid.
> > > > > 
> > > > > Signed-off-by: Adrian Reber 
> > > > > ---
> > > > > v2:
> > > > >  - Removed (size < sizeof(struct clone_args)) as discussed with
> > > > >Christian and Dmitry
> > > > >  - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
> > > > >  - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
> > > > > 
> > > > > v3:
> > > > >  - Return EEXIST if PID is already in use (Christian)
> > > > >  - Drop CLONE_SET_TID (Christian and Oleg)
> > > > >  - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
> > > > >  - Handle different `struct clone_args` sizes (Dmitry)
> > > > > 
> > > > > v4:
> > > > >  - Rework struct size check with defines (Christian)
> > > > >  - Reduce number of set_tid checks (Oleg)
> > > > >  - Less parentheses and more robust code (Oleg)
> > > > >  - Do ns_capable() on correct user_ns (Oleg, Christian)
> > > > > ---
> > > > >  include/linux/pid.h|  2 +-
> > > > >  include/linux/sched/task.h |  1 +
> > > > >  include/uapi/linux/sched.h |  1 +
> > > > >  kernel/fork.c  | 25 +++--
> > > > >  kernel/pid.c   | 34 +++---
> > > > >  5 files changed, 53 insertions(+), 10 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > > > > index 2a83e434db9d..052000db0ced 100644
> > > > > --- a/include/linux/pid.h
> > > > > +++ b/include/linux/pid.h
> > > > > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
> > > > >  extern struct pid *find_get_pid(int nr);
> > > > >  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> > > > >  
> > > > > -extern struct pid *alloc_pid(struct pid_namespace *ns);
> > > > > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t 
> > > > > set_tid);
> > > > >  extern void free_pid(struct pid *pid);
> > > > >  extern void disable_pid_allocation(struct pid_namespace *ns);
> > > > >  
> > > > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > > > > index 0497091e40c1..4f2a80564332 100644
> > > > > --- a/include/linux/sched/task.h
> > > > > +++ b/include/linux/sched/task.h
> > > > > @@ -26,6 +26,7 @@ struct kernel_clone_args {
> > > > >   unsigned long stack;
> > > > >   unsigned long stack_size;
> > > > >   unsigned long tls;
> > > > > + pid_t set_tid;
> > > > >  };
> > > > >  
> > > > >  /*
> > > > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > > > > index b3105ac1381a..e1ce103a2c47 100644
> > > > > --- a/include/uapi/linux/sched.h
> > > > > +++ b/include/uapi/linux/sched.h
> > > > > @@ -45,6 +45,7 @@ struct clone_args {
> > > > > 

Re: [PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-09 Thread Adrian Reber
On Sat, Aug 10, 2019 at 03:10:34AM +0200, Christian Brauner wrote:
> On Thu, Aug 08, 2019 at 11:22:21PM +0200, Adrian Reber wrote:
> > The main motivation to add set_tid to clone3() is CRIU.
> > 
> > To restore a process with the same PID/TID CRIU currently uses
> > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > ns_last_pid and then (quickly) does a clone(). This works most of the
> > time, but it is racy. It is also slow as it requires multiple syscalls.
> > 
> > Extending clone3() to support set_tid makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> > 
> > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > on clone3() with set_tid as they are currently in place for ns_last_pid.
> > 
> > Signed-off-by: Adrian Reber 
> > ---
> > v2:
> >  - Removed (size < sizeof(struct clone_args)) as discussed with
> >Christian and Dmitry
> >  - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
> >  - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
> > 
> > v3:
> >  - Return EEXIST if PID is already in use (Christian)
> >  - Drop CLONE_SET_TID (Christian and Oleg)
> >  - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
> >  - Handle different `struct clone_args` sizes (Dmitry)
> > 
> > v4:
> >  - Rework struct size check with defines (Christian)
> >  - Reduce number of set_tid checks (Oleg)
> >  - Less parentheses and more robust code (Oleg)
> >  - Do ns_capable() on correct user_ns (Oleg, Christian)
> > ---
> >  include/linux/pid.h|  2 +-
> >  include/linux/sched/task.h |  1 +
> >  include/uapi/linux/sched.h |  1 +
> >  kernel/fork.c  | 25 +++--
> >  kernel/pid.c   | 34 +++---
> >  5 files changed, 53 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index 2a83e434db9d..052000db0ced 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
> >  extern struct pid *find_get_pid(int nr);
> >  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> >  
> > -extern struct pid *alloc_pid(struct pid_namespace *ns);
> > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
> >  extern void free_pid(struct pid *pid);
> >  extern void disable_pid_allocation(struct pid_namespace *ns);
> >  
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index 0497091e40c1..4f2a80564332 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -26,6 +26,7 @@ struct kernel_clone_args {
> > unsigned long stack;
> > unsigned long stack_size;
> > unsigned long tls;
> > +   pid_t set_tid;
> >  };
> >  
> >  /*
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index b3105ac1381a..e1ce103a2c47 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -45,6 +45,7 @@ struct clone_args {
> > __aligned_u64 stack;
> > __aligned_u64 stack_size;
> > __aligned_u64 tls;
> > +   __aligned_u64 set_tid;
> >  };
> >  
> >  /*
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 2852d0e76ea3..2a03f0e201e9 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -117,6 +117,13 @@
> >   */
> >  #define MAX_THREADS FUTEX_TID_MASK
> >  
> > +/*
> > + * Different sizes of struct clone_args
> > + */
> > +#define CLONE3_ARGS_SIZE_V0 64
> > +/* V1 includes set_tid */
> > +#define CLONE3_ARGS_SIZE_V1 72
> > +
> >  /*
> >   * Protected counters by write_lock_irq(_lock)
> >   */
> > @@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct 
> > *copy_process(
> > stackleak_task_init(p);
> >  
> > if (pid != _struct_pid) {
> > -   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> > +   if (args->set_tid && !ns_capable(
> > +   p->nsproxy->pid_ns_for_children->user_ns,
> > +   CAP_SYS_ADMIN)) {
> > +   retval = -EPERM;
> > +   goto bad_fork_cleanup_thread;
> > +   }
> > +   pid = alloc_pid(p->nsproxy->p

[PATCH v4 2/2] selftests: add tests for clone3()

2019-08-08 Thread Adrian Reber
This tests clone3() with and without set_tid to see if all desired PIDs
are working as expected. The test tries to clone3() with a set_tid of
-1, 1, pid_max, a PID which is already in use and an unused PID. The
same tests are also running in PID namespace.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/.gitignore |   2 +
 tools/testing/selftests/clone3/Makefile   |  11 ++
 tools/testing/selftests/clone3/clone3.c   | 141 +++
 .../testing/selftests/clone3/clone3_set_tid.c | 161 ++
 4 files changed, 315 insertions(+)
 create mode 100644 tools/testing/selftests/clone3/.gitignore
 create mode 100644 tools/testing/selftests/clone3/Makefile
 create mode 100644 tools/testing/selftests/clone3/clone3.c
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore 
b/tools/testing/selftests/clone3/.gitignore
new file mode 100644
index ..c63c64a78ddf
--- /dev/null
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -0,0 +1,2 @@
+clone3_set_tid
+clone3
diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
new file mode 100644
index ..4efcf45b995b
--- /dev/null
+++ b/tools/testing/selftests/clone3/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+
+CFLAGS += -I../../../../usr/include/
+
+ifeq ($(ARCH),x86_64)
+   TEST_GEN_PROGS := clone3 clone3_set_tid
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3.c 
b/tools/testing/selftests/clone3/clone3.c
new file mode 100644
index ..55a6915566b8
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Based on Christian Brauner's clone3() example */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static pid_t raw_clone(struct clone_args *args)
+{
+   return syscall(__NR_clone3, args, sizeof(struct clone_args));
+}
+
+static int call_clone3(int flags)
+{
+   struct clone_args args = {0};
+   pid_t ppid = -1;
+   pid_t pid = -1;
+   int status;
+
+   args.flags = flags;
+   args.exit_signal = SIGCHLD;
+
+   pid = raw_clone();
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+   strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   ksft_print_msg("I am the child, my PID is %d\n", getpid());
+   _exit(EXIT_SUCCESS);
+   }
+
+   ppid = getpid();
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+   ppid, pid);
+
+   (void)wait();
+   if (WEXITSTATUS(status))
+   return WEXITSTATUS(status);
+
+   return 0;
+}
+
+static int test_clone3(int flags, int expected)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with flags 0x%x\n",
+   getpid(), flags);
+   ret = call_clone3(flags);
+   ksft_print_msg("[%d] clone3() with flags says :%d expected %d\n",
+   getpid(), ret, expected);
+   if (ret != expected)
+   ksft_exit_fail_msg(
+   "[%d] Result (%d) is different than expected (%d)\n",
+   getpid(), ret, expected);
+   ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n",
+   getpid(), ret, expected);
+   return 0;
+}
+int main(int argc, char *argv[])
+{
+   int ret = -1;
+   pid_t pid;
+
+   ksft_print_header();
+   ksft_set_plan(3);
+
+   /* Just a simple clone3() should return 0.*/
+   if (test_clone3(0, 0))
+   goto on_error;
+   /* Do a clone3() in a new PID NS.*/
+   if (test_clone3(CLONE_NEWPID, 0))
+   goto on_error;
+   ksft_print_msg("First unshare\n");
+   if (unshare(CLONE_NEWPID))
+   goto on_error;
+   /*
+* Before clone3()ing in a new PID NS with
+* CLONE_NEWPID a fork() is necessary.
+*/
+   if (test_clone3(CLONE_NEWPID, -EINVAL))
+   goto on_error;
+   pid = fork();
+   if (pid < 0) {
+   ksft_print_msg("First fork() failed\n");
+   goto on_error;
+   }
+   if (pid > 0) {
+   (void)wait(NULL);
+   goto parent_out;
+   }
+   ksft_set_plan(6);
+   if (test_clone3(CLONE_NEWPID, 0))
+   goto on_error;
+   if (test_clone3(0, 0))
+   goto on_error;
+   ksft_print_msg("Second unshare\n");

[PATCH v4 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-08 Thread Adrian Reber
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with set_tid as they are currently in place for ns_last_pid.

Signed-off-by: Adrian Reber 
---
v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
 - Return EEXIST if PID is already in use (Christian)
 - Drop CLONE_SET_TID (Christian and Oleg)
 - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
 - Handle different `struct clone_args` sizes (Dmitry)

v4:
 - Rework struct size check with defines (Christian)
 - Reduce number of set_tid checks (Oleg)
 - Less parentheses and more robust code (Oleg)
 - Do ns_capable() on correct user_ns (Oleg, Christian)
---
 include/linux/pid.h|  2 +-
 include/linux/sched/task.h |  1 +
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c  | 25 +++--
 kernel/pid.c   | 34 +++---
 5 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..052000db0ced 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4f2a80564332 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,7 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+   pid_t set_tid;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..e1ce103a2c47 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -45,6 +45,7 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+   __aligned_u64 set_tid;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..2a03f0e201e9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -117,6 +117,13 @@
  */
 #define MAX_THREADS FUTEX_TID_MASK
 
+/*
+ * Different sizes of struct clone_args
+ */
+#define CLONE3_ARGS_SIZE_V0 64
+/* V1 includes set_tid */
+#define CLONE3_ARGS_SIZE_V1 72
+
 /*
  * Protected counters by write_lock_irq(_lock)
  */
@@ -2031,7 +2038,13 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
 
if (pid != _struct_pid) {
-   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+   if (args->set_tid && !ns_capable(
+   p->nsproxy->pid_ns_for_children->user_ns,
+   CAP_SYS_ADMIN)) {
+   retval = -EPERM;
+   goto bad_fork_cleanup_thread;
+   }
+   pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2535,9 +2548,14 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
if (unlikely(size > PAGE_SIZE))
return -E2BIG;
 
-   if (unlikely(size < sizeof(struct clone_args)))
+   /* The struct needs to be at least the size of the original struct. */
+   if (unlikely(size < CLONE3_ARGS_SIZE_V0))
return -EINVAL;
 
+   if (size < sizeof(struct clone_args))
+   memset((void *) + size, 0,
+   sizeof(struct clone_args) - size);
+
if (unlikely(!access_ok(uargs, size)))
return -EFAULT;
 
@@ -2573,6 +2591,9 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
.tls= args.tls,
};
 
+   if (size >= CLONE3_ARGS_SIZE_V1)
+   kargs->set_tid = args.set_tid;
+
return 0;
 }
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..

[PATCH v3 2/2] selftests: add tests for clone3()

2019-08-06 Thread Adrian Reber
This tests clone3() with and without set_tid to see if all desired PIDs
are working as expected. The test tries to clone3() with a set_tid of
-1, 1, pid_max, a PID which is already in use and an unused PID. The
same tests are also running in PID namespace.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/.gitignore |   2 +
 tools/testing/selftests/clone3/Makefile   |  11 ++
 tools/testing/selftests/clone3/clone3.c   | 141 +++
 .../testing/selftests/clone3/clone3_set_tid.c | 161 ++
 4 files changed, 315 insertions(+)
 create mode 100644 tools/testing/selftests/clone3/.gitignore
 create mode 100644 tools/testing/selftests/clone3/Makefile
 create mode 100644 tools/testing/selftests/clone3/clone3.c
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore 
b/tools/testing/selftests/clone3/.gitignore
new file mode 100644
index ..c63c64a78ddf
--- /dev/null
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -0,0 +1,2 @@
+clone3_set_tid
+clone3
diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
new file mode 100644
index ..4efcf45b995b
--- /dev/null
+++ b/tools/testing/selftests/clone3/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+
+CFLAGS += -I../../../../usr/include/
+
+ifeq ($(ARCH),x86_64)
+   TEST_GEN_PROGS := clone3 clone3_set_tid
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3.c 
b/tools/testing/selftests/clone3/clone3.c
new file mode 100644
index ..55a6915566b8
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Based on Christian Brauner's clone3() example */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static pid_t raw_clone(struct clone_args *args)
+{
+   return syscall(__NR_clone3, args, sizeof(struct clone_args));
+}
+
+static int call_clone3(int flags)
+{
+   struct clone_args args = {0};
+   pid_t ppid = -1;
+   pid_t pid = -1;
+   int status;
+
+   args.flags = flags;
+   args.exit_signal = SIGCHLD;
+
+   pid = raw_clone();
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+   strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   ksft_print_msg("I am the child, my PID is %d\n", getpid());
+   _exit(EXIT_SUCCESS);
+   }
+
+   ppid = getpid();
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+   ppid, pid);
+
+   (void)wait();
+   if (WEXITSTATUS(status))
+   return WEXITSTATUS(status);
+
+   return 0;
+}
+
+static int test_clone3(int flags, int expected)
+{
+   int ret;
+
+   ksft_print_msg("[%d] Trying clone3() with flags 0x%x\n",
+   getpid(), flags);
+   ret = call_clone3(flags);
+   ksft_print_msg("[%d] clone3() with flags says :%d expected %d\n",
+   getpid(), ret, expected);
+   if (ret != expected)
+   ksft_exit_fail_msg(
+   "[%d] Result (%d) is different than expected (%d)\n",
+   getpid(), ret, expected);
+   ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n",
+   getpid(), ret, expected);
+   return 0;
+}
+int main(int argc, char *argv[])
+{
+   int ret = -1;
+   pid_t pid;
+
+   ksft_print_header();
+   ksft_set_plan(3);
+
+   /* Just a simple clone3() should return 0.*/
+   if (test_clone3(0, 0))
+   goto on_error;
+   /* Do a clone3() in a new PID NS.*/
+   if (test_clone3(CLONE_NEWPID, 0))
+   goto on_error;
+   ksft_print_msg("First unshare\n");
+   if (unshare(CLONE_NEWPID))
+   goto on_error;
+   /*
+* Before clone3()ing in a new PID NS with
+* CLONE_NEWPID a fork() is necessary.
+*/
+   if (test_clone3(CLONE_NEWPID, -EINVAL))
+   goto on_error;
+   pid = fork();
+   if (pid < 0) {
+   ksft_print_msg("First fork() failed\n");
+   goto on_error;
+   }
+   if (pid > 0) {
+   (void)wait(NULL);
+   goto parent_out;
+   }
+   ksft_set_plan(6);
+   if (test_clone3(CLONE_NEWPID, 0))
+   goto on_error;
+   if (test_clone3(0, 0))
+   goto on_error;
+   ksft_print_msg("Second unshare\n");

[PATCH v3 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-06 Thread Adrian Reber
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with set_tid as they are currently in place for ns_last_pid.

Signed-off-by: Adrian Reber 
---
v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
 - Return EEXIST if PID is already in use (Christian)
 - Drop CLONE_SET_TID (Christian and Oleg)
 - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
 - Handle different `struct clone_args` sizes (Dmitry)
---
 include/linux/pid.h|  2 +-
 include/linux/sched/task.h |  1 +
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c  | 18 --
 kernel/pid.c   | 37 ++---
 5 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..052000db0ced 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4f2a80564332 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,7 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+   pid_t set_tid;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..e1ce103a2c47 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -45,6 +45,7 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+   __aligned_u64 set_tid;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..4bb3972ebb99 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2031,7 +2031,7 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
 
if (pid != _struct_pid) {
-   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+   pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2530,12 +2530,14 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
  struct clone_args __user *uargs,
  size_t size)
 {
+   struct pid_namespace *pid_ns = task_active_pid_ns(current);
struct clone_args args;
 
if (unlikely(size > PAGE_SIZE))
return -E2BIG;
 
-   if (unlikely(size < sizeof(struct clone_args)))
+   /* The struct needs to be at least the size of the original struct. */
+   if (size < (sizeof(struct clone_args) - sizeof(__aligned_u64)))
return -EINVAL;
 
if (unlikely(!access_ok(uargs, size)))
@@ -2573,6 +2575,14 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
.tls= args.tls,
};
 
+   if (size == sizeof(struct clone_args)) {
+   /* Only check permissions if set_tid is actually set. */
+   if (args.set_tid &&
+   !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+   kargs->set_tid = args.set_tid;
+   }
+
return 0;
 }
 
@@ -2585,6 +2595,10 @@ static bool clone3_args_valid(const struct 
kernel_clone_args *kargs)
if (kargs->flags & ~CLONE_LEGACY_FLAGS)
return false;
 
+   /* Fail if set_tid is invalid */
+   if (kargs->set_tid < 0)
+   return false;
+
/*
 * - make the CLONE_DETACHED bit reuseable for clone3
 * - make the CSIGNAL bits reuseable for clone3
diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..520b75c17790 100644
--- a/kernel/pid.c
+++

Re: [PATCH v2 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-02 Thread Adrian Reber
On Fri, Aug 02, 2019 at 03:52:49PM +0200, Christian Brauner wrote:
> On Fri, Aug 02, 2019 at 03:46:11PM +0200, Oleg Nesterov wrote:
> > On 08/02, Oleg Nesterov wrote:
> > >
> > > So Adrian, sorry for confusion, I think your patch is fine.

Good to know.

> > Yes... but do we really need the new CLONE_SET_TID ?
> > 
> > set_tid == 0 has no effect, can't we simply check kargs->set_tid != 0
> > before ns_capable() ?
> 
> Yeah, I agree that sounds much better and aligns with exit_signal.

Let me remove CLONE_SET_TID from the patch and I will try out
idr_is_empty().

I will also address Dmitry's comment about accessing smaller parameter
structs.

Adrian


Re: [PATCH v2 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-02 Thread Adrian Reber
On Fri, Aug 02, 2019 at 03:50:54PM +0200, Christian Brauner wrote:
> On Fri, Aug 02, 2019 at 03:30:01PM +0200, Oleg Nesterov wrote:
> > On 08/02, Christian Brauner wrote:
> > >
> > > On Wed, Jul 31, 2019 at 06:12:22PM +0200, Adrian Reber wrote:
> > > > The main motivation to add CLONE_SET_TID to clone3() is CRIU.
> > > >
> > > > To restore a process with the same PID/TID CRIU currently uses
> > > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > > > ns_last_pid and then (quickly) does a clone(). This works most of the
> > > > time, but it is racy. It is also slow as it requires multiple syscalls.
> > >
> > > Can you elaborate how this is racy, please. Afaict, CRIU will always
> > > usually restore in a new pid namespace that it controls, right?
> > 
> > Why? No. For example you can checkpoint (not sure this is correct word)
> > a single process in your namespace, then (try to restore) it. 
> > 
> > > What is
> > > the exact race?
> > 
> > something else in the same namespace can fork() right after criu writes
> > the pid-for-restore into ns_last_pid.
> 
> Ok, that makes sense. :)
> My CRIU userspace knowledge is sporadic, so I'm not sure how exactly it
> restores process trees in pid namespaces and what workloads this would
> especially help with.

Just what Oleg said. CRIU can restore processes in a new PID namespaces
or in an existing. To restore a process into an existing PID namespace
has the possibility of a PID collision, but if the PID is not yet in use
there is no limitation from CRIU's side.

Restoring into an existing PID namespace which is used by other
processes always has the possibility that between writing to
/proc/sys/kernel/ns_last_pid and clone() something else has fork()'d and
therefore it is racy.

Adrian


Re: [PATCH v2 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-08-02 Thread Adrian Reber
On Wed, Jul 31, 2019 at 07:41:36PM +0200, Oleg Nesterov wrote:
> On 07/31, Adrian Reber wrote:
> >
> > Extending clone3() to support CLONE_SET_TID makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> 
> I personally like this... but please see the question below.
> 
> > +struct pid *alloc_pid(struct pid_namespace *ns, int set_tid)
> >  {
> > struct pid *pid;
> > enum pid_type type;
> > @@ -186,12 +186,28 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > if (idr_get_cursor(>idr) > RESERVED_PIDS)
> > pid_min = RESERVED_PIDS;
> >  
> > -   /*
> > -* Store a null pointer so find_pid_ns does not find
> > -* a partially initialized PID (see below).
> > -*/
> > -   nr = idr_alloc_cyclic(>idr, NULL, pid_min,
> > - pid_max, GFP_ATOMIC);
> > +   if (set_tid) {
> > +   /*
> > +* Also fail if a PID != 1 is requested
> > +* and no PID 1 exists.
> > +*/
> > +   if ((set_tid >= pid_max) || ((set_tid != 1) &&
> > +   (idr_get_cursor(>idr) <= 1)))
>  ^^
> 
> Ah, I forgot to mention... this should work but only because
> RESERVED_PIDS > 0. How about idr_is_empty() ?
> 
> 
> But the main question is how it can really help if ns->level > 0, unlikely
> CRIU will ever need to clone the process with the same pid_nr == set_tid
> in the ns->parent chain.

Not sure I understand what you mean. For CRIU only the PID in the PID
namespace is relevant.

> So may be kernel_clone_args->set_tid should be pid_t __user *set_tid_array?
> Or I missed something ?

Not sure why and how an array would be needed. Could you give me some
more details why you think this is needed.

Adrian


[PATCH v2 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-07-31 Thread Adrian Reber
The main motivation to add CLONE_SET_TID to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support CLONE_SET_TID makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with set_tid as they are currently in place for ns_last_pid.

v2:
 - Removed (size < sizeof(struct clone_args)) as discussed with
   Christian and Dmitry
 - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
 - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

Signed-off-by: Adrian Reber 
---
 include/linux/pid.h|  2 +-
 include/linux/sched/task.h |  1 +
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 25 -
 kernel/pid.c   | 30 +++---
 5 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..052000db0ced 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4f2a80564332 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,7 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+   pid_t set_tid;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..8c4e803e8147 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -32,6 +32,7 @@
 #define CLONE_NEWPID   0x2000  /* New pid namespace */
 #define CLONE_NEWNET   0x4000  /* New network namespace */
 #define CLONE_IO   0x8000  /* Clone io context */
+#define CLONE_SET_TID  0x1ULL  /* set if the desired TID is 
set in set_tid */
 
 /*
  * Arguments for the clone3 syscall
@@ -45,6 +46,7 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+   __aligned_u64 set_tid;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..405f98ce4c83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2031,7 +2031,7 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
 
if (pid != _struct_pid) {
-   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+   pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2530,14 +2530,12 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
  struct clone_args __user *uargs,
  size_t size)
 {
+   struct pid_namespace *pid_ns = task_active_pid_ns(current);
struct clone_args args;
 
if (unlikely(size > PAGE_SIZE))
return -E2BIG;
 
-   if (unlikely(size < sizeof(struct clone_args)))
-   return -EINVAL;
-
if (unlikely(!access_ok(uargs, size)))
return -EFAULT;
 
@@ -2562,6 +2560,9 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
if (copy_from_user(, uargs, size))
return -EFAULT;
 
+   if ((args.flags & CLONE_SET_TID) && !ns_capable(pid_ns->user_ns, 
CAP_SYS_ADMIN))
+   return -EPERM;
+
*kargs = (struct kernel_clone_args){
.flags  = args.flags,
.pidfd  = u64_to_user_ptr(args.pidfd),
@@ -2571,6 +2572,7 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
.stack  = args.stack,
.stack_size = args.stack_size,
.tls= args.tls,
+   .set_tid= args.set_tid,
};
 
return 0;
@@ -2578,11 +2580,16 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
 
 static bool clone3_args_valid(const struct kernel_clone_args *kargs)
 {
-   /*
- 

[PATCH v2 2/2] selftests: add test for clone3() with set_tid

2019-07-31 Thread Adrian Reber
This tests clone3() with set_tid to see if all desired PIDs are working
as expected. The test tries to clone3() with a set_tid of -1, 1,
pid_max, a PID which is already in use and an unused PID. The same
tests are also running in PID namespace.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/.gitignore |   1 +
 tools/testing/selftests/clone3/Makefile   |  11 ++
 .../testing/selftests/clone3/clone3_set_tid.c | 148 ++
 3 files changed, 160 insertions(+)
 create mode 100644 tools/testing/selftests/clone3/.gitignore
 create mode 100644 tools/testing/selftests/clone3/Makefile
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore 
b/tools/testing/selftests/clone3/.gitignore
new file mode 100644
index ..09ccea33016c
--- /dev/null
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -0,0 +1 @@
+clone3_set_tid
diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
new file mode 100644
index ..45c77b50f367
--- /dev/null
+++ b/tools/testing/selftests/clone3/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+
+CFLAGS += -I../../../../usr/include/
+
+ifeq ($(ARCH),x86_64)
+   TEST_GEN_PROGS := clone3_set_tid
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c 
b/tools/testing/selftests/clone3/clone3_set_tid.c
new file mode 100644
index ..1ed0845aa4c5
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_set_tid.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Based on Christian Brauner's clone3() example */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static pid_t raw_clone(struct clone_args *args)
+{
+   return syscall(__NR_clone3, args, sizeof(struct clone_args));
+}
+
+static int call_clone3_set_tid(int set_tid, int flags)
+{
+   struct clone_args args = {0};
+   pid_t ppid = -1;
+   pid_t pid = -1;
+   int status;
+
+   args.flags = flags | CLONE_SET_TID;
+   args.exit_signal = SIGCHLD;
+   args.set_tid = set_tid;
+
+   pid = raw_clone();
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+   strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   ksft_print_msg("I am the child, my PID is %d\n", getpid());
+   if (set_tid != getpid())
+   _exit(EXIT_FAILURE);
+   _exit(EXIT_SUCCESS);
+   }
+
+   ppid = getpid();
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+   ppid, pid);
+
+   (void)wait();
+   if (WEXITSTATUS(status))
+   return WEXITSTATUS(status);
+
+   return 0;
+}
+
+static int test_clone3_set_tid(int set_tid, int flags, int expected)
+{
+   int ret;
+   ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d "
+   "and 0x%x\n", getpid(), set_tid, flags);
+   ret = call_clone3_set_tid(set_tid, flags);
+   ksft_print_msg("[%d] clone3() with CLONE_SET_TID %d says :%d "
+   "- expected %d\n", getpid(), set_tid, ret, expected);
+   if (ret != expected)
+   ksft_exit_fail_msg("[%d] Result (%d) is different than "
+   "expected (%d)\n", getpid(), ret, expected);
+   ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n",
+   getpid(), ret, expected);
+   return 0;
+}
+int main(int argc, char *argv[])
+{
+   FILE *f;
+   int pid_max = 0;
+   pid_t pid;
+   pid_t ns_pid;
+   int ret = -1;
+
+   ksft_print_header();
+   ksft_set_plan(10);
+
+   f = fopen("/proc/sys/kernel/pid_max", "r");
+   if (f == NULL)
+   ksft_exit_fail_msg("%s - Could not open 
/proc/sys/kernel/pid_max\n",
+   strerror(errno));
+   fscanf(f, "%d", _max);
+   fclose(f);
+   ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
+
+   /* First try with an invalid PID */
+   if (test_clone3_set_tid(-1, 0, -EINVAL))
+   goto on_error;
+   if (test_clone3_set_tid(-1, CLONE_NEWPID, -EINVAL))
+   goto on_error;
+   /* Then with PID 1 */
+   if (test_clone3_set_tid(1, 0, -EAGAIN))
+   goto on_error;
+   /* PID 1 should not fail in a PID namespace */
+   if (test_clone3_set_tid(1, CLONE_NEWPID, 0))
+   goto on_er

[PATCH 2/2] selftests: add test for clone3() with set_tid

2019-07-29 Thread Adrian Reber
This tests clone3() with set_tid to see if all desired PIDs are working
as expected. The test tries to clone3() with a set_tid of -1, 1,
pid_max, a PID which is already in use and an unused PID. The same
tests are also running in PID namespace.

Signed-off-by: Adrian Reber 
---
 tools/testing/selftests/clone3/.gitignore |   1 +
 tools/testing/selftests/clone3/Makefile   |  11 ++
 .../testing/selftests/clone3/clone3_set_tid.c | 148 ++
 3 files changed, 160 insertions(+)
 create mode 100644 tools/testing/selftests/clone3/.gitignore
 create mode 100644 tools/testing/selftests/clone3/Makefile
 create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore 
b/tools/testing/selftests/clone3/.gitignore
new file mode 100644
index ..09ccea33016c
--- /dev/null
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -0,0 +1 @@
+clone3_set_tid
diff --git a/tools/testing/selftests/clone3/Makefile 
b/tools/testing/selftests/clone3/Makefile
new file mode 100644
index ..45c77b50f367
--- /dev/null
+++ b/tools/testing/selftests/clone3/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+
+CFLAGS += -I../../../../usr/include/
+
+ifeq ($(ARCH),x86_64)
+   TEST_GEN_PROGS := clone3_set_tid
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c 
b/tools/testing/selftests/clone3/clone3_set_tid.c
new file mode 100644
index ..1ed0845aa4c5
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_set_tid.c
@@ -0,0 +1,148 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Based on Christian Brauner's clone3() example */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+static pid_t raw_clone(struct clone_args *args)
+{
+   return syscall(__NR_clone3, args, sizeof(struct clone_args));
+}
+
+static int call_clone3_set_tid(int set_tid, int flags)
+{
+   struct clone_args args = {0};
+   pid_t ppid = -1;
+   pid_t pid = -1;
+   int status;
+
+   args.flags = flags | CLONE_SET_TID;
+   args.exit_signal = SIGCHLD;
+   args.set_tid = set_tid;
+
+   pid = raw_clone();
+   if (pid < 0) {
+   ksft_print_msg("%s - Failed to create new process\n",
+   strerror(errno));
+   return -errno;
+   }
+
+   if (pid == 0) {
+   ksft_print_msg("I am the child, my PID is %d\n", getpid());
+   if (set_tid != getpid())
+   _exit(EXIT_FAILURE);
+   _exit(EXIT_SUCCESS);
+   }
+
+   ppid = getpid();
+   ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+   ppid, pid);
+
+   (void)wait();
+   if (WEXITSTATUS(status))
+   return WEXITSTATUS(status);
+
+   return 0;
+}
+
+static int test_clone3_set_tid(int set_tid, int flags, int expected)
+{
+   int ret;
+   ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d "
+   "and 0x%x\n", getpid(), set_tid, flags);
+   ret = call_clone3_set_tid(set_tid, flags);
+   ksft_print_msg("[%d] clone3() with CLONE_SET_TID %d says :%d "
+   "- expected %d\n", getpid(), set_tid, ret, expected);
+   if (ret != expected)
+   ksft_exit_fail_msg("[%d] Result (%d) is different than "
+   "expected (%d)\n", getpid(), ret, expected);
+   ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n",
+   getpid(), ret, expected);
+   return 0;
+}
+int main(int argc, char *argv[])
+{
+   FILE *f;
+   int pid_max = 0;
+   pid_t pid;
+   pid_t ns_pid;
+   int ret = -1;
+
+   ksft_print_header();
+   ksft_set_plan(10);
+
+   f = fopen("/proc/sys/kernel/pid_max", "r");
+   if (f == NULL)
+   ksft_exit_fail_msg("%s - Could not open 
/proc/sys/kernel/pid_max\n",
+   strerror(errno));
+   fscanf(f, "%d", _max);
+   fclose(f);
+   ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
+
+   /* First try with an invalid PID */
+   if (test_clone3_set_tid(-1, 0, -EINVAL))
+   goto on_error;
+   if (test_clone3_set_tid(-1, CLONE_NEWPID, -EINVAL))
+   goto on_error;
+   /* Then with PID 1 */
+   if (test_clone3_set_tid(1, 0, -EAGAIN))
+   goto on_error;
+   /* PID 1 should not fail in a PID namespace */
+   if (test_clone3_set_tid(1, CLONE_NEWPID, 0))
+   goto on_er

[PATCH 1/2] fork: extend clone3() to support CLONE_SET_TID

2019-07-29 Thread Adrian Reber
The main motivation to add CLONE_SET_TID to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support CLONE_SET_TID makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with set_tid as they are currently in place for ns_last_pid.

Signed-off-by: Adrian Reber 
---
 include/linux/pid.h|  2 +-
 include/linux/sched/task.h |  1 +
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 22 --
 kernel/pid.c   | 21 ++---
 5 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 2a83e434db9d..052000db0ced 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -116,7 +116,7 @@ extern struct pid *find_vpid(int nr);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t set_tid);
 extern void free_pid(struct pid *pid);
 extern void disable_pid_allocation(struct pid_namespace *ns);
 
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4f2a80564332 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,7 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+   pid_t set_tid;
 };
 
 /*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..8c4e803e8147 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -32,6 +32,7 @@
 #define CLONE_NEWPID   0x2000  /* New pid namespace */
 #define CLONE_NEWNET   0x4000  /* New network namespace */
 #define CLONE_IO   0x8000  /* Clone io context */
+#define CLONE_SET_TID  0x1ULL  /* set if the desired TID is 
set in set_tid */
 
 /*
  * Arguments for the clone3 syscall
@@ -45,6 +46,7 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+   __aligned_u64 set_tid;
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..8c59441ac153 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2031,7 +2031,7 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
 
if (pid != _struct_pid) {
-   pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+   pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2530,6 +2530,7 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
  struct clone_args __user *uargs,
  size_t size)
 {
+   struct pid_namespace *pid_ns = task_active_pid_ns(current);
struct clone_args args;
 
if (unlikely(size > PAGE_SIZE))
@@ -2562,6 +2563,9 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
if (copy_from_user(, uargs, size))
return -EFAULT;
 
+   if ((args.flags & CLONE_SET_TID) && !ns_capable(pid_ns->user_ns, 
CAP_SYS_ADMIN))
+   return -EPERM;
+
*kargs = (struct kernel_clone_args){
.flags  = args.flags,
.pidfd  = u64_to_user_ptr(args.pidfd),
@@ -2571,6 +2575,7 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
.stack  = args.stack,
.stack_size = args.stack_size,
.tls= args.tls,
+   .set_tid= args.set_tid,
};
 
return 0;
@@ -2578,11 +2583,16 @@ noinline static int copy_clone_args_from_user(struct 
kernel_clone_args *kargs,
 
 static bool clone3_args_valid(const struct kernel_clone_args *kargs)
 {
-   /*
-* All lower bits of the flag word are taken.
-* Verify that no other unknown flags are passed along.
-*/
-   if (kargs->flags & ~CLONE_LEGACY_FLAGS)
+   /* Verify that no other unknown flags are passed along. */
+   if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_SET_TID))
+   return false;
+
+   /* Fail if set_tid is set without CLONE_SET_TID */
+   if (kargs->set_tid && !(k

Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-13 Thread Adrian Reber
On Fri, Jul 13, 2018 at 08:46:25AM -0500, Eric W. Biederman wrote:
> Pavel Emelyanov  writes:
> 
> > On 07/12/2018 07:33 PM, Eric W. Biederman wrote:
> >> 
> >> Adrian Reber  writes:
> >> 
> >>> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
> >>> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
> >>> distribution kernels and also part of the defconfigs of various
> >>> architectures.
> >>>
> >>> To make it easier for distributions to enable CHECKPOINT_RESTORE this
> >>> removes EXPERT and moves the configuration option out of the EXPERT
> >>> block.
> >> 
> >> I think we should change the help text at the same time, to match
> >> our improve understanding of the situation.
> >> 
> >> Does anyone remember why this option was added at all?
> >
> > Sure! Quoting Andrew's ~7 years ago akpm branch merge e-mail:
> >
> >However I'm less confident than the developers that it will all
> >eventually work! So what I'm asking them to do is to wrap each piece
> >of new code inside CONFIG_CHECKPOINT_RESTORE.  So if it all
> >eventually comes to tears and the project as a whole fails, it should
> >be a simple matter to go through and delete all trace of it.
> >
> > the best link with full e-mail I googled for is
> > https://gitlab.imag.fr/kaunetem/linux-kaunetem/commit/099469502f62fbe0d7e4f0b83a2f22538367f734
> 
> Good explanation.  Thank you.
> 
> At this point we even have not CRIU users of some of the pieces.
> The project as a whole has not failed.
> 
> The code is old enough an common enough (enabled in some distros) that
> we need to do the whole watch out for regressions if we remove any part
> of it.
> 
> Which is a long way of saying the original justifiction for
> CONFIG_CHECKPOINT_RESTORE is gone.  So please let's remove the entire
> config option and simplify everyone's lives who has to test this stuff.

Sounds good.

> Unless someone can come up with a justification for keeping some of this
> behind a config option.

I can provide a patch removing CONFIG_CHECKPOINT_RESTORE if there are no
further objections against it.

Adrian


Re: [PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-13 Thread Adrian Reber
On Fri, Jul 13, 2018 at 08:46:25AM -0500, Eric W. Biederman wrote:
> Pavel Emelyanov  writes:
> 
> > On 07/12/2018 07:33 PM, Eric W. Biederman wrote:
> >> 
> >> Adrian Reber  writes:
> >> 
> >>> The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
> >>> combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
> >>> distribution kernels and also part of the defconfigs of various
> >>> architectures.
> >>>
> >>> To make it easier for distributions to enable CHECKPOINT_RESTORE this
> >>> removes EXPERT and moves the configuration option out of the EXPERT
> >>> block.
> >> 
> >> I think we should change the help text at the same time, to match
> >> our improve understanding of the situation.
> >> 
> >> Does anyone remember why this option was added at all?
> >
> > Sure! Quoting Andrew's ~7 years ago akpm branch merge e-mail:
> >
> >However I'm less confident than the developers that it will all
> >eventually work! So what I'm asking them to do is to wrap each piece
> >of new code inside CONFIG_CHECKPOINT_RESTORE.  So if it all
> >eventually comes to tears and the project as a whole fails, it should
> >be a simple matter to go through and delete all trace of it.
> >
> > the best link with full e-mail I googled for is
> > https://gitlab.imag.fr/kaunetem/linux-kaunetem/commit/099469502f62fbe0d7e4f0b83a2f22538367f734
> 
> Good explanation.  Thank you.
> 
> At this point we even have not CRIU users of some of the pieces.
> The project as a whole has not failed.
> 
> The code is old enough an common enough (enabled in some distros) that
> we need to do the whole watch out for regressions if we remove any part
> of it.
> 
> Which is a long way of saying the original justifiction for
> CONFIG_CHECKPOINT_RESTORE is gone.  So please let's remove the entire
> config option and simplify everyone's lives who has to test this stuff.

Sounds good.

> Unless someone can come up with a justification for keeping some of this
> behind a config option.

I can provide a patch removing CONFIG_CHECKPOINT_RESTORE if there are no
further objections against it.

Adrian


[PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-12 Thread Adrian Reber
The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
distribution kernels and also part of the defconfigs of various
architectures.

To make it easier for distributions to enable CHECKPOINT_RESTORE this
removes EXPERT and moves the configuration option out of the EXPERT
block.

Signed-off-by: Adrian Reber 
Cc: Oleg Nesterov 
Cc: Pavel Emelyanov 
Cc: Andrew Morton 
Cc: Eric W. Biederman 
Cc: Andrei Vagin 
Cc: Hendrik Brueckner 
---
 init/Kconfig | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 041f3a022122..9c529c763326 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -932,6 +932,18 @@ config NET_NS
 
 endif # NAMESPACES
 
+config CHECKPOINT_RESTORE
+   bool "Checkpoint/restore support"
+   select PROC_CHILDREN
+   default n
+   help
+ Enables additional kernel features in a sake of checkpoint/restore.
+ In particular it adds auxiliary prctl codes to setup process text,
+ data and heap segment sizes, and a few additional /proc filesystem
+ entries.
+
+ If unsure, say N here.
+
 config SCHED_AUTOGROUP
bool "Automatic process group scheduling"
select CGROUPS
@@ -1348,18 +1360,6 @@ config MEMBARRIER
 
  If unsure, say Y.
 
-config CHECKPOINT_RESTORE
-   bool "Checkpoint/restore support" if EXPERT
-   select PROC_CHILDREN
-   default n
-   help
- Enables additional kernel features in a sake of checkpoint/restore.
- In particular it adds auxiliary prctl codes to setup process text,
- data and heap segment sizes, and a few additional /proc filesystem
- entries.
-
- If unsure, say N here.
-
 config KALLSYMS
 bool "Load all symbols for debugging/ksymoops" if EXPERT
 default y
-- 
2.17.1



[PATCH] kconfig: remove EXPERT from CHECKPOINT_RESTORE

2018-07-12 Thread Adrian Reber
The CHECKPOINT_RESTORE configuration option was introduced in 2012 and
combined with EXPERT. CHECKPOINT_RESTORE is already enabled in many
distribution kernels and also part of the defconfigs of various
architectures.

To make it easier for distributions to enable CHECKPOINT_RESTORE this
removes EXPERT and moves the configuration option out of the EXPERT
block.

Signed-off-by: Adrian Reber 
Cc: Oleg Nesterov 
Cc: Pavel Emelyanov 
Cc: Andrew Morton 
Cc: Eric W. Biederman 
Cc: Andrei Vagin 
Cc: Hendrik Brueckner 
---
 init/Kconfig | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 041f3a022122..9c529c763326 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -932,6 +932,18 @@ config NET_NS
 
 endif # NAMESPACES
 
+config CHECKPOINT_RESTORE
+   bool "Checkpoint/restore support"
+   select PROC_CHILDREN
+   default n
+   help
+ Enables additional kernel features in a sake of checkpoint/restore.
+ In particular it adds auxiliary prctl codes to setup process text,
+ data and heap segment sizes, and a few additional /proc filesystem
+ entries.
+
+ If unsure, say N here.
+
 config SCHED_AUTOGROUP
bool "Automatic process group scheduling"
select CGROUPS
@@ -1348,18 +1360,6 @@ config MEMBARRIER
 
  If unsure, say Y.
 
-config CHECKPOINT_RESTORE
-   bool "Checkpoint/restore support" if EXPERT
-   select PROC_CHILDREN
-   default n
-   help
- Enables additional kernel features in a sake of checkpoint/restore.
- In particular it adds auxiliary prctl codes to setup process text,
- data and heap segment sizes, and a few additional /proc filesystem
- entries.
-
- If unsure, say N here.
-
 config KALLSYMS
 bool "Load all symbols for debugging/ksymoops" if EXPERT
 default y
-- 
2.17.1



Re: [PATCH 7/7] aio: implement io_pgetevents

2018-07-04 Thread Adrian Reber
On Wed, May 02, 2018 at 11:14:48PM +0200, Christoph Hellwig wrote:
> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
> 
>   sigset_t origmask;
> 
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
> 
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.

Starting with this commit following code does not compile for me
anymore:

#include 
#include 

int main()
{
return 0;
}

In file included from /usr/include/linux/signal.h:5,
 from /usr/include/linux/aio_abi.h:32,
 from include.c:2:
/usr/include/asm/signal.h:16:23: error: conflicting types for ‘sigset_t’
 typedef unsigned long sigset_t;
   ^~~~
In file included from /usr/include/signal.h:35,
 from include.c:1:
/usr/include/bits/types/sigset_t.h:7:20: note: previous declaration of 
‘sigset_t’ was here
 typedef __sigset_t sigset_t;
^~~~
In file included from /usr/include/linux/signal.h:5,
 from /usr/include/linux/aio_abi.h:32,
 from include.c:2:
/usr/include/asm/signal.h:115:8: error: redefinition of ‘struct sigaction’
 struct sigaction {
^
In file included from /usr/include/signal.h:226,
 from include.c:1:
/usr/include/bits/sigaction.h:27:8: note: originally defined here
 struct sigaction
^
[and much more]

Before this commit it compiles without errors.

Adrian


Re: [PATCH 7/7] aio: implement io_pgetevents

2018-07-04 Thread Adrian Reber
On Wed, May 02, 2018 at 11:14:48PM +0200, Christoph Hellwig wrote:
> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
> 
>   sigset_t origmask;
> 
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
> 
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.

Starting with this commit following code does not compile for me
anymore:

#include 
#include 

int main()
{
return 0;
}

In file included from /usr/include/linux/signal.h:5,
 from /usr/include/linux/aio_abi.h:32,
 from include.c:2:
/usr/include/asm/signal.h:16:23: error: conflicting types for ‘sigset_t’
 typedef unsigned long sigset_t;
   ^~~~
In file included from /usr/include/signal.h:35,
 from include.c:1:
/usr/include/bits/types/sigset_t.h:7:20: note: previous declaration of 
‘sigset_t’ was here
 typedef __sigset_t sigset_t;
^~~~
In file included from /usr/include/linux/signal.h:5,
 from /usr/include/linux/aio_abi.h:32,
 from include.c:2:
/usr/include/asm/signal.h:115:8: error: redefinition of ‘struct sigaction’
 struct sigaction {
^
In file included from /usr/include/signal.h:226,
 from include.c:1:
/usr/include/bits/sigaction.h:27:8: note: originally defined here
 struct sigaction
^
[and much more]

Before this commit it compiles without errors.

Adrian


Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Adrian Reber
On Thu, Feb 08, 2007 at 06:21:56PM +0100, Arnd Bergmann wrote:
[...]
> Moving the sample rate computation to user space sounds like the right
> idea, but why not have a more drastic version of it:
> 
> Right now, all products that support this feature run at the same clock
> rate (3.2 Ghz), with cpufreq, we can reduce this to 1.6 Ghz. If I understand
> this correctly, the value depends only on the frequency, so we could simply
> hardcode this in the kernel, and print out a warning message if we ever
> encounter a different frequency, right?

Just for the record... CAB is running with 2.8 GHz. At least all the boards
I have seen.

Adrian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Adrian Reber
On Thu, Feb 08, 2007 at 06:21:56PM +0100, Arnd Bergmann wrote:
[...]
 Moving the sample rate computation to user space sounds like the right
 idea, but why not have a more drastic version of it:
 
 Right now, all products that support this feature run at the same clock
 rate (3.2 Ghz), with cpufreq, we can reduce this to 1.6 Ghz. If I understand
 this correctly, the value depends only on the frequency, so we could simply
 hardcode this in the kernel, and print out a warning message if we ever
 encounter a different frequency, right?

Just for the record... CAB is running with 2.8 GHz. At least all the boards
I have seen.

Adrian
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.4.0-test9 load problem

2000-10-04 Thread Adrian Reber


Hi

I have a Dual Pentium III 500 Mhz machine.
It has 1GB of memory, an internal amimegaraid controller, two additional
scsi ultra-wide controller sym53c875E (using sym53c8xx.o) and acenic
gigabit ethernet card. I'm patching every kernel to use vlan (802.1Q) and
reiserfs. Currently a 2.4.0-test4 kernel i running on it and it is really
stable. uptime 2 months. But no support for the 1GB of memory. Just the
~900 MB.

I decided to use the newest patch for reiserfs and vlan and the newest
kernel. 2.4.0-test8 and 2.4.0-test9. Both kernels had the 4GB support
enabled. They boot fine but the load keeps rising.  Sendmail didn't start
anymore and the ftpserver (proftpd), though connections were established,
didn't let you transfer anything. There was no real load on the computer
but the load kept rising until it reached 150, which was the point where i
rebooted the thing an switched back to 2.4.0-test4.



Adrian


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



2.4.0-test9 load problem

2000-10-04 Thread Adrian Reber


Hi

I have a Dual Pentium III 500 Mhz machine.
It has 1GB of memory, an internal amimegaraid controller, two additional
scsi ultra-wide controller sym53c875E (using sym53c8xx.o) and acenic
gigabit ethernet card. I'm patching every kernel to use vlan (802.1Q) and
reiserfs. Currently a 2.4.0-test4 kernel i running on it and it is really
stable. uptime 2 months. But no support for the 1GB of memory. Just the
~900 MB.

I decided to use the newest patch for reiserfs and vlan and the newest
kernel. 2.4.0-test8 and 2.4.0-test9. Both kernels had the 4GB support
enabled. They boot fine but the load keeps rising.  Sendmail didn't start
anymore and the ftpserver (proftpd), though connections were established,
didn't let you transfer anything. There was no real load on the computer
but the load kept rising until it reached 150, which was the point where i
rebooted the thing an switched back to 2.4.0-test4.



Adrian


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/