from:"Sargun Dhillon"

Re: seccomp: Delay filter activation

2021-03-18 Thread Sargun Dhillon

On Thu, Mar 18, 2021 at 03:54:54PM +0100, Christian Brauner wrote:
> Sorry, I just found that mail.
> 
> On Mon, Mar 01, 2021 at 03:44:06PM -0800, Kees Cook wrote:
> > On Mon, Mar 01, 2021 at 02:21:56PM +0100, Christian Brauner wrote:
> > > On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote:
> > > > On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> > > > > We've run into a problem where attaching a filter can be quite messy
> > > > > business because the filter itself intercepts sendmsg, and other
> > > > > syscalls related to exfiltrating the listener FD. I believe that this
> > > > > problem set has been brought up before, and although there are
> > > > > "simpler" methods of exfiltrating the listener, like clone3 or
> > > > > pidfd_getfd, but these are still less than ideal.
> > 
> > I'm trying to make sure I understand: the target process would like to
> > have a filter attached that blocks sendmsg, but that would mean it has
> > no way to send the listener FD to its manager?
> 
> With pidfd_getfd() that wouldn't be a problem, I think which is what I
> was trying to say. Unless the supervising task doen't have enough
> privilege over the supervised task which seems like an odd scenario but
> is technically possible, I guess.
> 
> > 
> > And you'd want to have listening working for sendmsg (otherwise you
> > could do it with two filters, I imagine)?
> > 
> > > > int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, 
> > > > SECCOMP_FILTER_DETACHED, );
> > > > 
> > > > BARRIER_WAIT_SETUP_DONE;
> > > > 
> > > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, 
> > > > INT_TO_PTR(fd_listener));
> > > 
> > > This obviously should've been sm like:
> > > 
> > > struct seccomp_filter_attach {
> > >   union {
> > >   __s32 pidfd;
> > >   __s32 pid;
> > >   };
> > >   __u32 fd_filter;
> > > };
> > > 
> > > and then
> > > 
> > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach);
> > 
> > Given the difficulty with TSYNC, I'm not excited about adding an
> > "apply this filter to another process" API. :)
> 
> Just to give a more complete reason for suggesting something like this
> without trying to argue that we must have this:
> 
> seccomp() has so far been an API that is caller-centric and by that I
> mean that the caller loaded it's seccomp profile and sandboxed itself. As
> such seccomp is an example of "caller-managed" security. This security
> model has obvious advantages and fits into the general fork()-like world
> of unix. But imho that self-management model breaks down as soon as a
> file descriptor that can be used to refer to the object in question
> enters into the picture. For seccomp this "breaking point" was the
> seccomp notifier fd.
> 
> Because with the introduction of that fd we have introduced the concept
> of supervisor and supervisee for seccomp which imho didn't really exist
> in the same way before. It's pretty obvious from the type of language
> that we now use both in userspace and in kernelspace when we talk about
> the seccomp notifier.
> 
> At the current point we're somewhere in the middle between caller-managed
> and supervised seccomp which brings up funny probelms and edge-cases.
> One of them most obvious examples is in fact the question how to get the
> seccomp notify fd out of the supervised task. This clearly points to the
> fact that we're missing one of the fundamentals of an fd-based
> supervision model: open(). This is why I was suggesting the
> SECCOMP_ATTACH_FILTER command. It's in a sense an open-call for the
> seccomp notify fd.
> 
> That all being said I know that it can be weird to implement this and if
> you prefer we go with another simpler model to work around such things
> than I fully understand.
> 
> Christian

So, beyond clone3 to get pidfds being kind of awkward, how do you see this
pattern actually working? How does the filter installer let the supervisor
know that it's ready for extraction? pause() + signaling the parent?

In the case that you're not fork-execing, how do you communicate to the 
notifier? My coworkers have been working on this code, where they need to 
connect to a daemon that does the supervision, and it's gnarly[1]. They're 
looking at adding sendmsg to the filter list, and that complicates things.

I think that pidfd_getfd works well if the child has some way to signal to the 
parent that it's ready and that the f

[PATCH 0/5] Handle seccomp notification preemption

2021-03-17 Thread Sargun Dhillon



This patchset addresses a race condition we've dealt with recently with
seccomp. Specifically programs interrupting syscalls while they're in
progress. This was exacerbated by Golang's recent adoption of "async
preemption", in which they try to interrupt any syscall that's been
running for more than 10ms during GC. During certain syscalls, it's
non-trivial to write them in a reetrant manner in userspace (mount).

This has a couple semantic changes, and relaxes a check on seccomp_data, and
changes the semantics with ordering of how addfd and notification replies
in the supervisor are handled.

It also follows up on the original proposal from Tycho[2] to allow
for adding an FD and returning that value atomically.

Changes since v1[1]:
 * Fix some documentation
 * Add Rata's patches to allow for direct return from addfd

[1]: https://lore.kernel.org/lkml/20210220090502.7202-1-sar...@sargun.me/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Rodrigo Campos (1):
  seccomp: Support atomic "addfd + send reply"

Sargun Dhillon (4):
  seccomp: Refactor notification handler to prepare for new semantics
  seccomp: Add wait_killable semantic to seccomp user notifier
  selftests/seccomp: Add test for wait killable notifier
  selftests/seccomp: Add test for atomic addfd+send

 .../userspace-api/seccomp_filter.rst  |  15 +-
 include/uapi/linux/seccomp.h  |   4 +
 kernel/seccomp.c  | 129 ++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 102 ++
 4 files changed, 220 insertions(+), 30 deletions(-)

-- 
2.25.1

[PATCH 1/5] seccomp: Refactor notification handler to prepare for new semantics

2021-03-17 Thread Sargun Dhillon

This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.

Signed-off-by: Sargun Dhillon 
---
 kernel/seccomp.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 952dc1c90229..b48fb0a29455 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1098,28 +1098,30 @@ static int seccomp_do_user_notification(int 
this_syscall,
 
up(>notif->request);
wake_up_poll(>wqh, EPOLLIN | EPOLLRDNORM);
-   mutex_unlock(>notify_lock);
 
/*
 * This is where we wait for a reply from userspace.
 */
-wait:
-   err = wait_for_completion_interruptible();
-   mutex_lock(>notify_lock);
-   if (err == 0) {
-   /* Check if we were woken up by a addfd message */
+   do {
+   mutex_unlock(>notify_lock);
+   err = wait_for_completion_interruptible();
+   mutex_lock(>notify_lock);
+   if (err != 0)
+   goto interrupted;
+
addfd = list_first_entry_or_null(,
 struct seccomp_kaddfd, list);
-   if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
+   /* Check if we were woken up by a addfd message */
+   if (addfd)
seccomp_handle_addfd(addfd);
-   mutex_unlock(>notify_lock);
-   goto wait;
-   }
-   ret = n.val;
-   err = n.error;
-   flags = n.flags;
-   }
 
+   }  while (n.state != SECCOMP_NOTIFY_REPLIED);
+
+   ret = n.val;
+   err = n.error;
+   flags = n.flags;
+
+interrupted:
/* If there were any pending addfd calls, clear them out */
list_for_each_entry_safe(addfd, tmp, , list) {
/* The process went away before we got a chance to handle it */
-- 
2.25.1

[PATCH 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

2021-03-17 Thread Sargun Dhillon

The user notifier feature allows for filtering of seccomp notifications in
userspace. While the user notifier is handling the syscall, the notifying
process can be preempted, thus ending the notification. This has become a
growing problem, as Golang has adopted signal based async preemption[1]. In
this, it will preempt every 10ms, thus leaving the supervisor less than
10ms to respond to a given notification. If the syscall require I/O (mount,
connect) on behalf of the process, it can easily take 10ms.

This allows the supervisor to set a flag that moves the process into a
state where it is only killable by terminating signals as opposed to all
signals. The process can still be terminated before the supervisor receives
the notification.

Signed-off-by: Sargun Dhillon 

[1]: https://github.com/golang/go/issues/24543
---
 .../userspace-api/seccomp_filter.rst  | 15 +++---
 include/uapi/linux/seccomp.h  |  3 ++
 kernel/seccomp.c  | 54 ---
 3 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst 
b/Documentation/userspace-api/seccomp_filter.rst
index bd9165241b6c..75de9400d56a 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -251,13 +251,14 @@ seccomp notification fd to receive a ``struct 
seccomp_notif``, which contains
 five members: the input length of the structure, a unique-per-filter ``id``,
 the ``pid`` of the task which triggered this request (which may be 0 if the
 task is in a pid ns not visible from the listener's pid namespace), a ``flags``
-member which for now only has ``SECCOMP_NOTIF_FLAG_SIGNALED``, representing
-whether or not the notification is a result of a non-fatal signal, and the
-``data`` passed to seccomp. Userspace can then make a decision based on this
-information about what to do, and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a
-response, indicating what should be returned to userspace. The ``id`` member of
-``struct seccomp_notif_resp`` should be the same ``id`` as in ``struct
-seccomp_notif``.
+member and the ``data`` passed to seccomp. Upon receiving the notification,
+the ``SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE`` flag may be set, which will
+try to put the task into a state where it will only respond to fatal signals.
+
+Userspace can then make a decision based on this information about what to do,
+and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a response, indicating what should be
+returned to userspace. The ``id`` member of ``struct seccomp_notif_resp`` 
should
+be the same ``id`` as in ``struct seccomp_notif``.
 
 It is worth noting that ``struct seccomp_data`` contains the values of register
 arguments to the syscall, but does not contain pointers to memory. The task's
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..bc7fc8b04749 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -70,6 +70,9 @@ struct seccomp_notif_sizes {
__u16 seccomp_data;
 };
 
+/* Valid flags for struct seccomp_notif */
+#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE  (1UL << 0) /* Prevent task from 
being interrupted */
+
 struct seccomp_notif {
__u64 id;
__u32 pid;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b48fb0a29455..1a38fb1de053 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -97,6 +97,8 @@ struct seccomp_knotif {
 
/* outstanding addfd requests */
struct list_head addfd;
+
+   bool wait_killable;
 };
 
 /**
@@ -1073,6 +1075,11 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd 
*addfd)
complete(>completion);
 }
 
+static bool notification_interruptible(struct seccomp_knotif *n)
+{
+   return !(n->state == SECCOMP_NOTIFY_SENT && n->wait_killable);
+}
+
 static int seccomp_do_user_notification(int this_syscall,
struct seccomp_filter *match,
const struct seccomp_data *sd)
@@ -1082,6 +1089,7 @@ static int seccomp_do_user_notification(int this_syscall,
long ret = 0;
struct seccomp_knotif n = {};
struct seccomp_kaddfd *addfd, *tmp;
+   bool interruptible = true;
 
mutex_lock(>notify_lock);
err = -ENOSYS;
@@ -1103,11 +,31 @@ static int seccomp_do_user_notification(int 
this_syscall,
 * This is where we wait for a reply from userspace.
 */
do {
+   interruptible = notification_interruptible();
+
mutex_unlock(>notify_lock);
-   err = wait_for_completion_interruptible();
+   if (interruptible)
+   err = wait_for_completion_interruptible();
+   else
+   err = wait_for_completion_killable();
mutex_lock(>notify_lock);
-   if (err != 0

[PATCH 3/5] selftests/seccomp: Add test for wait killable notifier

2021-03-17 Thread Sargun Dhillon

This adds a test for the positive case of the wait killable notifier,
in testing that when the feature is activated the process acts as
expected -- in not terminating on a non-fatal signal, and instead
queueing it up. There is already a test case for normal handlers
and preemption.

Signed-off-by: Sargun Dhillon 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 64 +++
 1 file changed, 64 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 26c72f2b61b1..48ad53030d5a 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -235,6 +235,10 @@ struct seccomp_notif_addfd {
 };
 #endif
 
+#ifndef SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE
+#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE  (1UL << 0) /* Prevent task from 
being interrupted */
+#endif
+
 struct seccomp_notif_addfd_small {
__u64 id;
char weird[4];
@@ -4139,6 +4143,66 @@ TEST(user_notification_addfd_rlimit)
close(memfd);
 }
 
+TEST(user_notification_signal_wait_killable)
+{
+   pid_t pid;
+   long ret;
+   int status, listener, sk_pair[2];
+   struct seccomp_notif req = {
+   .flags = SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE,
+   };
+   struct seccomp_notif_resp resp = {};
+   char c;
+
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+   ASSERT_EQ(fcntl(sk_pair[0], F_SETFL, O_NONBLOCK), 0);
+
+   listener = user_notif_syscall(__NR_gettid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   ASSERT_GE(listener, 0);
+
+   pid = fork();
+   ASSERT_GE(pid, 0);
+
+   if (pid == 0) {
+   close(sk_pair[0]);
+   handled = sk_pair[1];
+   if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+   perror("signal");
+   exit(1);
+   }
+
+   ret = syscall(__NR_gettid);
+   exit(!(ret == 42));
+   }
+   close(sk_pair[1]);
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+   EXPECT_EQ(kill(pid, SIGUSR1), 0);
+   /* Make sure we didn't get a signal */
+   EXPECT_EQ(read(sk_pair[0], , 1), -1);
+   /* Make sure the notification is still alive */
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, ), 0);
+
+   resp.id = req.id;
+   resp.error = 0;
+   resp.val = 42;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, ), 0);
+
+   EXPECT_EQ(waitpid(pid, , 0), pid);
+   EXPECT_EQ(true, WIFEXITED(status));
+   EXPECT_EQ(0, WEXITSTATUS(status));
+   /* Check we eventually received the signal */
+   EXPECT_EQ(read(sk_pair[0], , 1), 1);
+}
+
+
 /*
  * TODO:
  * - expand NNP testing
-- 
2.25.1

[PATCH 4/5] seccomp: Support atomic "addfd + send reply"

2021-03-17 Thread Sargun Dhillon

From: Rodrigo Campos 

Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.

Other patches in this series add a way to block signals when a syscall
is put to wait by seccomp. However, that might be a big hammer for some
cases, as the golang runtime uses SIGURG to interrupt threads for GC
collection.  Sometimes we just don't want to interfere with the GC, for
example, and just either add the fd and return it or fail the syscall.
With no leaking fds added inadvertly to the target process.

This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.

This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag. The way to go to emulate those in cases where a
signal might interrupt is to use the functionality to block signals.

The struct seccomp_notif_resp, used when doing SECCOMP_IOCTL_NOTIF_SEND
ioctl() to send a response to the target, has three more fields that we
don't allow to set when doing the addfd ioctl() to also return. The
reasons to disallow each field are:
 * val: This will be set to the new allocated fd. No point taking it
   from userspace in this case.
 * error: If this is non-zero, the value is ignored. Therefore,
   it is pointless in this case as we want to return the value.
 * flags: The only flag is to let userspace continue to execute the
   syscall. This seems pointless, as we want the syscall to return the
   allocated fd.

This is why those fields are not possible to set when using this new
flag.

[1]: 
https://lore.kernel.org/lkml/cadzs7q4sw71inhmv8eooxhukjmorpzf7thraxzyddtzsxta...@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Signed-off-by: Rodrigo Campos 
Signed-off-by: Sargun Dhillon 
---
 include/uapi/linux/seccomp.h |  1 +
 kernel/seccomp.c | 49 +---
 2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index bc7fc8b04749..95dd9bab73c6 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -118,6 +118,7 @@ struct seccomp_notif_resp {
 
 /* valid flags for seccomp_notif_addfd */
 #define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
+#define SECCOMP_ADDFD_FLAG_SEND(1UL << 1) /* Addfd and return 
it, atomically */
 
 /**
  * struct seccomp_notif_addfd
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1a38fb1de053..66b3ff58469a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -109,6 +109,7 @@ struct seccomp_knotif {
  *  installing process should allocate the fd as normal.
  * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
  * is allowed.
+ * @ioctl_flags: The flags used for the seccomp_addfd ioctl.
  * @ret: The return value of the installing process. It is set to the fd num
  *   upon success (>= 0).
  * @completion: Indicates that the installing process has completed fd
@@ -120,6 +121,7 @@ struct seccomp_kaddfd {
struct file *file;
int fd;
unsigned int flags;
+   __u32 ioctl_flags;
 
/* To only be set on reply */
int ret;
@@ -1064,14 +1066,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter 
*filter)
return filter->notif->next_id++;
 }
 
-static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
+static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct 
seccomp_knotif *n)
 {
+   int fd;
+
/*
 * Remove the notification, and reset the list pointers, indicating
 * that it has been handled.
 */
list_del_init(>list);
-   addfd->ret = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
+   fd = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
+
+   addfd->ret = fd;
+
+   if (addfd->ioctl_flags & SECCOMP_ADDFD_FLA

[PATCH 5/5] selftests/seccomp: Add test for atomic addfd+send

2021-03-17 Thread Sargun Dhillon

This just adds a test to verify that when using the new introduced flag
to ADDFD, a valid fd is added and returned as the syscall result.

Signed-off-by: Rodrigo Campos 
Signed-off-by: Sargun Dhillon 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 38 +++
 1 file changed, 38 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 48ad53030d5a..f7242294a2d5 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -239,6 +239,10 @@ struct seccomp_notif_addfd {
 #define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE  (1UL << 0) /* Prevent task from 
being interrupted */
 #endif
 
+#ifndef SECCOMP_ADDFD_FLAG_SEND
+#define SECCOMP_ADDFD_FLAG_SEND(1UL << 1) /* Addfd and return it, 
atomically */
+#endif
+
 struct seccomp_notif_addfd_small {
__u64 id;
char weird[4];
@@ -3980,8 +3984,14 @@ TEST(user_notification_addfd)
ASSERT_GE(pid, 0);
 
if (pid == 0) {
+   /* fds will be added and this value is expected */
if (syscall(__NR_getppid) != USER_NOTIF_MAGIC)
exit(1);
+
+   /* Atomic addfd+send is received here. Check it is a valid fd */
+   if (fcntl(syscall(__NR_getppid), F_GETFD) == -1)
+   exit(1);
+
exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
}
 
@@ -4064,6 +4074,30 @@ TEST(user_notification_addfd)
ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
ASSERT_EQ(addfd.id, req.id);
 
+   /* Verify we can do an atomic addfd and send */
+   addfd.newfd = 0;
+   addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+   fd = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, );
+
+   /* Child has fds 0-6 and 42 used, we expect the lower fd available: 7 */
+   EXPECT_EQ(fd, 7);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, fd), 0);
+
+   /*
+* This sets the ID of the ADD FD to the last request plus 1. The
+* notification ID increments 1 per notification.
+*/
+   addfd.id = req.id + 1;
+
+   /* This spins until the underlying notification is generated */
+   while (ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ) != -1 &&
+  errno != -EINPROGRESS)
+   nanosleep(, NULL);
+
+   memset(, 0, sizeof(req));
+   ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+   ASSERT_EQ(addfd.id, req.id);
+
resp.id = req.id;
resp.error = 0;
resp.val = USER_NOTIF_MAGIC;
@@ -4124,6 +4158,10 @@ TEST(user_notification_addfd_rlimit)
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
EXPECT_EQ(errno, EMFILE);
 
+   addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EMFILE);
+
addfd.newfd = 100;
addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
-- 
2.25.1

[PATCH 0/5] Handle seccomp notification preemption

2021-03-17 Thread Sargun Dhillon



This patchset addresses a race condition we've dealt with recently with
seccomp. Specifically programs interrupting syscalls while they're in
progress. This was exacerbated by Golang's recent adoption of "async
preemption", in which they try to interrupt any syscall that's been
running for more than 10ms during GC. During certain syscalls, it's
non-trivial to write them in a reetrant manner in userspace (mount).

This has a couple semantic changes, and relaxes a check on seccomp_data, and
changes the semantics with ordering of how addfd and notification replies
in the supervisor are handled.

It also follows up on the original proposal from Tycho[2] to allow
for adding an FD and returning that value atomically.

Changes since v1[1]:
 * Fix some documentation
 * Add Rata's patches to allow for direct return from addfd

[1]: https://lore.kernel.org/lkml/20210220090502.7202-1-sar...@sargun.me/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Rodrigo Campos (1):
  seccomp: Support atomic "addfd + send reply"

Sargun Dhillon (4):
  seccomp: Refactor notification handler to prepare for new semantics
  seccomp: Add wait_killable semantic to seccomp user notifier
  selftests/seccomp: Add test for wait killable notifier
  selftests/seccomp: Add test for atomic addfd+send

 .../userspace-api/seccomp_filter.rst  |  15 +-
 include/uapi/linux/seccomp.h  |   4 +
 kernel/seccomp.c  | 129 ++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 102 ++
 4 files changed, 220 insertions(+), 30 deletions(-)

-- 
2.25.1

seccomp: Delay filter activation

2021-02-20 Thread Sargun Dhillon

We've run into a problem where attaching a filter can be quite messy
business because the filter itself intercepts sendmsg, and other
syscalls related to exfiltrating the listener FD. I believe that this
problem set has been brought up before, and although there are
"simpler" methods of exfiltrating the listener, like clone3 or
pidfd_getfd, but these are still less than ideal.

One of the ideas that's been talked about (I want to say back at LSS
NA) is the idea of "delayed activation". I was thinking that it might
be nice to have a mechanism to do delayed attach, either activated on
execve / fork, or an ioctl on the listenerfd to activate the filter
and have a flag like SECCOMP_FILTER_FLAG_NEW_LISTENER_INACTIVE, which
indicates that the listener should be setup, but not enforcing, and
another ioctl to activate it.

The later approach is preferred due to simplicity, but I can see a
situation where you could accidentally get into a state where the
filter is not being enforced. Additionally, this may have unforeseen
implications with CRIU.

I'm curious whether this is a problem others share, and whether any of
the aforementioned approaches seem reasonable.

-Thanks,
Sargun

[RFC PATCH 3/3] selftests/seccomp: Add test for wait killable notifier

2021-02-20 Thread Sargun Dhillon

This adds a test for the positive case of the wait killable notifier,
in testing that when the feature is activated the process acts as
expected -- in not terminating on a non-fatal signal, and instead
queueing it up. There is already a test case for normal handlers
and preemption.

Signed-off-by: Sargun Dhillon 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 60 +++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 26c72f2b61b1..a8ef4558d673 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -4139,6 +4139,66 @@ TEST(user_notification_addfd_rlimit)
close(memfd);
 }
 
+TEST(user_notification_signal_wait_killable)
+{
+   pid_t pid;
+   long ret;
+   int status, listener, sk_pair[2];
+   struct seccomp_notif req = {
+   .flags = SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE,
+   };
+   struct seccomp_notif_resp resp = {};
+   char c;
+
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+   ASSERT_EQ(fcntl(sk_pair[0], F_SETFL, O_NONBLOCK), 0);
+
+   listener = user_notif_syscall(__NR_gettid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   ASSERT_GE(listener, 0);
+
+   pid = fork();
+   ASSERT_GE(pid, 0);
+
+   if (pid == 0) {
+   close(sk_pair[0]);
+   handled = sk_pair[1];
+   if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+   perror("signal");
+   exit(1);
+   }
+
+   ret = syscall(__NR_gettid);
+   exit(!(ret == 42));
+   }
+   close(sk_pair[1]);
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+   EXPECT_EQ(kill(pid, SIGUSR1), 0);
+   /* Make sure we didn't get a signal */
+   EXPECT_EQ(read(sk_pair[0], , 1), -1);
+   /* Make sure the notification is still alive */
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, ), 0);
+
+   resp.id = req.id;
+   resp.error = 0;
+   resp.val = 42;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, ), 0);
+
+   EXPECT_EQ(waitpid(pid, , 0), pid);
+   EXPECT_EQ(true, WIFEXITED(status));
+   EXPECT_EQ(0, WEXITSTATUS(status));
+   /* Check we eventually received the signal */
+   EXPECT_EQ(read(sk_pair[0], , 1), 1);
+}
+
+
 /*
  * TODO:
  * - expand NNP testing
-- 
2.25.1

[RFC PATCH 1/3] seccomp: Refactor notification handler to prepare for new semantics

2021-02-20 Thread Sargun Dhillon

This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.

Signed-off-by: Sargun Dhillon 
---
 kernel/seccomp.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 952dc1c90229..b48fb0a29455 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1098,28 +1098,30 @@ static int seccomp_do_user_notification(int 
this_syscall,
 
up(>notif->request);
wake_up_poll(>wqh, EPOLLIN | EPOLLRDNORM);
-   mutex_unlock(>notify_lock);
 
/*
 * This is where we wait for a reply from userspace.
 */
-wait:
-   err = wait_for_completion_interruptible();
-   mutex_lock(>notify_lock);
-   if (err == 0) {
-   /* Check if we were woken up by a addfd message */
+   do {
+   mutex_unlock(>notify_lock);
+   err = wait_for_completion_interruptible();
+   mutex_lock(>notify_lock);
+   if (err != 0)
+   goto interrupted;
+
addfd = list_first_entry_or_null(,
 struct seccomp_kaddfd, list);
-   if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
+   /* Check if we were woken up by a addfd message */
+   if (addfd)
seccomp_handle_addfd(addfd);
-   mutex_unlock(>notify_lock);
-   goto wait;
-   }
-   ret = n.val;
-   err = n.error;
-   flags = n.flags;
-   }
 
+   }  while (n.state != SECCOMP_NOTIFY_REPLIED);
+
+   ret = n.val;
+   err = n.error;
+   flags = n.flags;
+
+interrupted:
/* If there were any pending addfd calls, clear them out */
list_for_each_entry_safe(addfd, tmp, , list) {
/* The process went away before we got a chance to handle it */
-- 
2.25.1

[RFC PATCH 2/3] seccomp: Add wait_killable semantic to seccomp user notifier

2021-02-20 Thread Sargun Dhillon

The user notifier feature allows for filtering of seccomp notifications in
userspace. While the user notifier is handling the syscall, the notifying
process can be preempted, thus ending the notification. This has become a
growing problem, as Golang has adopted signal based async preemption[1]. In
this, it will preempt every 10ms, thus leaving the supervisor less than
10ms to respond to a given notification. If the syscall require I/O (mount,
connect) on behalf of the process, it can easily take 10ms.

This allows the supervisor to set a flag that moves the process into a
state where it is only killable by terminating signals as opposed to all
signals.

Signed-off-by: Sargun Dhillon 

[1]: https://github.com/golang/go/issues/24543
---
 include/uapi/linux/seccomp.h | 10 ++
 kernel/seccomp.c | 35 +--
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..f9acdb58138b 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -70,6 +70,16 @@ struct seccomp_notif_sizes {
__u16 seccomp_data;
 };
 
+/*
+ * Valid flags for struct seccomp_notif
+ *
+ * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE
+ *
+ * Prevent the notifying process from being interrupted by non-fatal, unmasked
+ * signals.
+ */
+#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE (1UL << 0)
+
 struct seccomp_notif {
__u64 id;
__u32 pid;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b48fb0a29455..f8c6c47df5d8 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -97,6 +97,8 @@ struct seccomp_knotif {
 
/* outstanding addfd requests */
struct list_head addfd;
+
+   bool wait_killable;
 };
 
 /**
@@ -1082,6 +1084,7 @@ static int seccomp_do_user_notification(int this_syscall,
long ret = 0;
struct seccomp_knotif n = {};
struct seccomp_kaddfd *addfd, *tmp;
+   bool wait_killable = false;
 
mutex_lock(>notify_lock);
err = -ENOSYS;
@@ -1103,8 +1106,14 @@ static int seccomp_do_user_notification(int this_syscall,
 * This is where we wait for a reply from userspace.
 */
do {
+   wait_killable = n.state == SECCOMP_NOTIFY_SENT &&
+   n.wait_killable;
+
mutex_unlock(>notify_lock);
-   err = wait_for_completion_interruptible();
+   if (wait_killable)
+   err = wait_for_completion_killable();
+   else
+   err = wait_for_completion_interruptible();
mutex_lock(>notify_lock);
if (err != 0)
goto interrupted;
@@ -1420,14 +1429,16 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
struct seccomp_notif unotif;
ssize_t ret;
 
+   ret = copy_from_user(, buf, sizeof(unotif));
+   if (ret)
+   return -EFAULT;
+
/* Verify that we're not given garbage to keep struct extensible. */
-   ret = check_zeroed_user(buf, sizeof(unotif));
-   if (ret < 0)
-   return ret;
-   if (!ret)
+   if (unotif.flags & ~(SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE))
return -EINVAL;
 
-   memset(, 0, sizeof(unotif));
+   if (unotif.id || unotif.pid)
+   return -EINVAL;
 
ret = down_interruptible(>notif->request);
if (ret < 0)
@@ -1455,6 +1466,12 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
unotif.pid = task_pid_vnr(knotif->task);
unotif.data = *(knotif->data);
 
+   if (unotif.flags & SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) {
+   knotif->wait_killable = true;
+   complete(>ready);
+   }
+
+
knotif->state = SECCOMP_NOTIFY_SENT;
wake_up_poll(>wqh, EPOLLOUT | EPOLLWRNORM);
ret = 0;
@@ -1473,6 +1490,12 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
mutex_lock(>notify_lock);
knotif = find_notification(filter, unotif.id);
if (knotif) {
+   /* Reset the waiting state */
+   if (knotif->wait_killable) {
+   knotif->wait_killable = false;
+   complete(>ready);
+   }
+
knotif->state = SECCOMP_NOTIFY_INIT;
up(>notif->request);
}
-- 
2.25.1

[RFC PATCH 0/3] Seccomp non-preemptible notifier

2021-02-20 Thread Sargun Dhillon

This patchset addresses a race condition we've dealt with recently with
seccomp. Specifically programs interrupting syscalls while they're in
progress. This was exacerbated by Golang's recent adoption of "async
preemption", in which they try to interrupt any syscall that's been
running for more than 10ms during GC. During certain syscalls, it's
non-trivial to write them in a reetrant manner in userspace (mount).

This has a couple semantic changes, and relaxes a check on seccomp_data.
I can deal with these, but this was a first cut. I also expect that the
patch would be squashed down, but it's split out for easier review.

Sargun Dhillon (3):
  seccomp: Refactor notification handler to prepare for new semantics
  seccomp: Add wait_killable semantic to seccomp user notifier
  selftests/seccomp: Add test for wait killable notifier

 include/uapi/linux/seccomp.h  | 10 +++
 kernel/seccomp.c  | 63 +--
 tools/testing/selftests/seccomp/seccomp_bpf.c | 60 ++
 3 files changed, 114 insertions(+), 19 deletions(-)

-- 
2.25.1

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Sargun Dhillon

On Mon, Dec 28, 2020 at 9:26 AM Jeff Layton  wrote:
>
> On Mon, 2020-12-28 at 15:56 +, Matthew Wilcox wrote:
> > On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > > To be clear, the main thing you'll lose with the method above is the
> > > ability to see an unseen error on a newly opened fd, if there was an
> > > overlayfs mount using the same upper sb before your open occurred.
> > >
> > > IOW, consider two overlayfs mounts using the same upper layer sb:
> > >
> > > ovlfs1  ovlfs2
> > > --
> > > mount
> > > open fd1
> > > write to fd1
> > > 
> > > mount (upper errseq_t SEEN flag marked)
> > > open fd2
> > > syncfs(fd2)
> > > syncfs(fd1)
> > >
> > >
> > > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > > calls. The first one has a sample from before the error occurred, and
> > > the second one has a sample of 0, due to the fact that the error was
> > > unseen at open time.
> > >
> > > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > > two, then we can ensure that they both still get an error in this
> > > situation.
> >
> > But do we need to?  If the inode has been evicted we also lose the errno.
> > The guarantee we provide is that a fd that was open before the error
> > occurred will see the error.  An fd that's opened after the error occurred
> > may or may not see the error.
> >
>
> In principle, you can lose errors this way (which was the justification
> for making errseq_sample return 0 when there are unseen errors). E.g.,
> if you close fd1 instead of doing a syncfs on it, that error will be
> lost forever.
>
> As to whether that's OK, it's hard to say. It is a deviation from how
> this works in a non-containerized situation, and I'd argue that it's
> less than ideal. You may or may not see the error on fd2, but it's
> dependent on events that take place outside the container and that
> aren't observable from within it. That effectively makes the results
> non-deterministic, which is usually a bad thing in computing...
>
> --
> Jeff Layton 
>

I agree that predictable behaviour outweighs any benefit of complexity
cutting we might do here.

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-24 Thread Sargun Dhillon

On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox  wrote:
> >
> > On Wed, Dec 23, 2020 at 08:21:41PM +, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> > > > > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > > > > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > > > > > I fail to see why this is neccessary if you incorporate error 
> > > > > > > reporting into the
> > > > > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > > > > pickup Jeff's
> > > > > > > patch that adds the 2nd flag to errseq for "observed", you should 
> > > > > > > be able to
> > > > > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > > > > check-and-return
> > > > > > > in there instead instead of adding this new infrastructure.
> > > > > >
> > > > > > You still haven't explained why you want to add the "observed" flag.
> > > > >
> > > > >
> > > > > In the overlayfs model, many users may be using the same filesystem 
> > > > > (super block)
> > > > > for their upperdir. Let's say you have something like this:
> > > > >
> > > > > /workdir [Mounted FS]
> > > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > > /workdir/userscratchspace
> > > > >
> > > > > The user needs to be able to do something like:
> > > > > sync -f ${overlayfs1}/file
> > > > >
> > > > > which in turn will call sync on the the underlying filesystem (the 
> > > > > one mounted
> > > > > on /workdir), and can check if the errseq has changed since the 
> > > > > overlayfs was
> > > > > mounted, and use that to return an error to the user.
> > > >
> > > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > > (each instance of) overlayfs samples the errseq at mount time and then
> > > > check_and_advances it at sync time, it will see any error that has 
> > > > occurred
> > > > since the mount happened (and possibly also an error which occurred 
> > > > before
> > > > the mount happened, but hadn't been reported to anybody before).
> > > >
> > >
> > > If there is an outstanding error at mount time, and the SEEN flag is 
> > > unset,
> > > subsequent errors will not increment the counter, until the user calls 
> > > sync on
> > > the upperdir's filesystem. If overlayfs calls check_and_advance on the 
> > > upperdir's
> > > super block at any point, it will then set the seen block, and if the 
> > > user calls
> > > syncfs on the upperdir, it will not return that there is an outstanding 
> > > error,
> > > since overlayfs just cleared it.
> >
> > Your concern is this case:
> >
> > fs is mounted on /workdir
> > /workdir/A is written to and then closed.
> > writeback happens and -EIO happens, but there's nobody around to care.
> > /workdir/upperdir1 becomes part of an overlayfs mount
> > overlayfs samples the error
> > a user writes to /workdir/B, another -EIO occurs, but nothing happens
> > someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> > a user opens /workdir/B and calls syncfs, but sees no error
> >
> > do i have that right?  or is it something else?
> 
> IMO it is something else. Others may disagree.
> IMO the level of interference between users accessing overlay and users
> accessing upper fs directly is not well defined and it can stay this way.
> 
> Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
> is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
> Changes to underlying filesystems:
> "Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."
> 
> The question is whether syncfs(open(/workdir/B)) is considered
> "Changes to the underl

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Sargun Dhillon

On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> On Wed, Dec 23, 2020 at 07:29:41PM +0000, Sargun Dhillon wrote:
> > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > > > I fail to see why this is neccessary if you incorporate error reporting 
> > > > into the 
> > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > pickup Jeff's
> > > > patch that adds the 2nd flag to errseq for "observed", you should be 
> > > > able to
> > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > check-and-return
> > > > in there instead instead of adding this new infrastructure.
> > > 
> > > You still haven't explained why you want to add the "observed" flag.
> > 
> > 
> > In the overlayfs model, many users may be using the same filesystem (super 
> > block)
> > for their upperdir. Let's say you have something like this:
> > 
> > /workdir [Mounted FS]
> > /workdir/upperdir1 [overlayfs upperdir]
> > /workdir/upperdir2 [overlayfs upperdir]
> > /workdir/userscratchspace
> > 
> > The user needs to be able to do something like:
> > sync -f ${overlayfs1}/file
> > 
> > which in turn will call sync on the the underlying filesystem (the one 
> > mounted 
> > on /workdir), and can check if the errseq has changed since the overlayfs 
> > was
> > mounted, and use that to return an error to the user.
> 
> OK, but I don't see why the current scheme doesn't work for this.  If
> (each instance of) overlayfs samples the errseq at mount time and then
> check_and_advances it at sync time, it will see any error that has occurred
> since the mount happened (and possibly also an error which occurred before
> the mount happened, but hadn't been reported to anybody before).
> 

If there is an outstanding error at mount time, and the SEEN flag is unset, 
subsequent errors will not increment the counter, until the user calls sync on
the upperdir's filesystem. If overlayfs calls check_and_advance on the 
upperdir's
super block at any point, it will then set the seen block, and if the user calls
syncfs on the upperdir, it will not return that there is an outstanding error,
since overlayfs just cleared it.


> > If we do not advance the errseq on the upperdir to "mark it as seen", that 
> > means 
> > future errors will not be reported if the user calls sync -f 
> > ${overlayfs1}/file,
> > because errseq will not increment the value if the seen bit is unset.
> > 
> > On the other hand, if we mark it as seen, then if the user calls sync on 
> > /workdir/userscratchspace/file, they wont see the error since we just set 
> > the 
> > SEEN flag.
> 
> While we set the SEEN flag, if the file were opened before the error
> occurred, we would still report the error because the sequence is higher
> than it was when we sampled the error.
> 

Right, this isn't a problem for people calling f(data)sync on a particular 
file, 
because it takes its own snapshot of errseq. This is only problematic for folks 
calling syncfs. In Jeff's other messages, it sounded like this behaviour is
pretty important, and the likes of postgresql depend on it.

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Sargun Dhillon

On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> On Wed, Dec 23, 2020 at 06:20:27PM +0000, Sargun Dhillon wrote:
> > I fail to see why this is neccessary if you incorporate error reporting 
> > into the 
> > sync_fs callback. Why is this separate from that callback? If you pickup 
> > Jeff's
> > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > stash the first errseq seen in the ovl_fs struct, and do the 
> > check-and-return
> > in there instead instead of adding this new infrastructure.
> 
> You still haven't explained why you want to add the "observed" flag.

In the overlayfs model, many users may be using the same filesystem (super 
block)
for their upperdir. Let's say you have something like this:

/workdir [Mounted FS]
/workdir/upperdir1 [overlayfs upperdir]
/workdir/upperdir2 [overlayfs upperdir]
/workdir/userscratchspace

The user needs to be able to do something like:
sync -f ${overlayfs1}/file

which in turn will call sync on the the underlying filesystem (the one mounted 
on /workdir), and can check if the errseq has changed since the overlayfs was
mounted, and use that to return an error to the user.

If we do not advance the errseq on the upperdir to "mark it as seen", that 
means 
future errors will not be reported if the user calls sync -f ${overlayfs1}/file,
because errseq will not increment the value if the seen bit is unset.

On the other hand, if we mark it as seen, then if the user calls sync on 
/workdir/userscratchspace/file, they wont see the error since we just set the 
SEEN flag.

You need a new flag (observed) to differentiate between "Seen and reported to 
user" versus "seen by a second-order system, so should now increment".

One alternative is to always increment the errseq error counter, but I've
gotta imagine there's a reason that wasn't done in the first place.

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Sargun Dhillon

On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> Currently syncfs() and fsync() seem to be two interfaces which check and
> return writeback errors on superblock to user space. fsync() should
> work fine with overlayfs as it relies on underlying filesystem to
> do the check and return error. For example, if ext4 is on upper filesystem,
> then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> upper file and returns error. So overlayfs does not have to do anything
> special.
> 
> But with syncfs(), error check happens in vfs in syncfs() w.r.t
> overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> does not do actual writeback and all writeback errors are recorded
> on underlying filesystem. So sb->s_wb_err is never updated hence
> syncfs() does not work with overlay.
> 
> Jeff suggested that instead of trying to propagate errors to overlay
> super block, why not simply check for errors against upper filesystem
> super block. I implemented this idea.
> 
> Overlay file has "since" value which needs to be initialized at open
> time. Overlay overrides VFS initialization and re-initializes
> f->f_sb_err w.r.t upper super block. Later when
> ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> since value to figure out if any error on upper sb has happened since
> then.
> 
> Note, Right now this patch only deals with regular file and directories.
> Yet to deal with special files like device inodes, socket, fifo etc.
> 
> Suggested-by: Jeff Layton 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/overlayfs/file.c  |  1 +
>  fs/overlayfs/overlayfs.h |  1 +
>  fs/overlayfs/readdir.c   |  1 +
>  fs/overlayfs/super.c | 23 +++
>  fs/overlayfs/util.c  | 13 +
>  5 files changed, 39 insertions(+)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..7b58a44dcb71 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file 
> *file)
>   return PTR_ERR(realfile);
>  
>   file->private_data = realfile;
> + ovl_init_file_errseq(file);
>  
>   return 0;
>  }
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index f8880aa2ba0e..47838abbfb3d 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct 
> dentry *dentry);
>  bool ovl_is_metacopy_dentry(struct dentry *dentry);
>  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
>int padding);
> +void ovl_init_file_errseq(struct file *file);
>  
>  static inline bool ovl_is_impuredir(struct super_block *sb,
>   struct dentry *dentry)
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 01620ebae1bd..0c48f1545483 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file 
> *file)
>   od->is_real = ovl_dir_is_real(file->f_path.dentry);
>   od->is_upper = OVL_TYPE_UPPER(type);
>   file->private_data = od;
> + ovl_init_file_errseq(file);
>  
>   return 0;
>  }
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..d99867983722 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int 
> *flags, char *data)
>   return ret;
>  }
>  
> +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> *file)
> +{
> + struct ovl_fs *ofs = sb->s_fs_info;
> + struct super_block *upper_sb;
> + int ret;
> +
> + if (!ovl_upper_mnt(ofs))
> + return 0;
> +
> + upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +
> + if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> + return 0;
> +
> + /* Something changed, must use slow path */
> + spin_lock(>f_lock);
> + ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> + spin_unlock(>f_lock);
> +
> + return ret;
> +}
> +
>  static const struct super_operations ovl_super_operations = {
>   .alloc_inode= ovl_alloc_inode,
>   .free_inode = ovl_free_inode,
> @@ -400,6 +422,7 @@ static const struct super_operations ovl_super_operations 
> = {
>   .statfs = ovl_statfs,
>   .show_options   = ovl_show_options,
>   .remount_fs = ovl_remount,
> + .errseq_check_advance   = ovl_errseq_check_advance,
>  };
>  
>  enum {
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 23f475627d07..a1742847f3a8 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -950,3 +950,16 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct 
> dentry *dentry,
>   kfree(buf);
>   return ERR_PTR(res);
>  }
> +
> +void ovl_init_file_errseq(struct file *file)
> +{
> + struct super_block *sb =

[PATCH RESEND v5 2/2] NFSv4: Refactor to use user namespaces for nfs4idmap

2020-12-13 Thread Sargun Dhillon

In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399da65: NFSv4: Convert the NFS client idmapper to use the container user 
namespace
3b7eb5e35d0f: NFS: When mounting, don't share filesystems between different 
user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.

Eventually, we will need to at least:
  * Separate out the keyring cache by namespace
  * Come up with an upcall mechanism that can be triggered inside of the 
container,
or safely triggered outside, with the requisite context to do the right
mapping. * Handle whatever refactoring needs to be done in net/sunrpc.

Signed-off-by: Sargun Dhillon 
Tested-by: Alban Crequy 
---
 fs/nfs/nfs4client.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index be7915c861ce..86acffe7335c 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1153,7 +1153,7 @@ struct nfs_server *nfs4_create_server(struct fs_context 
*fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
auth_probe = ctx->auth_info.flavor_len < 1;
 
-- 
2.25.1

[PATCH RESEND v5 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-12-13 Thread Sargun Dhillon

This is a resend[2] for consideration into the next NFS client merge window.

Right now, it is possible to mount NFS with an non-matching super block
user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
set of interactions if using anything other than auth_null, where the UIDs
being sent to the server are different than the local UIDs being checked.
This can cause "breakage", where if you try to communicate with the NFS
server with any other set of mappings, it breaks.

The reason for this is that you can call fsopen("nfs4") in the unprivileged
namespace, and that configures fs_context with all the right information
for that user namespace. In addition, it also keeps a gets a cred object
associated with the caller -- which should match the user namespace.
Unfortunately, the mount has to be finished in the init_user_ns because we
currently require CAP_SYS_ADMIN in the init user namespace to call fsmount.
This means that the superblock's user namespace is set "correctly" to the
container, but there's absolutely no way nfs4idmap to consume an
unprivileged user namespace because the cred / user_ns that's passed down
to nfs4idmap is the one at fsmount.

How this actually exhibits is let's say that the UID 0 in the user
namespace is mapped to UID 1000 in the init user ns (and kuid space). What
will happen is that nfs4idmap will translate the UID 1000 into UID 0 on the
wire, even if the mount is in entirely in the mount / user namespace of the
container.

So, it looks something like this
Client in unprivileged User NS (UID: 0, KUID: 0)
->Perform open()
...VFS / NFS bits...
nfs_map_uid_to_name ->
from_kuid_munged(init_user_ns, uid) (returns 0)
RPC with UID 0

This behaviour happens "the other way" as well, where the UID in the
container may be 0, but the corresponding kuid is 1000. When a response
from an NFS server comes in we decode it according to the idmap userns.
The way this exhibits is even more odd.

Server responds with file attribute (UID: 0, GID: 0)
->nfs_map_name_to_uid(..., 0)
->make_kuid(init_user_ns, id) (returns 0)
VFS / NFS Bits...
->from_kuid(container_ns, 0) -> invalid uid
-> EOVERFLOW

This changes the nfs server to use the cred / userns from fs_context, which
is how idmap is constructed. This subsequently is used in the above
described flow of converting uids back-and-forth.

Trond gave the feedback that this behaviour [implemented by this patch] is
how the legacy sys_mount() behaviour worked[1], and that the intended
behaviour is for UIDs to be plumbed through entirely, where the user
namespaces UIDs are what is sent over the wire, and not the init user ns.

[1]: 
https://lore.kernel.org/linux-nfs/8feccf45f6575a204da03e796391cc135283eb88.ca...@hammerspace.com/
[2]: https://lore.kernel.org/linux-nfs/20201112100952.3514-1-sar...@sargun.me/

Sargun Dhillon (2):
  NFS: NFSv2/NFSv3: Use cred from fs_context during mount
  NFSv4: Refactor to use user namespaces for nfs4idmap

 fs/nfs/client.c | 4 ++--
 fs/nfs/nfs4client.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

-- 
2.25.1

[PATCH RESEND v5 1/2] NFS: NFSv2/NFSv3: Use cred from fs_context during mount

2020-12-13 Thread Sargun Dhillon

There was refactoring done to use the fs_context for mounting done in:
62a55d088cd87: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0e5c1: NFS: Store the credential of the mount process in the nfs_server
264d948ce7d0: NFS: Convert NFSv3 to use the container user namespace
c207db2f5da5: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.

Signed-off-by: Sargun Dhillon 
Tested-by: Alban Crequy 
---
 fs/nfs/client.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 4b8cc93913f7..1e6f3b3ed445 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -571,7 +571,7 @@ static int nfs_start_lockd(struct nfs_server *server)
1 : 0,
.net= clp->cl_net,
.nlmclnt_ops= clp->cl_nfs_mod->rpc_ops->nlmclnt_ops,
-   .cred   = current_cred(),
+   .cred   = server->cred,
};
 
if (nlm_init.nfs_version > 3)
@@ -985,7 +985,7 @@ struct nfs_server *nfs_create_server(struct fs_context *fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
error = -ENOMEM;
fattr = nfs_alloc_fattr();
-- 
2.25.1

Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition

2020-12-01 Thread Sargun Dhillon

On Tue, Dec 01, 2020 at 07:41:05AM -0500, Tycho Andersen wrote:
> On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> > Idea 1 sounds best to me, but maybe that's because it's the way I
> > originally did the fd support that never landed :)
> > 
> > But here's an Idea 4: we add a way to remotely close an fd (I don't
> > see that the current infra can do this, but perhaps I didn't look hard
> > enough), and then when you get ENOENT you have to close the fd. Of
> > course, this can't be via seccomp, so maybe it's even more racy.
> 
> Or better yet: what if the kernel closed everything it had added via
> ADDFD if it didn't get a valid response from the supervisor? Then
> everyone gets this bug fixed for free.
> 
> Tycho
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

This doesn't solve the problem universally because of the (Go) preemption 
problem. Unless we can guarantee that the supervisor can always handle the 
request in fewer than 10ms, or if it implements resumption behaviour. I know 
that resumption behaviour is a requirement no matter what, but the easier we 
can 
make it to implement resumption, the better chance we are giving users to get 
this right.

I think that the easiest solution is to add to the SECCOMP_IOCTL_NOTIF_RECV
ioctl. Either we have a flag like "block all blockable signals" or pass a
sigset_t directly of which signals to allow (and return an einval if they
try to block non-blockable signals).

Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition

2020-11-30 Thread Sargun Dhillon

On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> Hi,
> 
> On Thu, Nov 26, 2020 at 02:09:33PM +0100, Alban Crequy wrote:
> > Hi,
> > 
> > With the addfd feature (added in “seccomp: Introduce addfd ioctl to
> > seccomp user notifier”, commit 7cf97b125455), the new file is
> > installed in the target process during the SECCOMP_IOCTL_NOTIF_ADDFD
> > operation and not at the end with the SECCOMP_IOCTL_NOTIF_SEND
> > operation. This can cause race conditions when the target process is
> > interrupted by a signal (EINTR) and restarted automatically.
> > 
> > This is more noticeable in multithreaded processes like with Golang.
> > In Golang 1.14:
> > https://golang.org/doc/go1.14
> > > "A consequence of the implementation of preemption is that on Unix 
> > > systems, including Linux and macOS systems, programs built with Go 1.14 
> > > will receive more signals than programs built with earlier releases. This 
> > > means that programs that use packages like syscall or 
> > > golang.org/x/sys/unix will see more slow system calls fail with EINTR 
> > > errors. Those programs will have to handle those errors in some way, most 
> > > likely looping to try the system call again."
> > 
> > In my test, I added a seccomp policy which returns
> > SECCOMP_RET_USER_NOTIF on execve() and I added a sleep(2) in the
> > seccomp agent (using https://github.com/kinvolk/seccompagent/) between
> > SECCOMP_IOCTL_NOTIF_RECV and SECCOMP_IOCTL_NOTIF_SEND to make it a bit
> > slow to reply with SECCOMP_USER_NOTIF_FLAG_CONTINUE. I got the
> > following strace log going on in a loop:
> > 
> > [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> > 0xc63b00 /* 11 vars */ 
> > [pid 2656200] <... nanosleep resumed>NULL) = 0
> > [pid 2656200] epoll_pwait(7, [], 128, 0, NULL, 0) = 0
> > [pid 2656200] getpid()  = 1
> > [pid 2656200] tgkill(1, 1, SIGURG)  = 0
> > [pid 2656199] <... execve resumed>) = ? ERESTARTSYS (To be
> > restarted if SA_RESTART is set)
> > [pid 2656200] nanosleep({tv_sec=0, tv_nsec=1000},  
> > [pid 2656199] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1,
> > si_uid=0} ---
> > [pid 2656199] rt_sigreturn({mask=[]})   = 59
> > [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> > 0xc63b00 /* 11 vars */ 
> > 
> > On the seccomp agent side, the ioctl(SECCOMP_IOCTL_NOTIF_SEND) returns
> > ENOENT, and then it receives the same notification at the next
> > iteration of the loop.
> > 
> > The SIGURG signal is sent by the Golang runtime, causing the execve to
> > be interrupted, and restarted automatically, triggering the new
> > seccomp notification. In this example with execve, this is not a big
> > deal because the seccomp agent doesn't add a fd. But on a open() or
> > accept() syscall, I fear that the seccomp agent could install a file
> > descriptor without knowing that the syscall will be interrupted soon
> > after, but before the SECCOMP_IOCTL_NOTIF_SEND is completed.
> > 
> > I understand the need to have two different ioctl() to add the fd and
> > to reply to the seccomp notification because the seccomp agent needs
> > to know the fd number being assigned before specifying the return
> > value of the syscall with that number.
> > 
> > What do you think is the best way to solve this problem? Here are a few 
> > ideas:
> > 
> > - Idea 1: add a second flag for the struct seccomp_notif_resp
> > “SECCOMP_USER_NOTIF_FLAG_RETURN_FD” to instruct seccomp to override
> > the return value with the first fd to install. It would not help to
> > emulate recvfrom() with SCM_RIGHTS but it will solve the problem for
> > syscalls that return a fd because we can then implement a new ioctl
> > (“SECCOMP_IOCTL_NOTIF_SEND_WITH_FDS”?) that does the addfd and the
> > notification response in one step.
> > 
> > Other ideas but they cause more problems:
> > 
> > - Idea 2: We need some kind of transactions where the fd is sent with
> > the first ioctl() and installed in the fd table but marked somehow to
> > be closed automatically if the syscall is interrupted with EINTR
> > outside of the control of the seccomp agent. The new fd in the fd
> > table would be committed at the end if the syscall is not interrupted.
> > But this introduces other issues: another thread could call dup() on
> > the fd before it gets closed. Or another process sharing the fd table
> > with CLONE_FILES could do the same. Should the not-yet-committed fds
> > be visible in /proc//fd/? Or inherited to new processes created
> > by fork()?
> > 
> > - Idea 3: We could add fds in a temporary location but not in the
> > `struct files_struct` of the target process, and only commit at
> > SECCOMP_IOCTL_NOTIF_SEND time. In this way, threads or processes
> > sharing the fd table with CLONE_FILES would not be impacted. However,
> > this could open new race conditions if other threads are installing
> > fds in the same slots in the fd table. Also, this seems quite
> > dangerous to add this concept of

Re: [PATCH v5 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-24 Thread Sargun Dhillon

On Thu, Nov 12, 2020 at 02:09:50AM -0800, Sargun Dhillon wrote:
> Right now, it is possible to mount NFS with an non-matching super block
> user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
> set of interactions if using anything other than auth_null, where the UIDs
> being sent to the server are different than the local UIDs being checked.
> This can cause "breakage", where if you try to communicate with the NFS
> server with any other set of mappings, it breaks.
> 
> The reason for this is that you can call fsopen("nfs4") in the unprivileged
> namespace, and that configures fs_context with all the right information
> for that user namespace. In addition, it also keeps a gets a cred object
> associated with the caller -- which should match the user namespace.
> Unfortunately, the mount has to be finished in the init_user_ns because we
> currently require CAP_SYS_ADMIN in the init user namespace to call fsmount.
> This means that the superblock's user namespace is set "correctly" to the
> container, but there's absolutely no way nfs4idmap to consume an
> unprivileged user namespace because the cred / user_ns that's passed down
> to nfs4idmap is the one at fsmount.
> 
> How this actually exhibits is let's say that the UID 0 in the user
> namespace is mapped to UID 1000 in the init user ns (and kuid space). What
> will happen is that nfs4idmap will translate the UID 1000 into UID 0 on the
> wire, even if the mount is in entirely in the mount / user namespace of the
> container.
> 
> So, it looks something like this
> Client in unprivileged User NS (UID: 0, KUID: 0)
>   ->Perform open()
>   ...VFS / NFS bits...
>   nfs_map_uid_to_name ->
>   from_kuid_munged(init_user_ns, uid) (returns 0)
>   RPC with UID 0
> 
> This behaviour happens "the other way" as well, where the UID in the
> container may be 0, but the corresponding kuid is 1000. When a response
> from an NFS server comes in we decode it according to the idmap userns.
> The way this exhibits is even more odd.
> 
> Server responds with file attribute (UID: 0, GID: 0)
>   ->nfs_map_name_to_uid(..., 0)
>   ->make_kuid(init_user_ns, id) (returns 0)
>   VFS / NFS Bits...
>   ->from_kuid(container_ns, 0) -> invalid uid
>   -> EOVERFLOW
> 
> This changes the nfs server to use the cred / userns from fs_context, which
> is how idmap is constructed. This subsequently is used in the above
> described flow of converting uids back-and-forth.
> 
> Trond gave the feedback that this behaviour [implemented by this patch] is
> how the legacy sys_mount() behaviour worked[1], and that the intended
> behaviour is for UIDs to be plumbed through entirely, where the user
> namespaces UIDs are what is sent over the wire, and not the init user ns.
> 
> [1]: 
> https://lore.kernel.org/linux-nfs/8feccf45f6575a204da03e796391cc135283eb88.ca...@hammerspace.com/
> 
> Sargun Dhillon (2):
>   NFS: NFSv2/NFSv3: Use cred from fs_context during mount
>   NFSv4: Refactor to use user namespaces for nfs4idmap
> 
>  fs/nfs/client.c | 4 ++--
>  fs/nfs/nfs4client.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> 
> base-commit: 8c39076c276be0b31982e44654e2c2357473258a
> -- 
> 2.25.1
>


Trond,
Are there any other concerns you have before landing this, or do you want
to wait until the v5.11 merge window?

Re: [PATCH v5 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-13 Thread Sargun Dhillon

On Thu, Nov 12, 2020 at 02:09:50AM -0800, Sargun Dhillon wrote:
> Right now, it is possible to mount NFS with an non-matching super block
> user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
> set of interactions if using anything other than auth_null, where the UIDs
> being sent to the server are different than the local UIDs being checked.
> This can cause "breakage", where if you try to communicate with the NFS
> server with any other set of mappings, it breaks.
> 
> The reason for this is that you can call fsopen("nfs4") in the unprivileged
> namespace, and that configures fs_context with all the right information
> for that user namespace. In addition, it also keeps a gets a cred object
> associated with the caller -- which should match the user namespace.
> Unfortunately, the mount has to be finished in the init_user_ns because we
> currently require CAP_SYS_ADMIN in the init user namespace to call fsmount.
> This means that the superblock's user namespace is set "correctly" to the
> container, but there's absolutely no way nfs4idmap to consume an
> unprivileged user namespace because the cred / user_ns that's passed down
> to nfs4idmap is the one at fsmount.
> 
> How this actually exhibits is let's say that the UID 0 in the user
> namespace is mapped to UID 1000 in the init user ns (and kuid space). What
> will happen is that nfs4idmap will translate the UID 1000 into UID 0 on the
> wire, even if the mount is in entirely in the mount / user namespace of the
> container.
> 
> So, it looks something like this
> Client in unprivileged User NS (UID: 0, KUID: 0)
>   ->Perform open()
>   ...VFS / NFS bits...
>   nfs_map_uid_to_name ->
>   from_kuid_munged(init_user_ns, uid) (returns 0)
>   RPC with UID 0
> 
> This behaviour happens "the other way" as well, where the UID in the
> container may be 0, but the corresponding kuid is 1000. When a response
> from an NFS server comes in we decode it according to the idmap userns.
> The way this exhibits is even more odd.
> 
> Server responds with file attribute (UID: 0, GID: 0)
>   ->nfs_map_name_to_uid(..., 0)
>   ->make_kuid(init_user_ns, id) (returns 0)
>   VFS / NFS Bits...
>   ->from_kuid(container_ns, 0) -> invalid uid
>   -> EOVERFLOW
> 
> This changes the nfs server to use the cred / userns from fs_context, which
> is how idmap is constructed. This subsequently is used in the above
> described flow of converting uids back-and-forth.
> 
> Trond gave the feedback that this behaviour [implemented by this patch] is
> how the legacy sys_mount() behaviour worked[1], and that the intended
> behaviour is for UIDs to be plumbed through entirely, where the user
> namespaces UIDs are what is sent over the wire, and not the init user ns.
> 
> [1]: 
> https://lore.kernel.org/linux-nfs/8feccf45f6575a204da03e796391cc135283eb88.ca...@hammerspace.com/
> 
> Sargun Dhillon (2):
>   NFS: NFSv2/NFSv3: Use cred from fs_context during mount
>   NFSv4: Refactor to use user namespaces for nfs4idmap
> 
>  fs/nfs/client.c | 4 ++--
>  fs/nfs/nfs4client.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> 
> base-commit: 8c39076c276be0b31982e44654e2c2357473258a
> -- 
> 2.25.1
> 
Trond,

I was just thinking, since you said that this is the behaviour of the sys_mount 
API, would this be considered a regression? Should it go to stable (v5.9)?

[PATCH v5 1/2] NFS: NFSv2/NFSv3: Use cred from fs_context during mount

2020-11-12 Thread Sargun Dhillon

There was refactoring done to use the fs_context for mounting done in:
62a55d088cd87: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0e5c1: NFS: Store the credential of the mount process in the nfs_server
264d948ce7d0: NFS: Convert NFSv3 to use the container user namespace
c207db2f5da5: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.

Signed-off-by: Sargun Dhillon 
Tested-by: Alban Crequy 
---
 fs/nfs/client.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 4b8cc93913f7..1e6f3b3ed445 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -571,7 +571,7 @@ static int nfs_start_lockd(struct nfs_server *server)
1 : 0,
.net= clp->cl_net,
.nlmclnt_ops= clp->cl_nfs_mod->rpc_ops->nlmclnt_ops,
-   .cred   = current_cred(),
+   .cred   = server->cred,
};
 
if (nlm_init.nfs_version > 3)
@@ -985,7 +985,7 @@ struct nfs_server *nfs_create_server(struct fs_context *fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
error = -ENOMEM;
fattr = nfs_alloc_fattr();
-- 
2.25.1

[PATCH v5 2/2] NFSv4: Refactor to use user namespaces for nfs4idmap

2020-11-12 Thread Sargun Dhillon

In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399da65: NFSv4: Convert the NFS client idmapper to use the container user 
namespace
3b7eb5e35d0f: NFS: When mounting, don't share filesystems between different 
user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.

Eventually, we will need to at least:
  * Separate out the keyring cache by namespace
  * Come up with an upcall mechanism that can be triggered inside of the 
container,
or safely triggered outside, with the requisite context to do the right
mapping. * Handle whatever refactoring needs to be done in net/sunrpc.

Signed-off-by: Sargun Dhillon 
Tested-by: Alban Crequy 
---
 fs/nfs/nfs4client.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index be7915c861ce..86acffe7335c 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1153,7 +1153,7 @@ struct nfs_server *nfs4_create_server(struct fs_context 
*fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
auth_probe = ctx->auth_info.flavor_len < 1;
 
-- 
2.25.1

[PATCH v5 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-12 Thread Sargun Dhillon

Right now, it is possible to mount NFS with an non-matching super block
user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
set of interactions if using anything other than auth_null, where the UIDs
being sent to the server are different than the local UIDs being checked.
This can cause "breakage", where if you try to communicate with the NFS
server with any other set of mappings, it breaks.

The reason for this is that you can call fsopen("nfs4") in the unprivileged
namespace, and that configures fs_context with all the right information
for that user namespace. In addition, it also keeps a gets a cred object
associated with the caller -- which should match the user namespace.
Unfortunately, the mount has to be finished in the init_user_ns because we
currently require CAP_SYS_ADMIN in the init user namespace to call fsmount.
This means that the superblock's user namespace is set "correctly" to the
container, but there's absolutely no way nfs4idmap to consume an
unprivileged user namespace because the cred / user_ns that's passed down
to nfs4idmap is the one at fsmount.

How this actually exhibits is let's say that the UID 0 in the user
namespace is mapped to UID 1000 in the init user ns (and kuid space). What
will happen is that nfs4idmap will translate the UID 1000 into UID 0 on the
wire, even if the mount is in entirely in the mount / user namespace of the
container.

So, it looks something like this
Client in unprivileged User NS (UID: 0, KUID: 0)
->Perform open()
...VFS / NFS bits...
nfs_map_uid_to_name ->
from_kuid_munged(init_user_ns, uid) (returns 0)
RPC with UID 0

This behaviour happens "the other way" as well, where the UID in the
container may be 0, but the corresponding kuid is 1000. When a response
from an NFS server comes in we decode it according to the idmap userns.
The way this exhibits is even more odd.

Server responds with file attribute (UID: 0, GID: 0)
->nfs_map_name_to_uid(..., 0)
->make_kuid(init_user_ns, id) (returns 0)
VFS / NFS Bits...
->from_kuid(container_ns, 0) -> invalid uid
-> EOVERFLOW

This changes the nfs server to use the cred / userns from fs_context, which
is how idmap is constructed. This subsequently is used in the above
described flow of converting uids back-and-forth.

Trond gave the feedback that this behaviour [implemented by this patch] is
how the legacy sys_mount() behaviour worked[1], and that the intended
behaviour is for UIDs to be plumbed through entirely, where the user
namespaces UIDs are what is sent over the wire, and not the init user ns.

[1]: 
https://lore.kernel.org/linux-nfs/8feccf45f6575a204da03e796391cc135283eb88.ca...@hammerspace.com/

Sargun Dhillon (2):
  NFS: NFSv2/NFSv3: Use cred from fs_context during mount
  NFSv4: Refactor to use user namespaces for nfs4idmap

 fs/nfs/client.c | 4 ++--
 fs/nfs/nfs4client.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)


base-commit: 8c39076c276be0b31982e44654e2c2357473258a
-- 
2.25.1

Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-11 Thread Sargun Dhillon

On Thu, Nov 12, 2020 at 12:30:56AM +, Sargun Dhillon wrote:
> On Wed, Nov 11, 2020 at 08:03:18PM +, Trond Myklebust wrote:
> > On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote:
> > > On Wed, Nov 11, 2020 at 02:38:11PM +, Trond Myklebust wrote:
> > > > On Wed, 2020-11-11 at 11:12 +, Sargun Dhillon wrote:
> > > > 
> > > > The current code for setting server->cred was developed
> > > > independently
> > > > of fsopen() (and predates it actually). I'm fine with the change to
> > > > have server->cred be the cred of the user that called fsopen().
> > > > That's
> > > > in line with what we used to do for sys_mount().
> > > > 
> > > Just curious, without FS_USERNS, how were you mounting NFSv4 in an
> > > unprivileged user ns?
> > 
> > The code was originally developed on a 5.1 kernel. So all my testing
> > has been with ordinary sys_mount() calls in a container that had
> > CAP_SYS_ADMIN privileges.
> > 
> > > > However all the other stuff to throw errors when the user namespace
> > > > is
> > > > not init_user_ns introduces massive regressions.
> > > > 
> > > 
> > > I can remove that and respin the patch. How do you feel about that? 
> > > I would 
> > > still like to keep the log lines though because it is a uapi change.
> > > I am 
> > > worried that someone might exercise this path with GSS and allow for
> > > upcalls 
> > > into the main namespaces by accident -- or be confused of why they're
> > > seeing 
> > > upcalls "in a different namespace".
> > > 
> > > Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from
> > > fs_context during 
> > > mount") without any changes?
> > 
> > Why do we need the dprintk()s? It seems to me that either they should
> > be reporting something that the user needs to know (in which case they
> > should be real printk()s) or they are telling us something that we
> > should already know. To me they seem to fit more in the latter
> > category.
> > 
> > > 
> > > I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
> > > /*
> > >  * nfs4idmap is not fully isolated by user namespaces. It is
> > > currently
> > >  * only network namespace aware. If upcalls never happen, we do not
> > >  * need to worry as nfs_client instances aren't shared between
> > >  * user namespaces.
> > >  */
> > > if (idmap_userns(server->nfs_client->cl_idmap) != _user_ns && 
> > > !(server->caps & NFS_CAP_UIDGID_NOMAP)) {
> > > error = -EINVAL;
> > > errorf(fc, "Mount credentials are from non init user
> > > namespace and ID mapping is enabled. This is not allowed.");
> > > goto error;
> > > }
> > > 
> > > (and making it so we can call idmap_userns)
> > > 
> > 
> > Yes. That would be acceptable. Again, though, I'd like to see the
> > dprintk()s gone.
> > 
> 
> I can drop the dprintks, but given this is a uapi change, does it make sense 
> to 
> pr_info_once? Especially, because this can have security impact?

Spending 5 minutes thinking about this, I think that best go out in another 
patch
that I can spin, and we can discuss there.

Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-11 Thread Sargun Dhillon

On Wed, Nov 11, 2020 at 08:03:18PM +, Trond Myklebust wrote:
> On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote:
> > On Wed, Nov 11, 2020 at 02:38:11PM +, Trond Myklebust wrote:
> > > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > > 
> > > The current code for setting server->cred was developed
> > > independently
> > > of fsopen() (and predates it actually). I'm fine with the change to
> > > have server->cred be the cred of the user that called fsopen().
> > > That's
> > > in line with what we used to do for sys_mount().
> > > 
> > Just curious, without FS_USERNS, how were you mounting NFSv4 in an
> > unprivileged user ns?
> 
> The code was originally developed on a 5.1 kernel. So all my testing
> has been with ordinary sys_mount() calls in a container that had
> CAP_SYS_ADMIN privileges.
> 
> > > However all the other stuff to throw errors when the user namespace
> > > is
> > > not init_user_ns introduces massive regressions.
> > > 
> > 
> > I can remove that and respin the patch. How do you feel about that? 
> > I would 
> > still like to keep the log lines though because it is a uapi change.
> > I am 
> > worried that someone might exercise this path with GSS and allow for
> > upcalls 
> > into the main namespaces by accident -- or be confused of why they're
> > seeing 
> > upcalls "in a different namespace".
> > 
> > Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from
> > fs_context during 
> > mount") without any changes?
> 
> Why do we need the dprintk()s? It seems to me that either they should
> be reporting something that the user needs to know (in which case they
> should be real printk()s) or they are telling us something that we
> should already know. To me they seem to fit more in the latter
> category.
> 
> > 
> > I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
> > /*
> >  * nfs4idmap is not fully isolated by user namespaces. It is
> > currently
> >  * only network namespace aware. If upcalls never happen, we do not
> >  * need to worry as nfs_client instances aren't shared between
> >  * user namespaces.
> >  */
> > if (idmap_userns(server->nfs_client->cl_idmap) != _user_ns && 
> > !(server->caps & NFS_CAP_UIDGID_NOMAP)) {
> > error = -EINVAL;
> > errorf(fc, "Mount credentials are from non init user
> > namespace and ID mapping is enabled. This is not allowed.");
> > goto error;
> > }
> > 
> > (and making it so we can call idmap_userns)
> > 
> 
> Yes. That would be acceptable. Again, though, I'd like to see the
> dprintk()s gone.
> 

I can drop the dprintks, but given this is a uapi change, does it make sense to 
pr_info_once? Especially, because this can have security impact?

Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-11 Thread Sargun Dhillon

On Wed, Nov 11, 2020 at 02:38:11PM +, Trond Myklebust wrote:
> On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > On Tue, Nov 10, 2020 at 08:12:01PM +, Trond Myklebust wrote:
> > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > > Hi,
> > > > 
> > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an
> > > > NFS
> > > > share with a different user namespace. fsopen() is done in the
> > > > container namespaces (user, mnt and net namespaces) while
> > > > fsconfig(),
> > > > fsmount() and move_mount() are done on the host namespaces. The
> > > > mount
> > > > on the host is available in the container via mount propagation
> > > > from
> > > > the host mount.
> > > > 
> > > > With this, the files on the NFS server with uid 0 are available
> > > > in
> > > > the
> > > > container with uid 0. On the host, they are available with uid
> > > > 4294967294 (make_kuid(_user_ns, -2)).
> > > > 
> > > 
> > > Can someone please tell me what is broken with the _current_ design
> > > before we start trying to push "fixes" that clearly break it?
> > Currently the mechanism of mounting nfs4 in a user namespace is as
> > follows:
> > 
> > Parent: fork()
> > Child: setns(userns)
> > C: fsopen("nfs4") = 3
> > C->P: Send FD 3
> > P: FSConfig...
> > P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> > 
> > 
> > Right now, when you mount an NFS filesystem in a non-init user
> > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> > are sent to the server are not the UIDs from the mounting namespace,
> > instead they are the UIDs from the init user ns.
> > 
> > The reason for this is that you can call fsopen("nfs4") in the
> > unprivileged 
> > namespace, and that configures fs_context with all the right
> > information for 
> > that user namespace, but we currently require CAP_SYS_ADMIN in the
> > init user 
> > namespace to call fsmount. This means that the superblock's user
> > namespace is 
> > set "correctly" to the container, but there's absolutely no way
> > nfs4uidmap
> > to consume an unprivileged user namespace.
> > 
> > This behaviour happens "the other way" as well, where the UID in the
> > container
> > may be 0, but the corresponding kuid is 1000. When a response from an
> > NFS
> > server comes in we decode it according to the idmap userns[1]. The
> > userns
> > used to get create idmap is generated at fsmount time, and not as
> > fsopen
> > time. So, even if the filesystem is in the user namespace, and the
> > server
> > responds with UID 0, it'll come up with an unmapped UID.
> > 
> > This is because we do
> > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> > from_kuid(container_ns, 0) -> invalid uid
> > 
> > This is broken behaviour, in my humble opinion as is it makes it
> > impossible to 
> > use NFSv4 (and v3 for that matter) out of the box with unprivileged
> > user 
> > namespaces. At least in our environment, using usernames / GSS isn't
> > an option,
> > so we have to rely on UIDs being set correctly [at least from the
> > container's
> > perspective].
> > 
> 
> The current code for setting server->cred was developed independently
> of fsopen() (and predates it actually). I'm fine with the change to
> have server->cred be the cred of the user that called fsopen(). That's
> in line with what we used to do for sys_mount().
> 
Just curious, without FS_USERNS, how were you mounting NFSv4 in an
unprivileged user ns?


> However all the other stuff to throw errors when the user namespace is
> not init_user_ns introduces massive regressions.
> 

I can remove that and respin the patch. How do you feel about that?  I would 
still like to keep the log lines though because it is a uapi change. I am 
worried that someone might exercise this path with GSS and allow for upcalls 
into the main namespaces by accident -- or be confused of why they're seeing 
upcalls "in a different namespace".

Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from fs_context 
during 
mount") without any changes?

I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
/*
 * nfs4idmap is not fully isolated by user namespaces. It is currently
 * only network namespace aware. If upcalls never happen, we do not
 * nee

Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-11 Thread Sargun Dhillon

On Tue, Nov 10, 2020 at 08:12:01PM +, Trond Myklebust wrote:
> On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > Hi,
> > 
> > I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS
> > share with a different user namespace. fsopen() is done in the
> > container namespaces (user, mnt and net namespaces) while fsconfig(),
> > fsmount() and move_mount() are done on the host namespaces. The mount
> > on the host is available in the container via mount propagation from
> > the host mount.
> > 
> > With this, the files on the NFS server with uid 0 are available in
> > the
> > container with uid 0. On the host, they are available with uid
> > 4294967294 (make_kuid(_user_ns, -2)).
> > 
> 
> Can someone please tell me what is broken with the _current_ design
> before we start trying to push "fixes" that clearly break it?
Currently the mechanism of mounting nfs4 in a user namespace is as follows:

Parent: fork()
Child: setns(userns)
C: fsopen("nfs4") = 3
C->P: Send FD 3
P: FSConfig...
P: fsmount... (This is where the CAP_SYS_ADMIN check happens))

Right now, when you mount an NFS filesystem in a non-init user
namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
are sent to the server are not the UIDs from the mounting namespace,
instead they are the UIDs from the init user ns.

The reason for this is that you can call fsopen("nfs4") in the unprivileged 
namespace, and that configures fs_context with all the right information for 
that user namespace, but we currently require CAP_SYS_ADMIN in the init user 
namespace to call fsmount. This means that the superblock's user namespace is 
set "correctly" to the container, but there's absolutely no way nfs4uidmap
to consume an unprivileged user namespace.

This behaviour happens "the other way" as well, where the UID in the container
may be 0, but the corresponding kuid is 1000. When a response from an NFS
server comes in we decode it according to the idmap userns[1]. The userns
used to get create idmap is generated at fsmount time, and not as fsopen
time. So, even if the filesystem is in the user namespace, and the server
responds with UID 0, it'll come up with an unmapped UID.

This is because we do
Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS from_kuid(container_ns, 
0) -> invalid uid

This is broken behaviour, in my humble opinion as is it makes it impossible to 
use NFSv4 (and v3 for that matter) out of the box with unprivileged user 
namespaces. At least in our environment, using usernames / GSS isn't an option,
so we have to rely on UIDs being set correctly [at least from the container's
perspective].

> 
> The current design assumes that the user namespace being used is the one 
> where 
> the mount itself is performed. That means that the uids and gids or usernames 
> and groupnames that go on the wire match the uids and gids of the container 
> in 
> which the mount occurred.
> 

Right now, NFS does not have the ability for the fsmount() call to be
called in an unprivileged user namespace. We can change that behaviour
elsewhere if we want, but it's orthogonal to this.

> The assumption is that the server has authenticated that client as
> belonging to a domain that it recognises (either through strong
> RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> addresses to a list of acceptable clients).
> 
I added a rejection for upcalls because upcalls can happen in the init 
namespaces. We can drop that restriction from the nfs4 patch if you'd like. I
*believe* (and I'm not a little out of my depth) that the request-key
handler gets called with the *network namespace* of the NFS mount,
but the userns is a privileged one, allowing for potential hazards.

The reason I added that block there is that I didn't imagine anyone was running 
NFS in an unprivileged user namespace, and relying on upcalls (potentially into 
privileged namespaces) in order to do authz.

> If you go ahead and change the user namespace on the client without
> going through the mount process again to mount a different super block
> with a different user namespace, then you will now get the exact same
> behaviour as if you do that with any other filesystem.

Not exactly, because other filesystems *only* use the s_user_ns for conversion 
of UIDs, whereas NFS uses the currend_cred() acquired at mount time, which 
doesn't match s_user_ns, leading to this behaviour.

1. Mistranslated UIDs in encoding RPCs
2. The UID / GID exposed to VFS do not match the user ns.

> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.mykleb...@hammerspace.com
> 
> 
-Thanks,
Sargun

[1]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782
[2]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154

Re: [PATCH 1/3] seccomp: Return from SECCOMP_IOCTL_NOTIF_RECV when children are gone

2020-11-04 Thread Sargun Dhillon

On Mon, Nov 02, 2020 at 09:37:04PM +0100, Jann Horn wrote:
> At the moment, the seccomp notifier API is hard to use without combining
> it with APIs like poll() or epoll(); if all target processes have gone
> away, the polling APIs will raise an error indication on the file
> descriptor, but SECCOMP_IOCTL_NOTIF_RECV will keep blocking indefinitely.
> 
> This usability problem was discovered when Michael Kerrisk wrote a
> manpage for this API.
> 
> To fix it, get rid of the semaphore logic and let SECCOMP_IOCTL_NOTIF_RECV
> behave as follows:
> 
> If O_NONBLOCK is set, SECCOMP_IOCTL_NOTIF_RECV always returns
> immediately, no matter whether a notification is available or not.
> 
> If O_NONBLOCK is unset, SECCOMP_IOCTL_NOTIF_RECV blocks until either a
> notification is delivered to userspace or all users of the filter have
> gone away.
> 
> To avoid subtle breakage from eventloop-style code that doesn't set
> O_NONBLOCK, set O_NONBLOCK by default - userspace can clear it if it
> wants blocking behavior, and if blocking-style code forgets to do so,
> that will be much more obvious than the breakage we'd get the other way
> around.
> This also means that UAPI breakage from this change should be limited to
> blocking users of the API, of which, to my knowledge, there are none so far
> (outside of in-tree sample and selftest code, which this patch adjusts - in
> particular the code in samples/ has to change a bunch).
> 
> This should be backported because future userspace code might otherwise not
> work properly on old kernels.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Michael Kerrisk 
> Signed-off-by: Jann Horn 
> ---
>  kernel/seccomp.c  | 62 +--
>  samples/seccomp/user-trap.c   | 16 +
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 21 +++
>  3 files changed, 79 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 8ad7a293255a..b3730740515f 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the
> @@ -138,7 +139,6 @@ struct seccomp_kaddfd {
>   * @notifications: A list of struct seccomp_knotif elements.
>   */
>  struct notification {
> - struct semaphore request;
>   u64 next_id;
>   struct list_head notifications;
>  };
> @@ -863,7 +863,6 @@ static int seccomp_do_user_notification(int this_syscall,
>   list_add(, >notif->notifications);
>   INIT_LIST_HEAD();
>  
> - up(>notif->request);
>   wake_up_poll(>wqh, EPOLLIN | EPOLLRDNORM);
>   mutex_unlock(>notify_lock);
>  
> @@ -1179,9 +1178,10 @@ find_notification(struct seccomp_filter *filter, u64 
> id)
>  
>  
>  static long seccomp_notify_recv(struct seccomp_filter *filter,
> - void __user *buf)
> + void __user *buf, bool blocking)
>  {
>   struct seccomp_knotif *knotif = NULL, *cur;
> + DEFINE_WAIT(wait);
>   struct seccomp_notif unotif;
>   ssize_t ret;
>  
> @@ -1194,11 +1194,9 @@ static long seccomp_notify_recv(struct seccomp_filter 
> *filter,
>  
>   memset(, 0, sizeof(unotif));
>  
> - ret = down_interruptible(>notif->request);
> - if (ret < 0)
> - return ret;
> -
>   mutex_lock(>notify_lock);
> +
> +retry:
>   list_for_each_entry(cur, >notif->notifications, list) {
>   if (cur->state == SECCOMP_NOTIFY_INIT) {
>   knotif = cur;
> @@ -1206,14 +1204,40 @@ static long seccomp_notify_recv(struct seccomp_filter 
> *filter,
>   }
>   }
>  
> - /*
> -  * If we didn't find a notification, it could be that the task was
> -  * interrupted by a fatal signal between the time we were woken and
> -  * when we were able to acquire the rw lock.
> -  */
>   if (!knotif) {
> - ret = -ENOENT;
> - goto out;
> + if (!blocking) {
> + if (refcount_read(>users) == 0)
> + ret = -ENOTCONN;
> + else
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + /* This has to happen before checking >users. */
> + prepare_to_wait(>wqh, , TASK_INTERRUPTIBLE);

Isn't there a subtle race condition here, that if we are in blocking mode, and 
we get a notification here, we wont be woken up? since we enter the wait queue
after checking to see if there are pending notifications, and in that period
the wake_up_poll could have been called?

I very well might be missing something interesting about the semantics of
freezable_schedule.

shouldn't it read something like:

retry:
mutex_lock(...);
ret = prepare_to_wait_event(>wqh, , TASK_INTERRUPTIBLE);
if (ret)
goto out;
list_for_each_entry(cur,

Re: [RFC PATCH v1 4/4] Allow to change the user namespace in which user rlimits are counted

2020-11-04 Thread Sargun Dhillon

On Mon, Nov 02, 2020 at 05:50:33PM +0100, Alexey Gladkov wrote:
> Add a new prctl to change the user namespace in which the process
> counter is located. A pointer to the user namespace is in cred struct
> to be inherited by all child processes.
> 
> Signed-off-by: Alexey Gladkov 
> ---
>  fs/exec.c  |  2 +-
>  fs/io-wq.c | 13 -
>  fs/io-wq.h |  1 +
>  fs/io_uring.c  |  1 +
>  include/linux/cred.h   |  8 
>  include/uapi/linux/prctl.h |  5 +
>  kernel/cred.c  | 35 +--
>  kernel/exit.c  |  2 +-
>  kernel/fork.c  |  4 ++--
>  kernel/sys.c   | 22 +-
>  kernel/user_namespace.c|  3 +++
>  11 files changed, 80 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index c45dfc716394..574b1381276c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1837,7 +1837,7 @@ static int __do_execve_file(int fd, struct filename 
> *filename,
>   goto out_ret;
>   }
>  
> - processes = get_rlimit_counter(_user_ns, current_euid(), 
> UCOUNT_RLIMIT_NPROC);
> + processes = get_rlimit_counter(current_rlimit_ns(), current_euid(), 
> UCOUNT_RLIMIT_NPROC);
>  
>   /*
>* We move the actual failure in case of RLIMIT_NPROC excess from
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index c3b0843abc9b..19e43ec115cb 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -116,6 +116,7 @@ struct io_wq {
>  
>   struct task_struct *manager;
>   struct user_struct *user;
It seems like user would be unused here, and you could use creds->user?

> + const struct cred *creds;
>   refcount_t refs;
>   struct completion done;
>  
> @@ -217,7 +218,7 @@ static void io_worker_exit(struct io_worker *worker)
>   if (worker->flags & IO_WORKER_F_RUNNING)
>   atomic_dec(>nr_running);
>   if (!(worker->flags & IO_WORKER_F_BOUND))
> - dec_rlimit_counter(_user_ns, wqe->wq->user->uid, 
> UCOUNT_RLIMIT_NPROC);
> + dec_rlimit_counter(wqe->wq->creds->rlimit_ns, 
> wqe->wq->user->uid, UCOUNT_RLIMIT_NPROC);
>   worker->flags = 0;
>   preempt_enable();
>  
> @@ -350,9 +351,9 @@ static void __io_worker_busy(struct io_wqe *wqe, struct 
> io_worker *worker,
>   worker->flags |= IO_WORKER_F_BOUND;
>   wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers--;
>   wqe->acct[IO_WQ_ACCT_BOUND].nr_workers++;
> - dec_rlimit_counter(_user_ns, wqe->wq->user->uid, 
> UCOUNT_RLIMIT_NPROC);
> + dec_rlimit_counter(wqe->wq->creds->rlimit_ns, 
> wqe->wq->user->uid, UCOUNT_RLIMIT_NPROC);
>   } else {
> - if (!inc_rlimit_counter(_user_ns, 
> wqe->wq->user->uid, UCOUNT_RLIMIT_NPROC))
> + if (!inc_rlimit_counter(wqe->wq->creds->rlimit_ns, 
> wqe->wq->user->uid, UCOUNT_RLIMIT_NPROC))
>   return;
>   worker->flags &= ~IO_WORKER_F_BOUND;
>   wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers++;
> @@ -662,7 +663,7 @@ static bool create_io_worker(struct io_wq *wq, struct 
> io_wqe *wqe, int index)
>   }
>  
>   if (index == IO_WQ_ACCT_UNBOUND &&
> - !inc_rlimit_counter(_user_ns, wq->user->uid, 
> UCOUNT_RLIMIT_NPROC)) {
> + !inc_rlimit_counter(wq->creds->rlimit_ns, wq->user->uid, 
> UCOUNT_RLIMIT_NPROC)) {
>   kfree(worker);
>   return false;
>   }
> @@ -772,7 +773,8 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct 
> io_wqe_acct *acct,
>   if (free_worker)
>   return true;
>  
> - processes = get_rlimit_counter(_user_ns, wqe->wq->user->uid, 
> UCOUNT_RLIMIT_NPROC);
> + processes = get_rlimit_counter(wqe->wq->creds->rlimit_ns, 
> wqe->wq->user->uid,
> + UCOUNT_RLIMIT_NPROC);
>  
>   if (processes >= acct->max_workers &&
>   !(capable(CAP_SYS_RESOURCE) || capable(CAP_SYS_ADMIN)))
> @@ -1049,6 +1051,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct 
> io_wq_data *data)
>  
>   /* caller must already hold a reference to this */
>   wq->user = data->user;
> + wq->creds = data->creds;
>  
>   for_each_node(node) {
>   struct io_wqe *wqe;
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index 071f1a997800..6acc3a04c38f 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -105,6 +105,7 @@ typedef void (io_wq_work_fn)(struct io_wq_work **);
>  
>  struct io_wq_data {
>   struct user_struct *user;
> + const struct cred *creds;
>  
>   io_wq_work_fn *do_work;
>   free_work_fn *free_work;
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 493e5047e67c..e419923968b3 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -6933,6 +6933,7 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
>   int ret = 0;
>  
>   data.user = ctx->user;
> +

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Sargun Dhillon

On Mon, Nov 2, 2020 at 11:45 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Sargun,
>
> Thanks for your reply!
>
> On 11/2/20 9:07 AM, Sargun Dhillon wrote:
> > On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Sargun,
> >>
> >> Thanks for your reply.
> >>
> >> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> >>> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> >>> wrote:
> >>
> >> [...]
> >>
> >>>>> I think I commented in another thread somewhere that the
> >>>>> supervisor is not notified if the syscall is preempted. Therefore
> >>>>> if it is performing a preemptible, long-running syscall, you need
> >>>>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> >>>>> you can end up in a bad situation -- like leaking resources, or
> >>>>> holding on to file descriptors after the program under
> >>>>> supervision has intended to release them.
> >>>>
> >>>> It's been a long day, and I'm not sure I reallu understand this.
> >>>> Could you outline the scnario in more detail?
> >>>>
> >>> S: Sets up filter + interception for accept T: socket(AF_INET,
> >>> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> >>> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> >>
> >> Presumably, the preceding line should have been:
> >>
> >> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> >> (s/T:/S:/)
> >>
> >> right?
> >
> > Right.
> >>
> >>
> >>> T: accept(7, ...) S: Intercepts accept S: Does accept in background
> >>> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> >>> Still running accept(7, ), holding port , so if now T
> >>> retries to bind to port , things fail.
> >>
> >> Okay -- I understand. Presumably the solution here is not to
> >> block in accept(), but rather to use poll() to monitor both the
> >> notification FD and the listening socket FD?
> >>
> > You need to have some kind of mechanism to periodically check
> > if the notification is still alive, and preempt the accept. It doesn't
> > matter how exactly you "background" the accept (threads, or
> > O_NONBLOCK + epoll).
> >
> > The thing is you need to make sure that when the process
> > cancels a syscall, you need to release the resources you
> > may have acquired on its behalf or bad things can happen.
> >
>
> Got it. I added the following text:
>
>Caveats regarding blocking system calls
>Suppose that the target performs a blocking system call (e.g.,
>accept(2)) that the supervisor should handle.  The supervisor
>might then in turn execute the same blocking system call.
>
>In this scenario, it is important to note that if the target's
>system call is now interrupted by a signal, the supervisor is not
>informed of this.  If the supervisor does not take suitable steps
>to actively discover that the target's system call has been
>canceled, various difficulties can occur.  Taking the example of
>accept(2), the supervisor might remain blocked in its accept(2)
>holding a port number that the target (which, after the
>interruption by the signal handler, perhaps closed  its listening
>socket) might expect to be able to reuse in a bind(2) call.
>
>Therefore, when the supervisor wishes to emulate a blocking system
>call, it must do so in such a way that it gets informed if the
>target's system call is interrupted by a signal handler.  For
>example, if the supervisor itself executes the same blocking
>system call, then it could employ a separate thread that uses the
>SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
>still blocked in its system call.  Alternatively, in the accept(2)
>example, the supervisor might use poll(2) to monitor both the
>notification file descriptor (so as as to discover when the
>target's accept(2) call has been interrupted) and the listening
>file descriptor (so as to know when a connection is available).
>
>If the target's system call is interrupted, the supervisor must
>take care to release resources (e.g., file descriptors) that it
>acquired on behalf of the target.
>
> Does

[PATCH v4 1/2] NFS: NFSv2/NFSv3: Use cred from fs_context during mount

2020-11-02 Thread Sargun Dhillon

There was refactoring done to use the fs_context for mounting done in:
62a55d088cd87: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0e5c1: NFS: Store the credential of the mount process in the nfs_server
264d948ce7d0: NFS: Convert NFSv3 to use the container user namespace
c207db2f5da5: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.

Signed-off-by: Sargun Dhillon 
---
 fs/nfs/client.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 4b8cc93913f7..c3afe448a512 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -571,7 +571,7 @@ static int nfs_start_lockd(struct nfs_server *server)
1 : 0,
.net= clp->cl_net,
.nlmclnt_ops= clp->cl_nfs_mod->rpc_ops->nlmclnt_ops,
-   .cred   = current_cred(),
+   .cred   = server->cred,
};
 
if (nlm_init.nfs_version > 3)
@@ -985,7 +985,13 @@ struct nfs_server *nfs_create_server(struct fs_context *fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   if (fc->cred->user_ns != _user_ns)
+   dprintk("%s: Using creds from non-init userns\n", __func__);
+   else if (fc->cred != current_cred())
+   dprintk("%s: Using creds from fs_context which are different 
than current_creds\n",
+   __func__);
+
+   server->cred = get_cred(fc->cred);
 
error = -ENOMEM;
fattr = nfs_alloc_fattr();
-- 
2.25.1

[PATCH v4 2/2] NFSv4: Refactor NFS to use user namespaces

2020-11-02 Thread Sargun Dhillon

In several patches work has been done to enable NFSv4 to use user namespaces:
58002399da65: NFSv4: Convert the NFS client idmapper to use the container user 
namespace
3b7eb5e35d0f: NFS: When mounting, don't share filesystems between different 
user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

This change only allows those users whom are not using ID mapping to use
user namespaces because the upcall mechanism still needs to be made fully
namespace aware. Currently, it is only network namespace aware (and this
patch doesn't impede that behaviour). Also, there is currently a limitation
that enabling / disabling ID mapping can only be done on a machine-wide
basis.

Eventually, we will need to at least:
  * Separate out the keyring cache by namespace
  * Come up with an upcall mechanism that can be triggered inside of the 
container,
or safely triggered outside, with the requisite context to do the right 
mapping.
  * Handle whatever refactoring needs to be done in net/sunrpc.

Signed-off-by: Sargun Dhillon 
---
 fs/nfs/nfs4client.c | 27 ++-
 fs/nfs/nfs4idmap.c  |  2 +-
 fs/nfs/nfs4idmap.h  |  3 ++-
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index be7915c861ce..c592f1881978 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1153,7 +1153,19 @@ struct nfs_server *nfs4_create_server(struct fs_context 
*fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   /*
+* current_cred() must have CAP_SYS_ADMIN in init_user_ns. All non
+* init user namespaces cannot mount NFS, but the fs_context
+* can be created in any user namespace.
+*/
+   if (fc->cred->user_ns != _user_ns) {
+   dprintk("%s: Using creds from non-init userns\n", __func__);
+   } else if (fc->cred != current_cred()) {
+   dprintk("%s: Using creds from fs_context which are different 
than current_creds\n",
+   __func__);
+   }
+
+   server->cred = get_cred(fc->cred);
 
auth_probe = ctx->auth_info.flavor_len < 1;
 
@@ -1166,6 +1178,19 @@ struct nfs_server *nfs4_create_server(struct fs_context 
*fc)
if (error < 0)
goto error;
 
+   /*
+* nfs4idmap is not fully isolated by user namespaces. It is currently
+* only network namespace aware. If upcalls never happen, we do not
+* need to worry as nfs_client instances aren't shared between
+* user namespaces.
+*/
+   if (idmap_userns(server->nfs_client->cl_idmap) != _user_ns &&
+   !(server->caps & NFS_CAP_UIDGID_NOMAP)) {
+   error = -EINVAL;
+   errorf(fc, "Mount credentials are from non init user namespace 
and ID mapping is enabled. This is not allowed.");
+   goto error;
+   }
+
return server;
 
 error:
diff --git a/fs/nfs/nfs4idmap.c b/fs/nfs/nfs4idmap.c
index 8d8aba305ecc..33dc9b76dc17 100644
--- a/fs/nfs/nfs4idmap.c
+++ b/fs/nfs/nfs4idmap.c
@@ -73,7 +73,7 @@ struct idmap {
struct user_namespace   *user_ns;
 };
 
-static struct user_namespace *idmap_userns(const struct idmap *idmap)
+struct user_namespace *idmap_userns(const struct idmap *idmap)
 {
if (idmap && idmap->user_ns)
return idmap->user_ns;
diff --git a/fs/nfs/nfs4idmap.h b/fs/nfs/nfs4idmap.h
index de44d7330ab3..2f5296497887 100644
--- a/fs/nfs/nfs4idmap.h
+++ b/fs/nfs/nfs4idmap.h
@@ -38,7 +38,7 @@
 
 #include 
 #include 
-
+#include 
 
 /* Forward declaration to make this header independent of others */
 struct nfs_client;
@@ -50,6 +50,7 @@ int nfs_idmap_init(void);
 void nfs_idmap_quit(void);
 int nfs_idmap_new(struct nfs_client *);
 void nfs_idmap_delete(struct nfs_client *);
+struct user_namespace *idmap_userns(const struct idmap *idmap);
 
 void nfs_fattr_init_names(struct nfs_fattr *fattr,
struct nfs4_string *owner_name,
-- 
2.25.1

[PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

2020-11-02 Thread Sargun Dhillon

This is effectively a resend, but re-based atop Anna's current tree. I can
add the samples back in an another patchset.

Right now, it is possible to mount NFS with an non-matching super block
user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
set of interactions if using anything other than auth_null, where the UIDs
being sent to the server are different than the local UIDs being checked.
This can cause "breakage", where if you try to communicate with the NFS
server with any other set of mappings, it breaks.

This is after the initial v5.10 merge window, so hopefully this patchset
can be reconsidered, and maybe we can make forward progress? I think that
it takes a relatively conservative approach in enabling user namespaces,
and it prevents the case where someone is using auth_gss (for now), as the
mappings are non-trivial.

Changes since v3:
  * Rebase atop Anna's tree
Changes since v2:
  * Removed samples
  * Split out NFSv2/v3 patchset from NFSv4 patchset
  * Added restrictions around use
Changes since v1:
  * Added samples

Sargun Dhillon (2):
  NFS: NFSv2/NFSv3: Use cred from fs_context during mount
  NFSv4: Refactor NFS to use user namespaces

 fs/nfs/client.c | 10 --
 fs/nfs/nfs4client.c | 27 ++-
 fs/nfs/nfs4idmap.c  |  2 +-
 fs/nfs/nfs4idmap.h  |  3 ++-
 4 files changed, 37 insertions(+), 5 deletions(-)


base-commit: 8c39076c276be0b31982e44654e2c2357473258a
-- 
2.25.1

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-11-02 Thread Sargun Dhillon

On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Sargun,
>
> Thanks for your reply.
>
> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
> > On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
> > wrote:
>
> [...]
>
> >>> I think I commented in another thread somewhere that the
> >>> supervisor is not notified if the syscall is preempted. Therefore
> >>> if it is performing a preemptible, long-running syscall, you need
> >>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
> >>> you can end up in a bad situation -- like leaking resources, or
> >>> holding on to file descriptors after the program under
> >>> supervision has intended to release them.
> >>
> >> It's been a long day, and I'm not sure I reallu understand this.
> >> Could you outline the scnario in more detail?
> >>
> > S: Sets up filter + interception for accept T: socket(AF_INET,
> > SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, }, ..) T: listen(7,
> > 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>
> Presumably, the preceding line should have been:
>
> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
> (s/T:/S:/)
>
> right?

Right.
>
>
> > T: accept(7, ...) S: Intercepts accept S: Does accept in background
> > T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
> > Still running accept(7, ), holding port , so if now T
> > retries to bind to port , things fail.
>
> Okay -- I understand. Presumably the solution here is not to
> block in accept(), but rather to use poll() to monitor both the
> notification FD and the listening socket FD?
>
You need to have some kind of mechanism to periodically check
if the notification is still alive, and preempt the accept. It doesn't
matter how exactly you "background" the accept (threads, or
O_NONBLOCK + epoll).

The thing is you need to make sure that when the process
cancels a syscall, you need to release the resources you
may have acquired on its behalf or bad things can happen.

> >>> A very specific example is if you're performing an accept on
> >>> behalf of the program generating the notification, and the
> >>> program intends to reuse the port. You can get into all sorts of
> >>> awkward situations there.
> >>
> >> [...]
> >>
> > See above
>
> [...]
>
> >>> In addition, if it is a socket, it inherits the cgroup v1 classid
> >>> and netprioidx of the receiving process.
> >>>
> >>> The argument of this is as follows:
> >>>
> >>> struct seccomp_notif_addfd { __u64 id; __u32 flags; __u32 srcfd;
> >>> __u32 newfd; __u32 newfd_flags; };
> >>>
> >>> id This is the cookie value that was obtained using
> >>> SECCOMP_IOCTL_NOTIF_RECV.
> >>>
> >>> flags A bitmask that includes zero or more of the
> >>> SECCOMP_ADDFD_FLAG_* bits set
> >>>
> >>> SECCOMP_ADDFD_FLAG_SETFD - Use dup2 (or dup3?) like semantics
> >>> when copying the file descriptor.
> >>>
> >>> srcfd The file descriptor number to copy in the supervisor
> >>> process.
> >>>
> >>> newfd If the SECCOMP_ADDFD_FLAG_SETFD flag is specified this will
> >>> be the file descriptor that is used in the dup2 semantics. If
> >>> this file descriptor exists in the receiving process, it is
> >>> closed and replaced by this file descriptor in an atomic fashion.
> >>> If the copy process fails due to a MAC failure, or if srcfd is
> >>> invalid, the newfd will not be closed in the receiving process.
> >>
> >> Great description!
> >>
> >>> If SECCOMP_ADDFD_FLAG_SETFD it not set, then this value must be
> >>> 0.
> >>>
> >>> newfd_flags The file descriptor flags to set on the file
> >>> descriptor after it has been received by the process. The only
> >>> flag that can currently be specified is O_CLOEXEC.
> >>>
> >>> On success, this operation returns the file descriptor number in
> >>> the receiving process. On failure, -1 is returned.
> >>>
> >>> It can fail with the following error codes:
> >>>
> >>> EINPROGRESS The cookie number specified hasn't been received by
> >>> the listener
> >>
> >> I don't understand this. Can you say more about the scenario?
> >>
> >
> > This should not real

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-30 Thread Sargun Dhillon

On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Sargun,,
> 
> On 10/29/20 9:53 AM, Sargun Dhillon wrote:
> > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
> >>ioctl(2) operations
> >>The following ioctl(2) operations are provided to support seccomp
> >>user-space notification.  For each of these operations, the first
> >>(file descriptor) argument of ioctl(2) is the listening file
> >>descriptor returned by a call to seccomp(2) with the
> >>SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
> >>
> >>SECCOMP_IOCTL_NOTIF_RECV
> >>   This operation is used to obtain a user-space notification
> >>   event.  If no such event is currently pending, the
> >>   operation blocks until an event occurs.  The third
> >>   ioctl(2) argument is a pointer to a structure of the
> >>   following form which contains information about the event.
> >>   This structure must be zeroed out before the call.
> >>
> >>   struct seccomp_notif {
> >>   __u64  id;  /* Cookie */
> >>   __u32  pid; /* TID of target thread */
> >>   __u32  flags;   /* Currently unused (0) */
> >>   struct seccomp_data data;   /* See seccomp(2) */
> >>   };
> >>
> >>   The fields in this structure are as follows:
> >>
> >>   id This is a cookie for the notification.  Each such
> >>  cookie is guaranteed to be unique for the
> >>  corresponding seccomp filter.
> >>
> >>  · It can be used with the
> >>SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
> >>to verify that the target is still alive.
> >>
> >>  · When returning a notification response to the
> >>kernel, the supervisor must include the cookie
> >>value in the seccomp_notif_resp structure that is
> >>specified as the argument of the
> >>SECCOMP_IOCTL_NOTIF_SEND operation.
> >>
> >>   pidThis is the thread ID of the target thread that
> >>  triggered the notification event.
> >>
> >>   flags  This is a bit mask of flags providing further
> >>  information on the event.  In the current
> >>  implementation, this field is always zero.
> >>
> >>   data   This is a seccomp_data structure containing
> >>  information about the system call that triggered
> >>  the notification.  This is the same structure that
> >>  is passed to the seccomp filter.  See seccomp(2)
> >>  for details of this structure.
> >>
> >>   On success, this operation returns 0; on failure, -1 is
> >>   returned, and errno is set to indicate the cause of the
> >>   error.  This operation can fail with the following errors:
> >>
> >>   EINVAL (since Linux 5.5)
> >>  The seccomp_notif structure that was passed to the
> >>  call contained nonzero fields.
> >>
> >>   ENOENT The target thread was killed by a signal as the
> >>  notification information was being generated, or
> >>  the target's (blocked) system call was interrupted
> >>  by a signal handler.
> >>
> > 
> > I think I commented in another thread somewhere that the supervisor is not 
> > notified if the syscall is preempted. Therefore if it is performing a 
> > preemptible, long-running syscall, you need to poll
> > SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise you can
> > end up in a bad situation -- like leaking resources, or holding on to
> > file descriptors after the program under supervision has intended to
> > release them.
> 
> It's been a long day, and I'm not sure I reallu understand this.
> Could you outline the scnario in more detail?
> 
S: Sets up filter + inter

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-29 Thread Sargun Dhillon

On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> 
> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> 
> I've shown the rendered version of the page below. The page source
> currently sits in a branch at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> 
> At this point, I'm mainly interested in feedback about the FIXMEs,
> some of which relate to the text of the page itself, while the
> others relate to the various outstanding API issues. The first 
> FIXME provides a small opportunity for some bikeshedding :-);
> 
> 
> Thanks,
> 
> Michael
> 
> [1] 
> https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90...@gmail.com/
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> 
> =
> 
> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)
> 
> NAME
>seccomp_user_notif - Seccomp user-space notification mechanism
> 
>┌─┐
>│FIXME│
>├─┤
>│Might "seccomp_unotify(2)" be a better name for this │
>│page?  It's slightly shorter to type, and perhaps│
>│reads better when spoken.│
>└─┘
> 
> SYNOPSIS
>#include 
>#include 
>#include 
> 
>int seccomp(unsigned int operation, unsigned int flags, void *args);
> 
>#include 
> 
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>  struct seccomp_notif *req);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>  struct seccomp_notif_resp *resp);
>int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
> 
> DESCRIPTION
>This page describes the user-space notification mechanism
>provided by the Secure Computing (seccomp) facility.  As well as
>the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>SECCOMP_RET_USER_NOTIF action value, and the
>SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>mechanism involves the use of a number of related ioctl(2)
>operations (described below).
> 
>Overview
>In conventional usage of a seccomp filter, the decision about how
>to treat a system call is made by the filter itself.  By
>contrast, the user-space notification mechanism allows the
>seccomp filter to delegate the handling of the system call to
>another user-space process.  Note that this mechanism is
>explicitly not intended as a method implementing security policy;
>see NOTES.
> 
>In the discussion that follows, the thread(s) on which the
>seccomp filter is installed is (are) referred to as the target,
>and the process that is notified by the user-space notification
>mechanism is referred to as the supervisor.
> 
>A suitably privileged supervisor can use the user-space
>notification mechanism to perform actions on behalf of the
>target.  The advantage of the user-space notification mechanism
>is that the supervisor will usually be able to retrieve
>information about the target and the performed system call that
>the seccomp filter itself cannot.  (A seccomp filter is limited
>in the information it can obtain and the actions that it can
>perform because it is running on a virtual machine inside the
>kernel.)
> 
>An overview of the steps performed by the target and the
>supervisor is as follows:
> 
>1. The target establishes a seccomp filter in the usual manner,
>   but with two differences:
> 
>   · The seccomp(2) flags argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
> value of the (successful) seccomp(2) call is a new
> "listening" file descriptor that can be used to receive
> notifications.  Only one "listening" seccomp filter can be
> installed for a thread.
> 
> ┌─┐
> │FIXME│
>

Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-29 Thread Sargun Dhillon

On Wed, Oct 28, 2020 at 03:47:27PM -0700, Kees Cook wrote:
> On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> > (This is my first message to the kernel list, I hope I'm doing it right)
> 
> Looks good to me! The key was CCing real people. ;)
> 
> > From my understanding, there is no way to delay the activation of
> > seccomp filters, for instance "until an _execve_ call".
> > But this might be useful, especially for tools who sandbox other,
> > non-cooperative, executables, such as "systemd" or "FireJail".
> > [...]
> > I only see hackish ways to restrict the use of _execve_ in a
> > non-cooperative executable. These methods seem globally bypassables
> > and not satisfactory from a security point of view.
> > 
> > IMHO, a way to prepare filter and enable them only on the next
> > _execve_ would have some benefit:
> > * have a way to restrict _execve_ in a non-cooperative executable;
> > * install filters atomically, ie. before the _execve_ system call
> > return. That would limit racy situations, and have the very firsts
> > instructions of potentially untrusted binaries already subject to
> > seccomp filters. It would also ensure there is only one thread running
> > at the filter enabling time.
> > 
> > From what I understand, there is a relative use case[2] where the
> > "enable on exec" mode would also be a solution.
> > 
> > Thanks for your attention,
> > C. Mougey
> > 
> > [1]: https://github.com/netblue30/firejail/issues/3685
> > [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/
> 
> Just to restate things already said in the thread and to try to illustrate
> with more clarity, I tend to organize my thinking about seccomp usage
> into three categories:
> 
> 1- self-confinement
> 2- launching external processes
>   a) cooperating
>   b) oblivious
> 
> I classify things like Chrome's complex tree of related processes and
> filters as 1, since it's all one thing together.
> 
> I think of systemd, docker, minijail, FireJail, etc all as falling into
> category 2, with some variation about how to deal with 2a or 2b. I see
> systemd as weakly covering both 2a and 2b: e.g. services are documenting
> what restrictions they want, etc. minijail has stronger 2b coverage as
> it attempts to do PRELOAD tricks (which it sounds like FireJail does
> too?) (Aside: why doesn't systemd do any self-confinement?)
> 
> We don't have much possibility for the targets in the 2a realm as far
> as cooperating over how to _manage_ confinement, but rather about simply
> expecting confinement to exist, or adding more confinement on their own.
> 
> So, what would adding delayed filters gain in the above classifications?
> 
> Both 1 and 2 would benefit from some simplification over how to apply
> filters (e.g. the referenced relative complexity of needing to pass the
> USER_NOTIF fd up to the supervisor).
> 
> Dealing with 2b is improved by allowing execve itself to be blocked.
> 
> If we turn this:
> 
>   fork
>   prepare & apply
>   exec
> 
> into this:
> 
>   fork
>   prepare
>   exec & apply
> 
> for 2a, this isn't too interesting since a 2a target could just give up
> execve after it launched. For 2b, though, it's pretty meaningful to gain
> further isolation of an oblivious (and assumingly untrusted) process
> (given all the hacks needed to try to cover the situation).
> 
> And to clarify, 2a would much prefer this to be able to separate
> initialization from runtime:
> 
> fork
> prepare
> exec
> other things
> apply
> 
> And just for completeness, none of this is useful at all for 1, which
> doesn't even "see" the fork from its perspective:
> 
> exec
> other things
> prepare & apply
> 
> How should 2a targets indicate they're ready? Can it be done passively
> (in the sense that libc would make some seccomp call to apply the
> delayed filters), or does it need to stay explicit? (e.g. can we turn
> a pre-untrusted-input 2b into a 2a just by having the libc make calls?)
> My instinct is that hiding it won't gain much over a "on-execve" case,
> but having an explicit call that means "I'm done initializing now" would
> be a meaningful synchronization point -- except I note that it just means
> the target could just as easily start doing its own confinement anyway,
> which means they effectively move from 2a to 1, and now we don't care
> about delayed filters any more.
> 
> So, lacking a clearer sync point, execve() does seem to stand out to me.
> 
> The other idea which was touched on in the thread was very direct
> management (e.g. ptrace) and the supervisor waits until some point and
> then forces the filters to apply on the target. What would be more
> light-weight than this? (Or rather, what kinds of things would such a
> ptracer be looking for to mark "I've started"?)
> 
> Since I've got bitmaps on my mind, what about a syscall bitmap that
> triggers the application of delayed filters? The supervisor is launching
> a daemon: mark NR_listen as the

Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Sargun Dhillon

On Wed, Oct 28, 2020 at 2:43 AM Jann Horn  wrote:
>
> On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon  wrote:
> > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn  wrote:
> > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > >  wrote:
> > > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > > no more listeners, you'd have to disambiguate through the poll()
> > > > > revents, which would be kinda ugly?
> > > >
> > > > I must confess, I'm not quite clear on which two cases you
> > > > are trying to distinguish. Can you elaborate?
> > >
> > > Let's say someone writes a program whose responsibilities are just to
> > > handle seccomp events and to listen on some other fd for commands. And
> > > this is implemented with an event loop. Then once all the target
> > > processes are gone (including zombie reaping), we'll start getting
> > > EPOLLERR.
> > >
> > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > > can just call into the seccomp logic without any arguments; it can
> > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > > The downside is that there's one more error code userspace has to
> > > special-case.
> > > This would be more consistent with what we'd be doing in the blocking 
> > > case.
> > >
> > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > > the seccomp logic what the revents are.
> > >
> > > I guess it probably doesn't really matter much.
> >
> > So, in practice, if you're emulating a blocking syscall (such as open,
> > perf_event_open, or any of a number of other syscalls), you probably
> > have to do it on a separate thread in the supervisor because you want
> > to continue to be able to receive new notifications if any other process
> > generates a seccomp notification event that you need to handle.
> >
> > In addition to that, some of these syscalls are preemptible, so you need
> > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> > under supervision hasn't left the syscall.
> >
> > If we're to implement a mechanism that makes the seccomp ioctl receive
> > non-blocking, it would be valuable to address this problem as well (getting
> > a notification when the supervisor is processing a syscall and needs to
> > preempt it). In the best case, this can be a minor inconvenience, and
> > in the worst case this can result in weird errors where you're keeping
> > resources open that the container expects to be closed.
>
> Does "a notification" mean signals? Or would you want to have a second
> thread in userspace that poll()s for cancellation events on the
> seccomp fd and then somehow takes care of interrupting the first
> thread, or something like that?

I would be reluctant to be prescriptive in that it be a signal. Right
now, it's implemented
as a second thread in userspace that does a ioctl(...) and checks if
the notification
is valid / alive, and does what's required if the notification has
died (interrupting
the first thread).

>
> Either way, I think your proposal goes beyond the scope of patching
> the existing weirdness, and should be a separate patch.

I agree it should be a separate patch, but I think that it'd be nice if there
was a way to do something like:
* opt-in to getting another message after receiving the notification
  that indicates the program has left the syscall
* when you do the RECV, you can specify a flag or some such asking
  that you get signaled / notified about the program leaving the syscall
* a multiplexed receive that can say if an existing notification in progress
  has left the valid state.

---
The reason I bring this up as part of this current thread / discussion is that
I think that they may be related in terms of how we want the behaviour to act.

I would love to hear how people think this should work, or better suggestions
than the second thread approach above, or the alternative approach of
polling all the notifications in progress on some interval [and relying on
epoll timeout to trigger that interval].

Re: For review: seccomp_user_notif(2) manual page

2020-10-28 Thread Sargun Dhillon

On Tue, Oct 27, 2020 at 3:28 AM Jann Horn  wrote:
>
> On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
>  wrote:
> > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > I'm a bit on the fence now on whether non-blocking mode should use
> > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > no more listeners, you'd have to disambiguate through the poll()
> > > revents, which would be kinda ugly?
> >
> > I must confess, I'm not quite clear on which two cases you
> > are trying to distinguish. Can you elaborate?
>
> Let's say someone writes a program whose responsibilities are just to
> handle seccomp events and to listen on some other fd for commands. And
> this is implemented with an event loop. Then once all the target
> processes are gone (including zombie reaping), we'll start getting
> EPOLLERR.
>
> If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> can just call into the seccomp logic without any arguments; it can
> just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> The downside is that there's one more error code userspace has to
> special-case.
> This would be more consistent with what we'd be doing in the blocking case.
>
> If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> the seccomp logic what the revents are.
>
> I guess it probably doesn't really matter much.

So, in practice, if you're emulating a blocking syscall (such as open,
perf_event_open, or any of a number of other syscalls), you probably
have to do it on a separate thread in the supervisor because you want
to continue to be able to receive new notifications if any other process
generates a seccomp notification event that you need to handle.

In addition to that, some of these syscalls are preemptible, so you need
to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
under supervision hasn't left the syscall.

If we're to implement a mechanism that makes the seccomp ioctl receive
non-blocking, it would be valuable to address this problem as well (getting
a notification when the supervisor is processing a syscall and needs to
preempt it). In the best case, this can be a minor inconvenience, and
in the worst case this can result in weird errors where you're keeping
resources open that the container expects to be closed.

Re: [RESEND PATCH v2 0/3] NFS User Namespaces with new mount API

2020-10-17 Thread Sargun Dhillon

On Fri, Oct 16, 2020 at 05:45:47AM -0700, Sargun Dhillon wrote:
> This patchset adds some functionality to allow NFS to be used from
> NFS namespaces (containers).
> 
> Changes since v1:
>   * Added samples
> 
> Sargun Dhillon (3):
>   NFS: Use cred from fscontext during fsmount
>   samples/vfs: Split out common code for new syscall APIs
>   samples/vfs: Add example leveraging NFS with new APIs and user
> namespaces
> 
>  fs/nfs/client.c|   2 +-
>  fs/nfs/flexfilelayout/flexfilelayout.c |   1 +
>  fs/nfs/nfs4client.c|   2 +-
>  samples/vfs/.gitignore |   2 +
>  samples/vfs/Makefile   |   5 +-
>  samples/vfs/test-fsmount.c |  86 +---
>  samples/vfs/test-nfs-userns.c  | 181 +
>  samples/vfs/vfs-helper.c   |  43 ++
>  samples/vfs/vfs-helper.h   |  55 
>  9 files changed, 289 insertions(+), 88 deletions(-)
>  create mode 100644 samples/vfs/test-nfs-userns.c
>  create mode 100644 samples/vfs/vfs-helper.c
>  create mode 100644 samples/vfs/vfs-helper.h
> 
> -- 
> 2.25.1
> 

Digging deeper into this a little bit, I actually found that there is some 
problematic aspects of the current behaviour. Because nfs_get_tree_common calls 
sget_fc, and sget_fc sets the super block's s_user_ns (via alloc_super) to the 
fs_context's user namespace unless the global flag is set (which NFS does not 
set), there are a bunch of permissions checks that are done against the super 
block's user_ns.

It looks like this was introduced in:
f2aedb713c28: NFS: Add fs_context support[1]

It turns out that unmapped users in the "parent" user namespace just get an 
EOVERFLOW error when trying to perform a read, even if the UID sent to the NFS 
server to read a file is a valid uid (the uid in the init user ns), and 
inode_permission checks permissions against the mapped UID in the namespace, 
while the authentication credentials (UIDs, GIDs) sent to the server are
those from the init user ns.

[This is all under the assumption there's not upcalls doing ID mapping]

Although, I do not think this presents any security risk (because you have to 
have CAP_SYS_ADMIN in the init user ns to get this far), it definitely seems
like "incorrect" behaviour.

[1]: 
https://lore.kernel.org/linux-nfs/20191120152750.6880-26-smay...@redhat.com/

[RESEND PATCH v2 0/3] NFS User Namespaces with new mount API

2020-10-16 Thread Sargun Dhillon

This patchset adds some functionality to allow NFS to be used from
NFS namespaces (containers).

Changes since v1:
  * Added samples

Sargun Dhillon (3):
  NFS: Use cred from fscontext during fsmount
  samples/vfs: Split out common code for new syscall APIs
  samples/vfs: Add example leveraging NFS with new APIs and user
namespaces

 fs/nfs/client.c|   2 +-
 fs/nfs/flexfilelayout/flexfilelayout.c |   1 +
 fs/nfs/nfs4client.c|   2 +-
 samples/vfs/.gitignore |   2 +
 samples/vfs/Makefile   |   5 +-
 samples/vfs/test-fsmount.c |  86 +---
 samples/vfs/test-nfs-userns.c  | 181 +
 samples/vfs/vfs-helper.c   |  43 ++
 samples/vfs/vfs-helper.h   |  55 
 9 files changed, 289 insertions(+), 88 deletions(-)
 create mode 100644 samples/vfs/test-nfs-userns.c
 create mode 100644 samples/vfs/vfs-helper.c
 create mode 100644 samples/vfs/vfs-helper.h

-- 
2.25.1

[RESEND PATCH v2 2/3] samples/vfs: Split out common code for new syscall APIs

2020-10-16 Thread Sargun Dhillon

There are a bunch of helper functions which make using the new
mount APIs much easier. As we add examples of leveraging the
new APIs, it probably makes sense to promote code reuse.

Signed-off-by: Sargun Dhillon 
Cc: David Howells 
Cc: Al Viro 
Cc: Kyle Anderson 
---
 samples/vfs/Makefile   |  2 +
 samples/vfs/test-fsmount.c | 86 +-
 samples/vfs/vfs-helper.c   | 43 +++
 samples/vfs/vfs-helper.h   | 55 
 4 files changed, 101 insertions(+), 85 deletions(-)
 create mode 100644 samples/vfs/vfs-helper.c
 create mode 100644 samples/vfs/vfs-helper.h

diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 00b6824f9237..7f76875eaa70 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,5 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
+test-fsmount-objs := test-fsmount.o vfs-helper.o
 userprogs := test-fsmount test-statx
+
 always-y := $(userprogs)
 
 userccflags += -I usr/include
diff --git a/samples/vfs/test-fsmount.c b/samples/vfs/test-fsmount.c
index 50f47b72e85f..36a4fa886200 100644
--- a/samples/vfs/test-fsmount.c
+++ b/samples/vfs/test-fsmount.c
@@ -14,91 +14,7 @@
 #include 
 #include 
 #include 
-
-#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
-
-static void check_messages(int fd)
-{
-   char buf[4096];
-   int err, n;
-
-   err = errno;
-
-   for (;;) {
-   n = read(fd, buf, sizeof(buf));
-   if (n < 0)
-   break;
-   n -= 2;
-
-   switch (buf[0]) {
-   case 'e':
-   fprintf(stderr, "Error: %*.*s\n", n, n, buf + 2);
-   break;
-   case 'w':
-   fprintf(stderr, "Warning: %*.*s\n", n, n, buf + 2);
-   break;
-   case 'i':
-   fprintf(stderr, "Info: %*.*s\n", n, n, buf + 2);
-   break;
-   }
-   }
-
-   errno = err;
-}
-
-static __attribute__((noreturn))
-void mount_error(int fd, const char *s)
-{
-   check_messages(fd);
-   fprintf(stderr, "%s: %m\n", s);
-   exit(1);
-}
-
-/* Hope -1 isn't a syscall */
-#ifndef __NR_fsopen
-#define __NR_fsopen -1
-#endif
-#ifndef __NR_fsmount
-#define __NR_fsmount -1
-#endif
-#ifndef __NR_fsconfig
-#define __NR_fsconfig -1
-#endif
-#ifndef __NR_move_mount
-#define __NR_move_mount -1
-#endif
-
-
-static inline int fsopen(const char *fs_name, unsigned int flags)
-{
-   return syscall(__NR_fsopen, fs_name, flags);
-}
-
-static inline int fsmount(int fsfd, unsigned int flags, unsigned int ms_flags)
-{
-   return syscall(__NR_fsmount, fsfd, flags, ms_flags);
-}
-
-static inline int fsconfig(int fsfd, unsigned int cmd,
-  const char *key, const void *val, int aux)
-{
-   return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
-}
-
-static inline int move_mount(int from_dfd, const char *from_pathname,
-int to_dfd, const char *to_pathname,
-unsigned int flags)
-{
-   return syscall(__NR_move_mount,
-  from_dfd, from_pathname,
-  to_dfd, to_pathname, flags);
-}
-
-#define E_fsconfig(fd, cmd, key, val, aux) \
-   do {\
-   if (fsconfig(fd, cmd, key, val, aux) == -1) \
-   mount_error(fd, key ?: "create");   \
-   } while (0)
+#include "vfs-helper.h"
 
 int main(int argc, char *argv[])
 {
diff --git a/samples/vfs/vfs-helper.c b/samples/vfs/vfs-helper.c
new file mode 100644
index ..136c6cb81540
--- /dev/null
+++ b/samples/vfs/vfs-helper.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include 
+#include 
+#include 
+#include 
+#include "vfs-helper.h"
+
+void check_messages(int fd)
+{
+   char buf[4096];
+   int err, n;
+
+   err = errno;
+
+   for (;;) {
+   n = read(fd, buf, sizeof(buf));
+   if (n < 0)
+   break;
+   n -= 2;
+
+   switch (buf[0]) {
+   case 'e':
+   fprintf(stderr, "Error: %*.*s\n", n, n, buf + 2);
+   break;
+   case 'w':
+   fprintf(stderr, "Warning: %*.*s\n", n, n, buf + 2);
+   break;
+   case 'i':
+   fprintf(stderr, "Info: %*.*s\n", n, n, buf + 2);
+   break;
+   }
+   }
+
+   errno = err;
+}
+
+__attribute__((noreturn))
+void mount_error(int fd, const char *s)
+{
+   check_messages(fd);
+   fprintf(stderr, "%s: %m\n", s);
+   exit(1);
+}
diff --git a/samples/vfs/vfs-helper.h

[RESEND PATCH v2 3/3] samples/vfs: Add example leveraging NFS with new APIs and user namespaces

2020-10-16 Thread Sargun Dhillon

This adds an example which assumes you already have an NFS server setup,
but does the work of creating a user namespace, and an NFS mount from
that user namespace which then exposes different UIDs than that of
the init user namespace.

Signed-off-by: Sargun Dhillon 
Cc: J. Bruce Fields 
Cc: Chuck Lever 
Cc: Trond Myklebust 
Cc: Anna Schumaker 
Cc: David Howells 
Cc: Al Viro 
Cc: Kyle Anderson 
---
 fs/nfs/flexfilelayout/flexfilelayout.c |   1 +
 samples/vfs/.gitignore |   2 +
 samples/vfs/Makefile   |   3 +-
 samples/vfs/test-nfs-userns.c  | 181 +
 4 files changed, 186 insertions(+), 1 deletion(-)
 create mode 100644 samples/vfs/test-nfs-userns.c

diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c 
b/fs/nfs/flexfilelayout/flexfilelayout.c
index f9348ed1bcda..ee45ff7d75ac 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -361,6 +361,7 @@ ff_layout_alloc_lseg(struct pnfs_layout_hdr *lh,
 struct nfs4_layoutget_res *lgr,
 gfp_t gfp_flags)
 {
+   struct user_namespace *user_ns = lh->plh_lc_cred->user_ns;
struct pnfs_layout_segment *ret;
struct nfs4_ff_layout_segment *fls = NULL;
struct xdr_stream stream;
diff --git a/samples/vfs/.gitignore b/samples/vfs/.gitignore
index 8fdabf7e5373..1d09826b31a6 100644
--- a/samples/vfs/.gitignore
+++ b/samples/vfs/.gitignore
@@ -1,3 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 test-fsmount
 test-statx
+test-nfs-userns
+
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 7f76875eaa70..6a2926080c08 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 test-fsmount-objs := test-fsmount.o vfs-helper.o
-userprogs := test-fsmount test-statx
+test-nfs-userns-objs := test-nfs-userns.o vfs-helper.o
+userprogs := test-fsmount test-statx test-nfs-userns
 
 always-y := $(userprogs)
 
diff --git a/samples/vfs/test-nfs-userns.c b/samples/vfs/test-nfs-userns.c
new file mode 100644
index ..108af924cbdd
--- /dev/null
+++ b/samples/vfs/test-nfs-userns.c
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "vfs-helper.h"
+
+
+#define WELL_KNOWN_FD  100
+
+static inline int pidfd_open(pid_t pid, unsigned int flags)
+{
+   return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static inline int pidfd_getfd(int pidfd, int fd, int flags)
+{
+   return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
+}
+
+static void write_to_path(const char *path, const char *str)
+{
+   int fd, len = strlen(str);
+
+   fd = open(path, O_WRONLY);
+   if (fd < 0) {
+   fprintf(stderr, "Can't open %s: %s\n", path, strerror(errno));
+   exit(1);
+   }
+
+   if (write(fd, str, len) != len) {
+   fprintf(stderr, "Can't write string: %s\n", strerror(errno));
+   exit(1);
+   }
+
+   E(close(fd));
+}
+
+static int do_work(int sk)
+{
+   int fsfd;
+
+   E(unshare(CLONE_NEWNS|CLONE_NEWUSER));
+
+   fsfd = fsopen("nfs4", 0);
+   E(fsfd);
+
+   E(send(sk, , sizeof(fsfd), 0));
+   // Wait for the other side to close / finish / wrap up
+   recv(sk, , sizeof(fsfd), 0);
+   E(close(sk));
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int pidfd, mntfd, fsfd, fsfdnum, status, sk_pair[2];
+   struct statx statxbuf;
+   char buf[1024];
+   pid_t pid;
+
+   if (mkdir("/mnt/share", 0777) && errno != EEXIST) {
+   perror("mkdir");
+   return 1;
+   }
+
+   E(chmod("/mnt/share", 0777));
+
+   if (mkdir("/mnt/nfs", 0755) && errno != EEXIST) {
+   perror("mkdir");
+   return 1;
+   }
+
+   if (unlink("/mnt/share/newfile") && errno != ENOENT) {
+   perror("unlink");
+   return 1;
+   }
+
+   E(creat("/mnt/share/testfile", 0644));
+   E(chown("/mnt/share/testfile", 1001, 1001));
+
+   /* exportfs is idempotent, but expects nfs-server to be running */
+   if (system("exportfs -o no_root_squash,no_subtree_check,rw 
127.0.0.0/8:/mnt/share")) {
+   fprintf(stderr,
+   "Could not export /mnt/share. Is NFS the server 
running?\n");
+   return 1;
+   }
+
+   E(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair));
+
+   pid = fork();
+   E(pid);
+   if (pid == 0) {
+   E(close(sk_pair[0]));
+   return do_work(sk_pair[1]);
+   }
+
+   E(close(sk_pair[1])

[RESEND PATCH v2 1/3] NFS: Use cred from fscontext during fsmount

2020-10-16 Thread Sargun Dhillon

In several patches, support was introduced to NFS for user namespaces:

ccfe51a5161c: SUNRPC: Fix the server AUTH_UNIX userspace mappings
e6667c73a27d: SUNRPC: rsi_parse() should use the current user namespace
1a58e8a0e5c1: NFS: Store the credential of the mount process in the nfs_server
283ebe3ec415: SUNRPC: Use the client user namespace when encoding creds
ac83228a7101: SUNRPC: Use namespace of listening daemon in the client AUTH_GSS 
upcall
264d948ce7d0: NFS: Convert NFSv3 to use the container user namespace
58002399da65: NFSv4: Convert the NFS client idmapper to use the container user 
namespace
c207db2f5da5: NFS: Convert NFSv2 to use the container user namespace
3b7eb5e35d0f: NFS: When mounting, don't share filesystems between different 
user namespaces

All of these commits are predicated on the NFS server being created with
credentials that are in the user namespace of interest. The new VFS
mount APIs help in this[1], in that the creation of the FSFD (fsopen)
captures a set of credentials at creation time.

Normally, the new file system API users automatically get their
super block's user_ns set to the fc->user_ns in sget_fc, but since
NFS has to do special manipulation of UIDs / GIDs on the wire,
it keeps track of credentials itself.

Unfortunately, the credentials that the NFS uses are the current_creds
at the time FSCONFIG_CMD_CREATE is called. When FSCONFIG_CMD_CREATE is
called, simultaneously, mount_capable is checked -- which checks if
the user has CAP_SYS_ADMIN in the init_user_ns because NFS does not
have FS_USERNS_MOUNT.

This makes a subtle change so that the struct cred from fsopen
is used instead. Since the fs_context is available at server
creation time, and it has the credentials, we can just use
those.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.

This change makes a small user space change if the user performs this
elaborate process of passing around file descriptors, and switching
namespaces. There may be a better way to go about this, or even enable
FS_USERNS_MOUNT on NFS, but this seems like the safest and most
straightforward approach.

[1]: 
https://lore.kernel.org/linux-fsdevel/155059610368.17079.2220554006494174417.st...@warthog.procyon.org.uk/

Signed-off-by: Sargun Dhillon 
Cc: J. Bruce Fields 
Cc: Chuck Lever 
Cc: Trond Myklebust 
Cc: Anna Schumaker 
Cc: David Howells 
Cc: Al Viro 
Cc: Kyle Anderson 
---
 fs/nfs/client.c | 2 +-
 fs/nfs/nfs4client.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index f1ff3076e4a4..fdefcc649884 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -967,7 +967,7 @@ struct nfs_server *nfs_create_server(struct fs_context *fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
error = -ENOMEM;
fattr = nfs_alloc_fattr();
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 0bd77cc1f639..92ff6fb8e324 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1120,7 +1120,7 @@ struct nfs_server *nfs4_create_server(struct fs_context 
*fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
auth_probe = ctx->auth_info.flavor_len < 1;
 
-- 
2.25.1

[PATCH v2 1/3] NFS: Use cred from fscontext during fsmount

2020-10-16 Thread Sargun Dhillon

In several patches, support was introduced to NFS for user namespaces:

ccfe51a5161c: SUNRPC: Fix the server AUTH_UNIX userspace mappings
e6667c73a27d: SUNRPC: rsi_parse() should use the current user namespace
1a58e8a0e5c1: NFS: Store the credential of the mount process in the nfs_server
283ebe3ec415: SUNRPC: Use the client user namespace when encoding creds
ac83228a7101: SUNRPC: Use namespace of listening daemon in the client AUTH_GSS 
upcall
264d948ce7d0: NFS: Convert NFSv3 to use the container user namespace
58002399da65: NFSv4: Convert the NFS client idmapper to use the container user 
namespace
c207db2f5da5: NFS: Convert NFSv2 to use the container user namespace
3b7eb5e35d0f: NFS: When mounting, don't share filesystems between different 
user namespaces

All of these commits are predicated on the NFS server being created with
credentials that are in the user namespace of interest. The new VFS
mount APIs help in this[1], in that the creation of the FSFD (fsopen)
captures a set of credentials at creation time.

Normally, the new file system API users automatically get their
super block's user_ns set to the fc->user_ns in sget_fc, but since
NFS has to do special manipulation of UIDs / GIDs on the wire,
it keeps track of credentials itself.

Unfortunately, the credentials that the NFS uses are the current_creds
at the time FSCONFIG_CMD_CREATE is called. When FSCONFIG_CMD_CREATE is
called, simultaneously, mount_capable is checked -- which checks if
the user has CAP_SYS_ADMIN in the init_user_ns because NFS does not
have FS_USERNS_MOUNT.

This makes a subtle change so that the struct cred from fsopen
is used instead. Since the fs_context is available at server
creation time, and it has the credentials, we can just use
those.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.

This change makes a small user space change if the user performs this
elaborate process of passing around file descriptors, and switching
namespaces. There may be a better way to go about this, or even enable
FS_USERNS_MOUNT on NFS, but this seems like the safest and most
straightforward approach.

[1]: 
https://lore.kernel.org/linux-fsdevel/155059610368.17079.2220554006494174417.st...@warthog.procyon.org.uk/

Signed-off-by: Sargun Dhillon 
Cc: J. Bruce Fields 
Cc: Chuck Lever 
Cc: Trond Myklebust 
Cc: Anna Schumaker 
Cc: David Howells 
Cc: Al Viro 
Cc: Kyle Anderson 
---
 fs/nfs/client.c | 2 +-
 fs/nfs/nfs4client.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index f1ff3076e4a4..fdefcc649884 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -967,7 +967,7 @@ struct nfs_server *nfs_create_server(struct fs_context *fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
error = -ENOMEM;
fattr = nfs_alloc_fattr();
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 0bd77cc1f639..92ff6fb8e324 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1120,7 +1120,7 @@ struct nfs_server *nfs4_create_server(struct fs_context 
*fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
auth_probe = ctx->auth_info.flavor_len < 1;
 
-- 
2.25.1

[PATCH v2 0/3] NFS User Namespaces

2020-10-16 Thread Sargun Dhillon

This patchset adds some functionality to allow NFS to be used from
NFS namespaces (containers).

Changes since v1:
  * Added samples

Sargun Dhillon (3):
  NFS: Use cred from fscontext during fsmount
  samples/vfs: Split out common code for new syscall APIs
  samples/vfs: Add example leveraging NFS with new APIs and user
namespaces

 fs/nfs/client.c|   2 +-
 fs/nfs/flexfilelayout/flexfilelayout.c |   1 +
 fs/nfs/nfs4client.c|   2 +-
 samples/vfs/.gitignore |   2 +
 samples/vfs/Makefile   |   5 +-
 samples/vfs/test-fsmount.c |  86 +---
 samples/vfs/test-nfs-userns.c  | 181 +
 samples/vfs/vfs-helper.c   |  43 ++
 samples/vfs/vfs-helper.h   |  55 
 9 files changed, 289 insertions(+), 88 deletions(-)
 create mode 100644 samples/vfs/test-nfs-userns.c
 create mode 100644 samples/vfs/vfs-helper.c
 create mode 100644 samples/vfs/vfs-helper.h

-- 
2.25.1

[PATCH v2 3/3] samples/vfs: Add example leveraging NFS with new APIs and user namespaces

2020-10-16 Thread Sargun Dhillon

This adds an example which assumes you already have an NFS server setup,
but does the work of creating a user namespace, and an NFS mount from
that user namespace which then exposes different UIDs than that of
the init user namespace.

Signed-off-by: Sargun Dhillon 
Cc: J. Bruce Fields 
Cc: Chuck Lever 
Cc: Trond Myklebust 
Cc: Anna Schumaker 
Cc: David Howells 
Cc: Al Viro 
Cc: Kyle Anderson 
---
 fs/nfs/flexfilelayout/flexfilelayout.c |   1 +
 samples/vfs/.gitignore |   2 +
 samples/vfs/Makefile   |   3 +-
 samples/vfs/test-nfs-userns.c  | 181 +
 4 files changed, 186 insertions(+), 1 deletion(-)
 create mode 100644 samples/vfs/test-nfs-userns.c

diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c 
b/fs/nfs/flexfilelayout/flexfilelayout.c
index f9348ed1bcda..ee45ff7d75ac 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -361,6 +361,7 @@ ff_layout_alloc_lseg(struct pnfs_layout_hdr *lh,
 struct nfs4_layoutget_res *lgr,
 gfp_t gfp_flags)
 {
+   struct user_namespace *user_ns = lh->plh_lc_cred->user_ns;
struct pnfs_layout_segment *ret;
struct nfs4_ff_layout_segment *fls = NULL;
struct xdr_stream stream;
diff --git a/samples/vfs/.gitignore b/samples/vfs/.gitignore
index 8fdabf7e5373..1d09826b31a6 100644
--- a/samples/vfs/.gitignore
+++ b/samples/vfs/.gitignore
@@ -1,3 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 test-fsmount
 test-statx
+test-nfs-userns
+
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 7f76875eaa70..6a2926080c08 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 test-fsmount-objs := test-fsmount.o vfs-helper.o
-userprogs := test-fsmount test-statx
+test-nfs-userns-objs := test-nfs-userns.o vfs-helper.o
+userprogs := test-fsmount test-statx test-nfs-userns
 
 always-y := $(userprogs)
 
diff --git a/samples/vfs/test-nfs-userns.c b/samples/vfs/test-nfs-userns.c
new file mode 100644
index ..108af924cbdd
--- /dev/null
+++ b/samples/vfs/test-nfs-userns.c
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "vfs-helper.h"
+
+
+#define WELL_KNOWN_FD  100
+
+static inline int pidfd_open(pid_t pid, unsigned int flags)
+{
+   return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static inline int pidfd_getfd(int pidfd, int fd, int flags)
+{
+   return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
+}
+
+static void write_to_path(const char *path, const char *str)
+{
+   int fd, len = strlen(str);
+
+   fd = open(path, O_WRONLY);
+   if (fd < 0) {
+   fprintf(stderr, "Can't open %s: %s\n", path, strerror(errno));
+   exit(1);
+   }
+
+   if (write(fd, str, len) != len) {
+   fprintf(stderr, "Can't write string: %s\n", strerror(errno));
+   exit(1);
+   }
+
+   E(close(fd));
+}
+
+static int do_work(int sk)
+{
+   int fsfd;
+
+   E(unshare(CLONE_NEWNS|CLONE_NEWUSER));
+
+   fsfd = fsopen("nfs4", 0);
+   E(fsfd);
+
+   E(send(sk, , sizeof(fsfd), 0));
+   // Wait for the other side to close / finish / wrap up
+   recv(sk, , sizeof(fsfd), 0);
+   E(close(sk));
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int pidfd, mntfd, fsfd, fsfdnum, status, sk_pair[2];
+   struct statx statxbuf;
+   char buf[1024];
+   pid_t pid;
+
+   if (mkdir("/mnt/share", 0777) && errno != EEXIST) {
+   perror("mkdir");
+   return 1;
+   }
+
+   E(chmod("/mnt/share", 0777));
+
+   if (mkdir("/mnt/nfs", 0755) && errno != EEXIST) {
+   perror("mkdir");
+   return 1;
+   }
+
+   if (unlink("/mnt/share/newfile") && errno != ENOENT) {
+   perror("unlink");
+   return 1;
+   }
+
+   E(creat("/mnt/share/testfile", 0644));
+   E(chown("/mnt/share/testfile", 1001, 1001));
+
+   /* exportfs is idempotent, but expects nfs-server to be running */
+   if (system("exportfs -o no_root_squash,no_subtree_check,rw 
127.0.0.0/8:/mnt/share")) {
+   fprintf(stderr,
+   "Could not export /mnt/share. Is NFS the server 
running?\n");
+   return 1;
+   }
+
+   E(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair));
+
+   pid = fork();
+   E(pid);
+   if (pid == 0) {
+   E(close(sk_pair[0]));
+   return do_work(sk_pair[1]);
+   }
+
+   E(close(sk_pair[1])

[PATCH v2 2/3] samples/vfs: Split out common code for new syscall APIs

2020-10-16 Thread Sargun Dhillon

There are a bunch of helper functions which make using the new
mount APIs much easier. As we add examples of leveraging the
new APIs, it probably makes sense to promote code reuse.

Cc: David Howells 
Cc: Al Viro 
Cc: Kyle Anderson 
---
 samples/vfs/Makefile   |  2 +
 samples/vfs/test-fsmount.c | 86 +-
 samples/vfs/vfs-helper.c   | 43 +++
 samples/vfs/vfs-helper.h   | 55 
 4 files changed, 101 insertions(+), 85 deletions(-)
 create mode 100644 samples/vfs/vfs-helper.c
 create mode 100644 samples/vfs/vfs-helper.h

diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 00b6824f9237..7f76875eaa70 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,5 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
+test-fsmount-objs := test-fsmount.o vfs-helper.o
 userprogs := test-fsmount test-statx
+
 always-y := $(userprogs)
 
 userccflags += -I usr/include
diff --git a/samples/vfs/test-fsmount.c b/samples/vfs/test-fsmount.c
index 50f47b72e85f..36a4fa886200 100644
--- a/samples/vfs/test-fsmount.c
+++ b/samples/vfs/test-fsmount.c
@@ -14,91 +14,7 @@
 #include 
 #include 
 #include 
-
-#define E(x) do { if ((x) == -1) { perror(#x); exit(1); } } while(0)
-
-static void check_messages(int fd)
-{
-   char buf[4096];
-   int err, n;
-
-   err = errno;
-
-   for (;;) {
-   n = read(fd, buf, sizeof(buf));
-   if (n < 0)
-   break;
-   n -= 2;
-
-   switch (buf[0]) {
-   case 'e':
-   fprintf(stderr, "Error: %*.*s\n", n, n, buf + 2);
-   break;
-   case 'w':
-   fprintf(stderr, "Warning: %*.*s\n", n, n, buf + 2);
-   break;
-   case 'i':
-   fprintf(stderr, "Info: %*.*s\n", n, n, buf + 2);
-   break;
-   }
-   }
-
-   errno = err;
-}
-
-static __attribute__((noreturn))
-void mount_error(int fd, const char *s)
-{
-   check_messages(fd);
-   fprintf(stderr, "%s: %m\n", s);
-   exit(1);
-}
-
-/* Hope -1 isn't a syscall */
-#ifndef __NR_fsopen
-#define __NR_fsopen -1
-#endif
-#ifndef __NR_fsmount
-#define __NR_fsmount -1
-#endif
-#ifndef __NR_fsconfig
-#define __NR_fsconfig -1
-#endif
-#ifndef __NR_move_mount
-#define __NR_move_mount -1
-#endif
-
-
-static inline int fsopen(const char *fs_name, unsigned int flags)
-{
-   return syscall(__NR_fsopen, fs_name, flags);
-}
-
-static inline int fsmount(int fsfd, unsigned int flags, unsigned int ms_flags)
-{
-   return syscall(__NR_fsmount, fsfd, flags, ms_flags);
-}
-
-static inline int fsconfig(int fsfd, unsigned int cmd,
-  const char *key, const void *val, int aux)
-{
-   return syscall(__NR_fsconfig, fsfd, cmd, key, val, aux);
-}
-
-static inline int move_mount(int from_dfd, const char *from_pathname,
-int to_dfd, const char *to_pathname,
-unsigned int flags)
-{
-   return syscall(__NR_move_mount,
-  from_dfd, from_pathname,
-  to_dfd, to_pathname, flags);
-}
-
-#define E_fsconfig(fd, cmd, key, val, aux) \
-   do {\
-   if (fsconfig(fd, cmd, key, val, aux) == -1) \
-   mount_error(fd, key ?: "create");   \
-   } while (0)
+#include "vfs-helper.h"
 
 int main(int argc, char *argv[])
 {
diff --git a/samples/vfs/vfs-helper.c b/samples/vfs/vfs-helper.c
new file mode 100644
index ..bae2bc03c923
--- /dev/null
+++ b/samples/vfs/vfs-helper.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include 
+#include 
+#include 
+#include 
+#include "vfs-helper.h"
+
+void check_messages(int fd)
+{
+   char buf[4096];
+   int err, n;
+
+   err = errno;
+
+   for (;;) {
+   n = read(fd, buf, sizeof(buf));
+   if (n < 0)
+   break;
+   n -= 2;
+
+   switch (buf[0]) {
+   case 'e':
+   fprintf(stderr, "Error: %*.*s\n", n, n, buf + 2);
+   break;
+   case 'w':
+   fprintf(stderr, "Warning: %*.*s\n", n, n, buf + 2);
+   break;
+   case 'i':
+   fprintf(stderr, "Info: %*.*s\n", n, n, buf + 2);
+   break;
+   }
+   }
+
+   errno = err;
+}
+
+__attribute__((noreturn))
+void mount_error(int fd, const char *s)
+{
+   check_messages(fd);
+   fprintf(stderr, "%s: %m\n", s);
+   exit(1);
+}
\ No newline at end of file
diff --git a/samples/vfs/vfs-helper.h b/samples/vfs/vfs-helper.h
new file mode 100644
index ..be460ab48247
--- /dev/null
+++

[RFC PATCH] nfs: Use cred from fscontext during fsmount

2020-10-13 Thread Sargun Dhillon

This is a subtle change that is important for usage of NFS within
user namespaces. With the new mount APIs, the fscontext has an associated
struct cred. This struct cred is created at the time "fsopen" is called.
This cred object contains user namespaces, network namespaces, etc...

Right now, rather than using the cred / network namespace / user namespaces
that are all acquired at the time fsopen is called, we use some bits at the
time FSCONFIG_CMD_CREATE is called, and other bits at the time fsopen is
called. Specifically, the RPC client itself lives in the network namespace
that fsopen was called within. On the other hand, the credentials the RPC
client uses are the ones retrieved at the time of FSCONFIG_CMD_CREATE.

When FSCONFIG_CMD_CREATE is called, the vfs layer checks is the user has
CAP_SYS_ADMIN in the init user ns, as NFS does not have the FS_USERNS_MOUNT
flag enabled. Due to this, there is no way of configuring an NFS mount to
use id mappings from a user namespace.

It may make sense to switch from using clp->cl_rpcclient->cl_cred->user_ns
as the user namespace for the idmapper to clp->cl_net->user_ns, to make
sure that everything is aligned based on the net ns, and matches what has
been previously discussed [1].

Although this is a change that would effect userspace, it is very unlikely
that anyone is initializing the NFS FS FD in an unprivileged namespace,
and then calling FSCONFIG_CMD_CREATE to only get the network namespace's
effects, and not all of the effects. The fscontext API has provisions
for being able to configure specific namespaces.

[1]: 
https://lore.kernel.org/linux-nfs/camp4zn-mw1u3pos9k_jepieu2+owg6hdxdrq2lt3p173j_s...@mail.gmail.com/

Signed-off-by: Sargun Dhillon 
---
 fs/nfs/client.c | 2 +-
 fs/nfs/nfs4client.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 4b8cc93913f7..bd26ec6a2984 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -985,7 +985,7 @@ struct nfs_server *nfs_create_server(struct fs_context *fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
error = -ENOMEM;
fattr = nfs_alloc_fattr();
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index daacc78a3d48..818638cb10c4 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1151,7 +1151,7 @@ struct nfs_server *nfs4_create_server(struct fs_context 
*fc)
if (!server)
return ERR_PTR(-ENOMEM);
 
-   server->cred = get_cred(current_cred());
+   server->cred = get_cred(fc->cred);
 
auth_probe = ctx->auth_info.flavor_len < 1;
 
-- 
2.25.1

[PATCH] NFS: Only reference user namespace from nfs4idmap struct instead of cred

2020-10-12 Thread Sargun Dhillon

The nfs4idmapper only needs access to the user namespace, and not the
entire cred struct. This replaces the struct cred* member with
struct user_namespace*. This is mostly hygiene, so we don't have to
hold onto the cred object, which has extraneous references to
things like user_struct. This also makes switching away
from init_user_ns more straightforward in the future.

Signed-off-by: Sargun Dhillon 
---
 fs/nfs/nfs4idmap.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/nfs4idmap.c b/fs/nfs/nfs4idmap.c
index 62e6eea5c516..8d8aba305ecc 100644
--- a/fs/nfs/nfs4idmap.c
+++ b/fs/nfs/nfs4idmap.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 #include "netns.h"
@@ -69,13 +70,13 @@ struct idmap {
struct rpc_pipe *idmap_pipe;
struct idmap_legacy_upcalldata *idmap_upcall_data;
struct mutexidmap_mutex;
-   const struct cred   *cred;
+   struct user_namespace   *user_ns;
 };
 
 static struct user_namespace *idmap_userns(const struct idmap *idmap)
 {
-   if (idmap && idmap->cred)
-   return idmap->cred->user_ns;
+   if (idmap && idmap->user_ns)
+   return idmap->user_ns;
return _user_ns;
 }
 
@@ -286,7 +287,7 @@ static struct key *nfs_idmap_request_key(const char *name, 
size_t namelen,
if (ret < 0)
return ERR_PTR(ret);
 
-   if (!idmap->cred || idmap->cred->user_ns == _user_ns)
+   if (!idmap->user_ns || idmap->user_ns == _user_ns)
rkey = request_key(_type_id_resolver, desc, "");
if (IS_ERR(rkey)) {
mutex_lock(>idmap_mutex);
@@ -462,7 +463,7 @@ nfs_idmap_new(struct nfs_client *clp)
return -ENOMEM;
 
mutex_init(>idmap_mutex);
-   idmap->cred = get_cred(clp->cl_rpcclient->cl_cred);
+   idmap->user_ns = get_user_ns(clp->cl_rpcclient->cl_cred->user_ns);
 
rpc_init_pipe_dir_object(>idmap_pdo,
_idmap_pipe_dir_object_ops,
@@ -486,7 +487,7 @@ nfs_idmap_new(struct nfs_client *clp)
 err_destroy_pipe:
rpc_destroy_pipe_data(idmap->idmap_pipe);
 err:
-   put_cred(idmap->cred);
+   get_user_ns(idmap->user_ns);
kfree(idmap);
return error;
 }
@@ -503,7 +504,7 @@ nfs_idmap_delete(struct nfs_client *clp)
>cl_rpcclient->cl_pipedir_objects,
>idmap_pdo);
rpc_destroy_pipe_data(idmap->idmap_pipe);
-   put_cred(idmap->cred);
+   put_user_ns(idmap->user_ns);
kfree(idmap);
 }
 
-- 
2.25.1

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Sargun Dhillon

On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hi Tycho, Sargun (and all),
>
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
>
> I've shown the rendered version of the page below, and would love
> to receive review comments from you and others, and acks, etc.
>
> There are a few FIXMEs sprinkled into the page, including one
> that relates to what appears to me to be a misdesign (possibly
> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> operation. I would be especially interested in feedback on that
> FIXME, and also of course the other FIXMEs.
>
> The page includes an extensive (albeit slightly contrived)
> example program, and I would be happy also to receive comments
> on that program.
>
> The page source currently sits in a branch (along with the text
> that you sent me for the seccomp(2) page) at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>
> Thanks,
>
> Michael
>
> [1] 
> https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>
> 
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
this point, given that the extensible ioctl mechanism works? If we add
new fields to the
seccomp datastructures, we would move them from fixed-size ioctls, to
variable sized
ioctls that encode the datastructure size / length?

-- This is mostly a question for Kees and Tycho.

Re: [PATCH v4 00/11] Add seccomp notifier ioctl that enables adding fds

2020-06-18 Thread Sargun Dhillon

On Mon, Jun 15, 2020 at 08:25:13PM -0700, Kees Cook wrote:
> Hello!
> 
> This is a bit of thread-merge between [1] and [2]. tl;dr: add a way for
> a seccomp user_notif process manager to inject files into the managed
> process in order to handle emulation of various fd-returning syscalls
> across security boundaries. Containers folks and Chrome are in need
> of the feature, and investigating this solution uncovered (and fixed)
> implementation issues with existing file sending routines.
> 
> I intend to carry this in the seccomp tree, unless someone has objections.
> :) Please review and test!
> 
> -Kees
> 
> [1] https://lore.kernel.org/lkml/20200603011044.7972-1-sar...@sargun.me/
> [2] 
> https://lore.kernel.org/lkml/20200610045214.1175600-1-keesc...@chromium.org/
> 
> Kees Cook (9):
>   net/scm: Regularize compat handling of scm_detach_fds()
>   fs: Move __scm_install_fd() to __fd_install_received()
>   fs: Add fd_install_received() wrapper for __fd_install_received()
>   pidfd: Replace open-coded partial fd_install_received()
>   fs: Expand __fd_install_received() to accept fd
>   selftests/seccomp: Make kcmp() less required
>   selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
>   seccomp: Switch addfd to Extensible Argument ioctl
>   seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
> 
This looks much cleaner than the original patchset. Thanks.

Reviewed-by: Sargun Dhillon 

on the pidfd, change fs* changes.

> Sargun Dhillon (2):
>   seccomp: Introduce addfd ioctl to seccomp user notifier
>   selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
> 
>  fs/file.c |  65 
>  include/linux/file.h  |  16 +
>  include/uapi/linux/seccomp.h  |  25 +-
>  kernel/pid.c  |  11 +-
>  kernel/seccomp.c  | 181 -
>  net/compat.c  |  55 ++-
>  net/core/scm.c|  50 +--
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 350 +++---
>  8 files changed, 618 insertions(+), 135 deletions(-)
> 
> -- 
> 2.25.1
>

Re: [PATCH v5 3/7] fs: Add fd_install_received() wrapper for __fd_install_received()

2020-06-17 Thread Sargun Dhillon

On Wed, Jun 17, 2020 at 03:03:23PM -0700, Kees Cook wrote:
> For both pidfd and seccomp, the __user pointer is not used. Update
> __fd_install_received() to make writing to ufd optional via a NULL check.
> However, for the fd_install_received_user() wrapper, ufd is NULL checked
> so an -EFAULT can be returned to avoid changing the SCM_RIGHTS interface
> behavior. Add new wrapper fd_install_received() for pidfd and seccomp
> that does not use the ufd argument. For the new helper, the new fd needs
> to be returned on success. Update the existing callers to handle it.
> 
> Signed-off-by: Kees Cook 
> ---
>  fs/file.c| 22 ++
>  include/linux/file.h |  7 +++
>  net/compat.c |  2 +-
>  net/core/scm.c   |  2 +-
>  4 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index f2167d6feec6..de85a42defe2 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -942,9 +942,10 @@ int replace_fd(unsigned fd, struct file *file, unsigned 
> flags)
>   * @o_flags: the O_* flags to apply to the new fd entry
>   *
>   * Installs a received file into the file descriptor table, with appropriate
> - * checks and count updates. Writes the fd number to userspace.
> + * checks and count updates. Optionally writes the fd number to userspace, if
> + * @ufd is non-NULL.
>   *
> - * Returns -ve on error.
> + * Returns newly install fd or -ve on error.
>   */
>  int __fd_install_received(struct file *file, int __user *ufd, unsigned int 
> o_flags)
>  {
> @@ -960,20 +961,25 @@ int __fd_install_received(struct file *file, int __user 
> *ufd, unsigned int o_fla
>   if (new_fd < 0)
>   return new_fd;
>  
> - error = put_user(new_fd, ufd);
> - if (error) {
> - put_unused_fd(new_fd);
> - return error;
> + if (ufd) {
> + error = put_user(new_fd, ufd);
> + if (error) {
> + put_unused_fd(new_fd);
> + return error;
> + }
>   }
>  
> - /* Bump the usage count and install the file. */
> + /* Bump the usage count and install the file. The resulting value of
> +  * "error" is ignored here since we only need to take action when
> +  * the file is a socket and testing "sock" for NULL is sufficient.
> +  */
>   sock = sock_from_file(file, );
>   if (sock) {
>   sock_update_netprioidx(>sk->sk_cgrp_data);
>   sock_update_classid(>sk->sk_cgrp_data);
>   }
>   fd_install(new_fd, get_file(file));
> - return 0;
> + return new_fd;
>  }
>  
>  static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
> diff --git a/include/linux/file.h b/include/linux/file.h
> index fe18a1a0d555..e19974ed9322 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct file;
>  
> @@ -96,8 +97,14 @@ extern int __fd_install_received(struct file *file, int 
> __user *ufd,
>  static inline int fd_install_received_user(struct file *file, int __user 
> *ufd,
>  unsigned int o_flags)
>  {
> + if (ufd == NULL)
> + return -EFAULT;
Isn't this *technically* a behvaiour change? Nonetheless, I think this is a 
much better
approach than forcing everyone to do null checking, and avoids at least one 
error case
where the kernel installs FDs for SCM_RIGHTS, and they're not actualy usable.

>   return __fd_install_received(file, ufd, o_flags);
>  }
> +static inline int fd_install_received(struct file *file, unsigned int 
> o_flags)
> +{
> + return __fd_install_received(file, NULL, o_flags);
> +}
>  
>  extern void flush_delayed_fput(void);
>  extern void __fput_sync(struct file *);
> diff --git a/net/compat.c b/net/compat.c
> index 94f288e8dac5..71494337cca7 100644
> --- a/net/compat.c
> +++ b/net/compat.c
> @@ -299,7 +299,7 @@ void scm_detach_fds_compat(struct msghdr *msg, struct 
> scm_cookie *scm)
>  
>   for (i = 0; i < fdmax; i++) {
>   err = fd_install_received_user(scm->fp->fp[i], cmsg_data + i, 
> o_flags);
> - if (err)
> + if (err < 0)
>   break;
>   }
>  
> diff --git a/net/core/scm.c b/net/core/scm.c
> index df190f1fdd28..b9a0442ebd26 100644
> --- a/net/core/scm.c
> +++ b/net/core/scm.c
> @@ -307,7 +307,7 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie 
> *scm)
>  
>   for (i = 0; i < fdmax; i++) {
>   err = fd_install_received_user(scm->fp->fp[i], cmsg_data + i, 
> o_flags);
> - if (err)
> + if (err < 0)
>   break;
>   }
>  
> -- 
> 2.25.1
> 

Reviewed-by: Sargun Dhillon

Re: [PATCH v3] seccomp: Add find_notification helper

2020-06-17 Thread Sargun Dhillon

On Wed, Jun 17, 2020 at 01:08:44PM -0700, Nathan Chancellor wrote:
> On Mon, Jun 01, 2020 at 04:25:32AM -0700, Sargun Dhillon wrote:
> > This adds a helper which can iterate through a seccomp_filter to
> > find a notification matching an ID. It removes several replicated
> > chunks of code.
> > 
> > Signed-off-by: Sargun Dhillon 
> > Acked-by: Christian Brauner 
> > Reviewed-by: Tycho Andersen 
> > Cc: Matt Denton 
> > Cc: Kees Cook ,
> > Cc: Jann Horn ,
> > Cc: Robert Sesek ,
> > Cc: Chris Palmer 
> > Cc: Christian Brauner 
> > Cc: Tycho Andersen 
> > ---
> >  kernel/seccomp.c | 55 
> >  1 file changed, 28 insertions(+), 27 deletions(-)
> > 
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 55a6184f5990..cc6b47173a95 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -41,6 +41,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  enum notify_state {
> > SECCOMP_NOTIFY_INIT,
> > @@ -1021,10 +1022,27 @@ static int seccomp_notify_release(struct inode 
> > *inode, struct file *file)
> > return 0;
> >  }
> >  
> > +/* must be called with notif_lock held */
> > +static inline struct seccomp_knotif *
> > +find_notification(struct seccomp_filter *filter, u64 id)
> > +{
> > +   struct seccomp_knotif *cur;
> > +
> > +   lockdep_assert_held(>notify_lock);
> > +
> > +   list_for_each_entry(cur, >notif->notifications, list) {
> > +   if (cur->id == id)
> > +   return cur;
> > +   }
> > +
> > +   return NULL;
> > +}
> > +
> > +
> >  static long seccomp_notify_recv(struct seccomp_filter *filter,
> > void __user *buf)
> >  {
> > -   struct seccomp_knotif *knotif = NULL, *cur;
> > +   struct seccomp_knotif *knotif, *cur;
> > struct seccomp_notif unotif;
> > ssize_t ret;
> >  
> > @@ -1078,15 +1096,8 @@ static long seccomp_notify_recv(struct 
> > seccomp_filter *filter,
> >  * may have died when we released the lock, so we need to make
> >  * sure it's still around.
> >  */
> > -   knotif = NULL;
> > mutex_lock(>notify_lock);
> > -   list_for_each_entry(cur, >notif->notifications, list) {
> > -   if (cur->id == unotif.id) {
> > -   knotif = cur;
> > -   break;
> > -   }
> > -   }
> > -
> > +   knotif = find_notification(filter, unotif.id);
> > if (knotif) {
> > knotif->state = SECCOMP_NOTIFY_INIT;
> > up(>notif->request);
> > @@ -1101,7 +1112,7 @@ static long seccomp_notify_send(struct seccomp_filter 
> > *filter,
> > void __user *buf)
> >  {
> > struct seccomp_notif_resp resp = {};
> > -   struct seccomp_knotif *knotif = NULL, *cur;
> > +   struct seccomp_knotif *knotif;
> > long ret;
> >  
> > if (copy_from_user(, buf, sizeof(resp)))
> > @@ -1118,13 +1129,7 @@ static long seccomp_notify_send(struct 
> > seccomp_filter *filter,
> > if (ret < 0)
> > return ret;
> >  
> > -   list_for_each_entry(cur, >notif->notifications, list) {
> > -   if (cur->id == resp.id) {
> > -   knotif = cur;
> > -   break;
> > -   }
> > -   }
> > -
> > +   knotif = find_notification(filter, resp.id);
> > if (!knotif) {
> > ret = -ENOENT;
> > goto out;
> > @@ -1150,7 +1155,7 @@ static long seccomp_notify_send(struct seccomp_filter 
> > *filter,
> >  static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > void __user *buf)
> >  {
> > -   struct seccomp_knotif *knotif = NULL;
> 
> I don't know that this should have been removed, clang now warns:
> 
> kernel/seccomp.c:1063:2: warning: variable 'knotif' is used uninitialized 
> whenever 'for' loop exits because its condition is false 
> [-Wsometimes-uninitialized]
> list_for_each_entry(cur, >notif->notifications, list) {
> ^
> include/linux/list.h:602:7: note: expanded from macro 'list_for_each_entry'
>  >member != (head);

Re: [PATCH v4 02/11] fs: Move __scm_install_fd() to __fd_install_received()

2020-06-15 Thread Sargun Dhillon

On Mon, Jun 15, 2020 at 08:25:15PM -0700, Kees Cook wrote:
> In preparation for users of the "install a received file" logic outside
> of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
> net/core/scm.c to __fd_install_received() in fs/file.c, and provide a
> wrapper named fd_install_received_user(), as future patches will change
> the interface to __fd_install_received().
> 
> Signed-off-by: Kees Cook 
> ---
>  fs/file.c| 47 
>  include/linux/file.h |  8 
>  include/net/scm.h|  1 -
>  net/compat.c |  2 +-
>  net/core/scm.c   | 32 +-
>  5 files changed, 57 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index abb8b7081d7a..fcfddae0d252 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -18,6 +19,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  unsigned int sysctl_nr_open __read_mostly = 1024*1024;
>  unsigned int sysctl_nr_open_min = BITS_PER_LONG;
> @@ -931,6 +934,50 @@ int replace_fd(unsigned fd, struct file *file, unsigned 
> flags)
>   return err;
>  }
>  
> +/**
> + * __fd_install_received() - Install received file into file descriptor table
> + *
> + * @fd: fd to install into (if negative, a new fd will be allocated)
> + * @file: struct file that was received from another process
> + * @ufd_required: true to use @ufd for writing fd number to userspace
> + * @ufd: __user pointer to write new fd number to
> + * @o_flags: the O_* flags to apply to the new fd entry
Probably doesn't matter, but this function doesn't take the fd, or ufd_required
argument in this patch. 

> + *
> + * Installs a received file into the file descriptor table, with appropriate
> + * checks and count updates. Optionally writes the fd number to userspace.
ufd does not apppear options here.

> + *
> + * Returns -ve on error.
> + */
> +int __fd_install_received(struct file *file, int __user *ufd, unsigned int 
> o_flags)
> +{
> + struct socket *sock;
> + int new_fd;
> + int error;
> +
> + error = security_file_receive(file);
> + if (error)
> + return error;
> +
> + new_fd = get_unused_fd_flags(o_flags);
> + if (new_fd < 0)
> + return new_fd;
> +
> + error = put_user(new_fd, ufd);
> + if (error) {
> + put_unused_fd(new_fd);
> + return error;
> + }
> +
> + /* Bump the usage count and install the file. */
> + sock = sock_from_file(file, );
> + if (sock) {
> + sock_update_netprioidx(>sk->sk_cgrp_data);
> + sock_update_classid(>sk->sk_cgrp_data);
> + }
> + fd_install(new_fd, get_file(file));
> + return 0;
> +}
> +
>  static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
>  {
>   int err = -EBADF;
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 122f80084a3e..fe18a1a0d555 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -91,6 +91,14 @@ extern void put_unused_fd(unsigned int fd);
>  
>  extern void fd_install(unsigned int fd, struct file *file);
>  
> +extern int __fd_install_received(struct file *file, int __user *ufd,
> +  unsigned int o_flags);
> +static inline int fd_install_received_user(struct file *file, int __user 
> *ufd,
> +unsigned int o_flags)
> +{
> + return __fd_install_received(file, ufd, o_flags);
> +}
> +
>  extern void flush_delayed_fput(void);
>  extern void __fput_sync(struct file *);
>  
> diff --git a/include/net/scm.h b/include/net/scm.h
> index 581a94d6c613..1ce365f4c256 100644
> --- a/include/net/scm.h
> +++ b/include/net/scm.h
> @@ -37,7 +37,6 @@ struct scm_cookie {
>  #endif
>  };
>  
> -int __scm_install_fd(struct file *file, int __user *ufd, unsigned int 
> o_flags);
>  void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm);
>  void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm);
>  int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie 
> *scm);
> diff --git a/net/compat.c b/net/compat.c
> index 27d477fdcaa0..94f288e8dac5 100644
> --- a/net/compat.c
> +++ b/net/compat.c
> @@ -298,7 +298,7 @@ void scm_detach_fds_compat(struct msghdr *msg, struct 
> scm_cookie *scm)
>   int err = 0, i;
>  
>   for (i = 0; i < fdmax; i++) {
> - err = __scm_install_fd(scm->fp->fp[i], cmsg_data + i, o_flags);
> + err = fd_install_received_user(scm->fp->fp[i], cmsg_data + i, 
> o_flags);
>   if (err)
>   break;
>   }
> diff --git a/net/core/scm.c b/net/core/scm.c
> index 6151678c73ed..df190f1fdd28 100644
> --- a/net/core/scm.c
> +++ b/net/core/scm.c
> @@ -280,36 +280,6 @@ void put_cmsg_scm_timestamping(struct msghdr *msg, 
> struct scm_timestamping_inter
>  }
>

Re: [RFC PATCH] seccomp: Add extensibility mechanism to read notifications

2020-06-15 Thread Sargun Dhillon

On Mon, Jun 15, 2020 at 11:36:22AM +0200, Jann Horn wrote:
> On Sat, Jun 13, 2020 at 9:26 AM Sargun Dhillon  wrote:
> > This introduces an extensibility mechanism to receive seccomp
> > notifications. It uses read(2), as opposed to using an ioctl. The listener
> > must be first configured to write the notification via the
> > SECCOMP_IOCTL_NOTIF_CONFIG ioctl with the fields that the user is
> > interested in.
> >
> > This is different than the old SECCOMP_IOCTL_NOTIF_RECV method as it allows
> > for more flexibility. It allows the user to opt into certain fields, and
> > not others. This is nice for users who want to opt into some fields like
> > thread group leader. In the future, this mechanism can be used to expose
> > file descriptors to users,
> 
> Please don't touch the caller's file descriptor table from read/write
> handlers, only from ioctl handlers. A process should always be able to
> read from files supplied by an untrusted user without having to worry
> about new entries mysteriously popping up in its fd table.
> 
Acknowledged.

Is something like:
ioctl(listener, SECCOMP_GET_MEMORY, notification_id);

reasonable in your opinion?

> > such as a representation of the process's
> > memory. It also has good forwards and backwards compatibility guarantees.
> > Users with programs compiled against newer headers will work fine on older
> > kernels as long as they don't opt into any sizes, or optional fields that
> > are only available on newer kernels.
> >
> > The ioctl method relies on an extensible struct[1]. This extensible struct
> > is slightly misleading[2] as the ioctl number changes when we extend it.
> > This breaks backwards compatibility with older kernels even if we're not
> > asking for any fields that we do not need. In order to deal with this, the
> > ioctl number would need to be dynamic, or the user would need to pass the
> > size they're expecting, and we would need to implemented "extended syscall"
> > semantics in ioctl. This potentially causes issue to future work of
> > kernel-assisted copying for ioctl user buffers.
> 
> I don't see the issue. Can't you replace "switch (cmd)" with "switch
> (cmd & ~IOCSIZE_MASK)" and then check the size separately?
It depends:
1. If we rely purely on definitions in ioctl.h, and the user they've pulled
   in a newer header file, on an older kernel, it will fail. This is because
   the size is bigger, and we don't actually know if they're interested in
   those new values
2. We can define new seccomp IOCTL versions, and expose these to the user.
   This has some niceness to it, in that there's a simple backwards compatibiity
   story. This is a little unorthodox though.
3. We do something like embed the version / size that someone is interested
   in in the struct, and the ioctl reads it in order to determine which version
   of the fields to populate. This is effectively what the read approach does
   with more steps.

There's no reason we can't do #3. Just a complexity tradeoff.

[RFC PATCH] seccomp: Add extensibility mechanism to read notifications

2020-06-13 Thread Sargun Dhillon

This introduces an extensibility mechanism to receive seccomp
notifications. It uses read(2), as opposed to using an ioctl. The listener
must be first configured to write the notification via the
SECCOMP_IOCTL_NOTIF_CONFIG ioctl with the fields that the user is
interested in.

This is different than the old SECCOMP_IOCTL_NOTIF_RECV method as it allows
for more flexibility. It allows the user to opt into certain fields, and
not others. This is nice for users who want to opt into some fields like
thread group leader. In the future, this mechanism can be used to expose
file descriptors to users, such as a representation of the process's
memory. It also has good forwards and backwards compatibility guarantees.
Users with programs compiled against newer headers will work fine on older
kernels as long as they don't opt into any sizes, or optional fields that
are only available on newer kernels.

The ioctl method relies on an extensible struct[1]. This extensible struct
is slightly misleading[2] as the ioctl number changes when we extend it.
This breaks backwards compatibility with older kernels even if we're not
asking for any fields that we do not need. In order to deal with this, the
ioctl number would need to be dynamic, or the user would need to pass the
size they're expecting, and we would need to implemented "extended syscall"
semantics in ioctl. This potentially causes issue to future work of
kernel-assisted copying for ioctl user buffers.

read(2) offers slightly simpler semantics for the user, in that they do
not need to pass in the size they're expecting to the kernel. Only the
size of the buffer they have allocated. Since this information is passed
along with the read syscall there isn't a requirement to read it back from
userspace. It also doesn't get into the EA ioctl / dynamic ioctl number
shenanigans discussed above.

Also, it plugs in nicely to Golang (or othr high-level languages), as you
can just treat it like a normal file. Go will put it into the event poll
loop, and do a read on the buffer for you.

[1]: https://lore.kernel.org/linux-api/20181209182414.30862-4-ty...@tycho.ws/
[2]: 
https://lore.kernel.org/lkml/20200610081237.GA23425@ircssh-2.c.rugged-nimbus-611.internal/

Signed-off-by: Sargun Dhillon 
---
 include/uapi/linux/seccomp.h  |  15 ++
 kernel/seccomp.c  | 245 --
 tools/testing/selftests/seccomp/seccomp_bpf.c |  61 +
 3 files changed, 299 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..75a6cb56db84 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -77,6 +77,19 @@ struct seccomp_notif {
struct seccomp_data data;
 };
 
+enum seccomp_data_version_size {
+   SECCOMP_DATA_SIZE_NOT_PRESENT = 0,
+   SECCOMP_DATA_SIZE_VER0 = 64,
+   SECCOMP_DATA_SIZE_LATEST = SECCOMP_DATA_SIZE_VER0,
+};
+
+#define SECCOMP_NOTIF_FIELD_PID(1UL << 0)
+
+struct seccomp_notif_config {
+   __u32 optional_fields; /* OR'd SECCOMP_NOTIF_FIELD_* */
+   __u32 seccomp_data_size; /* seccomp_notif_field_data_version */
+};
+
 /*
  * Valid flags for struct seccomp_notif_resp
  *
@@ -124,4 +137,6 @@ struct seccomp_notif_resp {
 #define SECCOMP_IOCTL_NOTIF_SEND   SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
 #define SECCOMP_IOCTL_NOTIF_ID_VALID   SECCOMP_IOR(2, __u64)
+#define SECCOMP_IOCTL_NOTIF_CONFIG SECCOMP_IOW(3, \
+   struct seccomp_notif_config)
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 34dbf77569b3..006b387d3408 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -84,6 +84,27 @@ struct seccomp_knotif {
struct list_head list;
 };
 
+/* Returns bytes written, negative number for error code */
+typedef int (*seccomp_notification_appender_t)(struct seccomp_filter *filter,
+  void *buf,
+  struct seccomp_knotif *knotif);
+
+/**
+ * struct notification_read_config - configuration for read calls against
+ * seccomp listener FD. This is the specification of what the read size
+ * and read format is.
+ *
+ * @read_size: The size of the configured read. If it 0, it means that the
+ * listener has not yet been configured.
+ * @optional_fields: Bitmask of enabled optional fields
+ * @seccomp_data: Callback to append seccomp_data to buffer
+ */
+struct notification_read_config {
+   u32 read_size;
+   u64 optional_fields;
+   seccomp_notification_appender_t seccomp_data;
+};
+
 /**
  * struct notification - container for seccomp userspace notifications. Since
  * most seccomp filters will not have notification listeners attached and this
@@ -100,6 +121,7 @@ struct notification {
struct semaphore request;
u64 next_id

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-12 Thread Sargun Dhillon

On Fri, Jun 12, 2020 at 08:36:03AM +, David Laight wrote:
> From: Kees Cook
> > Sent: 12 June 2020 00:50
> > > From: Sargun Dhillon
> > > > Sent: 11 June 2020 12:07
> > > > Subject: Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper 
> > > > to move fds across
> > processes
> > > >
> > > > On Thu, Jun 11, 2020 at 12:01:14PM +0200, Christian Brauner wrote:
> > > > > On Wed, Jun 10, 2020 at 07:59:55PM -0700, Kees Cook wrote:
> > > > > > On Wed, Jun 10, 2020 at 08:12:38AM +, Sargun Dhillon wrote:
> > > > > > > As an aside, all of this junk should be dropped:
> > > > > > > + ret = get_user(size, >size);
> > > > > > > + if (ret)
> > > > > > > + return ret;
> > > > > > > +
> > > > > > > + ret = copy_struct_from_user(, sizeof(addfd), uaddfd, 
> > > > > > > size);
> > > > > > > + if (ret)
> > > > > > > + return ret;
> > > > > > >
> > > > > > > and the size member of the seccomp_notif_addfd struct. I brought 
> > > > > > > this up
> > > > > > > off-list with Tycho that ioctls have the size of the struct 
> > > > > > > embedded in them. We
> > > > > > > should just use that. The ioctl definition is based on this[2]:
> > > > > > > #define _IOC(dir,type,nr,size) \
> > > > > > >   (((dir)  << _IOC_DIRSHIFT) | \
> > > > > > >((type) << _IOC_TYPESHIFT) | \
> > > > > > >((nr)   << _IOC_NRSHIFT) | \
> > > > > > >((size) << _IOC_SIZESHIFT))
> > > > > > >
> > > > > > >
> > > > > > > We should just use copy_from_user for now. In the future, we can 
> > > > > > > either
> > > > > > > introduce new ioctl names for new structs, or extract the size 
> > > > > > > dynamically from
> > > > > > > the ioctl (and mask it out on the switch statement in 
> > > > > > > seccomp_notify_ioctl.
> > > > > >
> > > > > > Yeah, that seems reasonable. Here's the diff for that part:
> > > > >
> > > > > Why does it matter that the ioctl() has the size of the struct 
> > > > > embedded
> > > > > within? Afaik, the kernel itself doesn't do anything with that size. 
> > > > > It
> > > > > merely checks that the size is not pathological and it does so at
> > > > > compile time.
> > > > >
> > > > > #ifdef __CHECKER__
> > > > > #define _IOC_TYPECHECK(t) (sizeof(t))
> > > > > #else
> > > > > /* provoke compile error for invalid uses of size argument */
> > > > > extern unsigned int __invalid_size_argument_for_IOC;
> > > > > #define _IOC_TYPECHECK(t) \
> > > > >   ((sizeof(t) == sizeof(t[1]) && \
> > > > > sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
> > > > > sizeof(t) : __invalid_size_argument_for_IOC)
> > > > > #endif
> > > > >
> > > > > The size itself is not verified at runtime. copy_struct_from_user()
> > > > > still makes sense at least if we're going to allow expanding the 
> > > > > struct
> > > > > in the future.
> > > > Right, but if we simply change our headers and extend the struct, it 
> > > > will break
> > > > all existing programs compiled against those headers. In order to avoid 
> > > > that, if
> > > > we intend on extending this struct by appending to it, we need to have a
> > > > backwards compatibility mechanism. Just having copy_struct_from_user 
> > > > isn't
> > > > enough. The data structure either must be fixed size, or we need a way 
> > > > to handle
> > > > multiple ioctl numbers derived from headers with different sized struct 
> > > > arguments
> > > >
> > > > The two approaches I see are:
> > > > 1. use more indirection. This has previous art in drm[1]. That's look
> > > > something like this:
> > > >
> > > > struct seccomp_notif_addfd_ptr {
> > > > __u64 size;
> > > > __u64 addr;
> > > > }
> >

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-11 Thread Sargun Dhillon

On Thu, Jun 11, 2020 at 12:01:14PM +0200, Christian Brauner wrote:
> On Wed, Jun 10, 2020 at 07:59:55PM -0700, Kees Cook wrote:
> > On Wed, Jun 10, 2020 at 08:12:38AM +0000, Sargun Dhillon wrote:
> > > As an aside, all of this junk should be dropped:
> > > + ret = get_user(size, >size);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + ret = copy_struct_from_user(, sizeof(addfd), uaddfd, size);
> > > + if (ret)
> > > + return ret;
> > > 
> > > and the size member of the seccomp_notif_addfd struct. I brought this up 
> > > off-list with Tycho that ioctls have the size of the struct embedded in 
> > > them. We 
> > > should just use that. The ioctl definition is based on this[2]:
> > > #define _IOC(dir,type,nr,size) \
> > >   (((dir)  << _IOC_DIRSHIFT) | \
> > >((type) << _IOC_TYPESHIFT) | \
> > >((nr)   << _IOC_NRSHIFT) | \
> > >((size) << _IOC_SIZESHIFT))
> > > 
> > > 
> > > We should just use copy_from_user for now. In the future, we can either 
> > > introduce new ioctl names for new structs, or extract the size 
> > > dynamically from 
> > > the ioctl (and mask it out on the switch statement in 
> > > seccomp_notify_ioctl.
> > 
> > Yeah, that seems reasonable. Here's the diff for that part:
> 
> Why does it matter that the ioctl() has the size of the struct embedded
> within? Afaik, the kernel itself doesn't do anything with that size. It
> merely checks that the size is not pathological and it does so at
> compile time.
> 
> #ifdef __CHECKER__
> #define _IOC_TYPECHECK(t) (sizeof(t))
> #else
> /* provoke compile error for invalid uses of size argument */
> extern unsigned int __invalid_size_argument_for_IOC;
> #define _IOC_TYPECHECK(t) \
>   ((sizeof(t) == sizeof(t[1]) && \
> sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
> sizeof(t) : __invalid_size_argument_for_IOC)
> #endif
> 
> The size itself is not verified at runtime. copy_struct_from_user()
> still makes sense at least if we're going to allow expanding the struct
> in the future.
Right, but if we simply change our headers and extend the struct, it will break 
all existing programs compiled against those headers. In order to avoid that, 
if 
we intend on extending this struct by appending to it, we need to have a 
backwards compatibility mechanism. Just having copy_struct_from_user isn't 
enough. The data structure either must be fixed size, or we need a way to 
handle 
multiple ioctl numbers derived from headers with different sized struct 
arguments

The two approaches I see are:
1. use more indirection. This has previous art in drm[1]. That's look
something like this:

struct seccomp_notif_addfd_ptr {
__u64 size;
__u64 addr;
}

... And then it'd be up to us to dereference the addr and copy struct from user.

2. Expose one ioctl to the user, many internally

e.g., public api:

struct seccomp_notif {
__u64 id;
__u64 pid;
struct seccomp_data;
__u64 fancy_new_field;
}

#define SECCOMP_IOCTL_NOTIF_RECVSECCOMP_IOWR(0, struct seccomp_notif)

internally:
struct seccomp_notif_v1 {
__u64 id;
__u64 pid;
struct seccomp_data;
}

struct seccomp_notif_v2 {
__u64 id;
__u64 pid;
struct seccomp_data;
__u64 fancy_new_field;
}

and we can switch like this:
switch (cmd) {
/* for example. We actually have to do this for any struct we intend to 
 * extend to get proper backwards compatibility
 */
case SECCOMP_IOWR(0, struct seccomp_notif_v1)
return seccomp_notify_recv(filter, buf, sizeof(struct 
seccomp_notif_v1));
case SECCOMP_IOWR(0, struct seccomp_notif_v2)
return seccomp_notify_recv(filter, buf, sizeof(struct 
seccomp_notif_v3));
...
case SECCOMP_IOCTL_NOTIF_SEND:
return seccomp_notify_send(filter, buf);
case SECCOMP_IOCTL_NOTIF_ID_VALID:
return seccomp_notify_id_valid(filter, buf);
default:
return -EINVAL;
}

This has the downside that programs compiled against more modern kernel headers 
will break on older kernels.

3. We can take the approach you suggested.

#define UNSIZED(cmd)(cmd & ~(_IOC_SIZEMASK << _IOC_SIZESHIFT)
static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 unsigned long arg)
{
struct seccomp_filter *filter = file->private_data;
void __user *buf = (void __user *)arg;
int size = _IOC_SIZE(cmd);
cmd = UNSIZED(cmd);

switch (cmd) {
/* for example. We actually ha

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-11 Thread Sargun Dhillon

On Thu, Jun 11, 2020 at 11:19:42AM +0200, Christian Brauner wrote:
> On Wed, Jun 10, 2020 at 07:59:55PM -0700, Kees Cook wrote:
> > On Wed, Jun 10, 2020 at 08:12:38AM +0000, Sargun Dhillon wrote:
> > > As an aside, all of this junk should be dropped:
> > > + ret = get_user(size, >size);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + ret = copy_struct_from_user(, sizeof(addfd), uaddfd, size);
> > > + if (ret)
> > > + return ret;
> > > 
> > > and the size member of the seccomp_notif_addfd struct. I brought this up 
> > > off-list with Tycho that ioctls have the size of the struct embedded in 
> > > them. We 
> > > should just use that. The ioctl definition is based on this[2]:
> > > #define _IOC(dir,type,nr,size) \
> > >   (((dir)  << _IOC_DIRSHIFT) | \
> > >((type) << _IOC_TYPESHIFT) | \
> > >((nr)   << _IOC_NRSHIFT) | \
> > >((size) << _IOC_SIZESHIFT))
> > > 
> > > 
> > > We should just use copy_from_user for now. In the future, we can either 
> > > introduce new ioctl names for new structs, or extract the size 
> > > dynamically from 
> > > the ioctl (and mask it out on the switch statement in 
> > > seccomp_notify_ioctl.
> > 
> > Yeah, that seems reasonable. Here's the diff for that part:
> > 
> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index 7b6028b399d8..98bf19b4e086 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/seccomp.h
> > @@ -118,7 +118,6 @@ struct seccomp_notif_resp {
> >  
> >  /**
> >   * struct seccomp_notif_addfd
> > - * @size: The size of the seccomp_notif_addfd datastructure
> >   * @id: The ID of the seccomp notification
> >   * @flags: SECCOMP_ADDFD_FLAG_*
> >   * @srcfd: The local fd number
> > @@ -126,7 +125,6 @@ struct seccomp_notif_resp {
> >   * @newfd_flags: The O_* flags the remote FD should have applied
> >   */
> >  struct seccomp_notif_addfd {
> > -   __u64 size;
> > __u64 id;
> > __u32 flags;
> > __u32 srcfd;
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 3c913f3b8451..00cbdad6c480 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1297,14 +1297,9 @@ static long seccomp_notify_addfd(struct 
> > seccomp_filter *filter,
> > struct seccomp_notif_addfd addfd;
> > struct seccomp_knotif *knotif;
> > struct seccomp_kaddfd kaddfd;
> > -   u64 size;
> > int ret;
> >  
> > -   ret = get_user(size, >size);
> > -   if (ret)
> > -   return ret;
> > -
> > -   ret = copy_struct_from_user(, sizeof(addfd), uaddfd, size);
> > +   ret = copy_from_user(, uaddfd, sizeof(addfd));
> > if (ret)
> > return ret;
> >  
> > 
> > > 
> > > 
> > > +#define SECCOMP_IOCTL_NOTIF_ADDFDSECCOMP_IOR(3,  \
> > > + struct seccomp_notif_addfd)
> > > 
> > > Lastly, what I believe to be a small mistake, it should be SECCOMP_IOW, 
> > > based on 
> > > the documentation in ioctl.h -- "_IOW means userland is writing and 
> > > kernel is 
> > > reading."
> > 
> > Oh. Yeah; good catch. Uhm, that means SECCOMP_IOCTL_NOTIF_ID_VALID
> > is wrong too, yes? Tycho, Christian, how disruptive would this be to
> > fix? (Perhaps support both and deprecate the IOR version at some point
> > in the future?)
> 
> We have custom defines in our source code, i.e.
> #define SECCOMP_IOCTL_NOTIF_ID_VALID  SECCOMP_IOR(2, __u64)
> so ideally we'd have a SECCOMP_IOCTL_NOTIF_ID_VALID_V2
> 
> Does that sound ok?
> 
> Christian
Why not change the public API in seccomp.h to:
#define SECCOMP_IOCTL_NOTIF_ID_VALIDSECCOMP_IOW(2, __u64)

And then in seccomp.c:
#define SECCOMP_IOCTL_NOTIF_ID_VALID_OLDSECCOMP_IOR(2, __u64)
static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 unsigned long arg)
{
struct seccomp_filter *filter = file->private_data;
void __user *buf = (void __user *)arg;

switch (cmd) {
case SECCOMP_IOCTL_NOTIF_RECV:
return seccomp_notify_recv(filter, buf);
case SECCOMP_IOCTL_NOTIF_SEND:
return seccomp_notify_send(filter, buf);
case SECCOMP_IOCTL_NOTIF_ID_VALID_OLD:
pr_warn_once("Detected usage of legacy (incorrect) version of 
seccomp notifier notif_id_valid ioctl\n");
case SECCOMP_IOCTL_NOTIF_ID_VALID:
return seccomp_notify_id_valid(filter, buf);
default:
return -EINVAL;
}
}
 

So, both will work fine, and whenevery anyone recompiles, or picks up new 
headers, they will start calling the "right" one without a code change, and
we wont break any userspace.

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-10 Thread Sargun Dhillon

On Wed, Jun 10, 2020 at 07:59:55PM -0700, Kees Cook wrote:
> 
> Yeah, that seems reasonable. Here's the diff for that part:
> 
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 7b6028b399d8..98bf19b4e086 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -118,7 +118,6 @@ struct seccomp_notif_resp {
>  
>  /**
>   * struct seccomp_notif_addfd
> - * @size: The size of the seccomp_notif_addfd datastructure
>   * @id: The ID of the seccomp notification
>   * @flags: SECCOMP_ADDFD_FLAG_*
>   * @srcfd: The local fd number
> @@ -126,7 +125,6 @@ struct seccomp_notif_resp {
>   * @newfd_flags: The O_* flags the remote FD should have applied
>   */
>  struct seccomp_notif_addfd {
> - __u64 size;
>   __u64 id;
>   __u32 flags;
>   __u32 srcfd;
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3c913f3b8451..00cbdad6c480 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1297,14 +1297,9 @@ static long seccomp_notify_addfd(struct seccomp_filter 
> *filter,
>   struct seccomp_notif_addfd addfd;
>   struct seccomp_knotif *knotif;
>   struct seccomp_kaddfd kaddfd;
> - u64 size;
>   int ret;
>  
> - ret = get_user(size, >size);
> - if (ret)
> - return ret;
> -
> - ret = copy_struct_from_user(, sizeof(addfd), uaddfd, size);
> + ret = copy_from_user(, uaddfd, sizeof(addfd));
>   if (ret)
>   return ret;
>  
> 
Looks good to me. If we ever change the size of this struct, we can do the work 
then to copy_struct_from_user.

> > 
> > 
> > +#define SECCOMP_IOCTL_NOTIF_ADDFD  SECCOMP_IOR(3,  \
> > +   struct seccomp_notif_addfd)
> > 
> > Lastly, what I believe to be a small mistake, it should be SECCOMP_IOW, 
> > based on 
> > the documentation in ioctl.h -- "_IOW means userland is writing and kernel 
> > is 
> > reading."
> 
> Oh. Yeah; good catch. Uhm, that means SECCOMP_IOCTL_NOTIF_ID_VALID
> is wrong too, yes? Tycho, Christian, how disruptive would this be to
> fix? (Perhaps support both and deprecate the IOR version at some point
> in the future?)
I think at a minimum we should change the uapi, and accept both (for now). 
Maybe 
a pr_warn_once telling people not to use the old one.

I can do the patch, if you want. 
> 
> Diff for just addfd's change:
> 
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 7b6028b399d8..98bf19b4e086 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -146,7 +144,7 @@ struct seccomp_notif_addfd {
>   struct seccomp_notif_resp)
>  #define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
>  /* On success, the return value is the remote process's added fd number */
> -#define SECCOMP_IOCTL_NOTIF_ADDFDSECCOMP_IOR(3,  \
> +#define SECCOMP_IOCTL_NOTIF_ADDFDSECCOMP_IOW(3,  \
>   struct seccomp_notif_addfd)
>  
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> 
> -- 
> Kees Cook
Looks good. Thank you.

Re: [PATCH 2/2] pidfd: Replace open-coded partial __scm_install_fd()

2020-06-10 Thread Sargun Dhillon

On Tue, Jun 09, 2020 at 09:52:14PM -0700, Kees Cook wrote:
> The sock counting (sock_update_netprioidx() and sock_update_classid())
> was missing from this implementation of fd installation, compared to
> SCM_RIGHTS. Use the new scm helper to get the work done, after adjusting
> it to return the installed fd and accept a NULL user pointer.
> 
> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
> Signed-off-by: Kees Cook 
> ---
> AFAICT, the following patches are needed for back-porting this to stable:
> 
> 0462b6bdb644 ("net: add a CMSG_USER_DATA macro")
> 2618d530dd8b ("net/scm: cleanup scm_detach_fds")
> 1f466e1f15cf ("net: cleanly handle kernel vs user buffers for ->msg_control")
> 6e8a4f9dda38 ("net: ignore sock_from_file errors in __scm_install_fd")
> ---
>  kernel/pid.c   | 12 ++--
>  net/compat.c   |  2 +-
>  net/core/scm.c | 27 ---
>  3 files changed, 23 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index f1496b757162..a7ce4ba898d3 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct pid init_struct_pid = {
>   .count  = REFCOUNT_INIT(1),
> @@ -635,18 +636,9 @@ static int pidfd_getfd(struct pid *pid, int fd)
>   if (IS_ERR(file))
>   return PTR_ERR(file);
>  
> - ret = security_file_receive(file);
> - if (ret) {
> - fput(file);
> - return ret;
> - }
> -
> - ret = get_unused_fd_flags(O_CLOEXEC);
> + ret = __scm_install_fd(file, NULL, O_CLOEXEC);
>   if (ret < 0)
>   fput(file);
> - else
> - fd_install(ret, file);
> -
>   return ret;
>  }
>  
> diff --git a/net/compat.c b/net/compat.c
> index 117f1869bf3b..f857b6d7 100644
> --- a/net/compat.c
> +++ b/net/compat.c
> @@ -299,7 +299,7 @@ void scm_detach_fds_compat(struct msghdr *msg, struct 
> scm_cookie *scm)
>  
>   for (i = 0; i < fdmax; i++) {
>   err = __scm_install_fd(scm->fp->fp[i], cmsg_data + i, o_flags);
> - if (err)
> + if (err < 0)
>   break;
>   }
>  
> diff --git a/net/core/scm.c b/net/core/scm.c
> index 86d96152646f..e80648fb4da7 100644
> --- a/net/core/scm.c
> +++ b/net/core/scm.c
> @@ -280,6 +280,14 @@ void put_cmsg_scm_timestamping(struct msghdr *msg, 
> struct scm_timestamping_inter
>  }
>  EXPORT_SYMBOL(put_cmsg_scm_timestamping);
>  
> +/**
> + * __scm_install_fd() - Install received file into file descriptor table
Any reason not to rename this remote_install_* or similar, and move it to fs/?
> + *
> + * Installs a received file into the file descriptor table, with appropriate
> + * checks and count updates.
> + *
> + * Returns fd installed or -ve on error.
> + */
>  int __scm_install_fd(struct file *file, int __user *ufd, int o_flags)
>  {
>   struct socket *sock;
> @@ -294,20 +302,25 @@ int __scm_install_fd(struct file *file, int __user 
> *ufd, int o_flags)
>   if (new_fd < 0)
>   return new_fd;
>  
> - error = put_user(new_fd, ufd);
> - if (error) {
> - put_unused_fd(new_fd);
> - return error;
> + if (ufd) {
See my comment elsewhere about not being able to use NULL here.
> + error = put_user(new_fd, ufd);
> + if (error) {
> + put_unused_fd(new_fd);
> + return error;
> + }
>   }
>  
> - /* Bump the usage count and install the file. */
> + /* Bump the usage count and install the file. The resulting value of
> +  * "error" is ignored here since we only need to take action when
> +  * the file is a socket and testing "sock" for NULL is sufficient.
> +  */
>   sock = sock_from_file(file, );
>   if (sock) {
>   sock_update_netprioidx(>sk->sk_cgrp_data);
>   sock_update_classid(>sk->sk_cgrp_data);
>   }
>   fd_install(new_fd, get_file(file));
> - return 0;
> + return new_fd;
>  }
>  
>  static int scm_max_fds(struct msghdr *msg)
> @@ -337,7 +350,7 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie 
> *scm)
>  
>   for (i = 0; i < fdmax; i++) {
>   err = __scm_install_fd(scm->fp->fp[i], cmsg_data + i, o_flags);
> - if (err)
> + if (err < 0)
>   break;
>   }
>  
> -- 
> 2.25.1
>

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-10 Thread Sargun Dhillon

On Tue, Jun 09, 2020 at 10:27:54PM -0700, Kees Cook wrote:
> On Tue, Jun 09, 2020 at 11:27:30PM +0200, Christian Brauner wrote:
> > On June 9, 2020 10:55:42 PM GMT+02:00, Kees Cook  
> > wrote:
> > >LOL. And while we were debating this, hch just went and cleaned stuff up:
> > >
> > >2618d530dd8b ("net/scm: cleanup scm_detach_fds")
> > >
> > >So, um, yeah, now my proposal is actually even closer to what we already
> > >have there. We just add the replace_fd() logic to __scm_install_fd() and
> > >we're done with it.
> > 
> > Cool, you have a link? :)
> 
> How about this:
> 
Thank you.
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=devel/seccomp/addfd/v3.1=bb94586b9e7cc88e915536c2e9fb991a97b62416
> 
> -- 
> Kees Cook

+   if (ufd) {
+   error = put_user(new_fd, ufd);
+   if (error) {
+   put_unused_fd(new_fd);
+   return error;
+   }
+   }
I'm fairly sure this introduces a bug[1] if the user does:

struct msghdr msg = {};
struct cmsghdr *cmsg;
struct iovec io = {
.iov_base = ,
.iov_len = 1,
};

msg.msg_iov = 
msg.msg_iovlen = 1;
msg.msg_control = NULL;
msg.msg_controllen = sizeof(buf);

recvmsg(sock, , 0);

They will have the FD installed, no error message, but FD number wont be 
written 
to memory AFAICT. If two FDs are passed, you will get an efault. They will both
be installed, but memory wont be written to. Maybe instead of 0, make it a
poison pointer, or -1 instead?

-
As an aside, all of this junk should be dropped:
+   ret = get_user(size, >size);
+   if (ret)
+   return ret;
+
+   ret = copy_struct_from_user(, sizeof(addfd), uaddfd, size);
+   if (ret)
+   return ret;

and the size member of the seccomp_notif_addfd struct. I brought this up 
off-list with Tycho that ioctls have the size of the struct embedded in them. 
We 
should just use that. The ioctl definition is based on this[2]:
#define _IOC(dir,type,nr,size) \
(((dir)  << _IOC_DIRSHIFT) | \
 ((type) << _IOC_TYPESHIFT) | \
 ((nr)   << _IOC_NRSHIFT) | \
 ((size) << _IOC_SIZESHIFT))

We should just use copy_from_user for now. In the future, we can either 
introduce new ioctl names for new structs, or extract the size dynamically from 
the ioctl (and mask it out on the switch statement in seccomp_notify_ioctl.

+#define SECCOMP_IOCTL_NOTIF_ADDFD  SECCOMP_IOR(3,  \
+   struct seccomp_notif_addfd)

Lastly, what I believe to be a small mistake, it should be SECCOMP_IOW, based 
on 
the documentation in ioctl.h -- "_IOW means userland is writing and kernel is 
reading."

[1]: 
https://lore.kernel.org/lkml/20200604052040.GA16501@ircssh-2.c.rugged-nimbus-611.internal/
[2]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/asm-generic/ioctl.h?id=v5.7#n69

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-05 Thread Sargun Dhillon

On Thu, Jun 04, 2020 at 02:52:26PM +0200, Christian Brauner wrote:
> On Wed, Jun 03, 2020 at 07:22:57PM -0700, Kees Cook wrote:
> > On Thu, Jun 04, 2020 at 03:24:52AM +0200, Christian Brauner wrote:
> > > On Tue, Jun 02, 2020 at 06:10:41PM -0700, Sargun Dhillon wrote:
> > > > Previously there were two chunks of code where the logic to receive file
> > > > descriptors was duplicated in net. The compat version of copying
> > > > file descriptors via SCM_RIGHTS did not have logic to update cgroups.
> > > > Logic to change the cgroup data was added in:
> > > > commit 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram 
> > > > not set correctly")
> > > > commit d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram 
> > > > not set correctly")
> > > > 
> > > > This was not copied to the compat path. This commit fixes that, and thus
> > > > should be cherry-picked into stable.
> > > > 
> > > > This introduces a helper (file_receive) which encapsulates the logic for
> > > > handling calling security hooks as well as manipulating cgroup 
> > > > information.
> > > > This helper can then be used other places in the kernel where file
> > > > descriptors are copied between processes
> > > > 
> > > > I tested cgroup classid setting on both the compat (x32) path, and the
> > > > native path to ensure that when moving the file descriptor the classid
> > > > is set.
> > > > 
> > > > Signed-off-by: Sargun Dhillon 
> > > > Suggested-by: Kees Cook 
> > > > Cc: Al Viro 
> > > > Cc: Christian Brauner 
> > > > Cc: Daniel Wagner 
> > > > Cc: David S. Miller 
> > > > Cc: Jann Horn ,
> > > > Cc: John Fastabend 
> > > > Cc: Tejun Heo 
> > > > Cc: Tycho Andersen 
> > > > Cc: sta...@vger.kernel.org
> > > > Cc: cgro...@vger.kernel.org
> > > > Cc: linux-fsde...@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > ---
> > > >  fs/file.c| 35 +++
> > > >  include/linux/file.h |  1 +
> > > >  net/compat.c | 10 +-
> > > >  net/core/scm.c   | 14 --
> > > >  4 files changed, 45 insertions(+), 15 deletions(-)
> > > > 
> > > 
> > > This is all just a remote version of fd_install(), yet it deviates from
> > > fd_install()'s semantics and naming. That's not great imho. What about
> > > naming this something like:
> > > 
> > > fd_install_received()
> > > 
> > > and move the get_file() out of there so it has the same semantics as
> > > fd_install(). It seems rather dangerous to have a function like
> > > fd_install() that consumes a reference once it returned and another
> > > version of this that is basically the same thing but doesn't consume a
> > > reference because it takes its own. Seems an invitation for confusion.
> > > Does that make sense?
> > 
> > We have some competing opinions on this, I guess. What I really don't
> > like is the copy/pasting of the get_unused_fd_flags() and
> > put_unused_fd() needed by (nearly) all the callers. If it's a helper, it
> > should help. Specifically, I'd like to see this:
> > 
> > int file_receive(int fd, unsigned long flags, struct file *file,
> >  int __user *fdptr)
> 
> I still fail to see what this whole put_user() handling buys us at all
> and why this function needs to be anymore complicated then simply:
> 
> fd_install_received(int fd, struct file *file)
> {
>   security_file_receive(file);
>  
>   sock = sock_from_file(fd, );
>   if (sock) {
>   sock_update_netprioidx(>sk->sk_cgrp_data);
>   sock_update_classid(>sk->sk_cgrp_data);
>   }
> 
>   fd_install();
>   return;
> }
> 
> exactly like fd_install() but for received files.
> 
> For scm you can fail somewhere in the middle of putting any number of
> file descriptors so you're left in a state with only a subset of
> requested file descriptors installed so it's not really useful there.
> And if you manage to install an fd but then fail to put_user() it
> userspace can simply check it's fds via proc and has to anyway on any
> scm message error. If you fail an scm message userspace better check
> their fds.
> For seccomp maybe but even there I doubt it and I still maintain that

Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

2020-06-05 Thread Sargun Dhillon

On Fri, May 29, 2020 at 11:01 PM Gabriel Krisman Bertazi
 wrote:
>
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
>
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux.  Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.
>
> The filtering cannot really be done based solely on the syscall number,
> because those could collide with existing Linux syscalls.  Instead, our
> proposed solution is to trap syscalls based on the userspace memory
> region that triggered the syscall, as wine is responsible for the
> Windows code allocations and it can apply correct memory protections to
> those areas.
>
> Therefore, this patch reuses the seccomp infrastructure to trap
> system calls, but introduces a new mode to trap based on a vma attribute
> that describes whether the userspace memory region is allowed to execute
> syscalls or not.  The protection is defined at mmap/mprotect time with a
> new protection flag PROT_NOSYSCALL.  This setting only takes effect if
> the new SECCOMP_MODE_MEMMAP is enabled through seccomp().
>
> It goes without saying that this is in no way a security mechanism
> despite being built on top of seccomp, since an evil application can
> always jump to a whitelisted memory region and run the syscall.  This
> is not a concern for Wine games.  Nevertheless, we reuse seccomp as a
> way to avoid adding a new mechanism to essentially do the same job of
> filtering system calls.
>
> * Why not SECCOMP_MODE_FILTER?
>
> We experimented with dynamically generating BPF filters for whitelisted
> memory regions and using SECCOMP_MODE_FILTER, but there are a few
> reasons why it isn't enough nor a good idea for our use case:
>
> 1. We cannot set the filters at program initialization time and forget
> about it, since there is no way of knowing which modules will be loaded,
> whether native and windows.  Filter would need a way to be updated
> frequently during game execution.
>
> 2. We cannot predict which Linux libraries will issue syscalls directly.
> Most of the time, whitelisting libc and a few other libraries is enough,
> but there are no guarantees other Linux libraries won't issue syscalls
> directly and break the execution.  Adding every linux library that is
> loaded also has a large performance cost due to the large resulting
> filter.
>
> 3. As I mentioned before, performance is critical.  In our testing with
> just a single memory segment blacklisted/whitelisted, the minimum size
> of a bpf filter would be 4 instructions.  In that scenario,
> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
> time of sysinfo(2) in comparison to seccomp disabled, while the impact
> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.
>
> Indeed, points 1 and 2 could be worked around with some userspace work
> and improved SECCOMP_MODE_FILTER support, but at a high performance and
> some stability cost, to obtain the semantics we want.  Still, the
> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
> enough that I believe it should be considered as an upstream solution.
>
> Sending as an RFC for now to get the discussion started.  In particular:
I have a totally different question. I am experimenting with a
patchset which is designed
to help with the "extended syscall" case (as Kees calls it).
Effectively syscalls like openat2,
where the syscall arguments are passed as a (potentially mixed size)
structure need to be
able to be inspected through user notif. `We can kind-of deal with
this with other syscalls
with mechanisms like pidfd_getfd, addfd, and potentially being able to
(re)set the registers
prior to actual invocation of the syscall. Unfortunately, you cannot
do the same trick with
user memory, because it opens you up to a time-of-check, time-of-use
attack, since the
kernel copies the syscall arguments from the invoking program again.

One of the things I've been experimenting with is using tricks like
userfaultfd / mprotect to
try to deal with this. I think that I might have to add some
capability to the kernel to actually
deal with this. In general, the approach is:
1. Syscall is invoked, and wakes up the manager
2. The manager gets the arguments, and a handle (either the ID, or an
FD). It then uses this
ID to read memory. Either something like process_vm_readv, an ioctl, or read.
3. When the kernel reads these arguments, it splits the VMA for the
address the pointer
lies in, and sets up access() with a special mapping that checks if
the page has been
tampered with by userspace in the read ranges between the

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-03 Thread Sargun Dhillon

On Wed, Jun 03, 2020 at 07:22:57PM -0700, Kees Cook wrote:
> On Thu, Jun 04, 2020 at 03:24:52AM +0200, Christian Brauner wrote:
> > On Tue, Jun 02, 2020 at 06:10:41PM -0700, Sargun Dhillon wrote:
> > > Previously there were two chunks of code where the logic to receive file
> > > descriptors was duplicated in net. The compat version of copying
> > > file descriptors via SCM_RIGHTS did not have logic to update cgroups.
> > > Logic to change the cgroup data was added in:
> > > commit 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not 
> > > set correctly")
> > > commit d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not 
> > > set correctly")
> > > 
> > > This was not copied to the compat path. This commit fixes that, and thus
> > > should be cherry-picked into stable.
> > > 
> > > This introduces a helper (file_receive) which encapsulates the logic for
> > > handling calling security hooks as well as manipulating cgroup 
> > > information.
> > > This helper can then be used other places in the kernel where file
> > > descriptors are copied between processes
> > > 
> > > I tested cgroup classid setting on both the compat (x32) path, and the
> > > native path to ensure that when moving the file descriptor the classid
> > > is set.
> > > 
> > > Signed-off-by: Sargun Dhillon 
> > > Suggested-by: Kees Cook 
> > > Cc: Al Viro 
> > > Cc: Christian Brauner 
> > > Cc: Daniel Wagner 
> > > Cc: David S. Miller 
> > > Cc: Jann Horn ,
> > > Cc: John Fastabend 
> > > Cc: Tejun Heo 
> > > Cc: Tycho Andersen 
> > > Cc: sta...@vger.kernel.org
> > > Cc: cgro...@vger.kernel.org
> > > Cc: linux-fsde...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > ---
> > >  fs/file.c| 35 +++
> > >  include/linux/file.h |  1 +
> > >  net/compat.c | 10 +-
> > >  net/core/scm.c   | 14 --
> > >  4 files changed, 45 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/fs/file.c b/fs/file.c
> > > index abb8b7081d7a..5afd76fca8c2 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -18,6 +18,9 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > +#include 
> > > +#include 
> > >  
> > >  unsigned int sysctl_nr_open __read_mostly = 1024*1024;
> > >  unsigned int sysctl_nr_open_min = BITS_PER_LONG;
> > > @@ -931,6 +934,38 @@ int replace_fd(unsigned fd, struct file *file, 
> > > unsigned flags)
> > >   return err;
> > >  }
> > >  
> > > +/*
> > > + * File Receive - Receive a file from another process
> > > + *
> > > + * This function is designed to receive files from other tasks. It 
> > > encapsulates
> > > + * logic around security and cgroups. The file descriptor provided must 
> > > be a
> > > + * freshly allocated (unused) file descriptor.
> > > + *
> > > + * This helper does not consume a reference to the file, so the caller 
> > > must put
> > > + * their reference.
> > > + *
> > > + * Returns 0 upon success.
> > > + */
> > > +int file_receive(int fd, struct file *file)
> > 
> > This is all just a remote version of fd_install(), yet it deviates from
> > fd_install()'s semantics and naming. That's not great imho. What about
> > naming this something like:
> > 
> > fd_install_received()
> > 
> > and move the get_file() out of there so it has the same semantics as
> > fd_install(). It seems rather dangerous to have a function like
> > fd_install() that consumes a reference once it returned and another
> > version of this that is basically the same thing but doesn't consume a
> > reference because it takes its own. Seems an invitation for confusion.
> > Does that make sense?
> 
> We have some competing opinions on this, I guess. What I really don't
> like is the copy/pasting of the get_unused_fd_flags() and
> put_unused_fd() needed by (nearly) all the callers. If it's a helper, it
> should help. Specifically, I'd like to see this:
> 
> int file_receive(int fd, unsigned long flags, struct file *file,
>int __user *fdptr)
> {
>   struct socket *sock;
>   int err;
> 
>   err = security_file_receive(file);
>   if

Re: [PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-03 Thread Sargun Dhillon

On Thu, Jun 04, 2020 at 03:24:52AM +0200, Christian Brauner wrote:
> On Tue, Jun 02, 2020 at 06:10:41PM -0700, Sargun Dhillon wrote:
> > Previously there were two chunks of code where the logic to receive file
> > descriptors was duplicated in net. The compat version of copying
> > file descriptors via SCM_RIGHTS did not have logic to update cgroups.
> > Logic to change the cgroup data was added in:
> > commit 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not 
> > set correctly")
> > commit d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not 
> > set correctly")
> > 
> > This was not copied to the compat path. This commit fixes that, and thus
> > should be cherry-picked into stable.
> > 
> > This introduces a helper (file_receive) which encapsulates the logic for
> > handling calling security hooks as well as manipulating cgroup information.
> > This helper can then be used other places in the kernel where file
> > descriptors are copied between processes
> > 
> > I tested cgroup classid setting on both the compat (x32) path, and the
> > native path to ensure that when moving the file descriptor the classid
> > is set.
> > 
> > Signed-off-by: Sargun Dhillon 
> > Suggested-by: Kees Cook 
> > Cc: Al Viro 
> > Cc: Christian Brauner 
> > Cc: Daniel Wagner 
> > Cc: David S. Miller 
> > Cc: Jann Horn ,
> > Cc: John Fastabend 
> > Cc: Tejun Heo 
> > Cc: Tycho Andersen 
> > Cc: sta...@vger.kernel.org
> > Cc: cgro...@vger.kernel.org
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  fs/file.c| 35 +++
> >  include/linux/file.h |  1 +
> >  net/compat.c | 10 +-
> >  net/core/scm.c   | 14 --
> >  4 files changed, 45 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/file.c b/fs/file.c
> > index abb8b7081d7a..5afd76fca8c2 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -18,6 +18,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> > +#include 
> >  
> >  unsigned int sysctl_nr_open __read_mostly = 1024*1024;
> >  unsigned int sysctl_nr_open_min = BITS_PER_LONG;
> > @@ -931,6 +934,38 @@ int replace_fd(unsigned fd, struct file *file, 
> > unsigned flags)
> > return err;
> >  }
> >  
> > +/*
> > + * File Receive - Receive a file from another process
> > + *
> > + * This function is designed to receive files from other tasks. It 
> > encapsulates
> > + * logic around security and cgroups. The file descriptor provided must be 
> > a
> > + * freshly allocated (unused) file descriptor.
> > + *
> > + * This helper does not consume a reference to the file, so the caller 
> > must put
> > + * their reference.
> > + *
> > + * Returns 0 upon success.
> > + */
> > +int file_receive(int fd, struct file *file)
> 
> This is all just a remote version of fd_install(), yet it deviates from
> fd_install()'s semantics and naming. That's not great imho. What about
> naming this something like:
> 
> fd_install_received()
> 
> and move the get_file() out of there so it has the same semantics as
> fd_install(). It seems rather dangerous to have a function like
> fd_install() that consumes a reference once it returned and another
> version of this that is basically the same thing but doesn't consume a
> reference because it takes its own. Seems an invitation for confusion.
> Does that make sense?
> 
You're right. The reason for the difference in my mind is that fd_install
always succeeds, whereas file_receive can fail. It's easier to do something
like:
fd_install(fd, get_file(f))
vs.
if (file_receive(fd, get_file(f))
fput(f);

Alternatively, if the reference was always consumed, it is somewhat
easier.

I'm fine either way, but just explaining my reasoning for the difference
in behaviour.

Re: [PATCH v3 0/4] Add seccomp notifier ioctl that enables adding fds

2020-06-03 Thread Sargun Dhillon

On Wed, Jun 3, 2020 at 4:42 PM Kees Cook  wrote:
>
> On Tue, Jun 02, 2020 at 06:10:40PM -0700, Sargun Dhillon wrote:
> > Sargun Dhillon (4):
> >   fs, net: Standardize on file_receive helper to move fds across
> > processes
> >   pid: Use file_receive helper to copy FDs
>
> The fixes (that should add open-coded cgroups stuff) should be separate
> patches so they can be backported.
Patch 1/4, and 2/4 are separated so they can be backported. Patch 1 should
go into long term, and patch 2 should land in stable.

Do you see anything in 1/4, and 2/4 that shouldn't be there?

>
> The helper doesn't take the __user pointer I thought we'd agreed it
> should to avoid changing any SCM_RIGHTS behaviors?
>
It doesn't change the SCM_RIGHTS behaviour because it continues
to have the logic which allocates the file descriptor outside of the
helper.
1. Allocate FD (this happens in scm.c)
2. Copy FD # to userspace (this happens in scm.c)
3. Receive FD (this happens in the new helper)


> >   seccomp: Introduce addfd ioctl to seccomp user notifier
> >   selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
>
> Otherwise, yeah, this should be good.
>
> --
> Kees Cook

[PATCH v3 3/4] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-06-02 Thread Sargun Dhillon

This adds a seccomp notifier ioctl which allows for the listener to "add"
file descriptors to a process which originated a seccomp user
notification. This allows calls like mount, and mknod to be "implemented",
as the return value, and the arguments are data in memory. On the other
hand, calls like connect can be "implemented" using pidfd_getfd.

Unfortunately, there are calls which return file descriptors, like
open, which are vulnerable to TOC-TOU attacks, and require that the
more privileged supervisor can inspect the argument, and perform the
syscall on behalf of the process generating the notification. This
allows the file descriptor generated from that open call to be
returned to the calling process.

In addition, there is funcitonality to allow for replacement of
specific file descriptors, following dup2-like semantics.

This extends a previously added helper (file_receive), and introduces
a new helper built on top of it -- file_receive_replace, which is
meant to assist with calling replace_fd, with files received from
remote processes.

As a note, the seccomp_notif_addfd structure is laid out based on 8-byte
alignment without requiring packing as there have been packing issues with
uapi highlighted before [1][2]. Although we could overload the newfd field
and use -1 to indicate that it is not to be used, doing so requires
changing the size of the fd field, and introduces struct packing
complexity.

[1]: https://lore.kernel.org/lkml/87o8w9bcaf@mid.deneb.enyo.de/
[2]: 
https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f1...@rasmusvillemoes.dk/

Signed-off-by: Sargun Dhillon 
Suggested-by: Matt Denton 
Cc: Al Viro 
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Jann Horn 
Cc: Kees Cook 
Cc: Robert Sesek 
Cc: Tycho Andersen 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-...@vger.kernel.org
---
 fs/file.c|  29 +-
 include/linux/file.h |   1 +
 include/uapi/linux/seccomp.h |  25 +
 kernel/seccomp.c | 184 ++-
 4 files changed, 234 insertions(+), 5 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5afd76fca8c2..eb413c1fdb7f 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -938,15 +938,19 @@ int replace_fd(unsigned fd, struct file *file, unsigned 
flags)
  * File Receive - Receive a file from another process
  *
  * This function is designed to receive files from other tasks. It encapsulates
- * logic around security and cgroups. The file descriptor provided must be a
- * freshly allocated (unused) file descriptor.
+ * logic around security and cgroups. It can either replace an existing file
+ * descriptor, or install the file at a new unused one. If the file is meant
+ * to be installed on a new file descriptor, it must be allocated with the
+ * right flags by the user and the flags passed must be 0 -- as anything else
+ * is ignored.
  *
  * This helper does not consume a reference to the file, so the caller must put
  * their reference.
  *
  * Returns 0 upon success.
  */
-int file_receive(int fd, struct file *file)
+static int __file_receive(int fd, unsigned int flags, struct file *file,
+ bool replace)
 {
struct socket *sock;
int err;
@@ -955,7 +959,14 @@ int file_receive(int fd, struct file *file)
if (err)
return err;
 
-   fd_install(fd, get_file(file));
+   if (replace) {
+   err = replace_fd(fd, file, flags);
+   if (err)
+   return err;
+   } else {
+   WARN_ON(flags);
+   fd_install(fd, get_file(file));
+   }
 
sock = sock_from_file(file, );
if (sock) {
@@ -966,6 +977,16 @@ int file_receive(int fd, struct file *file)
return 0;
 }
 
+int file_receive_replace(int fd, unsigned int flags, struct file *file)
+{
+   return __file_receive(fd, flags, file, true);
+}
+
+int file_receive(int fd, struct file *file)
+{
+   return __file_receive(fd, 0, file, false);
+}
+
 static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 {
int err = -EBADF;
diff --git a/include/linux/file.h b/include/linux/file.h
index 7b56dc23e560..e4ca058fb559 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -94,5 +94,6 @@ extern void fd_install(unsigned int fd, struct file *file);
 extern void flush_delayed_fput(void);
 extern void __fput_sync(struct file *);
 
+extern int file_receive_replace(int fd, unsigned int flags, struct file *file);
 extern int file_receive(int fd, struct file *file);
 #endif /* __LINUX_FILE_H */
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..aec3e43c4418 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -113,6 +113,27 @@ struct seccomp_notif_resp {
__u32 flags;
 };
 
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD   (

[PATCH v3 2/4] pid: Use file_receive helper to copy FDs

2020-06-02 Thread Sargun Dhillon

The code to copy file descriptors was duplicated in pidfd_getfd.
Rather than continue to duplicate it, this hoists the code out of
kernel/pid.c and uses the newly added file_receive helper.

Earlier, when this was implemented there was some back-and-forth
about how the semantics should work around copying around file
descriptors [1], and it was decided that the default behaviour
should be to not modify cgroup data. As a matter of least surprise,
this approach follows the default semantics as presented by SCM_RIGHTS.

In the future, a flag can be added to avoid manipulating the cgroup
data on copy.

[1]: https://lore.kernel.org/lkml/20200107175927.4558-1-sar...@sargun.me/

Signed-off-by: Sargun Dhillon 
Suggested-by: Kees Cook 
Cc: Al Viro 
Cc: Christian Brauner 
Cc: Daniel Wagner 
Cc: David S. Miller 
Cc: Jann Horn 
Cc: John Fastabend 
Cc: Tejun Heo 
Cc: Tycho Andersen 
Cc: sta...@vger.kernel.org
Cc: cgro...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/pid.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..1642cf940aa1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -606,7 +606,7 @@ static int pidfd_getfd(struct pid *pid, int fd)
 {
struct task_struct *task;
struct file *file;
-   int ret;
+   int ret, err;
 
task = get_pid_task(pid, PIDTYPE_PID);
if (!task)
@@ -617,18 +617,16 @@ static int pidfd_getfd(struct pid *pid, int fd)
if (IS_ERR(file))
return PTR_ERR(file);
 
-   ret = security_file_receive(file);
-   if (ret) {
-   fput(file);
-   return ret;
-   }
-
ret = get_unused_fd_flags(O_CLOEXEC);
-   if (ret < 0)
-   fput(file);
-   else
-   fd_install(ret, file);
+   if (ret >= 0) {
+   err = file_receive(ret, file);
+   if (err) {
+   put_unused_fd(ret);
+   ret = err;
+   }
+   }
 
+   fput(file);
return ret;
 }
 
-- 
2.25.1

[PATCH v3 4/4] selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

2020-06-02 Thread Sargun Dhillon

Test whether we can add file descriptors in response to notifications.
This injects the file descriptors via notifications, and then uses
kcmp to determine whether or not it has been successful.

It also includes some basic sanity checking for arguments.

Signed-off-by: Sargun Dhillon 
Cc: Al Viro 
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Jann Horn 
Cc: Kees Cook 
Cc: Robert Sesek 
Cc: Tycho Andersen 
Cc: Matt Denton 
Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 183 ++
 1 file changed, 183 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 402ccb3a4e52..a786b1734ddd 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -182,6 +183,12 @@ struct seccomp_metadata {
 #define SECCOMP_IOCTL_NOTIF_SEND   SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
 #define SECCOMP_IOCTL_NOTIF_ID_VALID   SECCOMP_IOR(2, __u64)
+/* On success, the return value is the remote process's added fd number */
+#define SECCOMP_IOCTL_NOTIF_ADDFD  SECCOMP_IOR(3,  \
+   struct seccomp_notif_addfd)
+
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
 
 struct seccomp_notif {
__u64 id;
@@ -202,6 +209,15 @@ struct seccomp_notif_sizes {
__u16 seccomp_notif_resp;
__u16 seccomp_data;
 };
+
+struct seccomp_notif_addfd {
+   __u64 size;
+   __u64 id;
+   __u32 flags;
+   __u32 srcfd;
+   __u32 newfd;
+   __u32 newfd_flags;
+};
 #endif
 
 #ifndef PTRACE_EVENTMSG_SYSCALL_ENTRY
@@ -3822,6 +3838,173 @@ TEST(user_notification_filter_empty_threaded)
EXPECT_GT((pollfd.revents & POLLHUP) ?: 0, 0);
 }
 
+TEST(user_notification_sendfd)
+{
+   pid_t pid;
+   long ret;
+   int status, listener, memfd;
+   struct seccomp_notif_addfd addfd = {};
+   struct seccomp_notif req = {};
+   struct seccomp_notif_resp resp = {};
+   /* 100 ms */
+   struct timespec delay = { .tv_nsec = 1 };
+
+   memfd = memfd_create("test", 0);
+   ASSERT_GE(memfd, 0);
+
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   /* Check that the basic notification machinery works */
+   listener = user_trap_syscall(__NR_getppid,
+SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   ASSERT_GE(listener, 0);
+
+   pid = fork();
+   ASSERT_GE(pid, 0);
+
+   if (pid == 0) {
+   if (syscall(__NR_getppid) != USER_NOTIF_MAGIC)
+   exit(1);
+   exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
+   }
+
+   ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+
+   addfd.size = sizeof(addfd);
+   addfd.srcfd = memfd;
+   addfd.newfd_flags = O_CLOEXEC;
+   addfd.newfd = 0;
+   addfd.id = req.id;
+   addfd.flags = 0xff;
+
+   /* Verify bad flags cannot be set */
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EINVAL);
+
+   /* Verify that remote_fd cannot be set without setting flags */
+   addfd.flags = 0;
+   addfd.newfd = 1;
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EINVAL);
+
+   /* Verify we can set an arbitrary remote fd */
+   addfd.newfd = 0;
+
+   ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, );
+   EXPECT_GE(ret, 0);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
+
+   /* Verify we can set a specific remote fd */
+   addfd.newfd = 42;
+   addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), 42);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, 42), 0);
+
+   resp.id = req.id;
+   resp.error = 0;
+   resp.val = USER_NOTIF_MAGIC;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, ), 0);
+
+   /*
+* This sets the ID of the ADD FD to the last request plus 1. The
+* notification ID increments 1 per notification.
+*/
+   addfd.id = req.id + 1;
+
+   /* This spins until the underlying notification is generated */
+   while (ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ) != -1 &&
+  errno != -EINPROGRESS)
+   nanosleep(, NULL);
+
+   memset(, 0, sizeof(req));
+   ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+   ASSERT_EQ(addfd.id, req.id);
+
+   resp.id = req.id;
+   resp.error = 0;
+   resp.val = USER_NOTIF_M

[PATCH v3 0/4] Add seccomp notifier ioctl that enables adding fds

2020-06-02 Thread Sargun Dhillon

This adds the capability for seccomp notifier listeners to add file
descriptors in response to a seccomp notification. This is useful for
syscalls in which the previous capabilities were not sufficient. The
current mechanism works well for syscalls that either have side effects
that are system / namespace wide (mount), or that operate on a specific
set of registers (reboot, mknod), and don't require dereferencing pointers.
The problem with derefencing pointers in a supervisor is that it leaves
us vulnerable to TOC-TOU [1] style attacks. For syscalls that had a direct
effect on file descriptors pidfd_getfd was added, allowing for those file
descriptors to be directly operated upon by the supervisor [2].

Unfortunately, this leaves system calls which return file descriptors
out of the picture. These are fairly common syscalls, such as openat,
socket, and perf_event_open that return file descriptors, and have
arguments that are pointers. These require that the supervisor is able to
verify the arguments, make the call on behalf of the process on hand,
and pass back the resulting file descriptor. This is where addfd comes
into play.

There is an additional flag that allows you to "set" an FD, rather than
add it with an arbitrary number. This has dup2 style semantics, and
installs the new file at that file descriptor, and atomically closes
the old one if it existed. This is useful for a particular use case
that we have, in which we want to swap out AF_INET sockets for AF_UNIX,
AF_INET6, and sockets in another namespace when doing "upconversion".

My specific usecase at Netflix is to enable our IPv4-IPv6 transition
mechanism, in which we our namespaces have no real IPv4 reachability,
and when it comes time to do a connect(2), we get a socket from a
namespace with global IPv4 reachability.

In addition, we intend to use it for our servicemesh, and where our
service mesh needs to intercept traffic ingress traffic, the addfd
capability will act as a mechanism to do socket activation.

Addfd is not implemented as a separate syscall, a la pidfd_getfd, as
VFS makes some optimizations in regards to the fdtable, and assumes
that they are not modified by external processes. Although a mechanism
that scheduled something in the context of the task could work, it is
somewhat simpler to do it in the context of the ioctl as we control
the task while in kernel. In addition there are not obvious needs
for this beyond seccomp notifier.

This mechanism leaves a potential issue that if the manager is
interrupted while injecting FDs, the child process will be left with
leaked / dangling FDs. This may lead to undefined behaviour. A
mechanism to work around this is to extend the structure and add a
"rollback" mechanism for FDs to be closed if things fail.

This introduces a new helper -- file_receive, which is responsible
for moving fds across processes. The helper replaces code in
SCM_RIGHTS. In SCM_RIGHTS compat codepath there was a bug that
resulted in this not being set all. This fixes that bug, and should
be cherry-picked into long-term. The file_receive change should
probably go into stable. The file_receive code also replaced the
receive fd logic in pidfd_getfd. This is somewhat contrary to my
original view[5], but I think it is best for the principal of
least surprise to adopt it. This should be cherry-picked into stable.

I tested this on amd64 with the x86-64 and x32 ABIs.

Given there is no testing infrastructure for cgroup v1, I opted to
forgo adding new tests there as it is considered deprecated.

Changes since v2:
 * Introducion of the file_receive helper which hoists out logic to
   manipulate file descriptors outside of seccomp.c to file.c
 * Small fix that manipulated the socket's cgroup even when the
   receive failed
 * seccomp struct layout
Changes since v1:
 * find_notification has been cleaned up slightly, and it replaces a use
   case in send as well.
 * Fixes ref counting rules to get / release references in the ioctl side,
   rather than the seccomp notifier side [3].
 * Removes the optional move flag, and opts into SCM_RIGHTS
 * Rearranges the seccomp_notif_addfd datastructure for greater user
   clarity [4]. In order to avoid unnamed padding it makes size u64,
   which is a little bit of a waste of space.
 * Changes error codes to return ESRCH upon the process going away on
   notification, and EINPROGRESS is the notification is in an unexpected
   state (and added tests for this behaviour)

[1]: 
https://lore.kernel.org/lkml/20190918084833.9369-2-christian.brau...@ubuntu.com/
[2]: https://lore.kernel.org/lkml/20200107175927.4558-1-sar...@sargun.me/
[3]: https://lore.kernel.org/lkml/20200525000537.gb23...@zeniv.linux.org.uk/
[4]: https://lore.kernel.org/lkml/20200525135036.vp2nmmx42y7dfznf@wittgenstein/
[5]: https://lore.kernel.org/lkml/20200107175927.4558-1-sar...@sargun.me/

Sargun Dhillon (4):
  fs, net: Standardize on file_receive helper to move fds across
processes

[PATCH v3 1/4] fs, net: Standardize on file_receive helper to move fds across processes

2020-06-02 Thread Sargun Dhillon

Previously there were two chunks of code where the logic to receive file
descriptors was duplicated in net. The compat version of copying
file descriptors via SCM_RIGHTS did not have logic to update cgroups.
Logic to change the cgroup data was added in:
commit 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not set 
correctly")
commit d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set 
correctly")

This was not copied to the compat path. This commit fixes that, and thus
should be cherry-picked into stable.

This introduces a helper (file_receive) which encapsulates the logic for
handling calling security hooks as well as manipulating cgroup information.
This helper can then be used other places in the kernel where file
descriptors are copied between processes

I tested cgroup classid setting on both the compat (x32) path, and the
native path to ensure that when moving the file descriptor the classid
is set.

Signed-off-by: Sargun Dhillon 
Suggested-by: Kees Cook 
Cc: Al Viro 
Cc: Christian Brauner 
Cc: Daniel Wagner 
Cc: David S. Miller 
Cc: Jann Horn ,
Cc: John Fastabend 
Cc: Tejun Heo 
Cc: Tycho Andersen 
Cc: sta...@vger.kernel.org
Cc: cgro...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/file.c| 35 +++
 include/linux/file.h |  1 +
 net/compat.c | 10 +-
 net/core/scm.c   | 14 --
 4 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index abb8b7081d7a..5afd76fca8c2 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -18,6 +18,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 unsigned int sysctl_nr_open __read_mostly = 1024*1024;
 unsigned int sysctl_nr_open_min = BITS_PER_LONG;
@@ -931,6 +934,38 @@ int replace_fd(unsigned fd, struct file *file, unsigned 
flags)
return err;
 }
 
+/*
+ * File Receive - Receive a file from another process
+ *
+ * This function is designed to receive files from other tasks. It encapsulates
+ * logic around security and cgroups. The file descriptor provided must be a
+ * freshly allocated (unused) file descriptor.
+ *
+ * This helper does not consume a reference to the file, so the caller must put
+ * their reference.
+ *
+ * Returns 0 upon success.
+ */
+int file_receive(int fd, struct file *file)
+{
+   struct socket *sock;
+   int err;
+
+   err = security_file_receive(file);
+   if (err)
+   return err;
+
+   fd_install(fd, get_file(file));
+
+   sock = sock_from_file(file, );
+   if (sock) {
+   sock_update_netprioidx(>sk->sk_cgrp_data);
+   sock_update_classid(>sk->sk_cgrp_data);
+   }
+
+   return 0;
+}
+
 static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 {
int err = -EBADF;
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..7b56dc23e560 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -94,4 +94,5 @@ extern void fd_install(unsigned int fd, struct file *file);
 extern void flush_delayed_fput(void);
 extern void __fput_sync(struct file *);
 
+extern int file_receive(int fd, struct file *file);
 #endif /* __LINUX_FILE_H */
diff --git a/net/compat.c b/net/compat.c
index 4bed96e84d9a..8ac0e7e09208 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -293,9 +293,6 @@ void scm_detach_fds_compat(struct msghdr *kmsg, struct 
scm_cookie *scm)
 
for (i = 0, cmfptr = (int __user *) CMSG_COMPAT_DATA(cm); i < fdmax; 
i++, cmfptr++) {
int new_fd;
-   err = security_file_receive(fp[i]);
-   if (err)
-   break;
err = get_unused_fd_flags(MSG_CMSG_CLOEXEC & kmsg->msg_flags
  ? O_CLOEXEC : 0);
if (err < 0)
@@ -306,8 +303,11 @@ void scm_detach_fds_compat(struct msghdr *kmsg, struct 
scm_cookie *scm)
put_unused_fd(new_fd);
break;
}
-   /* Bump the usage count and install the file. */
-   fd_install(new_fd, get_file(fp[i]));
+   err = file_receive(new_fd, fp[i]);
+   if (err) {
+   put_unused_fd(new_fd);
+   break;
+   }
}
 
if (i > 0) {
diff --git a/net/core/scm.c b/net/core/scm.c
index dc6fed1f221c..ba93abf2881b 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -303,11 +303,7 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie 
*scm)
for (i=0, cmfptr=(__force int __user *)CMSG_DATA(cm); imsg_flags
  ? O_CLOEXEC : 0);
if (err < 0)
@@ -318,13 +314,11 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie 
*scm)
put_unused_fd(new_fd);
break;
}

Re: [PATCH v2 2/3] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-06-01 Thread Sargun Dhillon

On Sat, May 30, 2020 at 9:07 AM Kees Cook  wrote:
>
> On Sat, May 30, 2020 at 03:08:37PM +0100, Al Viro wrote:
> > On Fri, May 29, 2020 at 07:43:10PM -0700, Kees Cook wrote:
> >
> > > Can anyone clarify the expected failure mode from SCM_RIGHTS? Can we
> > > move the put_user() after instead? I think cleanup would just be:
> > > replace_fd(fd, NULL, 0)
> >
> > Bollocks.
> >
> > Repeat after me: descriptor tables can be shared.  There is no
> > "cleanup" after you've put something there.
>
> Right -- this is what I was trying to ask about, and why I didn't like
> the idea of just leaving the fd in the table on failure. But yeah, there
> is a race if the process is about to fork or something.
>
> So the choice here is how to handle the put_user() failure:
>
> - add the put_user() address to the new helper, as I suggest in [1].
>   (exactly duplicates current behavior)
> - just leave the fd in place (not current behavior: dumps a fd into
>   the process without "agreed" notification).
> - do a double put_user (once before and once after), also in [1].
>   (sort of a best-effort combo of the above two. and SCM_RIGHTS is
>   hardly fast-pth).
>
> -Kees
>
> [1] https://lore.kernel.org/linux-api/202005282345.573B917@keescook/
>
> --
> Kees Cook

I'm going to suggest we stick to the approach of doing[1]:
1. Allocate FD
2. put_user
3. "Receive" and install file into FD

That is the only way to preserve the current behaviour in which userspace
is notified about *every* FD that is received via SCM_RIGHTS. The
scm_detach_fds code as it reads today does effectively what is above,
in that the fd is not installed until *after* the put user. Therefore
if put_user
gets an EFAULT or ENOMEM, it falls through to the MSG_CTRUNC bit.

The approach suggested[2] has a "change" in behaviour, in that (all in
file_receive):
1. Allocate FD
2. Receive file
3. put_user

Based on what Al Viro said, I don't think we can simply add step #4,
being "just" uninstall the FD.

[1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2179418.html
[2]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2179453.html

[PATCH v3] seccomp: Add find_notification helper

2020-06-01 Thread Sargun Dhillon

This adds a helper which can iterate through a seccomp_filter to
find a notification matching an ID. It removes several replicated
chunks of code.

Signed-off-by: Sargun Dhillon 
Acked-by: Christian Brauner 
Reviewed-by: Tycho Andersen 
Cc: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 kernel/seccomp.c | 55 
 1 file changed, 28 insertions(+), 27 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..cc6b47173a95 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 enum notify_state {
SECCOMP_NOTIFY_INIT,
@@ -1021,10 +1022,27 @@ static int seccomp_notify_release(struct inode *inode, 
struct file *file)
return 0;
 }
 
+/* must be called with notif_lock held */
+static inline struct seccomp_knotif *
+find_notification(struct seccomp_filter *filter, u64 id)
+{
+   struct seccomp_knotif *cur;
+
+   lockdep_assert_held(>notify_lock);
+
+   list_for_each_entry(cur, >notif->notifications, list) {
+   if (cur->id == id)
+   return cur;
+   }
+
+   return NULL;
+}
+
+
 static long seccomp_notify_recv(struct seccomp_filter *filter,
void __user *buf)
 {
-   struct seccomp_knotif *knotif = NULL, *cur;
+   struct seccomp_knotif *knotif, *cur;
struct seccomp_notif unotif;
ssize_t ret;
 
@@ -1078,15 +1096,8 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
 * may have died when we released the lock, so we need to make
 * sure it's still around.
 */
-   knotif = NULL;
mutex_lock(>notify_lock);
-   list_for_each_entry(cur, >notif->notifications, list) {
-   if (cur->id == unotif.id) {
-   knotif = cur;
-   break;
-   }
-   }
-
+   knotif = find_notification(filter, unotif.id);
if (knotif) {
knotif->state = SECCOMP_NOTIFY_INIT;
up(>notif->request);
@@ -1101,7 +1112,7 @@ static long seccomp_notify_send(struct seccomp_filter 
*filter,
void __user *buf)
 {
struct seccomp_notif_resp resp = {};
-   struct seccomp_knotif *knotif = NULL, *cur;
+   struct seccomp_knotif *knotif;
long ret;
 
if (copy_from_user(, buf, sizeof(resp)))
@@ -1118,13 +1129,7 @@ static long seccomp_notify_send(struct seccomp_filter 
*filter,
if (ret < 0)
return ret;
 
-   list_for_each_entry(cur, >notif->notifications, list) {
-   if (cur->id == resp.id) {
-   knotif = cur;
-   break;
-   }
-   }
-
+   knotif = find_notification(filter, resp.id);
if (!knotif) {
ret = -ENOENT;
goto out;
@@ -1150,7 +1155,7 @@ static long seccomp_notify_send(struct seccomp_filter 
*filter,
 static long seccomp_notify_id_valid(struct seccomp_filter *filter,
void __user *buf)
 {
-   struct seccomp_knotif *knotif = NULL;
+   struct seccomp_knotif *knotif;
u64 id;
long ret;
 
@@ -1161,16 +1166,12 @@ static long seccomp_notify_id_valid(struct 
seccomp_filter *filter,
if (ret < 0)
return ret;
 
-   ret = -ENOENT;
-   list_for_each_entry(knotif, >notif->notifications, list) {
-   if (knotif->id == id) {
-   if (knotif->state == SECCOMP_NOTIFY_SENT)
-   ret = 0;
-   goto out;
-   }
-   }
+   knotif = find_notification(filter, id);
+   if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
+   ret = 0;
+   else
+   ret = -ENOENT;
 
-out:
mutex_unlock(>notify_lock);
return ret;
 }
-- 
2.25.1

Re: [PATCH v2 2/3] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-29 Thread Sargun Dhillon

> 
> I mean, yes, that's certainly better, but it just seems a shame that
> everyone has to do the get_unused/put_unused dance just because of how
> SCM_RIGHTS does this weird put_user() in the middle.
> 
> Can anyone clarify the expected failure mode from SCM_RIGHTS? Can we
> move the put_user() after instead? I think cleanup would just be:
> replace_fd(fd, NULL, 0)
> 
> So:
> 
> (updated to skip sock updates on failure; thank you Christian!)
> 
> int file_receive(int fd, unsigned long flags, struct file *file)
> {
>   struct socket *sock;
>   int ret;
> 
>   ret = security_file_receive(file);
>   if (ret)
>   return ret;
> 
>   /* Install the file. */
>   if (fd == -1) {
>   ret = get_unused_fd_flags(flags);
>   if (ret >= 0)
>   fd_install(ret, get_file(file));
>   } else {
>   ret = replace_fd(fd, file, flags);
>   }
> 
>   /* Bump the sock usage counts. */
>   if (ret >= 0) {
>   sock = sock_from_file(addfd->file, );
>   if (sock) {
>   sock_update_netprioidx(>sk->sk_cgrp_data);
>   sock_update_classid(>sk->sk_cgrp_data);
>   }
>   }
> 
>   return ret;
> }
> 
> scm_detach_fds()
>   ...
>   for (i=0, cmfptr=(__force int __user *)CMSG_DATA(cm); i  i++, cmfptr++)
>   {
>   int new_fd;
> 
>   err = file_receive(-1, MSG_CMSG_CLOEXEC & msg->msg_flags
>   ? O_CLOEXEC : 0, fp[i]);
>   if (err < 0)
>   break;
>   new_fd = err;
> 
Isn't the "right" way to do this to allocate a bunch of file descriptors,
and fill up the user buffer with them, and then install the files? This
seems to like half-install the file descriptors and then error out.

I know that's the current behaviour, but that seems like a bad idea. Do
we really want to perpetuate this half-broken state? I guess that some
userspace programs could be depending on this -- and their recovery
semantics could rely on this. I mean this is 10+ year old code.

>   err = put_user(err, cmfptr);
>   if (err) {
>   /*
>* If we can't notify userspace that it got the
>* fd, we need to unwind and remove it again.
>*/
>   replace_fd(new_fd, NULL, 0);
>   break;
>   }
>   }
>   ...
> 
> 
> 
> -- 
> Kees Cook

Re: [PATCH v2 2/3] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-29 Thread Sargun Dhillon

On Fri, May 29, 2020 at 12:31:37AM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 04:08:57AM -0700, Sargun Dhillon wrote:
> > This adds a seccomp notifier ioctl which allows for the listener to "add"
> > file descriptors to a process which originated a seccomp user
> > notification. This allows calls like mount, and mknod to be "implemented",
> > as the return value, and the arguments are data in memory. On the other
> > hand, calls like connect can be "implemented" using pidfd_getfd.
> > 
> > Unfortunately, there are calls which return file descriptors, like
> > open, which are vulnerable to TOC-TOU attacks, and require that the
> > more privileged supervisor can inspect the argument, and perform the
> > syscall on behalf of the process generating the notifiation. This
> > allows the file descriptor generated from that open call to be
> > returned to the calling process.
> > 
> > In addition, there is funcitonality to allow for replacement of
> > specific file descriptors, following dup2-like semantics.
> > 
> > Signed-off-by: Sargun Dhillon 
> > Suggested-by: Matt Denton 
> 
> This looks mostly really clean. When I've got more brain tomorrow I want to
> double-check the locking, but I think the use of notify_lock and being
> in the ioctl fully protects everything from any use-after-free-like
> issues.
> 
> Notes below...
> 
> > +/* valid flags for seccomp_notif_addfd */
> > +#define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
> 
> Nit: please use BIT()
> 
> > @@ -735,6 +770,41 @@ static u64 seccomp_next_notify_id(struct 
> > seccomp_filter *filter)
> > return filter->notif->next_id++;
> >  }
> >  
> > +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> > +{
> > +   struct socket *sock;
> > +   int ret, err;
> > +
> > +   /*
> > +* Remove the notification, and reset the list pointers, indicating
> > +* that it has been handled.
> > +*/
> > +   list_del_init(>list);
> > +
> > +   ret = security_file_receive(addfd->file);
> > +   if (ret)
> > +   goto out;
> > +
> > +   if (addfd->fd == -1) {
> > +   ret = get_unused_fd_flags(addfd->flags);
> > +   if (ret >= 0)
> > +   fd_install(ret, get_file(addfd->file));
> > +   } else {
> > +   ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> > +   }
> > +
> > +   /* These are the semantics from copying FDs via SCM_RIGHTS */
> > +   sock = sock_from_file(addfd->file, );
> > +   if (sock) {
> > +   sock_update_netprioidx(>sk->sk_cgrp_data);
> > +   sock_update_classid(>sk->sk_cgrp_data);
> > +   }
> 
> This made my eye twitch. ;) I see this is borrowed from
> scm_detach_fds()... this really feels like the kind of thing that will
> quickly go out of sync. I think this "receive an fd" logic needs to be
> lifted out of scm_detach_fds() so it and seccomp can share it. I'm not
> sure how to parameterize it quite right, though. Perhaps:
> 
> int file_receive(int fd, unsigned long flags, struct file *file)
> {
>   struct socket *sock;
>   int ret;
> 
>   ret = security_file_receive(file);
>   if (ret)
>   return ret;
> 
>   /* Install the file. */
>   if (fd == -1) {
>   ret = get_unused_fd_flags(flags);
>   if (ret >= 0)
>   fd_install(ret, get_file(file));
>   } else {
>   ret = replace_fd(fd, file, flags);
>   }
> 
>   /* Bump the usage count. */
>   sock = sock_from_file(addfd->file, );
>   if (sock) {
>   sock_update_netprioidx(>sk->sk_cgrp_data);
>   sock_update_classid(>sk->sk_cgrp_data);
>   }
> 
>   return ret;
> }
> 
> 
> static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> {
>   /*
>* Remove the notification, and reset the list pointers, indicating
>* that it has been handled.
>*/
>   list_del_init(>list);
>   addfd->ret = file_receive(addfd->fd, addfd->flags, addfd->file);
>   complete(>completion);
> }
> 
> scm_detach_fds()
>   ...
>   for (i=0, cmfptr=(__force int __user *)CMSG_DATA(cm); i  i++, cmfptr++)
>   {
> 
>   err = file_receive(-1, MSG_CMSG_CLOEXEC & msg->msg_flags
>   ? O_CLOEXEC : 0, fp[i]);
>

Re: [PATCH v2 2/3] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-29 Thread Sargun Dhillon

On Fri, May 29, 2020 at 6:31 AM Christian Brauner
 wrote:
>
> > > +   /* Check if we were woken up by a addfd message */
> > > +   addfd = list_first_entry_or_null(,
> > > +struct seccomp_kaddfd, list);
> > > +   if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
> > > +   seccomp_handle_addfd(addfd);
> > > +   mutex_unlock(>notify_lock);
> > > +   goto wait;
> > > +   }
> > > ret = n.val;
> > > err = n.error;
> > > flags = n.flags;
> > > }
> > >
> > > +   /* If there were any pending addfd calls, clear them out */
> > > +   list_for_each_entry_safe(addfd, tmp, , list) {
> > > +   /* The process went away before we got a chance to handle it 
> > > */
> > > +   addfd->ret = -ESRCH;
> > > +   list_del_init(>list);
> > > +   complete(>completion);
> > > +   }
>
> I forgot to ask this in my first review before, don't you need a
> complete(>completion) call in seccomp_notify_release() before
> freeing it?
>

When complete(>ready) is called in seccomp_notify_release,
subsequently the notifier (seccomp_do_user_notification) will be woken up and
it'll fail this check:
if (addfd && n.state != SECCOMP_NOTIFY_REPLIED)

Falling through to:
/* If there were any pending addfd calls, clear them out */
list_for_each_entry_safe(addfd, tmp, , list) {
/* The process went away before we got a chance to handle it */
addfd->ret = -ESRCH;
list_del_init(>list);
complete(>completion);
}

Although ESRCH isn't the "right" response, this fall through behaviour
should work.

Re: [PATCH v2 3/3] selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

2020-05-29 Thread Sargun Dhillon

On Fri, May 29, 2020 at 12:41:51AM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 04:08:58AM -0700, Sargun Dhillon wrote:
> > +   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, ), 0);
> > +
> > +   nextid = req.id + 1;
> > +
> > +   /* Wait for getppid to be called for the second time */
> > +   sleep(1);
> 
> I always rebel at finding "sleep" in tests. ;) Is this needed? IIUC,
> userspace will immediately see EINPROGRESS after the NOTIF_SEND
> finishes, yes?
> 
> Otherwise, yes, this looks good.
> 
> -- 
> Kees Cook
I'm open to better suggestions, but there's a race where if getppid
is not called before the second SECCOMP_IOCTL_NOTIF_ADDFD is called,
you will just get an ENOENT, since the notification ID is not found.

The other approach is to "poll" the child, and wait for it to enter
the second syscall. Calling receive beforehand doesn't work because
it moves the state of the notification in the kernel to received,
and then the kernel doesn't error with EINPROGRESS.

Re: [PATCH v2 1/3] seccomp: Add find_notification helper

2020-05-29 Thread Sargun Dhillon

> 
> While the comment is good, let's actually enforce this with:
> 
> if (WARN_ON(!mutex_is_locked(>notif_lock)))
>   return NULL;
> 
I don't see much use of lockdep in seccomp (well, any), but
wouldn't a stronger statement be to use lockdep, and just have:

lockdep_assert_held(>notify_lock);

As that checks that the lock is held by the current task.
Although, that does put this check behind lockdep, which means
that running in "normal" circumstances is less safe (but faster?).

[PATCH v2 1/3] seccomp: Add find_notification helper

2020-05-28 Thread Sargun Dhillon

This adds a helper which can iterate through a seccomp_filter to
find a notification matching an ID. It removes several replicated
chunks of code.

Signed-off-by: Sargun Dhillon 
Cc: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 kernel/seccomp.c | 51 
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..94ae4c7502cc 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1021,10 +1021,25 @@ static int seccomp_notify_release(struct inode *inode, 
struct file *file)
return 0;
 }
 
+/* must be called with notif_lock held */
+static inline struct seccomp_knotif *
+find_notification(struct seccomp_filter *filter, u64 id)
+{
+   struct seccomp_knotif *cur;
+
+   list_for_each_entry(cur, >notif->notifications, list) {
+   if (cur->id == id)
+   return cur;
+   }
+
+   return NULL;
+}
+
+
 static long seccomp_notify_recv(struct seccomp_filter *filter,
void __user *buf)
 {
-   struct seccomp_knotif *knotif = NULL, *cur;
+   struct seccomp_knotif *knotif, *cur;
struct seccomp_notif unotif;
ssize_t ret;
 
@@ -1078,14 +1093,8 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
 * may have died when we released the lock, so we need to make
 * sure it's still around.
 */
-   knotif = NULL;
mutex_lock(>notify_lock);
-   list_for_each_entry(cur, >notif->notifications, list) {
-   if (cur->id == unotif.id) {
-   knotif = cur;
-   break;
-   }
-   }
+   knotif = find_notification(filter, unotif.id);
 
if (knotif) {
knotif->state = SECCOMP_NOTIFY_INIT;
@@ -1101,7 +1110,7 @@ static long seccomp_notify_send(struct seccomp_filter 
*filter,
void __user *buf)
 {
struct seccomp_notif_resp resp = {};
-   struct seccomp_knotif *knotif = NULL, *cur;
+   struct seccomp_knotif *knotif;
long ret;
 
if (copy_from_user(, buf, sizeof(resp)))
@@ -1118,13 +1127,7 @@ static long seccomp_notify_send(struct seccomp_filter 
*filter,
if (ret < 0)
return ret;
 
-   list_for_each_entry(cur, >notif->notifications, list) {
-   if (cur->id == resp.id) {
-   knotif = cur;
-   break;
-   }
-   }
-
+   knotif = find_notification(filter, resp.id);
if (!knotif) {
ret = -ENOENT;
goto out;
@@ -1150,7 +1153,7 @@ static long seccomp_notify_send(struct seccomp_filter 
*filter,
 static long seccomp_notify_id_valid(struct seccomp_filter *filter,
void __user *buf)
 {
-   struct seccomp_knotif *knotif = NULL;
+   struct seccomp_knotif *knotif;
u64 id;
long ret;
 
@@ -1161,16 +1164,12 @@ static long seccomp_notify_id_valid(struct 
seccomp_filter *filter,
if (ret < 0)
return ret;
 
-   ret = -ENOENT;
-   list_for_each_entry(knotif, >notif->notifications, list) {
-   if (knotif->id == id) {
-   if (knotif->state == SECCOMP_NOTIFY_SENT)
-   ret = 0;
-   goto out;
-   }
-   }
+   knotif = find_notification(filter, id);
+   if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
+   ret = 0;
+   else
+   ret = -ENOENT;
 
-out:
mutex_unlock(>notify_lock);
return ret;
 }
-- 
2.25.1

[PATCH v2 2/3] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-28 Thread Sargun Dhillon

This adds a seccomp notifier ioctl which allows for the listener to "add"
file descriptors to a process which originated a seccomp user
notification. This allows calls like mount, and mknod to be "implemented",
as the return value, and the arguments are data in memory. On the other
hand, calls like connect can be "implemented" using pidfd_getfd.

Unfortunately, there are calls which return file descriptors, like
open, which are vulnerable to TOC-TOU attacks, and require that the
more privileged supervisor can inspect the argument, and perform the
syscall on behalf of the process generating the notifiation. This
allows the file descriptor generated from that open call to be
returned to the calling process.

In addition, there is funcitonality to allow for replacement of
specific file descriptors, following dup2-like semantics.

Signed-off-by: Sargun Dhillon 
Suggested-by: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 include/uapi/linux/seccomp.h |  25 +
 kernel/seccomp.c | 182 ++-
 2 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..c7bfe898e7a0 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -113,6 +113,27 @@ struct seccomp_notif_resp {
__u32 flags;
 };
 
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
+
+/**
+ * struct seccomp_notif_addfd
+ * @size: The size of the seccomp_notif_addfd datastructure
+ * @id: The ID of the seccomp notification
+ * @flags: SECCOMP_ADDFD_FLAG_*
+ * @srcfd: The local fd number
+ * @newfd: Optional remote FD number if SETFD option is set, otherwise 0.
+ * @newfd_flags: Flags the remote FD should be allocated under
+ */
+struct seccomp_notif_addfd {
+   __u64 size;
+   __u64 id;
+   __u64 flags;
+   __u32 srcfd;
+   __u32 newfd;
+   __u32 newfd_flags;
+};
+
 #define SECCOMP_IOC_MAGIC  '!'
 #define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr)
 #define SECCOMP_IOR(nr, type)  _IOR(SECCOMP_IOC_MAGIC, nr, type)
@@ -124,4 +145,8 @@ struct seccomp_notif_resp {
 #define SECCOMP_IOCTL_NOTIF_SEND   SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
 #define SECCOMP_IOCTL_NOTIF_ID_VALID   SECCOMP_IOR(2, __u64)
+/* On success, the return value is the remote process's added fd number */
+#define SECCOMP_IOCTL_NOTIF_ADDFD  SECCOMP_IOR(3,  \
+   struct seccomp_notif_addfd)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 94ae4c7502cc..02b9ba1fbee0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 enum notify_state {
SECCOMP_NOTIFY_INIT,
@@ -77,10 +80,42 @@ struct seccomp_knotif {
long val;
u32 flags;
 
-   /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+   /*
+* Signals when this has changed states, such as the listener
+* dying, a new seccomp addfd message, or changing to REPLIED
+*/
struct completion ready;
 
struct list_head list;
+
+   /* outstanding addfd requests */
+   struct list_head addfd;
+};
+
+/**
+ * struct seccomp_kaddfd - container for seccomp_addfd ioctl messages
+ *
+ * @file: A reference to the file to install in the other task
+ * @fd: The fd number to install it at. If the fd number is -1, it means the
+ *  installing process should allocate the fd as normal.
+ * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
+ * is allowed.
+ * @ret: The return value of the installing process. It is set to the fd num
+ *   upon success (>= 0).
+ * @completion: Indicates that the installing process has completed fd
+ *  installation, or gone away (either due to successful
+ *  reply, or signal)
+ *
+ */
+struct seccomp_kaddfd {
+   struct file *file;
+   int fd;
+   unsigned int flags;
+
+   /* To only be set on reply */
+   int ret;
+   struct completion completion;
+   struct list_head list;
 };
 
 /**
@@ -735,6 +770,41 @@ static u64 seccomp_next_notify_id(struct seccomp_filter 
*filter)
return filter->notif->next_id++;
 }
 
+static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
+{
+   struct socket *sock;
+   int ret, err;
+
+   /*
+* Remove the notification, and reset the list pointers, indicating
+* that it has been handled.
+*/
+   list_del_init(>list);
+
+   ret = security_file_receive(addfd->file);
+   if (ret)
+   goto out;
+
+   if (addfd->fd == -1) {

[PATCH v2 0/3] Add seccomp notifier ioctl that enables adding fds

2020-05-28 Thread Sargun Dhillon

This adds the capability for seccomp notifier listeners to add file
descriptors in response to a seccomp notification. This is useful for
syscalls in which the previous capabilities were not sufficient. The
current mechanism works well for syscalls that either have side effects
that are system / namespace wide (mount), or that operate on a specific
set of registers (reboot, mknod), and don't require dereferencing pointers.
The problem with derefencing pointers in a supervisor is that it leaves
us vulnerable to TOC-TOU [1] style attacks. For syscalls that had a direct
effect on file descriptors pidfd_getfd was added, allowing for those file
descriptors to be directly operated upon by the supervisor [2].

Unfortunately, this leaves system calls which return file descriptors
out of the picture. These are fairly common syscalls, such as openat,
socket, and perf_event_open that return file descriptors, and have
arguments that are pointers. These require that the supervisor is able to
verify the arguments, make the call on behalf of the process on hand,
and pass back the resulting file descriptor. This is where addfd comes
into play.

There is an additional flag that allows you to "set" an FD, rather than
add it with an arbitrary number. This has dup2 style semantics, and
installs the new file at that file descriptor, and atomically closes
the old one if it existed. This is useful for a particular use case
that we have, in which we want to swap out AF_INET sockets for AF_UNIX,
AF_INET6, and sockets in another namespace when doing "upconversion".

My specific usecase at Netflix is to enable our IPv4-IPv6 transition
mechanism, in which we our namespaces have no real IPv4 reachability,
and when it comes time to do a connect(2), we get a socket from a
namespace with global IPv4 reachability.

In addition, we intend to use it for our servicemesh, and where our
service mesh needs to intercept traffic ingress traffic, the addfd
capability will act as a mechanism to do socket activation.

Google Chrome has a use case has a use case related to sandboxing.
They use SECCOMP_RET_TRAP to capture filesystem related syscalls in
their sandbox, which returns the FDs via SCM_RIGHTS. Unfortunately,
this does not work when signals are disabled, which is becoming the
default in glibc library functions. They need to switch to an
alternative before this becomes the default in Linux distros, and
they need a mechanism to addfd to move forward.

Addfd is not implemented as a separate syscall, a la pidfd_getfd, as
VFS makes some optimizations in regards to the fdtable, and assumes
that they are not modified by external processes. Although a mechanism
that scheduled something in the context of the task could work, it is
somewhat simpler to do it in the context of the ioctl as we control
the task while in kernel. In addition there are not obvious needs
for this beyond seccomp notifier.

This mechanism leaves a potential issue that if the manager is
interrupted while injecting FDs, the child process will be left with
leaked / dangling FDs. This may lead to undefined behaviour. A
mechanism to work around this is to extend the structure and add a
"rollback" mechanism for FDs to be closed if things fail.

Changes since v1:
 * find_notification has been cleaned up slightly, and it replaces a use
   case in send as well.
 * Fixes ref counting rules to get / release references in the ioctl side,
   rather than the seccomp notifier side [3].
 * Removes the optional move flag, and opts into SCM_RIGHTS
 * Rearranges the seccomp_notif_addfd datastructure for greater user
   clarity [4]. In order to avoid unnamed padding it makes size u64,
   which is a little bit of a waste of space.
 * Changes error codes to return ESRCH upon the process going away on
   notification, and EINPROGRESS is the notification is in an unexpected
   state (and added tests for this behaviour)

[1]: 
https://lore.kernel.org/lkml/20190918084833.9369-2-christian.brau...@ubuntu.com/
[2]: https://lore.kernel.org/lkml/20200107175927.4558-1-sar...@sargun.me/
[3]: https://lore.kernel.org/lkml/20200525000537.gb23...@zeniv.linux.org.uk/
[4]: https://lore.kernel.org/lkml/20200525135036.vp2nmmx42y7dfznf@wittgenstein/

Sargun Dhillon (3):
  seccomp: Add find_notification helper
  seccomp: Introduce addfd ioctl to seccomp user notifier
  selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

 include/uapi/linux/seccomp.h  |  25 ++
 kernel/seccomp.c  | 231 --
 tools/testing/selftests/seccomp/seccomp_bpf.c | 180 ++
 3 files changed, 410 insertions(+), 26 deletions(-)

-- 
2.25.1

[PATCH v2 3/3] selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

2020-05-28 Thread Sargun Dhillon

Test whether we can add file descriptors in response to notifications.
This injects the file descriptors via notifications, and then uses
kcmp to determine whether or not it has been successful.

It also includes some basic sanity checking for arguments.

Signed-off-by: Sargun Dhillon 
Cc: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 180 ++
 1 file changed, 180 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index c0aa46ce14f6..05516c185d78 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -181,6 +182,12 @@ struct seccomp_metadata {
 #define SECCOMP_IOCTL_NOTIF_SEND   SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
 #define SECCOMP_IOCTL_NOTIF_ID_VALID   SECCOMP_IOR(2, __u64)
+/* On success, the return value is the remote process's added fd number */
+#define SECCOMP_IOCTL_NOTIF_ADDFD  SECCOMP_IOR(3,  \
+   struct seccomp_notif_addfd)
+
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
 
 struct seccomp_notif {
__u64 id;
@@ -201,6 +208,15 @@ struct seccomp_notif_sizes {
__u16 seccomp_notif_resp;
__u16 seccomp_data;
 };
+
+struct seccomp_notif_addfd {
+   __u64 size;
+   __u64 id;
+   __u64 flags;
+   __u32 srcfd;
+   __u32 newfd;
+   __u32 newfd_flags;
+};
 #endif
 
 #ifndef PTRACE_EVENTMSG_SYSCALL_ENTRY
@@ -3686,6 +3702,170 @@ TEST(user_notification_continue)
}
 }
 
+TEST(user_notification_sendfd)
+{
+   pid_t pid;
+   long ret;
+   int status, listener, memfd;
+   struct seccomp_notif_addfd addfd = {};
+   struct seccomp_notif req = {};
+   struct seccomp_notif_resp resp = {};
+   __u64 nextid;
+
+   memfd = memfd_create("test", 0);
+   ASSERT_GE(memfd, 0);
+
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   /* Check that the basic notification machinery works */
+   listener = user_trap_syscall(__NR_getppid,
+SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   ASSERT_GE(listener, 0);
+
+   pid = fork();
+   ASSERT_GE(pid, 0);
+
+   if (pid == 0) {
+   if (syscall(__NR_getppid) != USER_NOTIF_MAGIC)
+   exit(1);
+   exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
+   }
+
+   ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+
+   addfd.size = sizeof(addfd);
+   addfd.srcfd = memfd;
+   addfd.newfd_flags = O_CLOEXEC;
+   addfd.newfd = 0;
+   addfd.id = req.id;
+   addfd.flags = 0xff;
+
+   /* Verify bad flags cannot be set */
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EINVAL);
+
+   /* Verify that remote_fd cannot be set without setting flags */
+   addfd.flags = 0;
+   addfd.newfd = 1;
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EINVAL);
+
+   /* Verify we can set an arbitrary remote fd */
+   addfd.newfd = 0;
+
+   ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, );
+   EXPECT_GE(ret, 0);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
+
+   /* Verify we can set a specific remote fd */
+   addfd.newfd = 42;
+   addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), 42);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, 42), 0);
+
+   resp.id = req.id;
+   resp.error = 0;
+   resp.val = USER_NOTIF_MAGIC;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, ), 0);
+
+   nextid = req.id + 1;
+
+   /* Wait for getppid to be called for the second time */
+   sleep(1);
+
+   addfd.id = nextid;
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EINPROGRESS);
+
+   memset(, 0, sizeof(req));
+   ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+   ASSERT_EQ(nextid, req.id);
+
+   resp.id = req.id;
+   resp.error = 0;
+   resp.val = USER_NOTIF_MAGIC;
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, ), 0);
+
+
+   EXPECT_EQ(waitpid(pid, , 0), pid);
+   EXPECT_EQ(true, WIFEXITED(status));
+   EXPECT_EQ(0, WEXITSTATUS(status));
+
+   close(memfd);
+}
+
+TEST(user_notification_sendfd_rlimit)
+{
+   pid_t pid;
+   long ret;
+   int status, listener, memfd;
+   struct

Re: [PATCH 1/2] seccomp: notify user trap about unused filter

2020-05-27 Thread Sargun Dhillon

On Wed, May 27, 2020 at 01:19:01PM +0200, Christian Brauner wrote:
> +void seccomp_filter_notify(const struct task_struct *tsk)
> +{
> + struct seccomp_filter *orig = tsk->seccomp.filter;
> +
> + while (orig && refcount_dec_and_test(>live)) {
> + if (waitqueue_active(>wqh))
> + wake_up_poll(>wqh, EPOLLHUP);
> + orig = orig->prev;
> + }
> +}
> +
Any reason not to write this as:
for (orig = tsk->seccomp.filter; refcount_dec_and_test(>live); orig = 
orig->prev)?

Also, for those of us who are plumbing in the likes of Go code into the
listener, where we don't have direct access to the epoll interface (at
least not out of the box), what do you think about exposing this on the RECV
ioctl? Or, do you think we should lump that into the "v2" receive API?

Either way, this seems useful, as right now, we're intertwining process
tree lifetime with manager lifetime. This seems cleaner.

Re: [PATCH 2/5] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-26 Thread Sargun Dhillon

On Mon, May 25, 2020 at 6:50 AM Christian Brauner
 wrote:
>
> On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> > This adds a seccomp notifier ioctl which allows for the listener to "add"
> > file descriptors to a process which originated a seccomp user
> > notification. This allows calls like mount, and mknod to be "implemented",
> > as the return value, and the arguments are data in memory. On the other
> > hand, calls like connect can be "implemented" using pidfd_getfd.
> >
> > Unfortunately, there are calls which return file descriptors, like
> > open, which are vulnerable to TOC-TOU attacks, and require that the
> > more privileged supervisor can inspect the argument, and perform the
> > syscall on behalf of the process generating the notifiation. This
> > allows the file descriptor generated from that open call to be
> > returned to the calling process.
> >
> > In addition, there is funcitonality to allow for replacement of
> > specific file descriptors, following dup2-like semantics.
> >
> > Signed-off-by: Sargun Dhillon 
> > Suggested-by: Matt Denton 
> > Cc: Kees Cook ,
> > Cc: Jann Horn ,
> > Cc: Robert Sesek ,
> > Cc: Chris Palmer 
> > Cc: Christian Brauner 
> > Cc: Tycho Andersen 
> > ---
> >  include/uapi/linux/seccomp.h |  25 ++
> >  kernel/seccomp.c | 169 ++-
> >  2 files changed, 193 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index c1735455bc53..7d450a9e4c29 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/seccomp.h
> > @@ -113,6 +113,27 @@ struct seccomp_notif_resp {
> >   __u32 flags;
> >  };
> >
> > +/* valid flags for seccomp_notif_addfd */
> > +#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
> > +
> > +/**
> > + * struct seccomp_notif_addfd
> > + * @size: The size of the seccomp_notif_addfd datastructure
> > + * @fd: The local fd number
> > + * @id: The ID of the seccomp notification
> > + * @fd_flags: Flags the remote FD should be allocated under
> > + * @remote_fd: Optional remote FD number if SETFD option is set, otherwise 
> > 0.
> > + * @flags: SECCOMP_ADDFD_FLAG_*
> > + */
> > +struct seccomp_notif_addfd {
> > + __u32 size;
> > + __u32 fd;
> > + __u64 id;
> > + __u32 fd_flags;
> > + __u32 remote_fd;
> > + __u64 flags;
> > +};
>
> This was a little confusing to me at first. So fd is the fd from which
> we take the struct file and remote_fd is either -1 at which point we
> just allocate the next free fd number and if it is not we
> allocate/replace a specific one. Maybe it would be clearer if we did:
>
> struct seccomp_notif_addfd {
> __u32 size;
> __u64 id;
> __u64 flags;
> __u32 srcfd;
> __u32 newfd;
> __u32 newfd_flags;
> };
>
> No need to hide in the name that this is remote_dup2().
>
> > +
> >  #define SECCOMP_IOC_MAGIC'!'
> >  #define SECCOMP_IO(nr)   _IO(SECCOMP_IOC_MAGIC, nr)
> >  #define SECCOMP_IOR(nr, type)_IOR(SECCOMP_IOC_MAGIC, nr, 
> > type)
> > @@ -124,4 +145,8 @@ struct seccomp_notif_resp {
> >  #define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
> >   struct seccomp_notif_resp)
> >  #define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
> > +/* On success, the return value is the remote process's added fd number */
> > +#define SECCOMP_IOCTL_NOTIF_ADDFDSECCOMP_IOR(3,  \
> > + struct seccomp_notif_addfd)
> > +
> >  #endif /* _UAPI_LINUX_SECCOMP_H */
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index f6ce94b7a167..88940eeabaee 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -77,10 +77,42 @@ struct seccomp_knotif {
> >   long val;
> >   u32 flags;
> >
> > - /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > + /*
> > +  * Signals when this has changed states, such as the listener
> > +  * dying, a new seccomp addfd message, or changing to REPLIED
> > +  */
> >   struct completion ready;
> >
> >   struct list_head list;
> > +
> > + /* outstanding addfd requests */
> > + struct list_head addfd;
> > +};
> > +
> > +/**
> &g

Re: [PATCH 4/5] seccomp: Add SECCOMP_ADDFD_FLAG_MOVE flag to add fd ioctl

2020-05-26 Thread Sargun Dhillon

> > + * they are created in. Specifcally, sockets, and their interactions with 
> > the
> > + * net_cls and net_prio cgroup v1 controllers. This "moves" the file 
> > descriptor
> > + * so that it takes on the cgroup controller's configuration in the process
> > + * that the file descriptor is being added to.
> > + */
> > +#define SECCOMP_ADDFD_FLAG_MOVE  (1UL << 1)
>
> I'm not happy about the name because "moving" has much more to do with
> transferring ownership than what we are doing here. After a "move" the
> fd shouldn't be valid anymore. But that might just be my thinking.
>
> But why make this opt-in and not do it exactly like when you send around
> fds and make this mandatory?

Based upon Tycho's comments in an offline thread, I'm going to make
this the default
(setting the cgroup metadata) to mirror what SCM_RIGHTS does, and then
if we come
up with a good use case where we need to preserve *cgroup v1*
metadata, then we can
add an opt-out flag in the future.

Re: [PATCH 2/5] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-24 Thread Sargun Dhillon

On Sun, May 24, 2020 at 5:05 PM Al Viro  wrote:
>
> On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
>
> Bad refcounting rules.  *IF* we go with anything of that sort (and I'm not
> convinced that the entire series makes sense), it's better to have more
> uniform rules re reference consumption/disposal.
>
> Make the destructor of addfd *ALWAYS* drop its reference.  And have this
> function go
Are you suggesting the in both the error, and non-error cases the ioctl
invoker side is responsible for fput'ing the final reference in both the
success and non-success cases? Would we take an extra reference
prior to fd_install?
>
> if (addfd->fd >= 0) {
> ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> } else {
> ret = get_unused_fd_flags(addfd->flags);
> if (ret >= 0)
> fd_install(ret, get_file(addfd->file));
> }
>
Wouldn't this result in consumption of reference in one case (fd_install),
and the fd still having a reference in the replace_fd case?

[PATCH 4/5] seccomp: Add SECCOMP_ADDFD_FLAG_MOVE flag to add fd ioctl

2020-05-24 Thread Sargun Dhillon

Certain files, when moved to another process have metadata changed, such
as netprioidx, and classid. This is the default behaviour in sending
sockets with SCM_RIGHTS over unix sockets. Depending on the usecase,
this may or may not be desirable with the addfd ioctl. This allows
the user to opt-in.

Signed-off-by: Sargun Dhillon 
Suggested-by: Tycho Andersen 
Cc: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
---
 include/uapi/linux/seccomp.h |  8 
 kernel/seccomp.c | 31 +++
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 7d450a9e4c29..ccd1c960372a 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -115,6 +115,14 @@ struct seccomp_notif_resp {
 
 /* valid flags for seccomp_notif_addfd */
 #define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
+/*
+ * Certain file descriptors are behave differently depending on the process
+ * they are created in. Specifcally, sockets, and their interactions with the
+ * net_cls and net_prio cgroup v1 controllers. This "moves" the file descriptor
+ * so that it takes on the cgroup controller's configuration in the process
+ * that the file descriptor is being added to.
+ */
+#define SECCOMP_ADDFD_FLAG_MOVE(1UL << 1)
 
 /**
  * struct seccomp_notif_addfd
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 88940eeabaee..2e649f3cb10e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 enum notify_state {
SECCOMP_NOTIFY_INIT,
@@ -108,6 +111,7 @@ struct seccomp_kaddfd {
struct file *file;
int fd;
unsigned int flags;
+   bool move;
 
/* To only be set on reply */
int ret;
@@ -769,7 +773,8 @@ static u64 seccomp_next_notify_id(struct seccomp_filter 
*filter)
 
 static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
 {
-   int ret;
+   struct socket *sock;
+   int err, ret;
 
/*
 * Remove the notification, and reset the list pointers, indicating
@@ -785,12 +790,29 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd 
*addfd)
ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
if (ret >= 0)
fput(addfd->file);
+   else
+   goto out;
} else {
ret = get_unused_fd_flags(addfd->flags);
if (ret >= 0)
fd_install(ret, addfd->file);
+   else
+   goto out;
}
 
+   if (addfd->move) {
+   sock = sock_from_file(addfd->file, );
+   if (sock) {
+   sock_update_netprioidx(>sk->sk_cgrp_data);
+   sock_update_classid(>sk->sk_cgrp_data);
+   }
+   }
+   /*
+* An extra reference is taken on the ioctl side, so upon success, we
+* must consume all references (and on failure, none).
+*/
+   fput(addfd->file);
+
 out:
addfd->ret = ret;
complete(>completion);
@@ -1279,16 +1301,17 @@ static long seccomp_notify_addfd(struct seccomp_filter 
*filter,
if (addfd.fd_flags & (~O_CLOEXEC))
return -EINVAL;
 
-   if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD))
+   if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD|SECCOMP_ADDFD_FLAG_MOVE))
return -EINVAL;
 
if (addfd.remote_fd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD))
return -EINVAL;
 
-   kaddfd.file = fget(addfd.fd);
+   kaddfd.file = fget_many(addfd.fd, 2);
if (!kaddfd.file)
return -EBADF;
 
+   kaddfd.move = (addfd.flags & SECCOMP_ADDFD_FLAG_MOVE);
kaddfd.flags = addfd.fd_flags;
kaddfd.fd = (addfd.flags & SECCOMP_ADDFD_FLAG_SETFD) ?
addfd.remote_fd : -1;
@@ -1339,7 +1362,7 @@ static long seccomp_notify_addfd(struct seccomp_filter 
*filter,
mutex_unlock(>notify_lock);
 out:
if (ret < 0)
-   fput(kaddfd.file);
+   fput_many(kaddfd.file, 2);
 
return ret;
 }
-- 
2.25.1

[PATCH 5/5] selftests/seccomp: Add test for addfd move semantics

2020-05-24 Thread Sargun Dhillon

This introduces another call to addfd, in which the move flag is set. It
may make sense to setup a cgroup v1 hierarchy, and check that the
netprioidx is changed.

Signed-off-by: Sargun Dhillon 
Cc: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 1ec43fef2b93..f4b50cbbde42 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -188,6 +188,8 @@ struct seccomp_metadata {
 
 /* valid flags for seccomp_notif_addfd */
 #define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
+#define SECCOMP_ADDFD_FLAG_MOVE(1UL << 1)
+
 
 struct seccomp_notif {
__u64 id;
@@ -3756,6 +3758,12 @@ TEST(user_notification_sendfd)
EXPECT_GE(ret, 0);
EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
 
+   /* Move the FD */
+   addfd.flags = SECCOMP_ADDFD_FLAG_MOVE;
+   ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, );
+   EXPECT_GE(ret, 0);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
+
/* Verify we can set a specific remote fd */
addfd.remote_fd = 42;
addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
-- 
2.25.1

[PATCH 2/5] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-24 Thread Sargun Dhillon

This adds a seccomp notifier ioctl which allows for the listener to "add"
file descriptors to a process which originated a seccomp user
notification. This allows calls like mount, and mknod to be "implemented",
as the return value, and the arguments are data in memory. On the other
hand, calls like connect can be "implemented" using pidfd_getfd.

Unfortunately, there are calls which return file descriptors, like
open, which are vulnerable to TOC-TOU attacks, and require that the
more privileged supervisor can inspect the argument, and perform the
syscall on behalf of the process generating the notifiation. This
allows the file descriptor generated from that open call to be
returned to the calling process.

In addition, there is funcitonality to allow for replacement of
specific file descriptors, following dup2-like semantics.

Signed-off-by: Sargun Dhillon 
Suggested-by: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 include/uapi/linux/seccomp.h |  25 ++
 kernel/seccomp.c | 169 ++-
 2 files changed, 193 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..7d450a9e4c29 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -113,6 +113,27 @@ struct seccomp_notif_resp {
__u32 flags;
 };
 
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
+
+/**
+ * struct seccomp_notif_addfd
+ * @size: The size of the seccomp_notif_addfd datastructure
+ * @fd: The local fd number
+ * @id: The ID of the seccomp notification
+ * @fd_flags: Flags the remote FD should be allocated under
+ * @remote_fd: Optional remote FD number if SETFD option is set, otherwise 0.
+ * @flags: SECCOMP_ADDFD_FLAG_*
+ */
+struct seccomp_notif_addfd {
+   __u32 size;
+   __u32 fd;
+   __u64 id;
+   __u32 fd_flags;
+   __u32 remote_fd;
+   __u64 flags;
+};
+
 #define SECCOMP_IOC_MAGIC  '!'
 #define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr)
 #define SECCOMP_IOR(nr, type)  _IOR(SECCOMP_IOC_MAGIC, nr, type)
@@ -124,4 +145,8 @@ struct seccomp_notif_resp {
 #define SECCOMP_IOCTL_NOTIF_SEND   SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
 #define SECCOMP_IOCTL_NOTIF_ID_VALID   SECCOMP_IOR(2, __u64)
+/* On success, the return value is the remote process's added fd number */
+#define SECCOMP_IOCTL_NOTIF_ADDFD  SECCOMP_IOR(3,  \
+   struct seccomp_notif_addfd)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index f6ce94b7a167..88940eeabaee 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -77,10 +77,42 @@ struct seccomp_knotif {
long val;
u32 flags;
 
-   /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+   /*
+* Signals when this has changed states, such as the listener
+* dying, a new seccomp addfd message, or changing to REPLIED
+*/
struct completion ready;
 
struct list_head list;
+
+   /* outstanding addfd requests */
+   struct list_head addfd;
+};
+
+/**
+ * struct seccomp_kaddfd - contianer for seccomp_addfd ioctl messages
+ *
+ * @file: A reference to the file to install in the other task
+ * @fd: The fd number to install it at. If the fd number is -1, it means the
+ *  installing process should allocate the fd as normal.
+ * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
+ * is allowed.
+ * @ret: The return value of the installing process. It is set to the fd num
+ *   upon success (>= 0).
+ * @completion: Indicates that the installing process has completed fd
+ *  installation, or gone away (either due to successful
+ *  reply, or signal)
+ *
+ */
+struct seccomp_kaddfd {
+   struct file *file;
+   int fd;
+   unsigned int flags;
+
+   /* To only be set on reply */
+   int ret;
+   struct completion completion;
+   struct list_head list;
 };
 
 /**
@@ -735,6 +767,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter 
*filter)
return filter->notif->next_id++;
 }
 
+static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
+{
+   int ret;
+
+   /*
+* Remove the notification, and reset the list pointers, indicating
+* that it has been handled.
+*/
+   list_del_init(>list);
+
+   ret = security_file_receive(addfd->file);
+   if (ret)
+   goto out;
+
+   if (addfd->fd >= 0) {
+   ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
+   if (ret >= 0)
+   fput(addfd->file);

[PATCH 0/5] Add seccomp notifier ioctl that enables adding fds

2020-05-24 Thread Sargun Dhillon

This adds the capability for seccomp notifier listeners to add file
descriptors in response to a seccomp notification. This is useful for
syscalls in which the previous capabilities were not sufficient. The
current mechanism works well for syscalls that either have side effects
that are system / namespace wide (mount), or that operate on a specific
set of registers (reboot, mknod), and don't require dereferencing pointers.
The problem with derefencing pointers in a supervisor is that it leaves
us vulnerable to TOC-TOU [1] style attacks. For syscalls that had a direct
effect on file descriptors pidfd_getfd was added, allowing for those file
descriptors to be directly operated upon by the supervisor [2].

Unfortunately, this leaves system calls which return file descriptors
out of the picture. These are fairly common syscalls, such as openat,
socket, and perf_event_open that return file descriptors, and have
arguments that are pointers. These require that the supervisor is able to
verify the arguments, make the call on behalf of the process on hand,
and pass back the resulting file descriptor. This is where addfd comes
into play.

There is an additional flag that allows you to "set" an FD, rather than
add it with an arbitrary number. This has dup2 style semantics, and
installs the new file at that file descriptor, and atomically closes
the old one if it existed. This is useful for a particular use case
that we have, in which we want to swap out AF_INET sockets for AF_UNIX,
AF_INET6, and sockets in another namespace when doing "upconversion".

My specific usecase at Netflix is to enable our IPv4-IPv6 transition
mechanism, in which we our namespaces have no real IPv4 reachability,
and when it comes time to do a connect(2), we get a socket from a
namespace with global IPv4 reachability.

In addition, we intend to use it for our servicemesh, and where our
service mesh needs to intercept traffic ingress traffic, the addfd
capability will act as a mechanism to do socket activation.

Addfd is not implemented as a separate syscall, a la pidfd_getfd, as
VFS makes some optimizations in regards to the fdtable, and assumes
that they are not modified by external processes. Although a mechanism
that scheduled something in the context of the task could work, it is
somewhat simpler to do it in the context of the ioctl as we control
the task while in kernel.

There is an additional flag (move) that was added to enable cgroup
v1 controllers (netprio, classid), and moving sockets, as a socket
can only be associated with one cgroup at a time.

[1]: 
https://lore.kernel.org/lkml/20190918084833.9369-2-christian.brau...@ubuntu.com/
[2]: https://lore.kernel.org/lkml/20200107175927.4558-1-sar...@sargun.me/

Sargun Dhillon (5):
  seccomp: Add find_notification helper
  seccomp: Introduce addfd ioctl to seccomp user notifier
  selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
  seccomp: Add SECCOMP_ADDFD_FLAG_MOVE flag to add fd ioctl
  selftests/seccomp: Add test for addfd move semantics

 include/uapi/linux/seccomp.h  |  33 +++
 kernel/seccomp.c  | 228 +++--
 tools/testing/selftests/seccomp/seccomp_bpf.c | 235 ++
 3 files changed, 479 insertions(+), 17 deletions(-)

-- 
2.25.1

[PATCH 1/5] seccomp: Add find_notification helper

2020-05-24 Thread Sargun Dhillon

This adds a helper which can iterate through a seccomp_filter to
find a notification matching an ID. It removes several replicated
chunks of code.

Signed-off-by: Sargun Dhillon 
Cc: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 kernel/seccomp.c | 38 +-
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..f6ce94b7a167 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1021,10 +1021,25 @@ static int seccomp_notify_release(struct inode *inode, 
struct file *file)
return 0;
 }
 
+/* must be called with notif_lock held */
+static inline struct seccomp_knotif *
+find_notification(struct seccomp_filter *filter, u64 id)
+{
+   struct seccomp_knotif *cur;
+
+   list_for_each_entry(cur, >notif->notifications, list) {
+   if (cur->id == id)
+   return cur;
+   }
+
+   return NULL;
+}
+
+
 static long seccomp_notify_recv(struct seccomp_filter *filter,
void __user *buf)
 {
-   struct seccomp_knotif *knotif = NULL, *cur;
+   struct seccomp_knotif *knotif, *cur;
struct seccomp_notif unotif;
ssize_t ret;
 
@@ -1078,14 +1093,8 @@ static long seccomp_notify_recv(struct seccomp_filter 
*filter,
 * may have died when we released the lock, so we need to make
 * sure it's still around.
 */
-   knotif = NULL;
mutex_lock(>notify_lock);
-   list_for_each_entry(cur, >notif->notifications, list) {
-   if (cur->id == unotif.id) {
-   knotif = cur;
-   break;
-   }
-   }
+   knotif = find_notification(filter, unotif.id);
 
if (knotif) {
knotif->state = SECCOMP_NOTIFY_INIT;
@@ -1150,7 +1159,7 @@ static long seccomp_notify_send(struct seccomp_filter 
*filter,
 static long seccomp_notify_id_valid(struct seccomp_filter *filter,
void __user *buf)
 {
-   struct seccomp_knotif *knotif = NULL;
+   struct seccomp_knotif *knotif;
u64 id;
long ret;
 
@@ -1162,15 +1171,10 @@ static long seccomp_notify_id_valid(struct 
seccomp_filter *filter,
return ret;
 
ret = -ENOENT;
-   list_for_each_entry(knotif, >notif->notifications, list) {
-   if (knotif->id == id) {
-   if (knotif->state == SECCOMP_NOTIFY_SENT)
-   ret = 0;
-   goto out;
-   }
-   }
+   knotif = find_notification(filter, id);
+   if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
+   ret = 0;
 
-out:
mutex_unlock(>notify_lock);
return ret;
 }
-- 
2.25.1

[PATCH 3/5] selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

2020-05-24 Thread Sargun Dhillon

Test whether we can add file descriptors in response to notifications.
This injects the file descriptors via notifications, and then uses
kcmp to determine whether or not it has been successful.

It also includes some basic sanity checking for arguments.

Signed-off-by: Sargun Dhillon 
Cc: Matt Denton 
Cc: Kees Cook ,
Cc: Jann Horn ,
Cc: Robert Sesek ,
Cc: Chris Palmer 
Cc: Christian Brauner 
Cc: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 227 ++
 1 file changed, 227 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index c0aa46ce14f6..1ec43fef2b93 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -181,6 +182,12 @@ struct seccomp_metadata {
 #define SECCOMP_IOCTL_NOTIF_SEND   SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
 #define SECCOMP_IOCTL_NOTIF_ID_VALID   SECCOMP_IOR(2, __u64)
+/* On success, the return value is the remote process's added fd number */
+#define SECCOMP_IOCTL_NOTIF_ADDFD  SECCOMP_IOR(3,  \
+   struct seccomp_notif_addfd)
+
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD   (1UL << 0) /* Specify remote fd */
 
 struct seccomp_notif {
__u64 id;
@@ -201,6 +208,15 @@ struct seccomp_notif_sizes {
__u16 seccomp_notif_resp;
__u16 seccomp_data;
 };
+
+struct seccomp_notif_addfd {
+   __u32 size;
+   __u32 fd;
+   __u64 id;
+   __u32 fd_flags;
+   __u32 remote_fd;
+   __u64 flags;
+};
 #endif
 
 #ifndef PTRACE_EVENTMSG_SYSCALL_ENTRY
@@ -3686,6 +3702,217 @@ TEST(user_notification_continue)
}
 }
 
+TEST(user_notification_sendfd)
+{
+   pid_t pid;
+   long ret;
+   int status, listener, memfd;
+   struct seccomp_notif_addfd addfd = {};
+   struct seccomp_notif req = {};
+   struct seccomp_notif_resp resp = {};
+
+   memfd = memfd_create("test", 0);
+   ASSERT_GE(memfd, 0);
+
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   /* Check that the basic notification machinery works */
+   listener = user_trap_syscall(__NR_getppid,
+SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   ASSERT_GE(listener, 0);
+
+   pid = fork();
+   ASSERT_GE(pid, 0);
+
+   if (pid == 0)
+   exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
+
+   ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, ), 0);
+
+   addfd.size = sizeof(addfd);
+   addfd.fd = memfd;
+   addfd.fd_flags = O_CLOEXEC;
+   addfd.remote_fd = 0;
+   addfd.id = req.id;
+   addfd.flags = 0xff;
+
+   /* Verify bad flags cannot be set */
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EINVAL);
+
+   /* Verify that remote_fd cannot be set without setting flags */
+   addfd.flags = 0;
+   addfd.remote_fd = 1;
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), -1);
+   EXPECT_EQ(errno, EINVAL);
+
+   /* Verify we can set an arbitrary remote fd */
+   addfd.remote_fd = 0;
+
+   ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, );
+   EXPECT_GE(ret, 0);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
+
+   /* Verify we can set a specific remote fd */
+   addfd.remote_fd = 42;
+   addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, ), 42);
+   EXPECT_EQ(filecmp(getpid(), pid, memfd, 42), 0);
+
+   resp.id = req.id;
+   resp.error = 0;
+   resp.val = USER_NOTIF_MAGIC;
+
+   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, ), 0);
+
+
+   EXPECT_EQ(waitpid(pid, , 0), pid);
+   EXPECT_EQ(true, WIFEXITED(status));
+   EXPECT_EQ(0, WEXITSTATUS(status));
+
+   close(memfd);
+}
+
+TEST(user_notification_sendfd_goaway)
+{
+   pid_t pid, pid2;
+   long ret;
+   int status, listener, memfd;
+   struct seccomp_notif_addfd addfd = {};
+   struct seccomp_notif req = {};
+   struct seccomp_notif_resp resp = {};
+
+   memfd = memfd_create("test", 0);
+   ASSERT_GE(memfd, 0);
+
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
+   /* Check that the basic notification machinery works */
+   listener = user_trap_syscall(__NR_getppid,
+SECCOMP_FILTER_FLAG_NEW_LISTENER);
+   ASSERT_GE(listener, 0);
+
+   pid = fork();
+   ASSERT_GE(pid, 0);
+
+   if (pi

Re: seccomp feature development

2020-05-22 Thread Sargun Dhillon

On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote:
> Hi!
> 
> This is my attempt at a brain-dump on my plans for nearish-term seccomp
> features. Welcome to my TED talk... ;)
> 
> These are the things I've been thinking about:
> 
> - fd passing
> - deep argument inspection
> - changing structure sizes
> - syscall bitmasks
> 
What's your take on enabling multiple filters with listeners being attached,
so that different seccomp interceptors can operate together. I'm wondering
how this would work.

One idea that I had is adding a new flag to the seccomp filter
installation -- something like NEXT_FILTER_COMPATIBLE. When a filter is
installed with a listener, it will check if all previous filters were
instaled with NEXT_FILTER_COMPATIBLE.

If the call is intercepted by a listener, and the return is overriden,
then it short-circuits, and the subsequent filters are not evaluated.

On the other hand, if the continue response is send, then the
subsequent filters are called.

What do you think?

1 2 3 >

1 - 100 of 288 matches

Mail list logo