Re: [PATCH] riscv: return -ENOSYS for syscall -1

2020-12-23 Thread Tycho Andersen
On Wed, Dec 23, 2020 at 06:54:43PM -0800, Palmer Dabbelt wrote:
> On Wed, 23 Dec 2020 00:24:04 PST (-0800), Christoph Hellwig wrote:
> > On Tue, Dec 22, 2020 at 09:22:19AM -0700, Tycho Andersen wrote:
> > > On Mon, Dec 21, 2020 at 11:52:00PM +0100, Andreas Schwab wrote:
> > > > Properly return -ENOSYS for syscall -1 instead of leaving the return 
> > > > value
> > > > uninitialized.  This fixes the strace teststuite.
> > > >
> > > > Fixes: 5340627e3fe0 ("riscv: add support for SECCOMP and 
> > > > SECCOMP_FILTER")
> > > > Signed-off-by: Andreas Schwab 
> > > > ---
> > > >  arch/riscv/kernel/entry.S | 9 +
> > > >  1 file changed, 1 insertion(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > > > index 524d918f3601..d07763001eb0 100644
> > > > --- a/arch/riscv/kernel/entry.S
> > > > +++ b/arch/riscv/kernel/entry.S
> > > > @@ -186,14 +186,7 @@ check_syscall_nr:
> > > >  * Syscall number held in a7.
> > > >  * If syscall number is above allowed value, redirect to 
> > > > ni_syscall.
> > > >  */
> > > > -   bge a7, t0, 1f
> > > > -   /*
> > > > -* Check if syscall is rejected by tracer, i.e., a7 == -1.
> > > > -* If yes, we pretend it was executed.
> > > > -*/
> > > > -   li t1, -1
> > > > -   beq a7, t1, ret_from_syscall_rejected
> > > > -   blt a7, t1, 1f
> > > > +   bgeu a7, t0, 1f
> > > 
> > > IIUC, this is all dead code anyway for the path where seccomp actually
> > > rejects the syscall, since it should do the rejection directly in
> > > handle_syscall_trace_enter(), which is called above this hunk. So it
> > > seems good to me.
> > 
> > That change really needs to be documented in the commit log, or even
> > better split into a separate patch (still documented of course!).
> 
> Unless I'm missing something, this is already how it works already?

Yes, agreed. My musing was mostly just that this is dead code that
probably should have been in removed in af33d2433b03 ("riscv: fix
seccomp reject syscall code path"), but was overlooked. Maybe this
could use a Fixes: tag for that too.

Tycho


Re: [PATCH] riscv: return -ENOSYS for syscall -1

2020-12-22 Thread Tycho Andersen
On Mon, Dec 21, 2020 at 11:52:00PM +0100, Andreas Schwab wrote:
> Properly return -ENOSYS for syscall -1 instead of leaving the return value
> uninitialized.  This fixes the strace teststuite.
> 
> Fixes: 5340627e3fe0 ("riscv: add support for SECCOMP and SECCOMP_FILTER")
> Signed-off-by: Andreas Schwab 
> ---
>  arch/riscv/kernel/entry.S | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index 524d918f3601..d07763001eb0 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -186,14 +186,7 @@ check_syscall_nr:
>* Syscall number held in a7.
>* If syscall number is above allowed value, redirect to ni_syscall.
>*/
> - bge a7, t0, 1f
> - /*
> -  * Check if syscall is rejected by tracer, i.e., a7 == -1.
> -  * If yes, we pretend it was executed.
> -  */
> - li t1, -1
> - beq a7, t1, ret_from_syscall_rejected
> - blt a7, t1, 1f
> + bgeu a7, t0, 1f

IIUC, this is all dead code anyway for the path where seccomp actually
rejects the syscall, since it should do the rejection directly in
handle_syscall_trace_enter(), which is called above this hunk. So it
seems good to me.

Reviewed-by: Tycho Andersen 


Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition

2020-12-01 Thread Tycho Andersen
On Tue, Dec 01, 2020 at 01:08:25PM +, Sargun Dhillon wrote:
> On Tue, Dec 01, 2020 at 07:41:05AM -0500, Tycho Andersen wrote:
> > On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> > > Idea 1 sounds best to me, but maybe that's because it's the way I
> > > originally did the fd support that never landed :)
> > > 
> > > But here's an Idea 4: we add a way to remotely close an fd (I don't
> > > see that the current infra can do this, but perhaps I didn't look hard
> > > enough), and then when you get ENOENT you have to close the fd. Of
> > > course, this can't be via seccomp, so maybe it's even more racy.
> > 
> > Or better yet: what if the kernel closed everything it had added via
> > ADDFD if it didn't get a valid response from the supervisor? Then
> > everyone gets this bug fixed for free.
> > 
> > Tycho
> > ___
> > Containers mailing list
> > contain...@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> 
> This doesn't solve the problem universally because of the (Go) preemption 
> problem. Unless we can guarantee that the supervisor can always handle the 
> request in fewer than 10ms, or if it implements resumption behaviour. I know 
> that resumption behaviour is a requirement no matter what, but the easier we 
> can 
> make it to implement resumption, the better chance we are giving users to get 
> this right.

Doesn't automatic cleanup of fds make things easier? I'm not sure I
understand the argument.

I agree it doesn't fix the problem of uncooperative userspace.

Tycho


Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition

2020-12-01 Thread Tycho Andersen
On Mon, Nov 30, 2020 at 06:20:09PM -0500, Tycho Andersen wrote:
> Idea 1 sounds best to me, but maybe that's because it's the way I
> originally did the fd support that never landed :)
> 
> But here's an Idea 4: we add a way to remotely close an fd (I don't
> see that the current infra can do this, but perhaps I didn't look hard
> enough), and then when you get ENOENT you have to close the fd. Of
> course, this can't be via seccomp, so maybe it's even more racy.

Or better yet: what if the kernel closed everything it had added via
ADDFD if it didn't get a valid response from the supervisor? Then
everyone gets this bug fixed for free.

Tycho


Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition

2020-11-30 Thread Tycho Andersen
Hi,

On Thu, Nov 26, 2020 at 02:09:33PM +0100, Alban Crequy wrote:
> Hi,
> 
> With the addfd feature (added in “seccomp: Introduce addfd ioctl to
> seccomp user notifier”, commit 7cf97b125455), the new file is
> installed in the target process during the SECCOMP_IOCTL_NOTIF_ADDFD
> operation and not at the end with the SECCOMP_IOCTL_NOTIF_SEND
> operation. This can cause race conditions when the target process is
> interrupted by a signal (EINTR) and restarted automatically.
> 
> This is more noticeable in multithreaded processes like with Golang.
> In Golang 1.14:
> https://golang.org/doc/go1.14
> > "A consequence of the implementation of preemption is that on Unix systems, 
> > including Linux and macOS systems, programs built with Go 1.14 will receive 
> > more signals than programs built with earlier releases. This means that 
> > programs that use packages like syscall or golang.org/x/sys/unix will see 
> > more slow system calls fail with EINTR errors. Those programs will have to 
> > handle those errors in some way, most likely looping to try the system call 
> > again."
> 
> In my test, I added a seccomp policy which returns
> SECCOMP_RET_USER_NOTIF on execve() and I added a sleep(2) in the
> seccomp agent (using https://github.com/kinvolk/seccompagent/) between
> SECCOMP_IOCTL_NOTIF_RECV and SECCOMP_IOCTL_NOTIF_SEND to make it a bit
> slow to reply with SECCOMP_USER_NOTIF_FLAG_CONTINUE. I got the
> following strace log going on in a loop:
> 
> [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> 0xc63b00 /* 11 vars */ 
> [pid 2656200] <... nanosleep resumed>NULL) = 0
> [pid 2656200] epoll_pwait(7, [], 128, 0, NULL, 0) = 0
> [pid 2656200] getpid()  = 1
> [pid 2656200] tgkill(1, 1, SIGURG)  = 0
> [pid 2656199] <... execve resumed>) = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> [pid 2656200] nanosleep({tv_sec=0, tv_nsec=1000},  
> [pid 2656199] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1,
> si_uid=0} ---
> [pid 2656199] rt_sigreturn({mask=[]})   = 59
> [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> 0xc63b00 /* 11 vars */ 
> 
> On the seccomp agent side, the ioctl(SECCOMP_IOCTL_NOTIF_SEND) returns
> ENOENT, and then it receives the same notification at the next
> iteration of the loop.
> 
> The SIGURG signal is sent by the Golang runtime, causing the execve to
> be interrupted, and restarted automatically, triggering the new
> seccomp notification. In this example with execve, this is not a big
> deal because the seccomp agent doesn't add a fd. But on a open() or
> accept() syscall, I fear that the seccomp agent could install a file
> descriptor without knowing that the syscall will be interrupted soon
> after, but before the SECCOMP_IOCTL_NOTIF_SEND is completed.
> 
> I understand the need to have two different ioctl() to add the fd and
> to reply to the seccomp notification because the seccomp agent needs
> to know the fd number being assigned before specifying the return
> value of the syscall with that number.
> 
> What do you think is the best way to solve this problem? Here are a few ideas:
> 
> - Idea 1: add a second flag for the struct seccomp_notif_resp
> “SECCOMP_USER_NOTIF_FLAG_RETURN_FD” to instruct seccomp to override
> the return value with the first fd to install. It would not help to
> emulate recvfrom() with SCM_RIGHTS but it will solve the problem for
> syscalls that return a fd because we can then implement a new ioctl
> (“SECCOMP_IOCTL_NOTIF_SEND_WITH_FDS”?) that does the addfd and the
> notification response in one step.
> 
> Other ideas but they cause more problems:
> 
> - Idea 2: We need some kind of transactions where the fd is sent with
> the first ioctl() and installed in the fd table but marked somehow to
> be closed automatically if the syscall is interrupted with EINTR
> outside of the control of the seccomp agent. The new fd in the fd
> table would be committed at the end if the syscall is not interrupted.
> But this introduces other issues: another thread could call dup() on
> the fd before it gets closed. Or another process sharing the fd table
> with CLONE_FILES could do the same. Should the not-yet-committed fds
> be visible in /proc//fd/? Or inherited to new processes created
> by fork()?
> 
> - Idea 3: We could add fds in a temporary location but not in the
> `struct files_struct` of the target process, and only commit at
> SECCOMP_IOCTL_NOTIF_SEND time. In this way, threads or processes
> sharing the fd table with CLONE_FILES would not be impacted. However,
> this could open new race conditions if other threads are installing
> fds in the same slots in the fd table. Also, this seems quite
> dangerous to add this concept of "inflight" fd for seccomp because
> there are already inflight fds for SCM_RIGHT and a garbage collector
> to clean circular references (net/unix/garbage.c). If we add an
> inflight fd mechanism on seccomp, a malicious user could just use
> SECCOMP_IOCTL_

Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Tycho Andersen
On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> Hi all (and especially Tycho and Sargun),
> 
> Following review comments on the first draft (thanks to Jann, Kees,
> Christian and Tycho), I've made a lot of changes to this page.
> I've also added a few FIXMEs relating to outstanding API issues.
> I'd like a second pass review of the page before I release it.
> But also, this mail serves as a way of noting the outstanding API
> issues.
> 
> Tycho: I still have an outstanding question for you at [2].
> [2] 
> https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/

I don't have that thread in my inbox any more, but I can reply here:
no, I don't know any users of this info, but I also don't anticipate
knowing how people will all use this feature :)

Tycho


Re: For review: seccomp_user_notif(2) manual page [v2]

2020-10-26 Thread Tycho Andersen
On Mon, Oct 26, 2020 at 03:30:29PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Tycho,
> 
> Thanks for getting back to me.
> 
> On Mon, 26 Oct 2020 at 14:54, Tycho Andersen  wrote:
> >
> > On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
> > > Hi all (and especially Tycho and Sargun),
> > >
> > > Following review comments on the first draft (thanks to Jann, Kees,
> > > Christian and Tycho), I've made a lot of changes to this page.
> > > I've also added a few FIXMEs relating to outstanding API issues.
> > > I'd like a second pass review of the page before I release it.
> > > But also, this mail serves as a way of noting the outstanding API
> > > issues.
> > >
> > > Tycho: I still have an outstanding question for you at [2].
> > > [2] 
> > > https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e68...@gmail.com/
> >
> > I don't have that thread in my inbox any more, but I can reply here:
> > no, I don't know any users of this info, but I also don't anticipate
> > knowing how people will all use this feature :)
> 
> Yes, but my questions were:
> 
> [[
> [1] So, I think maybe I now understand what you intended with setting
> POLLOUT: the notification has been received ("read") and now the
> FD can be used to NOTIFY_SEND ("write") a response. Right?
> 
> [2] If that's correct, I don't have a problem with it. I just wonder:
> is it useful? IOW: are there situations where the process doing the
> NOTIFY_SEND might want to test for POLLOUT because the it doesn't
> know whether a NOTIFY_RECV has occurred?
> ]]
> 
> So, do I understand right in [1]? (The implication from your reply is
> yes, but I want to be sure...)

Yes.

> For [2], my question was not about users, but *use cases*. The
> question I asked myself is: why does the feature exist? Hence my
> question [2] reworded: "when you designed this, did you have in mind
> scenarios here the process doing the NOTIFY_SEND might need to test
> for POLLOUT because it doesn't know whether a NOTIFY_RECV has
> occurred?"

I did not.

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 02:06:10PM -0700, Sargun Dhillon wrote:
> On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
>  wrote:
> >
> > Hi Tycho, Sargun (and all),
> >
> > I knew it would be a big ask, but below is kind of the manual page
> > I was hoping you might write [1] for the seccomp user-space notification
> > mechanism. Since you didn't (and because 5.9 adds various new pieces
> > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> > that also will need documenting [2]), I did :-). But of course I may
> > have made mistakes...
> >
> > I've shown the rendered version of the page below, and would love
> > to receive review comments from you and others, and acks, etc.
> >
> > There are a few FIXMEs sprinkled into the page, including one
> > that relates to what appears to me to be a misdesign (possibly
> > fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> > operation. I would be especially interested in feedback on that
> > FIXME, and also of course the other FIXMEs.
> >
> > The page includes an extensive (albeit slightly contrived)
> > example program, and I would be happy also to receive comments
> > on that program.
> >
> > The page source currently sits in a branch (along with the text
> > that you sent me for the seccomp(2) page) at
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> >
> > Thanks,
> >
> > Michael
> >
> > [1] 
> > https://lore.kernel.org/linux-man/2cea5fec-e73e-5749-18af-15c35a4bd...@gmail.com/#t
> > [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> > and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> >
> > 
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
> 
> Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
> this point, given that the extensible ioctl mechanism works? If we add
> new fields to the
> seccomp datastructures, we would move them from fixed-size ioctls, to
> variable sized
> ioctls that encode the datastructure size / length?
> 
> -- This is mostly a question for Kees and Tycho.

It will tell you how big struct seccomp_data in the currently running
kernel is, so it still seems useful/necessary to me, unless there's
another way to figure that out.

But I agree, I don't think the intent is to add anything else to
struct seccomp_notif. (I don't know that it ever was.)

Tycho


Re: [PATCH v11 2/3] arch: Wire up trusted_for(2)

2020-10-01 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 07:02:31PM +0200, Mickaël Salaün wrote:
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -859,9 +859,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
>  __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
>  #define __NR_faccessat2 439
>  __SYSCALL(__NR_faccessat2, sys_faccessat2)
> +#define __NR_trusted_for 443
> +__SYSCALL(__NR_trusted_for, sys_trusted_for)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 440
> +#define __NR_syscalls 444

Looks like a rebase problem here?

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 08:18:49PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen  wrote:
> > On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > >  wrote:
> > > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers 
> > > > wrote:
> > > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > >  wrote:
> > > > > > NOTES
> > > > > >The file descriptor returned when seccomp(2) is employed 
> > > > > > with the
> > > > > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  
> > > > > > using
> > > > > >poll(2), epoll(7), and select(2).  When a notification  is  
> > > > > > pend‐
> > > > > >ing,  these interfaces indicate that the file descriptor is 
> > > > > > read‐
> > > > > >able.
> > > > >
> > > > > We should probably also point out somewhere that, as
> > > > > include/uapi/linux/seccomp.h says:
> > > > >
> > > > >  * Similar precautions should be applied when stacking 
> > > > > SECCOMP_RET_USER_NOTIF
> > > > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on 
> > > > > the
> > > > >  * same syscall, the most recently added filter takes precedence. 
> > > > > This means
> > > > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially 
> > > > > allowing all
> > > > >  * such filtered syscalls to be executed by sending the response
> > > > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> > > > > equally
> > > > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > > >
> > > > > In other words, from a security perspective, you must assume that the
> > > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > > calling seccomp(). This should also be noted over in the main
> > > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > > >
> > > > So I was actually wondering about this when I skimmed this and a while
> > > > ago but forgot about this again... Afaict, you can only ever load a
> > > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > > one?
> > > >
> > > > static struct file *init_listener(struct seccomp_filter *filter)
> > > > {
> > > > struct file *ret = ERR_PTR(-EBUSY);
> > > > struct seccomp_filter *cur;
> > > >
> > > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > > if (cur->notif)
> > > > goto out;
> > > > }
> > > >
> > > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > > override each other for the same task simply because there can only ever
> > > > be a single one?
> > >
> > > Good point. Excpt that that check seems ineffective because this
> > > happens before we take the locks that guard against TSYNC, and also
> > > before we decide to which existing filter we want to chain the new
> > > filter. So if two threads race with TSYNC, I think they'll be able to
> > > chain two filters with listeners together.
> >
> > Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> > be totally effective,
> >
> > > I don't know whether we want to eternalize this "only one listener
> > > across all the filters" restriction in the manpage though, or whether
> > > the man page should just say that the kernel currently doesn't support
> > > it but that security-wise you should assume that it might at some
> > > point.
> >
> > This requirement originally came from Andy, arguing that the semantics
> > of this were/are confu

Re: For review: seccomp_user_notif(2) manual page

2020-10-01 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
>  wrote:
> > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > >  wrote:
> > > > NOTES
> > > >The file descriptor returned when seccomp(2) is employed with the
> > > >SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using
> > > >poll(2), epoll(7), and select(2).  When a notification  is  pend‐
> > > >ing,  these interfaces indicate that the file descriptor is read‐
> > > >able.
> > >
> > > We should probably also point out somewhere that, as
> > > include/uapi/linux/seccomp.h says:
> > >
> > >  * Similar precautions should be applied when stacking 
> > > SECCOMP_RET_USER_NOTIF
> > >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > >  * same syscall, the most recently added filter takes precedence. This 
> > > means
> > >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > >  * such filtered syscalls to be executed by sending the response
> > >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can 
> > > equally
> > >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > >
> > > In other words, from a security perspective, you must assume that the
> > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > calling seccomp(). This should also be noted over in the main
> > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> >
> > So I was actually wondering about this when I skimmed this and a while
> > ago but forgot about this again... Afaict, you can only ever load a
> > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > in the tasks filter hierarchy then the kernel will refuse to load a new
> > one?
> >
> > static struct file *init_listener(struct seccomp_filter *filter)
> > {
> > struct file *ret = ERR_PTR(-EBUSY);
> > struct seccomp_filter *cur;
> >
> > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > if (cur->notif)
> > goto out;
> > }
> >
> > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > override each other for the same task simply because there can only ever
> > be a single one?
> 
> Good point. Excpt that that check seems ineffective because this
> happens before we take the locks that guard against TSYNC, and also
> before we decide to which existing filter we want to chain the new
> filter. So if two threads race with TSYNC, I think they'll be able to
> chain two filters with listeners together.

Yep, seems the check needs to also be in seccomp_can_sync_threads() to
be totally effective,

> I don't know whether we want to eternalize this "only one listener
> across all the filters" restriction in the manpage though, or whether
> the man page should just say that the kernel currently doesn't support
> it but that security-wise you should assume that it might at some
> point.

This requirement originally came from Andy, arguing that the semantics
of this were/are confusing, which still makes sense to me. Perhaps we
should do something like the below?

Tycho


diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..7b107207c2b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -376,6 +376,18 @@ static int is_ancestor(struct seccomp_filter *parent,
return 0;
 }
 
+static bool has_listener_parent(struct seccomp_filter *child)
+{
+   struct seccomp_filter *cur;
+
+   for (cur = current->seccomp.filter; cur; cur = cur->prev) {
+   if (cur->notif)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  * seccomp_can_sync_threads: checks if all threads can be synchronized
  *
@@ -385,7 +397,7 @@ static int is_ancestor(struct seccomp_filter *parent,
  * either not in the correct seccomp mode or did not have an ancestral
  * seccomp filter.
  */
-static inline pid_t seccomp_can_sync_threads(void)
+static inline pid_t seccomp_can_sync_threads(unsigned int flags)
 {
struct task_struct *thread, *caller;
 
@@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
 caller->seccomp.filter)))
continue;
 
+   /* don't allow TSYNC to install multiple listeners */
+   if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
+   !has_listener_parent(thread->seccomp.filter))
+   continue;
+
/* Return the first thread that cannot be synchronized. */
failed = task_pid_vnr(thread);
  

Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen  wrote:
> > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) 
> > > > wrote:
> > > >>┌─┐
> > > >>│FIXME│
> > > >>├─┤
> > > >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> > > >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> > > >>│process terminates, then the ioctl()  simply  blocks │
> > > >>│(rather than returning an error to indicate that the │
> > > >>│target process no longer exists).│
> > > >
> > > > Yeah, I think Christian wanted to fix this at some point,
> > >
> > > Do you have a pointer that discussion? I could not find it with a
> > > quick search.
> > >
> > > > but it's a
> > > > bit sticky to do.
> > >
> > > Can you say a few words about the nature of the problem?
> >
> > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > notify about unused filter"). So maybe there's a bug here?
> 
> That thing only notifies on ->poll, it doesn't unblock ioctls; and
> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> commit doesn't have any effect on this kind of usage.

Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
we don't have a count of all of them, unfortunately.

We could maybe look inside the wait_list, but that will probably make
people angry :)

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Tycho,
> 
> Thanks for taking time to look at the page!
> 
> On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >>2. In order that the supervisor process can obtain  notifications
> >>   using  the  listening  file  descriptor, (a duplicate of) that
> >>   file descriptor must be passed from the target process to  the
> >>   supervisor process.  One way in which this could be done is by
> >>   passing the file descriptor over a UNIX domain socket  connec‐
> >>   tion between the two processes (using the SCM_RIGHTS ancillary
> >>   message type described in unix(7)).   Another  possibility  is
> >>   that  the  supervisor  might  inherit  the file descriptor via
> >>   fork(2).
> > 
> > It is technically possible to inherit the fd via fork, but is it
> > really that useful? The child process wouldn't be able to actually do
> > the syscall in question, since it would have the same filter.
> 
> D'oh! Yes, of course.
> 
> I think I was reaching because in an earlier conversation
> you replied:
> 
> [[
> > 3. The "target process" passes the "listening file descriptor"
> >to the "monitoring process" via the UNIX domain socket.
> 
> or some other means, it doesn't have to be with SCM_RIGHTS.
> ]]
> 
> So, what other means?
> 
> Anyway, I removed the sentence mentioning fork().

Whatever means people want :). fork() could work (it's how some of the
tests for this feature work, but it's not particularly useful I don't
think), clone(CLONE_FILES) is similar, seccomp_putfd, or maybe even
cloning it via some pidfd interface that might be invented for
re-opening files.

> >>┌─┐
> >>│FIXME│
> >>├─┤
> >>│From my experiments,  it  appears  that  if  a  SEC‐ │
> >>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
> >>│process terminates, then the ioctl()  simply  blocks │
> >>│(rather than returning an error to indicate that the │
> >>│target process no longer exists).│
> > 
> > Yeah, I think Christian wanted to fix this at some point,
> 
> Do you have a pointer that discussion? I could not find it with a 
> quick search.
> 
> > but it's a
> > bit sticky to do.
> 
> Can you say a few words about the nature of the problem?

I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
notify about unused filter"). So maybe there's a bug here?

> >>┌─┐
> >>│FIXME│
> >>├─┤
> >>│Interestingly, after the event  had  been  received, │
> >>│the  file descriptor indicates as writable (verified │
> >>│from the source code and by experiment). How is this │
> >>│useful?  │
> > 
> > You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> > reasonable.
> 
> No, I'm saying something more fundamental: why is the FD indicating as
> writable? Can you write something to it? If yes, what? If not, then
> why do these APIs want to say that the FD is writable?

You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
NOTIFY_SEND are reading and writing events from the fd. I don't know
that much about the poll interface though -- is it possible to
indicate "here's a pseudo-read event"? It didn't look like it, so I
just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

Tycho


Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>2. In order that the supervisor process can obtain  notifications
>   using  the  listening  file  descriptor, (a duplicate of) that
>   file descriptor must be passed from the target process to  the
>   supervisor process.  One way in which this could be done is by
>   passing the file descriptor over a UNIX domain socket  connec‐
>   tion between the two processes (using the SCM_RIGHTS ancillary
>   message type described in unix(7)).   Another  possibility  is
>   that  the  supervisor  might  inherit  the file descriptor via
>   fork(2).

It is technically possible to inherit the fd via fork, but is it
really that useful? The child process wouldn't be able to actually do
the syscall in question, since it would have the same filter.

>   The  information  in  the notification can be used to discover
>   the values of pointer arguments for the target process's  sys‐
>   tem call.  (This is something that can't be done from within a
>   seccomp filter.)  To do this (and  assuming  it  has  suitable

s/To do this/One way to accomplish this/ perhaps, since there are
others.

>   permissions),   the   supervisor   opens   the   corresponding
>   /proc/[pid]/mem file, seeks to the memory location that corre‐
>   sponds to one of the pointer arguments whose value is supplied
>   in the notification event, and reads bytes from that location.
>   (The supervisor must be careful to avoid a race condition that
>   can occur when doing this; see the  description  of  the  SEC‐
>   COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addi‐
>   tion, the supervisor can access other system information  that
>   is  visible  in  user space but which is not accessible from a
>   seccomp filter.
> 
>   ┌─┐
>   │FIXME│
>   ├─┤
>   │Suppose we are reading a pathname from /proc/PID/mem │
>   │for  a system call such as mkdir(). The pathname can │
>   │be an arbitrary length. How do we know how much (how │
>   │many pages) to read from /proc/PID/mem?  │
>   └─┘

PATH_MAX, I suppose.

>┌─┐
>│FIXME│
>├─┤
>│From my experiments,  it  appears  that  if  a  SEC‐ │
>│COMP_IOCTL_NOTIF_RECV   is  done  after  the  target │
>│process terminates, then the ioctl()  simply  blocks │
>│(rather than returning an error to indicate that the │
>│target process no longer exists).│

Yeah, I think Christian wanted to fix this at some point, but it's a
bit sticky to do. Note that if you e.g. rely on fork() above, the
filter is shared with your current process, and this notification
would never be possible. Perhaps another reason to omit that from the
man page.

>SECCOMP_IOCTL_NOTIF_ID_VALID
>   This operation can be used to check that a notification ID
>   returned by an earlier SECCOMP_IOCTL_NOTIF_RECV  operation
>   is  still  valid  (i.e.,  that  the  target  process still
>   exists).
> 
>   The third ioctl(2) argument is a  pointer  to  the  cookie
>   (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
> 
>   This  operation is necessary to avoid race conditions that
>   can  occur   when   the   pid   returned   by   the   SEC‐
>   COMP_IOCTL_NOTIF_RECV   operation   terminates,  and  that
>   process ID is reused by another process.   An  example  of
>   this kind of race is the following
> 
>   1. A  notification  is  generated  on  the  listening file
>  descriptor.  The returned  seccomp_notif  contains  the
>  PID of the target process.
> 
>   2. The target process terminates.
> 
>   3. Another process is created on the system that by chance
>  reuses the PID that was freed when the  target  process
>  terminates.
> 
>   4. The  supervisor  open(2)s  the /proc/[pid]/mem file for
>  the PID obtained in step 1, with the intention of (say)
>  inspecting the memory locations that contains the argu‐
>  ments of the system call that triggered  the  notifica‐
>  tion in step 1.
> 
>   In the above scenario, the risk is that the supervisor may
>

Re: For review: seccomp_user_notif(2) manual page

2020-09-30 Thread Tycho Andersen
On Wed, Sep 30, 2020 at 09:03:36AM -0600, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >┌─┐
> >│FIXME│
> >├─┤
> >│Interestingly, after the event  had  been  received, │
> >│the  file descriptor indicates as writable (verified │
> >│from the source code and by experiment). How is this │
> >│useful?  │
> 
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

If we make this change, I suppose we should also drop EPOLLRDNORM from
things which have not been received yet, since they're not really
readable.

Tycho


Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation

2020-09-25 Thread Tycho Andersen
On Fri, Sep 25, 2020 at 11:31:14AM +0100, Mark Rutland wrote:
> Hi,
> 
> Sorry to come to this so late; I've been meaning to provide feedback on
> this for a while but have been indisposed for a bit due to an injury.
> 
> On Fri, Sep 25, 2020 at 11:50:29AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 25, 2020 at 11:00:30AM +0200, David Hildenbrand wrote:
> > > On 25.09.20 09:41, Peter Zijlstra wrote:
> > > > On Thu, Sep 24, 2020 at 04:29:03PM +0300, Mike Rapoport wrote:
> > > >> From: Mike Rapoport 
> > > >>
> > > >> Removing a PAGE_SIZE page from the direct map every time such page is
> > > >> allocated for a secret memory mapping will cause severe fragmentation 
> > > >> of
> > > >> the direct map. This fragmentation can be reduced by using PMD-size 
> > > >> pages
> > > >> as a pool for small pages for secret memory mappings.
> > > >>
> > > >> Add a gen_pool per secretmem inode and lazily populate this pool with
> > > >> PMD-size pages.
> > > > 
> > > > What's the actual efficacy of this? Since the pmd is per inode, all I
> > > > need is a lot of inodes and we're in business to destroy the directmap,
> > > > no?
> > > > 
> > > > Afaict there's no privs needed to use this, all a process needs is to
> > > > stay below the mlock limit, so a 'fork-bomb' that maps a single secret
> > > > page will utterly destroy the direct map.
> > > > 
> > > > I really don't like this, at all.
> > > 
> > > As I expressed earlier, I would prefer allowing allocation of secretmem
> > > only from a previously defined CMA area. This would physically locally
> > > limit the pain.
> > 
> > Given that this thing doesn't have a migrate hook, that seems like an
> > eminently reasonable contraint. Because not only will it mess up the
> > directmap, it will also destroy the ability of the page-allocator /
> > compaction to re-form high order blocks by sprinkling holes throughout.
> > 
> > Also, this is all very close to XPFO, yet I don't see that mentioned
> > anywhere.
> 
> Agreed. I think if we really need something like this, something between
> XPFO and DEBUG_PAGEALLOC would be generally better, since:

Perhaps we can brainstorm on this? XPFO has mostly been abandoned
because there's no good/safe way to make it faster. There was work on
eliminating TLB flushes, but that waters down the protection. When I
was last thinking about it in anger, it just seemed like it was
destined to be slow, especially on $large_num_cores machines, since
you have to flush everyone else's map too.

I think the idea of "opt in to XPFO" is mostly attractive because then
people only have to pay the slowness cost for memory they really care
about. But if there's some way to make XPFO, or some alternative
design, that may be better.

Tycho


Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls

2020-09-21 Thread Tycho Andersen
On Mon, Sep 21, 2020 at 10:27:56AM -0500, YiFei Zhu wrote:
> On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen  wrote:
> > One problem with a kernel config setting is that it's for all tasks.
> > While docker and systemd may make decsisions based on syscall number,
> > other applications may have more nuanced filters, and this cache would
> > yield incorrect results.
> >
> > You could work around this by making this a filter flag instead;
> > filter authors would generally know whether their filter results can
> > be cached and probably be motivated to opt in if their users are
> > complaining about slow syscall execution.
> >
> > Tycho
> 
> Yielding incorrect results should not be possible. The purpose of the
> "emulator" (for the lack of a better term) is to determine whether the
> filter reads any syscall arguments. A read from a syscall argument
> must go through the BPF_LD | BPF_ABS instruction, where the 32 bit
> multiuse field "k" is an offset to struct seccomp_data.

I see, I missed this somehow. So is there a reason to hide this behind
a config option? Isn't it just always better?

Tycho


Re: [PATCH 1/2] seccomp: don't leak memory when filter install races

2020-09-02 Thread Tycho Andersen
On Wed, Sep 02, 2020 at 11:08:49AM +0200, Christian Brauner wrote:
> On Tue, Sep 01, 2020 at 07:40:16PM -0600, Tycho Andersen wrote:
> > In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize
> > the listener fd, then check to see if we can actually use it later in
> > seccomp_may_assign_mode(), which can fail if anyone else in our thread
> > group has installed a filter and caused some divergence. If we can't, we
> > partially clean up the newly allocated file: we put the fd, put the file,
> > but don't actually clean up the *memory* that was allocated at
> > filter->notif. Let's clean that up too.
> > 
> > To accomplish this, let's hoist the actual "detach a notifier from a
> > filter" code to its own helper out of seccomp_notify_release(), so that in
> > case anyone adds stuff to init_listener(), they only have to add the
> > cleanup code in one spot. This does a bit of extra locking and such on the
> > failure path when the filter is not attached, but it's a slow failure path
> > anyway.
> > 
> > Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together")
> > Reported-by: syzbot+3ad9614a12f80994c...@syzkaller.appspotmail.com
> > Signed-off-by: Tycho Andersen 
> > ---
> 
> This looks sane to me!
> Acked-by: Christian Brauner 
> 
> One thing I noticed when checking the failure paths. In init_listener we
> allocate the notifier by directly storing it into filter->notif and if
> anon_inode_getfile() fails we simply kfree(filter->notif) but don't NULL
> it and leave filter->notif pointing to freed memory.
> Since we have a few places where we check filter->notif whether it is
> initialized or not maybe we should NULL filter->notif if init_listener()
> fails or alloc the notifier into a tmp variable and only assign it to
> filter->notif after we can't fail anymore in init_listener().
> 
> Just a thought since the error path in seccomp_set_mode_filter() is a
> little more complex. :)

Yeah it seems like we definitely should, and maybe this is what Kees
was talking about and I missed it.

I'll send a patch on top of this shortly.

Tycho


[PATCH] seccomp: don't leave dangling ->notif if file allocation fails

2020-09-02 Thread Tycho Andersen
Christian and Kees both pointed out that this is a bit sloppy to open-code
both places, and Christian points out that we leave a dangling pointer to
->notif if file allocation fails. Since we check ->notif for null in order
to determine if it's ok to install a filter, this means people won't be
able to install a filter if the file allocation fails for some reason, even
if they subsequently should be able to.

To fix this, let's hoist this free+null into its own little helper and use
it.

Reported-by: Kees Cook 
Reported-by: Christian Brauner 
Signed-off-by: Tycho Andersen 
---
 kernel/seccomp.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index bb0dd9ae699a..676d4af62103 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1109,6 +1109,12 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+static void seccomp_notify_free(struct seccomp_filter *filter)
+{
+   kfree(filter->notif);
+   filter->notif = NULL;
+}
+
 static void seccomp_notify_detach(struct seccomp_filter *filter)
 {
struct seccomp_knotif *knotif;
@@ -1138,8 +1144,7 @@ static void seccomp_notify_detach(struct seccomp_filter 
*filter)
complete(&knotif->ready);
}
 
-   kfree(filter->notif);
-   filter->notif = NULL;
+   seccomp_notify_free(filter);
mutex_unlock(&filter->notify_lock);
 }
 
@@ -1494,7 +1499,7 @@ static struct file *init_listener(struct seccomp_filter 
*filter)
 
 out_notif:
if (IS_ERR(ret))
-   kfree(filter->notif);
+   seccomp_notify_free(filter);
 out:
return ret;
 }

base-commit: 7b6aa0bb62fd6fd50f2d14136136262d28fb2dfe
-- 
2.25.1



Re: [PATCH 2/2] mailmap, MAINTAINERS: move to tycho.pizza

2020-09-01 Thread Tycho Andersen
Hi Kees,

On Tue, Sep 01, 2020 at 07:40:17PM -0600, Tycho Andersen wrote:
> I've changed my e-mail address to tycho.pizza, so let's reflect that in
> these files.

Hopefully you can pick this one up too? :D

Thanks,

Tycho


[PATCH 2/2] mailmap, MAINTAINERS: move to tycho.pizza

2020-09-01 Thread Tycho Andersen
I've changed my e-mail address to tycho.pizza, so let's reflect that in
these files.

Signed-off-by: Tycho Andersen 
---
 .mailmap| 1 +
 MAINTAINERS | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index 332c7833057f..50096b96c85d 100644
--- a/.mailmap
+++ b/.mailmap
@@ -308,6 +308,7 @@ Tony Luck 
 TripleX Chung  
 TripleX Chung  
 Tsuneo Yoshioka 
+Tycho Andersen  
 Uwe Kleine-König 
 Uwe Kleine-König 
 Uwe Kleine-König 
diff --git a/MAINTAINERS b/MAINTAINERS
index e4647c84c987..2c60f3ec2496 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9776,7 +9776,7 @@ F:drivers/scsi/53c700*
 
 LEAKING_ADDRESSES
 M: Tobin C. Harding 
-M: Tycho Andersen 
+M: Tycho Andersen 
 L: kernel-harden...@lists.openwall.com
 S: Maintained
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tobin/leaks.git
-- 
2.25.1



[PATCH 1/2] seccomp: don't leak memory when filter install races

2020-09-01 Thread Tycho Andersen
In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize
the listener fd, then check to see if we can actually use it later in
seccomp_may_assign_mode(), which can fail if anyone else in our thread
group has installed a filter and caused some divergence. If we can't, we
partially clean up the newly allocated file: we put the fd, put the file,
but don't actually clean up the *memory* that was allocated at
filter->notif. Let's clean that up too.

To accomplish this, let's hoist the actual "detach a notifier from a
filter" code to its own helper out of seccomp_notify_release(), so that in
case anyone adds stuff to init_listener(), they only have to add the
cleanup code in one spot. This does a bit of extra locking and such on the
failure path when the filter is not attached, but it's a slow failure path
anyway.

Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together")
Reported-by: syzbot+3ad9614a12f80994c...@syzkaller.appspotmail.com
Signed-off-by: Tycho Andersen 
---
 kernel/seccomp.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..bb0dd9ae699a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1109,13 +1109,12 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
-static int seccomp_notify_release(struct inode *inode, struct file *file)
+static void seccomp_notify_detach(struct seccomp_filter *filter)
 {
-   struct seccomp_filter *filter = file->private_data;
struct seccomp_knotif *knotif;
 
if (!filter)
-   return 0;
+   return;
 
mutex_lock(&filter->notify_lock);
 
@@ -1142,6 +1141,13 @@ static int seccomp_notify_release(struct inode *inode, 
struct file *file)
kfree(filter->notif);
filter->notif = NULL;
mutex_unlock(&filter->notify_lock);
+}
+
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+   struct seccomp_filter *filter = file->private_data;
+
+   seccomp_notify_detach(filter);
__put_seccomp_filter(filter);
return 0;
 }
@@ -1581,6 +1587,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
+   seccomp_notify_detach(prepared);
} else {
fd_install(listener, listener_f);
ret = listener;

base-commit: f75aef392f869018f78cfedf3c320a6b3fcfda6b
-- 
2.25.1



Re: memory leak in do_seccomp

2020-09-01 Thread Tycho Andersen
On Tue, Sep 01, 2020 at 08:08:13AM -0700, Kees Cook wrote:
> On Mon, Aug 31, 2020 at 07:14:59PM -0600, Tycho Andersen wrote:
> > On Mon, Aug 31, 2020 at 06:09:15PM -0600, Tycho Andersen wrote:
> > > On Mon, Aug 31, 2020 at 04:25:35PM -0700, Kees Cook wrote:
> > > > On Sun, Aug 30, 2020 at 08:50:15PM -0700, syzbot wrote:
> > > > > syzbot has found a reproducer for the following issue on:
> > > > > 
> > > > > HEAD commit:dcc5c6f0 Merge tag 'x86-urgent-2020-08-30' of 
> > > > > git://git.ke..
> > > > > git tree:   upstream
> > > > > console output: 
> > > > > https://syzkaller.appspot.com/x/log.txt?x=10b297d590
> > > > > kernel config:  
> > > > > https://syzkaller.appspot.com/x/.config?x=903b9fecc3c6d231
> > > > > dashboard link: 
> > > > > https://syzkaller.appspot.com/bug?extid=3ad9614a12f80994c32e
> > > > > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > > > > syz repro:  
> > > > > https://syzkaller.appspot.com/x/repro.syz?x=1464956190
> > > > > C reproducer:   
> > > > > https://syzkaller.appspot.com/x/repro.c?x=118aacc190
> > > > > 
> > > > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > > > commit:
> > > > > Reported-by: syzbot+3ad9614a12f80994c...@syzkaller.appspotmail.com
> > > > > 
> > > > > executing program
> > > > > executing program
> > > > > executing program
> > > > > executing program
> > > > > executing program
> > > > > BUG: memory leak
> > > > > unreferenced object 0x88811ba93600 (size 64):
> > > > >   comm "syz-executor680", pid 6503, jiffies 4294951104 (age 21.940s)
> > > > >   hex dump (first 32 bytes):
> > > > > 00 00 00 00 00 00 00 00 08 36 a9 1b 81 88 ff ff  .6..
> > > > > 08 36 a9 1b 81 88 ff ff 11 ce 98 89 3a d5 b4 8f  .6..:...
> > > > >   backtrace:
> > > > > [<896418b0>] kmalloc include/linux/slab.h:554 [inline]
> > > > > [<896418b0>] kzalloc include/linux/slab.h:666 [inline]
> > > > > [<896418b0>] init_listener kernel/seccomp.c:1473 [inline]
> > > > > [<896418b0>] seccomp_set_mode_filter 
> > > > > kernel/seccomp.c:1546 [inline]
> > > > > [<896418b0>] do_seccomp+0x8ce/0xd40 kernel/seccomp.c:1649
> > > > > [<2b04976c>] do_syscall_64+0x2d/0x70 
> > > > > arch/x86/entry/common.c:46
> > > > > [<322b4126>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > I haven't narrowed this down yet (and it *might* be a false positive),
> > > > but it looks like this is filter->notif. The only way that's possible is
> > > > if seccomp_notify_release() was never called *and* seccomp_filter_free()
> > > > got called... which would imply a reference counting problem. The way
> > > > there doesn't jump out at me yet, but I haven't yet decoded the C
> > > > reproducer into the actual seccomp arguments, etc.
> > > 
> > > Looks like it's just a bunch of threads in the same thread group
> > > trying to install a filter with TSYNC and NEW_LISTENER turned on. Does
> > > the patch below look reasonable?
> > > 
> > > I didn't send it separately since I'm in the process of switching my
> > > e-mail address to tycho@tycho.pizza; let this e-mail serve as proof
> > > that that e-mail really is me too :). I can send it the normal way if
> > > it looks good.
> > > 
> > > 
> > > From d497e787e8e1b3e8b9230fdc4c9802616709c920 Mon Sep 17 00:00:00 2001
> > > From: Tycho Andersen 
> > > Date: Mon, 31 Aug 2020 17:55:07 -0600
> > > Subject: [PATCH] seccomp: don't leak memory when filter install races
> > > 
> > > In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first 
> > > initialize
> > > the listener fd, then check to see if we can actually use it later in
> > > seccomp_may_assign_mode(), which can fail if anyone else in our thread
> > > group has installed a filter and caused some divergence. If we can't, we
> > > partially clean up the newly allocated file: we put the fd, put the file,
> &g

Re: memory leak in do_seccomp

2020-08-31 Thread Tycho Andersen
On Mon, Aug 31, 2020 at 06:09:15PM -0600, Tycho Andersen wrote:
> On Mon, Aug 31, 2020 at 04:25:35PM -0700, Kees Cook wrote:
> > On Sun, Aug 30, 2020 at 08:50:15PM -0700, syzbot wrote:
> > > syzbot has found a reproducer for the following issue on:
> > > 
> > > HEAD commit:dcc5c6f0 Merge tag 'x86-urgent-2020-08-30' of 
> > > git://git.ke..
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=10b297d590
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=903b9fecc3c6d231
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=3ad9614a12f80994c32e
> > > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1464956190
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=118aacc190
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+3ad9614a12f80994c...@syzkaller.appspotmail.com
> > > 
> > > executing program
> > > executing program
> > > executing program
> > > executing program
> > > executing program
> > > BUG: memory leak
> > > unreferenced object 0x88811ba93600 (size 64):
> > >   comm "syz-executor680", pid 6503, jiffies 4294951104 (age 21.940s)
> > >   hex dump (first 32 bytes):
> > > 00 00 00 00 00 00 00 00 08 36 a9 1b 81 88 ff ff  .6..
> > > 08 36 a9 1b 81 88 ff ff 11 ce 98 89 3a d5 b4 8f  .6..:...
> > >   backtrace:
> > > [<896418b0>] kmalloc include/linux/slab.h:554 [inline]
> > > [<896418b0>] kzalloc include/linux/slab.h:666 [inline]
> > > [<896418b0>] init_listener kernel/seccomp.c:1473 [inline]
> > > [<896418b0>] seccomp_set_mode_filter kernel/seccomp.c:1546 
> > > [inline]
> > > [<896418b0>] do_seccomp+0x8ce/0xd40 kernel/seccomp.c:1649
> > > [<2b04976c>] do_syscall_64+0x2d/0x70 
> > > arch/x86/entry/common.c:46
> > > [<322b4126>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > I haven't narrowed this down yet (and it *might* be a false positive),
> > but it looks like this is filter->notif. The only way that's possible is
> > if seccomp_notify_release() was never called *and* seccomp_filter_free()
> > got called... which would imply a reference counting problem. The way
> > there doesn't jump out at me yet, but I haven't yet decoded the C
> > reproducer into the actual seccomp arguments, etc.
> 
> Looks like it's just a bunch of threads in the same thread group
> trying to install a filter with TSYNC and NEW_LISTENER turned on. Does
> the patch below look reasonable?
> 
> I didn't send it separately since I'm in the process of switching my
> e-mail address to tycho@tycho.pizza; let this e-mail serve as proof
> that that e-mail really is me too :). I can send it the normal way if
> it looks good.
> 
> 
> From d497e787e8e1b3e8b9230fdc4c9802616709c920 Mon Sep 17 00:00:00 2001
> From: Tycho Andersen 
> Date: Mon, 31 Aug 2020 17:55:07 -0600
> Subject: [PATCH] seccomp: don't leak memory when filter install races
> 
> In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize
> the listener fd, then check to see if we can actually use it later in
> seccomp_may_assign_mode(), which can fail if anyone else in our thread
> group has installed a filter and caused some divergence. If we can't, we
> partially clean up the newly allocated file: we put the fd, put the file,
> but don't actually clean up the *memory* that was allocated at
> filter->notif. Let's clean that up too.
> 
> Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together")
> Reported-by: syzbot+3ad9614a12f80994c...@syzkaller.appspotmail.com
> Signed-off-by: Tycho Andersen 
> ---
>  kernel/seccomp.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3ee59ce0a323..21a76127833f 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1581,6 +1581,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>   listener_f->private_data = NULL;
>   fput(listener_f);
>   put_unused_fd(listener);
> + kfree(filter->notif);
> + filter->notif = NULL;

Oof, actually this isn't quite right. It should be s/filter/prepared/g.
I can fix that and send out a real patch that's actually tested at
some point tomorrow.

Tycho


Re: memory leak in do_seccomp

2020-08-31 Thread Tycho Andersen
On Mon, Aug 31, 2020 at 04:25:35PM -0700, Kees Cook wrote:
> On Sun, Aug 30, 2020 at 08:50:15PM -0700, syzbot wrote:
> > syzbot has found a reproducer for the following issue on:
> > 
> > HEAD commit:dcc5c6f0 Merge tag 'x86-urgent-2020-08-30' of git://git.ke..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=10b297d590
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=903b9fecc3c6d231
> > dashboard link: https://syzkaller.appspot.com/bug?extid=3ad9614a12f80994c32e
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1464956190
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=118aacc190
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+3ad9614a12f80994c...@syzkaller.appspotmail.com
> > 
> > executing program
> > executing program
> > executing program
> > executing program
> > executing program
> > BUG: memory leak
> > unreferenced object 0x88811ba93600 (size 64):
> >   comm "syz-executor680", pid 6503, jiffies 4294951104 (age 21.940s)
> >   hex dump (first 32 bytes):
> > 00 00 00 00 00 00 00 00 08 36 a9 1b 81 88 ff ff  .6..
> > 08 36 a9 1b 81 88 ff ff 11 ce 98 89 3a d5 b4 8f  .6..:...
> >   backtrace:
> > [<896418b0>] kmalloc include/linux/slab.h:554 [inline]
> > [<896418b0>] kzalloc include/linux/slab.h:666 [inline]
> > [<896418b0>] init_listener kernel/seccomp.c:1473 [inline]
> > [<896418b0>] seccomp_set_mode_filter kernel/seccomp.c:1546 
> > [inline]
> > [<896418b0>] do_seccomp+0x8ce/0xd40 kernel/seccomp.c:1649
> > [<2b04976c>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> > [<322b4126>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> I haven't narrowed this down yet (and it *might* be a false positive),
> but it looks like this is filter->notif. The only way that's possible is
> if seccomp_notify_release() was never called *and* seccomp_filter_free()
> got called... which would imply a reference counting problem. The way
> there doesn't jump out at me yet, but I haven't yet decoded the C
> reproducer into the actual seccomp arguments, etc.

Looks like it's just a bunch of threads in the same thread group
trying to install a filter with TSYNC and NEW_LISTENER turned on. Does
the patch below look reasonable?

I didn't send it separately since I'm in the process of switching my
e-mail address to tycho@tycho.pizza; let this e-mail serve as proof
that that e-mail really is me too :). I can send it the normal way if
it looks good.


>From d497e787e8e1b3e8b9230fdc4c9802616709c920 Mon Sep 17 00:00:00 2001
From: Tycho Andersen 
Date: Mon, 31 Aug 2020 17:55:07 -0600
Subject: [PATCH] seccomp: don't leak memory when filter install races

In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize
the listener fd, then check to see if we can actually use it later in
seccomp_may_assign_mode(), which can fail if anyone else in our thread
group has installed a filter and caused some divergence. If we can't, we
partially clean up the newly allocated file: we put the fd, put the file,
but don't actually clean up the *memory* that was allocated at
filter->notif. Let's clean that up too.

Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together")
Reported-by: syzbot+3ad9614a12f80994c...@syzkaller.appspotmail.com
Signed-off-by: Tycho Andersen 
---
 kernel/seccomp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..21a76127833f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1581,6 +1581,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
+   kfree(filter->notif);
+   filter->notif = NULL;
} else {
fd_install(listener, listener_f);
ret = listener;

base-commit: b51594df17d0ce80b9f9f35394a1f42d7ac94472
-- 
2.25.1



Re: [PATCH for-next/seccomp v2 1/2] selftests/seccomp: Add SKIPs for failed unshare()

2020-07-11 Thread Tycho Andersen
On Fri, Jul 10, 2020 at 04:01:06PM -0700, Kees Cook wrote:
> Running the seccomp tests as a regular user shouldn't just fail tests
> that require CAP_SYS_ADMIN (for getting a PID namespace). Instead,
> detect those cases and SKIP them. Additionally, gracefully SKIP missing
> CONFIG_USER_NS (and add to "config" since we'd prefer to actually test
> this case).
> 
> Signed-off-by: Kees Cook 
> ---
>  tools/testing/selftests/seccomp/config|  1 +
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 10 --
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/seccomp/config 
> b/tools/testing/selftests/seccomp/config
> index db1e11b08c8a..64c19d8eba79 100644
> --- a/tools/testing/selftests/seccomp/config
> +++ b/tools/testing/selftests/seccomp/config
> @@ -1,2 +1,3 @@
>  CONFIG_SECCOMP=y
>  CONFIG_SECCOMP_FILTER=y
> +CONFIG_USER_NS=y
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index c0aa46ce14f6..14b038361549 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -3439,7 +3439,10 @@ TEST(user_notification_child_pid_ns)
>   struct seccomp_notif req = {};
>   struct seccomp_notif_resp resp = {};
>  
> - ASSERT_EQ(unshare(CLONE_NEWUSER | CLONE_NEWPID), 0);
> + ASSERT_EQ(unshare(CLONE_NEWUSER | CLONE_NEWPID), 0) {
> + if (errno == EINVAL)
> + SKIP(return, "kernel missing CLONE_NEWUSER support");
> + };
>  
>   listener = user_trap_syscall(__NR_getppid,
>SECCOMP_FILTER_FLAG_NEW_LISTENER);
> @@ -3504,7 +3507,10 @@ TEST(user_notification_sibling_pid_ns)
>   }
>  
>   /* Create the sibling ns, and sibling in it. */
> - ASSERT_EQ(unshare(CLONE_NEWPID), 0);
> + ASSERT_EQ(unshare(CLONE_NEWPID), 0) {
> + if (errno == EPERM)
> + SKIP(return, "CLONE_NEWPID requires CAP_SYS_ADMIN");
> + }
>   ASSERT_EQ(errno, 0);

For this one, I think we can just put an unshare(CLONE_NEWUSER) at
the top so the test still runs. This seems works for me unprivileged:

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 252140a52553..65e3642539f9 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3482,6 +3482,11 @@ TEST(user_notification_sibling_pid_ns)
TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
}
 
+   ASSERT_EQ(unshare(CLONE_NEWUSER), 0) {
+   if (errno == EINVAL)
+   SKIP(return, "kernel missing CLONE_NEWUSER support");
+   };
+
listener = user_trap_syscall(__NR_getppid,
 SECCOMP_FILTER_FLAG_NEW_LISTENER);
ASSERT_GE(listener, 0);


Re: [PATCH for-next/seccomp 2/2] selftests/seccomp: Set NNP for TSYNC ESRCH flag test

2020-07-10 Thread Tycho Andersen
On Fri, Jul 10, 2020 at 11:51:56AM -0700, Kees Cook wrote:
> The TSYNC ESRCH flag test will fail for regular users because NNP was
> not set yet. Add NNP setting.
> 
> Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Kees Cook 

Reviewed-by: Tycho Andersen 


Re: [PATCH for-next/seccomp 1/2] selftests/seccomp: Add SKIPs for failed unshare()

2020-07-10 Thread Tycho Andersen
On Fri, Jul 10, 2020 at 11:51:55AM -0700, Kees Cook wrote:
> Running the seccomp tests as a regular user shouldn't just fail tests
> that require CAP_SYS_ADMIN (for getting a PID namespace). Instead,
> detect those cases and SKIP them.

But if we unshare NEWUSER at the same time as NEWPID, shouldn't we
always be ns_capable(CAP_SYS_ADMIN)?

Tycho


Re: [PATCH v4 10/11] seccomp: Switch addfd to Extensible Argument ioctl

2020-06-16 Thread Tycho Andersen
On Tue, Jun 16, 2020 at 09:05:29AM -0700, Kees Cook wrote:
> On Tue, Jun 16, 2020 at 08:55:46AM -0600, Tycho Andersen wrote:
> > On Mon, Jun 15, 2020 at 08:25:23PM -0700, Kees Cook wrote:
> > > This patch is based on discussions[1] with Sargun Dhillon, Christian
> > > Brauner, and David Laight. Instead of building size into the addfd
> > > structure, make it a function of the ioctl command (which is how sizes are
> > > normally passed to ioctls). To support forward and backward compatibility,
> > > just mask out the direction and size, and match everything. The size (and
> > > any future direction) checks are done along with copy_struct_from_user()
> > > logic. Also update the selftests to check size bounds.
> > > 
> > > [1] 
> > > https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal
> > > 
> > > Signed-off-by: Kees Cook 
> > > ---
> > >  include/uapi/linux/seccomp.h  |  2 -
> > >  kernel/seccomp.c  | 21 ++
> > >  tools/testing/selftests/seccomp/seccomp_bpf.c | 40 ---
> > >  3 files changed, 49 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > > index c347160378e5..473a61695ac3 100644
> > > --- a/include/uapi/linux/seccomp.h
> > > +++ b/include/uapi/linux/seccomp.h
> > > @@ -118,7 +118,6 @@ struct seccomp_notif_resp {
> > >  
> > >  /**
> > >   * struct seccomp_notif_addfd
> > > - * @size: The size of the seccomp_notif_addfd structure
> > >   * @id: The ID of the seccomp notification
> > >   * @flags: SECCOMP_ADDFD_FLAG_*
> > >   * @srcfd: The local fd number
> > > @@ -126,7 +125,6 @@ struct seccomp_notif_resp {
> > >   * @newfd_flags: The O_* flags the remote FD should have applied
> > >   */
> > >  struct seccomp_notif_addfd {
> > > - __u64 size;
> > 
> > Huh? Won't this break builds?
> 
> Only if they use addfd without this patch? :) Are you saying I should
> collapse this patch into the main addfd and test patches?

Oh, derp, I see :) Yeah, maybe that would be good.

Tycho


Re: [PATCH v4 08/11] selftests/seccomp: Make kcmp() less required

2020-06-16 Thread Tycho Andersen
On Mon, Jun 15, 2020 at 08:25:21PM -0700, Kees Cook wrote:
> The seccomp tests are a bit noisy without CONFIG_CHECKPOINT_RESTORE (due
> to missing the kcmp() syscall). The seccomp tests are more accurate with
> kcmp(), but it's not strictly required. Refactor the tests to use
> alternatives (comparing fd numbers), and provide a central test for
> kcmp() so there is a single XFAIL instead of many. Continue to produce
> warnings for the other tests, though.
> 
> Additionally adds some more bad flag EINVAL tests to the addfd selftest.
> 
> Signed-off-by: Kees Cook 

This looks fine, but I wonder if this is enough motivation for taking
kcmp() out of CONFIG_CHECKPOINT_RESTORE guards?

Tycho


Re: [PATCH v4 09/11] selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()

2020-06-16 Thread Tycho Andersen
On Mon, Jun 15, 2020 at 08:25:22PM -0700, Kees Cook wrote:
> The user_trap_syscall() helper creates a filter with
> SECCOMP_RET_USER_NOTIF. To avoid confusion with SECCOMP_RET_TRAP, rename
> the helper to user_notif_syscall().
> 
> Additionally fix a redundant "return" after XFAIL.
> 
> Signed-off-by: Kees Cook 

Reviewed-by: Tycho Andersen 


Re: [PATCH v4 10/11] seccomp: Switch addfd to Extensible Argument ioctl

2020-06-16 Thread Tycho Andersen
On Mon, Jun 15, 2020 at 08:25:23PM -0700, Kees Cook wrote:
> This patch is based on discussions[1] with Sargun Dhillon, Christian
> Brauner, and David Laight. Instead of building size into the addfd
> structure, make it a function of the ioctl command (which is how sizes are
> normally passed to ioctls). To support forward and backward compatibility,
> just mask out the direction and size, and match everything. The size (and
> any future direction) checks are done along with copy_struct_from_user()
> logic. Also update the selftests to check size bounds.
> 
> [1] 
> https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal
> 
> Signed-off-by: Kees Cook 
> ---
>  include/uapi/linux/seccomp.h  |  2 -
>  kernel/seccomp.c  | 21 ++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 40 ---
>  3 files changed, 49 insertions(+), 14 deletions(-)
> 
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index c347160378e5..473a61695ac3 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -118,7 +118,6 @@ struct seccomp_notif_resp {
>  
>  /**
>   * struct seccomp_notif_addfd
> - * @size: The size of the seccomp_notif_addfd structure
>   * @id: The ID of the seccomp notification
>   * @flags: SECCOMP_ADDFD_FLAG_*
>   * @srcfd: The local fd number
> @@ -126,7 +125,6 @@ struct seccomp_notif_resp {
>   * @newfd_flags: The O_* flags the remote FD should have applied
>   */
>  struct seccomp_notif_addfd {
> - __u64 size;

Huh? Won't this break builds?

Tycho


Re: [RFC PATCH] seccomp: Add extensibility mechanism to read notifications

2020-06-15 Thread Tycho Andersen
On Sat, Jun 13, 2020 at 12:26:09AM -0700, Sargun Dhillon wrote:
> This introduces an extensibility mechanism to receive seccomp
> notifications. It uses read(2), as opposed to using an ioctl. The listener
> must be first configured to write the notification via the
> SECCOMP_IOCTL_NOTIF_CONFIG ioctl with the fields that the user is
> interested in.

I think we're not supposed to use read for control any more, because
"read is for data".

Tycho


Re: [PATCH v2 3/3] selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

2020-05-29 Thread Tycho Andersen
On Fri, May 29, 2020 at 06:46:07PM +, Sargun Dhillon wrote:
> On Fri, May 29, 2020 at 12:41:51AM -0700, Kees Cook wrote:
> > On Thu, May 28, 2020 at 04:08:58AM -0700, Sargun Dhillon wrote:
> > > + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
> > > +
> > > + nextid = req.id + 1;
> > > +
> > > + /* Wait for getppid to be called for the second time */
> > > + sleep(1);
> > 
> > I always rebel at finding "sleep" in tests. ;) Is this needed? IIUC,
> > userspace will immediately see EINPROGRESS after the NOTIF_SEND
> > finishes, yes?
> > 
> > Otherwise, yes, this looks good.
> > 
> > -- 
> > Kees Cook
> I'm open to better suggestions, but there's a race where if getppid
> is not called before the second SECCOMP_IOCTL_NOTIF_ADDFD is called,
> you will just get an ENOENT, since the notification ID is not found.

Ah, I see. The goal is to test the -EINPROGRESS here.

If you use write() instead of getppid(), and write to a socket, will
that work? The parent can block for the read, and once some thing has
been read it can test for -EINPROGRESS.

The user_notification_signal test does something similar.

Tycho


Re: [PATCH v2 0/3] Add seccomp notifier ioctl that enables adding fds

2020-05-29 Thread Tycho Andersen
On Thu, May 28, 2020 at 04:08:55AM -0700, Sargun Dhillon wrote:
> This adds the capability for seccomp notifier listeners to add file
> descriptors

Modulo the changes suggested by others, you can consider this series:

Reviewed-by: Tycho Andersen 


Re: [PATCH v2 3/3] selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

2020-05-29 Thread Tycho Andersen
On Fri, May 29, 2020 at 12:41:51AM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 04:08:58AM -0700, Sargun Dhillon wrote:
> > +   EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
> > +
> > +   nextid = req.id + 1;
> > +
> > +   /* Wait for getppid to be called for the second time */
> > +   sleep(1);
> 
> I always rebel at finding "sleep" in tests. ;) Is this needed? IIUC,
> userspace will immediately see EINPROGRESS after the NOTIF_SEND
> finishes, yes?

Yes, I think we can just drop this, and I agree it's a good idea to do
so :)

Tycho


Re: [PATCH 1/2] seccomp: notify user trap about unused filter

2020-05-27 Thread Tycho Andersen
On Wed, May 27, 2020 at 03:36:09PM -0700, Kees Cook wrote:
> On Wed, May 27, 2020 at 03:52:03PM -0600, Tycho Andersen wrote:
> > On Wed, May 27, 2020 at 02:43:49PM -0700, Kees Cook wrote:
> > > (While I'm here -- why can there be only one listener per task? The
> > > notifications are filter-specific, not task-specific?)
> > 
> > Not sure what you mean here?
> 
> tatic struct file *init_listener(struct seccomp_filter *filter)
> {
> struct file *ret = ERR_PTR(-EBUSY);
> struct seccomp_filter *cur;
> 
> for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> if (cur->notif)
> goto out;
> }
> 
> ...
> 
> /* Installing a second listener in the chain should EBUSY */
> EXPECT_EQ(user_trap_syscall(__NR_getpid,
> SECCOMP_FILTER_FLAG_NEW_LISTENER),
>   -1);
> EXPECT_EQ(errno, EBUSY);
> 
> 
> Why does this limit exist? Since the fd is tied to a specific filter,
> I don't see conflicts about having multiple USER_NOTIF filters on one
> task -- the monitor's response will either fake it or continue it, so
> there is no "composition" needed? I must be missing something.

It exists because Andy asked for it :)

I agree that there's no technical reason for it to be there. I think
it's just that the semantics were potentially confusing, and it wasn't
a requirement anyone had to have multiples attached.

Tycho


Re: [PATCH 1/2] seccomp: notify user trap about unused filter

2020-05-27 Thread Tycho Andersen
On Wed, May 27, 2020 at 02:43:49PM -0700, Kees Cook wrote:
> (While I'm here -- why can there be only one listener per task? The
> notifications are filter-specific, not task-specific?)

Not sure what you mean here?

> > To fix this, we introduce a new "live" reference counter that tracks the
> > live tasks making use of a given filter and when a notifier is
> > registered waiting tasks will be notified that the filter is now empty
> > by receiving a (E)POLLHUP event.
> > The concept in this patch introduces is the same as for signal_struct,
> > i.e. reference counting for life-cycle management is decoupled from
> > reference counting live taks using the object.
> 
> I will need convincing that life-cycle ref-counting needs to be decoupled
> from usage ref-counting.

I think it does, since the refcount is no longer 1:1 with the number
of tasks that have it (a notification fd's struct file has a reference
too).

We could also do it the reverse way, and keep track of how many
notification fds point to a particular file. But somehow we need two
counts.

Maybe it's best to decouple them entirely, and have usage go back to
just being the number of tasks, and introduce a new counter for
notification fds.

> I see what you're saying here and in the other
> reply about where the notification is coming from (release vs put, etc),
> but I think it'd be better if the EPOLLHUP was handled internally to the
> VFS due to the kernel end of the file being closed.
> 
> > There's probably some trickery possible but the second counter is just
> > the correct way of doing this imho and has precedence. The patch also
> > lifts the waitqeue from struct notification into into sruct
> > seccomp_filter. This is cleaner overall and let's us avoid having to
> > take the notifier mutex since we neither need to read nor modify the
> > notifier specific aspects of the seccomp filter. In the exit path I'd
> > very much like to avoid having to take the notifier mutex for each
> > filter in the task's filter hierarchy.
> 
> I guess this is a minor size/speed trade-off (every seccomp_filter
> struct grows by 1 pointer regardless of the presence of USER_NOTIF
> rules attached...). But I think this is an optimization detail, and I
> need to understand why we can't just close the file on filter free.

That seems nicest, agreed.

Tycho


Re: [PATCH 1/2] seccomp: notify user trap about unused filter

2020-05-27 Thread Tycho Andersen
On Wed, May 27, 2020 at 01:19:01PM +0200, Christian Brauner wrote:
> +void seccomp_filter_notify(const struct task_struct *tsk)
> +{
> + struct seccomp_filter *orig = tsk->seccomp.filter;
> +
> + while (orig && refcount_dec_and_test(&orig->live)) {
> + if (waitqueue_active(&orig->wqh))
> + wake_up_poll(&orig->wqh, EPOLLHUP);
> + orig = orig->prev;
> + }
> +}

Is there a reason this can't live in put_seccomp_filter()?

Tycho


Re: [PATCH] riscv: Remove unnecessary path for syscall_trace

2020-05-26 Thread Tycho Andersen
On Tue, May 26, 2020 at 08:29:45AM +0800, Guo Ren wrote:
> Hi Tycho,
> 
> On Mon, May 25, 2020 at 10:36 PM Tycho Andersen  wrote:
> >
> > On Mon, May 25, 2020 at 02:18:26PM +, guo...@kernel.org wrote:
> > > From: Guo Ren 
> > >
> > > Obviously, there is no need to recover a0-a7 in reject path.
> > >
> > > Previous modification is from commit af33d243 by Tycho, to
> > > fixup seccomp reject syscall code path.
> >
> > Doesn't this suffer from the same problem, though? a7 is clobbered, so
> > the -ERESTARTSYS behavior won't work?
> 
> Look, the patch only affects the path of ret_from_syscall_rejected,
> and there are two possible paths:
> 1. ret_from_syscall_rejected->handle_syscall_trace_exit->ret_from_exception
> 2. ret_from_syscall_rejected->ret_from_exception
> 
> All the above skip the check_syscall_nr and ignore the current a7, in
> the C function they use the pt_regs in the stack to get proper reg's
> value.
> 
> For the -ERESTARTSYS, we only process it in:
> ret_from_exception->resume_userspace->work_notifysig->do_notify_resume:
> do_signal & handle_signal:
> 
> switch (regs->a0) {
> case -ERESTARTNOHAND:
> case -ERESTARTSYS:
> case -ERESTARTNOINTR:
> regs->a0 = regs->orig_a0;
> regs->epc -= 0x4;
> break;
> 
> All above are done in pt_regs and when returning to userspace, a7 will
> be recovered by restore_all in entry.S.

Yes, thanks for that explanation.

Reviewed-by: Tycho Andersen 


Re: [PATCH] riscv: Remove unnecessary path for syscall_trace

2020-05-25 Thread Tycho Andersen
On Mon, May 25, 2020 at 02:18:26PM +, guo...@kernel.org wrote:
> From: Guo Ren 
> 
> Obviously, there is no need to recover a0-a7 in reject path.
> 
> Previous modification is from commit af33d243 by Tycho, to
> fixup seccomp reject syscall code path.

Doesn't this suffer from the same problem, though? a7 is clobbered, so
the -ERESTARTSYS behavior won't work?

I haven't run the test case that was failing before.

Tycho


Re: [PATCH 2/5] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-24 Thread Tycho Andersen
On Sun, May 24, 2020 at 05:57:32PM -0600, Tycho Andersen wrote:
> On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> > +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> > +{
> > +   int ret;
> > +
> > +   /*
> > +* Remove the notification, and reset the list pointers, indicating
> > +* that it has been handled.
> > +*/
> > +   list_del_init(&addfd->list);
> > +
> > +   ret = security_file_receive(addfd->file);
> > +   if (ret)
> > +   goto out;
> > +
> > +   if (addfd->fd >= 0) {
> > +   ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> > +   if (ret >= 0)
> > +   fput(addfd->file);
> > +   } else {
> > +   ret = get_unused_fd_flags(addfd->flags);
> > +   if (ret >= 0)
> > +   fd_install(ret, addfd->file);
> > +   }
> > +
> > +out:
> > +   addfd->ret = ret;
> > +   complete(&addfd->completion);
> > +}
> 
> My previous comment about SCM_RIGHTS still applies, right? That is, we
> should do,
> 
>   sock = sock_from_file(fp[i], &err);
>   if (sock) {
>   sock_update_netprioidx(&sock->sk->sk_cgrp_data);
>   sock_update_classid(&sock->sk->sk_cgrp_data);
>   }
> 
> and perhaps lift that into a helper.

Oh, and now I see the later patch. But is there a reason to separate
these? I can't think of one.

Tycho


Re: [PATCH 2/5] seccomp: Introduce addfd ioctl to seccomp user notifier

2020-05-24 Thread Tycho Andersen
On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> +{
> + int ret;
> +
> + /*
> +  * Remove the notification, and reset the list pointers, indicating
> +  * that it has been handled.
> +  */
> + list_del_init(&addfd->list);
> +
> + ret = security_file_receive(addfd->file);
> + if (ret)
> + goto out;
> +
> + if (addfd->fd >= 0) {
> + ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> + if (ret >= 0)
> + fput(addfd->file);
> + } else {
> + ret = get_unused_fd_flags(addfd->flags);
> + if (ret >= 0)
> + fd_install(ret, addfd->file);
> + }
> +
> +out:
> + addfd->ret = ret;
> + complete(&addfd->completion);
> +}

My previous comment about SCM_RIGHTS still applies, right? That is, we
should do,

sock = sock_from_file(fp[i], &err);
if (sock) {
sock_update_netprioidx(&sock->sk->sk_cgrp_data);
sock_update_classid(&sock->sk->sk_cgrp_data);
}

and perhaps lift that into a helper.

Tycho


Re: [PATCH 1/5] seccomp: Add find_notification helper

2020-05-24 Thread Tycho Andersen
On Sun, May 24, 2020 at 04:39:38PM -0700, Sargun Dhillon wrote:
> This adds a helper which can iterate through a seccomp_filter to
> find a notification matching an ID. It removes several replicated
> chunks of code.
> 
> Signed-off-by: Sargun Dhillon 
> Cc: Matt Denton 
> Cc: Kees Cook ,
> Cc: Jann Horn ,
> Cc: Robert Sesek ,
> Cc: Chris Palmer 
> Cc: Christian Brauner 
> Cc: Tycho Andersen 
> ---
>  kernel/seccomp.c | 38 +-
>  1 file changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 55a6184f5990..f6ce94b7a167 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1021,10 +1021,25 @@ static int seccomp_notify_release(struct inode 
> *inode, struct file *file)
>   return 0;
>  }
>  
> +/* must be called with notif_lock held */
> +static inline struct seccomp_knotif *
> +find_notification(struct seccomp_filter *filter, u64 id)
> +{
> + struct seccomp_knotif *cur;
> +
> + list_for_each_entry(cur, &filter->notif->notifications, list) {
> + if (cur->id == id)
> + return cur;
> + }
> +
> + return NULL;
> +}

I think there's also an instance of this in _send() that we can change
to use find_notification() as well.

Tycho


Re: seccomp feature development

2020-05-19 Thread Tycho Andersen
On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote:
> Hi!
> 
> This is my attempt at a brain-dump on my plans for nearish-term seccomp
> features. Welcome to my TED talk... ;)
> 
> These are the things I've been thinking about:
> 
> - fd passing
> - deep argument inspection
> - changing structure sizes
> - syscall bitmasks
> 
> So, diving right in:
> 
> 
> ## fd passing
> 
> Background: seccomp users want to be able to install an fd in a
> monitored process during a user_notif to emulate "open" calls (or
> similar), possibly across security boundaries, etc.
> 
> On the fd passing front, it seems that gaining pidfd_addfd() is the way
> to go as it allows for generic use not tied to seccomp in particular.
> I expect this feature will be developed orthogonally to seccomp (where
> does this stand, BTW?). However, as Sargun has shown[1], seccomp could
> be friendlier to help with using it. Things that need to be resolved:
> 
> - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but
>   if we're going to step back and make some design choices here, is
>   there a place for pidfds in seccomp user_notif, in order to avoid
>   needing the user_notif cookie? I think probably not: it's a rather lot
>   of overhead for notifications. It seems it's safe to perform an fd
>   installation with these steps:
>   - get pidnr from user_notif_recv
>   - open pidfd from pidnr
>   - re-verify user_notif cookie is still valid
>   - send new fd via pidfd
>   - reply with user_notif_send
>   - close pidfd

Yep, this looks safe.

> - how to deal with changing sizes of the user_notif structures to
>   include a pidnr. (Which will be its own topic below.)
> 
> 
> ## deep argument inspection
> 
> Background: seccomp users would like to write filters that traverse
> the user pointers passed into many syscalls, but seccomp can't do this
> dereference for a variety of reasons (mostly involving race conditions and
> rearchitecting the entire kernel syscall and copy_from_user() code flows).
> 
> During the last plumbers and in conversations since, the grudging
> consensus was reached that having seccomp do this for ALL syscalls was
> likely going to be extremely disruptive for very little gain (i.e.
> many things, like pathnames, have differing lifetimes, aliases, unstable
> kernel object references, etc[6]), but that there were a small subset of
> syscalls for which this WOULD be beneficial, and those are the newly
> created "Extensible Argument" syscalls (is there a better name for this
> design? I'm calling it "EA" for the rest of the email), like clone3(),
> openat2(), etc, which pass a pointer and a size:
> 
> long clone3(struct clone_args *cl_args, size_t size);
> 
> I think it should be possible to extend seccomp to examine this structure
> by appending it to seccomp_data, and allowing filters to examine the
> contents. This means that no BPF language extensions are required for
> seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think
> seccomp's design principles work well with maps, kernel helpers, etc,
> and I think the earlier the examination of using eBPF for user_notif
> bares this out).
> 
> In order for this to work, there are a number of prerequisites:
> 
> - argument caching, in two halves: syscall side and seccomp side:
>   - the EA syscalls needs to include awareness of potential seccomp
> hooking. i.e. seccomp may have done the copy_from_user() already and
> kept a cached copy.
>   - seccomp needs to potentially DO the copy_from_user() itself when it
> hits these syscalls for a given filter, and put it somewhere for
> later use by the syscall.
> - the sizes of these EA structs are, by design, growable over time.
>   seccomp and its users need to be handle this in a forward and backward
>   compatible way, similar to the design of the EA syscall interface
>   itself.
> 
> The argument caching bit is, I think, rather mechanical in nature since
> it's all "just" internal to the kernel: seccomp can likely adjust how it
> allocates seccomp_data (maybe going so far as to have it split across two
> pages with the syscall argument struct always starting on the 2nd page
> boundary), and copying the EA struct into that page, which will be both
> used by the filter and by the syscall. I imagine state tracking ("is
> there a cached EA?", "what is the address of seccomp_data?", "what is
> the address of the EA?") can be associated with the thread struct.
> 
> The growing size of the EA struct will need some API design. For filters
> to operate on the contiguous seccomp_data+EA struct, the filter will
> need to know how large seccomp_data is (more on this later), and how
> large the EA struct is. When the filter is written in userspace, it can
> do the math, point into the expected offsets, and get what it needs. For
> this to work correctly in the kernel, though, the seccomp BPF verifier
> needs to know the size of the EA struct as well, so it can corre

Re: [PATCH] seccomp: Add group_leader pid to seccomp_notif

2020-05-18 Thread Tycho Andersen
On Mon, May 18, 2020 at 02:45:00PM +0200, Christian Brauner wrote:
> On Mon, May 18, 2020 at 08:32:25AM +, Sargun Dhillon wrote:
> > On Sun, May 17, 2020 at 02:30:57PM -0700, Kees Cook wrote:
> > > On Sun, May 17, 2020 at 09:02:15AM -0600, Tycho Andersen wrote:
> > > 
> > > I'm going read this thread more carefully tomorrow, but I just wanted to
> > > mention that I'd *like* to extend seccomp_data for doing deep argument
> > > inspection of the new syscalls. I think it's the least bad of many
> > > designs, and I'll write that up in more detail. (I would *really* like
> > > to avoid extending seccomp's BPF language, and instead allow probing
> > > into the struct copied from userspace, etc.)
> > > 
> > > Anyway, it's very related to this, so, yeah, probably we need a v2 of the
> > > notif API, but I'll try to get all the ideas here collected in one place.
> > I scratched together a proposal of what I think would make a not-terrible
> > V2 API. I'm sure there's bugs in this code, but I think it's workable --
> > or at least a place to start. The biggest thing I think we should consider
> > is unrolling seccomp_data if we don't intend to add new BPF-accessible
> > fields.
> > 
> > If also uses read(2), so we get to take advantage of read(2)'s ability
> > to pass a size along with the read, as opposed to doing ioctl tricks.
> > It also makes programming from against it slightly simpler. I can imagine
> > that the send API could be similar, in that it could support write, and
> > thus making it 100% usable from Go (and the like) without requiring
> > a separate OS-thread be spun up to interact with the listener.
> 
> I don't have strong feelings about using read() and write() here but I
> think that Jann had reservations and that's why we didn't do it in the
> first version. But his reservations were specifically tied to fd passing
> which we never implemented:
> http://lkml.iu.edu/hypermail/linux/kernel/1806.2/05995.html
> 
> But still, worth considering.

There was a thread about this same time for some other API (I can't
find it now, but I can dig if you want) that suggests that "read() is
for data" and we shouldn't use it for control in APIs.

Tycho


Re: [PATCH] seccomp: Add group_leader pid to seccomp_notif

2020-05-18 Thread Tycho Andersen
On Mon, May 18, 2020 at 02:53:25PM +0200, Christian Brauner wrote:
> On Sun, May 17, 2020 at 09:02:15AM -0600, Tycho Andersen wrote:
> > On Sun, May 17, 2020 at 08:46:03AM -0600, Tycho Andersen wrote:
> > > On Sun, May 17, 2020 at 04:33:11PM +0200, Christian Brauner wrote:
> > > > struct seccomp_notif2 {
> > > > __u32 notif_size;
> > > > __u64 id;
> > > > __u32 pid;
> > > > __u32 flags;
> > > > struct seccomp_data data;
> > > > __u32 data_size;
> > > > };
> > > 
> > > I guess you need to put data_size before data, otherwise old userspace
> > > with a smaller struct seccomp_data will look in the wrong place.
> > > 
> > > But yes, that'll work if you put two sizes in, which is probably
> > > reasonable since we're talking about two structs.
> > 
> > Well, no, it doesn't either. Suppose we add a new field first to
> > struct seccomp_notif2:
> > 
> > struct seccomp_notif2 {
> > __u32 notif_size;
> > __u64 id;
> > __u32 pid;
> > __u32 flags;
> > struct seccomp_data data;
> > __u32 data_size;
> > __u32 new_field;
> > };
> > 
> > And next we add a new field to struct secccomp_data. When a userspace
> > compiled with just the new seccomp_notif2 field does:
> > 
> > seccomp_notif2.new_field = ...;
> > 
> > the compiler will put it in the wrong place for the kernel with the
> > new seccomp_data field too.
> > 
> > Sort of feels like we should do:
> > 
> > struct seccomp_notif2 {
> > struct seccomp_notif *notif;
> > struct seccomp_data *data;
> > };
> > 
> > ?
> 
> Oh yes of course, sorry that was my stupid typo. I meant:
> 
> struct seccomp_notif2 {
> __u32 notif_size;
> __u64 id;
> __u32 pid;
> __u32 flags;
> struct seccomp_data *data;
> __u32 data_size;
> __u32 new_field;
> }
> 
> at which point things should just work imho.

Are you saying that data_size is an input? Because I don't think they
Just Work otherwise.

Tycho


Re: [PATCH] seccomp: Add group_leader pid to seccomp_notif

2020-05-17 Thread Tycho Andersen
On Sun, May 17, 2020 at 08:46:03AM -0600, Tycho Andersen wrote:
> On Sun, May 17, 2020 at 04:33:11PM +0200, Christian Brauner wrote:
> > struct seccomp_notif2 {
> > __u32 notif_size;
> > __u64 id;
> > __u32 pid;
> > __u32 flags;
> > struct seccomp_data data;
> > __u32 data_size;
> > };
> 
> I guess you need to put data_size before data, otherwise old userspace
> with a smaller struct seccomp_data will look in the wrong place.
> 
> But yes, that'll work if you put two sizes in, which is probably
> reasonable since we're talking about two structs.

Well, no, it doesn't either. Suppose we add a new field first to
struct seccomp_notif2:

struct seccomp_notif2 {
__u32 notif_size;
__u64 id;
__u32 pid;
__u32 flags;
struct seccomp_data data;
__u32 data_size;
__u32 new_field;
};

And next we add a new field to struct secccomp_data. When a userspace
compiled with just the new seccomp_notif2 field does:

seccomp_notif2.new_field = ...;

the compiler will put it in the wrong place for the kernel with the
new seccomp_data field too.

Sort of feels like we should do:

struct seccomp_notif2 {
struct seccomp_notif *notif;
struct seccomp_data *data;
};

?

Tycho


Re: [PATCH] seccomp: Add group_leader pid to seccomp_notif

2020-05-17 Thread Tycho Andersen
On Sun, May 17, 2020 at 04:33:11PM +0200, Christian Brauner wrote:
> On Sun, May 17, 2020 at 08:23:16AM -0600, Tycho Andersen wrote:
> > On Sun, May 17, 2020 at 09:21:56PM +1000, Aleksa Sarai wrote:
> > > On 2020-05-17, Christian Brauner  wrote:
> > > > Or... And that's more invasive but ultimately cleaner we v2 the whole
> > > > thing so e.g. SECCOMP_IOCTL_NOTIF_RECV2, SECCOMP_IOCTL_NOTIF_SEND2, and
> > > > embedd the size argument in the structs. Userspace sets the size
> > > > argument, we use get_user() to get the size first and then
> > > > copy_struct_from_user() to handle it cleanly based on that. A similar
> > > > model as with sched (has other unrelated quirks because they messed up
> > > > something too):
> > > > 
> > > > static int sched_copy_attr(struct sched_attr __user *uattr, struct 
> > > > sched_attr *attr)
> > > > {
> > > > u32 size;
> > > > int ret;
> > > > 
> > > > /* Zero the full structure, so that a short copy will be nice: 
> > > > */
> > > > memset(attr, 0, sizeof(*attr));
> > > > 
> > > > ret = get_user(size, &uattr->size);
> > > > if (ret)
> > > > return ret;
> > > > 
> > > > /* ABI compatibility quirk: */
> > > > if (!size)
> > > > size = SCHED_ATTR_SIZE_VER0;
> > > > if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
> > > > goto err_size;
> > > > 
> > > > ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
> > > > if (ret) {
> > > > if (ret == -E2BIG)
> > > > goto err_size;
> > > > return ret;
> > > > }
> > > > 
> > > > We're probably the biggest user of this right now and I'd be ok with
> > > > that change. If it's a v2 than whatever. :)
> > > 
> > > I'm :+1: on a new version and switch to copy_struct_from_user(). I was a
> > > little surprised when I found out that user_notif doesn't do it this
> > > way a while ago (and although in theory it is userspace's fault, ideally
> > > we could have an API that doesn't have built-in footguns).
> > 
> > But I thought the whole point was that we couldn't do that, because
> > there's two things that can vary in length (struct seccomp_notif and
> > struct seccomp_data)?
> 
> I may have missed that discussion you linked.
> But why wouldn't:
> 
> struct seccomp_notif2 {
>   __u32 notif_size;
>   __u64 id;
>   __u32 pid;
>   __u32 flags;
>   struct seccomp_data data;
>   __u32 data_size;
> };

I guess you need to put data_size before data, otherwise old userspace
with a smaller struct seccomp_data will look in the wrong place.

But yes, that'll work if you put two sizes in, which is probably
reasonable since we're talking about two structs.

Tycho


Re: [PATCH] seccomp: Add group_leader pid to seccomp_notif

2020-05-17 Thread Tycho Andersen
On Sun, May 17, 2020 at 09:21:56PM +1000, Aleksa Sarai wrote:
> On 2020-05-17, Christian Brauner  wrote:
> > Or... And that's more invasive but ultimately cleaner we v2 the whole
> > thing so e.g. SECCOMP_IOCTL_NOTIF_RECV2, SECCOMP_IOCTL_NOTIF_SEND2, and
> > embedd the size argument in the structs. Userspace sets the size
> > argument, we use get_user() to get the size first and then
> > copy_struct_from_user() to handle it cleanly based on that. A similar
> > model as with sched (has other unrelated quirks because they messed up
> > something too):
> > 
> > static int sched_copy_attr(struct sched_attr __user *uattr, struct 
> > sched_attr *attr)
> > {
> > u32 size;
> > int ret;
> > 
> > /* Zero the full structure, so that a short copy will be nice: */
> > memset(attr, 0, sizeof(*attr));
> > 
> > ret = get_user(size, &uattr->size);
> > if (ret)
> > return ret;
> > 
> > /* ABI compatibility quirk: */
> > if (!size)
> > size = SCHED_ATTR_SIZE_VER0;
> > if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
> > goto err_size;
> > 
> > ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
> > if (ret) {
> > if (ret == -E2BIG)
> > goto err_size;
> > return ret;
> > }
> > 
> > We're probably the biggest user of this right now and I'd be ok with
> > that change. If it's a v2 than whatever. :)
> 
> I'm :+1: on a new version and switch to copy_struct_from_user(). I was a
> little surprised when I found out that user_notif doesn't do it this
> way a while ago (and although in theory it is userspace's fault, ideally
> we could have an API that doesn't have built-in footguns).

But I thought the whole point was that we couldn't do that, because
there's two things that can vary in length (struct seccomp_notif and
struct seccomp_data)?

https://lore.kernel.org/lkml/cagxu5j+zpxu6ege1fer+n9+zlx3n+sj_vbs_zzj9_hrdwrr...@mail.gmail.com/

Tycho


Re: [PATCH] seccomp: fix SECCOMP_USER_NOTIF_FLAG_CONTINUE test

2019-10-21 Thread Tycho Andersen
On Mon, Oct 21, 2019 at 11:10:55AM +0200, Christian Brauner wrote:
> The ifndef for SECCOMP_USER_NOTIF_FLAG_CONTINUE was placed under the
> ifndef for the SECCOMP_FILTER_FLAG_NEW_LISTENER feature. This will not
> work on systems that do support SECCOMP_FILTER_FLAG_NEW_LISTENER but do not
> support SECCOMP_USER_NOTIF_FLAG_CONTINUE. So move the latter ifndef out of
> the former ifndef's scope.
> 
> 2019-10-20 11:14:01 make run_tests -C seccomp
> make: Entering directory 
> '/usr/src/perf_selftests-x86_64-rhel-7.6-0eebfed2954f152259cae0ad57b91d3ea92968e8/tools/testing/selftests/seccomp'
> gcc -Wl,-no-as-needed -Wall  seccomp_bpf.c -lpthread -o seccomp_bpf
> seccomp_bpf.c: In function ‘user_notification_continue’:
> seccomp_bpf.c:3562:15: error: ‘SECCOMP_USER_NOTIF_FLAG_CONTINUE’ undeclared 
> (first use in this function)
>   resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
>^~~~
> seccomp_bpf.c:3562:15: note: each undeclared identifier is reported only once 
> for each function it appears in
> Makefile:12: recipe for target 'seccomp_bpf' failed
> make: *** [seccomp_bpf] Error 1
> make: Leaving directory 
> '/usr/src/perf_selftests-x86_64-rhel-7.6-0eebfed2954f152259cae0ad57b91d3ea92968e8/tools/testing/selftests/seccomp'
> 
> Reported-by: kernel test robot 
> Fixes: 0eebfed2954f ("seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE")
> Cc: linux-kselft...@vger.kernel.org
> Signed-off-by: Christian Brauner 

Reviewed-by: Tycho Andersen 


Re: [PATCH v2 2/3] seccomp: avoid overflow in implicit constant conversion

2019-09-20 Thread Tycho Andersen
On Fri, Sep 20, 2019 at 10:30:06AM +0200, Christian Brauner wrote:
> USER_NOTIF_MAGIC is assigned to int variables in this test so set it to 
> INT_MAX
> to avoid warnings:
> 
> seccomp_bpf.c: In function ‘user_notification_continue’:
> seccomp_bpf.c:3088:26: warning: overflow in implicit constant conversion 
> [-Woverflow]
>  #define USER_NOTIF_MAGIC 116983961184613L
>   ^
> seccomp_bpf.c:3572:15: note: in expansion of macro ‘USER_NOTIF_MAGIC’
>   resp.error = USER_NOTIF_MAGIC;
>^~~~
> 
> Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> Signed-off-by: Christian Brauner 
> Reviewed-by: Tyler Hicks 
> Cc: Kees Cook 
> Cc: Andy Lutomirski 
> Cc: Will Drewry 
> Cc: Shuah Khan 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> Cc: Yonghong Song 
> Cc: Tycho Andersen 

You can also add,

Reviewed-by: Tycho Andersen 

for this one.

Tycho


Re: [PATCH 1/4] seccomp: add SECCOMP_RET_USER_NOTIF_ALLOW

2019-09-18 Thread Tycho Andersen
On Wed, Sep 18, 2019 at 10:30:00AM -0700, Kees Cook wrote:
> On Wed, Sep 18, 2019 at 10:48:30AM +0200, Christian Brauner wrote:
> > This allows the seccomp notifier to continue a syscall. A positive
> > discussion about this feature was triggered by a post to the
> > ksummit-discuss mailing list (cf. [3]) and took place during KSummit
> > (cf. [1]) and again at the containers/checkpoint-restore
> > micro-conference at Linux Plumbers.
> > 
> > Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
> > which enables a process (watchee) to retrieve an fd for its seccomp
> > filter. This fd can then be handed to another (usually more privileged)
> > process (watcher). The watcher will then be able to receive seccomp
> > messages about the syscalls having been performed by the watchee.
> > 
> > This feature is heavily used in some userspace workloads. For example,
> > it is currently used to intercept mknod() syscalls in user namespaces
> > aka in containers.
> > The mknod() syscall can be easily filtered based on dev_t. This allows
> > us to only intercept a very specific subset of mknod() syscalls.
> > Furthermore, mknod() is not possible in user namespaces toto coelo and
> > so intercepting and denying syscalls that are not in the whitelist on
> > accident is not a big deal. The watchee won't notice a difference.
> > 
> > In contrast to mknod(), a lot of other syscall we intercept (e.g.
> > setxattr()) cannot be easily filtered like mknod() because they have
> > pointer arguments. Additionally, some of them might actually succeed in
> > user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
> > currently cannot tell seccomp to continue from a user notifier we are
> > stuck with performing all of the syscalls in lieu of the container. This
> > is a huge security liability since it is extremely difficult to
> > correctly assume all of the necessary privileges of the calling task
> > such that the syscall can be successfully emulated without escaping
> > other additional security restrictions (think missing CAP_MKNOD for
> > mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
> > telling seccomp to resume the syscall.
> > 
> > One thing that came up in the discussion was the problem that another
> > thread could change the memory after userspace has decided to let the
> > syscall continue which is a well known TOCTOU with seccomp which is
> > present in other ways already.
> > The discussion showed that this feature is already very useful for any
> > syscall without pointer arguments. For any accidentally intercepted
> > non-pointer syscall it is safe to continue.
> > For syscalls with pointer arguments there is a race but for any cautious
> > userspace and the main usec cases the race doesn't matter. The notifier
> > is intended to be used in a scenario where a more privileged watcher
> > supervises the syscalls of lesser privileged watchee to allow it to get
> > around kernel-enforced limitations by performing the syscall for it
> > whenever deemed save by the watcher. Hence, if a user tricks the watcher
> > into allowing a syscall they will either get a deny based on
> > kernel-enforced restrictions later or they will have changed the
> > arguments in such a way that they manage to perform a syscall with
> > arguments that they would've been allowed to do anyway.
> > In general, it is good to point out again, that the notifier fd was not
> > intended to allow userspace to implement a security policy but rather to
> > work around kernel security mechanisms in cases where the watcher knows
> > that a given action is safe to perform.
> > 
> > /* References */
> > [1]: https://linuxplumbersconf.org/event/4/contributions/560
> > [2]: https://linuxplumbersconf.org/event/4/contributions/477
> > [3]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns3...@brauner.io
> > [4]: commit 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> > 
> > Signed-off-by: Christian Brauner 
> > Cc: Kees Cook 
> > Cc: Andy Lutomirski 
> > Cc: Will Drewry 
> > Cc: Tycho Andersen 
> > CC: Tyler Hicks 
> > Cc: Jann Horn 
> > ---
> >  include/uapi/linux/seccomp.h |  2 ++
> >  kernel/seccomp.c | 24 
> >  2 files changed, 22 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index 90734aa5aa36..2c23b9aa6383 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/s

Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()

2019-09-06 Thread Tycho Andersen
On Fri, Sep 06, 2019 at 08:27:31PM +0200, Florian Weimer wrote:
> * Tycho Andersen:
> 
> > On Fri, Sep 06, 2019 at 07:20:51PM +0200, Christian Brauner wrote:
> >> On Sat, Sep 07, 2019 at 03:07:39AM +1000, Aleksa Sarai wrote:
> >> > On 2019-09-06, Mickaël Salaün  wrote:
> >> > > 
> >> > > On 06/09/2019 17:56, Florian Weimer wrote:
> >> > > > Let's assume I want to add support for this to the glibc dynamic 
> >> > > > loader,
> >> > > > while still being able to run on older kernels.
> >> > > >
> >> > > > Is it safe to try the open call first, with O_MAYEXEC, and if that 
> >> > > > fails
> >> > > > with EINVAL, try again without O_MAYEXEC?
> >> > > 
> >> > > The kernel ignore unknown open(2) flags, so yes, it is safe even for
> >> > > older kernel to use O_MAYEXEC.
> >> > 
> >> > Depends on your definition of "safe" -- a security feature that you will
> >> > silently not enable on older kernels doesn't sound super safe to me.
> >> > Unfortunately this is a limitation of open(2) that we cannot change --
> >> > which is why the openat2(2) proposal I've been posting gives -EINVAL for
> >> > unknown O_* flags.
> >> > 
> >> > There is a way to probe for support (though unpleasant), by creating a
> >> > test O_MAYEXEC fd and then checking if the flag is present in
> >> > /proc/self/fdinfo/$n.
> >> 
> >> Which Florian said they can't do for various reasons.
> >> 
> >> It is a major painpoint if there's no easy way for userspace to probe
> >> for support. Especially if it's security related which usually means
> >> that you want to know whether this feature works or not.
> >
> > What about just trying to violate the policy via fexecve() instead of
> > looking around in /proc? Still ugly, though.
> 
> How would we do this?  This is about opening the main executable as part
> of an explicit loader invocation.  Typically, an fexecve will succeed
> and try to run the program, but with the wrong dynamic loader.

Yeah, fexecve() was a think-o, sorry, you don't need to go that far. I
was thinking do what the tests in this series do: create a tmpfs with
MS_NOEXEC, put an executable file in it, and try and open it with
O_MAYEXEC. If that works, the kernel doesn't support the flag, and it
should give you -EACCES if the kernel does support the flag.

Still a lot of work, though. Seems better to just use openat2.

Tycho


Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()

2019-09-06 Thread Tycho Andersen
On Fri, Sep 06, 2019 at 07:20:51PM +0200, Christian Brauner wrote:
> On Sat, Sep 07, 2019 at 03:07:39AM +1000, Aleksa Sarai wrote:
> > On 2019-09-06, Mickaël Salaün  wrote:
> > > 
> > > On 06/09/2019 17:56, Florian Weimer wrote:
> > > > Let's assume I want to add support for this to the glibc dynamic loader,
> > > > while still being able to run on older kernels.
> > > >
> > > > Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> > > > with EINVAL, try again without O_MAYEXEC?
> > > 
> > > The kernel ignore unknown open(2) flags, so yes, it is safe even for
> > > older kernel to use O_MAYEXEC.
> > 
> > Depends on your definition of "safe" -- a security feature that you will
> > silently not enable on older kernels doesn't sound super safe to me.
> > Unfortunately this is a limitation of open(2) that we cannot change --
> > which is why the openat2(2) proposal I've been posting gives -EINVAL for
> > unknown O_* flags.
> > 
> > There is a way to probe for support (though unpleasant), by creating a
> > test O_MAYEXEC fd and then checking if the flag is present in
> > /proc/self/fdinfo/$n.
> 
> Which Florian said they can't do for various reasons.
> 
> It is a major painpoint if there's no easy way for userspace to probe
> for support. Especially if it's security related which usually means
> that you want to know whether this feature works or not.

What about just trying to violate the policy via fexecve() instead of
looking around in /proc? Still ugly, though.

Tycho


Re: [PATCH] selftests/seccomp: fix build on older kernels

2019-08-30 Thread Tycho Andersen
On Fri, Aug 30, 2019 at 09:19:00AM -0600, shuah wrote:
> On 8/29/19 6:45 PM, shuah wrote:
> > On 8/29/19 11:06 AM, Kees Cook wrote:
> > > On Mon, Aug 26, 2019 at 08:43:02AM -0600, Tycho Andersen wrote:
> > > > The seccomp selftest goes to some length to build against older kernel
> > > > headers, viz. all the #ifdefs at the beginning of the file. 201766a20e30
> > > > ("ptrace: add PTRACE_GET_SYSCALL_INFO request") introduces some
> > > > additional
> > > > macros, but doesn't do the #ifdef dance. Let's add that dance here to
> > > > avoid:
> > > > 
> > > > gcc -Wl,-no-as-needed -Wall  seccomp_bpf.c -lpthread -o seccomp_bpf
> > > > In file included from seccomp_bpf.c:51:
> > > > seccomp_bpf.c: In function ‘tracer_ptrace’:
> > > > seccomp_bpf.c:1787:20: error: ‘PTRACE_EVENTMSG_SYSCALL_ENTRY’
> > > > undeclared (first use in this function); did you mean
> > > > ‘PTRACE_EVENT_CLONE’?
> > > >    EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > > >  ^
> > > > ../kselftest_harness.h:608:13: note: in definition of macro ‘__EXPECT’
> > > >    __typeof__(_expected) __exp = (_expected); \
> > > >   ^
> > > > seccomp_bpf.c:1787:2: note: in expansion of macro ‘EXPECT_EQ’
> > > >    EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > > >    ^
> > > > seccomp_bpf.c:1787:20: note: each undeclared identifier is
> > > > reported only once for each function it appears in
> > > >    EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > > >  ^
> > > > ../kselftest_harness.h:608:13: note: in definition of macro ‘__EXPECT’
> > > >    __typeof__(_expected) __exp = (_expected); \
> > > >   ^
> > > > seccomp_bpf.c:1787:2: note: in expansion of macro ‘EXPECT_EQ’
> > > >    EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > > >    ^
> > > > seccomp_bpf.c:1788:6: error: ‘PTRACE_EVENTMSG_SYSCALL_EXIT’
> > > > undeclared (first use in this function); did you mean
> > > > ‘PTRACE_EVENT_EXIT’?
> > > >  : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
> > > >    ^~~~
> > > > ../kselftest_harness.h:608:13: note: in definition of macro ‘__EXPECT’
> > > >    __typeof__(_expected) __exp = (_expected); \
> > > >   ^
> > > > seccomp_bpf.c:1787:2: note: in expansion of macro ‘EXPECT_EQ’
> > > >    EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > > >    ^
> > > > make: *** [Makefile:12: seccomp_bpf] Error 1
> > > > 
> > > > Signed-off-by: Tycho Andersen 
> > > > Fixes: 201766a20e30 ("ptrace: add PTRACE_GET_SYSCALL_INFO request")
> > > 
> > > Acked-by: Kees Cook 
> > > 
> > > Alakesh Haloi also sent a fix[1] for this. I prefer Tycho's solution
> > > (one #ifndef and a Fixes line). Shuah, can you please apply this?
> > > 
> > 
> > Kees,
> > 
> > Yes I will pick this up.
> > 
> > thanks,
> > -- Shuah
> > 
> 
> Applied after fixing the following checkpatch error in the commit log:
> 
> ERROR: Please use git commit description style 'commit <12+ chars of sha1>
> ("")' - ie: 'commit 201766a20e30 ("ptrace: add
> PTRACE_GET_SYSCALL_INFO request")'
> #82:
> 
> Now reads as follows:
> 
> Commit 201766a20e30 ("ptrace: add PTRACE_GET_SYSCALL_INFO request")
> introduces some additional macros, but doesn't do the #ifdef dance.
> Let's add that dance here to avoid:

Ah, good to know. Thanks!

Tycho


Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER

2019-08-26 Thread Tycho Andersen
Hi,

On Fri, Aug 23, 2019 at 05:30:53PM -0700, Paul Walmsley wrote:
> On Thu, 22 Aug 2019, David Abdurachmanov wrote:
> 
> > There is one failing kernel selftest: global.user_notification_signal
> 
> Also - could you follow up with the author of this failing test to see if 
> we can get some more clarity about what might be going wrong here?  It 
> appears that the failing test was added in commit 6a21cc50f0c7f ("seccomp: 
> add a return code to trap to userspace") by Tycho Andersen 
> .

Can you post an strace and a cat of /proc/$pid/stack for both tasks
where it gets stuck? I don't have any riscv hardware, and it "works
for me" on x86 and arm64 with 100 tries.

Thanks,

Tycho


[PATCH] selftests/seccomp: fix build on older kernels

2019-08-26 Thread Tycho Andersen
The seccomp selftest goes to some length to build against older kernel
headers, viz. all the #ifdefs at the beginning of the file. 201766a20e30
("ptrace: add PTRACE_GET_SYSCALL_INFO request") introduces some additional
macros, but doesn't do the #ifdef dance. Let's add that dance here to
avoid:

gcc -Wl,-no-as-needed -Wall  seccomp_bpf.c -lpthread -o seccomp_bpf
In file included from seccomp_bpf.c:51:
seccomp_bpf.c: In function ‘tracer_ptrace’:
seccomp_bpf.c:1787:20: error: ‘PTRACE_EVENTMSG_SYSCALL_ENTRY’ undeclared (first 
use in this function); did you mean ‘PTRACE_EVENT_CLONE’?
  EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
^
../kselftest_harness.h:608:13: note: in definition of macro ‘__EXPECT’
  __typeof__(_expected) __exp = (_expected); \
 ^
seccomp_bpf.c:1787:2: note: in expansion of macro ‘EXPECT_EQ’
  EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
  ^
seccomp_bpf.c:1787:20: note: each undeclared identifier is reported only once 
for each function it appears in
  EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
^
../kselftest_harness.h:608:13: note: in definition of macro ‘__EXPECT’
  __typeof__(_expected) __exp = (_expected); \
 ^
seccomp_bpf.c:1787:2: note: in expansion of macro ‘EXPECT_EQ’
  EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
  ^
seccomp_bpf.c:1788:6: error: ‘PTRACE_EVENTMSG_SYSCALL_EXIT’ undeclared (first 
use in this function); did you mean ‘PTRACE_EVENT_EXIT’?
: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
  ^~~~
../kselftest_harness.h:608:13: note: in definition of macro ‘__EXPECT’
  __typeof__(_expected) __exp = (_expected); \
 ^
seccomp_bpf.c:1787:2: note: in expansion of macro ‘EXPECT_EQ’
  EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
  ^
make: *** [Makefile:12: seccomp_bpf] Error 1

Signed-off-by: Tycho Andersen 
Fixes: 201766a20e30 ("ptrace: add PTRACE_GET_SYSCALL_INFO request")
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 6ef7f16c4cf5..7f8b5c8982e3 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -199,6 +199,11 @@ struct seccomp_notif_sizes {
 };
 #endif
 
+#ifndef PTRACE_EVENTMSG_SYSCALL_ENTRY
+#define PTRACE_EVENTMSG_SYSCALL_ENTRY  1
+#define PTRACE_EVENTMSG_SYSCALL_EXIT   2
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
-- 
2.20.1



Re: [PATCH ghak90 V6 02/10] audit: add container id

2019-05-29 Thread Tycho Andersen
On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> On Wed, May 29, 2019 at 10:57 AM Tycho Andersen  wrote:
> >
> > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > It is not permitted to unset the audit container identifier.
> > > A child inherits its parent's audit container identifier.
> >
> > ...
> >
> > >  /**
> > > + * audit_set_contid - set current task's audit contid
> > > + * @contid: contid value
> > > + *
> > > + * Returns 0 on success, -EPERM on permission failure.
> > > + *
> > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > + */
> > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > +{
> > > + u64 oldcontid;
> > > + int rc = 0;
> > > + struct audit_buffer *ab;
> > > + uid_t uid;
> > > + struct tty_struct *tty;
> > > + char comm[sizeof(current->comm)];
> > > +
> > > + task_lock(task);
> > > + /* Can't set if audit disabled */
> > > + if (!task->audit) {
> > > + task_unlock(task);
> > > + return -ENOPROTOOPT;
> > > + }
> > > + oldcontid = audit_get_contid(task);
> > > + read_lock(&tasklist_lock);
> > > + /* Don't allow the audit containerid to be unset */
> > > + if (!audit_contid_valid(contid))
> > > + rc = -EINVAL;
> > > + /* if we don't have caps, reject */
> > > + else if (!capable(CAP_AUDIT_CONTROL))
> > > + rc = -EPERM;
> > > + /* if task has children or is not single-threaded, deny */
> > > + else if (!list_empty(&task->children))
> > > + rc = -EBUSY;
> > > + else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > + rc = -EALREADY;
> > > + read_unlock(&tasklist_lock);
> > > + if (!rc)
> > > + task->audit->contid = contid;
> > > + task_unlock(task);
> > > +
> > > + if (!audit_enabled)
> > > + return rc;
> >
> > ...but it is allowed to change it (assuming
> > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > immediately useful since we still live in the world of majority
> > privileged containers if we didn't allow changing it, in addition to
> > un-setting it.
> 
> The idea is that only container orchestrators should be able to
> set/modify the audit container ID, and since setting the audit
> container ID can have a significant effect on the records captured
> (and their routing to multiple daemons when we get there) modifying
> the audit container ID is akin to modifying the audit configuration
> which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> is that you would only change the audit container ID from one
> set/inherited value to another if you were nesting containers, in
> which case the nested container orchestrator would need to be granted
> CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> compromise).

But then don't you want some kind of ns_capable() instead (probably
not the obvious one, though...)? With capable(), you can't really nest
using the audit-id and user namespaces together.

Tycho


Re: [PATCH ghak90 V6 02/10] audit: add container id

2019-05-29 Thread Tycho Andersen
On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.

...

>  /**
> + * audit_set_contid - set current task's audit contid
> + * @contid: contid value
> + *
> + * Returns 0 on success, -EPERM on permission failure.
> + *
> + * Called (set) from fs/proc/base.c::proc_contid_write().
> + */
> +int audit_set_contid(struct task_struct *task, u64 contid)
> +{
> + u64 oldcontid;
> + int rc = 0;
> + struct audit_buffer *ab;
> + uid_t uid;
> + struct tty_struct *tty;
> + char comm[sizeof(current->comm)];
> +
> + task_lock(task);
> + /* Can't set if audit disabled */
> + if (!task->audit) {
> + task_unlock(task);
> + return -ENOPROTOOPT;
> + }
> + oldcontid = audit_get_contid(task);
> + read_lock(&tasklist_lock);
> + /* Don't allow the audit containerid to be unset */
> + if (!audit_contid_valid(contid))
> + rc = -EINVAL;
> + /* if we don't have caps, reject */
> + else if (!capable(CAP_AUDIT_CONTROL))
> + rc = -EPERM;
> + /* if task has children or is not single-threaded, deny */
> + else if (!list_empty(&task->children))
> + rc = -EBUSY;
> + else if (!(thread_group_leader(task) && thread_group_empty(task)))
> + rc = -EALREADY;
> + read_unlock(&tasklist_lock);
> + if (!rc)
> + task->audit->contid = contid;
> + task_unlock(task);
> +
> + if (!audit_enabled)
> + return rc;

...but it is allowed to change it (assuming
capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
immediately useful since we still live in the world of majority
privileged containers if we didn't allow changing it, in addition to
un-setting it.

Tycho


Re: [PATCH v1 2/2] Add selftests for pidfd polling

2019-04-25 Thread Tycho Andersen
On Thu, Apr 25, 2019 at 03:00:10PM -0400, Joel Fernandes (Google) wrote:
>
> +void *test_pidfd_poll_exec_thread(void *priv)

I think everything in this file can be static, there's this one and
3-4 below.

> +int test_pidfd_poll_exec(int use_waitpid)
> +{
> + int pid, pidfd = 0;
> + int status, ret;
> + pthread_t t1;
> + time_t prog_start = time(NULL);
> + const char *test_name = "pidfd_poll check for premature notification on 
> child thread exec";
> +
> + ksft_print_msg("Parent: pid: %d\n", getpid());
> + pid = pidfd_clone(CLONE_PIDFD, &pidfd, child_poll_exec_test);

If pidfd_clone() fails here, I think things will go haywire below.

> + ksft_print_msg("Parent: Waiting for Child (%d) to complete.\n", pid);
> +
> + if (use_waitpid) {
> + ret = waitpid(pid, &status, 0);
> + if (ret == -1)
> + ksft_print_msg("Parent: error\n");
> +
> + if (ret == pid)
> + ksft_print_msg("Parent: Child process waited for.\n");
> + } else {
> + poll_pidfd(test_name, pidfd);
> + }
> +
> + time_t prog_time = time(NULL) - prog_start;
> +
> + ksft_print_msg("Time waited for child: %lu\n", prog_time);
> +
> + close(pidfd);
> +
> + if (prog_time < CHILD_THREAD_MIN_WAIT || prog_time > 
> CHILD_THREAD_MIN_WAIT + 2)
> + ksft_exit_fail_msg("%s test: Failed\n", test_name);
> + else
> + ksft_test_result_pass("%s test: Passed\n", test_name);
> +}
> +
> +void *test_pidfd_poll_leader_exit_thread(void *priv)
> +{
> + char waittime[256];
> +
> + ksft_print_msg("Child Thread: starting. pid %d tid %d ; and sleeping\n",
> + getpid(), syscall(SYS_gettid));
> + sleep(CHILD_THREAD_MIN_WAIT);
> + ksft_print_msg("Child Thread: DONE. pid %d tid %d\n", getpid(), 
> syscall(SYS_gettid));
> + return NULL;
> +}
> +
> +static time_t *child_exit_secs;
> +static int child_poll_leader_exit_test(void *args)
> +{
> + pthread_t t1, t2;
> +
> + ksft_print_msg("Child: starting. pid %d tid %d\n", getpid(), 
> syscall(SYS_gettid));
> + pthread_create(&t1, NULL, test_pidfd_poll_leader_exit_thread, NULL);
> + pthread_create(&t2, NULL, test_pidfd_poll_leader_exit_thread, NULL);
> +
> + /*
> +  * glibc exit calls exit_group syscall, so explicity call exit only
> +  * so that only the group leader exits, leaving the threads alone.
> +  */
> + *child_exit_secs = time(NULL);
> + syscall(SYS_exit, 0);
> +}
> +
> +int test_pidfd_poll_leader_exit(int use_waitpid)
> +{
> + int pid, pidfd = 0;
> + int status, ret;
> + time_t prog_start = time(NULL);
> + const char *test_name = "pidfd_poll check for premature notification on 
> non-empty"
> + "group leader exit";
> +
> + child_exit_secs = mmap(NULL, sizeof *child_exit_secs, PROT_READ | 
> PROT_WRITE,
> + MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> +
> + ksft_print_msg("Parent: pid: %d\n", getpid());
> + pid = pidfd_clone(CLONE_PIDFD, &pidfd, child_poll_leader_exit_test);

Same problem here, I think.

Tycho


Re: [PATCH 1/2] selftests/seccomp: Prepare for exclusive seccomp flags

2019-04-24 Thread Tycho Andersen
On Wed, Apr 24, 2019 at 09:37:55AM -0700, Kees Cook wrote:
> Some seccomp flags will become exclusive, so the selftest needs to
> be adjusted to mask those out and test them individually for the "all
> flags" tests.
> 
> Cc: sta...@vger.kernel.org # v5.0+
> Signed-off-by: Kees Cook 

Whoops, thanks for this too.

Reviewed-by: Tycho Andersen 


Re: SECCOMP_RET_USER_NOTIF: listener improvements

2019-04-24 Thread Tycho Andersen
On Wed, Apr 24, 2019 at 05:04:26PM +0200, Christian Brauner wrote:
> Hey everyone,
> 
> So I was working on making use of the seccomp listener stuff and I
> stumbled upon a problem. Imagine a scenario where:
> 
> 1. Task T1 installs Filter F1 and gets and listener fd for that filter FD1
> 2. T1 sends FD1 via SCM_RIGHTS to task T2
>T2 now holds a reference to the same underlying struct file as FD1 via FD2
> 3. T2 registers FD2 in an event loop and starts listening for events
> 4. T1 exits and wipes FD1
> 
> Now, T2 still holds a reference to the filter via FD2 which references
> the same underlying file as FD1 which has the seccomp filter stashed in
> private_data.
> So T2 will never get notified that the filter is essentially unused and
> doesn't know when to exit, i.e. it has no way of telling when T1 and all
> of its children using the same filter are gone.
> 
> I think we should have a way to do this

Since the only way we ever allow creating a struct file * that points
to a struct seccomp_filter *, if there is a notifier attached, the
number of tasks still being monitored by a particular filter should be
filter->usage - 1 (assuming there is a notifier attached). So we could
augment __put_seccomp_filter() to check for this and send out a
message with a SECCOMP_NOTIF_FLAG_DEAD flag or something.

> *or* alternatively have a way to attach a process to an existing
> filter.

I also think this wouldn't be too hard, since the struct file * has a
reference to the filter. So I guess the question is: which of these
makes more sense?

Tycho


Re: [PATCH 2/2] seccomp: disallow NEW_LISTENER and TSYNC flags

2019-04-23 Thread Tycho Andersen
On Tue, Apr 23, 2019 at 04:31:45PM -0700, Kees Cook wrote:
> On Tue, Apr 23, 2019 at 3:09 PM Kees Cook  wrote:
> >
> > On Wed, Mar 6, 2019 at 12:14 PM Tycho Andersen  wrote:
> > >
> > > As the comment notes, the return codes for TSYNC and NEW_LISTENER 
> > > conflict,
> > > because they both return positive values, one in the case of success and
> > > one in the case of error. So, let's disallow both of these flags together.
> > >
> > > While this is technically a userspace break, all the users I know of are
> > > still waiting on me to land this feature in libseccomp, so I think it'll 
> > > be
> > > safe. Also, at present my use case doesn't require TSYNC at all, so this
> > > isn't a big deal to disallow. If someone wanted to support this, a path
> > > forward would be to add a new flag like
> > > TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN, 
> > > but
> > > the use cases are so different I don't see it really happening.
> > >
> > > Finally, it's worth noting that this does actually fix a UAF issue: at 
> > > the end
> > > of seccomp_set_mode_filter(), we have:
> > >
> > > if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > > if (ret < 0) {
> > > listener_f->private_data = NULL;
> > > fput(listener_f);
> > > put_unused_fd(listener);
> > > } else {
> > > fd_install(listener, listener_f);
> > > ret = listener;
> > > }
> > > }
> > > out_free:
> > > seccomp_filter_free(prepared);
> > >
> > > But if ret > 0 because TSYNC raced, we'll install the listener fd and 
> > > then free
> > > the filter out from underneath it, causing a UAF when the task closes it 
> > > or
> > > dies. This patch also switches the condition to be simply if (ret), so 
> > > that
> > > if someone does add the flag mentioned above, they won't have to remember
> > > to fix this too.
> > >
> > > Signed-off-by: Tycho Andersen 
> > > Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> > > CC: sta...@vger.kernel.org # v5.0+
> >
> > Thanks! Sorry I missed this. James, can you take this for Linus's
> > fixes for v5.1? (Or should I send a pull request to you?)
> >
> > Acked-by: Kees Cook 
> >
> > Let's also add:
> >
> > Reported-by: syzbot+b562969adb2e04af3...@syzkaller.appspotmail.com
> >
> > > ---
> > >  kernel/seccomp.c | 17 +++--
> > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > index d0d355ded2f4..79bada51091b 100644
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -500,7 +500,10 @@ seccomp_prepare_user_filter(const char __user 
> > > *user_filter)
> > >   *
> > >   * Caller must be holding current->sighand->siglock lock.
> > >   *
> > > - * Returns 0 on success, -ve on error.
> > > + * Returns 0 on success, -ve on error, or
> > > + *   - in TSYNC mode: the pid of a thread which was either not in the 
> > > correct
> > > + * seccomp mode or did not have an ancestral seccomp filter
> > > + *   - in NEW_LISTENER mode: the fd of the new listener
> > >   */
> > >  static long seccomp_attach_filter(unsigned int flags,
> > >   struct seccomp_filter *filter)
> > > @@ -1256,6 +1259,16 @@ static long seccomp_set_mode_filter(unsigned int 
> > > flags,
> > > if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > > return -EINVAL;
> > >
> > > +   /*
> > > +* In the successful case, NEW_LISTENER returns the new listener 
> > > fd.
> > > +* But in the failure case, TSYNC returns the thread that died. 
> > > If you
> > > +* combine these two flags, there's no way to tell whether 
> > > something
> > > +* succeded or failed. So, let's disallow this combination.
> >
> > also a tiny typo: succeeded
> >
> > > +*/
> > > +   if ((flags & SECCOMP_FILTER_FLAG_TSYNC) &&
> > > +   (flags && SECCOMP_FILTER_FLAG_NEW_LISTENER))
> 
> also a typo: && should be &

Oh, yes. Do you want me to send another version?

Tycho


Re: [PATCH] selftests/seccomp: Handle namespace failures gracefully

2019-04-12 Thread Tycho Andersen
On Fri, Apr 12, 2019 at 11:07:11AM -0600, shuah wrote:
> On 4/12/19 9:25 AM, Tycho Andersen wrote:
> > On Thu, Apr 11, 2019 at 04:56:31PM -0700, Kees Cook wrote:
> > > When running without USERNS or PIDNS the seccomp test would hang since
> > > it was waiting forever for the child to trigger the user notification
> > > since it seems the glibc() abort handler makes a call to getpid(),
> > > which would trap again. This changes the getpid filter to getppid, and
> > > makes sure ASSERTs execute to stop from spawning the listener.
> > > 
> > > Reported-by: Shuah Khan 
> > > Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> > > Signed-off-by: Kees Cook 
> > 
> > Sorry for the delay, thanks for looking at this!
> > 
> > Reviewed-by: Tycho Andersen 
> > 
> 
> Thanks both. Should it go into stables. I will pull this and
> add stable if that is appropriate.

Yes, for 5.0+ that sounds good.

Thanks!

Tycho


Re: [PATCH] selftests/seccomp: Handle namespace failures gracefully

2019-04-12 Thread Tycho Andersen
On Thu, Apr 11, 2019 at 04:56:31PM -0700, Kees Cook wrote:
> When running without USERNS or PIDNS the seccomp test would hang since
> it was waiting forever for the child to trigger the user notification
> since it seems the glibc() abort handler makes a call to getpid(),
> which would trap again. This changes the getpid filter to getppid, and
> makes sure ASSERTs execute to stop from spawning the listener.
> 
> Reported-by: Shuah Khan 
> Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
> Signed-off-by: Kees Cook 

Sorry for the delay, thanks for looking at this!

Reviewed-by: Tycho Andersen 


Re: [PATCH RFC 2/2] Add selftests for pidfd polling

2019-04-12 Thread Tycho Andersen
On Thu, Apr 11, 2019 at 01:50:43PM -0400, Joel Fernandes (Google) wrote:
> Other than verifying pidfd based polling, the tests make sure that
> wait semantics are preserved with the pidfd poll. Notably the 2 cases:
> 1. If a thread group leader exits while threads still there, then no
>pidfd poll notifcation should happen.
> 2. If a non-thread group leader does an execve, then the thread group
>leader is signaled to exit and is replaced with the execing thread
>as the new leader, however the parent is not notified in this case.
> 
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  tools/testing/selftests/pidfd/Makefile |   2 +-
>  tools/testing/selftests/pidfd/pidfd_test.c | 216 -
>  2 files changed, 208 insertions(+), 10 deletions(-)
> 
> diff --git a/tools/testing/selftests/pidfd/Makefile 
> b/tools/testing/selftests/pidfd/Makefile
> index deaf8073bc06..4b31c14f273c 100644
> --- a/tools/testing/selftests/pidfd/Makefile
> +++ b/tools/testing/selftests/pidfd/Makefile
> @@ -1,4 +1,4 @@
> -CFLAGS += -g -I../../../../usr/include/
> +CFLAGS += -g -I../../../../usr/include/ -lpthread
>  
>  TEST_GEN_PROGS := pidfd_test
>  
> diff --git a/tools/testing/selftests/pidfd/pidfd_test.c 
> b/tools/testing/selftests/pidfd/pidfd_test.c
> index d59378a93782..4d5206280091 100644
> --- a/tools/testing/selftests/pidfd/pidfd_test.c
> +++ b/tools/testing/selftests/pidfd/pidfd_test.c
> @@ -4,18 +4,26 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "../kselftest.h"
>  
> +#define CHILD_THREAD_MIN_WAIT 3 /* seconds */
> +#define MAX_EVENTS 5
> +#define __NR_pidfd_send_signal 424
> +
>  static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
>   unsigned int flags)
>  {
> @@ -30,6 +38,22 @@ static void set_signal_received_on_sigusr1(int sig)
>   signal_received = 1;
>  }
>  
> +static int open_pidfd(const char *test_name, pid_t pid)
> +{
> + char buf[256];
> + int pidfd;
> +
> + snprintf(buf, sizeof(buf), "/proc/%d", pid);
> + pidfd = open(buf, O_DIRECTORY | O_CLOEXEC);
> +
> + if (pidfd < 0)
> + ksft_exit_fail_msg(
> + "%s test: Failed to open process file descriptor\n",
> + test_name);
> +
> + return pidfd;
> +}
> +
>  /*
>   * Straightforward test to see whether pidfd_send_signal() works is to send
>   * a signal to ourself.
> @@ -87,7 +111,6 @@ static int wait_for_pid(pid_t pid)
>  static int test_pidfd_send_signal_exited_fail(void)
>  {
>   int pidfd, ret, saved_errno;
> - char buf[256];
>   pid_t pid;
>   const char *test_name = "pidfd_send_signal signal exited process";
>  
> @@ -99,17 +122,10 @@ static int test_pidfd_send_signal_exited_fail(void)
>   if (pid == 0)
>   _exit(EXIT_SUCCESS);
>  
> - snprintf(buf, sizeof(buf), "/proc/%d", pid);
> -
> - pidfd = open(buf, O_DIRECTORY | O_CLOEXEC);
> + pidfd = open_pidfd(test_name, pid);
>  
>   (void)wait_for_pid(pid);
>  
> - if (pidfd < 0)
> - ksft_exit_fail_msg(
> - "%s test: Failed to open process file descriptor\n",
> - test_name);
> -
>   ret = sys_pidfd_send_signal(pidfd, 0, NULL, 0);
>   saved_errno = errno;
>   close(pidfd);
> @@ -368,10 +384,192 @@ static int test_pidfd_send_signal_syscall_support(void)
>   return 0;
>  }
>  
> +void *test_pidfd_poll_exec_thread(void *priv)

I think you can do static here?

> +{
> + char waittime[256];
> +
> + ksft_print_msg("Child Thread: starting. pid %d tid %d ; and sleeping\n",
> + getpid(), syscall(SYS_gettid));
> + ksft_print_msg("Child Thread: doing exec of sleep\n");
> +
> + sprintf(waittime, "%d", CHILD_THREAD_MIN_WAIT);
> + execl("/bin/sleep", "sleep", waittime, (char *)NULL);
> +
> + ksft_print_msg("Child Thread: DONE. pid %d tid %d\n",
> + getpid(), syscall(SYS_gettid));

You execl(), but then print stuff after that? Might also be worth
switching to execlp().

> + return NULL;
> +}
> +
> +static int poll_pidfd(const char *test_name, int pidfd)
> +{
> + int c;
> + int epoll_fd = epoll_create1(0);

A style point, but I find it's best not to do resource allocation in
variable declarations like this. It breaks up the usual pattern of:

ret = -ENOMEM;
resource = allocate();
if (allocation_failed(resource))
goto err;

...

out:
free(resource);
err:
return ret;

You're not closing this fd on every path (they all exit [for now :D]
so it's probably ok), but it might be nice to make this match a more
regular pattern.

> + struct epoll_event event, events[MAX_EVENTS];
> +
> + if (epoll_fd == -1)
> + ksft_exit_fail_msg("%s test: Failed to create epoll file 

[PATCH v2] x86/entry: re-enable interrupts before exiting

2019-04-05 Thread Tycho Andersen
If the kernel oopses in an interrupt, nothing re-enables interrupts:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from 
invalid context at
./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, 
pid: 1970, name:
lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
lkdtm_xpfo_test Tainted: G  D
4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
(i440FX + PIIX, 1996), BIOS
1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 
[lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

do_exit() expects to be called in a well-defined environment, so let's
re-enable interrupts after unwinding the stack, in case they were disabled.

Note that if any spinlocks are held, etc. we'll also get the above warning,
so this isn't a silver bullet. So, let's add a C helper in case someone
wants to add fancier lock busting or if we've forgotten to unwind something
else.

Signed-off-by: Tycho Andersen 
CC: Josh Poimboeuf 
---
I split this out from the XPFO series since it's mostly unrelated, and is
just a bug I found while working on that series.

v2: whitelist __finish_rewind_stack_do_exit() as a noreturn function in
objtool (Josh)
---
 arch/x86/entry/common.c   | 10 ++
 arch/x86/entry/entry_32.S |  2 +-
 arch/x86/entry/entry_64.S |  2 +-
 tools/objtool/check.c |  1 +
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..4e9c54e0495f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -427,3 +427,13 @@ __visible long do_fast_syscall_32(struct pt_regs *regs)
 #endif
 }
 #endif
+
+void __noreturn __finish_rewind_stack_do_exit(long code)
+{
+   /*
+* If we oopsed in an interrupt handler, interrupts may be off. Let's 
turn
+* them back on before going back to "normal" code.
+*/
+local_irq_enable();
+do_exit(code);
+}
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..095b8770e3b0 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1507,6 +1507,6 @@ ENTRY(rewind_stack_do_exit)
movlPER_CPU_VAR(cpu_current_top_of_stack), %esi
leal-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
 
-   calldo_exit
+   call__finish_rewind_stack_do_exit
 1: jmp 1b
 END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..890a66d17f31 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1672,5 +1672,5 @@ ENTRY(rewind_stack_do_exit)
leaq-PTREGS_SIZE(%rax), %rsp
UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
 
-   calldo_exit
+   call__finish_rewind_stack_do_exit
 END(rewind_stack_do_exit)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 5dde107083c6..f88f8f0b3f6e 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -156,6 +156,7 @@ static int __dead_end_function(struct objtool_file *file, 
struct symbol *func,
"__stack_chk_fail",
"panic",
"do_exit",
+   "__finish_rewind_stack_do_exit",
"do_task_dead",
"__module_put_and_exit",
"complete_and_exit",
-- 
2.19.1



Re: [PATCH] x86/entry: re-enable interrupts before exiting

2019-04-05 Thread Tycho Andersen
On Fri, Apr 05, 2019 at 10:58:33AM -0500, Josh Poimboeuf wrote:
> On Fri, Apr 05, 2019 at 09:35:24AM -0600, Tycho Andersen wrote:
> > If the kernel oopses in an interrupt, nothing re-enables interrupts:
> > 
> > Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called 
> > from invalid context at
> > ./include/linux/percpu-rwsem.h:33
> > Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, 
> > irqs_disabled(): 1, pid: 1970, name:
> > lkdtm_xpfo_test
> > Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
> > lkdtm_xpfo_test Tainted: G  D
> > 4.13.0-rc5+ #228
> > Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
> > (i440FX + PIIX, 1996), BIOS
> > 1.10.1-1ubuntu1 04/01/2014
> > Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
> > Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
> > Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
> > Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
> > Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
> > Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
> > blocking_notifier_call_chain+0x11/0x20
> > Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
> > Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
> > Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
> > Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? 
> > read_user_with_flags+0x60/0x60 [lkdtm]
> > Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20
> > 
> > do_exit() expects to be called in a well-defined environment, so let's
> > re-enable interrupts after unwinding the stack, in case they were disabled.
> > 
> > Note that if any spinlocks are held, etc. we'll also get the above warning,
> > so this isn't a silver bullet. So, let's add a C helper in case someone
> > wants to add fancier lock busting or if we've forgotten to unwind something
> > else.
> > 
> > I've had to add back in the hack that Josh removed in 8c1f75587a18
> > ("x86/entry/64: Add unwind hint annotations") with the loop after the call,
> > because for whatever reason without that I get a warning:
> > 
> >   AS  arch/x86/entry/entry_64.o
> > arch/x86/entry/entry_64.o: warning: objtool: .entry.text: unexpected end of 
> > section
> > 
> > It seems to actually work fine for me though, since the new helper is also
> > __noreturn. Perhaps there's a better way to do this?
> 
> Unfortunately, objtool doesn't have a way to detect noreturn functions
> in other objects, so they're hard-coded in a list.  You can add
> __finish_rewind_stack_do_exit to the global_noreturns array in
> tools/objtool/check.c.

Awesome, this is the magic I was missing. Thanks!

Tycho


[PATCH] x86/entry: re-enable interrupts before exiting

2019-04-05 Thread Tycho Andersen
If the kernel oopses in an interrupt, nothing re-enables interrupts:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from 
invalid context at
./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, 
pid: 1970, name:
lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
lkdtm_xpfo_test Tainted: G  D
4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
(i440FX + PIIX, 1996), BIOS
1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 
[lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

do_exit() expects to be called in a well-defined environment, so let's
re-enable interrupts after unwinding the stack, in case they were disabled.

Note that if any spinlocks are held, etc. we'll also get the above warning,
so this isn't a silver bullet. So, let's add a C helper in case someone
wants to add fancier lock busting or if we've forgotten to unwind something
else.

I've had to add back in the hack that Josh removed in 8c1f75587a18
("x86/entry/64: Add unwind hint annotations") with the loop after the call,
because for whatever reason without that I get a warning:

  AS  arch/x86/entry/entry_64.o
arch/x86/entry/entry_64.o: warning: objtool: .entry.text: unexpected end of 
section

It seems to actually work fine for me though, since the new helper is also
__noreturn. Perhaps there's a better way to do this?

Signed-off-by: Tycho Andersen 
CC: Josh Poimboeuf 
---
I split this out from the XPFO series since it's mostly unrelated, and is
just a bug I found while working on that series.
---
 arch/x86/entry/common.c   | 10 ++
 arch/x86/entry/entry_32.S |  2 +-
 arch/x86/entry/entry_64.S |  3 ++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..4e9c54e0495f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -427,3 +427,13 @@ __visible long do_fast_syscall_32(struct pt_regs *regs)
 #endif
 }
 #endif
+
+void __noreturn __finish_rewind_stack_do_exit(long code)
+{
+   /*
+* If we oopsed in an interrupt handler, interrupts may be off. Let's 
turn
+* them back on before going back to "normal" code.
+*/
+local_irq_enable();
+do_exit(code);
+}
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..095b8770e3b0 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1507,6 +1507,6 @@ ENTRY(rewind_stack_do_exit)
movlPER_CPU_VAR(cpu_current_top_of_stack), %esi
leal-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
 
-   calldo_exit
+   call__finish_rewind_stack_do_exit
 1: jmp 1b
 END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..c6166133e45d 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1672,5 +1672,6 @@ ENTRY(rewind_stack_do_exit)
leaq-PTREGS_SIZE(%rax), %rsp
UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
 
-   calldo_exit
+   call__finish_rewind_stack_do_exit
+1: jmp 1b
 END(rewind_stack_do_exit)
-- 
2.19.1



Re: [RFC 02/15] slub: Add isolate() and migrate() methods

2019-03-08 Thread Tycho Andersen
On Sat, Mar 09, 2019 at 06:53:22AM +1100, Tobin C. Harding wrote:
> On Fri, Mar 08, 2019 at 09:22:37AM -0700, Tycho Andersen wrote:
> > On Fri, Mar 08, 2019 at 04:15:46PM +, Christopher Lameter wrote:
> > > On Fri, 8 Mar 2019, Tycho Andersen wrote:
> > > 
> > > > On Fri, Mar 08, 2019 at 03:14:13PM +1100, Tobin C. Harding wrote:
> > > > > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > > > > index f9d89c1b5977..754acdb292e4 100644
> > > > > --- a/mm/slab_common.c
> > > > > +++ b/mm/slab_common.c
> > > > > @@ -298,6 +298,10 @@ int slab_unmergeable(struct kmem_cache *s)
> > > > >   if (!is_root_cache(s))
> > > > >   return 1;
> > > > >
> > > > > + /*
> > > > > +  * s->isolate and s->migrate imply s->ctor so no need to
> > > > > +  * check them explicitly.
> > > > > +  */
> > > >
> > > > Shouldn't this implication go the other way, i.e.
> > > > s->ctor => s->isolate & s->migrate
> > > 
> > > A cache can have a constructor but the object may not be movable (I.e.
> > > currently dentries and inodes).
> > 
> > Yep, thanks. Somehow I got confused by the comment.
> 
> I removed code here from the original RFC-v2, if this comment is
> confusing perhaps we are better off without it.

I'd say leave it, unless others have objections. I got lost in the
"no need" and return true for unmergable too-many-nots goop, but it's
definitely worth noting that one implies the other. An alternative
might be to move it to a comment on the struct member instead.

Tycho


Re: [RFC 02/15] slub: Add isolate() and migrate() methods

2019-03-08 Thread Tycho Andersen
On Fri, Mar 08, 2019 at 04:15:46PM +, Christopher Lameter wrote:
> On Fri, 8 Mar 2019, Tycho Andersen wrote:
> 
> > On Fri, Mar 08, 2019 at 03:14:13PM +1100, Tobin C. Harding wrote:
> > > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > > index f9d89c1b5977..754acdb292e4 100644
> > > --- a/mm/slab_common.c
> > > +++ b/mm/slab_common.c
> > > @@ -298,6 +298,10 @@ int slab_unmergeable(struct kmem_cache *s)
> > >   if (!is_root_cache(s))
> > >   return 1;
> > >
> > > + /*
> > > +  * s->isolate and s->migrate imply s->ctor so no need to
> > > +  * check them explicitly.
> > > +  */
> >
> > Shouldn't this implication go the other way, i.e.
> > s->ctor => s->isolate & s->migrate
> 
> A cache can have a constructor but the object may not be movable (I.e.
> currently dentries and inodes).

Yep, thanks. Somehow I got confused by the comment.

Tycho


Re: [RFC 07/15] slub: Add defrag_used_ratio field and sysfs support

2019-03-08 Thread Tycho Andersen
On Fri, Mar 08, 2019 at 03:14:18PM +1100, Tobin C. Harding wrote:
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3642,6 +3642,7 @@ static int kmem_cache_open(struct kmem_cache *s, 
> slab_flags_t flags)
>  
>   set_cpu_partial(s);
>  
> + s->defrag_used_ratio = 30;
>  #ifdef CONFIG_NUMA
>   s->remote_node_defrag_ratio = 1000;
>  #endif
> @@ -5261,6 +5262,28 @@ static ssize_t destroy_by_rcu_show(struct kmem_cache 
> *s, char *buf)
>  }
>  SLAB_ATTR_RO(destroy_by_rcu);
>  
> +static ssize_t defrag_used_ratio_show(struct kmem_cache *s, char *buf)
> +{
> + return sprintf(buf, "%d\n", s->defrag_used_ratio);
> +}
> +
> +static ssize_t defrag_used_ratio_store(struct kmem_cache *s,
> +const char *buf, size_t length)
> +{
> + unsigned long ratio;
> + int err;
> +
> + err = kstrtoul(buf, 10, &ratio);
> + if (err)
> + return err;
> +
> + if (ratio <= 100)
> + s->defrag_used_ratio = ratio;
else
return -EINVAL;

maybe?

Tycho


Re: [RFC 02/15] slub: Add isolate() and migrate() methods

2019-03-08 Thread Tycho Andersen
On Fri, Mar 08, 2019 at 03:14:13PM +1100, Tobin C. Harding wrote:
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index f9d89c1b5977..754acdb292e4 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -298,6 +298,10 @@ int slab_unmergeable(struct kmem_cache *s)
>   if (!is_root_cache(s))
>   return 1;
>  
> + /*
> +  * s->isolate and s->migrate imply s->ctor so no need to
> +  * check them explicitly.
> +  */

Shouldn't this implication go the other way, i.e.
s->ctor => s->isolate & s->migrate
?

>   if (s->ctor)
>   return 1;

Tycho


Re: [PATCH 2/2] seccomp: disallow NEW_LISTENER and TSYNC flags

2019-03-06 Thread Tycho Andersen
On Wed, Mar 06, 2019 at 10:02:25PM +0100, Christian Brauner wrote:
> On Wed, Mar 6, 2019 at 9:46 PM Tycho Andersen  wrote:
> >
> > On Wed, Mar 06, 2019 at 09:39:35PM +0100, Christian Brauner wrote:
> > > > +
> > > > /* Prepare the new filter before holding any locks. */
> > > > prepared = seccomp_prepare_user_filter(filter);
> > > > if (IS_ERR(prepared))
> > > > @@ -1302,7 +1315,7 @@ static long seccomp_set_mode_filter(unsigned int 
> > > > flags,
> > > > mutex_unlock(¤t->signal->cred_guard_mutex);
> > > >  out_put_fd:
> > > > if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > > > -   if (ret < 0) {
> > > > +   if (ret) {
> > >
> > > Why that change but keep checking if (ret < 0) further up?
> >
> > Not sure what you mean here. The only other place I see that we check
> > something is < 0 in that function is the return value of
> > get_unused_fd_flags(), which looks right to me?
> 
> The change just seemed it had nothing to do with the rest of the patch.
> Just making sure this didn't happen on accident and would cause regressions.

No, not on accident :). See the second half of the patch notes.

I can split it out into two separate patches if that makes more sense.
In fact this hunk alone fixes the UAF, but you still get non-sensical
return results even if it doesn't do anything terrible, hence the
first hunk.

Cheers,

Tycho


Re: [PATCH 2/2] seccomp: disallow NEW_LISTENER and TSYNC flags

2019-03-06 Thread Tycho Andersen
On Wed, Mar 06, 2019 at 09:39:35PM +0100, Christian Brauner wrote:
> > +
> > /* Prepare the new filter before holding any locks. */
> > prepared = seccomp_prepare_user_filter(filter);
> > if (IS_ERR(prepared))
> > @@ -1302,7 +1315,7 @@ static long seccomp_set_mode_filter(unsigned int 
> > flags,
> > mutex_unlock(¤t->signal->cred_guard_mutex);
> >  out_put_fd:
> > if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > -   if (ret < 0) {
> > +   if (ret) {
> 
> Why that change but keep checking if (ret < 0) further up?

Not sure what you mean here. The only other place I see that we check
something is < 0 in that function is the return value of
get_unused_fd_flags(), which looks right to me?

Tycho


[PATCH 2/2] seccomp: disallow NEW_LISTENER and TSYNC flags

2019-03-06 Thread Tycho Andersen
As the comment notes, the return codes for TSYNC and NEW_LISTENER conflict,
because they both return positive values, one in the case of success and
one in the case of error. So, let's disallow both of these flags together.

While this is technically a userspace break, all the users I know of are
still waiting on me to land this feature in libseccomp, so I think it'll be
safe. Also, at present my use case doesn't require TSYNC at all, so this
isn't a big deal to disallow. If someone wanted to support this, a path
forward would be to add a new flag like
TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN, but
the use cases are so different I don't see it really happening.

Finally, it's worth noting that this does actually fix a UAF issue: at the end
of seccomp_set_mode_filter(), we have:

if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
if (ret < 0) {
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
} else {
fd_install(listener, listener_f);
ret = listener;
}
}
out_free:
seccomp_filter_free(prepared);

But if ret > 0 because TSYNC raced, we'll install the listener fd and then free
the filter out from underneath it, causing a UAF when the task closes it or
dies. This patch also switches the condition to be simply if (ret), so that
if someone does add the flag mentioned above, they won't have to remember
to fix this too.

Signed-off-by: Tycho Andersen 
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
CC: sta...@vger.kernel.org # v5.0+
---
 kernel/seccomp.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d0d355ded2f4..79bada51091b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -500,7 +500,10 @@ seccomp_prepare_user_filter(const char __user *user_filter)
  *
  * Caller must be holding current->sighand->siglock lock.
  *
- * Returns 0 on success, -ve on error.
+ * Returns 0 on success, -ve on error, or
+ *   - in TSYNC mode: the pid of a thread which was either not in the correct
+ * seccomp mode or did not have an ancestral seccomp filter
+ *   - in NEW_LISTENER mode: the fd of the new listener
  */
 static long seccomp_attach_filter(unsigned int flags,
  struct seccomp_filter *filter)
@@ -1256,6 +1259,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
return -EINVAL;
 
+   /*
+* In the successful case, NEW_LISTENER returns the new listener fd.
+* But in the failure case, TSYNC returns the thread that died. If you
+* combine these two flags, there's no way to tell whether something
+* succeded or failed. So, let's disallow this combination.
+*/
+   if ((flags & SECCOMP_FILTER_FLAG_TSYNC) &&
+   (flags && SECCOMP_FILTER_FLAG_NEW_LISTENER))
+   return -EINVAL;
+
/* Prepare the new filter before holding any locks. */
prepared = seccomp_prepare_user_filter(filter);
if (IS_ERR(prepared))
@@ -1302,7 +1315,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
mutex_unlock(¤t->signal->cred_guard_mutex);
 out_put_fd:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
-   if (ret < 0) {
+   if (ret) {
listener_f->private_data = NULL;
fput(listener_f);
put_unused_fd(listener);
-- 
2.19.1



[PATCH 1/2] seccomp: fix up grammar in comment

2019-03-06 Thread Tycho Andersen
This sentence is kind of a train wreck anyway, but at least dropping the
extra pronoun helps somewhat.

Signed-off-by: Tycho Andersen 
---
 kernel/seccomp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e815781ed751..d0d355ded2f4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -329,7 +329,7 @@ static int is_ancestor(struct seccomp_filter *parent,
  * Expects sighand and cred_guard_mutex locks to be held.
  *
  * Returns 0 on success, -ve on error, or the pid of a thread which was
- * either not in the correct seccomp mode or it did not have an ancestral
+ * either not in the correct seccomp mode or did not have an ancestral
  * seccomp filter.
  */
 static inline pid_t seccomp_can_sync_threads(void)
-- 
2.19.1



Re: [PATCH 2/2] seccomp.2: document userspace notification

2019-03-01 Thread Tycho Andersen
On Fri, Mar 01, 2019 at 04:16:27PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Tycho,
> 
> On 3/1/19 3:53 PM, Tycho Andersen wrote:
> > On Thu, Feb 28, 2019 at 01:52:19PM +0100, Michael Kerrisk (man-pages) wrote:
> >>> +a notification will be sent to this fd. See "Userspace Notification" 
> >>> below for
> >>
> >> s/fd/file descriptor/ throughout please.
> > 
> > Will do.
> > 
> >>> +more details.
> >>
> >> I think the description here could be better worded as something like:
> >>
> >> SECCOMP_FILTER_FLAG_NEW_LISTENER
> >> Register a new filter, as usual, but on success return a
> >> new file descriptor that provides user-space notifications.
> >> When the filter returns SECCOMP_RET_USER_NOTIF, a notification
> >> will be provided via this file descriptor. The close-on-exec
> >> flag is automatically set on the new file descriptor. ...
> >>
> >>>  .RE
> >>>  .TP
> >>>  .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
> >>> @@ -606,6 +613,17 @@ file.
> >>>  .TP
> >>>  .BR SECCOMP_RET_ALLOW
> >>>  This value results in the system call being executed.
> >>> +.TP
> >>> +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"
> >>
> >> Please see the start of this hanging list in the manual page.
> >> Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
> >> in the precedence order of all of the filter return values?
> > 
> > Oh, no, I didn't realize it was in a particular order. I'll switch it.
> 
> Just for my immediate education (I'm experimenting right now),
> where/how does it fit in the precedence order?

In between RET_ERRNO and RET_TRACE; see include/uapi/linux/seccomp.h
for details.

Tycho


Re: [PATCH 2/2] seccomp.2: document userspace notification

2019-03-01 Thread Tycho Andersen
On Thu, Feb 28, 2019 at 02:25:55PM +0100, Michael Kerrisk (man-pages) wrote:
> > 7. The monitoring process can use the information in the
> >'struct seccomp_notif' to make a determination about the
> >system call being made by the target process. This
> >structure includes a 'data' field that is the same
> >'struct seccomp_data' that is passed to a BPF filter.
> > 
> >In addition, the monitoring process may make use of other 
> >information that is available from user space. For example, 
> >it may inspect the memory of the target process (whose PID
> >is provided in the 'struct seccomp_notif') using
> >/proc/PID/mem, which includes inspecting the values
> >pointed to by system call arguments (whose location is
> >available 'seccomp_notif.data.args). However, when using
> >the target process PID in this way, one must guard against
> >PID re-use race conditions using the seccomp()
> >SECCOMP_IOCTL_NOTIF_ID_VALID operation.
> > 
> > 8. Having arrived at a decision about the target process's
> >system call, the monitoring process can inform the kernel
> >of its decision using the operation
> > 
> >ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr)
> > 
> >where the third argument is a pointer to a
> >'struct seccomp_notif_resp'. [Some more details
> >needed here, but I still don't yet understand fully
> >the semantics of the 'error' and 'val' fields.]
> 
> So clearly, I misunderstood these last two steps.
> 
> (7) is something like: discover information in userspace
> as required; perform userspace actions if appropriate
> (perhaps doing the system call operation "on behalf of" the
> target process).
> 
> 
> (8) is something like:
>set 'error' and 'val' to return info to the target process:
> * error != 0 ==> make it look like the syscall failed,
>   with 'errno' set to that value
> * error == 0 ==> make it look like the syscall succeeded 
>   and returned 'val'
> 
> Right?

Yep, exactly.

Tycho


Re: [PATCH 2/2] seccomp.2: document userspace notification

2019-03-01 Thread Tycho Andersen
On Thu, Feb 28, 2019 at 01:52:19PM +0100, Michael Kerrisk (man-pages) wrote:
> > +a notification will be sent to this fd. See "Userspace Notification" below 
> > for
> 
> s/fd/file descriptor/ throughout please.

Will do.

> > +more details.
> 
> I think the description here could be better worded as something like:
> 
> SECCOMP_FILTER_FLAG_NEW_LISTENER
> Register a new filter, as usual, but on success return a
> new file descriptor that provides user-space notifications.
> When the filter returns SECCOMP_RET_USER_NOTIF, a notification
> will be provided via this file descriptor. The close-on-exec
> flag is automatically set on the new file descriptor. ...
> 
> >  .RE
> >  .TP
> >  .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
> > @@ -606,6 +613,17 @@ file.
> >  .TP
> >  .BR SECCOMP_RET_ALLOW
> >  This value results in the system call being executed.
> > +.TP
> > +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"
> 
> Please see the start of this hanging list in the manual page.
> Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
> in the precedence order of all of the filter return values?

Oh, no, I didn't realize it was in a particular order. I'll switch it.

> > +Forwards the syscall to an attached listener in userspace to allow 
> > userspace to
> 
> s/syscall/system call throughout please.

Will do.

> > +decide what to do with the syscall. If there is no attached listener 
> > (either
> > +because the filter was not installed with the
> > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> > +or because the fd was closed), the filter returns
> > +.BR ENOSYS
> > +similar to what happens when a filter returns
> > +.BR SECCOMP_RET_TRACE
> > +and there is no tracer. See "Userspace Notification" below for more 
> > details.
> >  .PP
> >  If an action value other than one of the above is specified,
> >  then the filter action is treated as either
> > @@ -693,10 +711,75 @@ Otherwise, if kernel auditing is enabled and the 
> > process is being audited
> >  the action is logged.
> >  .IP *
> >  Otherwise, the action is not logged.
> > +.SS Userspace Notification
> > +Interactin userspace notification functionality in seccomp is primarily 
> > done
> > +via file descriptor. 
> 
> That sentence is somewhat garbled. Even if I correct the typo, 
> I still don't really understand it. Could you try again?

Maybe "Userspace interacts with the notification functionality via
a file descriptor"? Perhaps we can just delete it.

> > This file descriptor can be obtained by passing
> > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> > +as a filter flag when installing a new filter.
> > +.PP
> > +Once an fd is obtained, userspace can wait for events using
> > +.BR poll ()
> 
> and presumably select() and epoll?
> 
> What kind of notification event do poll(2)/epoll(7)/select(2) provide?
> It looks to be POLLIN/EPOLLIN/readable. Is that correct?
> These details should be noted here. More generally, my assumption
> is that you can use poll(2)/epoll(7)/select(2) to find out about the
> availability of an event, and then use SECCOMP_IOCTL_NOTIF_RECV
> to read that event. Correct? The text needs to be more explicit on
> this.

Yes.

> > +or
> > +.BR ioctl ().
> > +The supported
> > +.BR ioctl ()
> > +operations on a notification fd are:
> > +.TP
> > +.BR SECCOMP_IOCTL_NOTIF_RECV
> > +The argument to this command should be a pointer to a struct seccomp_notif:
> > +.IP
> > +.in +4n
> > +.EX
> > +struct seccomp_notif {
> > +__u64 id;
> > +__u32 pid;
> > +__u32 flags;
> > +struct seccomp_data data;
> > +};
> > +.EE
> > +.in
> > +.IP
> > +The id field is a filter-unique id for this syscall, and should be 
> > supplied in
> > +the response. It can additionally be used in
> > +.BR SECCOMP_IOCTL_ID_VALID
> > +to test whether or not the request is still alive. The pid here is the pid 
> > of
> > +the task as visible from the listener's pid namespace. If the pid is not
> > +visible, it is 0. Flags is unused right now. 
> 
> So, is 'flags' explicitly zeroed by the kernel? the manual page should note
> this.

Yes.

> > struct seccomp_data is the same
> > +data that would be passed to a filter running in the kernel.
> 
> What are the semantics if multiple monitoring processes are employing
> SECCOMP_IOCTL_NOTIF_RECV? Does only one of them get awoken? (Which one?)
> Or do they all get woken up and get a 'struct seccomp_notif'? The semantics
> should be detailed here.

Ok. (Only one notification is sent.)

> > +.TP
> > +.BR SECCOMP_IOCTL_NOTIF_SEND
> > +The argument to this command should be a pointer to a struct 
> > seccomp_notif_resp:
> > +.IP
> > +.in +4n
> > +.EX
> > +struct seccomp_notif_resp {
> > +__u64 id;
> > +__s64 val;
> > +__s32 error;
> > +__u32 flags;
> > +};
> > +.EE
> > +.in
> > +.IP
> > +The id should be the id from struct seccomp_notif; if error is non-zero, 
> > it is
> > +used as the return value, otherwise val is. Flags must be 0 right now.
> 
> Th

Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects

2019-02-19 Thread Tycho Andersen
On Fri, Feb 15, 2019 at 04:07:33PM +, David Howells wrote:
> ==
> FUTURE DEVELOPMENT
> ==
> 
>  (1) Setting up the container.
> 
>  A container would be created with, say:
> 
>   int cfd = container_create("fred", CONTAINER_NEW_EMPTY_FS_NS);
> 

...

>  Further mounts can be added by:
> 
>   move_mount(mfd, "", cfd, "proc", MOVE_MOUNT_F_EMPTY_PATH);
> 

...

>  (2) Starting the container.
> 
>  Once all modifications are complete, the container's 'init' process
>  can be started by:
> 
>   fork_into_container(int cfd);
> 
>  This precludes further external modification of the mount tree within
>  the container.

Is there a technical reason for this? In particular, there are some
container runtimes that do this today via clever use of bind mounts
and MS_MOVE, for things like dynamically attaching volumes. It would
be useful to be able to mount things into the container after the
fact.

>  (3) Waiting for the container to complete.
> 
>  The container fd can then be polled to wait for init process therein
>  to complete and the exit code collected by:
> 
>   container_wait(int container_fd, int *_wstatus, unsigned int wait,
>  struct rusage *rusage);
> 
>  The container and everything in it can be terminated or killed off:
> 
>   container_kill(int container_fd, int initonly, int signal);
> 
>  If 'init' dies, all other processes in the container are preemptively
>  SIGKILL'd by the kernel.

Isn't this essentially how the pid ns works today? I'm not sure what
the container fd offers here (of course if it lands, then having the
same semantics makes sense).

>  (6) Running different LSM policies by container.  This might particularly
>  make sense with something like Apparmor where different path-based
>  rules might be required inside a container to inside the parent.

Apparmor supports this today, as long as the host is also running
Apparmor. For the more general case, Casey (and others) have been
working on LSM stacking for a long time.

Tycho


Re: [PATCH v7 1/2] signal: add pidfd_send_signal() syscall

2019-02-14 Thread Tycho Andersen
On Wed, Jan 02, 2019 at 05:16:53PM +0100, Christian Brauner wrote:
> The kill() syscall operates on process identifiers (pid). After a process
> ...

Fashionably late to the party, but feel free to consider these two:

Reviewed-by: Tycho Andersen 


Re: [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions

2019-02-14 Thread Tycho Andersen
On Thu, Feb 14, 2019 at 10:29:52AM -0700, Khalid Aziz wrote:
> On a side note, do you mind if I update your address in your
> signed-off-by from ty...@docker.com when I send the next version of this
> series?

Sure that would be great thanks. This e-mail is a good one to use.

Cheers,

Tycho


Re: [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions

2019-02-14 Thread Tycho Andersen
Hi,

On Wed, Feb 13, 2019 at 05:01:30PM -0700, Khalid Aziz wrote:
> From: Juerg Haefliger 
> 
> If the page is unmapped by XPFO, a data cache flush results in a fatal
> page fault, so let's temporarily map the region, flush the cache, and then
> unmap it.
> 
> v6: actually flush in the face of xpfo, and temporarily map the underlying
> memory so it can be flushed correctly
> 
> CC: linux-arm-ker...@lists.infradead.org
> Signed-off-by: Juerg Haefliger 
> Signed-off-by: Tycho Andersen 
> ---
>  arch/arm64/mm/flush.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index 30695a868107..fad09aafd9d5 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -28,9 +29,15 @@
>  void sync_icache_aliases(void *kaddr, unsigned long len)
>  {
>   unsigned long addr = (unsigned long)kaddr;
> + unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
> + void *mapping[num_pages];

What version does this build on? Presumably -Wvla will cause an error
here, but,

>   if (icache_is_aliasing()) {
> + xpfo_temp_map(kaddr, len, mapping,
> +   sizeof(mapping[0]) * num_pages);
>   __clean_dcache_area_pou(kaddr, len);

Here, we map the pages to some random address via xpfo_temp_map(),
then pass the *original* address (which may not have been mapped) to
__clean_dcache_area_pou(). So I think this whole approach is wrong.

If we want to do it this way, it may be that we need some
xpfo_map_contiguous() type thing, but since we're just going to flush
it anyway, that seems a little crazy. Maybe someone who knows more
about arm64 knows a better way?

Tycho


Re: [PATCH 3/3] leaking_addresses: Expand tilde in output file name

2019-02-07 Thread Tycho Andersen
On Fri, Feb 08, 2019 at 09:50:26AM +1100, Tobin C. Harding wrote:
> Currently if user passes an output file to the script via
> --output-raw we do not handle expansion of tilde.
> 
> Use perl function glob() to expand tilde in output file name.
> 
> Signed-off-by: Tobin C. Harding 
> ---
>  scripts/leaking_addresses.pl | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/leaking_addresses.pl b/scripts/leaking_addresses.pl
> index ef9e5b2a1614..ce545ca3fb70 100755
> --- a/scripts/leaking_addresses.pl
> +++ b/scripts/leaking_addresses.pl
> @@ -150,7 +150,7 @@ if (!(is_supported_architecture() or $opt_32bit or 
> $page_offset_32bit)) {
>  }
>  
>  if ($output_raw) {
> - open my $fh, '>', $output_raw or die "$0: $output_raw: $!\n";
> + open my $fh, '>', glob($output_raw) or die "$0: $output_raw: $!\n";

Seems like you might also have the same problem with $input_raw? I
wonder if you can just do this in GetOptions somehow, so that all
users of these further down in the script don't have to remember to do
this.

Cheers,

Tycho


[PATCH] rcu docs: repair some whitespace damage

2019-01-29 Thread Tycho Andersen
While reading the docs I noticed some whitespace damage in diagram. Let's
fix it up to be consistent with elsewhere in the document: use one leading
tab, followed by spaces for any additional whitespace required.

Signed-off-by: Tycho Andersen 
---
 Documentation/RCU/whatisRCU.txt | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 4a6854318b17..419ec18b87bd 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -310,7 +310,7 @@ reader, updater, and reclaimer.
 
 
rcu_assign_pointer()
-   ++
+   ++
+-->| reader |-+
|   ++ |
|   |  |
@@ -318,12 +318,12 @@ reader, updater, and reclaimer.
|   |  | rcu_read_lock()
|   |  | rcu_read_unlock()
|rcu_dereference()  |  |
-   +-+  |  |
-   | updater |<-+  |
-   +-+ V
+   +-+ |  |
+   | updater |<+  |
+   +-+V
|+---+
+--->| reclaimer |
-+---+
++---+
  Defer:
  synchronize_rcu() & call_rcu()
 
-- 
2.19.1



Re: 4.14 revert "seccomp: add a selftest for get_metadata"

2019-01-28 Thread Tycho Andersen
On Mon, Jan 28, 2019 at 05:13:11PM +, Rantala, Tommi T. (Nokia - FI/Espoo) 
wrote:
> Hi Greg,
> 
> Can you please revert this commit in 4.14?
> 
> commit e65cd9a20343ea90f576c24c38ee85ab6e7d5fec
> Author: Tycho Andersen 
> Date:   Tue Feb 20 19:47:47 2018 -0700
> 
> seccomp: add a selftest for get_metadata
> 
> [ Upstream commit d057dc4e35e16050befa3dda943876dab39cbf80 ]
> 
> Let's test that we get the flags correctly, and that we preserve
> the filter
> index across the ptrace(PTRACE_SECCOMP_GET_METADATA) correctly.
> 
> 
> 
> PTRACE_SECCOMP_GET_METADATA was only added in 4.16
> (26500475ac1b499d8636ff281311d633909f5d20)
> 
> 
> And it's also breaking seccomp_bpf.c compilation for me:
> 
> seccomp_bpf.c: In function ‘get_metadata’:
> seccomp_bpf.c:2878:26: error: storage size of ‘md’ isn’t known
>   struct seccomp_metadata md;

Yes, agreed. The get_metadata() stuff went in for 4.16, so anything
before that won't have the struct. Sorry if I missed the stable commit
before.

Cheers,

Tycho


[PATCH 2/6] selftests: fix typo in seccomp_bpf.c

2019-01-18 Thread Tycho Andersen
There used to be an explanation here because it could trigger lockdep
previously, but now we're not doing recursive locking, so it really is just
for grins.

Signed-off-by: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 9aba1b904089..912a2a5430dc 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3035,7 +3035,7 @@ TEST(user_notification_basic)
EXPECT_EQ(true, WIFEXITED(status));
EXPECT_EQ(0, WEXITSTATUS(status));
 
-   /* Add some no-op filters so for grins. */
+   /* Add some no-op filters for grins. */
EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
-- 
2.19.1



[PATCH 1/6] selftests: don't kill child immediately in get_metadata() test

2019-01-18 Thread Tycho Andersen
This this test forks a child, and then the parent waits for a write() to a
pipe signalling the child is ready to be attached to. If something in the
child ASSERTs before it does this write, the test will hang waiting for it.
Instead, let's EXPECT, so that execution continues until we do the write.
Any failure after that is fine and can ASSERT.

Signed-off-by: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 496a9a8c773a..9aba1b904089 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -2943,11 +2943,11 @@ TEST(get_metadata)
};
 
/* one with log, one without */
-   ASSERT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER,
+   EXPECT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER,
 SECCOMP_FILTER_FLAG_LOG, &prog));
-   ASSERT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog));
+   EXPECT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog));
 
-   ASSERT_EQ(0, close(pipefd[0]));
+   EXPECT_EQ(0, close(pipefd[0]));
ASSERT_EQ(1, write(pipefd[1], "1", 1));
ASSERT_EQ(0, close(pipefd[1]));
 
-- 
2.19.1



[PATCH 6/6] selftests: unshare userns in seccomp pidns testcases

2019-01-18 Thread Tycho Andersen
The pid ns cannot be unshare()d as an unprivileged user without owning the
userns as well. Let's unshare the userns so that we can subsequently
unshare the pidns.

This also means that we don't need to set the no new privs bit as in the
other test cases, since we're unsharing the userns.

Signed-off-by: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index a4a7dce1a91b..8f6e95773225 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3271,7 +3271,7 @@ TEST(user_notification_child_pid_ns)
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
 
-   ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+   ASSERT_EQ(unshare(CLONE_NEWUSER | CLONE_NEWPID), 0);
 
listener = user_trap_syscall(__NR_getpid, 
SECCOMP_FILTER_FLAG_NEW_LISTENER);
ASSERT_GE(listener, 0);
@@ -3308,6 +3308,8 @@ TEST(user_notification_sibling_pid_ns)
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
 
+   ASSERT_EQ(unshare(CLONE_NEWUSER), 0);
+
listener = user_trap_syscall(__NR_getpid, 
SECCOMP_FILTER_FLAG_NEW_LISTENER);
ASSERT_GE(listener, 0);
 
-- 
2.19.1



[PATCH 4/6] selftests: skip seccomp get_metadata test if not real root

2019-01-18 Thread Tycho Andersen
The get_metadata() test requires real root, so let's skip it if we're not
real root.

Note that I used XFAIL here because that's what the test does later if
CONFIG_CHEKCKPOINT_RESTORE happens to not be enabled. After looking at the
code, there doesn't seem to be a nice way to skip tests defined as TEST(),
since there's no return code (I tried exit(KSFT_SKIP), but that didn't work
either...). So let's do it this way to be consistent, and easier to fix
when someone comes along and fixes it.

Signed-off-by: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 912a2a5430dc..ab6b6620f522 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -2929,6 +2929,12 @@ TEST(get_metadata)
struct seccomp_metadata md;
long ret;
 
+   /* Only real root can get metadata. */
+   if (geteuid()) {
+   XFAIL(return, "get_metadata requires real root");
+   return;
+   }
+
ASSERT_EQ(0, pipe(pipefd));
 
pid = fork();
-- 
2.19.1



[PATCH v1 0/6] seccomp test fixes

2019-01-18 Thread Tycho Andersen
Hi all,

Here are the fixes I previously mentioned I would send. I previously
assumed that the tests were mostly run as root, but it turns out
everything else besides the stuff I wrote in the seccomp tests either
sets NNP and doesn't require real root, so it all actually works. This
set of fixes should make most of the other tests work unprivileged,
while XFAIL-ing the one that requires real root.

Cheers,

Tycho

Tycho Andersen (6):
  selftests: don't kill child immediately in get_metadata() test
  selftests: fix typo in seccomp_bpf.c
  selftest: include stdio.h in kselftest.h
  selftests: skip seccomp get_metadata test if not real root
  selftests: set NO_NEW_PRIVS bit in seccomp user tests
  selftests: unshare userns in seccomp pidns testcases

 tools/testing/selftests/kselftest.h   |  1 +
 tools/testing/selftests/seccomp/seccomp_bpf.c | 42 ---
 2 files changed, 38 insertions(+), 5 deletions(-)

-- 
2.19.1



[PATCH 3/6] selftest: include stdio.h in kselftest.h

2019-01-18 Thread Tycho Andersen
While playing around with a way to skip the seccomp get_metadata test, I
noticed that this header uses printf() without defining it, leading to,

../kselftest.h: In function ‘ksft_print_header’:
../kselftest.h:61:3: warning: implicit declaration of function ‘printf’ 
[-Wimplicit-function-declaration]
   printf("TAP version 13\n");
   ^~
../kselftest.h:61:3: warning: incompatible implicit declaration of built-in 
function ‘printf’
../kselftest.h:61:3: note: include ‘’ or provide a declaration of 
‘printf’

if user code doesn't also use printf.

Signed-off-by: Tycho Andersen 
---
 tools/testing/selftests/kselftest.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/kselftest.h 
b/tools/testing/selftests/kselftest.h
index a3edb2c8e43d..47e1d995c182 100644
--- a/tools/testing/selftests/kselftest.h
+++ b/tools/testing/selftests/kselftest.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* define kselftest exit codes */
 #define KSFT_PASS  0
-- 
2.19.1



[PATCH 5/6] selftests: set NO_NEW_PRIVS bit in seccomp user tests

2019-01-18 Thread Tycho Andersen
seccomp() doesn't allow users who aren't root in their userns to attach
filters unless they have the nnp bit set, so let's set it so that these
tests can pass when run as an unprivileged user.

This idea stolen from the other seccomp tests, which use this trick :)

Signed-off-by: Tycho Andersen 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++
 1 file changed, 24 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index ab6b6620f522..a4a7dce1a91b 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3026,6 +3026,11 @@ TEST(user_notification_basic)
.filter = filter,
};
 
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
pid = fork();
ASSERT_GE(pid, 0);
 
@@ -3107,6 +3112,11 @@ TEST(user_notification_kill_in_middle)
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
 
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
listener = user_trap_syscall(__NR_getpid,
 SECCOMP_FILTER_FLAG_NEW_LISTENER);
ASSERT_GE(listener, 0);
@@ -3154,6 +3164,11 @@ TEST(user_notification_signal)
struct seccomp_notif_resp resp = {};
char c;
 
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
 
listener = user_trap_syscall(__NR_gettid,
@@ -3219,6 +3234,11 @@ TEST(user_notification_closed_listener)
long ret;
int status, listener;
 
+   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   ASSERT_EQ(0, ret) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
listener = user_trap_syscall(__NR_getpid,
 SECCOMP_FILTER_FLAG_NEW_LISTENER);
ASSERT_GE(listener, 0);
@@ -3350,6 +3370,10 @@ TEST(user_notification_fault_recv)
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
 
+   ASSERT_EQ(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0), 0) {
+   TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+   }
+
listener = user_trap_syscall(__NR_getpid, 
SECCOMP_FILTER_FLAG_NEW_LISTENER);
ASSERT_GE(listener, 0);
 
-- 
2.19.1



Re: Linux 5.0-rc2 seccomp_bpf user_notification_basic test hangs

2019-01-17 Thread Tycho Andersen
On Thu, Jan 17, 2019 at 08:41:59AM -0800, Kees Cook wrote:
> On Thu, Jan 17, 2019 at 8:27 AM Tycho Andersen  wrote:
> >
> > On Thu, Jan 17, 2019 at 08:12:50AM -0800, Kees Cook wrote:
> > > On Wed, Jan 16, 2019 at 5:26 PM shuah  wrote:
> > > > I am running Linux 5.0-rc2 and not an older kernel.
> > >
> > > Weird. I couldn't reproduce this on 5.0-rc2, but I did see it on a
> > > kernel without seccomp user_notif. Does the patch I sent fix it for
> > > you? (And if so, can you take it in your tree?)
> >
> > I can reproduce it; you have to run it as non-root. I think your patch
> > is necessary to get it to at least fail. The question is: what should
> > we do about these tests that require real root? Skip them if we're not
> > real-root, I guess?
> 
> Hm, maybe use the XFAIL() bit of the harness?
> 
> Perhaps it's time to make it a root-only test and do internal
> priv-dropping to test the nnp-requiring parts? I'll add it to the TODO
> list...

Ok, I'll try to send a couple of patches soon to fix some of this up.
But at least yours should should stop things from hanging for now.

Thanks,

Tycho


Re: Linux 5.0-rc2 seccomp_bpf user_notification_basic test hangs

2019-01-17 Thread Tycho Andersen
On Thu, Jan 17, 2019 at 08:12:50AM -0800, Kees Cook wrote:
> On Wed, Jan 16, 2019 at 5:26 PM shuah  wrote:
> > I am running Linux 5.0-rc2 and not an older kernel.
> 
> Weird. I couldn't reproduce this on 5.0-rc2, but I did see it on a
> kernel without seccomp user_notif. Does the patch I sent fix it for
> you? (And if so, can you take it in your tree?)

I can reproduce it; you have to run it as non-root. I think your patch
is necessary to get it to at least fail. The question is: what should
we do about these tests that require real root? Skip them if we're not
real-root, I guess?

Tycho


Re: [PATCH] selftests/seccomp: Abort without user notification support

2019-01-16 Thread Tycho Andersen
On Wed, Jan 16, 2019 at 04:35:25PM -0800, Kees Cook wrote:
> In the face of missing user notification support, the self test needs
> to stop executing a test (ASSERT_*) instead of just reporting and
> continuing (EXPECT_*). This adjusts the user notification tests to do
> that where needed.
> 
> Reported-by: Shuah Khan 
> Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")

The gift that keeps on giving :)

Reviewed-by: Tycho Andersen 

Thanks,

Tycho


Re: Linux 5.0-rc2 seccomp_bpf user_notification_basic test hangs

2019-01-16 Thread Tycho Andersen
On Wed, Jan 16, 2019 at 04:30:26PM -0800, Kees Cook wrote:
> On Wed, Jan 16, 2019 at 4:01 PM shuah  wrote:
> >
> > Hi Kees and James,
> >
> > seccomp_bpf test hangs right after the following test passes
> > with EBUSY. Please see log at the end.
> >
> > /* Installing a second listener in the chain should EBUSY */
> >  EXPECT_EQ(user_trap_syscall(__NR_getpid,
> >  SECCOMP_FILTER_FLAG_NEW_LISTENER),
> >-1);
> >  EXPECT_EQ(errno, EBUSY);
> >
> >
> > The user_notification_basic test starts running I assume and then
> > the hang.
> >
> > The only commit I see that could be suspect is the following as
> > it talks about adding SECCOMP_RET_USER_NOTIF
> >
> > commit d9a7fa67b4bfe6ce93ee9aab23ae2e7ca0763e84
> > Merge: f218a29c25ad 55b8cbe470d1
> > Author: Linus Torvalds 
> > Date:   Wed Jan 2 09:48:13 2019 -0800
> >
> >  Merge branch 'next-seccomp' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
> >
> >  Pull seccomp updates from James Morris:
> >
> >   - Add SECCOMP_RET_USER_NOTIF
> >
> >   - seccomp fixes for sparse warnings and s390 build (Tycho)
> >
> >  * 'next-seccomp' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
> >seccomp, s390: fix build for syscall type change
> >seccomp: fix poor type promotion
> >samples: add an example of seccomp user trap
> >seccomp: add a return code to trap to userspace
> >seccomp: switch system call argument type to void *
> >seccomp: hoist struct seccomp_data recalculation higher
> >
> >
> > Any ideas on how to proceed? Here is the log. The following
> > reproduces the problem.
> >
> > make -C tools/testing/selftests/seccomp/ run_tests
> >
> >
> > seccomp_bpf.c:2947:global.get_metadata:Expected 0 (0) ==
> > seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_LOG, &prog)
> > (18446744073709551615)
> > seccomp_bpf.c:2959:global.get_metadata:Expected 1 (1) == read(pipefd[0],
> > &buf, 1) (0)
> > global.get_metadata: Test terminated by assertion
> > [ FAIL ] global.get_metadata
> > [ RUN  ] global.user_notification_basic
> > seccomp_bpf.c:3036:global.user_notification_basic:Expected 0 (0) ==
> > WEXITSTATUS(status) (1)
> > seccomp_bpf.c:3039:global.user_notification_basic:Expected
> > seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog) (18446744073709551615) == 0 (0)
> > seccomp_bpf.c:3040:global.user_notification_basic:Expected
> > seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog) (18446744073709551615) == 0 (0)
> > seccomp_bpf.c:3041:global.user_notification_basic:Expected
> > seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog) (18446744073709551615) == 0 (0)
> > seccomp_bpf.c:3042:global.user_notification_basic:Expected
> > seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog) (18446744073709551615) == 0 (0)
> > seccomp_bpf.c:3047:global.user_notification_basic:Expected listener
> > (18446744073709551615) >= 0 (0)
> > seccomp_bpf.c:3053:global.user_notification_basic:Expected errno (13) ==
> > EBUSY (16)
> 
> Looks like the test is unfriendly when running the current selftest on
> an old kernel version. A quick look seems like it's missing some
> ASSERT_* cases where EXPECT_* is used. I'll send a patch.

ASSERT will kill the test case though right? I thought we were
supposed to use EXPECT when we wanted it to keep going. In particular,
it looks like in the get_metadata test, we should be using expect
instead of assert in some places, so we can get to the write() that
does the synchronization. Something like,

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 067cb4607d6c..4d2508af2483 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -2943,11 +2943,11 @@ TEST(get_metadata)
};
 
/* one with log, one without */
-   ASSERT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER,
+   EXPECT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER,
 SECCOMP_FILTER_FLAG_LOG, &prog));
-   ASSERT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog));
+   EXPECT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog));
 
-   ASSERT_EQ(0, close(pipefd[0]));
+   EXPECT_EQ(0, close(pipefd[0]));
ASSERT_EQ(1, write(pipefd[1], "1", 1));
ASSERT_EQ(0, close(pipefd[1]));
 

But also, is running new tests on an old kernel expected to work? I
didn't know that :).

Tycho


Re: [PATCH] uart: Fix crash in uart_write and uart_put_char

2019-01-16 Thread Tycho Andersen
On Wed, Jan 16, 2019 at 10:28:07AM -0800, sa...@embedur.com wrote:
> From: Samir Virmani 
> 
> We were experiencing a crash similar to the one reported as part of
> commit:a5ba1d95e46e ("uart: fix race between uart_put_char() and
> uart_shutdown()") in our testbed as well. We continue to observe the same
> crash after integrating the commit a5ba1d95e46e ("uart: fix race between
> uart_put_char() and uart_shutdown()")
> 
> On reviewing the change, the port lock should be taken prior to checking for
> if (!circ->buf) in fn. __uart_put_char and other fns. that update the buffer
> uart_state->xmit.
> 
> Traceback:
> 
> [11/27/2018 06:24:32.4870] Unable to handle kernel NULL pointer dereference
>at virtual address 003b
> 
> [11/27/2018 06:24:32.4950] PC is at memcpy+0x48/0x180
> [11/27/2018 06:24:32.4950] LR is at uart_write+0x74/0x120
> [11/27/2018 06:24:32.4950] pc : []
>lr : [] pstate: 01c5
> [11/27/2018 06:24:32.4950] sp : ffc076433d30
> [11/27/2018 06:24:32.4950] x29: ffc076433d30 x28: 0140
> [11/27/2018 06:24:32.4950] x27: ffc0009b9d5e x26: ffc07ce36580
> [11/27/2018 06:24:32.4950] x25:  x24: 0140
> [11/27/2018 06:24:32.4950] x23: ffc000891200 x22: ffc01fc34000
> [11/27/2018 06:24:32.4950] x21: 0fff x20: 0076
> [11/27/2018 06:24:32.4950] x19: 0076 x18: 
> [11/27/2018 06:24:32.4950] x17: 0047cf08 x16: ffc99e68
> [11/27/2018 06:24:32.4950] x15: 0018 x14: 776d726966205948
> [11/27/2018 06:24:32.4950] x13: 50203a6c6974755f x12: 74647075205d
> [11/27/2018 06:24:32.4950] x11: 3a35323a36203831 x10: 30322f37322f3131
> [11/27/2018 06:24:32.4950] x9 : 5b205d303638342e x8 : 746164206f742070
> [11/27/2018 06:24:32.4950] x7 : 7520736920657261 x6 : 003b
> [11/27/2018 06:24:32.4950] x5 : 817a x4 : 0008
> [11/27/2018 06:24:32.4950] x3 : 2f37322f31312a5b x2 : 006e
> [11/27/2018 06:24:32.4950] x1 : ffc0009b9cf0 x0 : 003b
> 
> [11/27/2018 06:24:32.4950] CPU2: stopping
> [11/27/2018 06:24:32.4950] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P  D
> O4.1.51 #3
> [11/27/2018 06:24:32.4950] Hardware name: Broadcom-v8A (DT)
> [11/27/2018 06:24:32.4950] Call trace:
> [11/27/2018 06:24:32.4950] [] dump_backtrace+0x0/0x150
> [11/27/2018 06:24:32.4950] [] show_stack+0x14/0x20
> [11/27/2018 06:24:32.4950] [] dump_stack+0x90/0xb0
> [11/27/2018 06:24:32.4950] [] handle_IPI+0x18c/0x1a0
> [11/27/2018 06:24:32.4950] [] gic_handle_irq+0x88/0x90
> 
> Fixes: a5ba1d95e46e ("uart: fix race between uart_put_char() and
>   uart_shutdown()")
> Signed-off-by: Samir Virmani 
> Cc: Tycho Andersen 

Acked-by: Tycho Andersen 

Thanks,

Tycho


  1   2   3   4   5   6   >