Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-29 Thread Rich Felker
On Thu, Oct 29, 2020 at 07:58:42AM +, Sargun Dhillon wrote:
> A mechanism for the thing listening on the listener FD to turn itself on or 
> off 
> and indicate that it is no longer interested in receiving notifications and 
> to 
> always continue / return an error code, or that it has taken an interest now, 
> and it would like to return to handling these events. The idea of an action 
> other than ENOSYS (specifically SECCOMP_USER_NOTIF_FLAG_CONTINUE) if the 
> listener goes away is attractive as well, in case the supervisor crashes. 
> EPERM 
> is somewhat "cleaner" of an error code than ENOSYS (most people don't write 
> handling for ENOSYS on connect).

This is a common misconception that really needs to be addressed.
EPERM is not a suitable error code for as-yet-unknown seccomp-blocked
syscalls, and is not suitable for a large portion of currently known
ones either. It use has actively broken lots of things that would have
worked fine with ENOSYS returned. This is because a caller will react
to ENOSYS by attempting to do whatever it wanted to do in another way,
one which your policy might handlle better (for example, if your
filter is outdated and blocking clock_gettime64 but not clock_gettime,
or blocking statx but not stat64) while it will react to EPERM as an
indication that your filter understands the abstract operation it was
trying to perform and considers that action forbidden. We hit issues
like this with virtually every seccomp-using application while moving
to time64 because the wrong idiom is so widespread, and further vdso
prevented it from being caught right away (only users without vdso hit
it later).

>From a standards perspective (and thus of programming to them), POSIX
does permit implementations to define additional errors for any
interface that already has errors defined, but does not permit
overloading the standard errors defined, so if EPERM is already
defined for an interface, returning it is probably not ok. (IMO it is
defensible if the seccomp policy can be thought of as an extension of
the permission model that would cause the standaed EPERM, but it's not
defensible if the policy is just "we don't know about this syscall
yet" since the syscall might just be a new/better way of implementing
some existing operation that makes the policy appear inconsistent.)

In your example of connect, I think either is semantically defensible;
one is "you lack suitable privilege to perform the operation" and the
other is "you're running on a type of system that doesn't have socket
connect functionality" (because it's a minimal execution environment).

Rich


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-29 Thread Sargun Dhillon
On Wed, Oct 28, 2020 at 03:47:27PM -0700, Kees Cook wrote:
> On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> > (This is my first message to the kernel list, I hope I'm doing it right)
> 
> Looks good to me! The key was CCing real people. ;)
> 
> > From my understanding, there is no way to delay the activation of
> > seccomp filters, for instance "until an _execve_ call".
> > But this might be useful, especially for tools who sandbox other,
> > non-cooperative, executables, such as "systemd" or "FireJail".
> > [...]
> > I only see hackish ways to restrict the use of _execve_ in a
> > non-cooperative executable. These methods seem globally bypassables
> > and not satisfactory from a security point of view.
> > 
> > IMHO, a way to prepare filter and enable them only on the next
> > _execve_ would have some benefit:
> > * have a way to restrict _execve_ in a non-cooperative executable;
> > * install filters atomically, ie. before the _execve_ system call
> > return. That would limit racy situations, and have the very firsts
> > instructions of potentially untrusted binaries already subject to
> > seccomp filters. It would also ensure there is only one thread running
> > at the filter enabling time.
> > 
> > From what I understand, there is a relative use case[2] where the
> > "enable on exec" mode would also be a solution.
> > 
> > Thanks for your attention,
> > C. Mougey
> > 
> > [1]: https://github.com/netblue30/firejail/issues/3685
> > [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/
> 
> Just to restate things already said in the thread and to try to illustrate
> with more clarity, I tend to organize my thinking about seccomp usage
> into three categories:
> 
> 1- self-confinement
> 2- launching external processes
>   a) cooperating
>   b) oblivious
> 
> I classify things like Chrome's complex tree of related processes and
> filters as 1, since it's all one thing together.
> 
> I think of systemd, docker, minijail, FireJail, etc all as falling into
> category 2, with some variation about how to deal with 2a or 2b. I see
> systemd as weakly covering both 2a and 2b: e.g. services are documenting
> what restrictions they want, etc. minijail has stronger 2b coverage as
> it attempts to do PRELOAD tricks (which it sounds like FireJail does
> too?) (Aside: why doesn't systemd do any self-confinement?)
> 
> We don't have much possibility for the targets in the 2a realm as far
> as cooperating over how to _manage_ confinement, but rather about simply
> expecting confinement to exist, or adding more confinement on their own.
> 
> So, what would adding delayed filters gain in the above classifications?
> 
> Both 1 and 2 would benefit from some simplification over how to apply
> filters (e.g. the referenced relative complexity of needing to pass the
> USER_NOTIF fd up to the supervisor).
> 
> Dealing with 2b is improved by allowing execve itself to be blocked.
> 
> If we turn this:
> 
>   fork
>   prepare & apply
>   exec
> 
> into this:
> 
>   fork
>   prepare
>   exec & apply
> 
> for 2a, this isn't too interesting since a 2a target could just give up
> execve after it launched. For 2b, though, it's pretty meaningful to gain
> further isolation of an oblivious (and assumingly untrusted) process
> (given all the hacks needed to try to cover the situation).
> 
> And to clarify, 2a would much prefer this to be able to separate
> initialization from runtime:
> 
> fork
> prepare
> exec
> other things
> apply
> 
> And just for completeness, none of this is useful at all for 1, which
> doesn't even "see" the fork from its perspective:
> 
> exec
> other things
> prepare & apply
> 
> How should 2a targets indicate they're ready? Can it be done passively
> (in the sense that libc would make some seccomp call to apply the
> delayed filters), or does it need to stay explicit? (e.g. can we turn
> a pre-untrusted-input 2b into a 2a just by having the libc make calls?)
> My instinct is that hiding it won't gain much over a "on-execve" case,
> but having an explicit call that means "I'm done initializing now" would
> be a meaningful synchronization point -- except I note that it just means
> the target could just as easily start doing its own confinement anyway,
> which means they effectively move from 2a to 1, and now we don't care
> about delayed filters any more.
> 
> So, lacking a clearer sync point, execve() does seem to stand out to me.
> 
> The other idea which was touched on in the thread was very direct
> management (e.g. ptrace) and the supervisor waits until some point and
> then forces the filters to apply on the target. What would be more
> light-weight than this? (Or rather, what kinds of things would such a
> ptracer be looking for to mark "I've started"?)
> 
> Since I've got bitmaps on my mind, what about a syscall bitmap that
> triggers the application of delayed filters? The supervisor is launching
> a daemon: mark NR_listen as the 

Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Rich Felker
On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> +luto just in case he has opinions on this
> 
> On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  wrote:
> > From my understanding, there is no way to delay the activation of
> > seccomp filters, for instance "until an _execve_ call".
> 
> (FWIW, there are some tricks that you can use for this. In particular,
> you can attach to the child with ptrace before the child runs
> execve(), and then use seccomp to inject a filter after execve(), or
> something like that. The disadvantage is that this is not super pretty
> because it interferes with debugging of the parent process. IIRC e.g.
> Ubuntu's launchd did things this way.)

Yes, in principle everything seccomp does could have been done with
ptrace but the whole point was not to use ptrace as a primitive to
build hacks upon. So this is not a good solution.

> > But this might be useful, especially for tools who sandbox other,
> > non-cooperative, executables, such as "systemd" or "FireJail".
> >
> > It seems to be a caveat of seccomp specific to the system call
> > _execve_. For now, some tools such as "systemd" explicitly mention
> > this exception, and do not support it (from the man page):
> > > Note that strict system call filters may impact execution and
> > > error handling code paths of the service invocation.
> > > Specifically, access to the execve system call is required for
> > > the execution of the service binary — if it is blocked service
> > > invocation will necessarily fail
> >
> > "FireJail" takes a different approach[1], with a kind of workaround:
> > the project uses an external library to be loaded through LD_PRELOAD
> > mechanism, in order to install filters during the loader stage.
> > This approach, a bit hacky, also has several caveats:
> > * _openat_, _mmap_, etc. must be allowed in order to reach the
> > LD_PRELOAD mechanism, and for the crafted library to work ;
> 
> Those caveats are not specific to the LD_PRELOAD approach. Actually,
> the LD_PRELOAD approach is the only one which I would expect to *not*
> have that caveat. (Of course, non-executable mmap() and probably also
> openat() are anyway needed for almost any real-world service to do its
> job correctly.)
> 
> > * it doesn't work for static binaries.
> 
> IMO the important thing about LD_PRELOAD is that it is unreliable:
> When the LD_PRELOAD library can't be opened, glibc just prints a
> warning and continues execution - and an attacker may be able to cause
> opening an LD_PRELOAD library to fail by opening so many files in
> other processes that the global limit is reached. So you can't build
> reliable security infrastructure on LD_PRELOAD. This is not a
> fundamental problem though - glibc could address this.

Using LD_PRELOAD for security infrastructure is a really bad idea
anyway. There are nearly unboundedly many ways code could end up
executing before the preloaded ctors. Preinit arrays, malformed ELF
headers seizing control of execution of ldso, etc. The seccomp filters
really need to be in place *before* the untrusted code runs.

> > I only see hackish ways to restrict the use of _execve_ in a
> > non-cooperative executable. These methods seem globally bypassables
> > and not satisfactory from a security point of view.
> 
> You're just focusing on execve() - I think it's important to keep in
> mind what happens after execve() for normal, dynamically-linked
> binaries: The next step is that the dynamic linker runs, and it will
> poke around in the file system with access() and openat() and fstat(),
> it will mmap() executable libraries into memory, it will mprotect()
> some memory regions, it will set up thread-local storage (e.g. using
> arch_prctl(); even if the process is single-threaded), and so on.
> 
> The earlier you install the seccomp filter, the more of these steps
> you have to permit in the filter. And if you want the filter to take
> effect directly after execve(), the syscalls you'll be forced to
> permit are sufficient to cobble something together in userspace that
> effectively does almost the same thing as execve().

I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
controlling these operations and allowing only the ones that are valid
during dynamic linking. This also allows you to defer application of
the filter until after execve. So unless I'm missing some reason why
this doesn't work, I think the requested functionality is already
available.

If you really just want the "activate at exec" behavior, it might be
possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
no notify fd open; I forget) to setup the filter so that the "mode
switch" happens automatically at exec by having the notify fd being
close-on-exec (notifications handled by a thread before exec). If this
works it would avoid having an extra process involved and managing its
lifetime.

> Your usecase might be better served by adding a glibc feature for
> "unskippable LD_PRELOAD" 

Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Rich Felker
On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote:
> On Wed, Oct 28, 2020 at 6:52 PM Rich Felker  wrote:
> > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker  wrote:
> > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  
> > > > > wrote:
> > > > > You're just focusing on execve() - I think it's important to keep in
> > > > > mind what happens after execve() for normal, dynamically-linked
> > > > > binaries: The next step is that the dynamic linker runs, and it will
> > > > > poke around in the file system with access() and openat() and fstat(),
> > > > > it will mmap() executable libraries into memory, it will mprotect()
> > > > > some memory regions, it will set up thread-local storage (e.g. using
> > > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > > >
> > > > > The earlier you install the seccomp filter, the more of these steps
> > > > > you have to permit in the filter. And if you want the filter to take
> > > > > effect directly after execve(), the syscalls you'll be forced to
> > > > > permit are sufficient to cobble something together in userspace that
> > > > > effectively does almost the same thing as execve().
> > > >
> > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > > > controlling these operations and allowing only the ones that are valid
> > > > during dynamic linking. This also allows you to defer application of
> > > > the filter until after execve. So unless I'm missing some reason why
> > > > this doesn't work, I think the requested functionality is already
> > > > available.
> > >
> > > Ah, yeah, good point.
> > >
> > > > If you really just want the "activate at exec" behavior, it might be
> > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > > > no notify fd open; I forget)
> > >
> > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > > though it might be a bit nicer if userspace had control over the errno
> > > there, such that it could be EPERM instead... oh well.)
> >
> > EPERM is a major bug in current sandbox implementations, so ENOSYS is
> > at least mildly better, but indeed it should be controllable, probably
> > by allowing a code path for the BPF to continue with a jump to a
> > different logic path if the notify listener is missing.
> 
> I guess we might be able to expose the listener status through a bit /
> a field in the struct seccomp_data, and then filters could branch on
> that. (And the kernel would run the filter twice if we raced with
> filter detachment.) I don't know whether it would look pretty, but I
> think it should be doable...

I was thinking the race wouldn't be salvagable, but indeed since the
filter is side-effect-free you can just re-run it if the status
changes between start of filter processing and the attempt at
notification. This sounds like it should work.

I guess it's not possible to chain two BPF filters to do this, because
that only works when the first one allows? Or am I misunderstanding
the multiple-filters case entirely? (I've never gotten that far with
programming it.)

Rich


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Rich Felker
On Wed, Oct 28, 2020 at 07:39:41PM +0100, Jann Horn wrote:
> On Wed, Oct 28, 2020 at 7:35 PM Rich Felker  wrote:
> > On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote:
> > > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker  wrote:
> > > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker  wrote:
> > > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey 
> > > > > > >  wrote:
> > > > > > > You're just focusing on execve() - I think it's important to keep 
> > > > > > > in
> > > > > > > mind what happens after execve() for normal, dynamically-linked
> > > > > > > binaries: The next step is that the dynamic linker runs, and it 
> > > > > > > will
> > > > > > > poke around in the file system with access() and openat() and 
> > > > > > > fstat(),
> > > > > > > it will mmap() executable libraries into memory, it will 
> > > > > > > mprotect()
> > > > > > > some memory regions, it will set up thread-local storage (e.g. 
> > > > > > > using
> > > > > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > > > > >
> > > > > > > The earlier you install the seccomp filter, the more of these 
> > > > > > > steps
> > > > > > > you have to permit in the filter. And if you want the filter to 
> > > > > > > take
> > > > > > > effect directly after execve(), the syscalls you'll be forced to
> > > > > > > permit are sufficient to cobble something together in userspace 
> > > > > > > that
> > > > > > > effectively does almost the same thing as execve().
> > > > > >
> > > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy 
> > > > > > for
> > > > > > controlling these operations and allowing only the ones that are 
> > > > > > valid
> > > > > > during dynamic linking. This also allows you to defer application of
> > > > > > the filter until after execve. So unless I'm missing some reason why
> > > > > > this doesn't work, I think the requested functionality is already
> > > > > > available.
> > > > >
> > > > > Ah, yeah, good point.
> > > > >
> > > > > > If you really just want the "activate at exec" behavior, it might be
> > > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when 
> > > > > > there's
> > > > > > no notify fd open; I forget)
> > > > >
> > > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > > > > though it might be a bit nicer if userspace had control over the errno
> > > > > there, such that it could be EPERM instead... oh well.)
> > > >
> > > > EPERM is a major bug in current sandbox implementations, so ENOSYS is
> > > > at least mildly better, but indeed it should be controllable, probably
> > > > by allowing a code path for the BPF to continue with a jump to a
> > > > different logic path if the notify listener is missing.
> > >
> > > I guess we might be able to expose the listener status through a bit /
> > > a field in the struct seccomp_data, and then filters could branch on
> > > that. (And the kernel would run the filter twice if we raced with
> > > filter detachment.) I don't know whether it would look pretty, but I
> > > think it should be doable...
> >
> > I was thinking the race wouldn't be salvagable, but indeed since the
> > filter is side-effect-free you can just re-run it if the status
> > changes between start of filter processing and the attempt at
> > notification. This sounds like it should work.
> >
> > I guess it's not possible to chain two BPF filters to do this, because
> > that only works when the first one allows? Or am I misunderstanding
> > the multiple-filters case entirely? (I've never gotten that far with
> > programming it.)
> 
> I'm not sure if I'm understanding the question correctly...
> At the moment you basically can't have multiple filters with notifiers.
> The rule with multiple filters is always that all the filters get run,
> and the actual action taken is the most restrictive result of all of
> them.

I probably just don't understand how multiple filters work then, which
is pretty much what I expected. But in any case it seems correct that
they're not a tool for solving the problem here.

Rich


[seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Camille Mougey
Hello,

(This is my first message to the kernel list, I hope I'm doing it right)

>From my understanding, there is no way to delay the activation of
seccomp filters, for instance "until an _execve_ call".
But this might be useful, especially for tools who sandbox other,
non-cooperative, executables, such as "systemd" or "FireJail".

It seems to be a caveat of seccomp specific to the system call
_execve_. For now, some tools such as "systemd" explicitly mention
this exception, and do not support it (from the man page):
> Note that strict system call filters may impact execution and error handling 
> code paths of the service invocation. Specifically, access to the execve 
> system call is required for the execution of the service binary — if it is 
> blocked service invocation will necessarily fail

"FireJail" takes a different approach[1], with a kind of workaround:
the project uses an external library to be loaded through LD_PRELOAD
mechanism, in order to install filters during the loader stage.
This approach, a bit hacky, also has several caveats:
* _openat_, _mmap_, etc. must be allowed in order to reach the
LD_PRELOAD mechanism, and for the crafted library to work ;
* it doesn't work for static binaries.

I only see hackish ways to restrict the use of _execve_ in a
non-cooperative executable. These methods seem globally bypassables
and not satisfactory from a security point of view.

IMHO, a way to prepare filter and enable them only on the next
_execve_ would have some benefit:
* have a way to restrict _execve_ in a non-cooperative executable;
* install filters atomically, ie. before the _execve_ system call
return. That would limit racy situations, and have the very firsts
instructions of potentially untrusted binaries already subject to
seccomp filters. It would also ensure there is only one thread running
at the filter enabling time.

>From what I understand, there is a relative use case[2] where the
"enable on exec" mode would also be a solution.

Thanks for your attention,
C. Mougey

[1]: https://github.com/netblue30/firejail/issues/3685
[2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Rich Felker
On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> On Wed, Oct 28, 2020 at 5:49 PM Rich Felker  wrote:
> > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  wrote:
> > > You're just focusing on execve() - I think it's important to keep in
> > > mind what happens after execve() for normal, dynamically-linked
> > > binaries: The next step is that the dynamic linker runs, and it will
> > > poke around in the file system with access() and openat() and fstat(),
> > > it will mmap() executable libraries into memory, it will mprotect()
> > > some memory regions, it will set up thread-local storage (e.g. using
> > > arch_prctl(); even if the process is single-threaded), and so on.
> > >
> > > The earlier you install the seccomp filter, the more of these steps
> > > you have to permit in the filter. And if you want the filter to take
> > > effect directly after execve(), the syscalls you'll be forced to
> > > permit are sufficient to cobble something together in userspace that
> > > effectively does almost the same thing as execve().
> >
> > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > controlling these operations and allowing only the ones that are valid
> > during dynamic linking. This also allows you to defer application of
> > the filter until after execve. So unless I'm missing some reason why
> > this doesn't work, I think the requested functionality is already
> > available.
> 
> Ah, yeah, good point.
> 
> > If you really just want the "activate at exec" behavior, it might be
> > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > no notify fd open; I forget)
> 
> syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> though it might be a bit nicer if userspace had control over the errno
> there, such that it could be EPERM instead... oh well.)

EPERM is a major bug in current sandbox implementations, so ENOSYS is
at least mildly better, but indeed it should be controllable, probably
by allowing a code path for the BPF to continue with a jump to a
different logic path if the notify listener is missing.

Rich


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Jann Horn
+luto just in case he has opinions on this

On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  wrote:
> From my understanding, there is no way to delay the activation of
> seccomp filters, for instance "until an _execve_ call".

(FWIW, there are some tricks that you can use for this. In particular,
you can attach to the child with ptrace before the child runs
execve(), and then use seccomp to inject a filter after execve(), or
something like that. The disadvantage is that this is not super pretty
because it interferes with debugging of the parent process. IIRC e.g.
Ubuntu's launchd did things this way.)

> But this might be useful, especially for tools who sandbox other,
> non-cooperative, executables, such as "systemd" or "FireJail".
>
> It seems to be a caveat of seccomp specific to the system call
> _execve_. For now, some tools such as "systemd" explicitly mention
> this exception, and do not support it (from the man page):
> > Note that strict system call filters may impact execution and error 
> > handling code paths of the service invocation. Specifically, access to the 
> > execve system call is required for the execution of the service binary — if 
> > it is blocked service invocation will necessarily fail
>
> "FireJail" takes a different approach[1], with a kind of workaround:
> the project uses an external library to be loaded through LD_PRELOAD
> mechanism, in order to install filters during the loader stage.
> This approach, a bit hacky, also has several caveats:
> * _openat_, _mmap_, etc. must be allowed in order to reach the
> LD_PRELOAD mechanism, and for the crafted library to work ;

Those caveats are not specific to the LD_PRELOAD approach. Actually,
the LD_PRELOAD approach is the only one which I would expect to *not*
have that caveat. (Of course, non-executable mmap() and probably also
openat() are anyway needed for almost any real-world service to do its
job correctly.)

> * it doesn't work for static binaries.

IMO the important thing about LD_PRELOAD is that it is unreliable:
When the LD_PRELOAD library can't be opened, glibc just prints a
warning and continues execution - and an attacker may be able to cause
opening an LD_PRELOAD library to fail by opening so many files in
other processes that the global limit is reached. So you can't build
reliable security infrastructure on LD_PRELOAD. This is not a
fundamental problem though - glibc could address this.

> I only see hackish ways to restrict the use of _execve_ in a
> non-cooperative executable. These methods seem globally bypassables
> and not satisfactory from a security point of view.

You're just focusing on execve() - I think it's important to keep in
mind what happens after execve() for normal, dynamically-linked
binaries: The next step is that the dynamic linker runs, and it will
poke around in the file system with access() and openat() and fstat(),
it will mmap() executable libraries into memory, it will mprotect()
some memory regions, it will set up thread-local storage (e.g. using
arch_prctl(); even if the process is single-threaded), and so on.

The earlier you install the seccomp filter, the more of these steps
you have to permit in the filter. And if you want the filter to take
effect directly after execve(), the syscalls you'll be forced to
permit are sufficient to cobble something together in userspace that
effectively does almost the same thing as execve().

Your usecase might be better served by adding a glibc feature for
"unskippable LD_PRELOAD" paired with a constructor function, or
something along those lines.

> IMHO, a way to prepare filter and enable them only on the next
> _execve_ would have some benefit:
> * have a way to restrict _execve_ in a non-cooperative executable;

As I said above, I think glibc is a better place to deal with this -
they could e.g. add a new LD_PRELOAD_MANDATORY that means "you *have*
to preload this library, and if that's not possible because the
library can't be opened or because the execution is setuid, then you
have to treat that as a fatal error".

> * install filters atomically, ie. before the _execve_ system call
> return. That would limit racy situations,

If you say "racy", please also describe what the >=2 participants of
the race would be. Are you worried about a scenario where a process A
that dropped privileges becomes dumpable via execve(), and another
process B running under the same set of UIDs (but also a seccomp
filter) then attaches to A and injects code into A before A enables
its seccomp filter? That would require that the sandboxing of process
B is not strong enough to prevent it from interfering with other
processes (that's plausible if we're just externally applying a coarse
sandbox, since procfs access is sufficient for that) and that the
system is not running Yama.

I feel like Seccomp isn't the right place to address processes being
able to debug each other though - something like Yama would probably
be a better place. Or an execveat() flag that 

Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Kees Cook
On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> +luto just in case he has opinions on this
> 
> On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  wrote:
> > From my understanding, there is no way to delay the activation of
> > seccomp filters, for instance "until an _execve_ call".
> > [...]
> > It would also ensure there is only one thread running
> > at the filter enabling time.
> 
> You're alluding to cases where library constructor functions launch
> threads? Is that a thing anyone does? (And in case someone does it, we
> still have TSYNC, so I don't think this would be a real problem.)

Unfortunately, yes, it happens. TSYNC got designed specifically to
"recapture" these constructor-launched threads. :( It was a common enough
situation Chrome wanted to solve due to some weird GPU libraries that
did this during init before Chrome was running.

-- 
Kees Cook


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Kees Cook
On Wed, Oct 28, 2020 at 12:49:36PM -0400, Rich Felker wrote:
> On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > +luto just in case he has opinions on this
> > 
> > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  wrote:
> > > From my understanding, there is no way to delay the activation of
> > > seccomp filters, for instance "until an _execve_ call".
> > > [...]
> > > I only see hackish ways to restrict the use of _execve_ in a
> > > non-cooperative executable. These methods seem globally bypassables
> > > and not satisfactory from a security point of view.
> > 
> > You're just focusing on execve() - I think it's important to keep in
> > mind what happens after execve() for normal, dynamically-linked
> > binaries: The next step is that the dynamic linker runs, and it will
> > poke around in the file system with access() and openat() and fstat(),
> > it will mmap() executable libraries into memory, it will mprotect()
> > some memory regions, it will set up thread-local storage (e.g. using
> > arch_prctl(); even if the process is single-threaded), and so on.
> > 
> > The earlier you install the seccomp filter, the more of these steps
> > you have to permit in the filter. And if you want the filter to take
> > effect directly after execve(), the syscalls you'll be forced to
> > permit are sufficient to cobble something together in userspace that
> > effectively does almost the same thing as execve().
> 
> I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> controlling these operations and allowing only the ones that are valid
> during dynamic linking. This also allows you to defer application of
> the filter until after execve. So unless I'm missing some reason why
> this doesn't work, I think the requested functionality is already
> available.

Oof. Yeah, that's possible, but I view it as kind of not the point of
USER_NOTIF -- I'd rather design a workable solution for the
delayed-apply case.

-- 
Kees Cook


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Andy Lutomirski
On Wed, Oct 28, 2020 at 3:47 PM Kees Cook  wrote:
>
> On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> > (This is my first message to the kernel list, I hope I'm doing it right)
>
> 1- self-confinement
> 2- launching external processes
> a) cooperating
> b) oblivious

I remain quite unconvinced that delayed filters will solve a real
problem.  As you described, 2a could just confine itself.  There's an
obvious synchronization point -- sd_notify().  I bet sd_notify() could
be rigged up to apply externally-supplied filters, or sd_notify()
could interact with user notifiers to get some assistance.

2b is nasty.  In an ideal world, we would materialize a fully formed
process with filters installed.  The problem is that processes don't
generally come fully formed.  Almost all interesting processes are
dynamically linked, and they get to specify their own dynamic linkers.
Even if we limit ourselves to a known dynamic linker, we would want to
make sure that the dynamic linker is hardened against various escape
techniques.  For dynamic linking, we would probably want to start out
with one set of privileges (loading libraries) and then switch.

I have an alternative suggestion to try to address some of the above:
allow a notifier to run in a mode in which it can replace the BPF
program outright.  This would be something like:

if (fork() != 0)
  return;  // do parent stuff

// Start up.  Set a BPF program that directs pretty much everything at
the listener.
int fd = seccomp(..., SECCOMP_FILTER_FLAG_NEW_LISTENER |
SECCOMP_FILTER_FLAG_ALLOW_REPLACEMENT, ...);

// Set up other things if needed.

execve();

Now, in the parent, once the child is ready for its final filters:

// Replace the filter on *all* processes using the filter to which
we're attached.
// I think the locking for this should be straightforward.
// Optional flag here to remove the ALLOW_REPLACEMENT flag, but it's
not really necessary
// since we're about to close() the listener.
ioctl(fd, SECCOMP_IOCTL_NOTIF_REPLACE_FILTER, new_filter);

// Call recv in a loop to drain and handle notifications.
for (...) {
  ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, ...);
  ...
}

close(fd);

And now we're done.  We can make the synchronization point be anything we like.


What do you all think?  For people who really want
delay-until-execve(), this can emulate it efficiently.


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Kees Cook
On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> (This is my first message to the kernel list, I hope I'm doing it right)

Looks good to me! The key was CCing real people. ;)

> From my understanding, there is no way to delay the activation of
> seccomp filters, for instance "until an _execve_ call".
> But this might be useful, especially for tools who sandbox other,
> non-cooperative, executables, such as "systemd" or "FireJail".
> [...]
> I only see hackish ways to restrict the use of _execve_ in a
> non-cooperative executable. These methods seem globally bypassables
> and not satisfactory from a security point of view.
> 
> IMHO, a way to prepare filter and enable them only on the next
> _execve_ would have some benefit:
> * have a way to restrict _execve_ in a non-cooperative executable;
> * install filters atomically, ie. before the _execve_ system call
> return. That would limit racy situations, and have the very firsts
> instructions of potentially untrusted binaries already subject to
> seccomp filters. It would also ensure there is only one thread running
> at the filter enabling time.
> 
> From what I understand, there is a relative use case[2] where the
> "enable on exec" mode would also be a solution.
> 
> Thanks for your attention,
> C. Mougey
> 
> [1]: https://github.com/netblue30/firejail/issues/3685
> [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/

Just to restate things already said in the thread and to try to illustrate
with more clarity, I tend to organize my thinking about seccomp usage
into three categories:

1- self-confinement
2- launching external processes
a) cooperating
b) oblivious

I classify things like Chrome's complex tree of related processes and
filters as 1, since it's all one thing together.

I think of systemd, docker, minijail, FireJail, etc all as falling into
category 2, with some variation about how to deal with 2a or 2b. I see
systemd as weakly covering both 2a and 2b: e.g. services are documenting
what restrictions they want, etc. minijail has stronger 2b coverage as
it attempts to do PRELOAD tricks (which it sounds like FireJail does
too?) (Aside: why doesn't systemd do any self-confinement?)

We don't have much possibility for the targets in the 2a realm as far
as cooperating over how to _manage_ confinement, but rather about simply
expecting confinement to exist, or adding more confinement on their own.

So, what would adding delayed filters gain in the above classifications?

Both 1 and 2 would benefit from some simplification over how to apply
filters (e.g. the referenced relative complexity of needing to pass the
USER_NOTIF fd up to the supervisor).

Dealing with 2b is improved by allowing execve itself to be blocked.

If we turn this:

fork
prepare & apply
exec

into this:

fork
prepare
exec & apply

for 2a, this isn't too interesting since a 2a target could just give up
execve after it launched. For 2b, though, it's pretty meaningful to gain
further isolation of an oblivious (and assumingly untrusted) process
(given all the hacks needed to try to cover the situation).

And to clarify, 2a would much prefer this to be able to separate
initialization from runtime:

fork
prepare
exec
other things
apply

And just for completeness, none of this is useful at all for 1, which
doesn't even "see" the fork from its perspective:

exec
other things
prepare & apply

How should 2a targets indicate they're ready? Can it be done passively
(in the sense that libc would make some seccomp call to apply the
delayed filters), or does it need to stay explicit? (e.g. can we turn
a pre-untrusted-input 2b into a 2a just by having the libc make calls?)
My instinct is that hiding it won't gain much over a "on-execve" case,
but having an explicit call that means "I'm done initializing now" would
be a meaningful synchronization point -- except I note that it just means
the target could just as easily start doing its own confinement anyway,
which means they effectively move from 2a to 1, and now we don't care
about delayed filters any more.

So, lacking a clearer sync point, execve() does seem to stand out to me.

The other idea which was touched on in the thread was very direct
management (e.g. ptrace) and the supervisor waits until some point and
then forces the filters to apply on the target. What would be more
light-weight than this? (Or rather, what kinds of things would such a
ptracer be looking for to mark "I've started"?)

Since I've got bitmaps on my mind, what about a syscall bitmap that
triggers the application of delayed filters? The supervisor is launching
a daemon: mark NR_listen as the apply-point. The supervisor is launching
something totally unknown: mark NR_execve as the apply-point.

If we did that, what happens to non-delayed filters applied between
program start and the apply-point getting tripped?

-- 
Kees Cook


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Jann Horn
On Wed, Oct 28, 2020 at 7:35 PM Rich Felker  wrote:
> On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote:
> > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker  wrote:
> > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker  wrote:
> > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  
> > > > > > wrote:
> > > > > > You're just focusing on execve() - I think it's important to keep in
> > > > > > mind what happens after execve() for normal, dynamically-linked
> > > > > > binaries: The next step is that the dynamic linker runs, and it will
> > > > > > poke around in the file system with access() and openat() and 
> > > > > > fstat(),
> > > > > > it will mmap() executable libraries into memory, it will mprotect()
> > > > > > some memory regions, it will set up thread-local storage (e.g. using
> > > > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > > > >
> > > > > > The earlier you install the seccomp filter, the more of these steps
> > > > > > you have to permit in the filter. And if you want the filter to take
> > > > > > effect directly after execve(), the syscalls you'll be forced to
> > > > > > permit are sufficient to cobble something together in userspace that
> > > > > > effectively does almost the same thing as execve().
> > > > >
> > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > > > > controlling these operations and allowing only the ones that are valid
> > > > > during dynamic linking. This also allows you to defer application of
> > > > > the filter until after execve. So unless I'm missing some reason why
> > > > > this doesn't work, I think the requested functionality is already
> > > > > available.
> > > >
> > > > Ah, yeah, good point.
> > > >
> > > > > If you really just want the "activate at exec" behavior, it might be
> > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > > > > no notify fd open; I forget)
> > > >
> > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > > > though it might be a bit nicer if userspace had control over the errno
> > > > there, such that it could be EPERM instead... oh well.)
> > >
> > > EPERM is a major bug in current sandbox implementations, so ENOSYS is
> > > at least mildly better, but indeed it should be controllable, probably
> > > by allowing a code path for the BPF to continue with a jump to a
> > > different logic path if the notify listener is missing.
> >
> > I guess we might be able to expose the listener status through a bit /
> > a field in the struct seccomp_data, and then filters could branch on
> > that. (And the kernel would run the filter twice if we raced with
> > filter detachment.) I don't know whether it would look pretty, but I
> > think it should be doable...
>
> I was thinking the race wouldn't be salvagable, but indeed since the
> filter is side-effect-free you can just re-run it if the status
> changes between start of filter processing and the attempt at
> notification. This sounds like it should work.
>
> I guess it's not possible to chain two BPF filters to do this, because
> that only works when the first one allows? Or am I misunderstanding
> the multiple-filters case entirely? (I've never gotten that far with
> programming it.)

I'm not sure if I'm understanding the question correctly...
At the moment you basically can't have multiple filters with notifiers.
The rule with multiple filters is always that all the filters get run,
and the actual action taken is the most restrictive result of all of
them.


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Jann Horn
On Wed, Oct 28, 2020 at 5:49 PM Rich Felker  wrote:
> On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  wrote:
> > You're just focusing on execve() - I think it's important to keep in
> > mind what happens after execve() for normal, dynamically-linked
> > binaries: The next step is that the dynamic linker runs, and it will
> > poke around in the file system with access() and openat() and fstat(),
> > it will mmap() executable libraries into memory, it will mprotect()
> > some memory regions, it will set up thread-local storage (e.g. using
> > arch_prctl(); even if the process is single-threaded), and so on.
> >
> > The earlier you install the seccomp filter, the more of these steps
> > you have to permit in the filter. And if you want the filter to take
> > effect directly after execve(), the syscalls you'll be forced to
> > permit are sufficient to cobble something together in userspace that
> > effectively does almost the same thing as execve().
>
> I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> controlling these operations and allowing only the ones that are valid
> during dynamic linking. This also allows you to defer application of
> the filter until after execve. So unless I'm missing some reason why
> this doesn't work, I think the requested functionality is already
> available.

Ah, yeah, good point.

> If you really just want the "activate at exec" behavior, it might be
> possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> no notify fd open; I forget)

syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
though it might be a bit nicer if userspace had control over the errno
there, such that it could be EPERM instead... oh well.)


Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

2020-10-28 Thread Jann Horn
On Wed, Oct 28, 2020 at 6:52 PM Rich Felker  wrote:
> On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker  wrote:
> > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey  
> > > > wrote:
> > > > You're just focusing on execve() - I think it's important to keep in
> > > > mind what happens after execve() for normal, dynamically-linked
> > > > binaries: The next step is that the dynamic linker runs, and it will
> > > > poke around in the file system with access() and openat() and fstat(),
> > > > it will mmap() executable libraries into memory, it will mprotect()
> > > > some memory regions, it will set up thread-local storage (e.g. using
> > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > >
> > > > The earlier you install the seccomp filter, the more of these steps
> > > > you have to permit in the filter. And if you want the filter to take
> > > > effect directly after execve(), the syscalls you'll be forced to
> > > > permit are sufficient to cobble something together in userspace that
> > > > effectively does almost the same thing as execve().
> > >
> > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > > controlling these operations and allowing only the ones that are valid
> > > during dynamic linking. This also allows you to defer application of
> > > the filter until after execve. So unless I'm missing some reason why
> > > this doesn't work, I think the requested functionality is already
> > > available.
> >
> > Ah, yeah, good point.
> >
> > > If you really just want the "activate at exec" behavior, it might be
> > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > > no notify fd open; I forget)
> >
> > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > though it might be a bit nicer if userspace had control over the errno
> > there, such that it could be EPERM instead... oh well.)
>
> EPERM is a major bug in current sandbox implementations, so ENOSYS is
> at least mildly better, but indeed it should be controllable, probably
> by allowing a code path for the BPF to continue with a jump to a
> different logic path if the notify listener is missing.

I guess we might be able to expose the listener status through a bit /
a field in the struct seccomp_data, and then filters could branch on
that. (And the kernel would run the filter twice if we raced with
filter detachment.) I don't know whether it would look pretty, but I
think it should be doable...