Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Thu, Oct 29, 2020 at 07:58:42AM +, Sargun Dhillon wrote: > A mechanism for the thing listening on the listener FD to turn itself on or > off > and indicate that it is no longer interested in receiving notifications and > to > always continue / return an error code, or that it has taken an interest now, > and it would like to return to handling these events. The idea of an action > other than ENOSYS (specifically SECCOMP_USER_NOTIF_FLAG_CONTINUE) if the > listener goes away is attractive as well, in case the supervisor crashes. > EPERM > is somewhat "cleaner" of an error code than ENOSYS (most people don't write > handling for ENOSYS on connect). This is a common misconception that really needs to be addressed. EPERM is not a suitable error code for as-yet-unknown seccomp-blocked syscalls, and is not suitable for a large portion of currently known ones either. It use has actively broken lots of things that would have worked fine with ENOSYS returned. This is because a caller will react to ENOSYS by attempting to do whatever it wanted to do in another way, one which your policy might handlle better (for example, if your filter is outdated and blocking clock_gettime64 but not clock_gettime, or blocking statx but not stat64) while it will react to EPERM as an indication that your filter understands the abstract operation it was trying to perform and considers that action forbidden. We hit issues like this with virtually every seccomp-using application while moving to time64 because the wrong idiom is so widespread, and further vdso prevented it from being caught right away (only users without vdso hit it later). >From a standards perspective (and thus of programming to them), POSIX does permit implementations to define additional errors for any interface that already has errors defined, but does not permit overloading the standard errors defined, so if EPERM is already defined for an interface, returning it is probably not ok. (IMO it is defensible if the seccomp policy can be thought of as an extension of the permission model that would cause the standaed EPERM, but it's not defensible if the policy is just "we don't know about this syscall yet" since the syscall might just be a new/better way of implementing some existing operation that makes the policy appear inconsistent.) In your example of connect, I think either is semantically defensible; one is "you lack suitable privilege to perform the operation" and the other is "you're running on a type of system that doesn't have socket connect functionality" (because it's a minimal execution environment). Rich
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 03:47:27PM -0700, Kees Cook wrote: > On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote: > > (This is my first message to the kernel list, I hope I'm doing it right) > > Looks good to me! The key was CCing real people. ;) > > > From my understanding, there is no way to delay the activation of > > seccomp filters, for instance "until an _execve_ call". > > But this might be useful, especially for tools who sandbox other, > > non-cooperative, executables, such as "systemd" or "FireJail". > > [...] > > I only see hackish ways to restrict the use of _execve_ in a > > non-cooperative executable. These methods seem globally bypassables > > and not satisfactory from a security point of view. > > > > IMHO, a way to prepare filter and enable them only on the next > > _execve_ would have some benefit: > > * have a way to restrict _execve_ in a non-cooperative executable; > > * install filters atomically, ie. before the _execve_ system call > > return. That would limit racy situations, and have the very firsts > > instructions of potentially untrusted binaries already subject to > > seccomp filters. It would also ensure there is only one thread running > > at the filter enabling time. > > > > From what I understand, there is a relative use case[2] where the > > "enable on exec" mode would also be a solution. > > > > Thanks for your attention, > > C. Mougey > > > > [1]: https://github.com/netblue30/firejail/issues/3685 > > [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/ > > Just to restate things already said in the thread and to try to illustrate > with more clarity, I tend to organize my thinking about seccomp usage > into three categories: > > 1- self-confinement > 2- launching external processes > a) cooperating > b) oblivious > > I classify things like Chrome's complex tree of related processes and > filters as 1, since it's all one thing together. > > I think of systemd, docker, minijail, FireJail, etc all as falling into > category 2, with some variation about how to deal with 2a or 2b. I see > systemd as weakly covering both 2a and 2b: e.g. services are documenting > what restrictions they want, etc. minijail has stronger 2b coverage as > it attempts to do PRELOAD tricks (which it sounds like FireJail does > too?) (Aside: why doesn't systemd do any self-confinement?) > > We don't have much possibility for the targets in the 2a realm as far > as cooperating over how to _manage_ confinement, but rather about simply > expecting confinement to exist, or adding more confinement on their own. > > So, what would adding delayed filters gain in the above classifications? > > Both 1 and 2 would benefit from some simplification over how to apply > filters (e.g. the referenced relative complexity of needing to pass the > USER_NOTIF fd up to the supervisor). > > Dealing with 2b is improved by allowing execve itself to be blocked. > > If we turn this: > > fork > prepare & apply > exec > > into this: > > fork > prepare > exec & apply > > for 2a, this isn't too interesting since a 2a target could just give up > execve after it launched. For 2b, though, it's pretty meaningful to gain > further isolation of an oblivious (and assumingly untrusted) process > (given all the hacks needed to try to cover the situation). > > And to clarify, 2a would much prefer this to be able to separate > initialization from runtime: > > fork > prepare > exec > other things > apply > > And just for completeness, none of this is useful at all for 1, which > doesn't even "see" the fork from its perspective: > > exec > other things > prepare & apply > > How should 2a targets indicate they're ready? Can it be done passively > (in the sense that libc would make some seccomp call to apply the > delayed filters), or does it need to stay explicit? (e.g. can we turn > a pre-untrusted-input 2b into a 2a just by having the libc make calls?) > My instinct is that hiding it won't gain much over a "on-execve" case, > but having an explicit call that means "I'm done initializing now" would > be a meaningful synchronization point -- except I note that it just means > the target could just as easily start doing its own confinement anyway, > which means they effectively move from 2a to 1, and now we don't care > about delayed filters any more. > > So, lacking a clearer sync point, execve() does seem to stand out to me. > > The other idea which was touched on in the thread was very direct > management (e.g. ptrace) and the supervisor waits until some point and > then forces the filters to apply on the target. What would be more > light-weight than this? (Or rather, what kinds of things would such a > ptracer be looking for to mark "I've started"?) > > Since I've got bitmaps on my mind, what about a syscall bitmap that > triggers the application of delayed filters? The supervisor is launching > a daemon: mark NR_listen as the
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > +luto just in case he has opinions on this > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey wrote: > > From my understanding, there is no way to delay the activation of > > seccomp filters, for instance "until an _execve_ call". > > (FWIW, there are some tricks that you can use for this. In particular, > you can attach to the child with ptrace before the child runs > execve(), and then use seccomp to inject a filter after execve(), or > something like that. The disadvantage is that this is not super pretty > because it interferes with debugging of the parent process. IIRC e.g. > Ubuntu's launchd did things this way.) Yes, in principle everything seccomp does could have been done with ptrace but the whole point was not to use ptrace as a primitive to build hacks upon. So this is not a good solution. > > But this might be useful, especially for tools who sandbox other, > > non-cooperative, executables, such as "systemd" or "FireJail". > > > > It seems to be a caveat of seccomp specific to the system call > > _execve_. For now, some tools such as "systemd" explicitly mention > > this exception, and do not support it (from the man page): > > > Note that strict system call filters may impact execution and > > > error handling code paths of the service invocation. > > > Specifically, access to the execve system call is required for > > > the execution of the service binary — if it is blocked service > > > invocation will necessarily fail > > > > "FireJail" takes a different approach[1], with a kind of workaround: > > the project uses an external library to be loaded through LD_PRELOAD > > mechanism, in order to install filters during the loader stage. > > This approach, a bit hacky, also has several caveats: > > * _openat_, _mmap_, etc. must be allowed in order to reach the > > LD_PRELOAD mechanism, and for the crafted library to work ; > > Those caveats are not specific to the LD_PRELOAD approach. Actually, > the LD_PRELOAD approach is the only one which I would expect to *not* > have that caveat. (Of course, non-executable mmap() and probably also > openat() are anyway needed for almost any real-world service to do its > job correctly.) > > > * it doesn't work for static binaries. > > IMO the important thing about LD_PRELOAD is that it is unreliable: > When the LD_PRELOAD library can't be opened, glibc just prints a > warning and continues execution - and an attacker may be able to cause > opening an LD_PRELOAD library to fail by opening so many files in > other processes that the global limit is reached. So you can't build > reliable security infrastructure on LD_PRELOAD. This is not a > fundamental problem though - glibc could address this. Using LD_PRELOAD for security infrastructure is a really bad idea anyway. There are nearly unboundedly many ways code could end up executing before the preloaded ctors. Preinit arrays, malformed ELF headers seizing control of execution of ldso, etc. The seccomp filters really need to be in place *before* the untrusted code runs. > > I only see hackish ways to restrict the use of _execve_ in a > > non-cooperative executable. These methods seem globally bypassables > > and not satisfactory from a security point of view. > > You're just focusing on execve() - I think it's important to keep in > mind what happens after execve() for normal, dynamically-linked > binaries: The next step is that the dynamic linker runs, and it will > poke around in the file system with access() and openat() and fstat(), > it will mmap() executable libraries into memory, it will mprotect() > some memory regions, it will set up thread-local storage (e.g. using > arch_prctl(); even if the process is single-threaded), and so on. > > The earlier you install the seccomp filter, the more of these steps > you have to permit in the filter. And if you want the filter to take > effect directly after execve(), the syscalls you'll be forced to > permit are sufficient to cobble something together in userspace that > effectively does almost the same thing as execve(). I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for controlling these operations and allowing only the ones that are valid during dynamic linking. This also allows you to defer application of the filter until after execve. So unless I'm missing some reason why this doesn't work, I think the requested functionality is already available. If you really just want the "activate at exec" behavior, it might be possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's no notify fd open; I forget) to setup the filter so that the "mode switch" happens automatically at exec by having the notify fd being close-on-exec (notifications handled by a thread before exec). If this works it would avoid having an extra process involved and managing its lifetime. > Your usecase might be better served by adding a glibc feature for > "unskippable LD_PRELOAD"
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote: > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker wrote: > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote: > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker wrote: > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey > > > > > wrote: > > > > > You're just focusing on execve() - I think it's important to keep in > > > > > mind what happens after execve() for normal, dynamically-linked > > > > > binaries: The next step is that the dynamic linker runs, and it will > > > > > poke around in the file system with access() and openat() and fstat(), > > > > > it will mmap() executable libraries into memory, it will mprotect() > > > > > some memory regions, it will set up thread-local storage (e.g. using > > > > > arch_prctl(); even if the process is single-threaded), and so on. > > > > > > > > > > The earlier you install the seccomp filter, the more of these steps > > > > > you have to permit in the filter. And if you want the filter to take > > > > > effect directly after execve(), the syscalls you'll be forced to > > > > > permit are sufficient to cobble something together in userspace that > > > > > effectively does almost the same thing as execve(). > > > > > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for > > > > controlling these operations and allowing only the ones that are valid > > > > during dynamic linking. This also allows you to defer application of > > > > the filter until after execve. So unless I'm missing some reason why > > > > this doesn't work, I think the requested functionality is already > > > > available. > > > > > > Ah, yeah, good point. > > > > > > > If you really just want the "activate at exec" behavior, it might be > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's > > > > no notify fd open; I forget) > > > > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even > > > though it might be a bit nicer if userspace had control over the errno > > > there, such that it could be EPERM instead... oh well.) > > > > EPERM is a major bug in current sandbox implementations, so ENOSYS is > > at least mildly better, but indeed it should be controllable, probably > > by allowing a code path for the BPF to continue with a jump to a > > different logic path if the notify listener is missing. > > I guess we might be able to expose the listener status through a bit / > a field in the struct seccomp_data, and then filters could branch on > that. (And the kernel would run the filter twice if we raced with > filter detachment.) I don't know whether it would look pretty, but I > think it should be doable... I was thinking the race wouldn't be salvagable, but indeed since the filter is side-effect-free you can just re-run it if the status changes between start of filter processing and the attempt at notification. This sounds like it should work. I guess it's not possible to chain two BPF filters to do this, because that only works when the first one allows? Or am I misunderstanding the multiple-filters case entirely? (I've never gotten that far with programming it.) Rich
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 07:39:41PM +0100, Jann Horn wrote: > On Wed, Oct 28, 2020 at 7:35 PM Rich Felker wrote: > > On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote: > > > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker wrote: > > > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote: > > > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker wrote: > > > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > > > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey > > > > > > > wrote: > > > > > > > You're just focusing on execve() - I think it's important to keep > > > > > > > in > > > > > > > mind what happens after execve() for normal, dynamically-linked > > > > > > > binaries: The next step is that the dynamic linker runs, and it > > > > > > > will > > > > > > > poke around in the file system with access() and openat() and > > > > > > > fstat(), > > > > > > > it will mmap() executable libraries into memory, it will > > > > > > > mprotect() > > > > > > > some memory regions, it will set up thread-local storage (e.g. > > > > > > > using > > > > > > > arch_prctl(); even if the process is single-threaded), and so on. > > > > > > > > > > > > > > The earlier you install the seccomp filter, the more of these > > > > > > > steps > > > > > > > you have to permit in the filter. And if you want the filter to > > > > > > > take > > > > > > > effect directly after execve(), the syscalls you'll be forced to > > > > > > > permit are sufficient to cobble something together in userspace > > > > > > > that > > > > > > > effectively does almost the same thing as execve(). > > > > > > > > > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy > > > > > > for > > > > > > controlling these operations and allowing only the ones that are > > > > > > valid > > > > > > during dynamic linking. This also allows you to defer application of > > > > > > the filter until after execve. So unless I'm missing some reason why > > > > > > this doesn't work, I think the requested functionality is already > > > > > > available. > > > > > > > > > > Ah, yeah, good point. > > > > > > > > > > > If you really just want the "activate at exec" behavior, it might be > > > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when > > > > > > there's > > > > > > no notify fd open; I forget) > > > > > > > > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even > > > > > though it might be a bit nicer if userspace had control over the errno > > > > > there, such that it could be EPERM instead... oh well.) > > > > > > > > EPERM is a major bug in current sandbox implementations, so ENOSYS is > > > > at least mildly better, but indeed it should be controllable, probably > > > > by allowing a code path for the BPF to continue with a jump to a > > > > different logic path if the notify listener is missing. > > > > > > I guess we might be able to expose the listener status through a bit / > > > a field in the struct seccomp_data, and then filters could branch on > > > that. (And the kernel would run the filter twice if we raced with > > > filter detachment.) I don't know whether it would look pretty, but I > > > think it should be doable... > > > > I was thinking the race wouldn't be salvagable, but indeed since the > > filter is side-effect-free you can just re-run it if the status > > changes between start of filter processing and the attempt at > > notification. This sounds like it should work. > > > > I guess it's not possible to chain two BPF filters to do this, because > > that only works when the first one allows? Or am I misunderstanding > > the multiple-filters case entirely? (I've never gotten that far with > > programming it.) > > I'm not sure if I'm understanding the question correctly... > At the moment you basically can't have multiple filters with notifiers. > The rule with multiple filters is always that all the filters get run, > and the actual action taken is the most restrictive result of all of > them. I probably just don't understand how multiple filters work then, which is pretty much what I expected. But in any case it seems correct that they're not a tool for solving the problem here. Rich
[seccomp] Request for a "enable on execve" mode for Seccomp filters
Hello, (This is my first message to the kernel list, I hope I'm doing it right) >From my understanding, there is no way to delay the activation of seccomp filters, for instance "until an _execve_ call". But this might be useful, especially for tools who sandbox other, non-cooperative, executables, such as "systemd" or "FireJail". It seems to be a caveat of seccomp specific to the system call _execve_. For now, some tools such as "systemd" explicitly mention this exception, and do not support it (from the man page): > Note that strict system call filters may impact execution and error handling > code paths of the service invocation. Specifically, access to the execve > system call is required for the execution of the service binary — if it is > blocked service invocation will necessarily fail "FireJail" takes a different approach[1], with a kind of workaround: the project uses an external library to be loaded through LD_PRELOAD mechanism, in order to install filters during the loader stage. This approach, a bit hacky, also has several caveats: * _openat_, _mmap_, etc. must be allowed in order to reach the LD_PRELOAD mechanism, and for the crafted library to work ; * it doesn't work for static binaries. I only see hackish ways to restrict the use of _execve_ in a non-cooperative executable. These methods seem globally bypassables and not satisfactory from a security point of view. IMHO, a way to prepare filter and enable them only on the next _execve_ would have some benefit: * have a way to restrict _execve_ in a non-cooperative executable; * install filters atomically, ie. before the _execve_ system call return. That would limit racy situations, and have the very firsts instructions of potentially untrusted binaries already subject to seccomp filters. It would also ensure there is only one thread running at the filter enabling time. >From what I understand, there is a relative use case[2] where the "enable on exec" mode would also be a solution. Thanks for your attention, C. Mougey [1]: https://github.com/netblue30/firejail/issues/3685 [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote: > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker wrote: > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey wrote: > > > You're just focusing on execve() - I think it's important to keep in > > > mind what happens after execve() for normal, dynamically-linked > > > binaries: The next step is that the dynamic linker runs, and it will > > > poke around in the file system with access() and openat() and fstat(), > > > it will mmap() executable libraries into memory, it will mprotect() > > > some memory regions, it will set up thread-local storage (e.g. using > > > arch_prctl(); even if the process is single-threaded), and so on. > > > > > > The earlier you install the seccomp filter, the more of these steps > > > you have to permit in the filter. And if you want the filter to take > > > effect directly after execve(), the syscalls you'll be forced to > > > permit are sufficient to cobble something together in userspace that > > > effectively does almost the same thing as execve(). > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for > > controlling these operations and allowing only the ones that are valid > > during dynamic linking. This also allows you to defer application of > > the filter until after execve. So unless I'm missing some reason why > > this doesn't work, I think the requested functionality is already > > available. > > Ah, yeah, good point. > > > If you really just want the "activate at exec" behavior, it might be > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's > > no notify fd open; I forget) > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even > though it might be a bit nicer if userspace had control over the errno > there, such that it could be EPERM instead... oh well.) EPERM is a major bug in current sandbox implementations, so ENOSYS is at least mildly better, but indeed it should be controllable, probably by allowing a code path for the BPF to continue with a jump to a different logic path if the notify listener is missing. Rich
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
+luto just in case he has opinions on this On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey wrote: > From my understanding, there is no way to delay the activation of > seccomp filters, for instance "until an _execve_ call". (FWIW, there are some tricks that you can use for this. In particular, you can attach to the child with ptrace before the child runs execve(), and then use seccomp to inject a filter after execve(), or something like that. The disadvantage is that this is not super pretty because it interferes with debugging of the parent process. IIRC e.g. Ubuntu's launchd did things this way.) > But this might be useful, especially for tools who sandbox other, > non-cooperative, executables, such as "systemd" or "FireJail". > > It seems to be a caveat of seccomp specific to the system call > _execve_. For now, some tools such as "systemd" explicitly mention > this exception, and do not support it (from the man page): > > Note that strict system call filters may impact execution and error > > handling code paths of the service invocation. Specifically, access to the > > execve system call is required for the execution of the service binary — if > > it is blocked service invocation will necessarily fail > > "FireJail" takes a different approach[1], with a kind of workaround: > the project uses an external library to be loaded through LD_PRELOAD > mechanism, in order to install filters during the loader stage. > This approach, a bit hacky, also has several caveats: > * _openat_, _mmap_, etc. must be allowed in order to reach the > LD_PRELOAD mechanism, and for the crafted library to work ; Those caveats are not specific to the LD_PRELOAD approach. Actually, the LD_PRELOAD approach is the only one which I would expect to *not* have that caveat. (Of course, non-executable mmap() and probably also openat() are anyway needed for almost any real-world service to do its job correctly.) > * it doesn't work for static binaries. IMO the important thing about LD_PRELOAD is that it is unreliable: When the LD_PRELOAD library can't be opened, glibc just prints a warning and continues execution - and an attacker may be able to cause opening an LD_PRELOAD library to fail by opening so many files in other processes that the global limit is reached. So you can't build reliable security infrastructure on LD_PRELOAD. This is not a fundamental problem though - glibc could address this. > I only see hackish ways to restrict the use of _execve_ in a > non-cooperative executable. These methods seem globally bypassables > and not satisfactory from a security point of view. You're just focusing on execve() - I think it's important to keep in mind what happens after execve() for normal, dynamically-linked binaries: The next step is that the dynamic linker runs, and it will poke around in the file system with access() and openat() and fstat(), it will mmap() executable libraries into memory, it will mprotect() some memory regions, it will set up thread-local storage (e.g. using arch_prctl(); even if the process is single-threaded), and so on. The earlier you install the seccomp filter, the more of these steps you have to permit in the filter. And if you want the filter to take effect directly after execve(), the syscalls you'll be forced to permit are sufficient to cobble something together in userspace that effectively does almost the same thing as execve(). Your usecase might be better served by adding a glibc feature for "unskippable LD_PRELOAD" paired with a constructor function, or something along those lines. > IMHO, a way to prepare filter and enable them only on the next > _execve_ would have some benefit: > * have a way to restrict _execve_ in a non-cooperative executable; As I said above, I think glibc is a better place to deal with this - they could e.g. add a new LD_PRELOAD_MANDATORY that means "you *have* to preload this library, and if that's not possible because the library can't be opened or because the execution is setuid, then you have to treat that as a fatal error". > * install filters atomically, ie. before the _execve_ system call > return. That would limit racy situations, If you say "racy", please also describe what the >=2 participants of the race would be. Are you worried about a scenario where a process A that dropped privileges becomes dumpable via execve(), and another process B running under the same set of UIDs (but also a seccomp filter) then attaches to A and injects code into A before A enables its seccomp filter? That would require that the sandboxing of process B is not strong enough to prevent it from interfering with other processes (that's plausible if we're just externally applying a coarse sandbox, since procfs access is sufficient for that) and that the system is not running Yama. I feel like Seccomp isn't the right place to address processes being able to debug each other though - something like Yama would probably be a better place. Or an execveat() flag that
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > +luto just in case he has opinions on this > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey wrote: > > From my understanding, there is no way to delay the activation of > > seccomp filters, for instance "until an _execve_ call". > > [...] > > It would also ensure there is only one thread running > > at the filter enabling time. > > You're alluding to cases where library constructor functions launch > threads? Is that a thing anyone does? (And in case someone does it, we > still have TSYNC, so I don't think this would be a real problem.) Unfortunately, yes, it happens. TSYNC got designed specifically to "recapture" these constructor-launched threads. :( It was a common enough situation Chrome wanted to solve due to some weird GPU libraries that did this during init before Chrome was running. -- Kees Cook
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 12:49:36PM -0400, Rich Felker wrote: > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > > +luto just in case he has opinions on this > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey wrote: > > > From my understanding, there is no way to delay the activation of > > > seccomp filters, for instance "until an _execve_ call". > > > [...] > > > I only see hackish ways to restrict the use of _execve_ in a > > > non-cooperative executable. These methods seem globally bypassables > > > and not satisfactory from a security point of view. > > > > You're just focusing on execve() - I think it's important to keep in > > mind what happens after execve() for normal, dynamically-linked > > binaries: The next step is that the dynamic linker runs, and it will > > poke around in the file system with access() and openat() and fstat(), > > it will mmap() executable libraries into memory, it will mprotect() > > some memory regions, it will set up thread-local storage (e.g. using > > arch_prctl(); even if the process is single-threaded), and so on. > > > > The earlier you install the seccomp filter, the more of these steps > > you have to permit in the filter. And if you want the filter to take > > effect directly after execve(), the syscalls you'll be forced to > > permit are sufficient to cobble something together in userspace that > > effectively does almost the same thing as execve(). > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for > controlling these operations and allowing only the ones that are valid > during dynamic linking. This also allows you to defer application of > the filter until after execve. So unless I'm missing some reason why > this doesn't work, I think the requested functionality is already > available. Oof. Yeah, that's possible, but I view it as kind of not the point of USER_NOTIF -- I'd rather design a workable solution for the delayed-apply case. -- Kees Cook
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 3:47 PM Kees Cook wrote: > > On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote: > > (This is my first message to the kernel list, I hope I'm doing it right) > > 1- self-confinement > 2- launching external processes > a) cooperating > b) oblivious I remain quite unconvinced that delayed filters will solve a real problem. As you described, 2a could just confine itself. There's an obvious synchronization point -- sd_notify(). I bet sd_notify() could be rigged up to apply externally-supplied filters, or sd_notify() could interact with user notifiers to get some assistance. 2b is nasty. In an ideal world, we would materialize a fully formed process with filters installed. The problem is that processes don't generally come fully formed. Almost all interesting processes are dynamically linked, and they get to specify their own dynamic linkers. Even if we limit ourselves to a known dynamic linker, we would want to make sure that the dynamic linker is hardened against various escape techniques. For dynamic linking, we would probably want to start out with one set of privileges (loading libraries) and then switch. I have an alternative suggestion to try to address some of the above: allow a notifier to run in a mode in which it can replace the BPF program outright. This would be something like: if (fork() != 0) return; // do parent stuff // Start up. Set a BPF program that directs pretty much everything at the listener. int fd = seccomp(..., SECCOMP_FILTER_FLAG_NEW_LISTENER | SECCOMP_FILTER_FLAG_ALLOW_REPLACEMENT, ...); // Set up other things if needed. execve(); Now, in the parent, once the child is ready for its final filters: // Replace the filter on *all* processes using the filter to which we're attached. // I think the locking for this should be straightforward. // Optional flag here to remove the ALLOW_REPLACEMENT flag, but it's not really necessary // since we're about to close() the listener. ioctl(fd, SECCOMP_IOCTL_NOTIF_REPLACE_FILTER, new_filter); // Call recv in a loop to drain and handle notifications. for (...) { ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, ...); ... } close(fd); And now we're done. We can make the synchronization point be anything we like. What do you all think? For people who really want delay-until-execve(), this can emulate it efficiently.
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote: > (This is my first message to the kernel list, I hope I'm doing it right) Looks good to me! The key was CCing real people. ;) > From my understanding, there is no way to delay the activation of > seccomp filters, for instance "until an _execve_ call". > But this might be useful, especially for tools who sandbox other, > non-cooperative, executables, such as "systemd" or "FireJail". > [...] > I only see hackish ways to restrict the use of _execve_ in a > non-cooperative executable. These methods seem globally bypassables > and not satisfactory from a security point of view. > > IMHO, a way to prepare filter and enable them only on the next > _execve_ would have some benefit: > * have a way to restrict _execve_ in a non-cooperative executable; > * install filters atomically, ie. before the _execve_ system call > return. That would limit racy situations, and have the very firsts > instructions of potentially untrusted binaries already subject to > seccomp filters. It would also ensure there is only one thread running > at the filter enabling time. > > From what I understand, there is a relative use case[2] where the > "enable on exec" mode would also be a solution. > > Thanks for your attention, > C. Mougey > > [1]: https://github.com/netblue30/firejail/issues/3685 > [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/ Just to restate things already said in the thread and to try to illustrate with more clarity, I tend to organize my thinking about seccomp usage into three categories: 1- self-confinement 2- launching external processes a) cooperating b) oblivious I classify things like Chrome's complex tree of related processes and filters as 1, since it's all one thing together. I think of systemd, docker, minijail, FireJail, etc all as falling into category 2, with some variation about how to deal with 2a or 2b. I see systemd as weakly covering both 2a and 2b: e.g. services are documenting what restrictions they want, etc. minijail has stronger 2b coverage as it attempts to do PRELOAD tricks (which it sounds like FireJail does too?) (Aside: why doesn't systemd do any self-confinement?) We don't have much possibility for the targets in the 2a realm as far as cooperating over how to _manage_ confinement, but rather about simply expecting confinement to exist, or adding more confinement on their own. So, what would adding delayed filters gain in the above classifications? Both 1 and 2 would benefit from some simplification over how to apply filters (e.g. the referenced relative complexity of needing to pass the USER_NOTIF fd up to the supervisor). Dealing with 2b is improved by allowing execve itself to be blocked. If we turn this: fork prepare & apply exec into this: fork prepare exec & apply for 2a, this isn't too interesting since a 2a target could just give up execve after it launched. For 2b, though, it's pretty meaningful to gain further isolation of an oblivious (and assumingly untrusted) process (given all the hacks needed to try to cover the situation). And to clarify, 2a would much prefer this to be able to separate initialization from runtime: fork prepare exec other things apply And just for completeness, none of this is useful at all for 1, which doesn't even "see" the fork from its perspective: exec other things prepare & apply How should 2a targets indicate they're ready? Can it be done passively (in the sense that libc would make some seccomp call to apply the delayed filters), or does it need to stay explicit? (e.g. can we turn a pre-untrusted-input 2b into a 2a just by having the libc make calls?) My instinct is that hiding it won't gain much over a "on-execve" case, but having an explicit call that means "I'm done initializing now" would be a meaningful synchronization point -- except I note that it just means the target could just as easily start doing its own confinement anyway, which means they effectively move from 2a to 1, and now we don't care about delayed filters any more. So, lacking a clearer sync point, execve() does seem to stand out to me. The other idea which was touched on in the thread was very direct management (e.g. ptrace) and the supervisor waits until some point and then forces the filters to apply on the target. What would be more light-weight than this? (Or rather, what kinds of things would such a ptracer be looking for to mark "I've started"?) Since I've got bitmaps on my mind, what about a syscall bitmap that triggers the application of delayed filters? The supervisor is launching a daemon: mark NR_listen as the apply-point. The supervisor is launching something totally unknown: mark NR_execve as the apply-point. If we did that, what happens to non-delayed filters applied between program start and the apply-point getting tripped? -- Kees Cook
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 7:35 PM Rich Felker wrote: > On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote: > > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker wrote: > > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote: > > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker wrote: > > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey > > > > > > wrote: > > > > > > You're just focusing on execve() - I think it's important to keep in > > > > > > mind what happens after execve() for normal, dynamically-linked > > > > > > binaries: The next step is that the dynamic linker runs, and it will > > > > > > poke around in the file system with access() and openat() and > > > > > > fstat(), > > > > > > it will mmap() executable libraries into memory, it will mprotect() > > > > > > some memory regions, it will set up thread-local storage (e.g. using > > > > > > arch_prctl(); even if the process is single-threaded), and so on. > > > > > > > > > > > > The earlier you install the seccomp filter, the more of these steps > > > > > > you have to permit in the filter. And if you want the filter to take > > > > > > effect directly after execve(), the syscalls you'll be forced to > > > > > > permit are sufficient to cobble something together in userspace that > > > > > > effectively does almost the same thing as execve(). > > > > > > > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for > > > > > controlling these operations and allowing only the ones that are valid > > > > > during dynamic linking. This also allows you to defer application of > > > > > the filter until after execve. So unless I'm missing some reason why > > > > > this doesn't work, I think the requested functionality is already > > > > > available. > > > > > > > > Ah, yeah, good point. > > > > > > > > > If you really just want the "activate at exec" behavior, it might be > > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's > > > > > no notify fd open; I forget) > > > > > > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even > > > > though it might be a bit nicer if userspace had control over the errno > > > > there, such that it could be EPERM instead... oh well.) > > > > > > EPERM is a major bug in current sandbox implementations, so ENOSYS is > > > at least mildly better, but indeed it should be controllable, probably > > > by allowing a code path for the BPF to continue with a jump to a > > > different logic path if the notify listener is missing. > > > > I guess we might be able to expose the listener status through a bit / > > a field in the struct seccomp_data, and then filters could branch on > > that. (And the kernel would run the filter twice if we raced with > > filter detachment.) I don't know whether it would look pretty, but I > > think it should be doable... > > I was thinking the race wouldn't be salvagable, but indeed since the > filter is side-effect-free you can just re-run it if the status > changes between start of filter processing and the attempt at > notification. This sounds like it should work. > > I guess it's not possible to chain two BPF filters to do this, because > that only works when the first one allows? Or am I misunderstanding > the multiple-filters case entirely? (I've never gotten that far with > programming it.) I'm not sure if I'm understanding the question correctly... At the moment you basically can't have multiple filters with notifiers. The rule with multiple filters is always that all the filters get run, and the actual action taken is the most restrictive result of all of them.
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 5:49 PM Rich Felker wrote: > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey wrote: > > You're just focusing on execve() - I think it's important to keep in > > mind what happens after execve() for normal, dynamically-linked > > binaries: The next step is that the dynamic linker runs, and it will > > poke around in the file system with access() and openat() and fstat(), > > it will mmap() executable libraries into memory, it will mprotect() > > some memory regions, it will set up thread-local storage (e.g. using > > arch_prctl(); even if the process is single-threaded), and so on. > > > > The earlier you install the seccomp filter, the more of these steps > > you have to permit in the filter. And if you want the filter to take > > effect directly after execve(), the syscalls you'll be forced to > > permit are sufficient to cobble something together in userspace that > > effectively does almost the same thing as execve(). > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for > controlling these operations and allowing only the ones that are valid > during dynamic linking. This also allows you to defer application of > the filter until after execve. So unless I'm missing some reason why > this doesn't work, I think the requested functionality is already > available. Ah, yeah, good point. > If you really just want the "activate at exec" behavior, it might be > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's > no notify fd open; I forget) syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even though it might be a bit nicer if userspace had control over the errno there, such that it could be EPERM instead... oh well.)
Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters
On Wed, Oct 28, 2020 at 6:52 PM Rich Felker wrote: > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote: > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker wrote: > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote: > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey > > > > wrote: > > > > You're just focusing on execve() - I think it's important to keep in > > > > mind what happens after execve() for normal, dynamically-linked > > > > binaries: The next step is that the dynamic linker runs, and it will > > > > poke around in the file system with access() and openat() and fstat(), > > > > it will mmap() executable libraries into memory, it will mprotect() > > > > some memory regions, it will set up thread-local storage (e.g. using > > > > arch_prctl(); even if the process is single-threaded), and so on. > > > > > > > > The earlier you install the seccomp filter, the more of these steps > > > > you have to permit in the filter. And if you want the filter to take > > > > effect directly after execve(), the syscalls you'll be forced to > > > > permit are sufficient to cobble something together in userspace that > > > > effectively does almost the same thing as execve(). > > > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for > > > controlling these operations and allowing only the ones that are valid > > > during dynamic linking. This also allows you to defer application of > > > the filter until after execve. So unless I'm missing some reason why > > > this doesn't work, I think the requested functionality is already > > > available. > > > > Ah, yeah, good point. > > > > > If you really just want the "activate at exec" behavior, it might be > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's > > > no notify fd open; I forget) > > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even > > though it might be a bit nicer if userspace had control over the errno > > there, such that it could be EPERM instead... oh well.) > > EPERM is a major bug in current sandbox implementations, so ENOSYS is > at least mildly better, but indeed it should be controllable, probably > by allowing a code path for the BPF to continue with a jump to a > different logic path if the notify listener is missing. I guess we might be able to expose the listener status through a bit / a field in the struct seccomp_data, and then filters could branch on that. (And the kernel would run the filter twice if we raced with filter detachment.) I don't know whether it would look pretty, but I think it should be doable...