Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote: > On Fri, May 29, 2020 at 1:11 AM Kees Cook wrote: > > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote: > > > * @usage: reference count to manage the object lifetime. > > > * get/put helpers should be used when accessing an instance > > > * outside of a lifetime-guarded section. In general, this > > > * is only needed for handling filters shared across tasks. > > > [...] > > > + * @live: Number of tasks that use this filter directly and number > > > + * of dependent filters that have a non-zero @live counter. > > > + * Altered during fork(), exit(), and filter installation > > > [...] > > > refcount_set(>usage, 1); > > > + refcount_set(>live, 1); > [...] > > After looking at these other lifetime management examples in the kernel, > > I'm convinced that tracking these states separately is correct, but I > > remain uncomfortable about task management needing to explicitly make > > two calls to let go of the filter. > > > > I wonder if release_task() should also detach the filter from the task > > and do a put_seccomp_filter() instead of waiting for task_free(). This > > is supported by the other place where seccomp_filter_release() is > > called: > > > > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long > > > flags) > > >* allows a put before the assignment.) > > > */ > > > put_seccomp_filter(thread); > > > + seccomp_filter_release(thread); > > > > This would also remove the only put_seccomp_filter() call outside of > > seccomp.c, since the free_task() call will be removed now in favor of > > the task_release() call. > > > > So, is it safe to detach the filter in release_task()? Has dethreading > > happened yet? i.e. can we race TSYNC? -- is there a possible > > inc-from-zero? > > release_task -> __exit_signal -> __unhash_process -> > list_del_rcu(>thread_node) drops us from the thread list under > siglock, which is the same lock TSYNC uses. We should move us after write_unlock_irq(_lock). We're after __exit_signal() so we're unhashed and can't be discovered by tsync too anymore and we also don't require the tasklist_lock to be held: diff --git a/kernel/exit.c b/kernel/exit.c index b332e3635eb5..5490cc04f436 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -193,8 +193,6 @@ void release_task(struct task_struct *p) cgroup_release(p); write_lock_irq(_lock); ptrace_release_task(p); thread_pid = get_pid(p->thread_pid); @@ -220,6 +218,7 @@ void release_task(struct task_struct *p) } write_unlock_irq(_lock); + seccomp_filter_release(p); proc_flush_pid(thread_pid); put_pid(thread_pid); release_thread(p); Christian
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 01:06:59AM -0700, Kees Cook wrote: > On Fri, May 29, 2020 at 09:56:41AM +0200, Christian Brauner wrote: > > On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote: > > > void seccomp_filter_release(const struct task_struct *tsk) > > > { > > > struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter); > > > > > > smp_store_release(>seccomp.filter, NULL); > > > > I need to go through the memory ordering requirements before I can say > > yay or nay confidently to this. :) > > > > > __seccomp_filter_release(orig); > > > } > > The only caller will be release_task() after dethread, so honestly this > was just me being paranoid. I don't think it actually needs the > READ_ONCE() nor the store_release. I think I wrote all that before I'd > convinced myself it was safe to remove a filter then. But I'm still > suspicious given the various ways release_task() gets called... I just > know that if mode 2 is set and filter == NULL, seccomp will fail closed, > so I went the paranoid route. :) release_task() should only be called once per thread otherwise we'd have big problems. And every time we call release_task() we're already EXIT_DEAD iirc. So there should be no way someone else can find us (in a relevant way and especially not from userspace). exit_notify() -> if we're autoreaping we're EXIT_DEAD anyway, if we're not autoreaping we'll wait_task_zombie() eventually -> we're EXIT_DEAD (parent has reaped us) find_child_reaper() -> we couldn't find a child reaper for us, i.e. we were (namespace) init -> unlink all tasks we were ptracing (exit_ptrace()) and if they're EXIT_DEAD move them to the dead list -> release_task()'s that are EXIT_DEAD that we ptraced. and de_thread() relies on EXIT_DEAD too: /* * We are going to release_task()->ptrace_unlink() silently, * the tracer can sleep in do_wait(). EXIT_DEAD guarantees * the tracer wont't block again waiting for this thread. */ (This is a very _rough_ sketch.) So I think that's safe. Christian
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 09:56:41AM +0200, Christian Brauner wrote: > On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote: > > void seccomp_filter_release(const struct task_struct *tsk) > > { > > struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter); > > > > smp_store_release(>seccomp.filter, NULL); > > I need to go through the memory ordering requirements before I can say > yay or nay confidently to this. :) > > > __seccomp_filter_release(orig); > > } The only caller will be release_task() after dethread, so honestly this was just me being paranoid. I don't think it actually needs the READ_ONCE() nor the store_release. I think I wrote all that before I'd convinced myself it was safe to remove a filter then. But I'm still suspicious given the various ways release_task() gets called... I just know that if mode 2 is set and filter == NULL, seccomp will fail closed, so I went the paranoid route. :) -- Kees Cook
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 09:47:44AM +0200, Christian Brauner wrote: > Well the correct way would probably be: > "usage" -> "refs" > "live" -> "users" Yeah, I like it! :) > So we'd need a first patch to convert "usage" to "refs" and then > introduce "users". Yup, sounds right. > > signal_struct has "sigcnt" and "live". I find "sigcnt" to be an > > unhelpful name too. (And why isn't it refcount_t?) > > I think I once looked that up and there was some sort of "not needed, no > gain" style rationale. hrm. it uses _inc and _dec_and_test... imo, that should make it be a refcount_t. Even if we're not protecting some clear UAF issue, it's still good to notification of potential bugs. -- Kees Cook
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 12:56:50AM -0700, Kees Cook wrote: > On Fri, May 29, 2020 at 09:51:37AM +0200, Christian Brauner wrote: > > Aside from this being not an issue now, can we please not dump seccomp > > filter contents in proc. That sounds terrible and what's the rationale, > > libseccomp already let's you dump filter contents while loading and you > > could ptrace it. But maybe I'm missing a giant need for this... > > The use-case comes from Android wanting to audit seccomp filters at > runtime. I think this is stalled until there is a good answer to "what > are you going to audit for, and how, given raw BPF?" Doing this in proc seems very suboptimal why isn't this simply an extension to the seccomp syscall (passing in a struct with the target's pid or pidfd for example) to identify the target? But yeah, if there's no real audit strategy all of that seems weird. Christian
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote: > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote: > > * @usage: reference count to manage the object lifetime. > > * get/put helpers should be used when accessing an instance > > * outside of a lifetime-guarded section. In general, this > > * is only needed for handling filters shared across tasks. > > [...] > > + * @live: Number of tasks that use this filter directly and number > > + * of dependent filters that have a non-zero @live counter. > > + * Altered during fork(), exit(), and filter installation > > [...] > > refcount_set(>usage, 1); > > + refcount_set(>live, 1); > > I'd like these reference counters to have more descriptive names. "usage" > by what? "live" from what perspective? At the least, I think we need > to be explicit in the comment, and at best we should do that and rename > them to be a bit more clear. > > A filter's "usage" is incremented for each directly-attached task > (task::seccomp_data.filter, via fork() or thread_sync), once for the > dependent filter (filter::prev), and once for an open user_notif file > (file::private_data). When it reaches zero, there are (should be) no more > active memory references back to the struct filter and it can be freed. > > A filter's "live" is incremented for each directly-attached task > (task::seccomp_data.filter, via fork() or thread_sync), and once for > the dependent filter (filter::prev). When it reaches zero there is no > way for new tasks to get associated with the filter, but there may still > be user_notif file::private_data references pointing at the filter. > > But we're tracking "validity lifetime" (live) and "memory reference > safety" (usage). > > signal_struct has "sigcnt" and "live". I find "sigcnt" to be an > unhelpful name too. (And why isn't it refcount_t?) > > So, perhaps leave "live", but rename "usage" -> "references". > > After looking at these other lifetime management examples in the kernel, > I'm convinced that tracking these states separately is correct, but I > remain uncomfortable about task management needing to explicitly make > two calls to let go of the filter. > > I wonder if release_task() should also detach the filter from the task > and do a put_seccomp_filter() instead of waiting for task_free(). This > is supported by the other place where seccomp_filter_release() is > called: > > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long > > flags) > > * allows a put before the assignment.) > > */ > > put_seccomp_filter(thread); > > + seccomp_filter_release(thread); > > This would also remove the only put_seccomp_filter() call outside of > seccomp.c, since the free_task() call will be removed now in favor of > the task_release() call. > > So, is it safe to detach the filter in release_task()? Has dethreading > happened yet? i.e. can we race TSYNC? -- is there a possible So I've just gone through this and this is safe. I suspect that it was moved to free_task() because it was not considered high-priority work, i.e. nobody depended on the filter being released at a certain point. That's changed with the notifier so yes, we could move this. > inc-from-zero? (Actually, all our refcount_inc()s should be > refcount_inc_not_zero() just for robustness.) I *think* we can do it I think this was mentioned in another mail refcount_inc() does that check. Believe me, I know that from other debugging experience. ;) > before the release_thread() call (instead of after cgroup_release()). > > With that, then seccomp_filter_release() could assign the filter to NULL > and do the clean up: > > void seccomp_filter_release(const struct task_struct *tsk) > { > struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter); > > smp_store_release(>seccomp.filter, NULL); I need to go through the memory ordering requirements before I can say yay or nay confidently to this. :) > __seccomp_filter_release(orig); > } > > All other refcounting is then internal to seccomp.c. Which brings me > back to TSYNC, since we don't want to write NULL to task->seccomp.filter > during TSYNC. TSYNC can use: > > void __seccomp_filter_release(struct seccomp_filter *filter) > { > while (filter && refcount_dec_and_test(>live)) { > if (waitqueue_active(>wqh)) > wake_up_poll(>wqh, EPOLLHUP); > filter = filter->prev; > } > __put_seccomp_filter(filter); > } > > Thoughts? > > -- > Kees Cook
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 09:51:37AM +0200, Christian Brauner wrote: > Aside from this being not an issue now, can we please not dump seccomp > filter contents in proc. That sounds terrible and what's the rationale, > libseccomp already let's you dump filter contents while loading and you > could ptrace it. But maybe I'm missing a giant need for this... The use-case comes from Android wanting to audit seccomp filters at runtime. I think this is stalled until there is a good answer to "what are you going to audit for, and how, given raw BPF?" -- Kees Cook
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote: > On Fri, May 29, 2020 at 1:11 AM Kees Cook wrote: > > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote: > > > * @usage: reference count to manage the object lifetime. > > > * get/put helpers should be used when accessing an instance > > > * outside of a lifetime-guarded section. In general, this > > > * is only needed for handling filters shared across tasks. > > > [...] > > > + * @live: Number of tasks that use this filter directly and number > > > + * of dependent filters that have a non-zero @live counter. > > > + * Altered during fork(), exit(), and filter installation > > > [...] > > > refcount_set(>usage, 1); > > > + refcount_set(>live, 1); > [...] > > After looking at these other lifetime management examples in the kernel, > > I'm convinced that tracking these states separately is correct, but I > > remain uncomfortable about task management needing to explicitly make > > two calls to let go of the filter. > > > > I wonder if release_task() should also detach the filter from the task > > and do a put_seccomp_filter() instead of waiting for task_free(). This > > is supported by the other place where seccomp_filter_release() is > > called: > > > > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long > > > flags) > > >* allows a put before the assignment.) > > > */ > > > put_seccomp_filter(thread); > > > + seccomp_filter_release(thread); > > > > This would also remove the only put_seccomp_filter() call outside of > > seccomp.c, since the free_task() call will be removed now in favor of > > the task_release() call. > > > > So, is it safe to detach the filter in release_task()? Has dethreading > > happened yet? i.e. can we race TSYNC? -- is there a possible > > inc-from-zero? > > release_task -> __exit_signal -> __unhash_process -> > list_del_rcu(>thread_node) drops us from the thread list under > siglock, which is the same lock TSYNC uses. > > One other interesting thing that can look at seccomp state is > task_seccomp() in procfs - that can still happen at this point. At the > moment, procfs only lets you see the numeric filter state, not the > actual filter contents, so that's not a problem; but if we ever add a > procfs interface for dumping seccomp filters (in addition to the > ptrace interface that already exists), that's something to keep in > mind. Aside from this being not an issue now, can we please not dump seccomp filter contents in proc. That sounds terrible and what's the rationale, libseccomp already let's you dump filter contents while loading and you could ptrace it. But maybe I'm missing a giant need for this... Christian
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote: > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote: > > * @usage: reference count to manage the object lifetime. > > * get/put helpers should be used when accessing an instance > > * outside of a lifetime-guarded section. In general, this > > * is only needed for handling filters shared across tasks. > > [...] > > + * @live: Number of tasks that use this filter directly and number > > + * of dependent filters that have a non-zero @live counter. > > + * Altered during fork(), exit(), and filter installation > > [...] > > refcount_set(>usage, 1); > > + refcount_set(>live, 1); > > I'd like these reference counters to have more descriptive names. "usage" > by what? "live" from what perspective? At the least, I think we need > to be explicit in the comment, and at best we should do that and rename > them to be a bit more clear. Well the correct way would probably be: "usage" -> "refs" "live" -> "users" So we'd need a first patch to convert "usage" to "refs" and then introduce "users". > > A filter's "usage" is incremented for each directly-attached task > (task::seccomp_data.filter, via fork() or thread_sync), once for the > dependent filter (filter::prev), and once for an open user_notif file > (file::private_data). When it reaches zero, there are (should be) no more > active memory references back to the struct filter and it can be freed. > > A filter's "live" is incremented for each directly-attached task > (task::seccomp_data.filter, via fork() or thread_sync), and once for > the dependent filter (filter::prev). When it reaches zero there is no > way for new tasks to get associated with the filter, but there may still > be user_notif file::private_data references pointing at the filter. or - at least briefyl - ptrace or whatever, yes. > > But we're tracking "validity lifetime" (live) and "memory reference > safety" (usage). > > signal_struct has "sigcnt" and "live". I find "sigcnt" to be an > unhelpful name too. (And why isn't it refcount_t?) I think I once looked that up and there was some sort of "not needed, no gain" style rationale. > > So, perhaps leave "live", but rename "usage" -> "references". usage -> refs live -> users/active > > After looking at these other lifetime management examples in the kernel, > I'm convinced that tracking these states separately is correct, but I > remain uncomfortable about task management needing to explicitly make > two calls to let go of the filter. > > I wonder if release_task() should also detach the filter from the task > and do a put_seccomp_filter() instead of waiting for task_free(). This > is supported by the other place where seccomp_filter_release() is > called: > > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long > > flags) > > * allows a put before the assignment.) > > */ > > put_seccomp_filter(thread); > > + seccomp_filter_release(thread); > > This would also remove the only put_seccomp_filter() call outside of > seccomp.c, since the free_task() call will be removed now in favor of > the task_release() call. > > So, is it safe to detach the filter in release_task()? Has dethreading > happened yet? i.e. can we race TSYNC? -- is there a possible > inc-from-zero? (Actually, all our refcount_inc()s should be > refcount_inc_not_zero() just for robustness.) I *think* we can do it > before the release_thread() call (instead of after cgroup_release()). > > With that, then seccomp_filter_release() could assign the filter to NULL > and do the clean up: > > void seccomp_filter_release(const struct task_struct *tsk) > { > struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter); > > smp_store_release(>seccomp.filter, NULL); > __seccomp_filter_release(orig); > } > > All other refcounting is then internal to seccomp.c. Which brings me > back to TSYNC, since we don't want to write NULL to task->seccomp.filter > during TSYNC. TSYNC can use: > > void __seccomp_filter_release(struct seccomp_filter *filter) > { > while (filter && refcount_dec_and_test(>live)) { > if (waitqueue_active(>wqh)) > wake_up_poll(>wqh, EPOLLHUP); > filter = filter->prev; > } > __put_seccomp_filter(filter); > } > > Thoughts? > > -- > Kees Cook
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote: > On Fri, May 29, 2020 at 1:11 AM Kees Cook wrote: > > So, is it safe to detach the filter in release_task()? Has dethreading > > happened yet? i.e. can we race TSYNC? -- is there a possible > > inc-from-zero? > > release_task -> __exit_signal -> __unhash_process -> > list_del_rcu(>thread_node) drops us from the thread list under > siglock, which is the same lock TSYNC uses. Ah, there it is. I missed the __unhash_process() in __exit_signal, but once I saw the call to release_task(), I figured it was safe at that point. So this seems correct: > > I *think* we can do it > > before the release_thread() call (instead of after cgroup_release()). > One other interesting thing that can look at seccomp state is > task_seccomp() in procfs - that can still happen at this point. At the > moment, procfs only lets you see the numeric filter state, not the > actual filter contents, so that's not a problem; but if we ever add a > procfs interface for dumping seccomp filters (in addition to the > ptrace interface that already exists), that's something to keep in > mind. Right -- but we can just reuse the get/put to pin the filter while dumping it from proc (there IS someone working on this feature...) > > (Actually, all our refcount_inc()s should be > > refcount_inc_not_zero() just for robustness.) > > Eeeh... wouldn't that just make the code more complicated for no good reason? Sorry, ignore that. I got myself briefly confused -- we're fine; refcount_inc() already does inc-from-zero checking. -- Kees Cook
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Fri, May 29, 2020 at 1:11 AM Kees Cook wrote: > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote: > > * @usage: reference count to manage the object lifetime. > > * get/put helpers should be used when accessing an instance > > * outside of a lifetime-guarded section. In general, this > > * is only needed for handling filters shared across tasks. > > [...] > > + * @live: Number of tasks that use this filter directly and number > > + * of dependent filters that have a non-zero @live counter. > > + * Altered during fork(), exit(), and filter installation > > [...] > > refcount_set(>usage, 1); > > + refcount_set(>live, 1); [...] > After looking at these other lifetime management examples in the kernel, > I'm convinced that tracking these states separately is correct, but I > remain uncomfortable about task management needing to explicitly make > two calls to let go of the filter. > > I wonder if release_task() should also detach the filter from the task > and do a put_seccomp_filter() instead of waiting for task_free(). This > is supported by the other place where seccomp_filter_release() is > called: > > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long > > flags) > >* allows a put before the assignment.) > > */ > > put_seccomp_filter(thread); > > + seccomp_filter_release(thread); > > This would also remove the only put_seccomp_filter() call outside of > seccomp.c, since the free_task() call will be removed now in favor of > the task_release() call. > > So, is it safe to detach the filter in release_task()? Has dethreading > happened yet? i.e. can we race TSYNC? -- is there a possible > inc-from-zero? release_task -> __exit_signal -> __unhash_process -> list_del_rcu(>thread_node) drops us from the thread list under siglock, which is the same lock TSYNC uses. One other interesting thing that can look at seccomp state is task_seccomp() in procfs - that can still happen at this point. At the moment, procfs only lets you see the numeric filter state, not the actual filter contents, so that's not a problem; but if we ever add a procfs interface for dumping seccomp filters (in addition to the ptrace interface that already exists), that's something to keep in mind. > (Actually, all our refcount_inc()s should be > refcount_inc_not_zero() just for robustness.) Eeeh... wouldn't that just make the code more complicated for no good reason? > I *think* we can do it > before the release_thread() call (instead of after cgroup_release()).
Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter
On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote: > * @usage: reference count to manage the object lifetime. > * get/put helpers should be used when accessing an instance > * outside of a lifetime-guarded section. In general, this > * is only needed for handling filters shared across tasks. > [...] > + * @live: Number of tasks that use this filter directly and number > + * of dependent filters that have a non-zero @live counter. > + * Altered during fork(), exit(), and filter installation > [...] > refcount_set(>usage, 1); > + refcount_set(>live, 1); I'd like these reference counters to have more descriptive names. "usage" by what? "live" from what perspective? At the least, I think we need to be explicit in the comment, and at best we should do that and rename them to be a bit more clear. A filter's "usage" is incremented for each directly-attached task (task::seccomp_data.filter, via fork() or thread_sync), once for the dependent filter (filter::prev), and once for an open user_notif file (file::private_data). When it reaches zero, there are (should be) no more active memory references back to the struct filter and it can be freed. A filter's "live" is incremented for each directly-attached task (task::seccomp_data.filter, via fork() or thread_sync), and once for the dependent filter (filter::prev). When it reaches zero there is no way for new tasks to get associated with the filter, but there may still be user_notif file::private_data references pointing at the filter. But we're tracking "validity lifetime" (live) and "memory reference safety" (usage). signal_struct has "sigcnt" and "live". I find "sigcnt" to be an unhelpful name too. (And why isn't it refcount_t?) So, perhaps leave "live", but rename "usage" -> "references". After looking at these other lifetime management examples in the kernel, I'm convinced that tracking these states separately is correct, but I remain uncomfortable about task management needing to explicitly make two calls to let go of the filter. I wonder if release_task() should also detach the filter from the task and do a put_seccomp_filter() instead of waiting for task_free(). This is supported by the other place where seccomp_filter_release() is called: > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long > flags) >* allows a put before the assignment.) > */ > put_seccomp_filter(thread); > + seccomp_filter_release(thread); This would also remove the only put_seccomp_filter() call outside of seccomp.c, since the free_task() call will be removed now in favor of the task_release() call. So, is it safe to detach the filter in release_task()? Has dethreading happened yet? i.e. can we race TSYNC? -- is there a possible inc-from-zero? (Actually, all our refcount_inc()s should be refcount_inc_not_zero() just for robustness.) I *think* we can do it before the release_thread() call (instead of after cgroup_release()). With that, then seccomp_filter_release() could assign the filter to NULL and do the clean up: void seccomp_filter_release(const struct task_struct *tsk) { struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter); smp_store_release(>seccomp.filter, NULL); __seccomp_filter_release(orig); } All other refcounting is then internal to seccomp.c. Which brings me back to TSYNC, since we don't want to write NULL to task->seccomp.filter during TSYNC. TSYNC can use: void __seccomp_filter_release(struct seccomp_filter *filter) { while (filter && refcount_dec_and_test(>live)) { if (waitqueue_active(>wqh)) wake_up_poll(>wqh, EPOLLHUP); filter = filter->prev; } __put_seccomp_filter(filter); } Thoughts? -- Kees Cook
[PATCH v2 1/2] seccomp: notify user trap about unused filter
We've been making heavy use of the seccomp notifier to intercept and handle certain syscalls for containers. This patch allows a syscall supervisor listening on a given notifier to be notified when a seccomp filter has become unused. A container is often managed by a singleton supervisor process the so-called "monitor". This monitor process has an event loop which has various event handlers registered. If the user specified a seccomp profile that included a notifier for various syscalls then we also register a seccomp notify even handler. For any container using a separate pid namespace the lifecycle of the seccomp notifier is bound to the init process of the pid namespace, i.e. when the init process exits the filter must be unused. If a new process attaches to a container we force it to assume a seccomp profile. This can either be the same seccomp profile as the container was started with or a modified one. If the attaching process makes use of the seccomp notifier we will register a new seccomp notifier handler in the monitor's event loop. However, when the attaching process exits we can't simply delete the handler since other child processes could've been created (daemons spawned etc.) that have inherited the seccomp filter and so we need to keep the seccomp notifier fd alive in the event loop. But this is problematic since we don't get a notification when the seccomp filter has become unused and so we currently never remove the seccomp notifier fd from the event loop and just keep accumulating fds in the event loop. We've had this issue for a while but it has recently become more pressing as more and larger users make use of this. To fix this, we introduce a new "live" reference counter that tracks the live tasks making use of a given filter and when a notifier is registered waiting tasks will be notified that the filter is now empty by receiving a (E)POLLHUP event. The concept in this patch introduces is the same as for signal_struct, i.e. reference counting for life-cycle management is decoupled from reference counting live taks using the object. There's probably some trickery possible but the second counter is just the correct way of doing this imho and has precedence. The patch also lifts the waitqeue from struct notification into into sruct seccomp_filter. This is cleaner overall and let's us avoid having to take the notifier mutex since we neither need to read nor modify the notifier specific aspects of the seccomp filter. In the exit path I'd very much like to avoid having to take the notifier mutex for each filter in the task's filter hierarchy. Cc: Tycho Andersen Cc: Kees Cook Cc: Matt Denton Cc: Sargun Dhillon Cc: Jann Horn Cc: Chris Palmer Cc: Aleksa Sarai Cc: Robert Sesek Cc: Jeffrey Vander Stoep Cc: Linux Containers Signed-off-by: Christian Brauner --- /* v2 */ - Jann Horn : - Use more descriptive instead of seccomp_filter_notify(). (I went with seccomp_filter_release().) --- include/linux/seccomp.h | 5 + kernel/exit.c | 2 ++ kernel/seccomp.c| 33 +++-- 3 files changed, 34 insertions(+), 6 deletions(-) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 4192369b8418..d72b5b43f8ea 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -84,6 +84,7 @@ static inline int seccomp_mode(struct seccomp *s) #ifdef CONFIG_SECCOMP_FILTER extern void put_seccomp_filter(struct task_struct *tsk); extern void get_seccomp_filter(struct task_struct *tsk); +extern void seccomp_filter_release(const struct task_struct *tsk); #else /* CONFIG_SECCOMP_FILTER */ static inline void put_seccomp_filter(struct task_struct *tsk) { @@ -93,6 +94,10 @@ static inline void get_seccomp_filter(struct task_struct *tsk) { return; } +static inline void seccomp_filter_release(const struct task_struct *tsk) +{ + return; +} #endif /* CONFIG_SECCOMP_FILTER */ #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE) diff --git a/kernel/exit.c b/kernel/exit.c index ce2a75bc0ade..b332e3635eb5 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -193,6 +193,8 @@ void release_task(struct task_struct *p) cgroup_release(p); + seccomp_filter_release(p); + write_lock_irq(_lock); ptrace_release_task(p); thread_pid = get_pid(p->thread_pid); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 55a6184f5990..9fa642d6d549 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -94,13 +94,11 @@ struct seccomp_knotif { * filter->notify_lock. * @next_id: The id of the next request. * @notifications: A list of struct seccomp_knotif elements. - * @wqh: A wait queue for poll. */ struct notification { struct semaphore request; u64 next_id; struct list_head notifications; - wait_queue_head_t wqh; }; /** @@ -115,6 +113,10 @@ struct notification { * @prog: the BPF program to evaluate * @notif: the struct that