subject:"\[PATCH v2 1\/2\] seccomp\: notify user trap about unused filter"

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Christian Brauner

On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote:
> On Fri, May 29, 2020 at 1:11 AM Kees Cook  wrote:
> > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> > >   * @usage: reference count to manage the object lifetime.
> > >   * get/put helpers should be used when accessing an instance
> > >   * outside of a lifetime-guarded section.  In general, this
> > >   * is only needed for handling filters shared across tasks.
> > > [...]
> > > + * @live: Number of tasks that use this filter directly and number
> > > + * of dependent filters that have a non-zero @live counter.
> > > + * Altered during fork(), exit(), and filter installation
> > > [...]
> > >   refcount_set(>usage, 1);
> > > + refcount_set(>live, 1);
> [...]
> > After looking at these other lifetime management examples in the kernel,
> > I'm convinced that tracking these states separately is correct, but I
> > remain uncomfortable about task management needing to explicitly make
> > two calls to let go of the filter.
> >
> > I wonder if release_task() should also detach the filter from the task
> > and do a put_seccomp_filter() instead of waiting for task_free(). This
> > is supported by the other place where seccomp_filter_release() is
> > called:
> >
> > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long 
> > > flags)
> > >* allows a put before the assignment.)
> > >   */
> > >   put_seccomp_filter(thread);
> > > + seccomp_filter_release(thread);
> >
> > This would also remove the only put_seccomp_filter() call outside of
> > seccomp.c, since the free_task() call will be removed now in favor of
> > the task_release() call.
> >
> > So, is it safe to detach the filter in release_task()? Has dethreading
> > happened yet? i.e. can we race TSYNC? -- is there a possible
> > inc-from-zero?
> 
> release_task -> __exit_signal -> __unhash_process ->
> list_del_rcu(>thread_node) drops us from the thread list under
> siglock, which is the same lock TSYNC uses.

We should move us after write_unlock_irq(_lock). We're after
__exit_signal() so we're unhashed and can't be discovered by tsync too
anymore and we also don't require the tasklist_lock to be held:

diff --git a/kernel/exit.c b/kernel/exit.c
index b332e3635eb5..5490cc04f436 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -193,8 +193,6 @@ void release_task(struct task_struct *p)

cgroup_release(p);

write_lock_irq(_lock);
ptrace_release_task(p);
thread_pid = get_pid(p->thread_pid);
@@ -220,6 +218,7 @@ void release_task(struct task_struct *p)
}

write_unlock_irq(_lock);
+   seccomp_filter_release(p);
proc_flush_pid(thread_pid);
put_pid(thread_pid);
release_thread(p);

Christian

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Christian Brauner

On Fri, May 29, 2020 at 01:06:59AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 09:56:41AM +0200, Christian Brauner wrote:
> > On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> > > void seccomp_filter_release(const struct task_struct *tsk)
> > > {
> > >   struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> > > 
> > >   smp_store_release(>seccomp.filter, NULL);
> > 
> > I need to go through the memory ordering requirements before I can say
> > yay or nay confidently to this. :)
> > 
> > >   __seccomp_filter_release(orig);
> > > }
> 
> The only caller will be release_task() after dethread, so honestly this
> was just me being paranoid. I don't think it actually needs the
> READ_ONCE() nor the store_release. I think I wrote all that before I'd
> convinced myself it was safe to remove a filter then. But I'm still
> suspicious given the various ways release_task() gets called... I just
> know that if mode 2 is set and filter == NULL, seccomp will fail closed,
> so I went the paranoid route. :)

release_task() should only be called once per thread otherwise we'd have
big problems. And every time we call release_task() we're already
EXIT_DEAD iirc. So there should be no way someone else can find us (in a
relevant way and especially not from userspace).
exit_notify() -> if we're autoreaping we're EXIT_DEAD anyway, if we're
not autoreaping we'll wait_task_zombie() eventually -> we're EXIT_DEAD
(parent has reaped us)

find_child_reaper() -> we couldn't find a child reaper for us, i.e. we
were (namespace) init -> unlink all tasks we were ptracing
(exit_ptrace()) and if they're EXIT_DEAD move them to the dead list ->
release_task()'s that are EXIT_DEAD that we ptraced.

and de_thread() relies on EXIT_DEAD too:
/*
 * We are going to release_task()->ptrace_unlink() silently,
 * the tracer can sleep in do_wait(). EXIT_DEAD guarantees
 * the tracer wont't block again waiting for this thread.
*/

(This is a very _rough_ sketch.)
So I think that's safe.

Christian

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Kees Cook

On Fri, May 29, 2020 at 09:56:41AM +0200, Christian Brauner wrote:
> On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> > void seccomp_filter_release(const struct task_struct *tsk)
> > {
> > struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> > 
> > smp_store_release(>seccomp.filter, NULL);
> 
> I need to go through the memory ordering requirements before I can say
> yay or nay confidently to this. :)
> 
> > __seccomp_filter_release(orig);
> > }

The only caller will be release_task() after dethread, so honestly this
was just me being paranoid. I don't think it actually needs the
READ_ONCE() nor the store_release. I think I wrote all that before I'd
convinced myself it was safe to remove a filter then. But I'm still
suspicious given the various ways release_task() gets called... I just
know that if mode 2 is set and filter == NULL, seccomp will fail closed,
so I went the paranoid route. :)

-- 
Kees Cook

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Kees Cook

On Fri, May 29, 2020 at 09:47:44AM +0200, Christian Brauner wrote:
> Well the correct way would probably be:
> "usage" -> "refs"
> "live"  -> "users"

Yeah, I like it! :)

> So we'd need a first patch to convert "usage" to "refs" and then
> introduce "users".

Yup, sounds right.

> > signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
> > unhelpful name too. (And why isn't it refcount_t?)
> 
> I think I once looked that up and there was some sort of "not needed, no
> gain" style rationale.

hrm. it uses _inc and _dec_and_test... imo, that should make it be a
refcount_t. Even if we're not protecting some clear UAF issue, it's
still good to notification of potential bugs.

-- 
Kees Cook

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Christian Brauner

On Fri, May 29, 2020 at 12:56:50AM -0700, Kees Cook wrote:
> On Fri, May 29, 2020 at 09:51:37AM +0200, Christian Brauner wrote:
> > Aside from this being not an issue now, can we please not dump seccomp
> > filter contents in proc. That sounds terrible and what's the rationale,
> > libseccomp already let's you dump filter contents while loading and you
> > could ptrace it. But maybe I'm missing a giant need for this...
> 
> The use-case comes from Android wanting to audit seccomp filters at
> runtime. I think this is stalled until there is a good answer to "what
> are you going to audit for, and how, given raw BPF?"

Doing this in proc seems very suboptimal why isn't this simply an
extension to the seccomp syscall (passing in a struct with the target's
pid or pidfd for example) to identify the target? But yeah, if there's
no real audit strategy all of that seems weird.

Christian

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Christian Brauner

On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> >   * @usage: reference count to manage the object lifetime.
> >   * get/put helpers should be used when accessing an instance
> >   * outside of a lifetime-guarded section.  In general, this
> >   * is only needed for handling filters shared across tasks.
> > [...]
> > + * @live: Number of tasks that use this filter directly and number
> > + *   of dependent filters that have a non-zero @live counter.
> > + *   Altered during fork(), exit(), and filter installation
> > [...]
> > refcount_set(>usage, 1);
> > +   refcount_set(>live, 1);
> 
> I'd like these reference counters to have more descriptive names. "usage"
> by what? "live" from what perspective? At the least, I think we need
> to be explicit in the comment, and at best we should do that and rename
> them to be a bit more clear.
> 
> A filter's "usage" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), once for the
> dependent filter (filter::prev), and once for an open user_notif file
> (file::private_data). When it reaches zero, there are (should be) no more
> active memory references back to the struct filter and it can be freed.
> 
> A filter's "live" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), and once for
> the dependent filter (filter::prev). When it reaches zero there is no
> way for new tasks to get associated with the filter, but there may still
> be user_notif file::private_data references pointing at the filter.
> 
> But we're tracking "validity lifetime" (live) and "memory reference
> safety" (usage).
> 
> signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
> unhelpful name too. (And why isn't it refcount_t?)
> 
> So, perhaps leave "live", but rename "usage" -> "references".
> 
> After looking at these other lifetime management examples in the kernel,
> I'm convinced that tracking these states separately is correct, but I
> remain uncomfortable about task management needing to explicitly make
> two calls to let go of the filter.
> 
> I wonder if release_task() should also detach the filter from the task
> and do a put_seccomp_filter() instead of waiting for task_free(). This
> is supported by the other place where seccomp_filter_release() is
> called:
> 
> > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long 
> > flags)
> >  * allows a put before the assignment.)
> > */
> > put_seccomp_filter(thread);
> > +   seccomp_filter_release(thread);
> 
> This would also remove the only put_seccomp_filter() call outside of
> seccomp.c, since the free_task() call will be removed now in favor of
> the task_release() call.
> 
> So, is it safe to detach the filter in release_task()? Has dethreading
> happened yet? i.e. can we race TSYNC? -- is there a possible

So I've just gone through this and this is safe. I suspect that it was
moved to free_task() because it was not considered high-priority work,
i.e. nobody depended on the filter being released at a certain point.
That's changed with the notifier so yes, we could move this.

> inc-from-zero? (Actually, all our refcount_inc()s should be
> refcount_inc_not_zero() just for robustness.) I *think* we can do it

I think this was mentioned in another mail refcount_inc() does that
check. Believe me, I know that from other debugging experience. ;)

> before the release_thread() call (instead of after cgroup_release()).
> 
> With that, then seccomp_filter_release() could assign the filter to NULL
> and do the clean up:
> 
> void seccomp_filter_release(const struct task_struct *tsk)
> {
>   struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> 
>   smp_store_release(>seccomp.filter, NULL);

I need to go through the memory ordering requirements before I can say
yay or nay confidently to this. :)

>   __seccomp_filter_release(orig);
> }
> 
> All other refcounting is then internal to seccomp.c. Which brings me
> back to TSYNC, since we don't want to write NULL to task->seccomp.filter
> during TSYNC. TSYNC can use:
> 
> void __seccomp_filter_release(struct seccomp_filter *filter)
> {
>   while (filter && refcount_dec_and_test(>live)) {
>   if (waitqueue_active(>wqh))
>   wake_up_poll(>wqh, EPOLLHUP);
>   filter = filter->prev;
>   }
>   __put_seccomp_filter(filter);
> }
> 
> Thoughts?
> 
> -- 
> Kees Cook

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Kees Cook

On Fri, May 29, 2020 at 09:51:37AM +0200, Christian Brauner wrote:
> Aside from this being not an issue now, can we please not dump seccomp
> filter contents in proc. That sounds terrible and what's the rationale,
> libseccomp already let's you dump filter contents while loading and you
> could ptrace it. But maybe I'm missing a giant need for this...

The use-case comes from Android wanting to audit seccomp filters at
runtime. I think this is stalled until there is a good answer to "what
are you going to audit for, and how, given raw BPF?"

-- 
Kees Cook

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Christian Brauner

On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote:
> On Fri, May 29, 2020 at 1:11 AM Kees Cook  wrote:
> > On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> > >   * @usage: reference count to manage the object lifetime.
> > >   * get/put helpers should be used when accessing an instance
> > >   * outside of a lifetime-guarded section.  In general, this
> > >   * is only needed for handling filters shared across tasks.
> > > [...]
> > > + * @live: Number of tasks that use this filter directly and number
> > > + * of dependent filters that have a non-zero @live counter.
> > > + * Altered during fork(), exit(), and filter installation
> > > [...]
> > >   refcount_set(>usage, 1);
> > > + refcount_set(>live, 1);
> [...]
> > After looking at these other lifetime management examples in the kernel,
> > I'm convinced that tracking these states separately is correct, but I
> > remain uncomfortable about task management needing to explicitly make
> > two calls to let go of the filter.
> >
> > I wonder if release_task() should also detach the filter from the task
> > and do a put_seccomp_filter() instead of waiting for task_free(). This
> > is supported by the other place where seccomp_filter_release() is
> > called:
> >
> > > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long 
> > > flags)
> > >* allows a put before the assignment.)
> > >   */
> > >   put_seccomp_filter(thread);
> > > + seccomp_filter_release(thread);
> >
> > This would also remove the only put_seccomp_filter() call outside of
> > seccomp.c, since the free_task() call will be removed now in favor of
> > the task_release() call.
> >
> > So, is it safe to detach the filter in release_task()? Has dethreading
> > happened yet? i.e. can we race TSYNC? -- is there a possible
> > inc-from-zero?
> 
> release_task -> __exit_signal -> __unhash_process ->
> list_del_rcu(>thread_node) drops us from the thread list under
> siglock, which is the same lock TSYNC uses.
> 
> One other interesting thing that can look at seccomp state is
> task_seccomp() in procfs - that can still happen at this point. At the
> moment, procfs only lets you see the numeric filter state, not the
> actual filter contents, so that's not a problem; but if we ever add a
> procfs interface for dumping seccomp filters (in addition to the
> ptrace interface that already exists), that's something to keep in
> mind.

Aside from this being not an issue now, can we please not dump seccomp
filter contents in proc. That sounds terrible and what's the rationale,
libseccomp already let's you dump filter contents while loading and you
could ptrace it. But maybe I'm missing a giant need for this...

Christian

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-29 Thread Christian Brauner

On Thu, May 28, 2020 at 04:11:00PM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> >   * @usage: reference count to manage the object lifetime.
> >   * get/put helpers should be used when accessing an instance
> >   * outside of a lifetime-guarded section.  In general, this
> >   * is only needed for handling filters shared across tasks.
> > [...]
> > + * @live: Number of tasks that use this filter directly and number
> > + *   of dependent filters that have a non-zero @live counter.
> > + *   Altered during fork(), exit(), and filter installation
> > [...]
> > refcount_set(>usage, 1);
> > +   refcount_set(>live, 1);
> 
> I'd like these reference counters to have more descriptive names. "usage"
> by what? "live" from what perspective? At the least, I think we need
> to be explicit in the comment, and at best we should do that and rename
> them to be a bit more clear.

Well the correct way would probably be:
"usage" -> "refs"
"live"  -> "users"

So we'd need a first patch to convert "usage" to "refs" and then
introduce "users".

> 
> A filter's "usage" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), once for the
> dependent filter (filter::prev), and once for an open user_notif file
> (file::private_data). When it reaches zero, there are (should be) no more
> active memory references back to the struct filter and it can be freed.
> 
> A filter's "live" is incremented for each directly-attached task
> (task::seccomp_data.filter, via fork() or thread_sync), and once for
> the dependent filter (filter::prev). When it reaches zero there is no
> way for new tasks to get associated with the filter, but there may still
> be user_notif file::private_data references pointing at the filter.

or - at least briefyl - ptrace or whatever, yes.

> 
> But we're tracking "validity lifetime" (live) and "memory reference
> safety" (usage).
> 
> signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
> unhelpful name too. (And why isn't it refcount_t?)

I think I once looked that up and there was some sort of "not needed, no
gain" style rationale.

> 
> So, perhaps leave "live", but rename "usage" -> "references".

usage -> refs
live  -> users/active

> 
> After looking at these other lifetime management examples in the kernel,
> I'm convinced that tracking these states separately is correct, but I
> remain uncomfortable about task management needing to explicitly make
> two calls to let go of the filter.
> 
> I wonder if release_task() should also detach the filter from the task
> and do a put_seccomp_filter() instead of waiting for task_free(). This
> is supported by the other place where seccomp_filter_release() is
> called:
> 
> > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long 
> > flags)
> >  * allows a put before the assignment.)
> > */
> > put_seccomp_filter(thread);
> > +   seccomp_filter_release(thread);
> 
> This would also remove the only put_seccomp_filter() call outside of
> seccomp.c, since the free_task() call will be removed now in favor of
> the task_release() call.
> 
> So, is it safe to detach the filter in release_task()? Has dethreading
> happened yet? i.e. can we race TSYNC? -- is there a possible
> inc-from-zero? (Actually, all our refcount_inc()s should be
> refcount_inc_not_zero() just for robustness.) I *think* we can do it
> before the release_thread() call (instead of after cgroup_release()).
> 
> With that, then seccomp_filter_release() could assign the filter to NULL
> and do the clean up:
> 
> void seccomp_filter_release(const struct task_struct *tsk)
> {
>   struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);
> 
>   smp_store_release(>seccomp.filter, NULL);
>   __seccomp_filter_release(orig);
> }
> 
> All other refcounting is then internal to seccomp.c. Which brings me
> back to TSYNC, since we don't want to write NULL to task->seccomp.filter
> during TSYNC. TSYNC can use:
> 
> void __seccomp_filter_release(struct seccomp_filter *filter)
> {
>   while (filter && refcount_dec_and_test(>live)) {
>   if (waitqueue_active(>wqh))
>   wake_up_poll(>wqh, EPOLLHUP);
>   filter = filter->prev;
>   }
>   __put_seccomp_filter(filter);
> }
> 
> Thoughts?
> 
> -- 
> Kees Cook

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-28 Thread Kees Cook

On Fri, May 29, 2020 at 01:32:03AM +0200, Jann Horn wrote:
> On Fri, May 29, 2020 at 1:11 AM Kees Cook  wrote:
> > So, is it safe to detach the filter in release_task()? Has dethreading
> > happened yet? i.e. can we race TSYNC? -- is there a possible
> > inc-from-zero?
> 
> release_task -> __exit_signal -> __unhash_process ->
> list_del_rcu(>thread_node) drops us from the thread list under
> siglock, which is the same lock TSYNC uses.

Ah, there it is. I missed the __unhash_process() in __exit_signal, but
once I saw the call to release_task(), I figured it was safe at that
point. So this seems correct:

> > I *think* we can do it
> > before the release_thread() call (instead of after cgroup_release()).

> One other interesting thing that can look at seccomp state is
> task_seccomp() in procfs - that can still happen at this point. At the
> moment, procfs only lets you see the numeric filter state, not the
> actual filter contents, so that's not a problem; but if we ever add a
> procfs interface for dumping seccomp filters (in addition to the
> ptrace interface that already exists), that's something to keep in
> mind.

Right -- but we can just reuse the get/put to pin the filter while
dumping it from proc (there IS someone working on this feature...)

> > (Actually, all our refcount_inc()s should be
> > refcount_inc_not_zero() just for robustness.)
> 
> Eeeh... wouldn't that just make the code more complicated for no good reason?

Sorry, ignore that. I got myself briefly confused -- we're fine;
refcount_inc() already does inc-from-zero checking.

-- 
Kees Cook

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-28 Thread Jann Horn

On Fri, May 29, 2020 at 1:11 AM Kees Cook  wrote:
> On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
> >   * @usage: reference count to manage the object lifetime.
> >   * get/put helpers should be used when accessing an instance
> >   * outside of a lifetime-guarded section.  In general, this
> >   * is only needed for handling filters shared across tasks.
> > [...]
> > + * @live: Number of tasks that use this filter directly and number
> > + * of dependent filters that have a non-zero @live counter.
> > + * Altered during fork(), exit(), and filter installation
> > [...]
> >   refcount_set(>usage, 1);
> > + refcount_set(>live, 1);
[...]
> After looking at these other lifetime management examples in the kernel,
> I'm convinced that tracking these states separately is correct, but I
> remain uncomfortable about task management needing to explicitly make
> two calls to let go of the filter.
>
> I wonder if release_task() should also detach the filter from the task
> and do a put_seccomp_filter() instead of waiting for task_free(). This
> is supported by the other place where seccomp_filter_release() is
> called:
>
> > @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long 
> > flags)
> >* allows a put before the assignment.)
> >   */
> >   put_seccomp_filter(thread);
> > + seccomp_filter_release(thread);
>
> This would also remove the only put_seccomp_filter() call outside of
> seccomp.c, since the free_task() call will be removed now in favor of
> the task_release() call.
>
> So, is it safe to detach the filter in release_task()? Has dethreading
> happened yet? i.e. can we race TSYNC? -- is there a possible
> inc-from-zero?

release_task -> __exit_signal -> __unhash_process ->
list_del_rcu(>thread_node) drops us from the thread list under
siglock, which is the same lock TSYNC uses.

One other interesting thing that can look at seccomp state is
task_seccomp() in procfs - that can still happen at this point. At the
moment, procfs only lets you see the numeric filter state, not the
actual filter contents, so that's not a problem; but if we ever add a
procfs interface for dumping seccomp filters (in addition to the
ptrace interface that already exists), that's something to keep in
mind.

> (Actually, all our refcount_inc()s should be
> refcount_inc_not_zero() just for robustness.)

Eeeh... wouldn't that just make the code more complicated for no good reason?

> I *think* we can do it
> before the release_thread() call (instead of after cgroup_release()).

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-28 Thread Kees Cook

On Thu, May 28, 2020 at 05:14:11PM +0200, Christian Brauner wrote:
>   * @usage: reference count to manage the object lifetime.
>   * get/put helpers should be used when accessing an instance
>   * outside of a lifetime-guarded section.  In general, this
>   * is only needed for handling filters shared across tasks.
> [...]
> + * @live: Number of tasks that use this filter directly and number
> + * of dependent filters that have a non-zero @live counter.
> + * Altered during fork(), exit(), and filter installation
> [...]
>   refcount_set(>usage, 1);
> + refcount_set(>live, 1);

I'd like these reference counters to have more descriptive names. "usage"
by what? "live" from what perspective? At the least, I think we need
to be explicit in the comment, and at best we should do that and rename
them to be a bit more clear.

A filter's "usage" is incremented for each directly-attached task
(task::seccomp_data.filter, via fork() or thread_sync), once for the
dependent filter (filter::prev), and once for an open user_notif file
(file::private_data). When it reaches zero, there are (should be) no more
active memory references back to the struct filter and it can be freed.

A filter's "live" is incremented for each directly-attached task
(task::seccomp_data.filter, via fork() or thread_sync), and once for
the dependent filter (filter::prev). When it reaches zero there is no
way for new tasks to get associated with the filter, but there may still
be user_notif file::private_data references pointing at the filter.

But we're tracking "validity lifetime" (live) and "memory reference
safety" (usage).

signal_struct has "sigcnt" and "live". I find "sigcnt" to be an
unhelpful name too. (And why isn't it refcount_t?)

So, perhaps leave "live", but rename "usage" -> "references".

After looking at these other lifetime management examples in the kernel,
I'm convinced that tracking these states separately is correct, but I
remain uncomfortable about task management needing to explicitly make
two calls to let go of the filter.

I wonder if release_task() should also detach the filter from the task
and do a put_seccomp_filter() instead of waiting for task_free(). This
is supported by the other place where seccomp_filter_release() is
called:

> @@ -396,6 +400,7 @@ static inline void seccomp_sync_threads(unsigned long 
> flags)
>* allows a put before the assignment.)
>   */
>   put_seccomp_filter(thread);
> + seccomp_filter_release(thread);

This would also remove the only put_seccomp_filter() call outside of
seccomp.c, since the free_task() call will be removed now in favor of
the task_release() call.

So, is it safe to detach the filter in release_task()? Has dethreading
happened yet? i.e. can we race TSYNC? -- is there a possible
inc-from-zero? (Actually, all our refcount_inc()s should be
refcount_inc_not_zero() just for robustness.) I *think* we can do it
before the release_thread() call (instead of after cgroup_release()).

With that, then seccomp_filter_release() could assign the filter to NULL
and do the clean up:

void seccomp_filter_release(const struct task_struct *tsk)
{
struct seccomp_filter *orig = READ_ONCE(tsk->seccomp.filter);

smp_store_release(>seccomp.filter, NULL);
__seccomp_filter_release(orig);
}

All other refcounting is then internal to seccomp.c. Which brings me
back to TSYNC, since we don't want to write NULL to task->seccomp.filter
during TSYNC. TSYNC can use:

void __seccomp_filter_release(struct seccomp_filter *filter)
{
while (filter && refcount_dec_and_test(>live)) {
if (waitqueue_active(>wqh))
wake_up_poll(>wqh, EPOLLHUP);
filter = filter->prev;
}
__put_seccomp_filter(filter);
}

Thoughts?

-- 
Kees Cook

[PATCH v2 1/2] seccomp: notify user trap about unused filter

2020-05-28 Thread Christian Brauner

We've been making heavy use of the seccomp notifier to intercept and
handle certain syscalls for containers. This patch allows a syscall
supervisor listening on a given notifier to be notified when a seccomp
filter has become unused.

A container is often managed by a singleton supervisor process the
so-called "monitor". This monitor process has an event loop which has
various event handlers registered. If the user specified a seccomp
profile that included a notifier for various syscalls then we also
register a seccomp notify even handler. For any container using a
separate pid namespace the lifecycle of the seccomp notifier is bound to
the init process of the pid namespace, i.e. when the init process exits
the filter must be unused.
If a new process attaches to a container we force it to assume a seccomp
profile. This can either be the same seccomp profile as the container
was started with or a modified one. If the attaching process makes use
of the seccomp notifier we will register a new seccomp notifier handler
in the monitor's event loop. However, when the attaching process exits
we can't simply delete the handler since other child processes could've
been created (daemons spawned etc.) that have inherited the seccomp
filter and so we need to keep the seccomp notifier fd alive in the event
loop. But this is problematic since we don't get a notification when the
seccomp filter has become unused and so we currently never remove the
seccomp notifier fd from the event loop and just keep accumulating fds
in the event loop. We've had this issue for a while but it has recently
become more pressing as more and larger users make use of this.

To fix this, we introduce a new "live" reference counter that tracks the
live tasks making use of a given filter and when a notifier is
registered waiting tasks will be notified that the filter is now empty
by receiving a (E)POLLHUP event.
The concept in this patch introduces is the same as for signal_struct,
i.e. reference counting for life-cycle management is decoupled from
reference counting live taks using the object.

There's probably some trickery possible but the second counter is just
the correct way of doing this imho and has precedence. The patch also
lifts the waitqeue from struct notification into into sruct
seccomp_filter. This is cleaner overall and let's us avoid having to
take the notifier mutex since we neither need to read nor modify the
notifier specific aspects of the seccomp filter. In the exit path I'd
very much like to avoid having to take the notifier mutex for each
filter in the task's filter hierarchy.

Cc: Tycho Andersen 
Cc: Kees Cook 
Cc: Matt Denton 
Cc: Sargun Dhillon 
Cc: Jann Horn 
Cc: Chris Palmer 
Cc: Aleksa Sarai 
Cc: Robert Sesek 
Cc: Jeffrey Vander Stoep 
Cc: Linux Containers 
Signed-off-by: Christian Brauner 
---
/* v2 */
- Jann Horn :
  - Use more descriptive instead of seccomp_filter_notify().
(I went with seccomp_filter_release().)
---
 include/linux/seccomp.h |  5 +
 kernel/exit.c   |  2 ++
 kernel/seccomp.c| 33 +++--
 3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4192369b8418..d72b5b43f8ea 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -84,6 +84,7 @@ static inline int seccomp_mode(struct seccomp *s)
 #ifdef CONFIG_SECCOMP_FILTER
 extern void put_seccomp_filter(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
+extern void seccomp_filter_release(const struct task_struct *tsk);
 #else  /* CONFIG_SECCOMP_FILTER */
 static inline void put_seccomp_filter(struct task_struct *tsk)
 {
@@ -93,6 +94,10 @@ static inline void get_seccomp_filter(struct task_struct 
*tsk)
 {
return;
 }
+static inline void seccomp_filter_release(const struct task_struct *tsk)
+{
+   return;
+}
 #endif /* CONFIG_SECCOMP_FILTER */
 
 #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
diff --git a/kernel/exit.c b/kernel/exit.c
index ce2a75bc0ade..b332e3635eb5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -193,6 +193,8 @@ void release_task(struct task_struct *p)
 
cgroup_release(p);
 
+   seccomp_filter_release(p);
+
write_lock_irq(_lock);
ptrace_release_task(p);
thread_pid = get_pid(p->thread_pid);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..9fa642d6d549 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -94,13 +94,11 @@ struct seccomp_knotif {
  *   filter->notify_lock.
  * @next_id: The id of the next request.
  * @notifications: A list of struct seccomp_knotif elements.
- * @wqh: A wait queue for poll.
  */
 struct notification {
struct semaphore request;
u64 next_id;
struct list_head notifications;
-   wait_queue_head_t wqh;
 };
 
 /**
@@ -115,6 +113,10 @@ struct notification {
  * @prog: the BPF program to evaluate
  * @notif: the struct that

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

Re: [PATCH v2 1/2] seccomp: notify user trap about unused filter

[PATCH v2 1/2] seccomp: notify user trap about unused filter

13 matches

Site Navigation

Mail list logo

Footer information