Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Oleg Nesterov
On 07/10, Kees Cook wrote:
>
> On Thu, Jul 10, 2014 at 8:24 AM, Oleg Nesterov  wrote:
> >
> > Just to simplify. Suppose TIF_SECCOMP was set a long ago. This thread
> > has a single filter F1 and it enters seccomp_run_filters().
> >
> > Right before it does ACCESS_ONCE() to read the pointer, another thread
> > does seccomp_sync_threads() and sets .filter = F2.
> >
> > If ACCESS_ONCE() returns F1 - everything is fine. But it can see the new
> > pointer F2, and in this case we need a barrier to ensure that, say,
> > LOAD(F2->prog) will see all the preceding changes in this memory.
>
> And the rmb() isn't sufficient for that?

But it has no effect if the pointer was changed _after_ rmb() was already
called.

And, you need a barrier _after_ ACCESS_ONCE().

(Unless, again, we know that this is the first filter, but this is only
 by accident).

> Is another barrier needed
> before assigning the filter pointer to make sure the contents it
> points to are flushed?

I think smp_store_release() should be moved from seccomp_attach_filter()
to seccomp_sync_threads(). Although probably it _should_ work either way,
but at least this looks confusing because a) "current" doesn't need a
barrier to serialize wuth itself, and b) it is not clear why it is safe
to change the pointer dereferenced by another thread without a barrier.

> What's the least time-consuming operation I can use in run_filters?

As I said smp_read_barrier_depends() (nop unless alpha) or
smp_load_acquire() which you used in the previous version.

And to remind, afaics smp_load_acquire() in put_filter() should die ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Kees Cook
On Thu, Jul 10, 2014 at 8:24 AM, Oleg Nesterov  wrote:
> On 07/10, Kees Cook wrote:
>>
>> On Wed, Jul 9, 2014 at 11:55 AM, Oleg Nesterov  wrote:
>> > On 07/09, Oleg Nesterov wrote:
>> >>
>> >> On 06/27, Kees Cook wrote:
>> >> >
>> >> >  static u32 seccomp_run_filters(int syscall)
>> >> >  {
>> >> > -   struct seccomp_filter *f;
>> >> > +   struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
>> >>
>> >> I am not sure...
>> >>
>> >> This is fine if this ->filter is the 1st (and only) one, in this case
>> >> we can rely on rmb() in the caller.
>> >>
>> >> But the new filter can be installed at any moment. Say, right after that
>> >> rmb() although this doesn't matter. Either we need 
>> >> smp_read_barrier_depends()
>> >> after that, or smp_load_acquire() like the previous version did?
>> >
>> > Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
>> > when it sets thread->filter = current->filter by the same reason?
>> >
>> > OTOH. smp_store_release() in seccomp_attach_filter() can die, "current"
>> > doesn't need a barrier to serialize with itself.
>>
>> I have lost track of what you're suggesting to change. :)
>
> Perhaps I am just trying to confuse you and myself ;)
>
> But,
>
>> Since rmb() happens before run_filters, isn't the ACCESS_ONCE
>> sufficient?
>
> Yes. But see above. ACCESS_ONCE is sufficient if we read the first filter
> installed by another thread, in this case rmb() pairs with mb_before_atomic()
> before set_bit(TIF_SECCOMP).
>
> IOW, if this threads sees TIF_SECCOMP, it should also see all modifications
> which were done before set_bit, including the data in ->filter points to.
>
>> We only care that TIF_SECCOMP, mode, and some filter is
>> valid. In a tsync thread race, it's okay to use not use the deepest
>> filter node in the list,
>
> Yes, it is fine if we miss yet another filter which was just installed by
> another thread.
>
> But, unless I missed something, the problem is that we can get this new
> filter.
>
> Just to simplify. Suppose TIF_SECCOMP was set a long ago. This thread
> has a single filter F1 and it enters seccomp_run_filters().
>
> Right before it does ACCESS_ONCE() to read the pointer, another thread
> does seccomp_sync_threads() and sets .filter = F2.
>
> If ACCESS_ONCE() returns F1 - everything is fine. But it can see the new
> pointer F2, and in this case we need a barrier to ensure that, say,
> LOAD(F2->prog) will see all the preceding changes in this memory.

And the rmb() isn't sufficient for that? Is another barrier needed
before assigning the filter pointer to make sure the contents it
points to are flushed?

What's the least time-consuming operation I can use in run_filters?

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Oleg Nesterov
On 07/10, Kees Cook wrote:
>
> On Wed, Jul 9, 2014 at 11:55 AM, Oleg Nesterov  wrote:
> > On 07/09, Oleg Nesterov wrote:
> >>
> >> On 06/27, Kees Cook wrote:
> >> >
> >> >  static u32 seccomp_run_filters(int syscall)
> >> >  {
> >> > -   struct seccomp_filter *f;
> >> > +   struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
> >>
> >> I am not sure...
> >>
> >> This is fine if this ->filter is the 1st (and only) one, in this case
> >> we can rely on rmb() in the caller.
> >>
> >> But the new filter can be installed at any moment. Say, right after that
> >> rmb() although this doesn't matter. Either we need 
> >> smp_read_barrier_depends()
> >> after that, or smp_load_acquire() like the previous version did?
> >
> > Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
> > when it sets thread->filter = current->filter by the same reason?
> >
> > OTOH. smp_store_release() in seccomp_attach_filter() can die, "current"
> > doesn't need a barrier to serialize with itself.
>
> I have lost track of what you're suggesting to change. :)

Perhaps I am just trying to confuse you and myself ;)

But,

> Since rmb() happens before run_filters, isn't the ACCESS_ONCE
> sufficient?

Yes. But see above. ACCESS_ONCE is sufficient if we read the first filter
installed by another thread, in this case rmb() pairs with mb_before_atomic()
before set_bit(TIF_SECCOMP).

IOW, if this threads sees TIF_SECCOMP, it should also see all modifications
which were done before set_bit, including the data in ->filter points to.

> We only care that TIF_SECCOMP, mode, and some filter is
> valid. In a tsync thread race, it's okay to use not use the deepest
> filter node in the list,

Yes, it is fine if we miss yet another filter which was just installed by
another thread.

But, unless I missed something, the problem is that we can get this new
filter.

Just to simplify. Suppose TIF_SECCOMP was set a long ago. This thread
has a single filter F1 and it enters seccomp_run_filters().

Right before it does ACCESS_ONCE() to read the pointer, another thread
does seccomp_sync_threads() and sets .filter = F2.

If ACCESS_ONCE() returns F1 - everything is fine. But it can see the new
pointer F2, and in this case we need a barrier to ensure that, say,
LOAD(F2->prog) will see all the preceding changes in this memory.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Kees Cook
On Wed, Jul 9, 2014 at 11:55 AM, Oleg Nesterov  wrote:
> On 07/09, Oleg Nesterov wrote:
>>
>> On 06/27, Kees Cook wrote:
>> >
>> >  static u32 seccomp_run_filters(int syscall)
>> >  {
>> > -   struct seccomp_filter *f;
>> > +   struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
>>
>> I am not sure...
>>
>> This is fine if this ->filter is the 1st (and only) one, in this case
>> we can rely on rmb() in the caller.
>>
>> But the new filter can be installed at any moment. Say, right after that
>> rmb() although this doesn't matter. Either we need smp_read_barrier_depends()
>> after that, or smp_load_acquire() like the previous version did?
>
> Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
> when it sets thread->filter = current->filter by the same reason?
>
> OTOH. smp_store_release() in seccomp_attach_filter() can die, "current"
> doesn't need a barrier to serialize with itself.

I have lost track of what you're suggesting to change. :)

Since rmb() happens before run_filters, isn't the ACCESS_ONCE
sufficient? We only care that TIF_SECCOMP, mode, and some filter is
valid. In a tsync thread race, it's okay to use not use the deepest
filter node in the list, it just needs A filter.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Kees Cook
On Wed, Jul 9, 2014 at 11:55 AM, Oleg Nesterov o...@redhat.com wrote:
 On 07/09, Oleg Nesterov wrote:

 On 06/27, Kees Cook wrote:
 
   static u32 seccomp_run_filters(int syscall)
   {
  -   struct seccomp_filter *f;
  +   struct seccomp_filter *f = ACCESS_ONCE(current-seccomp.filter);

 I am not sure...

 This is fine if this -filter is the 1st (and only) one, in this case
 we can rely on rmb() in the caller.

 But the new filter can be installed at any moment. Say, right after that
 rmb() although this doesn't matter. Either we need smp_read_barrier_depends()
 after that, or smp_load_acquire() like the previous version did?

 Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
 when it sets thread-filter = current-filter by the same reason?

 OTOH. smp_store_release() in seccomp_attach_filter() can die, current
 doesn't need a barrier to serialize with itself.

I have lost track of what you're suggesting to change. :)

Since rmb() happens before run_filters, isn't the ACCESS_ONCE
sufficient? We only care that TIF_SECCOMP, mode, and some filter is
valid. In a tsync thread race, it's okay to use not use the deepest
filter node in the list, it just needs A filter.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Oleg Nesterov
On 07/10, Kees Cook wrote:

 On Wed, Jul 9, 2014 at 11:55 AM, Oleg Nesterov o...@redhat.com wrote:
  On 07/09, Oleg Nesterov wrote:
 
  On 06/27, Kees Cook wrote:
  
static u32 seccomp_run_filters(int syscall)
{
   -   struct seccomp_filter *f;
   +   struct seccomp_filter *f = ACCESS_ONCE(current-seccomp.filter);
 
  I am not sure...
 
  This is fine if this -filter is the 1st (and only) one, in this case
  we can rely on rmb() in the caller.
 
  But the new filter can be installed at any moment. Say, right after that
  rmb() although this doesn't matter. Either we need 
  smp_read_barrier_depends()
  after that, or smp_load_acquire() like the previous version did?
 
  Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
  when it sets thread-filter = current-filter by the same reason?
 
  OTOH. smp_store_release() in seccomp_attach_filter() can die, current
  doesn't need a barrier to serialize with itself.

 I have lost track of what you're suggesting to change. :)

Perhaps I am just trying to confuse you and myself ;)

But,

 Since rmb() happens before run_filters, isn't the ACCESS_ONCE
 sufficient?

Yes. But see above. ACCESS_ONCE is sufficient if we read the first filter
installed by another thread, in this case rmb() pairs with mb_before_atomic()
before set_bit(TIF_SECCOMP).

IOW, if this threads sees TIF_SECCOMP, it should also see all modifications
which were done before set_bit, including the data in -filter points to.

 We only care that TIF_SECCOMP, mode, and some filter is
 valid. In a tsync thread race, it's okay to use not use the deepest
 filter node in the list,

Yes, it is fine if we miss yet another filter which was just installed by
another thread.

But, unless I missed something, the problem is that we can get this new
filter.

Just to simplify. Suppose TIF_SECCOMP was set a long ago. This thread
has a single filter F1 and it enters seccomp_run_filters().

Right before it does ACCESS_ONCE() to read the pointer, another thread
does seccomp_sync_threads() and sets .filter = F2.

If ACCESS_ONCE() returns F1 - everything is fine. But it can see the new
pointer F2, and in this case we need a barrier to ensure that, say,
LOAD(F2-prog) will see all the preceding changes in this memory.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Kees Cook
On Thu, Jul 10, 2014 at 8:24 AM, Oleg Nesterov o...@redhat.com wrote:
 On 07/10, Kees Cook wrote:

 On Wed, Jul 9, 2014 at 11:55 AM, Oleg Nesterov o...@redhat.com wrote:
  On 07/09, Oleg Nesterov wrote:
 
  On 06/27, Kees Cook wrote:
  
static u32 seccomp_run_filters(int syscall)
{
   -   struct seccomp_filter *f;
   +   struct seccomp_filter *f = ACCESS_ONCE(current-seccomp.filter);
 
  I am not sure...
 
  This is fine if this -filter is the 1st (and only) one, in this case
  we can rely on rmb() in the caller.
 
  But the new filter can be installed at any moment. Say, right after that
  rmb() although this doesn't matter. Either we need 
  smp_read_barrier_depends()
  after that, or smp_load_acquire() like the previous version did?
 
  Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
  when it sets thread-filter = current-filter by the same reason?
 
  OTOH. smp_store_release() in seccomp_attach_filter() can die, current
  doesn't need a barrier to serialize with itself.

 I have lost track of what you're suggesting to change. :)

 Perhaps I am just trying to confuse you and myself ;)

 But,

 Since rmb() happens before run_filters, isn't the ACCESS_ONCE
 sufficient?

 Yes. But see above. ACCESS_ONCE is sufficient if we read the first filter
 installed by another thread, in this case rmb() pairs with mb_before_atomic()
 before set_bit(TIF_SECCOMP).

 IOW, if this threads sees TIF_SECCOMP, it should also see all modifications
 which were done before set_bit, including the data in -filter points to.

 We only care that TIF_SECCOMP, mode, and some filter is
 valid. In a tsync thread race, it's okay to use not use the deepest
 filter node in the list,

 Yes, it is fine if we miss yet another filter which was just installed by
 another thread.

 But, unless I missed something, the problem is that we can get this new
 filter.

 Just to simplify. Suppose TIF_SECCOMP was set a long ago. This thread
 has a single filter F1 and it enters seccomp_run_filters().

 Right before it does ACCESS_ONCE() to read the pointer, another thread
 does seccomp_sync_threads() and sets .filter = F2.

 If ACCESS_ONCE() returns F1 - everything is fine. But it can see the new
 pointer F2, and in this case we need a barrier to ensure that, say,
 LOAD(F2-prog) will see all the preceding changes in this memory.

And the rmb() isn't sufficient for that? Is another barrier needed
before assigning the filter pointer to make sure the contents it
points to are flushed?

What's the least time-consuming operation I can use in run_filters?

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-10 Thread Oleg Nesterov
On 07/10, Kees Cook wrote:

 On Thu, Jul 10, 2014 at 8:24 AM, Oleg Nesterov o...@redhat.com wrote:
 
  Just to simplify. Suppose TIF_SECCOMP was set a long ago. This thread
  has a single filter F1 and it enters seccomp_run_filters().
 
  Right before it does ACCESS_ONCE() to read the pointer, another thread
  does seccomp_sync_threads() and sets .filter = F2.
 
  If ACCESS_ONCE() returns F1 - everything is fine. But it can see the new
  pointer F2, and in this case we need a barrier to ensure that, say,
  LOAD(F2-prog) will see all the preceding changes in this memory.

 And the rmb() isn't sufficient for that?

But it has no effect if the pointer was changed _after_ rmb() was already
called.

And, you need a barrier _after_ ACCESS_ONCE().

(Unless, again, we know that this is the first filter, but this is only
 by accident).

 Is another barrier needed
 before assigning the filter pointer to make sure the contents it
 points to are flushed?

I think smp_store_release() should be moved from seccomp_attach_filter()
to seccomp_sync_threads(). Although probably it _should_ work either way,
but at least this looks confusing because a) current doesn't need a
barrier to serialize wuth itself, and b) it is not clear why it is safe
to change the pointer dereferenced by another thread without a barrier.

 What's the least time-consuming operation I can use in run_filters?

As I said smp_read_barrier_depends() (nop unless alpha) or
smp_load_acquire() which you used in the previous version.

And to remind, afaics smp_load_acquire() in put_filter() should die ;)

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-09 Thread Oleg Nesterov
On 06/27, Kees Cook wrote:
>
>  void put_seccomp_filter(struct task_struct *tsk)
>  {
> - struct seccomp_filter *orig = tsk->seccomp.filter;
> + struct seccomp_filter *orig = smp_load_acquire(>seccomp.filter);
>   /* Clean up single-reference branches iteratively. */
>   while (orig && atomic_dec_and_test(>usage)) {

And this smp_load_acquire() looks unnecessary. atomic_dec_and_test()
is a barrier.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-09 Thread Oleg Nesterov
On 07/09, Oleg Nesterov wrote:
>
> On 06/27, Kees Cook wrote:
> >
> >  static u32 seccomp_run_filters(int syscall)
> >  {
> > -   struct seccomp_filter *f;
> > +   struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);
>
> I am not sure...
>
> This is fine if this ->filter is the 1st (and only) one, in this case
> we can rely on rmb() in the caller.
>
> But the new filter can be installed at any moment. Say, right after that
> rmb() although this doesn't matter. Either we need smp_read_barrier_depends()
> after that, or smp_load_acquire() like the previous version did?

Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
when it sets thread->filter = current->filter by the same reason?

OTOH. smp_store_release() in seccomp_attach_filter() can die, "current"
doesn't need a barrier to serialize with itself.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-09 Thread Oleg Nesterov
On 06/27, Kees Cook wrote:
>
>  static u32 seccomp_run_filters(int syscall)
>  {
> - struct seccomp_filter *f;
> + struct seccomp_filter *f = ACCESS_ONCE(current->seccomp.filter);

I am not sure...

This is fine if this ->filter is the 1st (and only) one, in this case
we can rely on rmb() in the caller.

But the new filter can be installed at any moment. Say, right after that
rmb() although this doesn't matter. Either we need smp_read_barrier_depends()
after that, or smp_load_acquire() like the previous version did?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-09 Thread Oleg Nesterov
On 06/27, Kees Cook wrote:

  static u32 seccomp_run_filters(int syscall)
  {
 - struct seccomp_filter *f;
 + struct seccomp_filter *f = ACCESS_ONCE(current-seccomp.filter);

I am not sure...

This is fine if this -filter is the 1st (and only) one, in this case
we can rely on rmb() in the caller.

But the new filter can be installed at any moment. Say, right after that
rmb() although this doesn't matter. Either we need smp_read_barrier_depends()
after that, or smp_load_acquire() like the previous version did?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-09 Thread Oleg Nesterov
On 07/09, Oleg Nesterov wrote:

 On 06/27, Kees Cook wrote:
 
   static u32 seccomp_run_filters(int syscall)
   {
  -   struct seccomp_filter *f;
  +   struct seccomp_filter *f = ACCESS_ONCE(current-seccomp.filter);

 I am not sure...

 This is fine if this -filter is the 1st (and only) one, in this case
 we can rely on rmb() in the caller.

 But the new filter can be installed at any moment. Say, right after that
 rmb() although this doesn't matter. Either we need smp_read_barrier_depends()
 after that, or smp_load_acquire() like the previous version did?

Wait... and it seems that seccomp_sync_threads() needs smp_store_release()
when it sets thread-filter = current-filter by the same reason?

OTOH. smp_store_release() in seccomp_attach_filter() can die, current
doesn't need a barrier to serialize with itself.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 09/11] seccomp: introduce writer locking

2014-07-09 Thread Oleg Nesterov
On 06/27, Kees Cook wrote:

  void put_seccomp_filter(struct task_struct *tsk)
  {
 - struct seccomp_filter *orig = tsk-seccomp.filter;
 + struct seccomp_filter *orig = smp_load_acquire(tsk-seccomp.filter);
   /* Clean up single-reference branches iteratively. */
   while (orig  atomic_dec_and_test(orig-usage)) {

And this smp_load_acquire() looks unnecessary. atomic_dec_and_test()
is a barrier.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v9 09/11] seccomp: introduce writer locking

2014-06-27 Thread Kees Cook
Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to
the seccomp fields are now protected by the sighand spinlock (which
is unique to the thread group). Read access remains lockless because
pointer updates themselves are atomic.  However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook 
---
 include/linux/seccomp.h |6 +++---
 kernel/fork.c   |   45 -
 kernel/seccomp.c|   26 --
 3 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b0994071..9ff98b4bfe2e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -14,11 +14,11 @@ struct seccomp_filter;
  *
  * @mode:  indicates one of the valid values above for controlled
  * system calls available to a process.
- * @filter: The metadata and ruleset for determining what system calls
- *  are allowed for a task.
+ * @filter: must always point to a valid seccomp-filter or NULL as it is
+ *  accessed without locking during system call entry.
  *
  *  @filter must only be accessed from the context of current as there
- *  is no locking.
+ *  is no read locking.
  */
 struct seccomp {
int mode;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6a13c46cd87d..ffc1b43e351f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig)
goto free_ti;
 
tsk->stack = ti;
+#ifdef CONFIG_SECCOMP
+   /*
+* We must handle setting up seccomp filters once we're under
+* the sighand lock in case orig has changed between now and
+* then. Until then, filter must be NULL to avoid messing up
+* the usage counts on the error path calling free_task.
+*/
+   tsk->seccomp.filter = NULL;
+#endif
 
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
@@ -1081,6 +1090,35 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
return 0;
 }
 
+static void copy_seccomp(struct task_struct *p)
+{
+#ifdef CONFIG_SECCOMP
+   /*
+* Must be called with sighand->lock held, which is common to
+* all threads in the group. Holding cred_guard_mutex is not
+* needed because this new task is not yet running and cannot
+* be racing exec.
+*/
+   BUG_ON(!spin_is_locked(>sighand->siglock));
+
+   /* Ref-count the new filter user, and assign it. */
+   get_seccomp_filter(current);
+   p->seccomp = current->seccomp;
+
+   /*
+* Explicitly enable no_new_privs here in case it got set
+* between the task_struct being duplicated and holding the
+* sighand lock. The seccomp state and nnp must be in sync.
+*/
+   if (task_no_new_privs(current))
+   task_set_no_new_privs(p);
+
+   /* If we have a seccomp mode, enable the thread flag. */
+   if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
+   set_tsk_thread_flag(p, TIF_SECCOMP);
+#endif
+}
+
 SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
 {
current->clear_child_tid = tidptr;
@@ -1196,7 +1234,6 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto fork_out;
 
ftrace_graph_init_task(p);
-   get_seccomp_filter(p);
 
rt_mutex_init_task(p);
 
@@ -1437,6 +1474,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
spin_lock(>sighand->siglock);
 
/*
+* Copy seccomp details explicitly here, in case they were changed
+* before holding sighand lock.
+*/
+   copy_seccomp(p);
+
+   /*
 * Process group and session signals need to be delivered to just the
 * parent before the fork or both the parent and the child after the
 * fork. Restart if a signal comes in before we add the new process to
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 502e54d7f86d..e1ff2c193190 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -173,12 +173,12 @@ static int seccomp_check_filter(struct sock_filter 
*filter, unsigned 

[PATCH v9 09/11] seccomp: introduce writer locking

2014-06-27 Thread Kees Cook
Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to
the seccomp fields are now protected by the sighand spinlock (which
is unique to the thread group). Read access remains lockless because
pointer updates themselves are atomic.  However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook keesc...@chromium.org
---
 include/linux/seccomp.h |6 +++---
 kernel/fork.c   |   45 -
 kernel/seccomp.c|   26 --
 3 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b0994071..9ff98b4bfe2e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -14,11 +14,11 @@ struct seccomp_filter;
  *
  * @mode:  indicates one of the valid values above for controlled
  * system calls available to a process.
- * @filter: The metadata and ruleset for determining what system calls
- *  are allowed for a task.
+ * @filter: must always point to a valid seccomp-filter or NULL as it is
+ *  accessed without locking during system call entry.
  *
  *  @filter must only be accessed from the context of current as there
- *  is no locking.
+ *  is no read locking.
  */
 struct seccomp {
int mode;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6a13c46cd87d..ffc1b43e351f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig)
goto free_ti;
 
tsk-stack = ti;
+#ifdef CONFIG_SECCOMP
+   /*
+* We must handle setting up seccomp filters once we're under
+* the sighand lock in case orig has changed between now and
+* then. Until then, filter must be NULL to avoid messing up
+* the usage counts on the error path calling free_task.
+*/
+   tsk-seccomp.filter = NULL;
+#endif
 
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
@@ -1081,6 +1090,35 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
return 0;
 }
 
+static void copy_seccomp(struct task_struct *p)
+{
+#ifdef CONFIG_SECCOMP
+   /*
+* Must be called with sighand-lock held, which is common to
+* all threads in the group. Holding cred_guard_mutex is not
+* needed because this new task is not yet running and cannot
+* be racing exec.
+*/
+   BUG_ON(!spin_is_locked(current-sighand-siglock));
+
+   /* Ref-count the new filter user, and assign it. */
+   get_seccomp_filter(current);
+   p-seccomp = current-seccomp;
+
+   /*
+* Explicitly enable no_new_privs here in case it got set
+* between the task_struct being duplicated and holding the
+* sighand lock. The seccomp state and nnp must be in sync.
+*/
+   if (task_no_new_privs(current))
+   task_set_no_new_privs(p);
+
+   /* If we have a seccomp mode, enable the thread flag. */
+   if (p-seccomp.mode != SECCOMP_MODE_DISABLED)
+   set_tsk_thread_flag(p, TIF_SECCOMP);
+#endif
+}
+
 SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
 {
current-clear_child_tid = tidptr;
@@ -1196,7 +1234,6 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto fork_out;
 
ftrace_graph_init_task(p);
-   get_seccomp_filter(p);
 
rt_mutex_init_task(p);
 
@@ -1437,6 +1474,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
spin_lock(current-sighand-siglock);
 
/*
+* Copy seccomp details explicitly here, in case they were changed
+* before holding sighand lock.
+*/
+   copy_seccomp(p);
+
+   /*
 * Process group and session signals need to be delivered to just the
 * parent before the fork or both the parent and the child after the
 * fork. Restart if a signal comes in before we add the new process to
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 502e54d7f86d..e1ff2c193190 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -173,12 +173,12 @@ static int seccomp_check_filter(struct