Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-13 Thread Christoph Hellwig
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
> On stack variable length arrays get implemented by the compiler doing
> alloca(), and we sadly have a few of those around.

I've just got rid of one of those and I wish they would appear
entirely as they are horrible in so many different ways.  Sparse
warns about them, btw.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-13 Thread Christoph Hellwig
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
> On stack variable length arrays get implemented by the compiler doing
> alloca(), and we sadly have a few of those around.

I've just got rid of one of those and I wish they would appear
entirely as they are horrible in so many different ways.  Sparse
warns about them, btw.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Andy Lutomirski
On Fri, May 12, 2017 at 12:15 AM, Al Viro  wrote:

> Folks, seriously, have you even looked through that zoo?  I have, and it's
> really, really not fun.  Sure, we can say "fuck 'em, no need to allow
> splice() on random crap".  Would be perfectly reasonable, expect that
> it's not the only place doing kernel_write() and its ilk...

Can you clarify this?  I think we really may be able to do exactly
this.  From Christoph's list, there are only two things that need
kernel_read/kernel_write to user-supplied fds that may come from a
variety of sources: splice and exec.  If you're execing a chardev from
a crappy driver, something is seriously wrong.  And returning -EINVAL
from splice() to or from files that use ->read and ->write seems find
(and splice(2) even documents -EINVAL as meaning that the target
doesn't support splicing).

--Andy


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Andy Lutomirski
On Fri, May 12, 2017 at 12:15 AM, Al Viro  wrote:

> Folks, seriously, have you even looked through that zoo?  I have, and it's
> really, really not fun.  Sure, we can say "fuck 'em, no need to allow
> splice() on random crap".  Would be perfectly reasonable, expect that
> it's not the only place doing kernel_write() and its ilk...

Can you clarify this?  I think we really may be able to do exactly
this.  From Christoph's list, there are only two things that need
kernel_read/kernel_write to user-supplied fds that may come from a
variety of sources: splice and exec.  If you're execing a chardev from
a crappy driver, something is seriously wrong.  And returning -EINVAL
from splice() to or from files that use ->read and ->write seems find
(and splice(2) even documents -EINVAL as meaning that the target
doesn't support splicing).

--Andy


Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Andy Lutomirski
On Mon, May 8, 2017 at 1:48 PM, Al Viro  wrote:
> On Mon, May 08, 2017 at 04:06:35PM +0200, Jann Horn wrote:
>
>> I think Kees might be talking about
>> https://bugs.chromium.org/p/project-zero/issues/detail?id=822, fixed in
>> commit e6978e4bf181fb3b5f8cb6f71b4fe30fbf1b655c. The issue was that
>> perf code that can run in pretty much any context called access_ok().
>
> And that commit has *NOT* solved the problem.  perf_callchain_user()
> can be called synchronously, without passing through that code.
> Tracepoint shite...
>
> That set_fs() should be done in get_perf_callchain(), just around the call of
> perf_callchain_user().  Along with pagefault_disable(), actually.
>

Even that's not quite enough because of a different issue: perf nmis
can hit during scheduling or when we're in lazy mm, leading to the
entirely wrong set of page tables being used.  We need
nmi_uaccess_begin() and nmi_uaccess_end(), and the former needs to be
allowed to fail.

AFAIK this isn't presently a security problem because it mainly
affects kernel threads, and you need to be root to profile them, but
maybe there's some race where it does matter.


Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Andy Lutomirski
On Mon, May 8, 2017 at 1:48 PM, Al Viro  wrote:
> On Mon, May 08, 2017 at 04:06:35PM +0200, Jann Horn wrote:
>
>> I think Kees might be talking about
>> https://bugs.chromium.org/p/project-zero/issues/detail?id=822, fixed in
>> commit e6978e4bf181fb3b5f8cb6f71b4fe30fbf1b655c. The issue was that
>> perf code that can run in pretty much any context called access_ok().
>
> And that commit has *NOT* solved the problem.  perf_callchain_user()
> can be called synchronously, without passing through that code.
> Tracepoint shite...
>
> That set_fs() should be done in get_perf_callchain(), just around the call of
> perf_callchain_user().  Along with pagefault_disable(), actually.
>

Even that's not quite enough because of a different issue: perf nmis
can hit during scheduling or when we're in lazy mm, leading to the
entirely wrong set of page tables being used.  We need
nmi_uaccess_begin() and nmi_uaccess_end(), and the former needs to be
allowed to fail.

AFAIK this isn't presently a security problem because it mainly
affects kernel threads, and you need to be root to profile them, but
maybe there's some race where it does matter.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 05:47:55PM -0400, Rik van Riel wrote:

> > Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere
> > near
> > the top threat.  If attacker can overwrite thread_info, you have
> > lost.
> 
> That is why THREAD_INFO_IN_TASK exists. It moves
> the struct thread_info to a location away from the
> stack, which means a stack overflow will not overwrite
> the thread_info.

... in which case such attacks on ->addr_limit also become a non-issue.

AFAICS, we are mixing several unrelated issues here:
* amount of places where set_fs() is called.  Sure, reducing it
is a good idea and we want to move to primitives like kernel_write() et.al.
Fewer users => lower odds of screwing it up.
* making sure that remaining callers are properly paired.  Ditto.
* switching to ->read_iter()/->write_iter() where it makes sense.
Again, no problem with that.
* providing sane environment for places like perf/oprofile.  Again,
a good idea, and set_fs(USER_DS) is only a part of what's needed there.
* switching _everything_ to ->read_iter()/->write_iter().  Flat-out
insane and AFAICS nobody is signing up for that.
* getting rid of set_fs() entirely.  I'm afraid that it's not feasible
without the previous one and frankly, I don't see much point.
* sanity-checking on return to userland.  Maybe useful, maybe not.
* taking thread_info out of the way of stack overflows.  Reasonable,
but has very little to do with the rest of that.
* protecting against Lovecraftian horrors slithering in from the outer
space only to commit unspeakable acts against ->addr_limit and ignoring much
tastier targets next to it, but then what do you expect from degenerate
spawn of Great Old Ones - sanity?


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 05:47:55PM -0400, Rik van Riel wrote:

> > Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere
> > near
> > the top threat.  If attacker can overwrite thread_info, you have
> > lost.
> 
> That is why THREAD_INFO_IN_TASK exists. It moves
> the struct thread_info to a location away from the
> stack, which means a stack overflow will not overwrite
> the thread_info.

... in which case such attacks on ->addr_limit also become a non-issue.

AFAICS, we are mixing several unrelated issues here:
* amount of places where set_fs() is called.  Sure, reducing it
is a good idea and we want to move to primitives like kernel_write() et.al.
Fewer users => lower odds of screwing it up.
* making sure that remaining callers are properly paired.  Ditto.
* switching to ->read_iter()/->write_iter() where it makes sense.
Again, no problem with that.
* providing sane environment for places like perf/oprofile.  Again,
a good idea, and set_fs(USER_DS) is only a part of what's needed there.
* switching _everything_ to ->read_iter()/->write_iter().  Flat-out
insane and AFAICS nobody is signing up for that.
* getting rid of set_fs() entirely.  I'm afraid that it's not feasible
without the previous one and frankly, I don't see much point.
* sanity-checking on return to userland.  Maybe useful, maybe not.
* taking thread_info out of the way of stack overflows.  Reasonable,
but has very little to do with the rest of that.
* protecting against Lovecraftian horrors slithering in from the outer
space only to commit unspeakable acts against ->addr_limit and ignoring much
tastier targets next to it, but then what do you expect from degenerate
spawn of Great Old Ones - sanity?


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 2:41 PM, Al Viro  wrote:
> On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote:
>
>> Two things are at risk from stack exhaustion: thread_info (mainly
>> addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and
>
> Really?  Let's take a look at arm, for example:
>
> struct thread_info {
> unsigned long   flags;  /* low level flags */
> int preempt_count;  /* 0 => preemptable, <0 => 
> bug */
> mm_segment_taddr_limit; /* address limit */
> struct task_struct  *task;  /* main task structure */
>
> and current() is defined as current_thread_info()->task.
>
> Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere near
> the top threat.  If attacker can overwrite thread_info, you have lost.

I don't disagree, but the type of attack is different. If the attacker
overwrites task_struct pointer, then they need to have built an false
one, and that may be made difficult by PAN, or need to know more about
kernel memory layout (rather than only stack depth), etc. Attacking
addr_limit makes it very very easy to upgrade attack capabilities. I'm
not say thread_info shouldn't be moved off the stack.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 2:41 PM, Al Viro  wrote:
> On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote:
>
>> Two things are at risk from stack exhaustion: thread_info (mainly
>> addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and
>
> Really?  Let's take a look at arm, for example:
>
> struct thread_info {
> unsigned long   flags;  /* low level flags */
> int preempt_count;  /* 0 => preemptable, <0 => 
> bug */
> mm_segment_taddr_limit; /* address limit */
> struct task_struct  *task;  /* main task structure */
>
> and current() is defined as current_thread_info()->task.
>
> Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere near
> the top threat.  If attacker can overwrite thread_info, you have lost.

I don't disagree, but the type of attack is different. If the attacker
overwrites task_struct pointer, then they need to have built an false
one, and that may be made difficult by PAN, or need to know more about
kernel memory layout (rather than only stack depth), etc. Attacking
addr_limit makes it very very easy to upgrade attack capabilities. I'm
not say thread_info shouldn't be moved off the stack.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Rik van Riel
On Fri, 2017-05-12 at 22:41 +0100, Al Viro wrote:
> On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote:
> 
> > Two things are at risk from stack exhaustion: thread_info (mainly
> > addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and
> 
> Really?  Let's take a look at arm, for example:
> 
> struct thread_info {
> unsigned long   flags;  /* low level flags */
> int preempt_count;  /* 0 => preemptable,
> <0 => bug */
> mm_segment_taddr_limit; /* address limit */
> struct task_struct  *task;  /* main task
> structure */
> 
> and current() is defined as current_thread_info()->task.
> 
> Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere
> near
> the top threat.  If attacker can overwrite thread_info, you have
> lost.

That is why THREAD_INFO_IN_TASK exists. It moves
the struct thread_info to a location away from the
stack, which means a stack overflow will not overwrite
the thread_info.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Rik van Riel
On Fri, 2017-05-12 at 22:41 +0100, Al Viro wrote:
> On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote:
> 
> > Two things are at risk from stack exhaustion: thread_info (mainly
> > addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and
> 
> Really?  Let's take a look at arm, for example:
> 
> struct thread_info {
> unsigned long   flags;  /* low level flags */
> int preempt_count;  /* 0 => preemptable,
> <0 => bug */
> mm_segment_taddr_limit; /* address limit */
> struct task_struct  *task;  /* main task
> structure */
> 
> and current() is defined as current_thread_info()->task.
> 
> Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere
> near
> the top threat.  If attacker can overwrite thread_info, you have
> lost.

That is why THREAD_INFO_IN_TASK exists. It moves
the struct thread_info to a location away from the
stack, which means a stack overflow will not overwrite
the thread_info.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote:

> Two things are at risk from stack exhaustion: thread_info (mainly
> addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and

Really?  Let's take a look at arm, for example:

struct thread_info {
unsigned long   flags;  /* low level flags */
int preempt_count;  /* 0 => preemptable, <0 => bug 
*/
mm_segment_taddr_limit; /* address limit */
struct task_struct  *task;  /* main task structure */

and current() is defined as current_thread_info()->task.

Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere near
the top threat.  If attacker can overwrite thread_info, you have lost.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote:

> Two things are at risk from stack exhaustion: thread_info (mainly
> addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and

Really?  Let's take a look at arm, for example:

struct thread_info {
unsigned long   flags;  /* low level flags */
int preempt_count;  /* 0 => preemptable, <0 => bug 
*/
mm_segment_taddr_limit; /* address limit */
struct task_struct  *task;  /* main task structure */

and current() is defined as current_thread_info()->task.

Seriously, look at these beasts.  Overwriting ->addr_limit is nowhere near
the top threat.  If attacker can overwrite thread_info, you have lost.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Daniel Micay
> overflow into adjacent allocations (fixed by VMAP_STACK).

99% fixed, but it's possible to skip over the guard page without
-fstack-check enabled (plus some edge cases need to be fixed in GCC),
unless VLAs were forbidden in addition to the existing large frame size
warning.

I'm not sure about in-tree code, but Qualcomm had some of these
improperly bounded VLA vulnerabilities in their MSM kernel...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Daniel Micay
> overflow into adjacent allocations (fixed by VMAP_STACK).

99% fixed, but it's possible to skip over the guard page without
-fstack-check enabled (plus some edge cases need to be fixed in GCC),
unless VLAs were forbidden in addition to the existing large frame size
warning.

I'm not sure about in-tree code, but Qualcomm had some of these
improperly bounded VLA vulnerabilities in their MSM kernel...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 2:06 PM, Al Viro  wrote:
> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
>> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
>> > I'm clearly not explaining things well enough. I shouldn't say
>> > "corruption", I should say "malicious manipulation". The methodology
>> > of attacks against the stack are quite different from the other kinds
>> > of attacks like use-after-free, heap overflow, etc. Being able to
>> > exhaust the kernel stack (either due to deep recursion or unbounded
>> > alloca())
>>
>> I really hope we don't have alloca() use in the kernel.  Do you have
>> evidence to support that assertion?
>>
>> IMHO alloca() (or similar) should not be present in any kernel code
>> because we have a limited stack - we have kmalloc() etc for that kind
>> of thing.
>
> No alloca(), but there are VLAs.  Said that, the whole "what if they
> can bugger thread_info and/or task_struct and go after set_fs() state"
> is idiocy, of course - in that case the box is fucked, no matter what.

Two things are at risk from stack exhaustion: thread_info (mainly
addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and
overflow into adjacent allocations (fixed by VMAP_STACK). The latter
is fundamentally a heap overflow.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 2:06 PM, Al Viro  wrote:
> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
>> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
>> > I'm clearly not explaining things well enough. I shouldn't say
>> > "corruption", I should say "malicious manipulation". The methodology
>> > of attacks against the stack are quite different from the other kinds
>> > of attacks like use-after-free, heap overflow, etc. Being able to
>> > exhaust the kernel stack (either due to deep recursion or unbounded
>> > alloca())
>>
>> I really hope we don't have alloca() use in the kernel.  Do you have
>> evidence to support that assertion?
>>
>> IMHO alloca() (or similar) should not be present in any kernel code
>> because we have a limited stack - we have kmalloc() etc for that kind
>> of thing.
>
> No alloca(), but there are VLAs.  Said that, the whole "what if they
> can bugger thread_info and/or task_struct and go after set_fs() state"
> is idiocy, of course - in that case the box is fucked, no matter what.

Two things are at risk from stack exhaustion: thread_info (mainly
addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and
overflow into adjacent allocations (fixed by VMAP_STACK). The latter
is fundamentally a heap overflow.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Daniel Micay
On Fri, 2017-05-12 at 22:06 +0100, Al Viro wrote:
> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux
> wrote:
> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > > I'm clearly not explaining things well enough. I shouldn't say
> > > "corruption", I should say "malicious manipulation". The
> > > methodology
> > > of attacks against the stack are quite different from the other
> > > kinds
> > > of attacks like use-after-free, heap overflow, etc. Being able to
> > > exhaust the kernel stack (either due to deep recursion or
> > > unbounded
> > > alloca())
> > 
> > I really hope we don't have alloca() use in the kernel.  Do you have
> > evidence to support that assertion?
> > 
> > IMHO alloca() (or similar) should not be present in any kernel code
> > because we have a limited stack - we have kmalloc() etc for that
> > kind
> > of thing.
> 
> No alloca(), but there are VLAs.  Said that, the whole "what if they
> can bugger thread_info and/or task_struct and go after set_fs() state"
> is idiocy, of course - in that case the box is fucked, no matter what.

VMAP_STACK + -fstack-check would prevent exploiting even an unbounded
VLA / alloca size vs. it being an arbitrary write. -fstack-check
guarantees that there's one byte per page as the stack grows, although
there are some unfortunate GCC bugs making it less than perfect right
now... but they recently started caring about it more including making
it near zero overhead as it was always supposed to be.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Daniel Micay
On Fri, 2017-05-12 at 22:06 +0100, Al Viro wrote:
> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux
> wrote:
> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > > I'm clearly not explaining things well enough. I shouldn't say
> > > "corruption", I should say "malicious manipulation". The
> > > methodology
> > > of attacks against the stack are quite different from the other
> > > kinds
> > > of attacks like use-after-free, heap overflow, etc. Being able to
> > > exhaust the kernel stack (either due to deep recursion or
> > > unbounded
> > > alloca())
> > 
> > I really hope we don't have alloca() use in the kernel.  Do you have
> > evidence to support that assertion?
> > 
> > IMHO alloca() (or similar) should not be present in any kernel code
> > because we have a limited stack - we have kmalloc() etc for that
> > kind
> > of thing.
> 
> No alloca(), but there are VLAs.  Said that, the whole "what if they
> can bugger thread_info and/or task_struct and go after set_fs() state"
> is idiocy, of course - in that case the box is fucked, no matter what.

VMAP_STACK + -fstack-check would prevent exploiting even an unbounded
VLA / alloca size vs. it being an arbitrary write. -fstack-check
guarantees that there's one byte per page as the stack grows, although
there are some unfortunate GCC bugs making it less than perfect right
now... but they recently started caring about it more including making
it near zero overhead as it was always supposed to be.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > I'm clearly not explaining things well enough. I shouldn't say
> > "corruption", I should say "malicious manipulation". The methodology
> > of attacks against the stack are quite different from the other kinds
> > of attacks like use-after-free, heap overflow, etc. Being able to
> > exhaust the kernel stack (either due to deep recursion or unbounded
> > alloca())
> 
> I really hope we don't have alloca() use in the kernel.  Do you have
> evidence to support that assertion?
> 
> IMHO alloca() (or similar) should not be present in any kernel code
> because we have a limited stack - we have kmalloc() etc for that kind
> of thing.

No alloca(), but there are VLAs.  Said that, the whole "what if they
can bugger thread_info and/or task_struct and go after set_fs() state"
is idiocy, of course - in that case the box is fucked, no matter what.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > I'm clearly not explaining things well enough. I shouldn't say
> > "corruption", I should say "malicious manipulation". The methodology
> > of attacks against the stack are quite different from the other kinds
> > of attacks like use-after-free, heap overflow, etc. Being able to
> > exhaust the kernel stack (either due to deep recursion or unbounded
> > alloca())
> 
> I really hope we don't have alloca() use in the kernel.  Do you have
> evidence to support that assertion?
> 
> IMHO alloca() (or similar) should not be present in any kernel code
> because we have a limited stack - we have kmalloc() etc for that kind
> of thing.

No alloca(), but there are VLAs.  Said that, the whole "what if they
can bugger thread_info and/or task_struct and go after set_fs() state"
is idiocy, of course - in that case the box is fucked, no matter what.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 2:00 PM, Kees Cook  wrote:
> On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linux
>  wrote:
>> On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
>>> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
>>> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
>>> > > I'm clearly not explaining things well enough. I shouldn't say
>>> > > "corruption", I should say "malicious manipulation". The methodology
>>> > > of attacks against the stack are quite different from the other kinds
>>> > > of attacks like use-after-free, heap overflow, etc. Being able to
>>> > > exhaust the kernel stack (either due to deep recursion or unbounded
>>> > > alloca())
>>> >
>>> > I really hope we don't have alloca() use in the kernel.  Do you have
>>> > evidence to support that assertion?
>>> >
>>> > IMHO alloca() (or similar) should not be present in any kernel code
>>> > because we have a limited stack - we have kmalloc() etc for that kind
>>> > of thing.
>>>
>>> On stack variable length arrays get implemented by the compiler doing
>>> alloca(), and we sadly have a few of those around.
>>
>> I hope their size is appropriately limited, but something tells me it
>> would be foolish to assume that.
>>
>>> But yes, fully agreed on the desirability of alloca() and things.
>>
>> Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks
>> like it certainly would prevent an explicit alloca() call.
>
> Building with -Werror=vla is exciting. :)
>
> A lot of it is in crypto (which are relatively static sizes, just
> using function callbacks), but there is plenty more.

I meant to also paste an example (which is harmless, I haven't looked
extensively at other examples):

unsigned long alignmask = crypto_tfm_alg_alignmask(tfm);
unsigned int size = crypto_tfm_alg_blocksize(tfm);
u8 buffer[size + alignmask];

Looking at all the places (and having tried to remove a few of these
in pstore), I think it might be quite frustrating to eliminate them
all and then declare VLAs dead. I'm not against trying, though. :)

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 2:00 PM, Kees Cook  wrote:
> On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linux
>  wrote:
>> On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
>>> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
>>> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
>>> > > I'm clearly not explaining things well enough. I shouldn't say
>>> > > "corruption", I should say "malicious manipulation". The methodology
>>> > > of attacks against the stack are quite different from the other kinds
>>> > > of attacks like use-after-free, heap overflow, etc. Being able to
>>> > > exhaust the kernel stack (either due to deep recursion or unbounded
>>> > > alloca())
>>> >
>>> > I really hope we don't have alloca() use in the kernel.  Do you have
>>> > evidence to support that assertion?
>>> >
>>> > IMHO alloca() (or similar) should not be present in any kernel code
>>> > because we have a limited stack - we have kmalloc() etc for that kind
>>> > of thing.
>>>
>>> On stack variable length arrays get implemented by the compiler doing
>>> alloca(), and we sadly have a few of those around.
>>
>> I hope their size is appropriately limited, but something tells me it
>> would be foolish to assume that.
>>
>>> But yes, fully agreed on the desirability of alloca() and things.
>>
>> Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks
>> like it certainly would prevent an explicit alloca() call.
>
> Building with -Werror=vla is exciting. :)
>
> A lot of it is in crypto (which are relatively static sizes, just
> using function callbacks), but there is plenty more.

I meant to also paste an example (which is harmless, I haven't looked
extensively at other examples):

unsigned long alignmask = crypto_tfm_alg_alignmask(tfm);
unsigned int size = crypto_tfm_alg_blocksize(tfm);
u8 buffer[size + alignmask];

Looking at all the places (and having tried to remove a few of these
in pstore), I think it might be quite frustrating to eliminate them
all and then declare VLAs dead. I'm not against trying, though. :)

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linux
 wrote:
> On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
>> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
>> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
>> > > I'm clearly not explaining things well enough. I shouldn't say
>> > > "corruption", I should say "malicious manipulation". The methodology
>> > > of attacks against the stack are quite different from the other kinds
>> > > of attacks like use-after-free, heap overflow, etc. Being able to
>> > > exhaust the kernel stack (either due to deep recursion or unbounded
>> > > alloca())
>> >
>> > I really hope we don't have alloca() use in the kernel.  Do you have
>> > evidence to support that assertion?
>> >
>> > IMHO alloca() (or similar) should not be present in any kernel code
>> > because we have a limited stack - we have kmalloc() etc for that kind
>> > of thing.
>>
>> On stack variable length arrays get implemented by the compiler doing
>> alloca(), and we sadly have a few of those around.
>
> I hope their size is appropriately limited, but something tells me it
> would be foolish to assume that.
>
>> But yes, fully agreed on the desirability of alloca() and things.
>
> Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks
> like it certainly would prevent an explicit alloca() call.

Building with -Werror=vla is exciting. :)

A lot of it is in crypto (which are relatively static sizes, just
using function callbacks), but there is plenty more.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linux
 wrote:
> On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
>> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
>> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
>> > > I'm clearly not explaining things well enough. I shouldn't say
>> > > "corruption", I should say "malicious manipulation". The methodology
>> > > of attacks against the stack are quite different from the other kinds
>> > > of attacks like use-after-free, heap overflow, etc. Being able to
>> > > exhaust the kernel stack (either due to deep recursion or unbounded
>> > > alloca())
>> >
>> > I really hope we don't have alloca() use in the kernel.  Do you have
>> > evidence to support that assertion?
>> >
>> > IMHO alloca() (or similar) should not be present in any kernel code
>> > because we have a limited stack - we have kmalloc() etc for that kind
>> > of thing.
>>
>> On stack variable length arrays get implemented by the compiler doing
>> alloca(), and we sadly have a few of those around.
>
> I hope their size is appropriately limited, but something tells me it
> would be foolish to assume that.
>
>> But yes, fully agreed on the desirability of alloca() and things.
>
> Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks
> like it certainly would prevent an explicit alloca() call.

Building with -Werror=vla is exciting. :)

A lot of it is in crypto (which are relatively static sizes, just
using function callbacks), but there is plenty more.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Russell King - ARM Linux
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > > I'm clearly not explaining things well enough. I shouldn't say
> > > "corruption", I should say "malicious manipulation". The methodology
> > > of attacks against the stack are quite different from the other kinds
> > > of attacks like use-after-free, heap overflow, etc. Being able to
> > > exhaust the kernel stack (either due to deep recursion or unbounded
> > > alloca())
> > 
> > I really hope we don't have alloca() use in the kernel.  Do you have
> > evidence to support that assertion?
> > 
> > IMHO alloca() (or similar) should not be present in any kernel code
> > because we have a limited stack - we have kmalloc() etc for that kind
> > of thing.
> 
> On stack variable length arrays get implemented by the compiler doing
> alloca(), and we sadly have a few of those around.

I hope their size is appropriately limited, but something tells me it
would be foolish to assume that.

> But yes, fully agreed on the desirability of alloca() and things.

Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks
like it certainly would prevent an explicit alloca() call.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Russell King - ARM Linux
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote:
> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > > I'm clearly not explaining things well enough. I shouldn't say
> > > "corruption", I should say "malicious manipulation". The methodology
> > > of attacks against the stack are quite different from the other kinds
> > > of attacks like use-after-free, heap overflow, etc. Being able to
> > > exhaust the kernel stack (either due to deep recursion or unbounded
> > > alloca())
> > 
> > I really hope we don't have alloca() use in the kernel.  Do you have
> > evidence to support that assertion?
> > 
> > IMHO alloca() (or similar) should not be present in any kernel code
> > because we have a limited stack - we have kmalloc() etc for that kind
> > of thing.
> 
> On stack variable length arrays get implemented by the compiler doing
> alloca(), and we sadly have a few of those around.

I hope their size is appropriately limited, but something tells me it
would be foolish to assume that.

> But yes, fully agreed on the desirability of alloca() and things.

Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks
like it certainly would prevent an explicit alloca() call.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Peter Zijlstra
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > I'm clearly not explaining things well enough. I shouldn't say
> > "corruption", I should say "malicious manipulation". The methodology
> > of attacks against the stack are quite different from the other kinds
> > of attacks like use-after-free, heap overflow, etc. Being able to
> > exhaust the kernel stack (either due to deep recursion or unbounded
> > alloca())
> 
> I really hope we don't have alloca() use in the kernel.  Do you have
> evidence to support that assertion?
> 
> IMHO alloca() (or similar) should not be present in any kernel code
> because we have a limited stack - we have kmalloc() etc for that kind
> of thing.

On stack variable length arrays get implemented by the compiler doing
alloca(), and we sadly have a few of those around.

But yes, fully agreed on the desirability of alloca() and things.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Peter Zijlstra
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote:
> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> > I'm clearly not explaining things well enough. I shouldn't say
> > "corruption", I should say "malicious manipulation". The methodology
> > of attacks against the stack are quite different from the other kinds
> > of attacks like use-after-free, heap overflow, etc. Being able to
> > exhaust the kernel stack (either due to deep recursion or unbounded
> > alloca())
> 
> I really hope we don't have alloca() use in the kernel.  Do you have
> evidence to support that assertion?
> 
> IMHO alloca() (or similar) should not be present in any kernel code
> because we have a limited stack - we have kmalloc() etc for that kind
> of thing.

On stack variable length arrays get implemented by the compiler doing
alloca(), and we sadly have a few of those around.

But yes, fully agreed on the desirability of alloca() and things.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Russell King - ARM Linux
On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> I'm clearly not explaining things well enough. I shouldn't say
> "corruption", I should say "malicious manipulation". The methodology
> of attacks against the stack are quite different from the other kinds
> of attacks like use-after-free, heap overflow, etc. Being able to
> exhaust the kernel stack (either due to deep recursion or unbounded
> alloca())

I really hope we don't have alloca() use in the kernel.  Do you have
evidence to support that assertion?

IMHO alloca() (or similar) should not be present in any kernel code
because we have a limited stack - we have kmalloc() etc for that kind
of thing.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Russell King - ARM Linux
On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote:
> I'm clearly not explaining things well enough. I shouldn't say
> "corruption", I should say "malicious manipulation". The methodology
> of attacks against the stack are quite different from the other kinds
> of attacks like use-after-free, heap overflow, etc. Being able to
> exhaust the kernel stack (either due to deep recursion or unbounded
> alloca())

I really hope we don't have alloca() use in the kernel.  Do you have
evidence to support that assertion?

IMHO alloca() (or similar) should not be present in any kernel code
because we have a limited stack - we have kmalloc() etc for that kind
of thing.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 12:08 PM, Linus Torvalds
 wrote:
> On Fri, May 12, 2017 at 12:01 PM, Kees Cook  wrote:
>> Yeah, the risk for "corrupted addr_limit" is mainly a concern for
>> archs with addr_limit on the kernel stack. If I'm reading things
>> correctly, that means, from the archs I've been paying closer
>> attention to, it's an issue for arm, mips, and powerpc:
>
> I don't understand why people are looking at addr_limit as some kind
> of special thing.
>
> If somebody is smashing the stack and corrupting thread info data, the
> game is over. addr_limit is the *least* of your problems, and it's not
> even all that likely that it will be increasing (it's much more likely
> that it would be overwritten with a smaller value).
>
> Quite frankly, this kind of idiotic discussion just makes me question
> the whole idea of the patch.
>
> Any "security" that is this specific is not real security, it's just
> masturbatory garbage.
>
> It may be worth checking that people use "set_fs()" properly. But stop
> this idiotic crap. It just makes the kernel security people look like
> the crazies.
>
> There are enough incompetent crazy security people, don't go there.
> The kinds of things it is worth protecting against are the big class
> of generic issues, not the kind of "oh, but imagine if a cosmic ray
> flips this particular word in memory" kind of crap that ignores all
> the other words of memory.
>
> Seriously, Kees. You are just making security people look bad. Stop it.

I'm clearly not explaining things well enough. I shouldn't say
"corruption", I should say "malicious manipulation". The methodology
of attacks against the stack are quite different from the other kinds
of attacks like use-after-free, heap overflow, etc. Being able to
exhaust the kernel stack (either due to deep recursion or unbounded
alloca()) means attackers can control a write to addr_limit, and then
leverage that into an actual arbitrary write via subsequent calls to
copy_to_user() pointed at kernel memory. This isn't theoretical, this
is how those attacks are performed. It may sound crazy, but it's real.

With thread_info off the stack, the whole problem goes away. It's
wonderful that this has happened for x86, arm64, and s390. There are
always going to be new methods of attack for everything, but while we
slowly address the design weakness (set_fs()), we can fix the low
hanging fruit too.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Fri, May 12, 2017 at 12:08 PM, Linus Torvalds
 wrote:
> On Fri, May 12, 2017 at 12:01 PM, Kees Cook  wrote:
>> Yeah, the risk for "corrupted addr_limit" is mainly a concern for
>> archs with addr_limit on the kernel stack. If I'm reading things
>> correctly, that means, from the archs I've been paying closer
>> attention to, it's an issue for arm, mips, and powerpc:
>
> I don't understand why people are looking at addr_limit as some kind
> of special thing.
>
> If somebody is smashing the stack and corrupting thread info data, the
> game is over. addr_limit is the *least* of your problems, and it's not
> even all that likely that it will be increasing (it's much more likely
> that it would be overwritten with a smaller value).
>
> Quite frankly, this kind of idiotic discussion just makes me question
> the whole idea of the patch.
>
> Any "security" that is this specific is not real security, it's just
> masturbatory garbage.
>
> It may be worth checking that people use "set_fs()" properly. But stop
> this idiotic crap. It just makes the kernel security people look like
> the crazies.
>
> There are enough incompetent crazy security people, don't go there.
> The kinds of things it is worth protecting against are the big class
> of generic issues, not the kind of "oh, but imagine if a cosmic ray
> flips this particular word in memory" kind of crap that ignores all
> the other words of memory.
>
> Seriously, Kees. You are just making security people look bad. Stop it.

I'm clearly not explaining things well enough. I shouldn't say
"corruption", I should say "malicious manipulation". The methodology
of attacks against the stack are quite different from the other kinds
of attacks like use-after-free, heap overflow, etc. Being able to
exhaust the kernel stack (either due to deep recursion or unbounded
alloca()) means attackers can control a write to addr_limit, and then
leverage that into an actual arbitrary write via subsequent calls to
copy_to_user() pointed at kernel memory. This isn't theoretical, this
is how those attacks are performed. It may sound crazy, but it's real.

With thread_info off the stack, the whole problem goes away. It's
wonderful that this has happened for x86, arm64, and s390. There are
always going to be new methods of attack for everything, but while we
slowly address the design weakness (set_fs()), we can fix the low
hanging fruit too.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Russell King - ARM Linux
On Fri, May 12, 2017 at 12:01:59PM -0700, Kees Cook wrote:
> Yeah, the risk for "corrupted addr_limit" is mainly a concern for
> archs with addr_limit on the kernel stack. If I'm reading things
> correctly, that means, from the archs I've been paying closer
> attention to, it's an issue for arm, mips, and powerpc:

I'd first want to uninline everything in uaccess.h first that makes
use of access_ok() - which I think is something that needs to happen
anyway.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Russell King - ARM Linux
On Fri, May 12, 2017 at 12:01:59PM -0700, Kees Cook wrote:
> Yeah, the risk for "corrupted addr_limit" is mainly a concern for
> archs with addr_limit on the kernel stack. If I'm reading things
> correctly, that means, from the archs I've been paying closer
> attention to, it's an issue for arm, mips, and powerpc:

I'd first want to uninline everything in uaccess.h first that makes
use of access_ok() - which I think is something that needs to happen
anyway.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Linus Torvalds
On Fri, May 12, 2017 at 12:01 PM, Kees Cook  wrote:
>
> Yeah, the risk for "corrupted addr_limit" is mainly a concern for
> archs with addr_limit on the kernel stack. If I'm reading things
> correctly, that means, from the archs I've been paying closer
> attention to, it's an issue for arm, mips, and powerpc:

I don't understand why people are looking at addr_limit as some kind
of special thing.

If somebody is smashing the stack and corrupting thread info data, the
game is over. addr_limit is the *least* of your problems, and it's not
even all that likely that it will be increasing (it's much more likely
that it would be overwritten with a smaller value).

Quite frankly, this kind of idiotic discussion just makes me question
the whole idea of the patch.

Any "security" that is this specific is not real security, it's just
masturbatory garbage.

It may be worth checking that people use "set_fs()" properly. But stop
this idiotic crap. It just makes the kernel security people look like
the crazies.

There are enough incompetent crazy security people, don't go there.
The kinds of things it is worth protecting against are the big class
of generic issues, not the kind of "oh, but imagine if a cosmic ray
flips this particular word in memory" kind of crap that ignores all
the other words of memory.

Seriously, Kees. You are just making security people look bad. Stop it.

   Linus


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Linus Torvalds
On Fri, May 12, 2017 at 12:01 PM, Kees Cook  wrote:
>
> Yeah, the risk for "corrupted addr_limit" is mainly a concern for
> archs with addr_limit on the kernel stack. If I'm reading things
> correctly, that means, from the archs I've been paying closer
> attention to, it's an issue for arm, mips, and powerpc:

I don't understand why people are looking at addr_limit as some kind
of special thing.

If somebody is smashing the stack and corrupting thread info data, the
game is over. addr_limit is the *least* of your problems, and it's not
even all that likely that it will be increasing (it's much more likely
that it would be overwritten with a smaller value).

Quite frankly, this kind of idiotic discussion just makes me question
the whole idea of the patch.

Any "security" that is this specific is not real security, it's just
masturbatory garbage.

It may be worth checking that people use "set_fs()" properly. But stop
this idiotic crap. It just makes the kernel security people look like
the crazies.

There are enough incompetent crazy security people, don't go there.
The kinds of things it is worth protecting against are the big class
of generic issues, not the kind of "oh, but imagine if a cosmic ray
flips this particular word in memory" kind of crap that ignores all
the other words of memory.

Seriously, Kees. You are just making security people look bad. Stop it.

   Linus


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Thu, May 11, 2017 at 10:54 PM, Martin Schwidefsky
 wrote:
> On Thu, 11 May 2017 22:34:31 -0700
> Kees Cook  wrote:
>
>> On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky
>>  wrote:
>> > On Thu, 11 May 2017 16:44:07 -0700
>> > Linus Torvalds  wrote:
>> >
>> >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  
>> >> wrote:
>> >> >
>> >> > Ingo: Do you want the change as-is? Would you like it to be optional?
>> >> > What do you think?
>> >>
>> >> I'm not ingo, but I don't like that patch. It's in the wrong place -
>> >> that system call return code is too timing-critical to add address
>> >> limit checks.
>> >>
>> >> Now what I think you *could* do is:
>> >>
>> >>  - make "set_fs()" actually set a work flag in the current thread flags
>> >>
>> >>  - do the test in the slow-path (syscall_return_slowpath).
>> >>
>> >> Yes, yes, that ends up being architecture-specific, but it's fairly 
>> >> simple.
>> >>
>> >> And it only slows down the system calls that actually use "set_fs()".
>> >> Sure, it will slow those down a fair amount, but they are hopefully a
>> >> small subset of all cases.
>> >>
>> >> How does that sound to people?  Thats' where we currently do that
>> >>
>> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
>> >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
>> >> regs->orig_ax))
>> >> local_irq_enable();
>> >>
>> >> check too, which is a fairly similar issue.
>> >
>> > This is exactly what Heiko did for the s390 backend as a result of this
>> > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
>> > for the hot patch the check for the bit is included in the general
>> > _CIF_WORK test. Only the slow patch gets a bit slower.
>> >
>> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
>> > "s390: restore address space when returning to user space".
>>
>> If I'm understanding this, it won't catch corruption of addr_limit
>> during fast-path syscalls, though (i.e. addr_limit changed without a
>> call to set_fs()). :( This addr_limit corruption is mostly only a risk
>> archs without THREAD_INFO_IN_TASK, but it would still be nice to catch
>> unbalanced set_fs() code, so I like the idea. I like getting rid of
>> addr_limit entirely even more, but that'll take some time. :)
>
> Well for s390 there is no addr_limit as we use two separate address space
> for kernel vs. user. The equivalent to the addr_limit corruption on a
> fast-path syscall would be changing CR7 outside of set_fs. This boils
> down to the question what we are protection against? Bad code with
> unbalanced set_fs or evil code that changes addr_limit/CR7 outside of
> set_fs

Yeah, the risk for "corrupted addr_limit" is mainly a concern for
archs with addr_limit on the kernel stack. If I'm reading things
correctly, that means, from the archs I've been paying closer
attention to, it's an issue for arm, mips, and powerpc:

arch/arm/include/asm/uaccess.h: current_thread_info()->addr_limit = fs;
arch/arm/include/asm/thread_info.h: (current_stack_pointer
& ~(THREAD_SIZE - 1));

arch/mips/include/asm/uaccess.h:#define set_fs(x)
(current_thread_info()->addr_limit = (x))
arch/mips/kernel/process.c:* task stacks at THREAD_SIZE - 32

arch/powerpc/include/asm/uaccess.h:#define set_fs(val)
(current->thread.fs = (val))
arch/powerpc/kernel/process.c:  struct pt_regs *regs =
task_stack_page(current) + THREAD_SIZE;

(s390 uses a register, x86 and arm64 implement THREAD_INFO_IN_TASK.)
Targeting addr_limit through arbitrary write attacks isn't too common
since ... it's an arbitrary write. The issue with addr_limit was that
it can live on the kernel stack, which meant all kinds of
stack-related bugs can lead to it getting stomped on.

So, two goals to protect addr_limit:

- get it off the stack to make the difficulty of corruption on par
with other sensitive things that would require an arbitrary write
flaw.

- detect/block unbalanced set_fs() calls.

If we can get the former addressed by the remaining architectures,
then that class of attack will go away. For the latter, it sounds like
Linus's slowpath-exit will work nicely.

To me it looks like he architectures with addr_limit still on the
stack would still benefit from always-check-addr_limit on syscall
exit, but that would be arch-specific anyway.

And then, of course, we've got the parallel task of just removing
set_fs() entirely. :)

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Kees Cook
On Thu, May 11, 2017 at 10:54 PM, Martin Schwidefsky
 wrote:
> On Thu, 11 May 2017 22:34:31 -0700
> Kees Cook  wrote:
>
>> On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky
>>  wrote:
>> > On Thu, 11 May 2017 16:44:07 -0700
>> > Linus Torvalds  wrote:
>> >
>> >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  
>> >> wrote:
>> >> >
>> >> > Ingo: Do you want the change as-is? Would you like it to be optional?
>> >> > What do you think?
>> >>
>> >> I'm not ingo, but I don't like that patch. It's in the wrong place -
>> >> that system call return code is too timing-critical to add address
>> >> limit checks.
>> >>
>> >> Now what I think you *could* do is:
>> >>
>> >>  - make "set_fs()" actually set a work flag in the current thread flags
>> >>
>> >>  - do the test in the slow-path (syscall_return_slowpath).
>> >>
>> >> Yes, yes, that ends up being architecture-specific, but it's fairly 
>> >> simple.
>> >>
>> >> And it only slows down the system calls that actually use "set_fs()".
>> >> Sure, it will slow those down a fair amount, but they are hopefully a
>> >> small subset of all cases.
>> >>
>> >> How does that sound to people?  Thats' where we currently do that
>> >>
>> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
>> >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
>> >> regs->orig_ax))
>> >> local_irq_enable();
>> >>
>> >> check too, which is a fairly similar issue.
>> >
>> > This is exactly what Heiko did for the s390 backend as a result of this
>> > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
>> > for the hot patch the check for the bit is included in the general
>> > _CIF_WORK test. Only the slow patch gets a bit slower.
>> >
>> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
>> > "s390: restore address space when returning to user space".
>>
>> If I'm understanding this, it won't catch corruption of addr_limit
>> during fast-path syscalls, though (i.e. addr_limit changed without a
>> call to set_fs()). :( This addr_limit corruption is mostly only a risk
>> archs without THREAD_INFO_IN_TASK, but it would still be nice to catch
>> unbalanced set_fs() code, so I like the idea. I like getting rid of
>> addr_limit entirely even more, but that'll take some time. :)
>
> Well for s390 there is no addr_limit as we use two separate address space
> for kernel vs. user. The equivalent to the addr_limit corruption on a
> fast-path syscall would be changing CR7 outside of set_fs. This boils
> down to the question what we are protection against? Bad code with
> unbalanced set_fs or evil code that changes addr_limit/CR7 outside of
> set_fs

Yeah, the risk for "corrupted addr_limit" is mainly a concern for
archs with addr_limit on the kernel stack. If I'm reading things
correctly, that means, from the archs I've been paying closer
attention to, it's an issue for arm, mips, and powerpc:

arch/arm/include/asm/uaccess.h: current_thread_info()->addr_limit = fs;
arch/arm/include/asm/thread_info.h: (current_stack_pointer
& ~(THREAD_SIZE - 1));

arch/mips/include/asm/uaccess.h:#define set_fs(x)
(current_thread_info()->addr_limit = (x))
arch/mips/kernel/process.c:* task stacks at THREAD_SIZE - 32

arch/powerpc/include/asm/uaccess.h:#define set_fs(val)
(current->thread.fs = (val))
arch/powerpc/kernel/process.c:  struct pt_regs *regs =
task_stack_page(current) + THREAD_SIZE;

(s390 uses a register, x86 and arm64 implement THREAD_INFO_IN_TASK.)
Targeting addr_limit through arbitrary write attacks isn't too common
since ... it's an arbitrary write. The issue with addr_limit was that
it can live on the kernel stack, which meant all kinds of
stack-related bugs can lead to it getting stomped on.

So, two goals to protect addr_limit:

- get it off the stack to make the difficulty of corruption on par
with other sensitive things that would require an arbitrary write
flaw.

- detect/block unbalanced set_fs() calls.

If we can get the former addressed by the remaining architectures,
then that class of attack will go away. For the latter, it sounds like
Linus's slowpath-exit will work nicely.

To me it looks like he architectures with addr_limit still on the
stack would still benefit from always-check-addr_limit on syscall
exit, but that would be arch-specific anyway.

And then, of course, we've got the parallel task of just removing
set_fs() entirely. :)

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Thomas Garnier
On Thu, May 11, 2017 at 11:58 PM, Ingo Molnar  wrote:
>
> * Linus Torvalds  wrote:
>
>> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>> >
>> > Ingo: Do you want the change as-is? Would you like it to be optional?
>> > What do you think?
>>
>> I'm not ingo, but I don't like that patch. It's in the wrong place -
>> that system call return code is too timing-critical to add address
>> limit checks.
>>
>> Now what I think you *could* do is:
>>
>>  - make "set_fs()" actually set a work flag in the current thread flags
>>
>>  - do the test in the slow-path (syscall_return_slowpath).
>>
>> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
>>
>> And it only slows down the system calls that actually use "set_fs()".
>> Sure, it will slow those down a fair amount, but they are hopefully a
>> small subset of all cases.
>>
>> How does that sound to people?  Thats' where we currently do that
>>
>> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
>> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
>> regs->orig_ax))
>> local_irq_enable();
>>
>> check too, which is a fairly similar issue.
>
> I really like that idea and I'd be perfectly fine with that solution, because 
> it
> puts the overhead where the problem comes from, and adds an extra incentive 
> for
> code to move away from set_fs() facilities. Win-win.

Great, I will adapt the patch for that.

>
> Thanks,
>
> Ingo



-- 
Thomas


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Thomas Garnier
On Thu, May 11, 2017 at 11:58 PM, Ingo Molnar  wrote:
>
> * Linus Torvalds  wrote:
>
>> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>> >
>> > Ingo: Do you want the change as-is? Would you like it to be optional?
>> > What do you think?
>>
>> I'm not ingo, but I don't like that patch. It's in the wrong place -
>> that system call return code is too timing-critical to add address
>> limit checks.
>>
>> Now what I think you *could* do is:
>>
>>  - make "set_fs()" actually set a work flag in the current thread flags
>>
>>  - do the test in the slow-path (syscall_return_slowpath).
>>
>> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
>>
>> And it only slows down the system calls that actually use "set_fs()".
>> Sure, it will slow those down a fair amount, but they are hopefully a
>> small subset of all cases.
>>
>> How does that sound to people?  Thats' where we currently do that
>>
>> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
>> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
>> regs->orig_ax))
>> local_irq_enable();
>>
>> check too, which is a fairly similar issue.
>
> I really like that idea and I'd be perfectly fine with that solution, because 
> it
> puts the overhead where the problem comes from, and adds an extra incentive 
> for
> code to move away from set_fs() facilities. Win-win.

Great, I will adapt the patch for that.

>
> Thanks,
>
> Ingo



-- 
Thomas


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Arnd Bergmann
On Fri, May 12, 2017 at 10:11 AM, Al Viro  wrote:
> Anyway, what's special about modules?  IDGI...

One of the arguments that came up earlier was code in external modules
being mostly unaudited, sometimes without any source code available
at all but still used in devices.

If modules can't do set_fs() any more, this could eliminate bugs with
unpaired set_fs in those modules.

Limiting factors of course are:

- embedded systems that ship come with their own kernels (as opposed
  to using whatever users have, or relying on binary distros) can just
  make it available to modules again, by reverting the patch

- As Christoph said, they could have an open-coded set_fs in the
  driver

- Whatever other method a clueless driver write might come up with
  isn't necessarily better than set_fs().

 Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Arnd Bergmann
On Fri, May 12, 2017 at 10:11 AM, Al Viro  wrote:
> Anyway, what's special about modules?  IDGI...

One of the arguments that came up earlier was code in external modules
being mostly unaudited, sometimes without any source code available
at all but still used in devices.

If modules can't do set_fs() any more, this could eliminate bugs with
unpaired set_fs in those modules.

Limiting factors of course are:

- embedded systems that ship come with their own kernels (as opposed
  to using whatever users have, or relying on binary distros) can just
  make it available to modules again, by reverting the patch

- As Christoph said, they could have an open-coded set_fs in the
  driver

- Whatever other method a clueless driver write might come up with
  isn't necessarily better than set_fs().

 Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 01:11:26AM -0700, Christoph Hellwig wrote:

> But it won't help against exploits modifying addr_limit manually.

Or the ones setting current->cred to that of init.  Your point being?


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 01:11:26AM -0700, Christoph Hellwig wrote:

> But it won't help against exploits modifying addr_limit manually.

Or the ones setting current->cred to that of init.  Your point being?


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote:

> How realistic and how useful would it be to first completely eliminate
> the ones that are in loadable modules and then wrapping the definition
> in #ifndef MODULE (or even make it an extern function)?

Eliminate _what_?  ->read() and ->write() instances?

> This should be a fairly complete list of the modular users:
> 
> drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS);

Ah, set_fs()...  Sure, many of those can be killed off.  Wouldn't be
a bad idea, but I don't understand what difference does modular/built-in
make here...

This one: AFAICS doesn't give a damn about set_fs() at all.

> drivers/input/serio/hp_sdc.c:   set_fs(KERNEL_DS);

Open-coded probe_kernel_read(), apparently.

> drivers/media/v4l2-core/v4l2-compat-ioctl32.c:  set_fs(KERNEL_DS);

massive compat ioctl crap.

> drivers/misc/lkdtm_bugs.c:  set_fs(KERNEL_DS);

insane.

> drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS);

No idea.

> drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS);

Open-coded kernel_write(); to some character device, no less...  And similar
for kernel_read(), apparently.

> drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds());

Fuck knows; kernel_write() might do it.  Depends upon what it's writing
to.

You've missed other places in lustre, BTW - including the ioctls on
sockets, etc.

> drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c:
>  set_fs(KERNEL_DS);

Compat ioctl crap, again.

> drivers/staging/rtl8723bs/os_dep/osdep_service.c:   oldfs
> = get_fs(); set_fs(get_ds());

Oh, lovely - reading an arbitrary (as in, specified by pathname) file.
Firmware (mis)handling?

> drivers/usb/gadget/function/f_mass_storage.c:   set_fs(get_ds());

No idea.

> drivers/usb/gadget/function/u_uac1.c:   set_fs(KERNEL_DS);

kernel_write(), by the look of it.  Or something similar.

> drivers/vhost/vhost.c:  set_fs(USER_DS);

kernel thread doing use_mm()

> drivers/video/fbdev/core/fbmem.c:   set_fs(KERNEL_DS);

compat ioctl.

> drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS);

probe_kernel_read()

> fs/autofs4/waitq.c: set_fs(KERNEL_DS);

kernel_write()

> fs/binfmt_aout.c:   set_fs(KERNEL_DS);
> fs/binfmt_elf.c:set_fs(USER_DS);
> fs/binfmt_elf_fdpic.c:  set_fs(KERNEL_DS);

coredump stuff.

> fs/btrfs/send.c:set_fs(KERNEL_DS);
kernel_write()

Anyway, what's special about modules?  IDGI...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote:

> How realistic and how useful would it be to first completely eliminate
> the ones that are in loadable modules and then wrapping the definition
> in #ifndef MODULE (or even make it an extern function)?

Eliminate _what_?  ->read() and ->write() instances?

> This should be a fairly complete list of the modular users:
> 
> drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS);

Ah, set_fs()...  Sure, many of those can be killed off.  Wouldn't be
a bad idea, but I don't understand what difference does modular/built-in
make here...

This one: AFAICS doesn't give a damn about set_fs() at all.

> drivers/input/serio/hp_sdc.c:   set_fs(KERNEL_DS);

Open-coded probe_kernel_read(), apparently.

> drivers/media/v4l2-core/v4l2-compat-ioctl32.c:  set_fs(KERNEL_DS);

massive compat ioctl crap.

> drivers/misc/lkdtm_bugs.c:  set_fs(KERNEL_DS);

insane.

> drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS);

No idea.

> drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS);

Open-coded kernel_write(); to some character device, no less...  And similar
for kernel_read(), apparently.

> drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds());

Fuck knows; kernel_write() might do it.  Depends upon what it's writing
to.

You've missed other places in lustre, BTW - including the ioctls on
sockets, etc.

> drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c:
>  set_fs(KERNEL_DS);

Compat ioctl crap, again.

> drivers/staging/rtl8723bs/os_dep/osdep_service.c:   oldfs
> = get_fs(); set_fs(get_ds());

Oh, lovely - reading an arbitrary (as in, specified by pathname) file.
Firmware (mis)handling?

> drivers/usb/gadget/function/f_mass_storage.c:   set_fs(get_ds());

No idea.

> drivers/usb/gadget/function/u_uac1.c:   set_fs(KERNEL_DS);

kernel_write(), by the look of it.  Or something similar.

> drivers/vhost/vhost.c:  set_fs(USER_DS);

kernel thread doing use_mm()

> drivers/video/fbdev/core/fbmem.c:   set_fs(KERNEL_DS);

compat ioctl.

> drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS);

probe_kernel_read()

> fs/autofs4/waitq.c: set_fs(KERNEL_DS);

kernel_write()

> fs/binfmt_aout.c:   set_fs(KERNEL_DS);
> fs/binfmt_elf.c:set_fs(USER_DS);
> fs/binfmt_elf_fdpic.c:  set_fs(KERNEL_DS);

coredump stuff.

> fs/btrfs/send.c:set_fs(KERNEL_DS);
kernel_write()

Anyway, what's special about modules?  IDGI...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Christoph Hellwig
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote:
> How realistic and how useful would it be to first completely eliminate
> the ones that are in loadable modules and then wrapping the definition
> in #ifndef MODULE (or even make it an extern function)?

Should be fairly doable and might be a nice step towards cleaning the
mess up.  In fact with my seres a large part of those are gone, and
most of the remaining handler are ioctl handlers or what seems like
opencoded versions of probe_kernel_read.

But it won't help against exploits modifying addr_limit manually.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Christoph Hellwig
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote:
> How realistic and how useful would it be to first completely eliminate
> the ones that are in loadable modules and then wrapping the definition
> in #ifndef MODULE (or even make it an extern function)?

Should be fairly doable and might be a nice step towards cleaning the
mess up.  In fact with my seres a large part of those are gone, and
most of the remaining handler are ioctl handlers or what seems like
opencoded versions of probe_kernel_read.

But it won't help against exploits modifying addr_limit manually.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Arnd Bergmann
On Fri, May 12, 2017 at 9:15 AM, Al Viro  wrote:
> On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote:
>
>> > How about trying to remove all of them?  If we could actually get rid
>> > of all of them, we could drop the arch support, and we'd get faster,
>> > simpler, shorter uaccess code throughout the kernel.
>>
>> I'm all for that!
>
> Oh, for...  Ingo, do you really want to go through all ->write() and ->read()
> instances, converting all of them to iov_iter?  Or, better yet, deal with
> the patch flood from Nick Krause sock puppet brigade?

How realistic and how useful would it be to first completely eliminate
the ones that are in loadable modules and then wrapping the definition
in #ifndef MODULE (or even make it an extern function)?

This should be a fairly complete list of the modular users:

drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS);
drivers/input/serio/hp_sdc.c:   set_fs(KERNEL_DS);
drivers/media/v4l2-core/v4l2-compat-ioctl32.c:  set_fs(KERNEL_DS);
drivers/misc/lkdtm_bugs.c:  set_fs(KERNEL_DS);
drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS);
drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS);
drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds());
drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c:
 set_fs(KERNEL_DS);
drivers/staging/rtl8723bs/os_dep/osdep_service.c:   oldfs
= get_fs(); set_fs(get_ds());
drivers/usb/gadget/function/f_mass_storage.c:   set_fs(get_ds());
drivers/usb/gadget/function/u_uac1.c:   set_fs(KERNEL_DS);
drivers/vhost/vhost.c:  set_fs(USER_DS);
drivers/video/fbdev/core/fbmem.c:   set_fs(KERNEL_DS);
drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS);
fs/autofs4/waitq.c: set_fs(KERNEL_DS);
fs/binfmt_aout.c:   set_fs(KERNEL_DS);
fs/binfmt_elf.c:set_fs(USER_DS);
fs/binfmt_elf_fdpic.c:  set_fs(KERNEL_DS);
fs/btrfs/send.c:set_fs(KERNEL_DS);
fs/ext4/ioctl.c:set_fs(KERNEL_DS);
fs/nfsd/vfs.c:  set_fs(KERNEL_DS);
net/9p/trans_fd.c:  set_fs(get_ds());
net/ipv6/addrconf.c:set_fs(KERNEL_DS);
net/ipv6/exthdrs.c: set_fs(KERNEL_DS);
net/sunrpc/svcsock.c:   oldfs = get_fs(); set_fs(KERNEL_DS);
sound/core/oss/pcm_oss.c:   set_fs(get_ds());
sound/core/pcm_native.c:set_fs(get_ds());
sound/drivers/opl3/opl3_oss.c:  set_fs(get_ds());
sound/oss/dmabuf.c: set_fs(get_ds());
sound/oss/swarm_cs4297a.c:set_fs(KERNEL_DS);
sound/pci/emu10k1/emufx.c:  set_fs(get_ds());
sound/pci/hda/hda_codec.c:  set_fs(get_ds());

 Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Arnd Bergmann
On Fri, May 12, 2017 at 9:15 AM, Al Viro  wrote:
> On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote:
>
>> > How about trying to remove all of them?  If we could actually get rid
>> > of all of them, we could drop the arch support, and we'd get faster,
>> > simpler, shorter uaccess code throughout the kernel.
>>
>> I'm all for that!
>
> Oh, for...  Ingo, do you really want to go through all ->write() and ->read()
> instances, converting all of them to iov_iter?  Or, better yet, deal with
> the patch flood from Nick Krause sock puppet brigade?

How realistic and how useful would it be to first completely eliminate
the ones that are in loadable modules and then wrapping the definition
in #ifndef MODULE (or even make it an extern function)?

This should be a fairly complete list of the modular users:

drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS);
drivers/input/serio/hp_sdc.c:   set_fs(KERNEL_DS);
drivers/media/v4l2-core/v4l2-compat-ioctl32.c:  set_fs(KERNEL_DS);
drivers/misc/lkdtm_bugs.c:  set_fs(KERNEL_DS);
drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS);
drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS);
drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds());
drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c:
 set_fs(KERNEL_DS);
drivers/staging/rtl8723bs/os_dep/osdep_service.c:   oldfs
= get_fs(); set_fs(get_ds());
drivers/usb/gadget/function/f_mass_storage.c:   set_fs(get_ds());
drivers/usb/gadget/function/u_uac1.c:   set_fs(KERNEL_DS);
drivers/vhost/vhost.c:  set_fs(USER_DS);
drivers/video/fbdev/core/fbmem.c:   set_fs(KERNEL_DS);
drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS);
fs/autofs4/waitq.c: set_fs(KERNEL_DS);
fs/binfmt_aout.c:   set_fs(KERNEL_DS);
fs/binfmt_elf.c:set_fs(USER_DS);
fs/binfmt_elf_fdpic.c:  set_fs(KERNEL_DS);
fs/btrfs/send.c:set_fs(KERNEL_DS);
fs/ext4/ioctl.c:set_fs(KERNEL_DS);
fs/nfsd/vfs.c:  set_fs(KERNEL_DS);
net/9p/trans_fd.c:  set_fs(get_ds());
net/ipv6/addrconf.c:set_fs(KERNEL_DS);
net/ipv6/exthdrs.c: set_fs(KERNEL_DS);
net/sunrpc/svcsock.c:   oldfs = get_fs(); set_fs(KERNEL_DS);
sound/core/oss/pcm_oss.c:   set_fs(get_ds());
sound/core/pcm_native.c:set_fs(get_ds());
sound/drivers/opl3/opl3_oss.c:  set_fs(get_ds());
sound/oss/dmabuf.c: set_fs(get_ds());
sound/oss/swarm_cs4297a.c:set_fs(KERNEL_DS);
sound/pci/emu10k1/emufx.c:  set_fs(get_ds());
sound/pci/hda/hda_codec.c:  set_fs(get_ds());

 Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Christoph Hellwig
On Fri, May 12, 2017 at 08:15:49AM +0100, Al Viro wrote:
> And converting everything to ->read_iter()/->write_iter() means an insane
> amount of code churn, not to mention coping with random bogosities in
> semantics.  ->read() and ->write() are going to stay around, pretty
> much indefinitely.

But I don't think kernel users of them have to.  I've been digging
through the calllers and will send an analysis to the list in a bit.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Christoph Hellwig
On Fri, May 12, 2017 at 08:15:49AM +0100, Al Viro wrote:
> And converting everything to ->read_iter()/->write_iter() means an insane
> amount of code churn, not to mention coping with random bogosities in
> semantics.  ->read() and ->write() are going to stay around, pretty
> much indefinitely.

But I don't think kernel users of them have to.  I've been digging
through the calllers and will send an analysis to the list in a bit.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote:

> > How about trying to remove all of them?  If we could actually get rid
> > of all of them, we could drop the arch support, and we'd get faster,
> > simpler, shorter uaccess code throughout the kernel.
> 
> I'm all for that!

Oh, for...  Ingo, do you really want to go through all ->write() and ->read()
instances, converting all of them to iov_iter?  Or, better yet, deal with
the patch flood from Nick Krause sock puppet brigade?

Folks, seriously, have you even looked through that zoo?  I have, and it's
really, really not fun.  Sure, we can say "fuck 'em, no need to allow
splice() on random crap".  Would be perfectly reasonable, expect that
it's not the only place doing kernel_write() and its ilk...

And converting everything to ->read_iter()/->write_iter() means an insane
amount of code churn, not to mention coping with random bogosities in
semantics.  ->read() and ->write() are going to stay around, pretty
much indefinitely.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Al Viro
On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote:

> > How about trying to remove all of them?  If we could actually get rid
> > of all of them, we could drop the arch support, and we'd get faster,
> > simpler, shorter uaccess code throughout the kernel.
> 
> I'm all for that!

Oh, for...  Ingo, do you really want to go through all ->write() and ->read()
instances, converting all of them to iov_iter?  Or, better yet, deal with
the patch flood from Nick Krause sock puppet brigade?

Folks, seriously, have you even looked through that zoo?  I have, and it's
really, really not fun.  Sure, we can say "fuck 'em, no need to allow
splice() on random crap".  Would be perfectly reasonable, expect that
it's not the only place doing kernel_write() and its ilk...

And converting everything to ->read_iter()/->write_iter() means an insane
amount of code churn, not to mention coping with random bogosities in
semantics.  ->read() and ->write() are going to stay around, pretty
much indefinitely.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig  wrote:
> > On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote:
> >> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it 
> >> would
> >> be a pity to add a runtime check to every system call ...
> >
> > I think we should simply strive to remove all of them that aren't
> > in core scheduler / arch code.  Basically evetyytime we do the
> >
> > oldfs = get_fs();
> > set_fs(KERNEL_DS);
> > ..
> > set_fs(oldfs);
> >
> > trick we're doing something wrong, and there should always be better
> > ways to archive it.  E.g. using iov_iter with a ITER_KVEC type
> > consistently would already remove most of them.
> 
> How about trying to remove all of them?  If we could actually get rid
> of all of them, we could drop the arch support, and we'd get faster,
> simpler, shorter uaccess code throughout the kernel.

I'm all for that!

Thanks,

Ingo


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig  wrote:
> > On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote:
> >> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it 
> >> would
> >> be a pity to add a runtime check to every system call ...
> >
> > I think we should simply strive to remove all of them that aren't
> > in core scheduler / arch code.  Basically evetyytime we do the
> >
> > oldfs = get_fs();
> > set_fs(KERNEL_DS);
> > ..
> > set_fs(oldfs);
> >
> > trick we're doing something wrong, and there should always be better
> > ways to archive it.  E.g. using iov_iter with a ITER_KVEC type
> > consistently would already remove most of them.
> 
> How about trying to remove all of them?  If we could actually get rid
> of all of them, we could drop the arch support, and we'd get faster,
> simpler, shorter uaccess code throughout the kernel.

I'm all for that!

Thanks,

Ingo


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
> >
> > Ingo: Do you want the change as-is? Would you like it to be optional?
> > What do you think?
> 
> I'm not ingo, but I don't like that patch. It's in the wrong place -
> that system call return code is too timing-critical to add address
> limit checks.
> 
> Now what I think you *could* do is:
> 
>  - make "set_fs()" actually set a work flag in the current thread flags
> 
>  - do the test in the slow-path (syscall_return_slowpath).
> 
> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
> 
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.
> 
> How does that sound to people?  Thats' where we currently do that
> 
> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> regs->orig_ax))
> local_irq_enable();
> 
> check too, which is a fairly similar issue.

I really like that idea and I'd be perfectly fine with that solution, because 
it 
puts the overhead where the problem comes from, and adds an extra incentive for 
code to move away from set_fs() facilities. Win-win.

Thanks,

Ingo


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
> >
> > Ingo: Do you want the change as-is? Would you like it to be optional?
> > What do you think?
> 
> I'm not ingo, but I don't like that patch. It's in the wrong place -
> that system call return code is too timing-critical to add address
> limit checks.
> 
> Now what I think you *could* do is:
> 
>  - make "set_fs()" actually set a work flag in the current thread flags
> 
>  - do the test in the slow-path (syscall_return_slowpath).
> 
> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
> 
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.
> 
> How does that sound to people?  Thats' where we currently do that
> 
> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> regs->orig_ax))
> local_irq_enable();
> 
> check too, which is a fairly similar issue.

I really like that idea and I'd be perfectly fine with that solution, because 
it 
puts the overhead where the problem comes from, and adds an extra incentive for 
code to move away from set_fs() facilities. Win-win.

Thanks,

Ingo


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Ingo Molnar

* Kees Cook  wrote:

> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
> > "s390: restore address space when returning to user space".
> 
> If I'm understanding this, it won't catch corruption of addr_limit
> during fast-path syscalls, though (i.e. addr_limit changed without a
> call to set_fs()). :(

Nor does it, or the patch you propose, protect against against something 
corrupting task->mm pointer, or the task->*uid values, or any of the myriads of 
security relevant values stored in the task structure!

Making sure API (set_fs()) usage is bug-free and protecting against the effects 
of 
general data corruption are two unrelated things that should not mixed.

Thanks,

Ingo


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Ingo Molnar

* Kees Cook  wrote:

> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
> > "s390: restore address space when returning to user space".
> 
> If I'm understanding this, it won't catch corruption of addr_limit
> during fast-path syscalls, though (i.e. addr_limit changed without a
> call to set_fs()). :(

Nor does it, or the patch you propose, protect against against something 
corrupting task->mm pointer, or the task->*uid values, or any of the myriads of 
security relevant values stored in the task structure!

Making sure API (set_fs()) usage is bug-free and protecting against the effects 
of 
general data corruption are two unrelated things that should not mixed.

Thanks,

Ingo


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Andy Lutomirski
[resending because kernel.org seems to have mangled my SMTP
credentials.  I wonder if this is a common problem.]

On Thu, May 11, 2017 at 4:44 PM, Linus Torvalds
 wrote:
> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>>
>> Ingo: Do you want the change as-is? Would you like it to be optional?
>> What do you think?
>
> I'm not ingo, but I don't like that patch. It's in the wrong place -
> that system call return code is too timing-critical to add address
> limit checks.
>
> Now what I think you *could* do is:
>
>  - make "set_fs()" actually set a work flag in the current thread flags
>
>  - do the test in the slow-path (syscall_return_slowpath).
>
> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
>
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.
>
> How does that sound to people?  Thats' where we currently do that
>
> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> regs->orig_ax))
> local_irq_enable();
>
> check too, which is a fairly similar issue.
>

I like this.  It wouldn't help the problem that I suspect is a major
part of the motivation for this patch: a stack overflow could
overwrite addr_limit.  But we fixed that for real already.

Slightly off-topic: I would *love* to see syscall_return_slowpath() or
similar moved or at least mostly moved into generic code.  Aside from
the fact that it used to be written in asm, there's nothing
fundamentally arch-specific about it.

>
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.

It won't even slow them down that much.  The slow path is reasonably
fast these days.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-12 Thread Andy Lutomirski
[resending because kernel.org seems to have mangled my SMTP
credentials.  I wonder if this is a common problem.]

On Thu, May 11, 2017 at 4:44 PM, Linus Torvalds
 wrote:
> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>>
>> Ingo: Do you want the change as-is? Would you like it to be optional?
>> What do you think?
>
> I'm not ingo, but I don't like that patch. It's in the wrong place -
> that system call return code is too timing-critical to add address
> limit checks.
>
> Now what I think you *could* do is:
>
>  - make "set_fs()" actually set a work flag in the current thread flags
>
>  - do the test in the slow-path (syscall_return_slowpath).
>
> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
>
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.
>
> How does that sound to people?  Thats' where we currently do that
>
> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> regs->orig_ax))
> local_irq_enable();
>
> check too, which is a fairly similar issue.
>

I like this.  It wouldn't help the problem that I suspect is a major
part of the motivation for this patch: a stack overflow could
overwrite addr_limit.  But we fixed that for real already.

Slightly off-topic: I would *love* to see syscall_return_slowpath() or
similar moved or at least mostly moved into generic code.  Aside from
the fact that it used to be written in asm, there's nothing
fundamentally arch-specific about it.

>
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.

It won't even slow them down that much.  The slow path is reasonably
fast these days.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Martin Schwidefsky
On Thu, 11 May 2017 22:34:31 -0700
Kees Cook  wrote:

> On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky
>  wrote:
> > On Thu, 11 May 2017 16:44:07 -0700
> > Linus Torvalds  wrote:
> >  
> >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  
> >> wrote:  
> >> >
> >> > Ingo: Do you want the change as-is? Would you like it to be optional?
> >> > What do you think?  
> >>
> >> I'm not ingo, but I don't like that patch. It's in the wrong place -
> >> that system call return code is too timing-critical to add address
> >> limit checks.
> >>
> >> Now what I think you *could* do is:
> >>
> >>  - make "set_fs()" actually set a work flag in the current thread flags
> >>
> >>  - do the test in the slow-path (syscall_return_slowpath).
> >>
> >> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
> >>
> >> And it only slows down the system calls that actually use "set_fs()".
> >> Sure, it will slow those down a fair amount, but they are hopefully a
> >> small subset of all cases.
> >>
> >> How does that sound to people?  Thats' where we currently do that
> >>
> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> >> regs->orig_ax))
> >> local_irq_enable();
> >>
> >> check too, which is a fairly similar issue.  
> >
> > This is exactly what Heiko did for the s390 backend as a result of this
> > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
> > for the hot patch the check for the bit is included in the general
> > _CIF_WORK test. Only the slow patch gets a bit slower.
> >
> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
> > "s390: restore address space when returning to user space".  
> 
> If I'm understanding this, it won't catch corruption of addr_limit
> during fast-path syscalls, though (i.e. addr_limit changed without a
> call to set_fs()). :( This addr_limit corruption is mostly only a risk
> archs without THREAD_INFO_IN_TASK, but it would still be nice to catch
> unbalanced set_fs() code, so I like the idea. I like getting rid of
> addr_limit entirely even more, but that'll take some time. :)

Well for s390 there is no addr_limit as we use two separate address space
for kernel vs. user. The equivalent to the addr_limit corruption on a
fast-path syscall would be changing CR7 outside of set_fs. This boils
down to the question what we are protection against? Bad code with 
unbalanced set_fs or evil code that changes addr_limit/CR7 outside of
set_fs

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Martin Schwidefsky
On Thu, 11 May 2017 22:34:31 -0700
Kees Cook  wrote:

> On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky
>  wrote:
> > On Thu, 11 May 2017 16:44:07 -0700
> > Linus Torvalds  wrote:
> >  
> >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  
> >> wrote:  
> >> >
> >> > Ingo: Do you want the change as-is? Would you like it to be optional?
> >> > What do you think?  
> >>
> >> I'm not ingo, but I don't like that patch. It's in the wrong place -
> >> that system call return code is too timing-critical to add address
> >> limit checks.
> >>
> >> Now what I think you *could* do is:
> >>
> >>  - make "set_fs()" actually set a work flag in the current thread flags
> >>
> >>  - do the test in the slow-path (syscall_return_slowpath).
> >>
> >> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
> >>
> >> And it only slows down the system calls that actually use "set_fs()".
> >> Sure, it will slow those down a fair amount, but they are hopefully a
> >> small subset of all cases.
> >>
> >> How does that sound to people?  Thats' where we currently do that
> >>
> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> >> regs->orig_ax))
> >> local_irq_enable();
> >>
> >> check too, which is a fairly similar issue.  
> >
> > This is exactly what Heiko did for the s390 backend as a result of this
> > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
> > for the hot patch the check for the bit is included in the general
> > _CIF_WORK test. Only the slow patch gets a bit slower.
> >
> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
> > "s390: restore address space when returning to user space".  
> 
> If I'm understanding this, it won't catch corruption of addr_limit
> during fast-path syscalls, though (i.e. addr_limit changed without a
> call to set_fs()). :( This addr_limit corruption is mostly only a risk
> archs without THREAD_INFO_IN_TASK, but it would still be nice to catch
> unbalanced set_fs() code, so I like the idea. I like getting rid of
> addr_limit entirely even more, but that'll take some time. :)

Well for s390 there is no addr_limit as we use two separate address space
for kernel vs. user. The equivalent to the addr_limit corruption on a
fast-path syscall would be changing CR7 outside of set_fs. This boils
down to the question what we are protection against? Bad code with 
unbalanced set_fs or evil code that changes addr_limit/CR7 outside of
set_fs

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Kees Cook
On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky
 wrote:
> On Thu, 11 May 2017 16:44:07 -0700
> Linus Torvalds  wrote:
>
>> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>> >
>> > Ingo: Do you want the change as-is? Would you like it to be optional?
>> > What do you think?
>>
>> I'm not ingo, but I don't like that patch. It's in the wrong place -
>> that system call return code is too timing-critical to add address
>> limit checks.
>>
>> Now what I think you *could* do is:
>>
>>  - make "set_fs()" actually set a work flag in the current thread flags
>>
>>  - do the test in the slow-path (syscall_return_slowpath).
>>
>> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
>>
>> And it only slows down the system calls that actually use "set_fs()".
>> Sure, it will slow those down a fair amount, but they are hopefully a
>> small subset of all cases.
>>
>> How does that sound to people?  Thats' where we currently do that
>>
>> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
>> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
>> regs->orig_ax))
>> local_irq_enable();
>>
>> check too, which is a fairly similar issue.
>
> This is exactly what Heiko did for the s390 backend as a result of this
> discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
> for the hot patch the check for the bit is included in the general
> _CIF_WORK test. Only the slow patch gets a bit slower.
>
> git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
> "s390: restore address space when returning to user space".

If I'm understanding this, it won't catch corruption of addr_limit
during fast-path syscalls, though (i.e. addr_limit changed without a
call to set_fs()). :( This addr_limit corruption is mostly only a risk
archs without THREAD_INFO_IN_TASK, but it would still be nice to catch
unbalanced set_fs() code, so I like the idea. I like getting rid of
addr_limit entirely even more, but that'll take some time. :)

-Kees


-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Kees Cook
On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky
 wrote:
> On Thu, 11 May 2017 16:44:07 -0700
> Linus Torvalds  wrote:
>
>> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>> >
>> > Ingo: Do you want the change as-is? Would you like it to be optional?
>> > What do you think?
>>
>> I'm not ingo, but I don't like that patch. It's in the wrong place -
>> that system call return code is too timing-critical to add address
>> limit checks.
>>
>> Now what I think you *could* do is:
>>
>>  - make "set_fs()" actually set a work flag in the current thread flags
>>
>>  - do the test in the slow-path (syscall_return_slowpath).
>>
>> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
>>
>> And it only slows down the system calls that actually use "set_fs()".
>> Sure, it will slow those down a fair amount, but they are hopefully a
>> small subset of all cases.
>>
>> How does that sound to people?  Thats' where we currently do that
>>
>> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
>> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
>> regs->orig_ax))
>> local_irq_enable();
>>
>> check too, which is a fairly similar issue.
>
> This is exactly what Heiko did for the s390 backend as a result of this
> discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
> for the hot patch the check for the bit is included in the general
> _CIF_WORK test. Only the slow patch gets a bit slower.
>
> git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
> "s390: restore address space when returning to user space".

If I'm understanding this, it won't catch corruption of addr_limit
during fast-path syscalls, though (i.e. addr_limit changed without a
call to set_fs()). :( This addr_limit corruption is mostly only a risk
archs without THREAD_INFO_IN_TASK, but it would still be nice to catch
unbalanced set_fs() code, so I like the idea. I like getting rid of
addr_limit entirely even more, but that'll take some time. :)

-Kees


-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Martin Schwidefsky
On Thu, 11 May 2017 16:44:07 -0700
Linus Torvalds  wrote:

> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
> >
> > Ingo: Do you want the change as-is? Would you like it to be optional?
> > What do you think?  
> 
> I'm not ingo, but I don't like that patch. It's in the wrong place -
> that system call return code is too timing-critical to add address
> limit checks.
> 
> Now what I think you *could* do is:
> 
>  - make "set_fs()" actually set a work flag in the current thread flags
> 
>  - do the test in the slow-path (syscall_return_slowpath).
> 
> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
> 
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.
> 
> How does that sound to people?  Thats' where we currently do that
> 
> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> regs->orig_ax))
> local_irq_enable();
> 
> check too, which is a fairly similar issue.

This is exactly what Heiko did for the s390 backend as a result of this
discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
for the hot patch the check for the bit is included in the general
_CIF_WORK test. Only the slow patch gets a bit slower.

git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
"s390: restore address space when returning to user space".

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Martin Schwidefsky
On Thu, 11 May 2017 16:44:07 -0700
Linus Torvalds  wrote:

> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
> >
> > Ingo: Do you want the change as-is? Would you like it to be optional?
> > What do you think?  
> 
> I'm not ingo, but I don't like that patch. It's in the wrong place -
> that system call return code is too timing-critical to add address
> limit checks.
> 
> Now what I think you *could* do is:
> 
>  - make "set_fs()" actually set a work flag in the current thread flags
> 
>  - do the test in the slow-path (syscall_return_slowpath).
> 
> Yes, yes, that ends up being architecture-specific, but it's fairly simple.
> 
> And it only slows down the system calls that actually use "set_fs()".
> Sure, it will slow those down a fair amount, but they are hopefully a
> small subset of all cases.
> 
> How does that sound to people?  Thats' where we currently do that
> 
> if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
> WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
> regs->orig_ax))
> local_irq_enable();
> 
> check too, which is a fairly similar issue.

This is exactly what Heiko did for the s390 backend as a result of this
discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S,
for the hot patch the check for the bit is included in the general
_CIF_WORK test. Only the slow patch gets a bit slower.

git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d
"s390: restore address space when returning to user space".

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Linus Torvalds
On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>
> Ingo: Do you want the change as-is? Would you like it to be optional?
> What do you think?

I'm not ingo, but I don't like that patch. It's in the wrong place -
that system call return code is too timing-critical to add address
limit checks.

Now what I think you *could* do is:

 - make "set_fs()" actually set a work flag in the current thread flags

 - do the test in the slow-path (syscall_return_slowpath).

Yes, yes, that ends up being architecture-specific, but it's fairly simple.

And it only slows down the system calls that actually use "set_fs()".
Sure, it will slow those down a fair amount, but they are hopefully a
small subset of all cases.

How does that sound to people?  Thats' where we currently do that

if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
regs->orig_ax))
local_irq_enable();

check too, which is a fairly similar issue.

   Linus


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Linus Torvalds
On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier  wrote:
>
> Ingo: Do you want the change as-is? Would you like it to be optional?
> What do you think?

I'm not ingo, but I don't like that patch. It's in the wrong place -
that system call return code is too timing-critical to add address
limit checks.

Now what I think you *could* do is:

 - make "set_fs()" actually set a work flag in the current thread flags

 - do the test in the slow-path (syscall_return_slowpath).

Yes, yes, that ends up being architecture-specific, but it's fairly simple.

And it only slows down the system calls that actually use "set_fs()".
Sure, it will slow those down a fair amount, but they are hopefully a
small subset of all cases.

How does that sound to people?  Thats' where we currently do that

if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
WARN(irqs_disabled(), "syscall %ld left IRQs disabled",
regs->orig_ax))
local_irq_enable();

check too, which is a fairly similar issue.

   Linus


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Thomas Garnier
On Tue, May 9, 2017 at 7:29 AM, Thomas Garnier  wrote:
>
> On Tue, May 9, 2017 at 4:10 AM, Greg KH  wrote:
> > On Tue, May 09, 2017 at 08:56:19AM +0200, Ingo Molnar wrote:
> >>
> >> * Kees Cook  wrote:
> >>
> >> > > There's the option of using GCC plugins now that the infrastructure was
> >> > > upstreamed from grsecurity. It can be used as part of the regular build
> >> > > process and as long as the analysis is pretty simple it shouldn't hurt 
> >> > > compile
> >> > > time much.
> >> >
> >> > Well, and that the situation may arise due to memory corruption, not from
> >> > poorly-matched set_fs() calls, which static analysis won't help solve. 
> >> > We need
> >> > to catch this bad kernel state because it is a very bad state to run in.
> >>
> >> If memory corruption corrupted the task state into having addr_limit set to
> >> KERNEL_DS then there's already a fair chance that it's game over: it could 
> >> also
> >> have set *uid to 0, or changed a sensitive PF_ flag, or a number of other
> >> things...
> >>
> >> Furthermore, think about it: there's literally an infinite amount of 
> >> corrupted
> >> task states that could be a security problem and that could be checked 
> >> after every
> >> system call. Do we want to check every one of them?
> >
> > Ok, I'm all for not checking lots of stuff all the time, just to protect
> > from crappy drivers that.  Especially as we _can_ audit and run checks
> > on the source code for them in the kernel tree.
> >
> > But, and here's the problem, outside of the desktop/enterprise world,
> > there are a ton of out-of-tree code that is crap.  The number of
> > security/bug fixes and kernel crashes for out-of-tree code in systems
> > like Android phones is just so high it's laughable.
> >
> > When you have a device that is running 3.2 million lines of kernel code,
> > yet the diffstat of the tree compared to mainline adds 3 million lines
> > of code, there is bound to be a ton of issues/problems there.
> >
> > So this is an entirely different thing we need to try to protect
> > ourselves from.  A long time ago I laughed when I saw that Microsoft had
> > to do lots of "hardening" of their kernel to protect themselves from
> > crappy drivers, as I knew we didn't have to do that because we had the
> > source for them and could fix the root issues.  But that has changed and
> > now we don't all have that option.  That code is out-of-tree because the
> > vendor doesn't care, and doesn't want to take any time at all to do
> > anything resembling a real code review[1].
>
> That's a big part of why I thought would be useful. I am less worried
> about edge cases upstream right now than forks with custom codes not
> using set_fs correctly.
>
> >
> > So, how about options like the ones being proposed here, go behind a new
> > config option:
> > CONFIG_PROTECT_FROM_CRAPPY_DRIVERS
> > that device owners can enable if they do not trust their vendor-provided
> > code (hint, I sure don't.)  That way the "normal" path that all of us
> > are used to running will be fine, but if you want to take the speed hit
> > to try to protect yourself, then you can do that as well.
>
> Maybe another name but why not.

Ingo: Do you want the change as-is? Would you like it to be optional?
What do you think?

>
> >
> > Anyway, just an idea...
> >
> > thanks,
> >
> > greg k-h
> >
> > [1] I am working really hard with lots of vendors to try to fix their
> > broken development model, but that is going to take years to resolve
> > as their device pipelines are years long, and changing their
> > mindsets takes a long time...
>
>
>
> --
> Thomas




-- 
Thomas


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Thomas Garnier
On Tue, May 9, 2017 at 7:29 AM, Thomas Garnier  wrote:
>
> On Tue, May 9, 2017 at 4:10 AM, Greg KH  wrote:
> > On Tue, May 09, 2017 at 08:56:19AM +0200, Ingo Molnar wrote:
> >>
> >> * Kees Cook  wrote:
> >>
> >> > > There's the option of using GCC plugins now that the infrastructure was
> >> > > upstreamed from grsecurity. It can be used as part of the regular build
> >> > > process and as long as the analysis is pretty simple it shouldn't hurt 
> >> > > compile
> >> > > time much.
> >> >
> >> > Well, and that the situation may arise due to memory corruption, not from
> >> > poorly-matched set_fs() calls, which static analysis won't help solve. 
> >> > We need
> >> > to catch this bad kernel state because it is a very bad state to run in.
> >>
> >> If memory corruption corrupted the task state into having addr_limit set to
> >> KERNEL_DS then there's already a fair chance that it's game over: it could 
> >> also
> >> have set *uid to 0, or changed a sensitive PF_ flag, or a number of other
> >> things...
> >>
> >> Furthermore, think about it: there's literally an infinite amount of 
> >> corrupted
> >> task states that could be a security problem and that could be checked 
> >> after every
> >> system call. Do we want to check every one of them?
> >
> > Ok, I'm all for not checking lots of stuff all the time, just to protect
> > from crappy drivers that.  Especially as we _can_ audit and run checks
> > on the source code for them in the kernel tree.
> >
> > But, and here's the problem, outside of the desktop/enterprise world,
> > there are a ton of out-of-tree code that is crap.  The number of
> > security/bug fixes and kernel crashes for out-of-tree code in systems
> > like Android phones is just so high it's laughable.
> >
> > When you have a device that is running 3.2 million lines of kernel code,
> > yet the diffstat of the tree compared to mainline adds 3 million lines
> > of code, there is bound to be a ton of issues/problems there.
> >
> > So this is an entirely different thing we need to try to protect
> > ourselves from.  A long time ago I laughed when I saw that Microsoft had
> > to do lots of "hardening" of their kernel to protect themselves from
> > crappy drivers, as I knew we didn't have to do that because we had the
> > source for them and could fix the root issues.  But that has changed and
> > now we don't all have that option.  That code is out-of-tree because the
> > vendor doesn't care, and doesn't want to take any time at all to do
> > anything resembling a real code review[1].
>
> That's a big part of why I thought would be useful. I am less worried
> about edge cases upstream right now than forks with custom codes not
> using set_fs correctly.
>
> >
> > So, how about options like the ones being proposed here, go behind a new
> > config option:
> > CONFIG_PROTECT_FROM_CRAPPY_DRIVERS
> > that device owners can enable if they do not trust their vendor-provided
> > code (hint, I sure don't.)  That way the "normal" path that all of us
> > are used to running will be fine, but if you want to take the speed hit
> > to try to protect yourself, then you can do that as well.
>
> Maybe another name but why not.

Ingo: Do you want the change as-is? Would you like it to be optional?
What do you think?

>
> >
> > Anyway, just an idea...
> >
> > thanks,
> >
> > greg k-h
> >
> > [1] I am working really hard with lots of vendors to try to fix their
> > broken development model, but that is going to take years to resolve
> > as their device pipelines are years long, and changing their
> > mindsets takes a long time...
>
>
>
> --
> Thomas




-- 
Thomas


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Borislav Petkov
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote:
> > I don't like silent fixups.  If we want to do this, we should BUG or
> > at least WARN, not just change the addr limit.  But I'm also not
> > convinced it's indicative of an actual bug here.
> 
> Nothing should enter that function with KERNEL_DS set, right?
> 
> BUG_ON(get_fs() != USER_DS);

We're feeling triggerhappy, aren't we? A nice juicy WARN-splat along
with a fixup looks much better than killing the box, to me.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-11 Thread Borislav Petkov
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote:
> > I don't like silent fixups.  If we want to do this, we should BUG or
> > at least WARN, not just change the addr limit.  But I'm also not
> > convinced it's indicative of an actual bug here.
> 
> Nothing should enter that function with KERNEL_DS set, right?
> 
> BUG_ON(get_fs() != USER_DS);

We're feeling triggerhappy, aren't we? A nice juicy WARN-splat along
with a fixup looks much better than killing the box, to me.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Andy Lutomirski
On Wed, May 10, 2017 at 1:14 AM, Christoph Hellwig  wrote:
> On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote:
>> On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote:
>>
>> > > How about trying to remove all of them?  If we could actually get rid
>> > > of all of them, we could drop the arch support, and we'd get faster,
>> > > simpler, shorter uaccess code throughout the kernel.
>>
>> BTW, not all get_user() under KERNEL_DS are plain loads.  There is an
>> exception - probe_kernel_read().
>
> And various calls that looks like opencoded versions, e.g. drivers/dio
> or the ELF loader.
>
> But in the long run we'll just need a separate primitive for that,
> but that can wait until the set_fs calls outside the core code are
> gone.

I suspect that, on most arches, the primitive is called
__copy_from_user().  We could make the generic code do that except
where overridden.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Andy Lutomirski
On Wed, May 10, 2017 at 1:14 AM, Christoph Hellwig  wrote:
> On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote:
>> On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote:
>>
>> > > How about trying to remove all of them?  If we could actually get rid
>> > > of all of them, we could drop the arch support, and we'd get faster,
>> > > simpler, shorter uaccess code throughout the kernel.
>>
>> BTW, not all get_user() under KERNEL_DS are plain loads.  There is an
>> exception - probe_kernel_read().
>
> And various calls that looks like opencoded versions, e.g. drivers/dio
> or the ELF loader.
>
> But in the long run we'll just need a separate primitive for that,
> but that can wait until the set_fs calls outside the core code are
> gone.

I suspect that, on most arches, the primitive is called
__copy_from_user().  We could make the generic code do that except
where overridden.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote:
> On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote:
> 
> > > How about trying to remove all of them?  If we could actually get rid
> > > of all of them, we could drop the arch support, and we'd get faster,
> > > simpler, shorter uaccess code throughout the kernel.
> 
> BTW, not all get_user() under KERNEL_DS are plain loads.  There is an
> exception - probe_kernel_read().

And various calls that looks like opencoded versions, e.g. drivers/dio
or the ELF loader.

But in the long run we'll just need a separate primitive for that,
but that can wait until the set_fs calls outside the core code are
gone.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote:
> On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote:
> 
> > > How about trying to remove all of them?  If we could actually get rid
> > > of all of them, we could drop the arch support, and we'd get faster,
> > > simpler, shorter uaccess code throughout the kernel.
> 
> BTW, not all get_user() under KERNEL_DS are plain loads.  There is an
> exception - probe_kernel_read().

And various calls that looks like opencoded versions, e.g. drivers/dio
or the ELF loader.

But in the long run we'll just need a separate primitive for that,
but that can wait until the set_fs calls outside the core code are
gone.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Al Viro
On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote:

> > How about trying to remove all of them?  If we could actually get rid
> > of all of them, we could drop the arch support, and we'd get faster,
> > simpler, shorter uaccess code throughout the kernel.

BTW, not all get_user() under KERNEL_DS are plain loads.  There is an
exception - probe_kernel_read().

> > The ones in kernel/compat.c are generally garbage.  They should be
> > using compat_alloc_user_space().  Ditto for kernel/power/user.c.
> 
> compat_alloc_user_space() has some problems too, it adds
> complexity to a rarely-tested code path and can add some noticeable
> overhead in cases where user space access is slow because of
> extra checks.
> 
> It's clearly better than set_fs(), but the way I prefer to convert the
> code is to avoid both and instead move compat handlers next to
> the native code, and splitting out the common code between native
> and compat mode into a helper that takes a regular kernel pointer.
> 
> I think that's what both Al has done in the past on compat_ioctl()
> and select() and what Christoph does in his latest series, but
> it seems worth pointing out for others that decide to help out here.

Folks, reducing the amount of places where we play with set_fs() is certainly
a good thing.  Getting rid of them completely is something entirely different;
I have tried to plot out patch series in this direction many times during the
last 5 years or so, but it's not going to be easy.  Tomorrow I can start
posting my notes in that direction (and there are tons of those, unfortunately
mixed with git grep results, highly unprintable personal comments, etc.);
just let me grab some sleep first...

BTW, slow userland access is not just due to extra checks; access_ok(),
in particular, is pretty much noise.  The real PITA comes from the things
like STAC/CLAC on recent x86.  Or hardware overhead of cross-address-space
block copy insn (e.g. on s390, where it's optimized for multi-cacheline
blocks).  Or things like uml, where it's a matter of walking the page
tables for each sodding __get_user().  It's not always just a matter of
address space limit...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Al Viro
On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote:

> > How about trying to remove all of them?  If we could actually get rid
> > of all of them, we could drop the arch support, and we'd get faster,
> > simpler, shorter uaccess code throughout the kernel.

BTW, not all get_user() under KERNEL_DS are plain loads.  There is an
exception - probe_kernel_read().

> > The ones in kernel/compat.c are generally garbage.  They should be
> > using compat_alloc_user_space().  Ditto for kernel/power/user.c.
> 
> compat_alloc_user_space() has some problems too, it adds
> complexity to a rarely-tested code path and can add some noticeable
> overhead in cases where user space access is slow because of
> extra checks.
> 
> It's clearly better than set_fs(), but the way I prefer to convert the
> code is to avoid both and instead move compat handlers next to
> the native code, and splitting out the common code between native
> and compat mode into a helper that takes a regular kernel pointer.
> 
> I think that's what both Al has done in the past on compat_ioctl()
> and select() and what Christoph does in his latest series, but
> it seems worth pointing out for others that decide to help out here.

Folks, reducing the amount of places where we play with set_fs() is certainly
a good thing.  Getting rid of them completely is something entirely different;
I have tried to plot out patch series in this direction many times during the
last 5 years or so, but it's not going to be easy.  Tomorrow I can start
posting my notes in that direction (and there are tons of those, unfortunately
mixed with git grep results, highly unprintable personal comments, etc.);
just let me grab some sleep first...

BTW, slow userland access is not just due to extra checks; access_ok(),
in particular, is pretty much noise.  The real PITA comes from the things
like STAC/CLAC on recent x86.  Or hardware overhead of cross-address-space
block copy insn (e.g. on s390, where it's optimized for multi-cacheline
blocks).  Or things like uml, where it's a matter of walking the page
tables for each sodding __get_user().  It's not always just a matter of
address space limit...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Arnd Bergmann
On Tue, May 9, 2017 at 3:00 PM, Andy Lutomirski  wrote:
> On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig  wrote:
>> On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote:
>>> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it 
>>> would
>>> be a pity to add a runtime check to every system call ...
>>
>> I think we should simply strive to remove all of them that aren't
>> in core scheduler / arch code.  Basically evetyytime we do the
>>
>> oldfs = get_fs();
>> set_fs(KERNEL_DS);
>> ..
>> set_fs(oldfs);
>>
>> trick we're doing something wrong, and there should always be better
>> ways to archive it.  E.g. using iov_iter with a ITER_KVEC type
>> consistently would already remove most of them.
>
> How about trying to remove all of them?  If we could actually get rid
> of all of them, we could drop the arch support, and we'd get faster,
> simpler, shorter uaccess code throughout the kernel.
>
> The ones in kernel/compat.c are generally garbage.  They should be
> using compat_alloc_user_space().  Ditto for kernel/power/user.c.

compat_alloc_user_space() has some problems too, it adds
complexity to a rarely-tested code path and can add some noticeable
overhead in cases where user space access is slow because of
extra checks.

It's clearly better than set_fs(), but the way I prefer to convert the
code is to avoid both and instead move compat handlers next to
the native code, and splitting out the common code between native
and compat mode into a helper that takes a regular kernel pointer.

I think that's what both Al has done in the past on compat_ioctl()
and select() and what Christoph does in his latest series, but
it seems worth pointing out for others that decide to help out here.

 Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Arnd Bergmann
On Tue, May 9, 2017 at 3:00 PM, Andy Lutomirski  wrote:
> On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig  wrote:
>> On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote:
>>> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it 
>>> would
>>> be a pity to add a runtime check to every system call ...
>>
>> I think we should simply strive to remove all of them that aren't
>> in core scheduler / arch code.  Basically evetyytime we do the
>>
>> oldfs = get_fs();
>> set_fs(KERNEL_DS);
>> ..
>> set_fs(oldfs);
>>
>> trick we're doing something wrong, and there should always be better
>> ways to archive it.  E.g. using iov_iter with a ITER_KVEC type
>> consistently would already remove most of them.
>
> How about trying to remove all of them?  If we could actually get rid
> of all of them, we could drop the arch support, and we'd get faster,
> simpler, shorter uaccess code throughout the kernel.
>
> The ones in kernel/compat.c are generally garbage.  They should be
> using compat_alloc_user_space().  Ditto for kernel/power/user.c.

compat_alloc_user_space() has some problems too, it adds
complexity to a rarely-tested code path and can add some noticeable
overhead in cases where user space access is slow because of
extra checks.

It's clearly better than set_fs(), but the way I prefer to convert the
code is to avoid both and instead move compat handlers next to
the native code, and splitting out the common code between native
and compat mode into a helper that takes a regular kernel pointer.

I think that's what both Al has done in the past on compat_ioctl()
and select() and what Christoph does in his latest series, but
it seems worth pointing out for others that decide to help out here.

 Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 09:28:48AM +0200, Arnd Bergmann wrote:
> My older time64_t syscall series has the side-effect of doing something
> like this to the time-related compat handlers in kernel/compat.c. If nobody
> else has started looking at removing set_fs from those, I can extract
> the relevant parts from my series.

That would be great.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 09:28:48AM +0200, Arnd Bergmann wrote:
> My older time64_t syscall series has the side-effect of doing something
> like this to the time-related compat handlers in kernel/compat.c. If nobody
> else has started looking at removing set_fs from those, I can extract
> the relevant parts from my series.

That would be great.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 08:27:47AM +0100, Al Viro wrote:
> And you *still* do the same.  Christoph, this is ridiculous - the worst
> part of the area is not a couple of functions in fs/read_write.c, it's
> a fucking lot of ->read() and ->write() instances in shitty driver code,
> pardon the redundance.  And _that_ is still done under set_fs(KERNEL_DS).

For now yes.  But it provides a sane API on top, and then we can
start moving the drivers that matter to the iter variants and drop
support for the rest soon.  Most in-kernel I/O is done to files, and
the rest to a very limited set of devices. (not accounting for sockets
through their own APIs, thats another story)

> That's what I'm objecting to.  Centralized kernel_readv() et.al. - sure,
> and fs/read_write.c is the right place for those.  No arguments here.
> Conversion to those - absolutely; drivers have no fucking business touching
> set_fs() at all.  But your primitives are trouble waiting to happen.
> Let them take kvec arrays.  And let them, in case when there's no
> ->read_iter()/->write_iter(), do set_fs().  Statically, without this
> if (iter->type & ITER_KVEC) ... stuff.

Oh weĺl.  I can do that first, but I think eventually we'll have to
get back to what I've done now.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 08:27:47AM +0100, Al Viro wrote:
> And you *still* do the same.  Christoph, this is ridiculous - the worst
> part of the area is not a couple of functions in fs/read_write.c, it's
> a fucking lot of ->read() and ->write() instances in shitty driver code,
> pardon the redundance.  And _that_ is still done under set_fs(KERNEL_DS).

For now yes.  But it provides a sane API on top, and then we can
start moving the drivers that matter to the iter variants and drop
support for the rest soon.  Most in-kernel I/O is done to files, and
the rest to a very limited set of devices. (not accounting for sockets
through their own APIs, thats another story)

> That's what I'm objecting to.  Centralized kernel_readv() et.al. - sure,
> and fs/read_write.c is the right place for those.  No arguments here.
> Conversion to those - absolutely; drivers have no fucking business touching
> set_fs() at all.  But your primitives are trouble waiting to happen.
> Let them take kvec arrays.  And let them, in case when there's no
> ->read_iter()/->write_iter(), do set_fs().  Statically, without this
> if (iter->type & ITER_KVEC) ... stuff.

Oh weĺl.  I can do that first, but I think eventually we'll have to
get back to what I've done now.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Al Viro
On Tue, May 09, 2017 at 11:53:01PM -0700, Christoph Hellwig wrote:
> On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote:
> > What's the point?  What's wrong with having 
> > kernel_read()/kernel_readv()/etc.?
> > You still have set_fs() in there; doing that one level up in call chain 
> > would
> > be just fine...  IDGI.
> 
> The problem is that they modify the address limit, which the whole
> subthread here wants to get rid of.

And you *still* do the same.  Christoph, this is ridiculous - the worst
part of the area is not a couple of functions in fs/read_write.c, it's
a fucking lot of ->read() and ->write() instances in shitty driver code,
pardon the redundance.  And _that_ is still done under set_fs(KERNEL_DS).

Claiming that set_fs() done one function deeper in callchain (both in
fs/read_write.c) is somehow better because it reduces the amount of code
under that thing...  Get real, please - helpers that encapsulate those
set_fs() pairs (a-la kernel_read(), etc.) absolutely make sense and
converting their open-coded instances to calls of those helpers is clearly
a good thing.  However, we are not
* getting rid of low-quality code run under KERNEL_DS
* gettind rid of set_fs() itself
* getting a generic kernel_read() variant that would really take
an iov_iter.

That's what I'm objecting to.  Centralized kernel_readv() et.al. - sure,
and fs/read_write.c is the right place for those.  No arguments here.
Conversion to those - absolutely; drivers have no fucking business touching
set_fs() at all.  But your primitives are trouble waiting to happen.
Let them take kvec arrays.  And let them, in case when there's no
->read_iter()/->write_iter(), do set_fs().  Statically, without this
if (iter->type & ITER_KVEC) ... stuff.

> > Another delicate place: you can't assume that write() always advances
> > file position by its (positive) return value.  btrfs stuff is sensitive
> > to that.
> 
> If we don't want to assume that we need to pass pointer to pos to
> kernel_read/write.  Which might be a good idea in general.

Yes.

> > ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure
> > about blind asma->file->f_pos += ret.  That's begging for races.  Actually,
> > scratch that - it *is* racy.
> 
> I think the proper fix is to not even bother to maintain f_pos of the
> backing file, as we don't ever use it - all reads from it pass in
> an explicit position anyway.

vfs_llseek() used by ashmem_llseek()...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Al Viro
On Tue, May 09, 2017 at 11:53:01PM -0700, Christoph Hellwig wrote:
> On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote:
> > What's the point?  What's wrong with having 
> > kernel_read()/kernel_readv()/etc.?
> > You still have set_fs() in there; doing that one level up in call chain 
> > would
> > be just fine...  IDGI.
> 
> The problem is that they modify the address limit, which the whole
> subthread here wants to get rid of.

And you *still* do the same.  Christoph, this is ridiculous - the worst
part of the area is not a couple of functions in fs/read_write.c, it's
a fucking lot of ->read() and ->write() instances in shitty driver code,
pardon the redundance.  And _that_ is still done under set_fs(KERNEL_DS).

Claiming that set_fs() done one function deeper in callchain (both in
fs/read_write.c) is somehow better because it reduces the amount of code
under that thing...  Get real, please - helpers that encapsulate those
set_fs() pairs (a-la kernel_read(), etc.) absolutely make sense and
converting their open-coded instances to calls of those helpers is clearly
a good thing.  However, we are not
* getting rid of low-quality code run under KERNEL_DS
* gettind rid of set_fs() itself
* getting a generic kernel_read() variant that would really take
an iov_iter.

That's what I'm objecting to.  Centralized kernel_readv() et.al. - sure,
and fs/read_write.c is the right place for those.  No arguments here.
Conversion to those - absolutely; drivers have no fucking business touching
set_fs() at all.  But your primitives are trouble waiting to happen.
Let them take kvec arrays.  And let them, in case when there's no
->read_iter()/->write_iter(), do set_fs().  Statically, without this
if (iter->type & ITER_KVEC) ... stuff.

> > Another delicate place: you can't assume that write() always advances
> > file position by its (positive) return value.  btrfs stuff is sensitive
> > to that.
> 
> If we don't want to assume that we need to pass pointer to pos to
> kernel_read/write.  Which might be a good idea in general.

Yes.

> > ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure
> > about blind asma->file->f_pos += ret.  That's begging for races.  Actually,
> > scratch that - it *is* racy.
> 
> I think the proper fix is to not even bother to maintain f_pos of the
> backing file, as we don't ever use it - all reads from it pass in
> an explicit position anyway.

vfs_llseek() used by ashmem_llseek()...


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Arnd Bergmann
On Tue, May 9, 2017 at 6:03 PM, Christoph Hellwig  wrote:
> On Tue, May 09, 2017 at 06:02:50AM -0700, Christoph Hellwig wrote:
>> On Tue, May 09, 2017 at 06:00:01AM -0700, Andy Lutomirski wrote:
>> > fs/splice.c has some, ahem, interesting uses that have been the source
>> > of nasty exploits in the past.  Converting them to use iov_iter
>> > properly would be really, really nice.  Christoph, I don't suppose
>> > you'd like to do that?
>>
>> I can take care of all the fs code including this one.
>
> I spent the afternoon hacking up where I'd like this to head.  It's
> completely untested as of now:
>
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/setfs-elimination

My older time64_t syscall series has the side-effect of doing something
like this to the time-related compat handlers in kernel/compat.c. If nobody
else has started looking at removing set_fs from those, I can extract
the relevant parts from my series.

  Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Arnd Bergmann
On Tue, May 9, 2017 at 6:03 PM, Christoph Hellwig  wrote:
> On Tue, May 09, 2017 at 06:02:50AM -0700, Christoph Hellwig wrote:
>> On Tue, May 09, 2017 at 06:00:01AM -0700, Andy Lutomirski wrote:
>> > fs/splice.c has some, ahem, interesting uses that have been the source
>> > of nasty exploits in the past.  Converting them to use iov_iter
>> > properly would be really, really nice.  Christoph, I don't suppose
>> > you'd like to do that?
>>
>> I can take care of all the fs code including this one.
>
> I spent the afternoon hacking up where I'd like this to head.  It's
> completely untested as of now:
>
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/setfs-elimination

My older time64_t syscall series has the side-effect of doing something
like this to the time-related compat handlers in kernel/compat.c. If nobody
else has started looking at removing set_fs from those, I can extract
the relevant parts from my series.

  Arnd


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote:
> > I don't like silent fixups.  If we want to do this, we should BUG or
> > at least WARN, not just change the addr limit.  But I'm also not
> > convinced it's indicative of an actual bug here.
> 
> Nothing should enter that function with KERNEL_DS set, right?

It might very well do.  Various drivers or the networking code mess
with the address limits for fairly broad sections of code.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote:
> > I don't like silent fixups.  If we want to do this, we should BUG or
> > at least WARN, not just change the addr limit.  But I'm also not
> > convinced it's indicative of an actual bug here.
> 
> Nothing should enter that function with KERNEL_DS set, right?

It might very well do.  Various drivers or the networking code mess
with the address limits for fairly broad sections of code.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 04:39:12AM +0100, Al Viro wrote:
> fcntl stuff: I've decided not to put something similar into work.compat
> since I couldn't decide what to do with compat stuff - word-by-word copy
> from userland converting to struct flock + conversion to posix_lock +
> actual work + conversion to flock + word-by-word copy to userland...  Smells
> like we might be better off with compat_flock_to_posix_lock() et.al.
> I'm still not sure; played a bit one way and another and dediced to drop
> it for now.  Hell knows...

My version already is an improvement in lines of code alone.  Between
that and stopping to mess with the address limit I think it's a clear
winner.  But it's pretty independent of the rest, and I'll just run
it through Jeff and Bruce and ask them what they think.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 04:39:12AM +0100, Al Viro wrote:
> fcntl stuff: I've decided not to put something similar into work.compat
> since I couldn't decide what to do with compat stuff - word-by-word copy
> from userland converting to struct flock + conversion to posix_lock +
> actual work + conversion to flock + word-by-word copy to userland...  Smells
> like we might be better off with compat_flock_to_posix_lock() et.al.
> I'm still not sure; played a bit one way and another and dediced to drop
> it for now.  Hell knows...

My version already is an improvement in lines of code alone.  Between
that and stopping to mess with the address limit I think it's a clear
winner.  But it's pretty independent of the rest, and I'll just run
it through Jeff and Bruce and ask them what they think.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote:
> What's the point?  What's wrong with having kernel_read()/kernel_readv()/etc.?
> You still have set_fs() in there; doing that one level up in call chain would
> be just fine...  IDGI.

The problem is that they modify the address limit, which the whole
subthread here wants to get rid of.

> Broken commit: "net: don't play with address limits in kernel_recvmsg".
> It would be OK if it was only about data.  Unfortunately, that's not
> true in one case: svc_udp_recvfrom() wants ->msg_control.

Dropped, but we'll need to fix that eventually.

> Another delicate place: you can't assume that write() always advances
> file position by its (positive) return value.  btrfs stuff is sensitive
> to that.

If we don't want to assume that we need to pass pointer to pos to
kernel_read/write.  Which might be a good idea in general.

> ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure
> about blind asma->file->f_pos += ret.  That's begging for races.  Actually,
> scratch that - it *is* racy.

I think the proper fix is to not even bother to maintain f_pos of the
backing file, as we don't ever use it - all reads from it pass in
an explicit position anyway.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote:
> What's the point?  What's wrong with having kernel_read()/kernel_readv()/etc.?
> You still have set_fs() in there; doing that one level up in call chain would
> be just fine...  IDGI.

The problem is that they modify the address limit, which the whole
subthread here wants to get rid of.

> Broken commit: "net: don't play with address limits in kernel_recvmsg".
> It would be OK if it was only about data.  Unfortunately, that's not
> true in one case: svc_udp_recvfrom() wants ->msg_control.

Dropped, but we'll need to fix that eventually.

> Another delicate place: you can't assume that write() always advances
> file position by its (positive) return value.  btrfs stuff is sensitive
> to that.

If we don't want to assume that we need to pass pointer to pos to
kernel_read/write.  Which might be a good idea in general.

> ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure
> about blind asma->file->f_pos += ret.  That's begging for races.  Actually,
> scratch that - it *is* racy.

I think the proper fix is to not even bother to maintain f_pos of the
backing file, as we don't ever use it - all reads from it pass in
an explicit position anyway.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 03:45:24AM +0100, Al Viro wrote:
> FWIW, some parts of that queue are obviously sane; it's the conversions of
> kernel_write() and friends to ->read_iter/->write_iter() that are 
> non-starters.

And that part is the main point!

> That stuff is used in too many situations; we can't guarantee that all of
> them will be for files that have those.

That's why this series handles ITER_KVEC for this case, which is all
that's really needed for kernel_read/write.  If you insiste the bvec
and pipe cases are handled as well that couod be added fairly easily.

> As for default_file_splice_read(), I seriously suspect that with your change
> we could as well just make it return -EINVAL and be done with that; places
> that have ->read_iter() tend to have explicit ->splice_read() and it looks
> like the ones that do not should simply use generic_file_read_iter().
> I hadn't checked that, but there's not a lot of those:

Making ->splice_read to default to the ->read_iter based implementation
and returning -EINVAL if neither that nor an explicit ->splice_read
is provided is useful, but wasn't the aim of this series.  Similar
on the write side.


Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode

2017-05-10 Thread Christoph Hellwig
On Wed, May 10, 2017 at 03:45:24AM +0100, Al Viro wrote:
> FWIW, some parts of that queue are obviously sane; it's the conversions of
> kernel_write() and friends to ->read_iter/->write_iter() that are 
> non-starters.

And that part is the main point!

> That stuff is used in too many situations; we can't guarantee that all of
> them will be for files that have those.

That's why this series handles ITER_KVEC for this case, which is all
that's really needed for kernel_read/write.  If you insiste the bvec
and pipe cases are handled as well that couod be added fairly easily.

> As for default_file_splice_read(), I seriously suspect that with your change
> we could as well just make it return -EINVAL and be done with that; places
> that have ->read_iter() tend to have explicit ->splice_read() and it looks
> like the ones that do not should simply use generic_file_read_iter().
> I hadn't checked that, but there's not a lot of those:

Making ->splice_read to default to the ->read_iter based implementation
and returning -EINVAL if neither that nor an explicit ->splice_read
is provided is useful, but wasn't the aim of this series.  Similar
on the write side.


  1   2   >