Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: > On stack variable length arrays get implemented by the compiler doing > alloca(), and we sadly have a few of those around. I've just got rid of one of those and I wish they would appear entirely as they are horrible in so many different ways. Sparse warns about them, btw.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: > On stack variable length arrays get implemented by the compiler doing > alloca(), and we sadly have a few of those around. I've just got rid of one of those and I wish they would appear entirely as they are horrible in so many different ways. Sparse warns about them, btw.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:15 AM, Al Virowrote: > Folks, seriously, have you even looked through that zoo? I have, and it's > really, really not fun. Sure, we can say "fuck 'em, no need to allow > splice() on random crap". Would be perfectly reasonable, expect that > it's not the only place doing kernel_write() and its ilk... Can you clarify this? I think we really may be able to do exactly this. From Christoph's list, there are only two things that need kernel_read/kernel_write to user-supplied fds that may come from a variety of sources: splice and exec. If you're execing a chardev from a crappy driver, something is seriously wrong. And returning -EINVAL from splice() to or from files that use ->read and ->write seems find (and splice(2) even documents -EINVAL as meaning that the target doesn't support splicing). --Andy
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:15 AM, Al Viro wrote: > Folks, seriously, have you even looked through that zoo? I have, and it's > really, really not fun. Sure, we can say "fuck 'em, no need to allow > splice() on random crap". Would be perfectly reasonable, expect that > it's not the only place doing kernel_write() and its ilk... Can you clarify this? I think we really may be able to do exactly this. From Christoph's list, there are only two things that need kernel_read/kernel_write to user-supplied fds that may come from a variety of sources: splice and exec. If you're execing a chardev from a crappy driver, something is seriously wrong. And returning -EINVAL from splice() to or from files that use ->read and ->write seems find (and splice(2) even documents -EINVAL as meaning that the target doesn't support splicing). --Andy
Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Mon, May 8, 2017 at 1:48 PM, Al Virowrote: > On Mon, May 08, 2017 at 04:06:35PM +0200, Jann Horn wrote: > >> I think Kees might be talking about >> https://bugs.chromium.org/p/project-zero/issues/detail?id=822, fixed in >> commit e6978e4bf181fb3b5f8cb6f71b4fe30fbf1b655c. The issue was that >> perf code that can run in pretty much any context called access_ok(). > > And that commit has *NOT* solved the problem. perf_callchain_user() > can be called synchronously, without passing through that code. > Tracepoint shite... > > That set_fs() should be done in get_perf_callchain(), just around the call of > perf_callchain_user(). Along with pagefault_disable(), actually. > Even that's not quite enough because of a different issue: perf nmis can hit during scheduling or when we're in lazy mm, leading to the entirely wrong set of page tables being used. We need nmi_uaccess_begin() and nmi_uaccess_end(), and the former needs to be allowed to fail. AFAIK this isn't presently a security problem because it mainly affects kernel threads, and you need to be root to profile them, but maybe there's some race where it does matter.
Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Mon, May 8, 2017 at 1:48 PM, Al Viro wrote: > On Mon, May 08, 2017 at 04:06:35PM +0200, Jann Horn wrote: > >> I think Kees might be talking about >> https://bugs.chromium.org/p/project-zero/issues/detail?id=822, fixed in >> commit e6978e4bf181fb3b5f8cb6f71b4fe30fbf1b655c. The issue was that >> perf code that can run in pretty much any context called access_ok(). > > And that commit has *NOT* solved the problem. perf_callchain_user() > can be called synchronously, without passing through that code. > Tracepoint shite... > > That set_fs() should be done in get_perf_callchain(), just around the call of > perf_callchain_user(). Along with pagefault_disable(), actually. > Even that's not quite enough because of a different issue: perf nmis can hit during scheduling or when we're in lazy mm, leading to the entirely wrong set of page tables being used. We need nmi_uaccess_begin() and nmi_uaccess_end(), and the former needs to be allowed to fail. AFAIK this isn't presently a security problem because it mainly affects kernel threads, and you need to be root to profile them, but maybe there's some race where it does matter.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 05:47:55PM -0400, Rik van Riel wrote: > > Seriously, look at these beasts. Overwriting ->addr_limit is nowhere > > near > > the top threat. If attacker can overwrite thread_info, you have > > lost. > > That is why THREAD_INFO_IN_TASK exists. It moves > the struct thread_info to a location away from the > stack, which means a stack overflow will not overwrite > the thread_info. ... in which case such attacks on ->addr_limit also become a non-issue. AFAICS, we are mixing several unrelated issues here: * amount of places where set_fs() is called. Sure, reducing it is a good idea and we want to move to primitives like kernel_write() et.al. Fewer users => lower odds of screwing it up. * making sure that remaining callers are properly paired. Ditto. * switching to ->read_iter()/->write_iter() where it makes sense. Again, no problem with that. * providing sane environment for places like perf/oprofile. Again, a good idea, and set_fs(USER_DS) is only a part of what's needed there. * switching _everything_ to ->read_iter()/->write_iter(). Flat-out insane and AFAICS nobody is signing up for that. * getting rid of set_fs() entirely. I'm afraid that it's not feasible without the previous one and frankly, I don't see much point. * sanity-checking on return to userland. Maybe useful, maybe not. * taking thread_info out of the way of stack overflows. Reasonable, but has very little to do with the rest of that. * protecting against Lovecraftian horrors slithering in from the outer space only to commit unspeakable acts against ->addr_limit and ignoring much tastier targets next to it, but then what do you expect from degenerate spawn of Great Old Ones - sanity?
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 05:47:55PM -0400, Rik van Riel wrote: > > Seriously, look at these beasts. Overwriting ->addr_limit is nowhere > > near > > the top threat. If attacker can overwrite thread_info, you have > > lost. > > That is why THREAD_INFO_IN_TASK exists. It moves > the struct thread_info to a location away from the > stack, which means a stack overflow will not overwrite > the thread_info. ... in which case such attacks on ->addr_limit also become a non-issue. AFAICS, we are mixing several unrelated issues here: * amount of places where set_fs() is called. Sure, reducing it is a good idea and we want to move to primitives like kernel_write() et.al. Fewer users => lower odds of screwing it up. * making sure that remaining callers are properly paired. Ditto. * switching to ->read_iter()/->write_iter() where it makes sense. Again, no problem with that. * providing sane environment for places like perf/oprofile. Again, a good idea, and set_fs(USER_DS) is only a part of what's needed there. * switching _everything_ to ->read_iter()/->write_iter(). Flat-out insane and AFAICS nobody is signing up for that. * getting rid of set_fs() entirely. I'm afraid that it's not feasible without the previous one and frankly, I don't see much point. * sanity-checking on return to userland. Maybe useful, maybe not. * taking thread_info out of the way of stack overflows. Reasonable, but has very little to do with the rest of that. * protecting against Lovecraftian horrors slithering in from the outer space only to commit unspeakable acts against ->addr_limit and ignoring much tastier targets next to it, but then what do you expect from degenerate spawn of Great Old Ones - sanity?
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 2:41 PM, Al Virowrote: > On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote: > >> Two things are at risk from stack exhaustion: thread_info (mainly >> addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and > > Really? Let's take a look at arm, for example: > > struct thread_info { > unsigned long flags; /* low level flags */ > int preempt_count; /* 0 => preemptable, <0 => > bug */ > mm_segment_taddr_limit; /* address limit */ > struct task_struct *task; /* main task structure */ > > and current() is defined as current_thread_info()->task. > > Seriously, look at these beasts. Overwriting ->addr_limit is nowhere near > the top threat. If attacker can overwrite thread_info, you have lost. I don't disagree, but the type of attack is different. If the attacker overwrites task_struct pointer, then they need to have built an false one, and that may be made difficult by PAN, or need to know more about kernel memory layout (rather than only stack depth), etc. Attacking addr_limit makes it very very easy to upgrade attack capabilities. I'm not say thread_info shouldn't be moved off the stack. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 2:41 PM, Al Viro wrote: > On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote: > >> Two things are at risk from stack exhaustion: thread_info (mainly >> addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and > > Really? Let's take a look at arm, for example: > > struct thread_info { > unsigned long flags; /* low level flags */ > int preempt_count; /* 0 => preemptable, <0 => > bug */ > mm_segment_taddr_limit; /* address limit */ > struct task_struct *task; /* main task structure */ > > and current() is defined as current_thread_info()->task. > > Seriously, look at these beasts. Overwriting ->addr_limit is nowhere near > the top threat. If attacker can overwrite thread_info, you have lost. I don't disagree, but the type of attack is different. If the attacker overwrites task_struct pointer, then they need to have built an false one, and that may be made difficult by PAN, or need to know more about kernel memory layout (rather than only stack depth), etc. Attacking addr_limit makes it very very easy to upgrade attack capabilities. I'm not say thread_info shouldn't be moved off the stack. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, 2017-05-12 at 22:41 +0100, Al Viro wrote: > On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote: > > > Two things are at risk from stack exhaustion: thread_info (mainly > > addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and > > Really? Let's take a look at arm, for example: > > struct thread_info { > unsigned long flags; /* low level flags */ > int preempt_count; /* 0 => preemptable, > <0 => bug */ > mm_segment_taddr_limit; /* address limit */ > struct task_struct *task; /* main task > structure */ > > and current() is defined as current_thread_info()->task. > > Seriously, look at these beasts. Overwriting ->addr_limit is nowhere > near > the top threat. If attacker can overwrite thread_info, you have > lost. That is why THREAD_INFO_IN_TASK exists. It moves the struct thread_info to a location away from the stack, which means a stack overflow will not overwrite the thread_info.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, 2017-05-12 at 22:41 +0100, Al Viro wrote: > On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote: > > > Two things are at risk from stack exhaustion: thread_info (mainly > > addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and > > Really? Let's take a look at arm, for example: > > struct thread_info { > unsigned long flags; /* low level flags */ > int preempt_count; /* 0 => preemptable, > <0 => bug */ > mm_segment_taddr_limit; /* address limit */ > struct task_struct *task; /* main task > structure */ > > and current() is defined as current_thread_info()->task. > > Seriously, look at these beasts. Overwriting ->addr_limit is nowhere > near > the top threat. If attacker can overwrite thread_info, you have > lost. That is why THREAD_INFO_IN_TASK exists. It moves the struct thread_info to a location away from the stack, which means a stack overflow will not overwrite the thread_info.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote: > Two things are at risk from stack exhaustion: thread_info (mainly > addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and Really? Let's take a look at arm, for example: struct thread_info { unsigned long flags; /* low level flags */ int preempt_count; /* 0 => preemptable, <0 => bug */ mm_segment_taddr_limit; /* address limit */ struct task_struct *task; /* main task structure */ and current() is defined as current_thread_info()->task. Seriously, look at these beasts. Overwriting ->addr_limit is nowhere near the top threat. If attacker can overwrite thread_info, you have lost.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 02:17:19PM -0700, Kees Cook wrote: > Two things are at risk from stack exhaustion: thread_info (mainly > addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and Really? Let's take a look at arm, for example: struct thread_info { unsigned long flags; /* low level flags */ int preempt_count; /* 0 => preemptable, <0 => bug */ mm_segment_taddr_limit; /* address limit */ struct task_struct *task; /* main task structure */ and current() is defined as current_thread_info()->task. Seriously, look at these beasts. Overwriting ->addr_limit is nowhere near the top threat. If attacker can overwrite thread_info, you have lost.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
> overflow into adjacent allocations (fixed by VMAP_STACK). 99% fixed, but it's possible to skip over the guard page without -fstack-check enabled (plus some edge cases need to be fixed in GCC), unless VLAs were forbidden in addition to the existing large frame size warning. I'm not sure about in-tree code, but Qualcomm had some of these improperly bounded VLA vulnerabilities in their MSM kernel...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
> overflow into adjacent allocations (fixed by VMAP_STACK). 99% fixed, but it's possible to skip over the guard page without -fstack-check enabled (plus some edge cases need to be fixed in GCC), unless VLAs were forbidden in addition to the existing large frame size warning. I'm not sure about in-tree code, but Qualcomm had some of these improperly bounded VLA vulnerabilities in their MSM kernel...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 2:06 PM, Al Virowrote: > On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: >> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: >> > I'm clearly not explaining things well enough. I shouldn't say >> > "corruption", I should say "malicious manipulation". The methodology >> > of attacks against the stack are quite different from the other kinds >> > of attacks like use-after-free, heap overflow, etc. Being able to >> > exhaust the kernel stack (either due to deep recursion or unbounded >> > alloca()) >> >> I really hope we don't have alloca() use in the kernel. Do you have >> evidence to support that assertion? >> >> IMHO alloca() (or similar) should not be present in any kernel code >> because we have a limited stack - we have kmalloc() etc for that kind >> of thing. > > No alloca(), but there are VLAs. Said that, the whole "what if they > can bugger thread_info and/or task_struct and go after set_fs() state" > is idiocy, of course - in that case the box is fucked, no matter what. Two things are at risk from stack exhaustion: thread_info (mainly addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and overflow into adjacent allocations (fixed by VMAP_STACK). The latter is fundamentally a heap overflow. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 2:06 PM, Al Viro wrote: > On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: >> On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: >> > I'm clearly not explaining things well enough. I shouldn't say >> > "corruption", I should say "malicious manipulation". The methodology >> > of attacks against the stack are quite different from the other kinds >> > of attacks like use-after-free, heap overflow, etc. Being able to >> > exhaust the kernel stack (either due to deep recursion or unbounded >> > alloca()) >> >> I really hope we don't have alloca() use in the kernel. Do you have >> evidence to support that assertion? >> >> IMHO alloca() (or similar) should not be present in any kernel code >> because we have a limited stack - we have kmalloc() etc for that kind >> of thing. > > No alloca(), but there are VLAs. Said that, the whole "what if they > can bugger thread_info and/or task_struct and go after set_fs() state" > is idiocy, of course - in that case the box is fucked, no matter what. Two things are at risk from stack exhaustion: thread_info (mainly addr_limit) when on the stack (fixed by THREAD_INFO_IN_TASK), and overflow into adjacent allocations (fixed by VMAP_STACK). The latter is fundamentally a heap overflow. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, 2017-05-12 at 22:06 +0100, Al Viro wrote: > On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux > wrote: > > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > > I'm clearly not explaining things well enough. I shouldn't say > > > "corruption", I should say "malicious manipulation". The > > > methodology > > > of attacks against the stack are quite different from the other > > > kinds > > > of attacks like use-after-free, heap overflow, etc. Being able to > > > exhaust the kernel stack (either due to deep recursion or > > > unbounded > > > alloca()) > > > > I really hope we don't have alloca() use in the kernel. Do you have > > evidence to support that assertion? > > > > IMHO alloca() (or similar) should not be present in any kernel code > > because we have a limited stack - we have kmalloc() etc for that > > kind > > of thing. > > No alloca(), but there are VLAs. Said that, the whole "what if they > can bugger thread_info and/or task_struct and go after set_fs() state" > is idiocy, of course - in that case the box is fucked, no matter what. VMAP_STACK + -fstack-check would prevent exploiting even an unbounded VLA / alloca size vs. it being an arbitrary write. -fstack-check guarantees that there's one byte per page as the stack grows, although there are some unfortunate GCC bugs making it less than perfect right now... but they recently started caring about it more including making it near zero overhead as it was always supposed to be.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, 2017-05-12 at 22:06 +0100, Al Viro wrote: > On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux > wrote: > > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > > I'm clearly not explaining things well enough. I shouldn't say > > > "corruption", I should say "malicious manipulation". The > > > methodology > > > of attacks against the stack are quite different from the other > > > kinds > > > of attacks like use-after-free, heap overflow, etc. Being able to > > > exhaust the kernel stack (either due to deep recursion or > > > unbounded > > > alloca()) > > > > I really hope we don't have alloca() use in the kernel. Do you have > > evidence to support that assertion? > > > > IMHO alloca() (or similar) should not be present in any kernel code > > because we have a limited stack - we have kmalloc() etc for that > > kind > > of thing. > > No alloca(), but there are VLAs. Said that, the whole "what if they > can bugger thread_info and/or task_struct and go after set_fs() state" > is idiocy, of course - in that case the box is fucked, no matter what. VMAP_STACK + -fstack-check would prevent exploiting even an unbounded VLA / alloca size vs. it being an arbitrary write. -fstack-check guarantees that there's one byte per page as the stack grows, although there are some unfortunate GCC bugs making it less than perfect right now... but they recently started caring about it more including making it near zero overhead as it was always supposed to be.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > I'm clearly not explaining things well enough. I shouldn't say > > "corruption", I should say "malicious manipulation". The methodology > > of attacks against the stack are quite different from the other kinds > > of attacks like use-after-free, heap overflow, etc. Being able to > > exhaust the kernel stack (either due to deep recursion or unbounded > > alloca()) > > I really hope we don't have alloca() use in the kernel. Do you have > evidence to support that assertion? > > IMHO alloca() (or similar) should not be present in any kernel code > because we have a limited stack - we have kmalloc() etc for that kind > of thing. No alloca(), but there are VLAs. Said that, the whole "what if they can bugger thread_info and/or task_struct and go after set_fs() state" is idiocy, of course - in that case the box is fucked, no matter what.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > I'm clearly not explaining things well enough. I shouldn't say > > "corruption", I should say "malicious manipulation". The methodology > > of attacks against the stack are quite different from the other kinds > > of attacks like use-after-free, heap overflow, etc. Being able to > > exhaust the kernel stack (either due to deep recursion or unbounded > > alloca()) > > I really hope we don't have alloca() use in the kernel. Do you have > evidence to support that assertion? > > IMHO alloca() (or similar) should not be present in any kernel code > because we have a limited stack - we have kmalloc() etc for that kind > of thing. No alloca(), but there are VLAs. Said that, the whole "what if they can bugger thread_info and/or task_struct and go after set_fs() state" is idiocy, of course - in that case the box is fucked, no matter what.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 2:00 PM, Kees Cookwrote: > On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linux > wrote: >> On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: >>> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: >>> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: >>> > > I'm clearly not explaining things well enough. I shouldn't say >>> > > "corruption", I should say "malicious manipulation". The methodology >>> > > of attacks against the stack are quite different from the other kinds >>> > > of attacks like use-after-free, heap overflow, etc. Being able to >>> > > exhaust the kernel stack (either due to deep recursion or unbounded >>> > > alloca()) >>> > >>> > I really hope we don't have alloca() use in the kernel. Do you have >>> > evidence to support that assertion? >>> > >>> > IMHO alloca() (or similar) should not be present in any kernel code >>> > because we have a limited stack - we have kmalloc() etc for that kind >>> > of thing. >>> >>> On stack variable length arrays get implemented by the compiler doing >>> alloca(), and we sadly have a few of those around. >> >> I hope their size is appropriately limited, but something tells me it >> would be foolish to assume that. >> >>> But yes, fully agreed on the desirability of alloca() and things. >> >> Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks >> like it certainly would prevent an explicit alloca() call. > > Building with -Werror=vla is exciting. :) > > A lot of it is in crypto (which are relatively static sizes, just > using function callbacks), but there is plenty more. I meant to also paste an example (which is harmless, I haven't looked extensively at other examples): unsigned long alignmask = crypto_tfm_alg_alignmask(tfm); unsigned int size = crypto_tfm_alg_blocksize(tfm); u8 buffer[size + alignmask]; Looking at all the places (and having tried to remove a few of these in pstore), I think it might be quite frustrating to eliminate them all and then declare VLAs dead. I'm not against trying, though. :) -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 2:00 PM, Kees Cook wrote: > On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linux > wrote: >> On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: >>> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: >>> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: >>> > > I'm clearly not explaining things well enough. I shouldn't say >>> > > "corruption", I should say "malicious manipulation". The methodology >>> > > of attacks against the stack are quite different from the other kinds >>> > > of attacks like use-after-free, heap overflow, etc. Being able to >>> > > exhaust the kernel stack (either due to deep recursion or unbounded >>> > > alloca()) >>> > >>> > I really hope we don't have alloca() use in the kernel. Do you have >>> > evidence to support that assertion? >>> > >>> > IMHO alloca() (or similar) should not be present in any kernel code >>> > because we have a limited stack - we have kmalloc() etc for that kind >>> > of thing. >>> >>> On stack variable length arrays get implemented by the compiler doing >>> alloca(), and we sadly have a few of those around. >> >> I hope their size is appropriately limited, but something tells me it >> would be foolish to assume that. >> >>> But yes, fully agreed on the desirability of alloca() and things. >> >> Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks >> like it certainly would prevent an explicit alloca() call. > > Building with -Werror=vla is exciting. :) > > A lot of it is in crypto (which are relatively static sizes, just > using function callbacks), but there is plenty more. I meant to also paste an example (which is harmless, I haven't looked extensively at other examples): unsigned long alignmask = crypto_tfm_alg_alignmask(tfm); unsigned int size = crypto_tfm_alg_blocksize(tfm); u8 buffer[size + alignmask]; Looking at all the places (and having tried to remove a few of these in pstore), I think it might be quite frustrating to eliminate them all and then declare VLAs dead. I'm not against trying, though. :) -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linuxwrote: > On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: >> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: >> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: >> > > I'm clearly not explaining things well enough. I shouldn't say >> > > "corruption", I should say "malicious manipulation". The methodology >> > > of attacks against the stack are quite different from the other kinds >> > > of attacks like use-after-free, heap overflow, etc. Being able to >> > > exhaust the kernel stack (either due to deep recursion or unbounded >> > > alloca()) >> > >> > I really hope we don't have alloca() use in the kernel. Do you have >> > evidence to support that assertion? >> > >> > IMHO alloca() (or similar) should not be present in any kernel code >> > because we have a limited stack - we have kmalloc() etc for that kind >> > of thing. >> >> On stack variable length arrays get implemented by the compiler doing >> alloca(), and we sadly have a few of those around. > > I hope their size is appropriately limited, but something tells me it > would be foolish to assume that. > >> But yes, fully agreed on the desirability of alloca() and things. > > Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks > like it certainly would prevent an explicit alloca() call. Building with -Werror=vla is exciting. :) A lot of it is in crypto (which are relatively static sizes, just using function callbacks), but there is plenty more. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 1:45 PM, Russell King - ARM Linux wrote: > On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: >> On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: >> > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: >> > > I'm clearly not explaining things well enough. I shouldn't say >> > > "corruption", I should say "malicious manipulation". The methodology >> > > of attacks against the stack are quite different from the other kinds >> > > of attacks like use-after-free, heap overflow, etc. Being able to >> > > exhaust the kernel stack (either due to deep recursion or unbounded >> > > alloca()) >> > >> > I really hope we don't have alloca() use in the kernel. Do you have >> > evidence to support that assertion? >> > >> > IMHO alloca() (or similar) should not be present in any kernel code >> > because we have a limited stack - we have kmalloc() etc for that kind >> > of thing. >> >> On stack variable length arrays get implemented by the compiler doing >> alloca(), and we sadly have a few of those around. > > I hope their size is appropriately limited, but something tells me it > would be foolish to assume that. > >> But yes, fully agreed on the desirability of alloca() and things. > > Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks > like it certainly would prevent an explicit alloca() call. Building with -Werror=vla is exciting. :) A lot of it is in crypto (which are relatively static sizes, just using function callbacks), but there is plenty more. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: > On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: > > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > > I'm clearly not explaining things well enough. I shouldn't say > > > "corruption", I should say "malicious manipulation". The methodology > > > of attacks against the stack are quite different from the other kinds > > > of attacks like use-after-free, heap overflow, etc. Being able to > > > exhaust the kernel stack (either due to deep recursion or unbounded > > > alloca()) > > > > I really hope we don't have alloca() use in the kernel. Do you have > > evidence to support that assertion? > > > > IMHO alloca() (or similar) should not be present in any kernel code > > because we have a limited stack - we have kmalloc() etc for that kind > > of thing. > > On stack variable length arrays get implemented by the compiler doing > alloca(), and we sadly have a few of those around. I hope their size is appropriately limited, but something tells me it would be foolish to assume that. > But yes, fully agreed on the desirability of alloca() and things. Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks like it certainly would prevent an explicit alloca() call. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 10:30:44PM +0200, Peter Zijlstra wrote: > On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: > > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > > I'm clearly not explaining things well enough. I shouldn't say > > > "corruption", I should say "malicious manipulation". The methodology > > > of attacks against the stack are quite different from the other kinds > > > of attacks like use-after-free, heap overflow, etc. Being able to > > > exhaust the kernel stack (either due to deep recursion or unbounded > > > alloca()) > > > > I really hope we don't have alloca() use in the kernel. Do you have > > evidence to support that assertion? > > > > IMHO alloca() (or similar) should not be present in any kernel code > > because we have a limited stack - we have kmalloc() etc for that kind > > of thing. > > On stack variable length arrays get implemented by the compiler doing > alloca(), and we sadly have a few of those around. I hope their size is appropriately limited, but something tells me it would be foolish to assume that. > But yes, fully agreed on the desirability of alloca() and things. Hmm, I wonder if -fno-builtin-alloca would prevent those... it looks like it certainly would prevent an explicit alloca() call. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > I'm clearly not explaining things well enough. I shouldn't say > > "corruption", I should say "malicious manipulation". The methodology > > of attacks against the stack are quite different from the other kinds > > of attacks like use-after-free, heap overflow, etc. Being able to > > exhaust the kernel stack (either due to deep recursion or unbounded > > alloca()) > > I really hope we don't have alloca() use in the kernel. Do you have > evidence to support that assertion? > > IMHO alloca() (or similar) should not be present in any kernel code > because we have a limited stack - we have kmalloc() etc for that kind > of thing. On stack variable length arrays get implemented by the compiler doing alloca(), and we sadly have a few of those around. But yes, fully agreed on the desirability of alloca() and things.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:21:06PM +0100, Russell King - ARM Linux wrote: > On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > > I'm clearly not explaining things well enough. I shouldn't say > > "corruption", I should say "malicious manipulation". The methodology > > of attacks against the stack are quite different from the other kinds > > of attacks like use-after-free, heap overflow, etc. Being able to > > exhaust the kernel stack (either due to deep recursion or unbounded > > alloca()) > > I really hope we don't have alloca() use in the kernel. Do you have > evidence to support that assertion? > > IMHO alloca() (or similar) should not be present in any kernel code > because we have a limited stack - we have kmalloc() etc for that kind > of thing. On stack variable length arrays get implemented by the compiler doing alloca(), and we sadly have a few of those around. But yes, fully agreed on the desirability of alloca() and things.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > I'm clearly not explaining things well enough. I shouldn't say > "corruption", I should say "malicious manipulation". The methodology > of attacks against the stack are quite different from the other kinds > of attacks like use-after-free, heap overflow, etc. Being able to > exhaust the kernel stack (either due to deep recursion or unbounded > alloca()) I really hope we don't have alloca() use in the kernel. Do you have evidence to support that assertion? IMHO alloca() (or similar) should not be present in any kernel code because we have a limited stack - we have kmalloc() etc for that kind of thing. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:30:02PM -0700, Kees Cook wrote: > I'm clearly not explaining things well enough. I shouldn't say > "corruption", I should say "malicious manipulation". The methodology > of attacks against the stack are quite different from the other kinds > of attacks like use-after-free, heap overflow, etc. Being able to > exhaust the kernel stack (either due to deep recursion or unbounded > alloca()) I really hope we don't have alloca() use in the kernel. Do you have evidence to support that assertion? IMHO alloca() (or similar) should not be present in any kernel code because we have a limited stack - we have kmalloc() etc for that kind of thing. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:08 PM, Linus Torvaldswrote: > On Fri, May 12, 2017 at 12:01 PM, Kees Cook wrote: >> Yeah, the risk for "corrupted addr_limit" is mainly a concern for >> archs with addr_limit on the kernel stack. If I'm reading things >> correctly, that means, from the archs I've been paying closer >> attention to, it's an issue for arm, mips, and powerpc: > > I don't understand why people are looking at addr_limit as some kind > of special thing. > > If somebody is smashing the stack and corrupting thread info data, the > game is over. addr_limit is the *least* of your problems, and it's not > even all that likely that it will be increasing (it's much more likely > that it would be overwritten with a smaller value). > > Quite frankly, this kind of idiotic discussion just makes me question > the whole idea of the patch. > > Any "security" that is this specific is not real security, it's just > masturbatory garbage. > > It may be worth checking that people use "set_fs()" properly. But stop > this idiotic crap. It just makes the kernel security people look like > the crazies. > > There are enough incompetent crazy security people, don't go there. > The kinds of things it is worth protecting against are the big class > of generic issues, not the kind of "oh, but imagine if a cosmic ray > flips this particular word in memory" kind of crap that ignores all > the other words of memory. > > Seriously, Kees. You are just making security people look bad. Stop it. I'm clearly not explaining things well enough. I shouldn't say "corruption", I should say "malicious manipulation". The methodology of attacks against the stack are quite different from the other kinds of attacks like use-after-free, heap overflow, etc. Being able to exhaust the kernel stack (either due to deep recursion or unbounded alloca()) means attackers can control a write to addr_limit, and then leverage that into an actual arbitrary write via subsequent calls to copy_to_user() pointed at kernel memory. This isn't theoretical, this is how those attacks are performed. It may sound crazy, but it's real. With thread_info off the stack, the whole problem goes away. It's wonderful that this has happened for x86, arm64, and s390. There are always going to be new methods of attack for everything, but while we slowly address the design weakness (set_fs()), we can fix the low hanging fruit too. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:08 PM, Linus Torvalds wrote: > On Fri, May 12, 2017 at 12:01 PM, Kees Cook wrote: >> Yeah, the risk for "corrupted addr_limit" is mainly a concern for >> archs with addr_limit on the kernel stack. If I'm reading things >> correctly, that means, from the archs I've been paying closer >> attention to, it's an issue for arm, mips, and powerpc: > > I don't understand why people are looking at addr_limit as some kind > of special thing. > > If somebody is smashing the stack and corrupting thread info data, the > game is over. addr_limit is the *least* of your problems, and it's not > even all that likely that it will be increasing (it's much more likely > that it would be overwritten with a smaller value). > > Quite frankly, this kind of idiotic discussion just makes me question > the whole idea of the patch. > > Any "security" that is this specific is not real security, it's just > masturbatory garbage. > > It may be worth checking that people use "set_fs()" properly. But stop > this idiotic crap. It just makes the kernel security people look like > the crazies. > > There are enough incompetent crazy security people, don't go there. > The kinds of things it is worth protecting against are the big class > of generic issues, not the kind of "oh, but imagine if a cosmic ray > flips this particular word in memory" kind of crap that ignores all > the other words of memory. > > Seriously, Kees. You are just making security people look bad. Stop it. I'm clearly not explaining things well enough. I shouldn't say "corruption", I should say "malicious manipulation". The methodology of attacks against the stack are quite different from the other kinds of attacks like use-after-free, heap overflow, etc. Being able to exhaust the kernel stack (either due to deep recursion or unbounded alloca()) means attackers can control a write to addr_limit, and then leverage that into an actual arbitrary write via subsequent calls to copy_to_user() pointed at kernel memory. This isn't theoretical, this is how those attacks are performed. It may sound crazy, but it's real. With thread_info off the stack, the whole problem goes away. It's wonderful that this has happened for x86, arm64, and s390. There are always going to be new methods of attack for everything, but while we slowly address the design weakness (set_fs()), we can fix the low hanging fruit too. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:01:59PM -0700, Kees Cook wrote: > Yeah, the risk for "corrupted addr_limit" is mainly a concern for > archs with addr_limit on the kernel stack. If I'm reading things > correctly, that means, from the archs I've been paying closer > attention to, it's an issue for arm, mips, and powerpc: I'd first want to uninline everything in uaccess.h first that makes use of access_ok() - which I think is something that needs to happen anyway. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:01:59PM -0700, Kees Cook wrote: > Yeah, the risk for "corrupted addr_limit" is mainly a concern for > archs with addr_limit on the kernel stack. If I'm reading things > correctly, that means, from the archs I've been paying closer > attention to, it's an issue for arm, mips, and powerpc: I'd first want to uninline everything in uaccess.h first that makes use of access_ok() - which I think is something that needs to happen anyway. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:01 PM, Kees Cookwrote: > > Yeah, the risk for "corrupted addr_limit" is mainly a concern for > archs with addr_limit on the kernel stack. If I'm reading things > correctly, that means, from the archs I've been paying closer > attention to, it's an issue for arm, mips, and powerpc: I don't understand why people are looking at addr_limit as some kind of special thing. If somebody is smashing the stack and corrupting thread info data, the game is over. addr_limit is the *least* of your problems, and it's not even all that likely that it will be increasing (it's much more likely that it would be overwritten with a smaller value). Quite frankly, this kind of idiotic discussion just makes me question the whole idea of the patch. Any "security" that is this specific is not real security, it's just masturbatory garbage. It may be worth checking that people use "set_fs()" properly. But stop this idiotic crap. It just makes the kernel security people look like the crazies. There are enough incompetent crazy security people, don't go there. The kinds of things it is worth protecting against are the big class of generic issues, not the kind of "oh, but imagine if a cosmic ray flips this particular word in memory" kind of crap that ignores all the other words of memory. Seriously, Kees. You are just making security people look bad. Stop it. Linus
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 12:01 PM, Kees Cook wrote: > > Yeah, the risk for "corrupted addr_limit" is mainly a concern for > archs with addr_limit on the kernel stack. If I'm reading things > correctly, that means, from the archs I've been paying closer > attention to, it's an issue for arm, mips, and powerpc: I don't understand why people are looking at addr_limit as some kind of special thing. If somebody is smashing the stack and corrupting thread info data, the game is over. addr_limit is the *least* of your problems, and it's not even all that likely that it will be increasing (it's much more likely that it would be overwritten with a smaller value). Quite frankly, this kind of idiotic discussion just makes me question the whole idea of the patch. Any "security" that is this specific is not real security, it's just masturbatory garbage. It may be worth checking that people use "set_fs()" properly. But stop this idiotic crap. It just makes the kernel security people look like the crazies. There are enough incompetent crazy security people, don't go there. The kinds of things it is worth protecting against are the big class of generic issues, not the kind of "oh, but imagine if a cosmic ray flips this particular word in memory" kind of crap that ignores all the other words of memory. Seriously, Kees. You are just making security people look bad. Stop it. Linus
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 10:54 PM, Martin Schwidefskywrote: > On Thu, 11 May 2017 22:34:31 -0700 > Kees Cook wrote: > >> On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky >> wrote: >> > On Thu, 11 May 2017 16:44:07 -0700 >> > Linus Torvalds wrote: >> > >> >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier >> >> wrote: >> >> > >> >> > Ingo: Do you want the change as-is? Would you like it to be optional? >> >> > What do you think? >> >> >> >> I'm not ingo, but I don't like that patch. It's in the wrong place - >> >> that system call return code is too timing-critical to add address >> >> limit checks. >> >> >> >> Now what I think you *could* do is: >> >> >> >> - make "set_fs()" actually set a work flag in the current thread flags >> >> >> >> - do the test in the slow-path (syscall_return_slowpath). >> >> >> >> Yes, yes, that ends up being architecture-specific, but it's fairly >> >> simple. >> >> >> >> And it only slows down the system calls that actually use "set_fs()". >> >> Sure, it will slow those down a fair amount, but they are hopefully a >> >> small subset of all cases. >> >> >> >> How does that sound to people? Thats' where we currently do that >> >> >> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && >> >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", >> >> regs->orig_ax)) >> >> local_irq_enable(); >> >> >> >> check too, which is a fairly similar issue. >> > >> > This is exactly what Heiko did for the s390 backend as a result of this >> > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, >> > for the hot patch the check for the bit is included in the general >> > _CIF_WORK test. Only the slow patch gets a bit slower. >> > >> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d >> > "s390: restore address space when returning to user space". >> >> If I'm understanding this, it won't catch corruption of addr_limit >> during fast-path syscalls, though (i.e. addr_limit changed without a >> call to set_fs()). :( This addr_limit corruption is mostly only a risk >> archs without THREAD_INFO_IN_TASK, but it would still be nice to catch >> unbalanced set_fs() code, so I like the idea. I like getting rid of >> addr_limit entirely even more, but that'll take some time. :) > > Well for s390 there is no addr_limit as we use two separate address space > for kernel vs. user. The equivalent to the addr_limit corruption on a > fast-path syscall would be changing CR7 outside of set_fs. This boils > down to the question what we are protection against? Bad code with > unbalanced set_fs or evil code that changes addr_limit/CR7 outside of > set_fs Yeah, the risk for "corrupted addr_limit" is mainly a concern for archs with addr_limit on the kernel stack. If I'm reading things correctly, that means, from the archs I've been paying closer attention to, it's an issue for arm, mips, and powerpc: arch/arm/include/asm/uaccess.h: current_thread_info()->addr_limit = fs; arch/arm/include/asm/thread_info.h: (current_stack_pointer & ~(THREAD_SIZE - 1)); arch/mips/include/asm/uaccess.h:#define set_fs(x) (current_thread_info()->addr_limit = (x)) arch/mips/kernel/process.c:* task stacks at THREAD_SIZE - 32 arch/powerpc/include/asm/uaccess.h:#define set_fs(val) (current->thread.fs = (val)) arch/powerpc/kernel/process.c: struct pt_regs *regs = task_stack_page(current) + THREAD_SIZE; (s390 uses a register, x86 and arm64 implement THREAD_INFO_IN_TASK.) Targeting addr_limit through arbitrary write attacks isn't too common since ... it's an arbitrary write. The issue with addr_limit was that it can live on the kernel stack, which meant all kinds of stack-related bugs can lead to it getting stomped on. So, two goals to protect addr_limit: - get it off the stack to make the difficulty of corruption on par with other sensitive things that would require an arbitrary write flaw. - detect/block unbalanced set_fs() calls. If we can get the former addressed by the remaining architectures, then that class of attack will go away. For the latter, it sounds like Linus's slowpath-exit will work nicely. To me it looks like he architectures with addr_limit still on the stack would still benefit from always-check-addr_limit on syscall exit, but that would be arch-specific anyway. And then, of course, we've got the parallel task of just removing set_fs() entirely. :) -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 10:54 PM, Martin Schwidefsky wrote: > On Thu, 11 May 2017 22:34:31 -0700 > Kees Cook wrote: > >> On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky >> wrote: >> > On Thu, 11 May 2017 16:44:07 -0700 >> > Linus Torvalds wrote: >> > >> >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier >> >> wrote: >> >> > >> >> > Ingo: Do you want the change as-is? Would you like it to be optional? >> >> > What do you think? >> >> >> >> I'm not ingo, but I don't like that patch. It's in the wrong place - >> >> that system call return code is too timing-critical to add address >> >> limit checks. >> >> >> >> Now what I think you *could* do is: >> >> >> >> - make "set_fs()" actually set a work flag in the current thread flags >> >> >> >> - do the test in the slow-path (syscall_return_slowpath). >> >> >> >> Yes, yes, that ends up being architecture-specific, but it's fairly >> >> simple. >> >> >> >> And it only slows down the system calls that actually use "set_fs()". >> >> Sure, it will slow those down a fair amount, but they are hopefully a >> >> small subset of all cases. >> >> >> >> How does that sound to people? Thats' where we currently do that >> >> >> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && >> >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", >> >> regs->orig_ax)) >> >> local_irq_enable(); >> >> >> >> check too, which is a fairly similar issue. >> > >> > This is exactly what Heiko did for the s390 backend as a result of this >> > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, >> > for the hot patch the check for the bit is included in the general >> > _CIF_WORK test. Only the slow patch gets a bit slower. >> > >> > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d >> > "s390: restore address space when returning to user space". >> >> If I'm understanding this, it won't catch corruption of addr_limit >> during fast-path syscalls, though (i.e. addr_limit changed without a >> call to set_fs()). :( This addr_limit corruption is mostly only a risk >> archs without THREAD_INFO_IN_TASK, but it would still be nice to catch >> unbalanced set_fs() code, so I like the idea. I like getting rid of >> addr_limit entirely even more, but that'll take some time. :) > > Well for s390 there is no addr_limit as we use two separate address space > for kernel vs. user. The equivalent to the addr_limit corruption on a > fast-path syscall would be changing CR7 outside of set_fs. This boils > down to the question what we are protection against? Bad code with > unbalanced set_fs or evil code that changes addr_limit/CR7 outside of > set_fs Yeah, the risk for "corrupted addr_limit" is mainly a concern for archs with addr_limit on the kernel stack. If I'm reading things correctly, that means, from the archs I've been paying closer attention to, it's an issue for arm, mips, and powerpc: arch/arm/include/asm/uaccess.h: current_thread_info()->addr_limit = fs; arch/arm/include/asm/thread_info.h: (current_stack_pointer & ~(THREAD_SIZE - 1)); arch/mips/include/asm/uaccess.h:#define set_fs(x) (current_thread_info()->addr_limit = (x)) arch/mips/kernel/process.c:* task stacks at THREAD_SIZE - 32 arch/powerpc/include/asm/uaccess.h:#define set_fs(val) (current->thread.fs = (val)) arch/powerpc/kernel/process.c: struct pt_regs *regs = task_stack_page(current) + THREAD_SIZE; (s390 uses a register, x86 and arm64 implement THREAD_INFO_IN_TASK.) Targeting addr_limit through arbitrary write attacks isn't too common since ... it's an arbitrary write. The issue with addr_limit was that it can live on the kernel stack, which meant all kinds of stack-related bugs can lead to it getting stomped on. So, two goals to protect addr_limit: - get it off the stack to make the difficulty of corruption on par with other sensitive things that would require an arbitrary write flaw. - detect/block unbalanced set_fs() calls. If we can get the former addressed by the remaining architectures, then that class of attack will go away. For the latter, it sounds like Linus's slowpath-exit will work nicely. To me it looks like he architectures with addr_limit still on the stack would still benefit from always-check-addr_limit on syscall exit, but that would be arch-specific anyway. And then, of course, we've got the parallel task of just removing set_fs() entirely. :) -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 11:58 PM, Ingo Molnarwrote: > > * Linus Torvalds wrote: > >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: >> > >> > Ingo: Do you want the change as-is? Would you like it to be optional? >> > What do you think? >> >> I'm not ingo, but I don't like that patch. It's in the wrong place - >> that system call return code is too timing-critical to add address >> limit checks. >> >> Now what I think you *could* do is: >> >> - make "set_fs()" actually set a work flag in the current thread flags >> >> - do the test in the slow-path (syscall_return_slowpath). >> >> Yes, yes, that ends up being architecture-specific, but it's fairly simple. >> >> And it only slows down the system calls that actually use "set_fs()". >> Sure, it will slow those down a fair amount, but they are hopefully a >> small subset of all cases. >> >> How does that sound to people? Thats' where we currently do that >> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", >> regs->orig_ax)) >> local_irq_enable(); >> >> check too, which is a fairly similar issue. > > I really like that idea and I'd be perfectly fine with that solution, because > it > puts the overhead where the problem comes from, and adds an extra incentive > for > code to move away from set_fs() facilities. Win-win. Great, I will adapt the patch for that. > > Thanks, > > Ingo -- Thomas
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 11:58 PM, Ingo Molnar wrote: > > * Linus Torvalds wrote: > >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: >> > >> > Ingo: Do you want the change as-is? Would you like it to be optional? >> > What do you think? >> >> I'm not ingo, but I don't like that patch. It's in the wrong place - >> that system call return code is too timing-critical to add address >> limit checks. >> >> Now what I think you *could* do is: >> >> - make "set_fs()" actually set a work flag in the current thread flags >> >> - do the test in the slow-path (syscall_return_slowpath). >> >> Yes, yes, that ends up being architecture-specific, but it's fairly simple. >> >> And it only slows down the system calls that actually use "set_fs()". >> Sure, it will slow those down a fair amount, but they are hopefully a >> small subset of all cases. >> >> How does that sound to people? Thats' where we currently do that >> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", >> regs->orig_ax)) >> local_irq_enable(); >> >> check too, which is a fairly similar issue. > > I really like that idea and I'd be perfectly fine with that solution, because > it > puts the overhead where the problem comes from, and adds an extra incentive > for > code to move away from set_fs() facilities. Win-win. Great, I will adapt the patch for that. > > Thanks, > > Ingo -- Thomas
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 10:11 AM, Al Virowrote: > Anyway, what's special about modules? IDGI... One of the arguments that came up earlier was code in external modules being mostly unaudited, sometimes without any source code available at all but still used in devices. If modules can't do set_fs() any more, this could eliminate bugs with unpaired set_fs in those modules. Limiting factors of course are: - embedded systems that ship come with their own kernels (as opposed to using whatever users have, or relying on binary distros) can just make it available to modules again, by reverting the patch - As Christoph said, they could have an open-coded set_fs in the driver - Whatever other method a clueless driver write might come up with isn't necessarily better than set_fs(). Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 10:11 AM, Al Viro wrote: > Anyway, what's special about modules? IDGI... One of the arguments that came up earlier was code in external modules being mostly unaudited, sometimes without any source code available at all but still used in devices. If modules can't do set_fs() any more, this could eliminate bugs with unpaired set_fs in those modules. Limiting factors of course are: - embedded systems that ship come with their own kernels (as opposed to using whatever users have, or relying on binary distros) can just make it available to modules again, by reverting the patch - As Christoph said, they could have an open-coded set_fs in the driver - Whatever other method a clueless driver write might come up with isn't necessarily better than set_fs(). Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 01:11:26AM -0700, Christoph Hellwig wrote: > But it won't help against exploits modifying addr_limit manually. Or the ones setting current->cred to that of init. Your point being?
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 01:11:26AM -0700, Christoph Hellwig wrote: > But it won't help against exploits modifying addr_limit manually. Or the ones setting current->cred to that of init. Your point being?
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote: > How realistic and how useful would it be to first completely eliminate > the ones that are in loadable modules and then wrapping the definition > in #ifndef MODULE (or even make it an extern function)? Eliminate _what_? ->read() and ->write() instances? > This should be a fairly complete list of the modular users: > > drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS); Ah, set_fs()... Sure, many of those can be killed off. Wouldn't be a bad idea, but I don't understand what difference does modular/built-in make here... This one: AFAICS doesn't give a damn about set_fs() at all. > drivers/input/serio/hp_sdc.c: set_fs(KERNEL_DS); Open-coded probe_kernel_read(), apparently. > drivers/media/v4l2-core/v4l2-compat-ioctl32.c: set_fs(KERNEL_DS); massive compat ioctl crap. > drivers/misc/lkdtm_bugs.c: set_fs(KERNEL_DS); insane. > drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS); No idea. > drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS); Open-coded kernel_write(); to some character device, no less... And similar for kernel_read(), apparently. > drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds()); Fuck knows; kernel_write() might do it. Depends upon what it's writing to. You've missed other places in lustre, BTW - including the ioctls on sockets, etc. > drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c: > set_fs(KERNEL_DS); Compat ioctl crap, again. > drivers/staging/rtl8723bs/os_dep/osdep_service.c: oldfs > = get_fs(); set_fs(get_ds()); Oh, lovely - reading an arbitrary (as in, specified by pathname) file. Firmware (mis)handling? > drivers/usb/gadget/function/f_mass_storage.c: set_fs(get_ds()); No idea. > drivers/usb/gadget/function/u_uac1.c: set_fs(KERNEL_DS); kernel_write(), by the look of it. Or something similar. > drivers/vhost/vhost.c: set_fs(USER_DS); kernel thread doing use_mm() > drivers/video/fbdev/core/fbmem.c: set_fs(KERNEL_DS); compat ioctl. > drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS); probe_kernel_read() > fs/autofs4/waitq.c: set_fs(KERNEL_DS); kernel_write() > fs/binfmt_aout.c: set_fs(KERNEL_DS); > fs/binfmt_elf.c:set_fs(USER_DS); > fs/binfmt_elf_fdpic.c: set_fs(KERNEL_DS); coredump stuff. > fs/btrfs/send.c:set_fs(KERNEL_DS); kernel_write() Anyway, what's special about modules? IDGI...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote: > How realistic and how useful would it be to first completely eliminate > the ones that are in loadable modules and then wrapping the definition > in #ifndef MODULE (or even make it an extern function)? Eliminate _what_? ->read() and ->write() instances? > This should be a fairly complete list of the modular users: > > drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS); Ah, set_fs()... Sure, many of those can be killed off. Wouldn't be a bad idea, but I don't understand what difference does modular/built-in make here... This one: AFAICS doesn't give a damn about set_fs() at all. > drivers/input/serio/hp_sdc.c: set_fs(KERNEL_DS); Open-coded probe_kernel_read(), apparently. > drivers/media/v4l2-core/v4l2-compat-ioctl32.c: set_fs(KERNEL_DS); massive compat ioctl crap. > drivers/misc/lkdtm_bugs.c: set_fs(KERNEL_DS); insane. > drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS); No idea. > drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS); Open-coded kernel_write(); to some character device, no less... And similar for kernel_read(), apparently. > drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds()); Fuck knows; kernel_write() might do it. Depends upon what it's writing to. You've missed other places in lustre, BTW - including the ioctls on sockets, etc. > drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c: > set_fs(KERNEL_DS); Compat ioctl crap, again. > drivers/staging/rtl8723bs/os_dep/osdep_service.c: oldfs > = get_fs(); set_fs(get_ds()); Oh, lovely - reading an arbitrary (as in, specified by pathname) file. Firmware (mis)handling? > drivers/usb/gadget/function/f_mass_storage.c: set_fs(get_ds()); No idea. > drivers/usb/gadget/function/u_uac1.c: set_fs(KERNEL_DS); kernel_write(), by the look of it. Or something similar. > drivers/vhost/vhost.c: set_fs(USER_DS); kernel thread doing use_mm() > drivers/video/fbdev/core/fbmem.c: set_fs(KERNEL_DS); compat ioctl. > drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS); probe_kernel_read() > fs/autofs4/waitq.c: set_fs(KERNEL_DS); kernel_write() > fs/binfmt_aout.c: set_fs(KERNEL_DS); > fs/binfmt_elf.c:set_fs(USER_DS); > fs/binfmt_elf_fdpic.c: set_fs(KERNEL_DS); coredump stuff. > fs/btrfs/send.c:set_fs(KERNEL_DS); kernel_write() Anyway, what's special about modules? IDGI...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote: > How realistic and how useful would it be to first completely eliminate > the ones that are in loadable modules and then wrapping the definition > in #ifndef MODULE (or even make it an extern function)? Should be fairly doable and might be a nice step towards cleaning the mess up. In fact with my seres a large part of those are gone, and most of the remaining handler are ioctl handlers or what seems like opencoded versions of probe_kernel_read. But it won't help against exploits modifying addr_limit manually.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:43:40AM +0200, Arnd Bergmann wrote: > How realistic and how useful would it be to first completely eliminate > the ones that are in loadable modules and then wrapping the definition > in #ifndef MODULE (or even make it an extern function)? Should be fairly doable and might be a nice step towards cleaning the mess up. In fact with my seres a large part of those are gone, and most of the remaining handler are ioctl handlers or what seems like opencoded versions of probe_kernel_read. But it won't help against exploits modifying addr_limit manually.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 9:15 AM, Al Virowrote: > On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote: > >> > How about trying to remove all of them? If we could actually get rid >> > of all of them, we could drop the arch support, and we'd get faster, >> > simpler, shorter uaccess code throughout the kernel. >> >> I'm all for that! > > Oh, for... Ingo, do you really want to go through all ->write() and ->read() > instances, converting all of them to iov_iter? Or, better yet, deal with > the patch flood from Nick Krause sock puppet brigade? How realistic and how useful would it be to first completely eliminate the ones that are in loadable modules and then wrapping the definition in #ifndef MODULE (or even make it an extern function)? This should be a fairly complete list of the modular users: drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS); drivers/input/serio/hp_sdc.c: set_fs(KERNEL_DS); drivers/media/v4l2-core/v4l2-compat-ioctl32.c: set_fs(KERNEL_DS); drivers/misc/lkdtm_bugs.c: set_fs(KERNEL_DS); drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS); drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS); drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds()); drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c: set_fs(KERNEL_DS); drivers/staging/rtl8723bs/os_dep/osdep_service.c: oldfs = get_fs(); set_fs(get_ds()); drivers/usb/gadget/function/f_mass_storage.c: set_fs(get_ds()); drivers/usb/gadget/function/u_uac1.c: set_fs(KERNEL_DS); drivers/vhost/vhost.c: set_fs(USER_DS); drivers/video/fbdev/core/fbmem.c: set_fs(KERNEL_DS); drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS); fs/autofs4/waitq.c: set_fs(KERNEL_DS); fs/binfmt_aout.c: set_fs(KERNEL_DS); fs/binfmt_elf.c:set_fs(USER_DS); fs/binfmt_elf_fdpic.c: set_fs(KERNEL_DS); fs/btrfs/send.c:set_fs(KERNEL_DS); fs/ext4/ioctl.c:set_fs(KERNEL_DS); fs/nfsd/vfs.c: set_fs(KERNEL_DS); net/9p/trans_fd.c: set_fs(get_ds()); net/ipv6/addrconf.c:set_fs(KERNEL_DS); net/ipv6/exthdrs.c: set_fs(KERNEL_DS); net/sunrpc/svcsock.c: oldfs = get_fs(); set_fs(KERNEL_DS); sound/core/oss/pcm_oss.c: set_fs(get_ds()); sound/core/pcm_native.c:set_fs(get_ds()); sound/drivers/opl3/opl3_oss.c: set_fs(get_ds()); sound/oss/dmabuf.c: set_fs(get_ds()); sound/oss/swarm_cs4297a.c:set_fs(KERNEL_DS); sound/pci/emu10k1/emufx.c: set_fs(get_ds()); sound/pci/hda/hda_codec.c: set_fs(get_ds()); Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 9:15 AM, Al Viro wrote: > On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote: > >> > How about trying to remove all of them? If we could actually get rid >> > of all of them, we could drop the arch support, and we'd get faster, >> > simpler, shorter uaccess code throughout the kernel. >> >> I'm all for that! > > Oh, for... Ingo, do you really want to go through all ->write() and ->read() > instances, converting all of them to iov_iter? Or, better yet, deal with > the patch flood from Nick Krause sock puppet brigade? How realistic and how useful would it be to first completely eliminate the ones that are in loadable modules and then wrapping the definition in #ifndef MODULE (or even make it an extern function)? This should be a fairly complete list of the modular users: drivers/block/drbd/drbd_main.c: set_fs(KERNEL_DS); drivers/input/serio/hp_sdc.c: set_fs(KERNEL_DS); drivers/media/v4l2-core/v4l2-compat-ioctl32.c: set_fs(KERNEL_DS); drivers/misc/lkdtm_bugs.c: set_fs(KERNEL_DS); drivers/s390/crypto/pkey_api.c: set_fs(KERNEL_DS); drivers/staging/comedi/drivers/serial2002.c:set_fs(KERNEL_DS); drivers/staging/lustre/lnet/libcfs/tracefile.c: set_fs(get_ds()); drivers/staging/media/atomisp/pci/atomisp2/atomisp_compat_ioctl32.c: set_fs(KERNEL_DS); drivers/staging/rtl8723bs/os_dep/osdep_service.c: oldfs = get_fs(); set_fs(get_ds()); drivers/usb/gadget/function/f_mass_storage.c: set_fs(get_ds()); drivers/usb/gadget/function/u_uac1.c: set_fs(KERNEL_DS); drivers/vhost/vhost.c: set_fs(USER_DS); drivers/video/fbdev/core/fbmem.c: set_fs(KERNEL_DS); drivers/video/fbdev/hpfb.c: set_fs(KERNEL_DS); fs/autofs4/waitq.c: set_fs(KERNEL_DS); fs/binfmt_aout.c: set_fs(KERNEL_DS); fs/binfmt_elf.c:set_fs(USER_DS); fs/binfmt_elf_fdpic.c: set_fs(KERNEL_DS); fs/btrfs/send.c:set_fs(KERNEL_DS); fs/ext4/ioctl.c:set_fs(KERNEL_DS); fs/nfsd/vfs.c: set_fs(KERNEL_DS); net/9p/trans_fd.c: set_fs(get_ds()); net/ipv6/addrconf.c:set_fs(KERNEL_DS); net/ipv6/exthdrs.c: set_fs(KERNEL_DS); net/sunrpc/svcsock.c: oldfs = get_fs(); set_fs(KERNEL_DS); sound/core/oss/pcm_oss.c: set_fs(get_ds()); sound/core/pcm_native.c:set_fs(get_ds()); sound/drivers/opl3/opl3_oss.c: set_fs(get_ds()); sound/oss/dmabuf.c: set_fs(get_ds()); sound/oss/swarm_cs4297a.c:set_fs(KERNEL_DS); sound/pci/emu10k1/emufx.c: set_fs(get_ds()); sound/pci/hda/hda_codec.c: set_fs(get_ds()); Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 08:15:49AM +0100, Al Viro wrote: > And converting everything to ->read_iter()/->write_iter() means an insane > amount of code churn, not to mention coping with random bogosities in > semantics. ->read() and ->write() are going to stay around, pretty > much indefinitely. But I don't think kernel users of them have to. I've been digging through the calllers and will send an analysis to the list in a bit.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 08:15:49AM +0100, Al Viro wrote: > And converting everything to ->read_iter()/->write_iter() means an insane > amount of code churn, not to mention coping with random bogosities in > semantics. ->read() and ->write() are going to stay around, pretty > much indefinitely. But I don't think kernel users of them have to. I've been digging through the calllers and will send an analysis to the list in a bit.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote: > > How about trying to remove all of them? If we could actually get rid > > of all of them, we could drop the arch support, and we'd get faster, > > simpler, shorter uaccess code throughout the kernel. > > I'm all for that! Oh, for... Ingo, do you really want to go through all ->write() and ->read() instances, converting all of them to iov_iter? Or, better yet, deal with the patch flood from Nick Krause sock puppet brigade? Folks, seriously, have you even looked through that zoo? I have, and it's really, really not fun. Sure, we can say "fuck 'em, no need to allow splice() on random crap". Would be perfectly reasonable, expect that it's not the only place doing kernel_write() and its ilk... And converting everything to ->read_iter()/->write_iter() means an insane amount of code churn, not to mention coping with random bogosities in semantics. ->read() and ->write() are going to stay around, pretty much indefinitely.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Fri, May 12, 2017 at 09:00:12AM +0200, Ingo Molnar wrote: > > How about trying to remove all of them? If we could actually get rid > > of all of them, we could drop the arch support, and we'd get faster, > > simpler, shorter uaccess code throughout the kernel. > > I'm all for that! Oh, for... Ingo, do you really want to go through all ->write() and ->read() instances, converting all of them to iov_iter? Or, better yet, deal with the patch flood from Nick Krause sock puppet brigade? Folks, seriously, have you even looked through that zoo? I have, and it's really, really not fun. Sure, we can say "fuck 'em, no need to allow splice() on random crap". Would be perfectly reasonable, expect that it's not the only place doing kernel_write() and its ilk... And converting everything to ->read_iter()/->write_iter() means an insane amount of code churn, not to mention coping with random bogosities in semantics. ->read() and ->write() are going to stay around, pretty much indefinitely.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
* Andy Lutomirskiwrote: > On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig wrote: > > On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote: > >> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it > >> would > >> be a pity to add a runtime check to every system call ... > > > > I think we should simply strive to remove all of them that aren't > > in core scheduler / arch code. Basically evetyytime we do the > > > > oldfs = get_fs(); > > set_fs(KERNEL_DS); > > .. > > set_fs(oldfs); > > > > trick we're doing something wrong, and there should always be better > > ways to archive it. E.g. using iov_iter with a ITER_KVEC type > > consistently would already remove most of them. > > How about trying to remove all of them? If we could actually get rid > of all of them, we could drop the arch support, and we'd get faster, > simpler, shorter uaccess code throughout the kernel. I'm all for that! Thanks, Ingo
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
* Andy Lutomirski wrote: > On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig wrote: > > On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote: > >> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it > >> would > >> be a pity to add a runtime check to every system call ... > > > > I think we should simply strive to remove all of them that aren't > > in core scheduler / arch code. Basically evetyytime we do the > > > > oldfs = get_fs(); > > set_fs(KERNEL_DS); > > .. > > set_fs(oldfs); > > > > trick we're doing something wrong, and there should always be better > > ways to archive it. E.g. using iov_iter with a ITER_KVEC type > > consistently would already remove most of them. > > How about trying to remove all of them? If we could actually get rid > of all of them, we could drop the arch support, and we'd get faster, > simpler, shorter uaccess code throughout the kernel. I'm all for that! Thanks, Ingo
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
* Linus Torvaldswrote: > On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: > > > > Ingo: Do you want the change as-is? Would you like it to be optional? > > What do you think? > > I'm not ingo, but I don't like that patch. It's in the wrong place - > that system call return code is too timing-critical to add address > limit checks. > > Now what I think you *could* do is: > > - make "set_fs()" actually set a work flag in the current thread flags > > - do the test in the slow-path (syscall_return_slowpath). > > Yes, yes, that ends up being architecture-specific, but it's fairly simple. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. > > How does that sound to people? Thats' where we currently do that > > if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > regs->orig_ax)) > local_irq_enable(); > > check too, which is a fairly similar issue. I really like that idea and I'd be perfectly fine with that solution, because it puts the overhead where the problem comes from, and adds an extra incentive for code to move away from set_fs() facilities. Win-win. Thanks, Ingo
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
* Linus Torvalds wrote: > On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: > > > > Ingo: Do you want the change as-is? Would you like it to be optional? > > What do you think? > > I'm not ingo, but I don't like that patch. It's in the wrong place - > that system call return code is too timing-critical to add address > limit checks. > > Now what I think you *could* do is: > > - make "set_fs()" actually set a work flag in the current thread flags > > - do the test in the slow-path (syscall_return_slowpath). > > Yes, yes, that ends up being architecture-specific, but it's fairly simple. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. > > How does that sound to people? Thats' where we currently do that > > if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > regs->orig_ax)) > local_irq_enable(); > > check too, which is a fairly similar issue. I really like that idea and I'd be perfectly fine with that solution, because it puts the overhead where the problem comes from, and adds an extra incentive for code to move away from set_fs() facilities. Win-win. Thanks, Ingo
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
* Kees Cookwrote: > > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d > > "s390: restore address space when returning to user space". > > If I'm understanding this, it won't catch corruption of addr_limit > during fast-path syscalls, though (i.e. addr_limit changed without a > call to set_fs()). :( Nor does it, or the patch you propose, protect against against something corrupting task->mm pointer, or the task->*uid values, or any of the myriads of security relevant values stored in the task structure! Making sure API (set_fs()) usage is bug-free and protecting against the effects of general data corruption are two unrelated things that should not mixed. Thanks, Ingo
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
* Kees Cook wrote: > > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d > > "s390: restore address space when returning to user space". > > If I'm understanding this, it won't catch corruption of addr_limit > during fast-path syscalls, though (i.e. addr_limit changed without a > call to set_fs()). :( Nor does it, or the patch you propose, protect against against something corrupting task->mm pointer, or the task->*uid values, or any of the myriads of security relevant values stored in the task structure! Making sure API (set_fs()) usage is bug-free and protecting against the effects of general data corruption are two unrelated things that should not mixed. Thanks, Ingo
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
[resending because kernel.org seems to have mangled my SMTP credentials. I wonder if this is a common problem.] On Thu, May 11, 2017 at 4:44 PM, Linus Torvaldswrote: > On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: >> >> Ingo: Do you want the change as-is? Would you like it to be optional? >> What do you think? > > I'm not ingo, but I don't like that patch. It's in the wrong place - > that system call return code is too timing-critical to add address > limit checks. > > Now what I think you *could* do is: > > - make "set_fs()" actually set a work flag in the current thread flags > > - do the test in the slow-path (syscall_return_slowpath). > > Yes, yes, that ends up being architecture-specific, but it's fairly simple. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. > > How does that sound to people? Thats' where we currently do that > > if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > regs->orig_ax)) > local_irq_enable(); > > check too, which is a fairly similar issue. > I like this. It wouldn't help the problem that I suspect is a major part of the motivation for this patch: a stack overflow could overwrite addr_limit. But we fixed that for real already. Slightly off-topic: I would *love* to see syscall_return_slowpath() or similar moved or at least mostly moved into generic code. Aside from the fact that it used to be written in asm, there's nothing fundamentally arch-specific about it. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. It won't even slow them down that much. The slow path is reasonably fast these days.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
[resending because kernel.org seems to have mangled my SMTP credentials. I wonder if this is a common problem.] On Thu, May 11, 2017 at 4:44 PM, Linus Torvalds wrote: > On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: >> >> Ingo: Do you want the change as-is? Would you like it to be optional? >> What do you think? > > I'm not ingo, but I don't like that patch. It's in the wrong place - > that system call return code is too timing-critical to add address > limit checks. > > Now what I think you *could* do is: > > - make "set_fs()" actually set a work flag in the current thread flags > > - do the test in the slow-path (syscall_return_slowpath). > > Yes, yes, that ends up being architecture-specific, but it's fairly simple. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. > > How does that sound to people? Thats' where we currently do that > > if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > regs->orig_ax)) > local_irq_enable(); > > check too, which is a fairly similar issue. > I like this. It wouldn't help the problem that I suspect is a major part of the motivation for this patch: a stack overflow could overwrite addr_limit. But we fixed that for real already. Slightly off-topic: I would *love* to see syscall_return_slowpath() or similar moved or at least mostly moved into generic code. Aside from the fact that it used to be written in asm, there's nothing fundamentally arch-specific about it. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. It won't even slow them down that much. The slow path is reasonably fast these days.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, 11 May 2017 22:34:31 -0700 Kees Cookwrote: > On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky > wrote: > > On Thu, 11 May 2017 16:44:07 -0700 > > Linus Torvalds wrote: > > > >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier > >> wrote: > >> > > >> > Ingo: Do you want the change as-is? Would you like it to be optional? > >> > What do you think? > >> > >> I'm not ingo, but I don't like that patch. It's in the wrong place - > >> that system call return code is too timing-critical to add address > >> limit checks. > >> > >> Now what I think you *could* do is: > >> > >> - make "set_fs()" actually set a work flag in the current thread flags > >> > >> - do the test in the slow-path (syscall_return_slowpath). > >> > >> Yes, yes, that ends up being architecture-specific, but it's fairly simple. > >> > >> And it only slows down the system calls that actually use "set_fs()". > >> Sure, it will slow those down a fair amount, but they are hopefully a > >> small subset of all cases. > >> > >> How does that sound to people? Thats' where we currently do that > >> > >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > >> regs->orig_ax)) > >> local_irq_enable(); > >> > >> check too, which is a fairly similar issue. > > > > This is exactly what Heiko did for the s390 backend as a result of this > > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, > > for the hot patch the check for the bit is included in the general > > _CIF_WORK test. Only the slow patch gets a bit slower. > > > > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d > > "s390: restore address space when returning to user space". > > If I'm understanding this, it won't catch corruption of addr_limit > during fast-path syscalls, though (i.e. addr_limit changed without a > call to set_fs()). :( This addr_limit corruption is mostly only a risk > archs without THREAD_INFO_IN_TASK, but it would still be nice to catch > unbalanced set_fs() code, so I like the idea. I like getting rid of > addr_limit entirely even more, but that'll take some time. :) Well for s390 there is no addr_limit as we use two separate address space for kernel vs. user. The equivalent to the addr_limit corruption on a fast-path syscall would be changing CR7 outside of set_fs. This boils down to the question what we are protection against? Bad code with unbalanced set_fs or evil code that changes addr_limit/CR7 outside of set_fs -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, 11 May 2017 22:34:31 -0700 Kees Cook wrote: > On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky > wrote: > > On Thu, 11 May 2017 16:44:07 -0700 > > Linus Torvalds wrote: > > > >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier > >> wrote: > >> > > >> > Ingo: Do you want the change as-is? Would you like it to be optional? > >> > What do you think? > >> > >> I'm not ingo, but I don't like that patch. It's in the wrong place - > >> that system call return code is too timing-critical to add address > >> limit checks. > >> > >> Now what I think you *could* do is: > >> > >> - make "set_fs()" actually set a work flag in the current thread flags > >> > >> - do the test in the slow-path (syscall_return_slowpath). > >> > >> Yes, yes, that ends up being architecture-specific, but it's fairly simple. > >> > >> And it only slows down the system calls that actually use "set_fs()". > >> Sure, it will slow those down a fair amount, but they are hopefully a > >> small subset of all cases. > >> > >> How does that sound to people? Thats' where we currently do that > >> > >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > >> regs->orig_ax)) > >> local_irq_enable(); > >> > >> check too, which is a fairly similar issue. > > > > This is exactly what Heiko did for the s390 backend as a result of this > > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, > > for the hot patch the check for the bit is included in the general > > _CIF_WORK test. Only the slow patch gets a bit slower. > > > > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d > > "s390: restore address space when returning to user space". > > If I'm understanding this, it won't catch corruption of addr_limit > during fast-path syscalls, though (i.e. addr_limit changed without a > call to set_fs()). :( This addr_limit corruption is mostly only a risk > archs without THREAD_INFO_IN_TASK, but it would still be nice to catch > unbalanced set_fs() code, so I like the idea. I like getting rid of > addr_limit entirely even more, but that'll take some time. :) Well for s390 there is no addr_limit as we use two separate address space for kernel vs. user. The equivalent to the addr_limit corruption on a fast-path syscall would be changing CR7 outside of set_fs. This boils down to the question what we are protection against? Bad code with unbalanced set_fs or evil code that changes addr_limit/CR7 outside of set_fs -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefskywrote: > On Thu, 11 May 2017 16:44:07 -0700 > Linus Torvalds wrote: > >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: >> > >> > Ingo: Do you want the change as-is? Would you like it to be optional? >> > What do you think? >> >> I'm not ingo, but I don't like that patch. It's in the wrong place - >> that system call return code is too timing-critical to add address >> limit checks. >> >> Now what I think you *could* do is: >> >> - make "set_fs()" actually set a work flag in the current thread flags >> >> - do the test in the slow-path (syscall_return_slowpath). >> >> Yes, yes, that ends up being architecture-specific, but it's fairly simple. >> >> And it only slows down the system calls that actually use "set_fs()". >> Sure, it will slow those down a fair amount, but they are hopefully a >> small subset of all cases. >> >> How does that sound to people? Thats' where we currently do that >> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", >> regs->orig_ax)) >> local_irq_enable(); >> >> check too, which is a fairly similar issue. > > This is exactly what Heiko did for the s390 backend as a result of this > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, > for the hot patch the check for the bit is included in the general > _CIF_WORK test. Only the slow patch gets a bit slower. > > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d > "s390: restore address space when returning to user space". If I'm understanding this, it won't catch corruption of addr_limit during fast-path syscalls, though (i.e. addr_limit changed without a call to set_fs()). :( This addr_limit corruption is mostly only a risk archs without THREAD_INFO_IN_TASK, but it would still be nice to catch unbalanced set_fs() code, so I like the idea. I like getting rid of addr_limit entirely even more, but that'll take some time. :) -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 10:28 PM, Martin Schwidefsky wrote: > On Thu, 11 May 2017 16:44:07 -0700 > Linus Torvalds wrote: > >> On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: >> > >> > Ingo: Do you want the change as-is? Would you like it to be optional? >> > What do you think? >> >> I'm not ingo, but I don't like that patch. It's in the wrong place - >> that system call return code is too timing-critical to add address >> limit checks. >> >> Now what I think you *could* do is: >> >> - make "set_fs()" actually set a work flag in the current thread flags >> >> - do the test in the slow-path (syscall_return_slowpath). >> >> Yes, yes, that ends up being architecture-specific, but it's fairly simple. >> >> And it only slows down the system calls that actually use "set_fs()". >> Sure, it will slow those down a fair amount, but they are hopefully a >> small subset of all cases. >> >> How does that sound to people? Thats' where we currently do that >> >> if (IS_ENABLED(CONFIG_PROVE_LOCKING) && >> WARN(irqs_disabled(), "syscall %ld left IRQs disabled", >> regs->orig_ax)) >> local_irq_enable(); >> >> check too, which is a fairly similar issue. > > This is exactly what Heiko did for the s390 backend as a result of this > discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, > for the hot patch the check for the bit is included in the general > _CIF_WORK test. Only the slow patch gets a bit slower. > > git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d > "s390: restore address space when returning to user space". If I'm understanding this, it won't catch corruption of addr_limit during fast-path syscalls, though (i.e. addr_limit changed without a call to set_fs()). :( This addr_limit corruption is mostly only a risk archs without THREAD_INFO_IN_TASK, but it would still be nice to catch unbalanced set_fs() code, so I like the idea. I like getting rid of addr_limit entirely even more, but that'll take some time. :) -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, 11 May 2017 16:44:07 -0700 Linus Torvaldswrote: > On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: > > > > Ingo: Do you want the change as-is? Would you like it to be optional? > > What do you think? > > I'm not ingo, but I don't like that patch. It's in the wrong place - > that system call return code is too timing-critical to add address > limit checks. > > Now what I think you *could* do is: > > - make "set_fs()" actually set a work flag in the current thread flags > > - do the test in the slow-path (syscall_return_slowpath). > > Yes, yes, that ends up being architecture-specific, but it's fairly simple. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. > > How does that sound to people? Thats' where we currently do that > > if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > regs->orig_ax)) > local_irq_enable(); > > check too, which is a fairly similar issue. This is exactly what Heiko did for the s390 backend as a result of this discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, for the hot patch the check for the bit is included in the general _CIF_WORK test. Only the slow patch gets a bit slower. git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d "s390: restore address space when returning to user space". -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, 11 May 2017 16:44:07 -0700 Linus Torvalds wrote: > On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: > > > > Ingo: Do you want the change as-is? Would you like it to be optional? > > What do you think? > > I'm not ingo, but I don't like that patch. It's in the wrong place - > that system call return code is too timing-critical to add address > limit checks. > > Now what I think you *could* do is: > > - make "set_fs()" actually set a work flag in the current thread flags > > - do the test in the slow-path (syscall_return_slowpath). > > Yes, yes, that ends up being architecture-specific, but it's fairly simple. > > And it only slows down the system calls that actually use "set_fs()". > Sure, it will slow those down a fair amount, but they are hopefully a > small subset of all cases. > > How does that sound to people? Thats' where we currently do that > > if (IS_ENABLED(CONFIG_PROVE_LOCKING) && > WARN(irqs_disabled(), "syscall %ld left IRQs disabled", > regs->orig_ax)) > local_irq_enable(); > > check too, which is a fairly similar issue. This is exactly what Heiko did for the s390 backend as a result of this discussion. See the _CIF_ASCE_SECONDARY bit in arch/s390/kernel/entry.S, for the hot patch the check for the bit is included in the general _CIF_WORK test. Only the slow patch gets a bit slower. git commit b5a882fcf146c87cb6b67c6df353e1c042b8773d "s390: restore address space when returning to user space". -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 4:17 PM, Thomas Garnierwrote: > > Ingo: Do you want the change as-is? Would you like it to be optional? > What do you think? I'm not ingo, but I don't like that patch. It's in the wrong place - that system call return code is too timing-critical to add address limit checks. Now what I think you *could* do is: - make "set_fs()" actually set a work flag in the current thread flags - do the test in the slow-path (syscall_return_slowpath). Yes, yes, that ends up being architecture-specific, but it's fairly simple. And it only slows down the system calls that actually use "set_fs()". Sure, it will slow those down a fair amount, but they are hopefully a small subset of all cases. How does that sound to people? Thats' where we currently do that if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN(irqs_disabled(), "syscall %ld left IRQs disabled", regs->orig_ax)) local_irq_enable(); check too, which is a fairly similar issue. Linus
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Thu, May 11, 2017 at 4:17 PM, Thomas Garnier wrote: > > Ingo: Do you want the change as-is? Would you like it to be optional? > What do you think? I'm not ingo, but I don't like that patch. It's in the wrong place - that system call return code is too timing-critical to add address limit checks. Now what I think you *could* do is: - make "set_fs()" actually set a work flag in the current thread flags - do the test in the slow-path (syscall_return_slowpath). Yes, yes, that ends up being architecture-specific, but it's fairly simple. And it only slows down the system calls that actually use "set_fs()". Sure, it will slow those down a fair amount, but they are hopefully a small subset of all cases. How does that sound to people? Thats' where we currently do that if (IS_ENABLED(CONFIG_PROVE_LOCKING) && WARN(irqs_disabled(), "syscall %ld left IRQs disabled", regs->orig_ax)) local_irq_enable(); check too, which is a fairly similar issue. Linus
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 9, 2017 at 7:29 AM, Thomas Garnierwrote: > > On Tue, May 9, 2017 at 4:10 AM, Greg KH wrote: > > On Tue, May 09, 2017 at 08:56:19AM +0200, Ingo Molnar wrote: > >> > >> * Kees Cook wrote: > >> > >> > > There's the option of using GCC plugins now that the infrastructure was > >> > > upstreamed from grsecurity. It can be used as part of the regular build > >> > > process and as long as the analysis is pretty simple it shouldn't hurt > >> > > compile > >> > > time much. > >> > > >> > Well, and that the situation may arise due to memory corruption, not from > >> > poorly-matched set_fs() calls, which static analysis won't help solve. > >> > We need > >> > to catch this bad kernel state because it is a very bad state to run in. > >> > >> If memory corruption corrupted the task state into having addr_limit set to > >> KERNEL_DS then there's already a fair chance that it's game over: it could > >> also > >> have set *uid to 0, or changed a sensitive PF_ flag, or a number of other > >> things... > >> > >> Furthermore, think about it: there's literally an infinite amount of > >> corrupted > >> task states that could be a security problem and that could be checked > >> after every > >> system call. Do we want to check every one of them? > > > > Ok, I'm all for not checking lots of stuff all the time, just to protect > > from crappy drivers that. Especially as we _can_ audit and run checks > > on the source code for them in the kernel tree. > > > > But, and here's the problem, outside of the desktop/enterprise world, > > there are a ton of out-of-tree code that is crap. The number of > > security/bug fixes and kernel crashes for out-of-tree code in systems > > like Android phones is just so high it's laughable. > > > > When you have a device that is running 3.2 million lines of kernel code, > > yet the diffstat of the tree compared to mainline adds 3 million lines > > of code, there is bound to be a ton of issues/problems there. > > > > So this is an entirely different thing we need to try to protect > > ourselves from. A long time ago I laughed when I saw that Microsoft had > > to do lots of "hardening" of their kernel to protect themselves from > > crappy drivers, as I knew we didn't have to do that because we had the > > source for them and could fix the root issues. But that has changed and > > now we don't all have that option. That code is out-of-tree because the > > vendor doesn't care, and doesn't want to take any time at all to do > > anything resembling a real code review[1]. > > That's a big part of why I thought would be useful. I am less worried > about edge cases upstream right now than forks with custom codes not > using set_fs correctly. > > > > > So, how about options like the ones being proposed here, go behind a new > > config option: > > CONFIG_PROTECT_FROM_CRAPPY_DRIVERS > > that device owners can enable if they do not trust their vendor-provided > > code (hint, I sure don't.) That way the "normal" path that all of us > > are used to running will be fine, but if you want to take the speed hit > > to try to protect yourself, then you can do that as well. > > Maybe another name but why not. Ingo: Do you want the change as-is? Would you like it to be optional? What do you think? > > > > > Anyway, just an idea... > > > > thanks, > > > > greg k-h > > > > [1] I am working really hard with lots of vendors to try to fix their > > broken development model, but that is going to take years to resolve > > as their device pipelines are years long, and changing their > > mindsets takes a long time... > > > > -- > Thomas -- Thomas
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 9, 2017 at 7:29 AM, Thomas Garnier wrote: > > On Tue, May 9, 2017 at 4:10 AM, Greg KH wrote: > > On Tue, May 09, 2017 at 08:56:19AM +0200, Ingo Molnar wrote: > >> > >> * Kees Cook wrote: > >> > >> > > There's the option of using GCC plugins now that the infrastructure was > >> > > upstreamed from grsecurity. It can be used as part of the regular build > >> > > process and as long as the analysis is pretty simple it shouldn't hurt > >> > > compile > >> > > time much. > >> > > >> > Well, and that the situation may arise due to memory corruption, not from > >> > poorly-matched set_fs() calls, which static analysis won't help solve. > >> > We need > >> > to catch this bad kernel state because it is a very bad state to run in. > >> > >> If memory corruption corrupted the task state into having addr_limit set to > >> KERNEL_DS then there's already a fair chance that it's game over: it could > >> also > >> have set *uid to 0, or changed a sensitive PF_ flag, or a number of other > >> things... > >> > >> Furthermore, think about it: there's literally an infinite amount of > >> corrupted > >> task states that could be a security problem and that could be checked > >> after every > >> system call. Do we want to check every one of them? > > > > Ok, I'm all for not checking lots of stuff all the time, just to protect > > from crappy drivers that. Especially as we _can_ audit and run checks > > on the source code for them in the kernel tree. > > > > But, and here's the problem, outside of the desktop/enterprise world, > > there are a ton of out-of-tree code that is crap. The number of > > security/bug fixes and kernel crashes for out-of-tree code in systems > > like Android phones is just so high it's laughable. > > > > When you have a device that is running 3.2 million lines of kernel code, > > yet the diffstat of the tree compared to mainline adds 3 million lines > > of code, there is bound to be a ton of issues/problems there. > > > > So this is an entirely different thing we need to try to protect > > ourselves from. A long time ago I laughed when I saw that Microsoft had > > to do lots of "hardening" of their kernel to protect themselves from > > crappy drivers, as I knew we didn't have to do that because we had the > > source for them and could fix the root issues. But that has changed and > > now we don't all have that option. That code is out-of-tree because the > > vendor doesn't care, and doesn't want to take any time at all to do > > anything resembling a real code review[1]. > > That's a big part of why I thought would be useful. I am less worried > about edge cases upstream right now than forks with custom codes not > using set_fs correctly. > > > > > So, how about options like the ones being proposed here, go behind a new > > config option: > > CONFIG_PROTECT_FROM_CRAPPY_DRIVERS > > that device owners can enable if they do not trust their vendor-provided > > code (hint, I sure don't.) That way the "normal" path that all of us > > are used to running will be fine, but if you want to take the speed hit > > to try to protect yourself, then you can do that as well. > > Maybe another name but why not. Ingo: Do you want the change as-is? Would you like it to be optional? What do you think? > > > > > Anyway, just an idea... > > > > thanks, > > > > greg k-h > > > > [1] I am working really hard with lots of vendors to try to fix their > > broken development model, but that is going to take years to resolve > > as their device pipelines are years long, and changing their > > mindsets takes a long time... > > > > -- > Thomas -- Thomas
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote: > > I don't like silent fixups. If we want to do this, we should BUG or > > at least WARN, not just change the addr limit. But I'm also not > > convinced it's indicative of an actual bug here. > > Nothing should enter that function with KERNEL_DS set, right? > > BUG_ON(get_fs() != USER_DS); We're feeling triggerhappy, aren't we? A nice juicy WARN-splat along with a fixup looks much better than killing the box, to me. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote: > > I don't like silent fixups. If we want to do this, we should BUG or > > at least WARN, not just change the addr limit. But I'm also not > > convinced it's indicative of an actual bug here. > > Nothing should enter that function with KERNEL_DS set, right? > > BUG_ON(get_fs() != USER_DS); We're feeling triggerhappy, aren't we? A nice juicy WARN-splat along with a fixup looks much better than killing the box, to me. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 1:14 AM, Christoph Hellwigwrote: > On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote: >> On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote: >> >> > > How about trying to remove all of them? If we could actually get rid >> > > of all of them, we could drop the arch support, and we'd get faster, >> > > simpler, shorter uaccess code throughout the kernel. >> >> BTW, not all get_user() under KERNEL_DS are plain loads. There is an >> exception - probe_kernel_read(). > > And various calls that looks like opencoded versions, e.g. drivers/dio > or the ELF loader. > > But in the long run we'll just need a separate primitive for that, > but that can wait until the set_fs calls outside the core code are > gone. I suspect that, on most arches, the primitive is called __copy_from_user(). We could make the generic code do that except where overridden.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 1:14 AM, Christoph Hellwig wrote: > On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote: >> On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote: >> >> > > How about trying to remove all of them? If we could actually get rid >> > > of all of them, we could drop the arch support, and we'd get faster, >> > > simpler, shorter uaccess code throughout the kernel. >> >> BTW, not all get_user() under KERNEL_DS are plain loads. There is an >> exception - probe_kernel_read(). > > And various calls that looks like opencoded versions, e.g. drivers/dio > or the ELF loader. > > But in the long run we'll just need a separate primitive for that, > but that can wait until the set_fs calls outside the core code are > gone. I suspect that, on most arches, the primitive is called __copy_from_user(). We could make the generic code do that except where overridden.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote: > On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote: > > > > How about trying to remove all of them? If we could actually get rid > > > of all of them, we could drop the arch support, and we'd get faster, > > > simpler, shorter uaccess code throughout the kernel. > > BTW, not all get_user() under KERNEL_DS are plain loads. There is an > exception - probe_kernel_read(). And various calls that looks like opencoded versions, e.g. drivers/dio or the ELF loader. But in the long run we'll just need a separate primitive for that, but that can wait until the set_fs calls outside the core code are gone.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 09:08:41AM +0100, Al Viro wrote: > On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote: > > > > How about trying to remove all of them? If we could actually get rid > > > of all of them, we could drop the arch support, and we'd get faster, > > > simpler, shorter uaccess code throughout the kernel. > > BTW, not all get_user() under KERNEL_DS are plain loads. There is an > exception - probe_kernel_read(). And various calls that looks like opencoded versions, e.g. drivers/dio or the ELF loader. But in the long run we'll just need a separate primitive for that, but that can wait until the set_fs calls outside the core code are gone.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote: > > How about trying to remove all of them? If we could actually get rid > > of all of them, we could drop the arch support, and we'd get faster, > > simpler, shorter uaccess code throughout the kernel. BTW, not all get_user() under KERNEL_DS are plain loads. There is an exception - probe_kernel_read(). > > The ones in kernel/compat.c are generally garbage. They should be > > using compat_alloc_user_space(). Ditto for kernel/power/user.c. > > compat_alloc_user_space() has some problems too, it adds > complexity to a rarely-tested code path and can add some noticeable > overhead in cases where user space access is slow because of > extra checks. > > It's clearly better than set_fs(), but the way I prefer to convert the > code is to avoid both and instead move compat handlers next to > the native code, and splitting out the common code between native > and compat mode into a helper that takes a regular kernel pointer. > > I think that's what both Al has done in the past on compat_ioctl() > and select() and what Christoph does in his latest series, but > it seems worth pointing out for others that decide to help out here. Folks, reducing the amount of places where we play with set_fs() is certainly a good thing. Getting rid of them completely is something entirely different; I have tried to plot out patch series in this direction many times during the last 5 years or so, but it's not going to be easy. Tomorrow I can start posting my notes in that direction (and there are tons of those, unfortunately mixed with git grep results, highly unprintable personal comments, etc.); just let me grab some sleep first... BTW, slow userland access is not just due to extra checks; access_ok(), in particular, is pretty much noise. The real PITA comes from the things like STAC/CLAC on recent x86. Or hardware overhead of cross-address-space block copy insn (e.g. on s390, where it's optimized for multi-cacheline blocks). Or things like uml, where it's a matter of walking the page tables for each sodding __get_user(). It's not always just a matter of address space limit...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 09:37:04AM +0200, Arnd Bergmann wrote: > > How about trying to remove all of them? If we could actually get rid > > of all of them, we could drop the arch support, and we'd get faster, > > simpler, shorter uaccess code throughout the kernel. BTW, not all get_user() under KERNEL_DS are plain loads. There is an exception - probe_kernel_read(). > > The ones in kernel/compat.c are generally garbage. They should be > > using compat_alloc_user_space(). Ditto for kernel/power/user.c. > > compat_alloc_user_space() has some problems too, it adds > complexity to a rarely-tested code path and can add some noticeable > overhead in cases where user space access is slow because of > extra checks. > > It's clearly better than set_fs(), but the way I prefer to convert the > code is to avoid both and instead move compat handlers next to > the native code, and splitting out the common code between native > and compat mode into a helper that takes a regular kernel pointer. > > I think that's what both Al has done in the past on compat_ioctl() > and select() and what Christoph does in his latest series, but > it seems worth pointing out for others that decide to help out here. Folks, reducing the amount of places where we play with set_fs() is certainly a good thing. Getting rid of them completely is something entirely different; I have tried to plot out patch series in this direction many times during the last 5 years or so, but it's not going to be easy. Tomorrow I can start posting my notes in that direction (and there are tons of those, unfortunately mixed with git grep results, highly unprintable personal comments, etc.); just let me grab some sleep first... BTW, slow userland access is not just due to extra checks; access_ok(), in particular, is pretty much noise. The real PITA comes from the things like STAC/CLAC on recent x86. Or hardware overhead of cross-address-space block copy insn (e.g. on s390, where it's optimized for multi-cacheline blocks). Or things like uml, where it's a matter of walking the page tables for each sodding __get_user(). It's not always just a matter of address space limit...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 9, 2017 at 3:00 PM, Andy Lutomirskiwrote: > On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig wrote: >> On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote: >>> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it >>> would >>> be a pity to add a runtime check to every system call ... >> >> I think we should simply strive to remove all of them that aren't >> in core scheduler / arch code. Basically evetyytime we do the >> >> oldfs = get_fs(); >> set_fs(KERNEL_DS); >> .. >> set_fs(oldfs); >> >> trick we're doing something wrong, and there should always be better >> ways to archive it. E.g. using iov_iter with a ITER_KVEC type >> consistently would already remove most of them. > > How about trying to remove all of them? If we could actually get rid > of all of them, we could drop the arch support, and we'd get faster, > simpler, shorter uaccess code throughout the kernel. > > The ones in kernel/compat.c are generally garbage. They should be > using compat_alloc_user_space(). Ditto for kernel/power/user.c. compat_alloc_user_space() has some problems too, it adds complexity to a rarely-tested code path and can add some noticeable overhead in cases where user space access is slow because of extra checks. It's clearly better than set_fs(), but the way I prefer to convert the code is to avoid both and instead move compat handlers next to the native code, and splitting out the common code between native and compat mode into a helper that takes a regular kernel pointer. I think that's what both Al has done in the past on compat_ioctl() and select() and what Christoph does in his latest series, but it seems worth pointing out for others that decide to help out here. Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 9, 2017 at 3:00 PM, Andy Lutomirski wrote: > On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig wrote: >> On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote: >>> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it >>> would >>> be a pity to add a runtime check to every system call ... >> >> I think we should simply strive to remove all of them that aren't >> in core scheduler / arch code. Basically evetyytime we do the >> >> oldfs = get_fs(); >> set_fs(KERNEL_DS); >> .. >> set_fs(oldfs); >> >> trick we're doing something wrong, and there should always be better >> ways to archive it. E.g. using iov_iter with a ITER_KVEC type >> consistently would already remove most of them. > > How about trying to remove all of them? If we could actually get rid > of all of them, we could drop the arch support, and we'd get faster, > simpler, shorter uaccess code throughout the kernel. > > The ones in kernel/compat.c are generally garbage. They should be > using compat_alloc_user_space(). Ditto for kernel/power/user.c. compat_alloc_user_space() has some problems too, it adds complexity to a rarely-tested code path and can add some noticeable overhead in cases where user space access is slow because of extra checks. It's clearly better than set_fs(), but the way I prefer to convert the code is to avoid both and instead move compat handlers next to the native code, and splitting out the common code between native and compat mode into a helper that takes a regular kernel pointer. I think that's what both Al has done in the past on compat_ioctl() and select() and what Christoph does in his latest series, but it seems worth pointing out for others that decide to help out here. Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 09:28:48AM +0200, Arnd Bergmann wrote: > My older time64_t syscall series has the side-effect of doing something > like this to the time-related compat handlers in kernel/compat.c. If nobody > else has started looking at removing set_fs from those, I can extract > the relevant parts from my series. That would be great.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 09:28:48AM +0200, Arnd Bergmann wrote: > My older time64_t syscall series has the side-effect of doing something > like this to the time-related compat handlers in kernel/compat.c. If nobody > else has started looking at removing set_fs from those, I can extract > the relevant parts from my series. That would be great.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 08:27:47AM +0100, Al Viro wrote: > And you *still* do the same. Christoph, this is ridiculous - the worst > part of the area is not a couple of functions in fs/read_write.c, it's > a fucking lot of ->read() and ->write() instances in shitty driver code, > pardon the redundance. And _that_ is still done under set_fs(KERNEL_DS). For now yes. But it provides a sane API on top, and then we can start moving the drivers that matter to the iter variants and drop support for the rest soon. Most in-kernel I/O is done to files, and the rest to a very limited set of devices. (not accounting for sockets through their own APIs, thats another story) > That's what I'm objecting to. Centralized kernel_readv() et.al. - sure, > and fs/read_write.c is the right place for those. No arguments here. > Conversion to those - absolutely; drivers have no fucking business touching > set_fs() at all. But your primitives are trouble waiting to happen. > Let them take kvec arrays. And let them, in case when there's no > ->read_iter()/->write_iter(), do set_fs(). Statically, without this > if (iter->type & ITER_KVEC) ... stuff. Oh weĺl. I can do that first, but I think eventually we'll have to get back to what I've done now.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 08:27:47AM +0100, Al Viro wrote: > And you *still* do the same. Christoph, this is ridiculous - the worst > part of the area is not a couple of functions in fs/read_write.c, it's > a fucking lot of ->read() and ->write() instances in shitty driver code, > pardon the redundance. And _that_ is still done under set_fs(KERNEL_DS). For now yes. But it provides a sane API on top, and then we can start moving the drivers that matter to the iter variants and drop support for the rest soon. Most in-kernel I/O is done to files, and the rest to a very limited set of devices. (not accounting for sockets through their own APIs, thats another story) > That's what I'm objecting to. Centralized kernel_readv() et.al. - sure, > and fs/read_write.c is the right place for those. No arguments here. > Conversion to those - absolutely; drivers have no fucking business touching > set_fs() at all. But your primitives are trouble waiting to happen. > Let them take kvec arrays. And let them, in case when there's no > ->read_iter()/->write_iter(), do set_fs(). Statically, without this > if (iter->type & ITER_KVEC) ... stuff. Oh weĺl. I can do that first, but I think eventually we'll have to get back to what I've done now.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 09, 2017 at 11:53:01PM -0700, Christoph Hellwig wrote: > On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote: > > What's the point? What's wrong with having > > kernel_read()/kernel_readv()/etc.? > > You still have set_fs() in there; doing that one level up in call chain > > would > > be just fine... IDGI. > > The problem is that they modify the address limit, which the whole > subthread here wants to get rid of. And you *still* do the same. Christoph, this is ridiculous - the worst part of the area is not a couple of functions in fs/read_write.c, it's a fucking lot of ->read() and ->write() instances in shitty driver code, pardon the redundance. And _that_ is still done under set_fs(KERNEL_DS). Claiming that set_fs() done one function deeper in callchain (both in fs/read_write.c) is somehow better because it reduces the amount of code under that thing... Get real, please - helpers that encapsulate those set_fs() pairs (a-la kernel_read(), etc.) absolutely make sense and converting their open-coded instances to calls of those helpers is clearly a good thing. However, we are not * getting rid of low-quality code run under KERNEL_DS * gettind rid of set_fs() itself * getting a generic kernel_read() variant that would really take an iov_iter. That's what I'm objecting to. Centralized kernel_readv() et.al. - sure, and fs/read_write.c is the right place for those. No arguments here. Conversion to those - absolutely; drivers have no fucking business touching set_fs() at all. But your primitives are trouble waiting to happen. Let them take kvec arrays. And let them, in case when there's no ->read_iter()/->write_iter(), do set_fs(). Statically, without this if (iter->type & ITER_KVEC) ... stuff. > > Another delicate place: you can't assume that write() always advances > > file position by its (positive) return value. btrfs stuff is sensitive > > to that. > > If we don't want to assume that we need to pass pointer to pos to > kernel_read/write. Which might be a good idea in general. Yes. > > ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure > > about blind asma->file->f_pos += ret. That's begging for races. Actually, > > scratch that - it *is* racy. > > I think the proper fix is to not even bother to maintain f_pos of the > backing file, as we don't ever use it - all reads from it pass in > an explicit position anyway. vfs_llseek() used by ashmem_llseek()...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 09, 2017 at 11:53:01PM -0700, Christoph Hellwig wrote: > On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote: > > What's the point? What's wrong with having > > kernel_read()/kernel_readv()/etc.? > > You still have set_fs() in there; doing that one level up in call chain > > would > > be just fine... IDGI. > > The problem is that they modify the address limit, which the whole > subthread here wants to get rid of. And you *still* do the same. Christoph, this is ridiculous - the worst part of the area is not a couple of functions in fs/read_write.c, it's a fucking lot of ->read() and ->write() instances in shitty driver code, pardon the redundance. And _that_ is still done under set_fs(KERNEL_DS). Claiming that set_fs() done one function deeper in callchain (both in fs/read_write.c) is somehow better because it reduces the amount of code under that thing... Get real, please - helpers that encapsulate those set_fs() pairs (a-la kernel_read(), etc.) absolutely make sense and converting their open-coded instances to calls of those helpers is clearly a good thing. However, we are not * getting rid of low-quality code run under KERNEL_DS * gettind rid of set_fs() itself * getting a generic kernel_read() variant that would really take an iov_iter. That's what I'm objecting to. Centralized kernel_readv() et.al. - sure, and fs/read_write.c is the right place for those. No arguments here. Conversion to those - absolutely; drivers have no fucking business touching set_fs() at all. But your primitives are trouble waiting to happen. Let them take kvec arrays. And let them, in case when there's no ->read_iter()/->write_iter(), do set_fs(). Statically, without this if (iter->type & ITER_KVEC) ... stuff. > > Another delicate place: you can't assume that write() always advances > > file position by its (positive) return value. btrfs stuff is sensitive > > to that. > > If we don't want to assume that we need to pass pointer to pos to > kernel_read/write. Which might be a good idea in general. Yes. > > ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure > > about blind asma->file->f_pos += ret. That's begging for races. Actually, > > scratch that - it *is* racy. > > I think the proper fix is to not even bother to maintain f_pos of the > backing file, as we don't ever use it - all reads from it pass in > an explicit position anyway. vfs_llseek() used by ashmem_llseek()...
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 9, 2017 at 6:03 PM, Christoph Hellwigwrote: > On Tue, May 09, 2017 at 06:02:50AM -0700, Christoph Hellwig wrote: >> On Tue, May 09, 2017 at 06:00:01AM -0700, Andy Lutomirski wrote: >> > fs/splice.c has some, ahem, interesting uses that have been the source >> > of nasty exploits in the past. Converting them to use iov_iter >> > properly would be really, really nice. Christoph, I don't suppose >> > you'd like to do that? >> >> I can take care of all the fs code including this one. > > I spent the afternoon hacking up where I'd like this to head. It's > completely untested as of now: > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/setfs-elimination My older time64_t syscall series has the side-effect of doing something like this to the time-related compat handlers in kernel/compat.c. If nobody else has started looking at removing set_fs from those, I can extract the relevant parts from my series. Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 9, 2017 at 6:03 PM, Christoph Hellwig wrote: > On Tue, May 09, 2017 at 06:02:50AM -0700, Christoph Hellwig wrote: >> On Tue, May 09, 2017 at 06:00:01AM -0700, Andy Lutomirski wrote: >> > fs/splice.c has some, ahem, interesting uses that have been the source >> > of nasty exploits in the past. Converting them to use iov_iter >> > properly would be really, really nice. Christoph, I don't suppose >> > you'd like to do that? >> >> I can take care of all the fs code including this one. > > I spent the afternoon hacking up where I'd like this to head. It's > completely untested as of now: > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/setfs-elimination My older time64_t syscall series has the side-effect of doing something like this to the time-related compat handlers in kernel/compat.c. If nobody else has started looking at removing set_fs from those, I can extract the relevant parts from my series. Arnd
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote: > > I don't like silent fixups. If we want to do this, we should BUG or > > at least WARN, not just change the addr limit. But I'm also not > > convinced it's indicative of an actual bug here. > > Nothing should enter that function with KERNEL_DS set, right? It might very well do. Various drivers or the networking code mess with the address limits for fairly broad sections of code.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Tue, May 09, 2017 at 04:31:00PM -0700, Kees Cook wrote: > > I don't like silent fixups. If we want to do this, we should BUG or > > at least WARN, not just change the addr limit. But I'm also not > > convinced it's indicative of an actual bug here. > > Nothing should enter that function with KERNEL_DS set, right? It might very well do. Various drivers or the networking code mess with the address limits for fairly broad sections of code.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 04:39:12AM +0100, Al Viro wrote: > fcntl stuff: I've decided not to put something similar into work.compat > since I couldn't decide what to do with compat stuff - word-by-word copy > from userland converting to struct flock + conversion to posix_lock + > actual work + conversion to flock + word-by-word copy to userland... Smells > like we might be better off with compat_flock_to_posix_lock() et.al. > I'm still not sure; played a bit one way and another and dediced to drop > it for now. Hell knows... My version already is an improvement in lines of code alone. Between that and stopping to mess with the address limit I think it's a clear winner. But it's pretty independent of the rest, and I'll just run it through Jeff and Bruce and ask them what they think.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 04:39:12AM +0100, Al Viro wrote: > fcntl stuff: I've decided not to put something similar into work.compat > since I couldn't decide what to do with compat stuff - word-by-word copy > from userland converting to struct flock + conversion to posix_lock + > actual work + conversion to flock + word-by-word copy to userland... Smells > like we might be better off with compat_flock_to_posix_lock() et.al. > I'm still not sure; played a bit one way and another and dediced to drop > it for now. Hell knows... My version already is an improvement in lines of code alone. Between that and stopping to mess with the address limit I think it's a clear winner. But it's pretty independent of the rest, and I'll just run it through Jeff and Bruce and ask them what they think.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote: > What's the point? What's wrong with having kernel_read()/kernel_readv()/etc.? > You still have set_fs() in there; doing that one level up in call chain would > be just fine... IDGI. The problem is that they modify the address limit, which the whole subthread here wants to get rid of. > Broken commit: "net: don't play with address limits in kernel_recvmsg". > It would be OK if it was only about data. Unfortunately, that's not > true in one case: svc_udp_recvfrom() wants ->msg_control. Dropped, but we'll need to fix that eventually. > Another delicate place: you can't assume that write() always advances > file position by its (positive) return value. btrfs stuff is sensitive > to that. If we don't want to assume that we need to pass pointer to pos to kernel_read/write. Which might be a good idea in general. > ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure > about blind asma->file->f_pos += ret. That's begging for races. Actually, > scratch that - it *is* racy. I think the proper fix is to not even bother to maintain f_pos of the backing file, as we don't ever use it - all reads from it pass in an explicit position anyway.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 04:12:54AM +0100, Al Viro wrote: > What's the point? What's wrong with having kernel_read()/kernel_readv()/etc.? > You still have set_fs() in there; doing that one level up in call chain would > be just fine... IDGI. The problem is that they modify the address limit, which the whole subthread here wants to get rid of. > Broken commit: "net: don't play with address limits in kernel_recvmsg". > It would be OK if it was only about data. Unfortunately, that's not > true in one case: svc_udp_recvfrom() wants ->msg_control. Dropped, but we'll need to fix that eventually. > Another delicate place: you can't assume that write() always advances > file position by its (positive) return value. btrfs stuff is sensitive > to that. If we don't want to assume that we need to pass pointer to pos to kernel_read/write. Which might be a good idea in general. > ashmem probably _is_ OK with demanding ->read_iter(), but I'm not sure > about blind asma->file->f_pos += ret. That's begging for races. Actually, > scratch that - it *is* racy. I think the proper fix is to not even bother to maintain f_pos of the backing file, as we don't ever use it - all reads from it pass in an explicit position anyway.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 03:45:24AM +0100, Al Viro wrote: > FWIW, some parts of that queue are obviously sane; it's the conversions of > kernel_write() and friends to ->read_iter/->write_iter() that are > non-starters. And that part is the main point! > That stuff is used in too many situations; we can't guarantee that all of > them will be for files that have those. That's why this series handles ITER_KVEC for this case, which is all that's really needed for kernel_read/write. If you insiste the bvec and pipe cases are handled as well that couod be added fairly easily. > As for default_file_splice_read(), I seriously suspect that with your change > we could as well just make it return -EINVAL and be done with that; places > that have ->read_iter() tend to have explicit ->splice_read() and it looks > like the ones that do not should simply use generic_file_read_iter(). > I hadn't checked that, but there's not a lot of those: Making ->splice_read to default to the ->read_iter based implementation and returning -EINVAL if neither that nor an explicit ->splice_read is provided is useful, but wasn't the aim of this series. Similar on the write side.
Re: [kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
On Wed, May 10, 2017 at 03:45:24AM +0100, Al Viro wrote: > FWIW, some parts of that queue are obviously sane; it's the conversions of > kernel_write() and friends to ->read_iter/->write_iter() that are > non-starters. And that part is the main point! > That stuff is used in too many situations; we can't guarantee that all of > them will be for files that have those. That's why this series handles ITER_KVEC for this case, which is all that's really needed for kernel_read/write. If you insiste the bvec and pipe cases are handled as well that couod be added fairly easily. > As for default_file_splice_read(), I seriously suspect that with your change > we could as well just make it return -EINVAL and be done with that; places > that have ->read_iter() tend to have explicit ->splice_read() and it looks > like the ones that do not should simply use generic_file_read_iter(). > I hadn't checked that, but there's not a lot of those: Making ->splice_read to default to the ->read_iter based implementation and returning -EINVAL if neither that nor an explicit ->splice_read is provided is useful, but wasn't the aim of this series. Similar on the write side.