Re: [PATCH] mm: larger stack guard gap, between vmas
On Thu, Jul 06, 2017 at 10:24:06AM +0200, Willy Tarreau wrote: > On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote: > > On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchingswrote: > > >> > > >> And I think your second patch breaks that "use a really large value to > > >> approximate infinity" case that definitely has existed as a pattern. > > > > > > Right. Well that seems to leave us with remembering the MAP_FIXED flag > > > and using that as the condition to ignore the previous mapping. > > > > I'm not particularly happy about having a MAP_FIXED special case, but > > yeah, I'm not seeing a lot of alternatives. > > We can possibly refine it like this : > - use PROT_NONE as a mark for the end of the stack and consider the > application doing this knows exactly what it's doing ; > > - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering > that 1) it used to work like this for many years, and 2) if an application > is forcing a MAP_FIXED just below the stack and at the same time uses > large alloca() or VLA it's definitely bogus and looking for unfixable > trouble. Not allowing this means we break existing applications anyway. That would probably give the following (only build-tested on x86_64). Do you think it would make sense and/or be acceptable ? That would more easily avoid the other options like adding sysctl + warnings or making a special case of setuid. Willy --- >From 56ae4e57e446bc92fd2647327da281e313930524 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Thu, 6 Jul 2017 12:00:54 +0200 Subject: mm: mm, mmap: only apply a one page gap betwen the stack an MAP_FIXED Some programs place a MAP_FIXED below the stack, not leaving enough room for the stack guard. This patch keeps track of MAP_FIXED, mirroring it in a new VM_FIXED flag and reduces the stack guard to a single page (as it used to be) in such a situation, assuming that when an application places a fixed map close to the stack, it very likely does it on purpose and is taking the full responsibility for the risk of the stack blowing up. Cc: Ben Hutchings Cc: Michal Hocko Cc: Kees Cook Cc: Andy Lutomirski Signed-off-by: Willy Tarreau --- include/linux/mm.h | 1 + include/linux/mman.h | 1 + mm/mmap.c| 30 -- 3 files changed, 22 insertions(+), 10 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f543a4..41492b9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -188,6 +188,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *, #define VM_ACCOUNT 0x0010 /* Is a VM accounted object */ #define VM_NORESERVE 0x0020 /* should the VM suppress accounting */ #define VM_HUGETLB 0x0040 /* Huge TLB Page VM */ +#define VM_FIXED 0x0080 /* MAP_FIXED was used */ #define VM_ARCH_1 0x0100 /* Architecture-specific flag */ #define VM_ARCH_2 0x0200 #define VM_DONTDUMP0x0400 /* Do not include in the core dump */ diff --git a/include/linux/mman.h b/include/linux/mman.h index 634c4c5..3a29069 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -86,6 +86,7 @@ static inline bool arch_validate_prot(unsigned long prot) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | + _calc_vm_trans(flags, MAP_FIXED, VM_FIXED ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED); } diff --git a/mm/mmap.c b/mm/mmap.c index ece0f6d..7fc1c29 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2244,12 +2244,17 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) gap_addr = TASK_SIZE; next = vma->vm_next; + + /* PROT_NONE above a MAP_GROWSUP always serves as a mark and inhibits +* the stack guard gap. +* MAP_FIXED above a MAP_GROWSUP only requires a single page guard. +*/ if (next && next->vm_start < gap_addr && - (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) { - if (!(next->vm_flags & VM_GROWSUP)) - return -ENOMEM; - /* Check that both stack segments have the same anon_vma? */ - } + !(next->vm_flags & VM_GROWSUP) && + (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)) && + (!(next->vm_flags & VM_FIXED) || +next->vm_start < address + PAGE_SIZE)) + return -ENOMEM; /* We must make sure the anon_vma is allocated. */ if (unlikely(anon_vma_prepare(vma))) @@ -2329,12 +2334,17 @@ int expand_downwards(struct vm_area_struct *vma, if (gap_addr > address) return -ENOMEM; prev = vma->vm_prev; + + /* PROT_NONE below a
Re: [PATCH] mm: larger stack guard gap, between vmas
On Thu, Jul 06, 2017 at 10:24:06AM +0200, Willy Tarreau wrote: > On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote: > > On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings wrote: > > >> > > >> And I think your second patch breaks that "use a really large value to > > >> approximate infinity" case that definitely has existed as a pattern. > > > > > > Right. Well that seems to leave us with remembering the MAP_FIXED flag > > > and using that as the condition to ignore the previous mapping. > > > > I'm not particularly happy about having a MAP_FIXED special case, but > > yeah, I'm not seeing a lot of alternatives. > > We can possibly refine it like this : > - use PROT_NONE as a mark for the end of the stack and consider the > application doing this knows exactly what it's doing ; > > - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering > that 1) it used to work like this for many years, and 2) if an application > is forcing a MAP_FIXED just below the stack and at the same time uses > large alloca() or VLA it's definitely bogus and looking for unfixable > trouble. Not allowing this means we break existing applications anyway. That would probably give the following (only build-tested on x86_64). Do you think it would make sense and/or be acceptable ? That would more easily avoid the other options like adding sysctl + warnings or making a special case of setuid. Willy --- >From 56ae4e57e446bc92fd2647327da281e313930524 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Thu, 6 Jul 2017 12:00:54 +0200 Subject: mm: mm, mmap: only apply a one page gap betwen the stack an MAP_FIXED Some programs place a MAP_FIXED below the stack, not leaving enough room for the stack guard. This patch keeps track of MAP_FIXED, mirroring it in a new VM_FIXED flag and reduces the stack guard to a single page (as it used to be) in such a situation, assuming that when an application places a fixed map close to the stack, it very likely does it on purpose and is taking the full responsibility for the risk of the stack blowing up. Cc: Ben Hutchings Cc: Michal Hocko Cc: Kees Cook Cc: Andy Lutomirski Signed-off-by: Willy Tarreau --- include/linux/mm.h | 1 + include/linux/mman.h | 1 + mm/mmap.c| 30 -- 3 files changed, 22 insertions(+), 10 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f543a4..41492b9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -188,6 +188,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *, #define VM_ACCOUNT 0x0010 /* Is a VM accounted object */ #define VM_NORESERVE 0x0020 /* should the VM suppress accounting */ #define VM_HUGETLB 0x0040 /* Huge TLB Page VM */ +#define VM_FIXED 0x0080 /* MAP_FIXED was used */ #define VM_ARCH_1 0x0100 /* Architecture-specific flag */ #define VM_ARCH_2 0x0200 #define VM_DONTDUMP0x0400 /* Do not include in the core dump */ diff --git a/include/linux/mman.h b/include/linux/mman.h index 634c4c5..3a29069 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -86,6 +86,7 @@ static inline bool arch_validate_prot(unsigned long prot) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | + _calc_vm_trans(flags, MAP_FIXED, VM_FIXED ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED); } diff --git a/mm/mmap.c b/mm/mmap.c index ece0f6d..7fc1c29 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2244,12 +2244,17 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) gap_addr = TASK_SIZE; next = vma->vm_next; + + /* PROT_NONE above a MAP_GROWSUP always serves as a mark and inhibits +* the stack guard gap. +* MAP_FIXED above a MAP_GROWSUP only requires a single page guard. +*/ if (next && next->vm_start < gap_addr && - (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) { - if (!(next->vm_flags & VM_GROWSUP)) - return -ENOMEM; - /* Check that both stack segments have the same anon_vma? */ - } + !(next->vm_flags & VM_GROWSUP) && + (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)) && + (!(next->vm_flags & VM_FIXED) || +next->vm_start < address + PAGE_SIZE)) + return -ENOMEM; /* We must make sure the anon_vma is allocated. */ if (unlikely(anon_vma_prepare(vma))) @@ -2329,12 +2334,17 @@ int expand_downwards(struct vm_area_struct *vma, if (gap_addr > address) return -ENOMEM; prev = vma->vm_prev; + + /* PROT_NONE below a MAP_GROWSDOWN always serves as a mark and inhibits +* the stack guard gap. +* MAP_FIXED below a MAP_GROWSDOWN
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote: > On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchingswrote: > >> > >> And I think your second patch breaks that "use a really large value to > >> approximate infinity" case that definitely has existed as a pattern. > > > > Right. Well that seems to leave us with remembering the MAP_FIXED flag > > and using that as the condition to ignore the previous mapping. > > I'm not particularly happy about having a MAP_FIXED special case, but > yeah, I'm not seeing a lot of alternatives. We can possibly refine it like this : - use PROT_NONE as a mark for the end of the stack and consider the application doing this knows exactly what it's doing ; - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering that 1) it used to work like this for many years, and 2) if an application is forcing a MAP_FIXED just below the stack and at the same time uses large alloca() or VLA it's definitely bogus and looking for unfixable trouble. Not allowing this means we break existing applications anyway. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote: > On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings wrote: > >> > >> And I think your second patch breaks that "use a really large value to > >> approximate infinity" case that definitely has existed as a pattern. > > > > Right. Well that seems to leave us with remembering the MAP_FIXED flag > > and using that as the condition to ignore the previous mapping. > > I'm not particularly happy about having a MAP_FIXED special case, but > yeah, I'm not seeing a lot of alternatives. We can possibly refine it like this : - use PROT_NONE as a mark for the end of the stack and consider the application doing this knows exactly what it's doing ; - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering that 1) it used to work like this for many years, and 2) if an application is forcing a MAP_FIXED just below the stack and at the same time uses large alloca() or VLA it's definitely bogus and looking for unfixable trouble. Not allowing this means we break existing applications anyway. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 08:36:45, Michal Hocko wrote: > On Tue 04-07-17 16:31:52, Linus Torvalds wrote: > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchingswrote: > > > > > > We have: > > > > > > bottom = 0xff803fff > > > sp = 0xb178 > > > > > > The relevant mappings are: > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > fffdd000-e000 rw-p 00:00 0 > > > [stack] > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > to use up almost all of it, and there's only about 28kB left between > > "bottom" and that 'rwx' mapping. > > > > Still, that rwx mapping is interesting: it is a single page, and it > > really is almost exactly 8MB below the stack. > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > the top of that odd one-page allocation (0xff7fd000). > > Very interesting! I would be really curious whether changing ulimit to > something bigger changes the picture. It's public holiday today here and I haven't read all new emails and I will be mostly offline today. I will catch up tomorrow. But before we go to more tricky workarounds. Could you double check that simply increasing the RLIMIT_STACK workarounds the problem here? Because if it does and other workarounds require some manual intervention then changing ulimit sounds like the least tricky one to me. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 08:36:45, Michal Hocko wrote: > On Tue 04-07-17 16:31:52, Linus Torvalds wrote: > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings wrote: > > > > > > We have: > > > > > > bottom = 0xff803fff > > > sp = 0xb178 > > > > > > The relevant mappings are: > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > fffdd000-e000 rw-p 00:00 0 > > > [stack] > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > to use up almost all of it, and there's only about 28kB left between > > "bottom" and that 'rwx' mapping. > > > > Still, that rwx mapping is interesting: it is a single page, and it > > really is almost exactly 8MB below the stack. > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > the top of that odd one-page allocation (0xff7fd000). > > Very interesting! I would be really curious whether changing ulimit to > something bigger changes the picture. It's public holiday today here and I haven't read all new emails and I will be mostly offline today. I will catch up tomorrow. But before we go to more tricky workarounds. Could you double check that simply increasing the RLIMIT_STACK workarounds the problem here? Because if it does and other workarounds require some manual intervention then changing ulimit sounds like the least tricky one to me. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings> > > wrote: > > > > > > > > We have: > > > > > > > > bottom = 0xff803fff > > > > sp =?0xb178 > > > > > > > > The relevant mappings are: > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > fffdd000-e000 rw-p 00:00 > > > > 0??[stack] > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > to use up almost all of it, and there's only about 28kB left between > > > "bottom" and that 'rwx' mapping. > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > really is almost exactly 8MB below the stack. > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > mmap, with a condition to catch that particular one? > > [...] > > > > Found it, and it's now clear why only i386 is affected: > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? On the plus side, the code in that page (a single RET) is only executed once when the workaround function is called. Notice that 'codebuf' is never even returned out of that function. The only reason they even leave that page mapped is to stop the exec shield limit from being lowered on them again. - Kevin
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings > > > wrote: > > > > > > > > We have: > > > > > > > > bottom = 0xff803fff > > > > sp =?0xb178 > > > > > > > > The relevant mappings are: > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > fffdd000-e000 rw-p 00:00 > > > > 0??[stack] > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > to use up almost all of it, and there's only about 28kB left between > > > "bottom" and that 'rwx' mapping. > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > really is almost exactly 8MB below the stack. > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > mmap, with a condition to catch that particular one? > > [...] > > > > Found it, and it's now clear why only i386 is affected: > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? On the plus side, the code in that page (a single RET) is only executed once when the workaround function is called. Notice that 'codebuf' is never even returned out of that function. The only reason they even leave that page mapped is to stop the exec shield limit from being lowered on them again. - Kevin
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 05:19:47PM -0700, Andy Lutomirski wrote: > I think it's ridiculous that you can change rlimits and then > exec a setuid thing. It's not so easy to fix, though. Maybe track, > per-task, inherited by clone and exec, what the rlimits were the last > time the process had privilege and reset to those limits when running > something setuid. But a better approach might be to have some sysctls > that say what the rlimits become when doing setuid. *Some* rlimits are useful and needed for the user as you mentionned. RLIMIT_CORE definitely is one of them, especially for debugging, when combined with suid_dumpable. Some others like RLIMIT_STACK should probably never be configurable at all and cause trouble. Probably that simply having a sysctl to set this one for setuid programs and ignore the current limit would be enough. We could even imagine another one to set the stack guard gap for setuid programs (this would also limit the impacts of having a large gap for everyone). > We need per-user-ns sysctls for stuff like this, and we don't really > have them... I don't think we need to be this fine-grained. min_mmap_addr is global, is used to address very similar issues and nobody seems to complain. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 05:19:47PM -0700, Andy Lutomirski wrote: > I think it's ridiculous that you can change rlimits and then > exec a setuid thing. It's not so easy to fix, though. Maybe track, > per-task, inherited by clone and exec, what the rlimits were the last > time the process had privilege and reset to those limits when running > something setuid. But a better approach might be to have some sysctls > that say what the rlimits become when doing setuid. *Some* rlimits are useful and needed for the user as you mentionned. RLIMIT_CORE definitely is one of them, especially for debugging, when combined with suid_dumpable. Some others like RLIMIT_STACK should probably never be configurable at all and cause trouble. Probably that simply having a sysctl to set this one for setuid programs and ignore the current limit would be enough. We could even imagine another one to set the stack guard gap for setuid programs (this would also limit the impacts of having a large gap for everyone). > We need per-user-ns sysctls for stuff like this, and we don't really > have them... I don't think we need to be this fine-grained. min_mmap_addr is global, is used to address very similar issues and nobody seems to complain. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:19 PM, Andy Lutomirskiwrote: > On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook wrote: >> As part of that should we put restrictions on the environment of >> set*id exec too? Part of the risks demonstrated by Qualys was that >> allowing a privilege-elevating binary to inherit rlimits can have lead >> to the nasty memory layout side-effects. That would fall into the >> "hardening" bucket as well. And if it turns out there is some set*id >> binary out there that can't run with "only", e.g., 128MB of stack, we >> can make it configurable... > > Yes. I think it's ridiculous that you can change rlimits and then > exec a setuid thing. It's not so easy to fix, though. Maybe track, > per-task, inherited by clone and exec, what the rlimits were the last > time the process had privilege and reset to those limits when running > something setuid. But a better approach might be to have some sysctls > that say what the rlimits become when doing setuid. > > We need per-user-ns sysctls for stuff like this, and we don't really > have them... In userspace, the way that sensible rlimit defaults were selected by PAM when building an initial environment is to just examine the rlimits of init. Maybe we could just do the same thing here, which gives us some level of namespace control. -Kees -- Kees Cook Pixel Security
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:19 PM, Andy Lutomirski wrote: > On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook wrote: >> As part of that should we put restrictions on the environment of >> set*id exec too? Part of the risks demonstrated by Qualys was that >> allowing a privilege-elevating binary to inherit rlimits can have lead >> to the nasty memory layout side-effects. That would fall into the >> "hardening" bucket as well. And if it turns out there is some set*id >> binary out there that can't run with "only", e.g., 128MB of stack, we >> can make it configurable... > > Yes. I think it's ridiculous that you can change rlimits and then > exec a setuid thing. It's not so easy to fix, though. Maybe track, > per-task, inherited by clone and exec, what the rlimits were the last > time the process had privilege and reset to those limits when running > something setuid. But a better approach might be to have some sysctls > that say what the rlimits become when doing setuid. > > We need per-user-ns sysctls for stuff like this, and we don't really > have them... In userspace, the way that sensible rlimit defaults were selected by PAM when building an initial environment is to just examine the rlimits of init. Maybe we could just do the same thing here, which gives us some level of namespace control. -Kees -- Kees Cook Pixel Security
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:31 PM, Andy Lutomirskiwrote: > > I wonder if we could pull some "sane" values out of our arses and have > it work just fine. That approach may work, but it's pretty nasty. But together with at least some way for the distro to set the values we pick, it would probably be fairly reasonable. You're right that most of the rlimits are just not very useful. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:31 PM, Andy Lutomirski wrote: > > I wonder if we could pull some "sane" values out of our arses and have > it work just fine. That approach may work, but it's pretty nasty. But together with at least some way for the distro to set the values we pick, it would probably be fairly reasonable. You're right that most of the rlimits are just not very useful. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:55 PM, Linus Torvaldswrote: > On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook wrote: >> >> As part of that should we put restrictions on the environment of >> set*id exec too? > > I'm not seeing what sane limits you could use. > > I think the concept of "reset as much of the environment to sane > things when running suid binaries" is a good concepr. > > But we simply don't have any sane values to reset things to. I wonder if we could pull some "sane" values out of our arses and have it work just fine. It's worth noting that a lot of the rlimits don't meaningfully restrict the use of any particular resource, so we could plausibly drop requirements to have privilege to increase them if we really cared to. I don't see why we'd make such a change, but it means that, if we reset on set*id and therefore poke a hole that allows a program to do "sudo -u $me whatever" and thereby reset limits, it's not so bad. A tiny survey: RLIMIT_AS: not a systemwide resource at all. RLIMIT_CORE: more or less just a policy of what you do when you crash. I don't see how you could do much damage here. RLIMIT_CPU: unless you're not allowed to fork(), this doesn't restrict anything systemwide. RLIMIT_DATA: *** RLIMIT_FSIZE: maybe? but I can see this being quite dangerous across set*id RLIMIT_LOCKS: gone RLIMIT_MEMLOCK: this one matters, but it also seems nearly worthless for exploits RLIMIT_MSGQUEUE: privilege matters here RLIMIT_NICE: maybe? anyone who actually cares would use cgroups instead RLIMIT_NOFILE: great for exploits. Only sort of useful for resource management RLIMIT_NPROC: privilege matters here RLIMIT_RTTIME: privilege kind of matters. Also dangerous for exploits (a bit) since it lets you kill your children at controlled times. RLIMIT_SIGPENDING: not sure RLIMIT_STACK: *** *** means that this is a half-arsed resource control. It's half-arsed because this stuff doesn't cover mmap(2), which seems to me like it defeats the purpose. This stuff feels like a throwback to the eighties.
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:55 PM, Linus Torvalds wrote: > On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook wrote: >> >> As part of that should we put restrictions on the environment of >> set*id exec too? > > I'm not seeing what sane limits you could use. > > I think the concept of "reset as much of the environment to sane > things when running suid binaries" is a good concepr. > > But we simply don't have any sane values to reset things to. I wonder if we could pull some "sane" values out of our arses and have it work just fine. It's worth noting that a lot of the rlimits don't meaningfully restrict the use of any particular resource, so we could plausibly drop requirements to have privilege to increase them if we really cared to. I don't see why we'd make such a change, but it means that, if we reset on set*id and therefore poke a hole that allows a program to do "sudo -u $me whatever" and thereby reset limits, it's not so bad. A tiny survey: RLIMIT_AS: not a systemwide resource at all. RLIMIT_CORE: more or less just a policy of what you do when you crash. I don't see how you could do much damage here. RLIMIT_CPU: unless you're not allowed to fork(), this doesn't restrict anything systemwide. RLIMIT_DATA: *** RLIMIT_FSIZE: maybe? but I can see this being quite dangerous across set*id RLIMIT_LOCKS: gone RLIMIT_MEMLOCK: this one matters, but it also seems nearly worthless for exploits RLIMIT_MSGQUEUE: privilege matters here RLIMIT_NICE: maybe? anyone who actually cares would use cgroups instead RLIMIT_NOFILE: great for exploits. Only sort of useful for resource management RLIMIT_NPROC: privilege matters here RLIMIT_RTTIME: privilege kind of matters. Also dangerous for exploits (a bit) since it lets you kill your children at controlled times. RLIMIT_SIGPENDING: not sure RLIMIT_STACK: *** *** means that this is a half-arsed resource control. It's half-arsed because this stuff doesn't cover mmap(2), which seems to me like it defeats the purpose. This stuff feels like a throwback to the eighties.
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:50 PM, Ben Hutchingswrote: > On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote: >> On Jul 5, 2017, at 12:32 PM, Ben Hutchings wrote: >> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: > [...] >> > > - As a hardening feature, if the stack would expand within 64k or >> > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might >> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) >> > > The idea being that, if you deliberately place a mapping under the >> > > stack, you know what you're doing. If you're like LibreOffice and do >> > > something daft and are thus exploitable, you're on your own. >> > > - As a hardening measure, don't let mmap without MAP_FIXED position >> > > something within 64k or whatever of the bottom of the stack unless a >> > > MAP_FIXED mapping is between them. >> > >> > Having tested patches along these lines, I think the above would avoid >> > the reported regressions. >> > >> >> FWIW, even this last part may be problematic. It'll break anything >> that tries to allocate many small MAP_GROWSDOWN stacks on 32- >> bit. Hopefully nothing does this, but maybe Java does. > > glibc (NPTL) does not. Java (at least Hotspot in OpenJDK 6,7, 8) does > not. LinuxThreads *does* and is used by uclibc. dietlibc *does*. I > would be surprised if either was used for applications with very many > threads, but then this issue has thrown up a lot of surprises. > Ugh. But yeah, I'd be a bit surprised to see heavily threaded apps using LinuxThreads or dietlibc. LinuxThreads still uses modify_ldt(), right? modify_ldt() performance is abysmal, and I have no intention of even trying to optimize it. Anyhow, you *can't* have more than 8192 threads if you use modify_ldt() for TLS because you run out of LDT slots. 8192 * 64k fits in 32 bits with room to spare, so this is unlikely to be a showstopper. --Andy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:50 PM, Ben Hutchings wrote: > On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote: >> On Jul 5, 2017, at 12:32 PM, Ben Hutchings wrote: >> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: > [...] >> > > - As a hardening feature, if the stack would expand within 64k or >> > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might >> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) >> > > The idea being that, if you deliberately place a mapping under the >> > > stack, you know what you're doing. If you're like LibreOffice and do >> > > something daft and are thus exploitable, you're on your own. >> > > - As a hardening measure, don't let mmap without MAP_FIXED position >> > > something within 64k or whatever of the bottom of the stack unless a >> > > MAP_FIXED mapping is between them. >> > >> > Having tested patches along these lines, I think the above would avoid >> > the reported regressions. >> > >> >> FWIW, even this last part may be problematic. It'll break anything >> that tries to allocate many small MAP_GROWSDOWN stacks on 32- >> bit. Hopefully nothing does this, but maybe Java does. > > glibc (NPTL) does not. Java (at least Hotspot in OpenJDK 6,7, 8) does > not. LinuxThreads *does* and is used by uclibc. dietlibc *does*. I > would be surprised if either was used for applications with very many > threads, but then this issue has thrown up a lot of surprises. > Ugh. But yeah, I'd be a bit surprised to see heavily threaded apps using LinuxThreads or dietlibc. LinuxThreads still uses modify_ldt(), right? modify_ldt() performance is abysmal, and I have no intention of even trying to optimize it. Anyhow, you *can't* have more than 8192 threads if you use modify_ldt() for TLS because you run out of LDT slots. 8192 * 64k fits in 32 bits with room to spare, so this is unlikely to be a showstopper. --Andy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cookwrote: > On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski wrote: >> Right. But I think the approach that we're all taking here is a bit >> nutty. We all realize that this issue is a longstanding *GCC* bug >> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must >> Be Fixed (tm) and therefore is allowed to break ABI. My security hat >> is normally pretty hard-line, but I think it may be time to call BS. >> >> Imagine if Kees had sent some symlink hardening patch that was >> default-on and broke a stock distro. Or if I had sent a vsyscall >> hardening patch that broke real code. It would get reverted right >> away, probably along with a diatribe about how we should have known >> better. I think this stack gap stuff is the same thing. It's not a >> security fix -- it's a hardening patch. >> >> Looking at it that way, I think a new inherited-on-exec flag is nucking futs. >> >> I'm starting to think that the right approach is to mostly revert all >> this stuff (the execve fixes are fine). Then start over and think >> about it as hardening. I would suggest the following approach: >> >> - The stack gap is one page, just like it's been for years. >> - As a hardening feature, if the stack would expand within 64k or >> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might >> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) >> The idea being that, if you deliberately place a mapping under the >> stack, you know what you're doing. If you're like LibreOffice and do >> something daft and are thus exploitable, you're on your own. >> - As a hardening measure, don't let mmap without MAP_FIXED position >> something within 64k or whatever of the bottom of the stack unless a >> MAP_FIXED mapping is between them. >> >> And that's all. It's not like a 64k gap actually fixes these bugs for >> real -- it just makes them harder to exploit. >> >> [1] The code that GCC generates for char buf[bug number] and alloca() >> is flat-out wrong. Everyone who's ever thought about it all all knows >> it and has known about it for years, but no one cared to fix it. > > As part of that should we put restrictions on the environment of > set*id exec too? Part of the risks demonstrated by Qualys was that > allowing a privilege-elevating binary to inherit rlimits can have lead > to the nasty memory layout side-effects. That would fall into the > "hardening" bucket as well. And if it turns out there is some set*id > binary out there that can't run with "only", e.g., 128MB of stack, we > can make it configurable... Yes. I think it's ridiculous that you can change rlimits and then exec a setuid thing. It's not so easy to fix, though. Maybe track, per-task, inherited by clone and exec, what the rlimits were the last time the process had privilege and reset to those limits when running something setuid. But a better approach might be to have some sysctls that say what the rlimits become when doing setuid. We need per-user-ns sysctls for stuff like this, and we don't really have them... --Andy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook wrote: > On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski wrote: >> Right. But I think the approach that we're all taking here is a bit >> nutty. We all realize that this issue is a longstanding *GCC* bug >> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must >> Be Fixed (tm) and therefore is allowed to break ABI. My security hat >> is normally pretty hard-line, but I think it may be time to call BS. >> >> Imagine if Kees had sent some symlink hardening patch that was >> default-on and broke a stock distro. Or if I had sent a vsyscall >> hardening patch that broke real code. It would get reverted right >> away, probably along with a diatribe about how we should have known >> better. I think this stack gap stuff is the same thing. It's not a >> security fix -- it's a hardening patch. >> >> Looking at it that way, I think a new inherited-on-exec flag is nucking futs. >> >> I'm starting to think that the right approach is to mostly revert all >> this stuff (the execve fixes are fine). Then start over and think >> about it as hardening. I would suggest the following approach: >> >> - The stack gap is one page, just like it's been for years. >> - As a hardening feature, if the stack would expand within 64k or >> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might >> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) >> The idea being that, if you deliberately place a mapping under the >> stack, you know what you're doing. If you're like LibreOffice and do >> something daft and are thus exploitable, you're on your own. >> - As a hardening measure, don't let mmap without MAP_FIXED position >> something within 64k or whatever of the bottom of the stack unless a >> MAP_FIXED mapping is between them. >> >> And that's all. It's not like a 64k gap actually fixes these bugs for >> real -- it just makes them harder to exploit. >> >> [1] The code that GCC generates for char buf[bug number] and alloca() >> is flat-out wrong. Everyone who's ever thought about it all all knows >> it and has known about it for years, but no one cared to fix it. > > As part of that should we put restrictions on the environment of > set*id exec too? Part of the risks demonstrated by Qualys was that > allowing a privilege-elevating binary to inherit rlimits can have lead > to the nasty memory layout side-effects. That would fall into the > "hardening" bucket as well. And if it turns out there is some set*id > binary out there that can't run with "only", e.g., 128MB of stack, we > can make it configurable... Yes. I think it's ridiculous that you can change rlimits and then exec a setuid thing. It's not so easy to fix, though. Maybe track, per-task, inherited by clone and exec, what the rlimits were the last time the process had privilege and reset to those limits when running something setuid. But a better approach might be to have some sysctls that say what the rlimits become when doing setuid. We need per-user-ns sysctls for stuff like this, and we don't really have them... --Andy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cookwrote: > > As part of that should we put restrictions on the environment of > set*id exec too? I'm not seeing what sane limits you could use. I think the concept of "reset as much of the environment to sane things when running suid binaries" is a good concepr. But we simply don't have any sane values to reset things to. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook wrote: > > As part of that should we put restrictions on the environment of > set*id exec too? I'm not seeing what sane limits you could use. I think the concept of "reset as much of the environment to sane things when running suid binaries" is a good concepr. But we simply don't have any sane values to reset things to. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote: > On Jul 5, 2017, at 12:32 PM, Ben Hutchingswrote: > > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: [...] > > > - As a hardening feature, if the stack would expand within 64k or > > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > > > The idea being that, if you deliberately place a mapping under the > > > stack, you know what you're doing. If you're like LibreOffice and do > > > something daft and are thus exploitable, you're on your own. > > > - As a hardening measure, don't let mmap without MAP_FIXED position > > > something within 64k or whatever of the bottom of the stack unless a > > > MAP_FIXED mapping is between them. > > > > Having tested patches along these lines, I think the above would avoid > > the reported regressions. > > > > FWIW, even this last part may be problematic. It'll break anything > that tries to allocate many small MAP_GROWSDOWN stacks on 32- > bit. Hopefully nothing does this, but maybe Java does. glibc (NPTL) does not. Java (at least Hotspot in OpenJDK 6,7, 8) does not. LinuxThreads *does* and is used by uclibc. dietlibc *does*. I would be surprised if either was used for applications with very many threads, but then this issue has thrown up a lot of surprises. Ben. -- Ben Hutchings Man invented language to satisfy his deep need to complain. - Lily Tomlin signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote: > On Jul 5, 2017, at 12:32 PM, Ben Hutchings wrote: > > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: [...] > > > - As a hardening feature, if the stack would expand within 64k or > > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > > > The idea being that, if you deliberately place a mapping under the > > > stack, you know what you're doing. If you're like LibreOffice and do > > > something daft and are thus exploitable, you're on your own. > > > - As a hardening measure, don't let mmap without MAP_FIXED position > > > something within 64k or whatever of the bottom of the stack unless a > > > MAP_FIXED mapping is between them. > > > > Having tested patches along these lines, I think the above would avoid > > the reported regressions. > > > > FWIW, even this last part may be problematic. It'll break anything > that tries to allocate many small MAP_GROWSDOWN stacks on 32- > bit. Hopefully nothing does this, but maybe Java does. glibc (NPTL) does not. Java (at least Hotspot in OpenJDK 6,7, 8) does not. LinuxThreads *does* and is used by uclibc. dietlibc *does*. I would be surprised if either was used for applications with very many threads, but then this issue has thrown up a lot of surprises. Ben. -- Ben Hutchings Man invented language to satisfy his deep need to complain. - Lily Tomlin signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchingswrote: >> >> And I think your second patch breaks that "use a really large value to >> approximate infinity" case that definitely has existed as a pattern. > > Right. Well that seems to leave us with remembering the MAP_FIXED flag > and using that as the condition to ignore the previous mapping. I'm not particularly happy about having a MAP_FIXED special case, but yeah, I'm not seeing a lot of alternatives. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings wrote: >> >> And I think your second patch breaks that "use a really large value to >> approximate infinity" case that definitely has existed as a pattern. > > Right. Well that seems to leave us with remembering the MAP_FIXED flag > and using that as the condition to ignore the previous mapping. I'm not particularly happy about having a MAP_FIXED special case, but yeah, I'm not seeing a lot of alternatives. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirskiwrote: > Right. But I think the approach that we're all taking here is a bit > nutty. We all realize that this issue is a longstanding *GCC* bug > [1], but we're acting like it's a Big Deal (tm) kernel bug that Must > Be Fixed (tm) and therefore is allowed to break ABI. My security hat > is normally pretty hard-line, but I think it may be time to call BS. > > Imagine if Kees had sent some symlink hardening patch that was > default-on and broke a stock distro. Or if I had sent a vsyscall > hardening patch that broke real code. It would get reverted right > away, probably along with a diatribe about how we should have known > better. I think this stack gap stuff is the same thing. It's not a > security fix -- it's a hardening patch. > > Looking at it that way, I think a new inherited-on-exec flag is nucking futs. > > I'm starting to think that the right approach is to mostly revert all > this stuff (the execve fixes are fine). Then start over and think > about it as hardening. I would suggest the following approach: > > - The stack gap is one page, just like it's been for years. > - As a hardening feature, if the stack would expand within 64k or > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > The idea being that, if you deliberately place a mapping under the > stack, you know what you're doing. If you're like LibreOffice and do > something daft and are thus exploitable, you're on your own. > - As a hardening measure, don't let mmap without MAP_FIXED position > something within 64k or whatever of the bottom of the stack unless a > MAP_FIXED mapping is between them. > > And that's all. It's not like a 64k gap actually fixes these bugs for > real -- it just makes them harder to exploit. > > [1] The code that GCC generates for char buf[bug number] and alloca() > is flat-out wrong. Everyone who's ever thought about it all all knows > it and has known about it for years, but no one cared to fix it. As part of that should we put restrictions on the environment of set*id exec too? Part of the risks demonstrated by Qualys was that allowing a privilege-elevating binary to inherit rlimits can have lead to the nasty memory layout side-effects. That would fall into the "hardening" bucket as well. And if it turns out there is some set*id binary out there that can't run with "only", e.g., 128MB of stack, we can make it configurable... -Kees -- Kees Cook Pixel Security
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski wrote: > Right. But I think the approach that we're all taking here is a bit > nutty. We all realize that this issue is a longstanding *GCC* bug > [1], but we're acting like it's a Big Deal (tm) kernel bug that Must > Be Fixed (tm) and therefore is allowed to break ABI. My security hat > is normally pretty hard-line, but I think it may be time to call BS. > > Imagine if Kees had sent some symlink hardening patch that was > default-on and broke a stock distro. Or if I had sent a vsyscall > hardening patch that broke real code. It would get reverted right > away, probably along with a diatribe about how we should have known > better. I think this stack gap stuff is the same thing. It's not a > security fix -- it's a hardening patch. > > Looking at it that way, I think a new inherited-on-exec flag is nucking futs. > > I'm starting to think that the right approach is to mostly revert all > this stuff (the execve fixes are fine). Then start over and think > about it as hardening. I would suggest the following approach: > > - The stack gap is one page, just like it's been for years. > - As a hardening feature, if the stack would expand within 64k or > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > The idea being that, if you deliberately place a mapping under the > stack, you know what you're doing. If you're like LibreOffice and do > something daft and are thus exploitable, you're on your own. > - As a hardening measure, don't let mmap without MAP_FIXED position > something within 64k or whatever of the bottom of the stack unless a > MAP_FIXED mapping is between them. > > And that's all. It's not like a 64k gap actually fixes these bugs for > real -- it just makes them harder to exploit. > > [1] The code that GCC generates for char buf[bug number] and alloca() > is flat-out wrong. Everyone who's ever thought about it all all knows > it and has known about it for years, but no one cared to fix it. As part of that should we put restrictions on the environment of set*id exec too? Part of the risks demonstrated by Qualys was that allowing a privilege-elevating binary to inherit rlimits can have lead to the nasty memory layout side-effects. That would fall into the "hardening" bucket as well. And if it turns out there is some set*id binary out there that can't run with "only", e.g., 128MB of stack, we can make it configurable... -Kees -- Kees Cook Pixel Security
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 10:15 -0700, Linus Torvalds wrote: > > On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchingswrote: > > > > I ended up with the following two patches, which seem to deal with > > both the Java and Rust regressions. These don't touch the > > stack-grows-up paths at all because Rust doesn't run on those > > architectures and the Java weirdness is i386-specific. > > > > They definitely need longer commit messages and comments, but aside > > from that do these look reasonable? > > I thin kthey both look reasonable, but I think we might still want to > massage things a bit (cutting down the quoting to a minimum, hopefully > leaving enough context to still make sense): > > > Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack > > gap > > > > prev = vma->vm_prev; > > + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) > > + prev = prev->vm_prev; > > if (prev && prev->vm_end > gap_addr) { > > Do we just want to ignore the user-supplied guard mapping, or do we > want to say "if the user does a guard mapping, we use that *instead* > of our stack gap"? > > IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want > to just return "ok". Rust effectively added a second guard page to the main thread stack. But it does not (yet) implement stack probing (https://github.com/rust-lang/rust/issues/16012) so I think it will benefit from the kernel's larger stack guard gap. > > Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent > > if finite > > This is good thinking, but no, I don't think the "if finite" is right. > > I've seen people use "really big values" as replacement for > RLIM_INIFITY, for various reasons. > > We've had huge confusion about RLIM_INFINITY over the years - look for > things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions > we've had. That sounds familiar... > Some people just use MAX_LONG etc, which is *not* the same as > RLIM_INFINITY, but in practice ends up doing the same thing. Yadda > yadda. > > So I'm personally leery of checking and depending on "exactly > RLIM_INIFITY", because I've seen it go wrong so many times. > > And I think your second patch breaks that "use a really large value to > approximate infinity" case that definitely has existed as a pattern. Right. Well that seems to leave us with remembering the MAP_FIXED flag and using that as the condition to ignore the previous mapping. Ben. -- Ben Hutchings Man invented language to satisfy his deep need to complain. - Lily Tomlin signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 10:15 -0700, Linus Torvalds wrote: > > On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings wrote: > > > > I ended up with the following two patches, which seem to deal with > > both the Java and Rust regressions. These don't touch the > > stack-grows-up paths at all because Rust doesn't run on those > > architectures and the Java weirdness is i386-specific. > > > > They definitely need longer commit messages and comments, but aside > > from that do these look reasonable? > > I thin kthey both look reasonable, but I think we might still want to > massage things a bit (cutting down the quoting to a minimum, hopefully > leaving enough context to still make sense): > > > Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack > > gap > > > > prev = vma->vm_prev; > > + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) > > + prev = prev->vm_prev; > > if (prev && prev->vm_end > gap_addr) { > > Do we just want to ignore the user-supplied guard mapping, or do we > want to say "if the user does a guard mapping, we use that *instead* > of our stack gap"? > > IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want > to just return "ok". Rust effectively added a second guard page to the main thread stack. But it does not (yet) implement stack probing (https://github.com/rust-lang/rust/issues/16012) so I think it will benefit from the kernel's larger stack guard gap. > > Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent > > if finite > > This is good thinking, but no, I don't think the "if finite" is right. > > I've seen people use "really big values" as replacement for > RLIM_INIFITY, for various reasons. > > We've had huge confusion about RLIM_INFINITY over the years - look for > things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions > we've had. That sounds familiar... > Some people just use MAX_LONG etc, which is *not* the same as > RLIM_INFINITY, but in practice ends up doing the same thing. Yadda > yadda. > > So I'm personally leery of checking and depending on "exactly > RLIM_INIFITY", because I've seen it go wrong so many times. > > And I think your second patch breaks that "use a really large value to > approximate infinity" case that definitely has existed as a pattern. Right. Well that seems to leave us with remembering the MAP_FIXED flag and using that as the condition to ignore the previous mapping. Ben. -- Ben Hutchings Man invented language to satisfy his deep need to complain. - Lily Tomlin signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
> On Jul 5, 2017, at 12:32 PM, Ben Hutchingswrote: > >> On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: >> [...] >> Looking at it that way, I think a new inherited-on-exec flag is nucking futs. >> >> I'm starting to think that the right approach is to mostly revert all >> this stuff (the execve fixes are fine). Then start over and think >> about it as hardening. I would suggest the following approach: >> >> - The stack gap is one page, just like it's been for years. > > Given that in the following points you say that something sounding like > a stack gap would be "64k or whatever", what does "the stack gap" mean > in this first point? I mean one page, with semantics as close to previous (4.11) behavior as practical. > >> - As a hardening feature, if the stack would expand within 64k or >> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might >> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) >> The idea being that, if you deliberately place a mapping under the >> stack, you know what you're doing. If you're like LibreOffice and do >> something daft and are thus exploitable, you're on your own. >> - As a hardening measure, don't let mmap without MAP_FIXED position >> something within 64k or whatever of the bottom of the stack unless a >> MAP_FIXED mapping is between them. > > Having tested patches along these lines, I think the above would avoid > the reported regressions. > FWIW, even this last part may be problematic. It'll break anything that tries to allocate many small MAP_GROWSDOWN stacks on 32-bit. Hopefully nothing does this, but maybe Java does. > Ben. > >> And that's all. It's not like a 64k gap actually fixes these bugs for >> real -- it just makes them harder to exploit. >> >> [1] The code that GCC generates for char buf[bug number] and alloca() >> is flat-out wrong. Everyone who's ever thought about it all all knows >> it and has known about it for years, but no one cared to fix it. > -- > Ben Hutchings > Anthony's Law of Force: Don't force it, get a larger hammer. >
Re: [PATCH] mm: larger stack guard gap, between vmas
> On Jul 5, 2017, at 12:32 PM, Ben Hutchings wrote: > >> On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: >> [...] >> Looking at it that way, I think a new inherited-on-exec flag is nucking futs. >> >> I'm starting to think that the right approach is to mostly revert all >> this stuff (the execve fixes are fine). Then start over and think >> about it as hardening. I would suggest the following approach: >> >> - The stack gap is one page, just like it's been for years. > > Given that in the following points you say that something sounding like > a stack gap would be "64k or whatever", what does "the stack gap" mean > in this first point? I mean one page, with semantics as close to previous (4.11) behavior as practical. > >> - As a hardening feature, if the stack would expand within 64k or >> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might >> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) >> The idea being that, if you deliberately place a mapping under the >> stack, you know what you're doing. If you're like LibreOffice and do >> something daft and are thus exploitable, you're on your own. >> - As a hardening measure, don't let mmap without MAP_FIXED position >> something within 64k or whatever of the bottom of the stack unless a >> MAP_FIXED mapping is between them. > > Having tested patches along these lines, I think the above would avoid > the reported regressions. > FWIW, even this last part may be problematic. It'll break anything that tries to allocate many small MAP_GROWSDOWN stacks on 32-bit. Hopefully nothing does this, but maybe Java does. > Ben. > >> And that's all. It's not like a 64k gap actually fixes these bugs for >> real -- it just makes them harder to exploit. >> >> [1] The code that GCC generates for char buf[bug number] and alloca() >> is flat-out wrong. Everyone who's ever thought about it all all knows >> it and has known about it for years, but no one cared to fix it. > -- > Ben Hutchings > Anthony's Law of Force: Don't force it, get a larger hammer. >
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 08:32:43PM +0100, Ben Hutchings wrote: > > - As a hardening feature, if the stack would expand within 64k or > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > > The idea being that, if you deliberately place a mapping under the > > stack, you know what you're doing. If you're like LibreOffice and do > > something daft and are thus exploitable, you're on your own. > > - As a hardening measure, don't let mmap without MAP_FIXED position > > something within 64k or whatever of the bottom of the stack unless a > > MAP_FIXED mapping is between them. > > Having tested patches along these lines, I think the above would avoid > the reported regressions. Stuff like this has already been proposed but Linus suspects that more software than we imagine uses MAP_FIXED and could break. I cannot infirm nor confirm, and that probably indicates that there's nothing fundamentally wrong with this approach from the userland's perspective and that it could indeed imply such software may be more common than we would like it. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 08:32:43PM +0100, Ben Hutchings wrote: > > - As a hardening feature, if the stack would expand within 64k or > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > > The idea being that, if you deliberately place a mapping under the > > stack, you know what you're doing. If you're like LibreOffice and do > > something daft and are thus exploitable, you're on your own. > > - As a hardening measure, don't let mmap without MAP_FIXED position > > something within 64k or whatever of the bottom of the stack unless a > > MAP_FIXED mapping is between them. > > Having tested patches along these lines, I think the above would avoid > the reported regressions. Stuff like this has already been proposed but Linus suspects that more software than we imagine uses MAP_FIXED and could break. I cannot infirm nor confirm, and that probably indicates that there's nothing fundamentally wrong with this approach from the userland's perspective and that it could indeed imply such software may be more common than we would like it. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: [...] > Looking at it that way, I think a new inherited-on-exec flag is nucking futs. > > I'm starting to think that the right approach is to mostly revert all > this stuff (the execve fixes are fine). Then start over and think > about it as hardening. I would suggest the following approach: > > - The stack gap is one page, just like it's been for years. Given that in the following points you say that something sounding like a stack gap would be "64k or whatever", what does "the stack gap" mean in this first point? > - As a hardening feature, if the stack would expand within 64k or > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > The idea being that, if you deliberately place a mapping under the > stack, you know what you're doing. If you're like LibreOffice and do > something daft and are thus exploitable, you're on your own. > - As a hardening measure, don't let mmap without MAP_FIXED position > something within 64k or whatever of the bottom of the stack unless a > MAP_FIXED mapping is between them. Having tested patches along these lines, I think the above would avoid the reported regressions. Ben. > And that's all. It's not like a 64k gap actually fixes these bugs for > real -- it just makes them harder to exploit. > > [1] The code that GCC generates for char buf[bug number] and alloca() > is flat-out wrong. Everyone who's ever thought about it all all knows > it and has known about it for years, but no one cared to fix it. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote: [...] > Looking at it that way, I think a new inherited-on-exec flag is nucking futs. > > I'm starting to think that the right approach is to mostly revert all > this stuff (the execve fixes are fine). Then start over and think > about it as hardening. I would suggest the following approach: > > - The stack gap is one page, just like it's been for years. Given that in the following points you say that something sounding like a stack gap would be "64k or whatever", what does "the stack gap" mean in this first point? > - As a hardening feature, if the stack would expand within 64k or > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) > The idea being that, if you deliberately place a mapping under the > stack, you know what you're doing. If you're like LibreOffice and do > something daft and are thus exploitable, you're on your own. > - As a hardening measure, don't let mmap without MAP_FIXED position > something within 64k or whatever of the bottom of the stack unless a > MAP_FIXED mapping is between them. Having tested patches along these lines, I think the above would avoid the reported regressions. Ben. > And that's all. It's not like a 64k gap actually fixes these bugs for > real -- it just makes them harder to exploit. > > [1] The code that GCC generates for char buf[bug number] and alloca() > is flat-out wrong. Everyone who's ever thought about it all all knows > it and has known about it for years, but no one cared to fix it. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 12:18 PM, Willy Tarreauwrote: > > But only if the sysctl is set. It can simply be recommended to set it > if any program fails. We've done this for many years with other ones > like min_mmap_addr or tcp_ecn. Ok, fair enough. I don't hate the approach, and maybe it's simpler overall, and would help find other potential problem spots. *Hopefully* it was just that Rust thing and the nasty Java exec-shield workaround, but yeah, those might just be the first ones that have been found so far. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 12:18 PM, Willy Tarreau wrote: > > But only if the sysctl is set. It can simply be recommended to set it > if any program fails. We've done this for many years with other ones > like min_mmap_addr or tcp_ecn. Ok, fair enough. I don't hate the approach, and maybe it's simpler overall, and would help find other potential problem spots. *Hopefully* it was just that Rust thing and the nasty Java exec-shield workaround, but yeah, those might just be the first ones that have been found so far. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 12:17:20PM -0700, Linus Torvalds wrote: > On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreauwrote: > > > > Don't you think that the option of having a sysctl to relax the check > > per task wouldn't be easier for distros and safer overall ? Ie, emit > > a warning the first time the gap is hit instead of segfaulting, then > > reduce it to something that used to work (4k or 64k, I don't remember) > > and try again ? > > It used to be just 4k. > > .. and I think that might be a valid way to find these things, but > would it be safer? It basically disables the new stack gap entirely > apart from the warning. But only if the sysctl is set. It can simply be recommended to set it if any program fails. We've done this for many years with other ones like min_mmap_addr or tcp_ecn. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 12:17:20PM -0700, Linus Torvalds wrote: > On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau wrote: > > > > Don't you think that the option of having a sysctl to relax the check > > per task wouldn't be easier for distros and safer overall ? Ie, emit > > a warning the first time the gap is hit instead of segfaulting, then > > reduce it to something that used to work (4k or 64k, I don't remember) > > and try again ? > > It used to be just 4k. > > .. and I think that might be a valid way to find these things, but > would it be safer? It basically disables the new stack gap entirely > apart from the warning. But only if the sysctl is set. It can simply be recommended to set it if any program fails. We've done this for many years with other ones like min_mmap_addr or tcp_ecn. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreauwrote: > > Don't you think that the option of having a sysctl to relax the check > per task wouldn't be easier for distros and safer overall ? Ie, emit > a warning the first time the gap is hit instead of segfaulting, then > reduce it to something that used to work (4k or 64k, I don't remember) > and try again ? It used to be just 4k. .. and I think that might be a valid way to find these things, but would it be safer? It basically disables the new stack gap entirely apart from the warning. And maybe that's ok and distros prefer that? Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau wrote: > > Don't you think that the option of having a sysctl to relax the check > per task wouldn't be easier for distros and safer overall ? Ie, emit > a warning the first time the gap is hit instead of segfaulting, then > reduce it to something that used to work (4k or 64k, I don't remember) > and try again ? It used to be just 4k. .. and I think that might be a valid way to find these things, but would it be safer? It basically disables the new stack gap entirely apart from the warning. And maybe that's ok and distros prefer that? Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 09:17:59AM -0700, Linus Torvalds wrote: (...) > The good news is that this is probably specialized enough that we can > just keep the defaults as "will break this one case, but we give > people the tools to work around it". > > I hate doing that, but distros that still support 32-bit (which is > apparently a shrinking number) can maybe hack the libreoffice launch > scripts up? Don't you think that the option of having a sysctl to relax the check per task wouldn't be easier for distros and safer overall ? Ie, emit a warning the first time the gap is hit instead of segfaulting, then reduce it to something that used to work (4k or 64k, I don't remember) and try again ? It would quickly report all these "special" programs for end-user distros, without leaving too much room for attacks due to the warning making it pretty obvious what's going on. I just don't know how to place this stack gap per process but since this was already discussed for prctl I think it's doable. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 09:17:59AM -0700, Linus Torvalds wrote: (...) > The good news is that this is probably specialized enough that we can > just keep the defaults as "will break this one case, but we give > people the tools to work around it". > > I hate doing that, but distros that still support 32-bit (which is > apparently a shrinking number) can maybe hack the libreoffice launch > scripts up? Don't you think that the option of having a sysctl to relax the check per task wouldn't be easier for distros and safer overall ? Ie, emit a warning the first time the gap is hit instead of segfaulting, then reduce it to something that used to work (4k or 64k, I don't remember) and try again ? It would quickly report all these "special" programs for end-user distros, without leaving too much room for attacks due to the warning making it pretty obvious what's going on. I just don't know how to place this stack gap per process but since this was already discussed for prctl I think it's doable. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 19:05 +0200, Michal Hocko wrote: > On Wed 05-07-17 17:58:45, Ben Hutchings wrote: > [...] > > diff --git a/mm/mmap.c b/mm/mmap.c > > index c7906ae1a7a1..f8131a94e56e 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, > > unsigned long address) > > } > > #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ > > > > +unsigned long __vm_start_gap(struct vm_area_struct *vma) > > +{ > > + unsigned long stack_limit = > > + current->signal->rlim[RLIMIT_STACK].rlim_cur; > > + unsigned long vm_start; > > + > > + if (stack_limit != RLIM_INFINITY && > > + vma->vm_end - vma->vm_start < stack_limit) > > + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit); > > This is exactly what I was worried about in my previous email. Say > somebody sets stack ulimit to 1G or so. Should we reduce the available > address space that much? It's not ideal, but why would someone set the stack limit that high unless it's for an application that will actually use most of that stack space? Do you think that "increase the stack limit" has been cargo-culted? > Say you are 32b and you have an application > with multiple stacks each doing its MAP_GROWSDOWN. [...] So this application is using dietlibc or uclibc? glibc uses fixed-size mappings for new threads. I suppose there's a risk that by doing this we would mamke MAP_GROWSDOWN useful enough that it is more likely to be used for new thread stacks in future. Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 19:05 +0200, Michal Hocko wrote: > On Wed 05-07-17 17:58:45, Ben Hutchings wrote: > [...] > > diff --git a/mm/mmap.c b/mm/mmap.c > > index c7906ae1a7a1..f8131a94e56e 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, > > unsigned long address) > > } > > #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ > > > > +unsigned long __vm_start_gap(struct vm_area_struct *vma) > > +{ > > + unsigned long stack_limit = > > + current->signal->rlim[RLIMIT_STACK].rlim_cur; > > + unsigned long vm_start; > > + > > + if (stack_limit != RLIM_INFINITY && > > + vma->vm_end - vma->vm_start < stack_limit) > > + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit); > > This is exactly what I was worried about in my previous email. Say > somebody sets stack ulimit to 1G or so. Should we reduce the available > address space that much? It's not ideal, but why would someone set the stack limit that high unless it's for an application that will actually use most of that stack space? Do you think that "increase the stack limit" has been cargo-culted? > Say you are 32b and you have an application > with multiple stacks each doing its MAP_GROWSDOWN. [...] So this application is using dietlibc or uclibc? glibc uses fixed-size mappings for new threads. I suppose there's a risk that by doing this we would mamke MAP_GROWSDOWN useful enough that it is more likely to be used for new thread stacks in future. Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 9:20 AM, Linus Torvaldswrote: > On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski wrote: >> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko wrote: >>> >>> This is really worrying. This doesn't look like a gap at all. It is a >>> mapping which actually contains a code and so we should absolutely not >>> allow to scribble over it. So I am afraid the only way forward is to >>> allow per process stack gap and run this particular program to have a >>> smaller gap. We basically have two ways. Either /proc//$file or >>> a prctl inherited on exec. The later is a smaller code. What do you >>> think? >> >> Why inherit on exec? > > .. because the whole point is that you have an existing binary that breaks. > > So you need to be able to wrap it in "let's lower the stack gap, then > run that known-problematic binary". > > If you think the problem is solved by recompiling existing binaries, > then why are we doing this kernel hack to begin with? The *real* > solution was always to just fix the damn compiler and ABI. That's not what I was suggesting at all. I was suggesting that, if we're going to suggest a new API, that the new API actually be sane. > > That *real* solution is simple and needs no kernel support at all. > > In other words, *ALL* of the kernel work in this area is purely to > support existing binaries. Don't overlook that fact. Right. But I think the approach that we're all taking here is a bit nutty. We all realize that this issue is a longstanding *GCC* bug [1], but we're acting like it's a Big Deal (tm) kernel bug that Must Be Fixed (tm) and therefore is allowed to break ABI. My security hat is normally pretty hard-line, but I think it may be time to call BS. Imagine if Kees had sent some symlink hardening patch that was default-on and broke a stock distro. Or if I had sent a vsyscall hardening patch that broke real code. It would get reverted right away, probably along with a diatribe about how we should have known better. I think this stack gap stuff is the same thing. It's not a security fix -- it's a hardening patch. Looking at it that way, I think a new inherited-on-exec flag is nucking futs. I'm starting to think that the right approach is to mostly revert all this stuff (the execve fixes are fine). Then start over and think about it as hardening. I would suggest the following approach: - The stack gap is one page, just like it's been for years. - As a hardening feature, if the stack would expand within 64k or whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) The idea being that, if you deliberately place a mapping under the stack, you know what you're doing. If you're like LibreOffice and do something daft and are thus exploitable, you're on your own. - As a hardening measure, don't let mmap without MAP_FIXED position something within 64k or whatever of the bottom of the stack unless a MAP_FIXED mapping is between them. And that's all. It's not like a 64k gap actually fixes these bugs for real -- it just makes them harder to exploit. [1] The code that GCC generates for char buf[bug number] and alloca() is flat-out wrong. Everyone who's ever thought about it all all knows it and has known about it for years, but no one cared to fix it.
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 9:20 AM, Linus Torvalds wrote: > On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski wrote: >> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko wrote: >>> >>> This is really worrying. This doesn't look like a gap at all. It is a >>> mapping which actually contains a code and so we should absolutely not >>> allow to scribble over it. So I am afraid the only way forward is to >>> allow per process stack gap and run this particular program to have a >>> smaller gap. We basically have two ways. Either /proc//$file or >>> a prctl inherited on exec. The later is a smaller code. What do you >>> think? >> >> Why inherit on exec? > > .. because the whole point is that you have an existing binary that breaks. > > So you need to be able to wrap it in "let's lower the stack gap, then > run that known-problematic binary". > > If you think the problem is solved by recompiling existing binaries, > then why are we doing this kernel hack to begin with? The *real* > solution was always to just fix the damn compiler and ABI. That's not what I was suggesting at all. I was suggesting that, if we're going to suggest a new API, that the new API actually be sane. > > That *real* solution is simple and needs no kernel support at all. > > In other words, *ALL* of the kernel work in this area is purely to > support existing binaries. Don't overlook that fact. Right. But I think the approach that we're all taking here is a bit nutty. We all realize that this issue is a longstanding *GCC* bug [1], but we're acting like it's a Big Deal (tm) kernel bug that Must Be Fixed (tm) and therefore is allowed to break ABI. My security hat is normally pretty hard-line, but I think it may be time to call BS. Imagine if Kees had sent some symlink hardening patch that was default-on and broke a stock distro. Or if I had sent a vsyscall hardening patch that broke real code. It would get reverted right away, probably along with a diatribe about how we should have known better. I think this stack gap stuff is the same thing. It's not a security fix -- it's a hardening patch. Looking at it that way, I think a new inherited-on-exec flag is nucking futs. I'm starting to think that the right approach is to mostly revert all this stuff (the execve fixes are fine). Then start over and think about it as hardening. I would suggest the following approach: - The stack gap is one page, just like it's been for years. - As a hardening feature, if the stack would expand within 64k or whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.) The idea being that, if you deliberately place a mapping under the stack, you know what you're doing. If you're like LibreOffice and do something daft and are thus exploitable, you're on your own. - As a hardening measure, don't let mmap without MAP_FIXED position something within 64k or whatever of the bottom of the stack unless a MAP_FIXED mapping is between them. And that's all. It's not like a 64k gap actually fixes these bugs for real -- it just makes them harder to exploit. [1] The code that GCC generates for char buf[bug number] and alloca() is flat-out wrong. Everyone who's ever thought about it all all knows it and has known about it for years, but no one cared to fix it.
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchingswrote: > > I ended up with the following two patches, which seem to deal with > both the Java and Rust regressions. These don't touch the > stack-grows-up paths at all because Rust doesn't run on those > architectures and the Java weirdness is i386-specific. > > They definitely need longer commit messages and comments, but aside > from that do these look reasonable? I thin kthey both look reasonable, but I think we might still want to massage things a bit (cutting down the quoting to a minimum, hopefully leaving enough context to still make sense): > Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap > > prev = vma->vm_prev; > + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) > + prev = prev->vm_prev; > if (prev && prev->vm_end > gap_addr) { Do we just want to ignore the user-supplied guard mapping, or do we want to say "if the user does a guard mapping, we use that *instead* of our stack gap"? IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want to just return "ok". > Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent if > finite This is good thinking, but no, I don't think the "if finite" is right. I've seen people use "really big values" as replacement for RLIM_INIFITY, for various reasons. We've had huge confusion about RLIM_INFINITY over the years - look for things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions we've had. Some people just use MAX_LONG etc, which is *not* the same as RLIM_INFINITY, but in practice ends up doing the same thing. Yadda yadda. So I'm personally leery of checking and depending on "exactly RLIM_INIFITY", because I've seen it go wrong so many times. And I think your second patch breaks that "use a really large value to approximate infinity" case that definitely has existed as a pattern. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings wrote: > > I ended up with the following two patches, which seem to deal with > both the Java and Rust regressions. These don't touch the > stack-grows-up paths at all because Rust doesn't run on those > architectures and the Java weirdness is i386-specific. > > They definitely need longer commit messages and comments, but aside > from that do these look reasonable? I thin kthey both look reasonable, but I think we might still want to massage things a bit (cutting down the quoting to a minimum, hopefully leaving enough context to still make sense): > Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap > > prev = vma->vm_prev; > + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) > + prev = prev->vm_prev; > if (prev && prev->vm_end > gap_addr) { Do we just want to ignore the user-supplied guard mapping, or do we want to say "if the user does a guard mapping, we use that *instead* of our stack gap"? IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want to just return "ok". > Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent if > finite This is good thinking, but no, I don't think the "if finite" is right. I've seen people use "really big values" as replacement for RLIM_INIFITY, for various reasons. We've had huge confusion about RLIM_INFINITY over the years - look for things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions we've had. Some people just use MAX_LONG etc, which is *not* the same as RLIM_INFINITY, but in practice ends up doing the same thing. Yadda yadda. So I'm personally leery of checking and depending on "exactly RLIM_INIFITY", because I've seen it go wrong so many times. And I think your second patch breaks that "use a really large value to approximate infinity" case that definitely has existed as a pattern. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 17:58:45, Ben Hutchings wrote: [...] > diff --git a/mm/mmap.c b/mm/mmap.c > index c7906ae1a7a1..f8131a94e56e 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, > unsigned long address) > } > #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ > > +unsigned long __vm_start_gap(struct vm_area_struct *vma) > +{ > + unsigned long stack_limit = > + current->signal->rlim[RLIMIT_STACK].rlim_cur; > + unsigned long vm_start; > + > + if (stack_limit != RLIM_INFINITY && > + vma->vm_end - vma->vm_start < stack_limit) > + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit); This is exactly what I was worried about in my previous email. Say somebody sets stack ulimit to 1G or so. Should we reduce the available address space that much? Say you are 32b and you have an application with multiple stacks each doing its MAP_GROWSDOWN. You are quickly out of address space. That's why I've said that we would need to find a cap for the user defined limit. How much that should be though? Few (tens, hundreds) megs. If we can figure that up I would be of course quite happy about such a change because MAP_GROWSDOWN doesn't work really well these days. > + else > + vm_start = vma->vm_start; > + > + vm_start -= stack_guard_gap; > + if (vm_start > vma->vm_start) > + vm_start = 0; > + > + return vm_start; > +} > + > /* > * vma is the first one with address < vma->vm_start. Have to extend vma. > */ > > -- > Ben Hutchings > For every complex problem > there is a solution that is simple, neat, and wrong. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 17:58:45, Ben Hutchings wrote: [...] > diff --git a/mm/mmap.c b/mm/mmap.c > index c7906ae1a7a1..f8131a94e56e 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, > unsigned long address) > } > #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ > > +unsigned long __vm_start_gap(struct vm_area_struct *vma) > +{ > + unsigned long stack_limit = > + current->signal->rlim[RLIMIT_STACK].rlim_cur; > + unsigned long vm_start; > + > + if (stack_limit != RLIM_INFINITY && > + vma->vm_end - vma->vm_start < stack_limit) > + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit); This is exactly what I was worried about in my previous email. Say somebody sets stack ulimit to 1G or so. Should we reduce the available address space that much? Say you are 32b and you have an application with multiple stacks each doing its MAP_GROWSDOWN. You are quickly out of address space. That's why I've said that we would need to find a cap for the user defined limit. How much that should be though? Few (tens, hundreds) megs. If we can figure that up I would be of course quite happy about such a change because MAP_GROWSDOWN doesn't work really well these days. > + else > + vm_start = vma->vm_start; > + > + vm_start -= stack_guard_gap; > + if (vm_start > vma->vm_start) > + vm_start = 0; > + > + return vm_start; > +} > + > /* > * vma is the first one with address < vma->vm_start. Have to extend vma. > */ > > -- > Ben Hutchings > For every complex problem > there is a solution that is simple, neat, and wrong. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:25:00PM +0100, Ben Hutchings wrote: [...] > Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. > Look at os::pd_attempt_reserve_memory_at(). If the first, hinted, > mmap() doesn't return the hinted address it then attempts to allocate > huge areas (I'm not sure how intentional this is) and unmaps the > unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re- > mmap()s the wanted part with MAP_FIXED. If this fails at any point it > is not a fatal error. > > So if we change vm_start_gap() to take the stack limit into account > (when it's finite) that should neutralise > os::workaround_expand_exec_shield_cs_limit(). I'll try this. I ended up with the following two patches, which seem to deal with both the Java and Rust regressions. These don't touch the stack-grows-up paths at all because Rust doesn't run on those architectures and the Java weirdness is i386-specific. They definitely need longer commit messages and comments, but aside from that do these look reasonable? Ben. Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap Signed-off-by: Ben Hutchings--- mm/mmap.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index a5e3dcd75e79..c7906ae1a7a1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2323,11 +2323,16 @@ int expand_downwards(struct vm_area_struct *vma, if (error) return error; - /* Enforce stack_guard_gap */ + /* +* Enforce stack_guard_gap. Some applications allocate a VM_NONE +* mapping just below the stack, which we can safely ignore. +*/ gap_addr = address - stack_guard_gap; if (gap_addr > address) return -ENOMEM; prev = vma->vm_prev; + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) + prev = prev->vm_prev; if (prev && prev->vm_end > gap_addr) { if (!(prev->vm_flags & VM_GROWSDOWN)) return -ENOMEM; Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent if finite Signed-off-by: Ben Hutchings --- include/linux/mm.h | 9 - mm/mmap.c | 19 +++ 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f543a47fc92..2240a0505072 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2223,15 +2223,14 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m return vma; } +unsigned long __vm_start_gap(struct vm_area_struct *vma); + static inline unsigned long vm_start_gap(struct vm_area_struct *vma) { unsigned long vm_start = vma->vm_start; - if (vma->vm_flags & VM_GROWSDOWN) { - vm_start -= stack_guard_gap; - if (vm_start > vma->vm_start) - vm_start = 0; - } + if (vma->vm_flags & VM_GROWSDOWN) + vm_start = __vm_start_gap(vma); return vm_start; } diff --git a/mm/mmap.c b/mm/mmap.c index c7906ae1a7a1..f8131a94e56e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) } #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ +unsigned long __vm_start_gap(struct vm_area_struct *vma) +{ + unsigned long stack_limit = + current->signal->rlim[RLIMIT_STACK].rlim_cur; + unsigned long vm_start; + + if (stack_limit != RLIM_INFINITY && + vma->vm_end - vma->vm_start < stack_limit) + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit); + else + vm_start = vma->vm_start; + + vm_start -= stack_guard_gap; + if (vm_start > vma->vm_start) + vm_start = 0; + + return vm_start; +} + /* * vma is the first one with address < vma->vm_start. Have to extend vma. */ -- Ben Hutchings For every complex problem there is a solution that is simple, neat, and wrong. signature.asc Description: Digital signature
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 04:25:00PM +0100, Ben Hutchings wrote: [...] > Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. > Look at os::pd_attempt_reserve_memory_at(). If the first, hinted, > mmap() doesn't return the hinted address it then attempts to allocate > huge areas (I'm not sure how intentional this is) and unmaps the > unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re- > mmap()s the wanted part with MAP_FIXED. If this fails at any point it > is not a fatal error. > > So if we change vm_start_gap() to take the stack limit into account > (when it's finite) that should neutralise > os::workaround_expand_exec_shield_cs_limit(). I'll try this. I ended up with the following two patches, which seem to deal with both the Java and Rust regressions. These don't touch the stack-grows-up paths at all because Rust doesn't run on those architectures and the Java weirdness is i386-specific. They definitely need longer commit messages and comments, but aside from that do these look reasonable? Ben. Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap Signed-off-by: Ben Hutchings --- mm/mmap.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index a5e3dcd75e79..c7906ae1a7a1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2323,11 +2323,16 @@ int expand_downwards(struct vm_area_struct *vma, if (error) return error; - /* Enforce stack_guard_gap */ + /* +* Enforce stack_guard_gap. Some applications allocate a VM_NONE +* mapping just below the stack, which we can safely ignore. +*/ gap_addr = address - stack_guard_gap; if (gap_addr > address) return -ENOMEM; prev = vma->vm_prev; + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) + prev = prev->vm_prev; if (prev && prev->vm_end > gap_addr) { if (!(prev->vm_flags & VM_GROWSDOWN)) return -ENOMEM; Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent if finite Signed-off-by: Ben Hutchings --- include/linux/mm.h | 9 - mm/mmap.c | 19 +++ 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 6f543a47fc92..2240a0505072 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2223,15 +2223,14 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m return vma; } +unsigned long __vm_start_gap(struct vm_area_struct *vma); + static inline unsigned long vm_start_gap(struct vm_area_struct *vma) { unsigned long vm_start = vma->vm_start; - if (vma->vm_flags & VM_GROWSDOWN) { - vm_start -= stack_guard_gap; - if (vm_start > vma->vm_start) - vm_start = 0; - } + if (vma->vm_flags & VM_GROWSDOWN) + vm_start = __vm_start_gap(vma); return vm_start; } diff --git a/mm/mmap.c b/mm/mmap.c index c7906ae1a7a1..f8131a94e56e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) } #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ +unsigned long __vm_start_gap(struct vm_area_struct *vma) +{ + unsigned long stack_limit = + current->signal->rlim[RLIMIT_STACK].rlim_cur; + unsigned long vm_start; + + if (stack_limit != RLIM_INFINITY && + vma->vm_end - vma->vm_start < stack_limit) + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit); + else + vm_start = vma->vm_start; + + vm_start -= stack_guard_gap; + if (vm_start > vma->vm_start) + vm_start = 0; + + return vm_start; +} + /* * vma is the first one with address < vma->vm_start. Have to extend vma. */ -- Ben Hutchings For every complex problem there is a solution that is simple, neat, and wrong. signature.asc Description: Digital signature
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirskiwrote: > On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko wrote: >> >> This is really worrying. This doesn't look like a gap at all. It is a >> mapping which actually contains a code and so we should absolutely not >> allow to scribble over it. So I am afraid the only way forward is to >> allow per process stack gap and run this particular program to have a >> smaller gap. We basically have two ways. Either /proc//$file or >> a prctl inherited on exec. The later is a smaller code. What do you >> think? > > Why inherit on exec? .. because the whole point is that you have an existing binary that breaks. So you need to be able to wrap it in "let's lower the stack gap, then run that known-problematic binary". If you think the problem is solved by recompiling existing binaries, then why are we doing this kernel hack to begin with? The *real* solution was always to just fix the damn compiler and ABI. That *real* solution is simple and needs no kernel support at all. In other words, *ALL* of the kernel work in this area is purely to support existing binaries. Don't overlook that fact. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski wrote: > On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko wrote: >> >> This is really worrying. This doesn't look like a gap at all. It is a >> mapping which actually contains a code and so we should absolutely not >> allow to scribble over it. So I am afraid the only way forward is to >> allow per process stack gap and run this particular program to have a >> smaller gap. We basically have two ways. Either /proc//$file or >> a prctl inherited on exec. The later is a smaller code. What do you >> think? > > Why inherit on exec? .. because the whole point is that you have an existing binary that breaks. So you need to be able to wrap it in "let's lower the stack gap, then run that known-problematic binary". If you think the problem is solved by recompiling existing binaries, then why are we doing this kernel hack to begin with? The *real* solution was always to just fix the damn compiler and ABI. That *real* solution is simple and needs no kernel support at all. In other words, *ALL* of the kernel work in this area is purely to support existing binaries. Don't overlook that fact. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:19 AM, Ben Hutchingswrote: > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: >> >> Can you find out where that is allocated? Perhaps a breakpoint on >> mmap, with a condition to catch that particular one? > > Found it, and it's now clear why only i386 is affected: > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 Thanks, good work. Well, good work on *your* part. I will try very hard to refrain from commenting too much on the f*cking stinking pile of sh*t that was exec-shield. But yes, I don't think we can sanely recognize this. The code clearly very intentionally does that mapping under the stack, and it's very intentionally not PROT_NONE, since it's meant to be both writable and executable. As I said earlier (and I see Michal Hocko suggested the same - sudden email flurry going on here), I think we need to basically allow people to set the stack gap per-process to something low. The good news is that this is probably specialized enough that we can just keep the defaults as "will break this one case, but we give people the tools to work around it". I hate doing that, but distros that still support 32-bit (which is apparently a shrinking number) can maybe hack the libreoffice launch scripts up? Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:19 AM, Ben Hutchings wrote: > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: >> >> Can you find out where that is allocated? Perhaps a breakpoint on >> mmap, with a condition to catch that particular one? > > Found it, and it's now clear why only i386 is affected: > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 Thanks, good work. Well, good work on *your* part. I will try very hard to refrain from commenting too much on the f*cking stinking pile of sh*t that was exec-shield. But yes, I don't think we can sanely recognize this. The code clearly very intentionally does that mapping under the stack, and it's very intentionally not PROT_NONE, since it's meant to be both writable and executable. As I said earlier (and I see Michal Hocko suggested the same - sudden email flurry going on here), I think we need to basically allow people to set the stack gap per-process to something low. The good news is that this is probably specialized enough that we can just keep the defaults as "will break this one case, but we give people the tools to work around it". I hate doing that, but distros that still support 32-bit (which is apparently a shrinking number) can maybe hack the libreoffice launch scripts up? Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 7:23 AM, Michal Hockowrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: >> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: >> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings >> > wrote: >> > > >> > > We have: >> > > >> > > bottom = 0xff803fff >> > > sp = 0xb178 >> > > >> > > The relevant mappings are: >> > > >> > > ff7fc000-ff7fd000 rwxp 00:00 0 >> > > fffdd000-e000 rw-p 00:00 >> > > 0 [stack] >> > >> > Ugh. So that stack is actually 8MB in size, but the alloca() is about >> > to use up almost all of it, and there's only about 28kB left between >> > "bottom" and that 'rwx' mapping. >> > >> > Still, that rwx mapping is interesting: it is a single page, and it >> > really is almost exactly 8MB below the stack. >> > >> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from >> > the top of that odd one-page allocation (0xff7fd000). >> > >> > Can you find out where that is allocated? Perhaps a breakpoint on >> > mmap, with a condition to catch that particular one? >> [...] >> >> Found it, and it's now clear why only i386 is affected: >> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 >> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? Why inherit on exec? I think that, if we add a new API, we should do it right rather than making it even more hackish. Specifically, we'd add a real VMA type (via flag or whatever) that means "this is a modern stack". A modern stack wouldn't ever expand and would have no guard page at all. It would, however, properly account stack space by tracking the pages used as stack space. Users of the new VMA type would be responsible for allocating their own guard pages, probably by mapping an extra page and than mapping PROT_NONE over it. Also, this doesn't even need a new API, I think. What's wrong with plain old mmap(2) with MAP_STACK and *without* MAP_GROWSDOWN? Only new kernels would get the accounting right, but I doubt that matters much in practice.
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko wrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: >> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: >> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings >> > wrote: >> > > >> > > We have: >> > > >> > > bottom = 0xff803fff >> > > sp = 0xb178 >> > > >> > > The relevant mappings are: >> > > >> > > ff7fc000-ff7fd000 rwxp 00:00 0 >> > > fffdd000-e000 rw-p 00:00 >> > > 0 [stack] >> > >> > Ugh. So that stack is actually 8MB in size, but the alloca() is about >> > to use up almost all of it, and there's only about 28kB left between >> > "bottom" and that 'rwx' mapping. >> > >> > Still, that rwx mapping is interesting: it is a single page, and it >> > really is almost exactly 8MB below the stack. >> > >> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from >> > the top of that odd one-page allocation (0xff7fd000). >> > >> > Can you find out where that is allocated? Perhaps a breakpoint on >> > mmap, with a condition to catch that particular one? >> [...] >> >> Found it, and it's now clear why only i386 is affected: >> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 >> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? Why inherit on exec? I think that, if we add a new API, we should do it right rather than making it even more hackish. Specifically, we'd add a real VMA type (via flag or whatever) that means "this is a modern stack". A modern stack wouldn't ever expand and would have no guard page at all. It would, however, properly account stack space by tracking the pages used as stack space. Users of the new VMA type would be responsible for allocating their own guard pages, probably by mapping an extra page and than mapping PROT_NONE over it. Also, this doesn't even need a new API, I think. What's wrong with plain old mmap(2) with MAP_STACK and *without* MAP_GROWSDOWN? Only new kernels would get the accounting right, but I doubt that matters much in practice.
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:21 AM, Ben Hutchingswrote: > > How about, instead of looking at permissions, we remember whether vmas > were allocated with MAP_FIXED and ignore those when evaluating the gap? No, I think that's a bad idea. There's tons of good reasons to use MAP_FIXED, and real programs do it all the time. I'd much rather just do something special for the Java case, either recognizing that particular pattern, or (and this is likely what we'll have to do) just have a per-process stack limit that (a) will be reset by suid transitions etc security boundaries (b) you can just set back to 4kB for the specific Java case. because I'd rather make this be a very conscious thing rather than a random hack. The PROT_NONE thing made tons of conceptual sense ("let people do their own guard mappings"). The MAP_FIXED thing does not. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 5, 2017 at 5:21 AM, Ben Hutchings wrote: > > How about, instead of looking at permissions, we remember whether vmas > were allocated with MAP_FIXED and ignore those when evaluating the gap? No, I think that's a bad idea. There's tons of good reasons to use MAP_FIXED, and real programs do it all the time. I'd much rather just do something special for the Java case, either recognizing that particular pattern, or (and this is likely what we'll have to do) just have a per-process stack limit that (a) will be reset by suid transitions etc security boundaries (b) you can just set back to 4kB for the specific Java case. because I'd rather make this be a very conscious thing rather than a random hack. The PROT_NONE thing made tons of conceptual sense ("let people do their own guard mappings"). The MAP_FIXED thing does not. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 16:25:00, Ben Hutchings wrote: > On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote: > > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings> > > > wrote: > > > > > > > > > > We have: > > > > > > > > > > bottom = 0xff803fff > > > > > sp = 0xb178 > > > > > > > > > > The relevant mappings are: > > > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > > fffdd000-e000 rw-p 00:00 > > > > > 0 [stack] > > > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > > to use up almost all of it, and there's only about 28kB left between > > > > "bottom" and that 'rwx' mapping. > > > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > > really is almost exactly 8MB below the stack. > > > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > > mmap, with a condition to catch that particular one? > > > > > > [...] > > > > > > Found it, and it's now clear why only i386 is affected: > > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > > > This is really worrying. This doesn't look like a gap at all. It is a > > mapping which actually contains a code and so we should absolutely not > > allow to scribble over it. So I am afraid the only way forward is to > > allow per process stack gap and run this particular program to have a > > smaller gap. We basically have two ways. Either /proc//$file or > > a prctl inherited on exec. The later is a smaller code. What do you > > think? > > Distributions can do that, but what about all the other apps out there > using JNI and private copies of the JRE? Yes this sucks. I was thinking about something like run_legacy_stack which would do prctl(PR_SET_STACK_GAP, 1, 0, 0, 0); execve(argv[1], argv+1, environ); so we would have a way to start applications that start crashing with the new setup without changing the default for all other applications. The question is what to do if the execed task is suid because we definitely do not want to allow tricking anybody to have smaller gap. Or maybe just start the java with increased stack rlimit? > Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. > Look at os::pd_attempt_reserve_memory_at(). If the first, hinted, > mmap() doesn't return the hinted address it then attempts to allocate > huge areas (I'm not sure how intentional this is) and unmaps the > unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re- > mmap()s the wanted part with MAP_FIXED. If this fails at any point it > is not a fatal error. > > So if we change vm_start_gap() to take the stack limit into account > (when it's finite) that should neutralise > os::workaround_expand_exec_shield_cs_limit(). I'll try this. I was already thinking about doing something like that to have a better support for MAP_GROWSDOWN but then I just gave up because this would require to cap RLIMIT_STACK for large values in order to not break userspace again. The max value is not really clear to me. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 16:25:00, Ben Hutchings wrote: > On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote: > > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings > > > > wrote: > > > > > > > > > > We have: > > > > > > > > > > bottom = 0xff803fff > > > > > sp = 0xb178 > > > > > > > > > > The relevant mappings are: > > > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > > fffdd000-e000 rw-p 00:00 > > > > > 0 [stack] > > > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > > to use up almost all of it, and there's only about 28kB left between > > > > "bottom" and that 'rwx' mapping. > > > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > > really is almost exactly 8MB below the stack. > > > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > > mmap, with a condition to catch that particular one? > > > > > > [...] > > > > > > Found it, and it's now clear why only i386 is affected: > > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > > > This is really worrying. This doesn't look like a gap at all. It is a > > mapping which actually contains a code and so we should absolutely not > > allow to scribble over it. So I am afraid the only way forward is to > > allow per process stack gap and run this particular program to have a > > smaller gap. We basically have two ways. Either /proc//$file or > > a prctl inherited on exec. The later is a smaller code. What do you > > think? > > Distributions can do that, but what about all the other apps out there > using JNI and private copies of the JRE? Yes this sucks. I was thinking about something like run_legacy_stack which would do prctl(PR_SET_STACK_GAP, 1, 0, 0, 0); execve(argv[1], argv+1, environ); so we would have a way to start applications that start crashing with the new setup without changing the default for all other applications. The question is what to do if the execed task is suid because we definitely do not want to allow tricking anybody to have smaller gap. Or maybe just start the java with increased stack rlimit? > Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. > Look at os::pd_attempt_reserve_memory_at(). If the first, hinted, > mmap() doesn't return the hinted address it then attempts to allocate > huge areas (I'm not sure how intentional this is) and unmaps the > unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re- > mmap()s the wanted part with MAP_FIXED. If this fails at any point it > is not a fatal error. > > So if we change vm_start_gap() to take the stack limit into account > (when it's finite) that should neutralise > os::workaround_expand_exec_shield_cs_limit(). I'll try this. I was already thinking about doing something like that to have a better support for MAP_GROWSDOWN but then I just gave up because this would require to cap RLIMIT_STACK for large values in order to not break userspace again. The max value is not really clear to me. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings> > > wrote: > > > > > > > > We have: > > > > > > > > bottom = 0xff803fff > > > > sp = 0xb178 > > > > > > > > The relevant mappings are: > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > fffdd000-e000 rw-p 00:00 > > > > 0 [stack] > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > to use up almost all of it, and there's only about 28kB left between > > > "bottom" and that 'rwx' mapping. > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > really is almost exactly 8MB below the stack. > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > mmap, with a condition to catch that particular one? > > > > [...] > > > > Found it, and it's now clear why only i386 is affected: > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? Distributions can do that, but what about all the other apps out there using JNI and private copies of the JRE? Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. Look at os::pd_attempt_reserve_memory_at(). If the first, hinted, mmap() doesn't return the hinted address it then attempts to allocate huge areas (I'm not sure how intentional this is) and unmaps the unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re- mmap()s the wanted part with MAP_FIXED. If this fails at any point it is not a fatal error. So if we change vm_start_gap() to take the stack limit into account (when it's finite) that should neutralise os::workaround_expand_exec_shield_cs_limit(). I'll try this. Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote: > On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings > > > wrote: > > > > > > > > We have: > > > > > > > > bottom = 0xff803fff > > > > sp = 0xb178 > > > > > > > > The relevant mappings are: > > > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > > fffdd000-e000 rw-p 00:00 > > > > 0 [stack] > > > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > > to use up almost all of it, and there's only about 28kB left between > > > "bottom" and that 'rwx' mapping. > > > > > > Still, that rwx mapping is interesting: it is a single page, and it > > > really is almost exactly 8MB below the stack. > > > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > > the top of that odd one-page allocation (0xff7fd000). > > > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > > mmap, with a condition to catch that particular one? > > > > [...] > > > > Found it, and it's now clear why only i386 is affected: > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 > > This is really worrying. This doesn't look like a gap at all. It is a > mapping which actually contains a code and so we should absolutely not > allow to scribble over it. So I am afraid the only way forward is to > allow per process stack gap and run this particular program to have a > smaller gap. We basically have two ways. Either /proc//$file or > a prctl inherited on exec. The later is a smaller code. What do you > think? Distributions can do that, but what about all the other apps out there using JNI and private copies of the JRE? Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. Look at os::pd_attempt_reserve_memory_at(). If the first, hinted, mmap() doesn't return the hinted address it then attempts to allocate huge areas (I'm not sure how intentional this is) and unmaps the unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re- mmap()s the wanted part with MAP_FIXED. If this fails at any point it is not a fatal error. So if we change vm_start_gap() to take the stack limit into account (when it's finite) that should neutralise os::workaround_expand_exec_shield_cs_limit(). I'll try this. Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings> > wrote: > > > > > > We have: > > > > > > bottom = 0xff803fff > > > sp = 0xb178 > > > > > > The relevant mappings are: > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > fffdd000-e000 rw-p 00:00 > > > 0 [stack] > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > to use up almost all of it, and there's only about 28kB left between > > "bottom" and that 'rwx' mapping. > > > > Still, that rwx mapping is interesting: it is a single page, and it > > really is almost exactly 8MB below the stack. > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > the top of that odd one-page allocation (0xff7fd000). > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > mmap, with a condition to catch that particular one? > [...] > > Found it, and it's now clear why only i386 is affected: > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 This is really worrying. This doesn't look like a gap at all. It is a mapping which actually contains a code and so we should absolutely not allow to scribble over it. So I am afraid the only way forward is to allow per process stack gap and run this particular program to have a smaller gap. We basically have two ways. Either /proc//$file or a prctl inherited on exec. The later is a smaller code. What do you think? -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 13:19:40, Ben Hutchings wrote: > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings > > wrote: > > > > > > We have: > > > > > > bottom = 0xff803fff > > > sp = 0xb178 > > > > > > The relevant mappings are: > > > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > > fffdd000-e000 rw-p 00:00 > > > 0 [stack] > > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > > to use up almost all of it, and there's only about 28kB left between > > "bottom" and that 'rwx' mapping. > > > > Still, that rwx mapping is interesting: it is a single page, and it > > really is almost exactly 8MB below the stack. > > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > > the top of that odd one-page allocation (0xff7fd000). > > > > Can you find out where that is allocated? Perhaps a breakpoint on > > mmap, with a condition to catch that particular one? > [...] > > Found it, and it's now clear why only i386 is affected: > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 This is really worrying. This doesn't look like a gap at all. It is a mapping which actually contains a code and so we should absolutely not allow to scribble over it. So I am afraid the only way forward is to allow per process stack gap and run this particular program to have a smaller gap. We basically have two ways. Either /proc//$file or a prctl inherited on exec. The later is a smaller code. What do you think? -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 13:21:54, Ben Hutchings wrote: > On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote: > > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > > PROT_NONE would explicitly fault but we would simply > > > run over this mapping too easily and who knows what might end up below > > > it. So to me the guard gap does its job here. > > > > I tend to think that applications that implement their own stack guard > > using PROT_NONE also assume that they will never perfom unchecked stack > > allocations larger than their own guard, thus the condition above should > > never happen. Otherwise they're bogus and/or vulnerable by design and it > > is their responsibility to fix it. > > > > Thus maybe if that helps we could even relax some of the stack guard > > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > > packed if the application knows what it's doing. That wouldn't solve > > the libreoffice issue though, given the lower page is RWX. > > How about, instead of looking at permissions, we remember whether vmas > were allocated with MAP_FIXED and ignore those when evaluating the gap? To be honest I really hate this. The same way as any other heuristics where we try to guess the gap which will not fault to let userspace know something is wrong. And the Java example just proves the point AFAIU. The mapping we clash on is _not_ a gap. It is a real mapping we should rather not scribble over. It contains a code to execute and that is even more worrying. So I guess the _only_ sane way forward for this case is to reduce stack gap for the particular code. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 13:21:54, Ben Hutchings wrote: > On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote: > > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > > PROT_NONE would explicitly fault but we would simply > > > run over this mapping too easily and who knows what might end up below > > > it. So to me the guard gap does its job here. > > > > I tend to think that applications that implement their own stack guard > > using PROT_NONE also assume that they will never perfom unchecked stack > > allocations larger than their own guard, thus the condition above should > > never happen. Otherwise they're bogus and/or vulnerable by design and it > > is their responsibility to fix it. > > > > Thus maybe if that helps we could even relax some of the stack guard > > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > > packed if the application knows what it's doing. That wouldn't solve > > the libreoffice issue though, given the lower page is RWX. > > How about, instead of looking at permissions, we remember whether vmas > were allocated with MAP_FIXED and ignore those when evaluating the gap? To be honest I really hate this. The same way as any other heuristics where we try to guess the gap which will not fault to let userspace know something is wrong. And the Java example just proves the point AFAIU. The mapping we clash on is _not_ a gap. It is a real mapping we should rather not scribble over. It contains a code to execute and that is even more worrying. So I guess the _only_ sane way forward for this case is to reduce stack gap for the particular code. -- Michal Hocko SUSE Labs
Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas
Hi all, On Tue, Jul 04, 2017 at 07:16:06PM -0600, kseifr...@redhat.com wrote: > This issue occurs post stackguard patches correct? Fixing it sounds like > this might go beyond hardening and into CVE territory. Since this thread is public on LKML, as it should be, it's no longer valid to be CC'ed to linux-distros, which is for handling of embargoed issues only. So please drop linux-distros from further CC's (I moved linux-distros to Bcc on this reply, just so they know what happened). If specific security issues are identified (such as with LibreOffice and Java), then ideally those should be posted to oss-security as separate reports. I'd appreciate it if anyone takes care of that (regardless of CVE worthiness). In fact, I already mentioned this thread in: http://www.openwall.com/lists/oss-security/2017/07/05/11 Thank you! Alexander
Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas
Hi all, On Tue, Jul 04, 2017 at 07:16:06PM -0600, kseifr...@redhat.com wrote: > This issue occurs post stackguard patches correct? Fixing it sounds like > this might go beyond hardening and into CVE territory. Since this thread is public on LKML, as it should be, it's no longer valid to be CC'ed to linux-distros, which is for handling of embargoed issues only. So please drop linux-distros from further CC's (I moved linux-distros to Bcc on this reply, just so they know what happened). If specific security issues are identified (such as with LibreOffice and Java), then ideally those should be posted to oss-security as separate reports. I'd appreciate it if anyone takes care of that (regardless of CVE worthiness). In fact, I already mentioned this thread in: http://www.openwall.com/lists/oss-security/2017/07/05/11 Thank you! Alexander
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 01:21:54PM +0100, Ben Hutchings wrote: > On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote: > > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > > PROT_NONE would explicitly fault but we would simply > > > run over this mapping too easily and who knows what might end up below > > > it. So to me the guard gap does its job here. > > > > I tend to think that applications that implement their own stack guard > > using PROT_NONE also assume that they will never perfom unchecked stack > > allocations larger than their own guard, thus the condition above should > > never happen. Otherwise they're bogus and/or vulnerable by design and it > > is their responsibility to fix it. > > > > Thus maybe if that helps we could even relax some of the stack guard > > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > > packed if the application knows what it's doing. That wouldn't solve > > the libreoffice issue though, given the lower page is RWX. > > How about, instead of looking at permissions, we remember whether vmas > were allocated with MAP_FIXED and ignore those when evaluating the gap? I like this idea. It leaves complete control to the application. Our usual principle of letting people shoot themselves in the foot if they insist on doing so. Do you think something like this could work (not even build tested) ? Willy -- diff --git a/include/linux/mm.h b/include/linux/mm.h index d16f524..8ad7f40 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -90,6 +90,7 @@ #define VM_PFNMAP 0x0400 /* Page-ranges managed without "struct page", just pure PFN */ #define VM_DENYWRITE 0x0800 /* ETXTBSY on write attempts.. */ +#define VM_FIXED 0x1000 /* MAP_FIXED was used */ #define VM_LOCKED 0x2000 #define VM_IO 0x4000 /* Memory mapped I/O or similar */ diff --git a/include/linux/mman.h b/include/linux/mman.h index 9aa863d..4df2659 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -79,6 +79,7 @@ static inline int arch_validate_prot(unsigned long prot) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | + _calc_vm_trans(flags, MAP_FIXED, VM_FIXED ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED); } #endif /* _LINUX_MMAN_H */ diff --git a/mm/mmap.c b/mm/mmap.c index 3c4e4d7..b612868 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2145,7 +2145,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) next = vma->vm_next; if (next && next->vm_start < gap_addr) { - if (!(next->vm_flags & VM_GROWSUP)) + if (!(next->vm_flags & (VM_GROWSUP|VM_FIXED))) return -ENOMEM; /* Check that both stack segments have the same anon_vma? */ } @@ -2225,7 +2225,7 @@ int expand_downwards(struct vm_area_struct *vma, return -ENOMEM; prev = vma->vm_prev; if (prev && prev->vm_end > gap_addr) { - if (!(prev->vm_flags & VM_GROWSDOWN)) + if (!(prev->vm_flags & (VM_GROWSDOWN|VM_FIXED))) return -ENOMEM; /* Check that both stack segments have the same anon_vma? */ }
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 01:21:54PM +0100, Ben Hutchings wrote: > On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote: > > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > > PROT_NONE would explicitly fault but we would simply > > > run over this mapping too easily and who knows what might end up below > > > it. So to me the guard gap does its job here. > > > > I tend to think that applications that implement their own stack guard > > using PROT_NONE also assume that they will never perfom unchecked stack > > allocations larger than their own guard, thus the condition above should > > never happen. Otherwise they're bogus and/or vulnerable by design and it > > is their responsibility to fix it. > > > > Thus maybe if that helps we could even relax some of the stack guard > > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > > packed if the application knows what it's doing. That wouldn't solve > > the libreoffice issue though, given the lower page is RWX. > > How about, instead of looking at permissions, we remember whether vmas > were allocated with MAP_FIXED and ignore those when evaluating the gap? I like this idea. It leaves complete control to the application. Our usual principle of letting people shoot themselves in the foot if they insist on doing so. Do you think something like this could work (not even build tested) ? Willy -- diff --git a/include/linux/mm.h b/include/linux/mm.h index d16f524..8ad7f40 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -90,6 +90,7 @@ #define VM_PFNMAP 0x0400 /* Page-ranges managed without "struct page", just pure PFN */ #define VM_DENYWRITE 0x0800 /* ETXTBSY on write attempts.. */ +#define VM_FIXED 0x1000 /* MAP_FIXED was used */ #define VM_LOCKED 0x2000 #define VM_IO 0x4000 /* Memory mapped I/O or similar */ diff --git a/include/linux/mman.h b/include/linux/mman.h index 9aa863d..4df2659 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -79,6 +79,7 @@ static inline int arch_validate_prot(unsigned long prot) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | + _calc_vm_trans(flags, MAP_FIXED, VM_FIXED ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED); } #endif /* _LINUX_MMAN_H */ diff --git a/mm/mmap.c b/mm/mmap.c index 3c4e4d7..b612868 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2145,7 +2145,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) next = vma->vm_next; if (next && next->vm_start < gap_addr) { - if (!(next->vm_flags & VM_GROWSUP)) + if (!(next->vm_flags & (VM_GROWSUP|VM_FIXED))) return -ENOMEM; /* Check that both stack segments have the same anon_vma? */ } @@ -2225,7 +2225,7 @@ int expand_downwards(struct vm_area_struct *vma, return -ENOMEM; prev = vma->vm_prev; if (prev && prev->vm_end > gap_addr) { - if (!(prev->vm_flags & VM_GROWSDOWN)) + if (!(prev->vm_flags & (VM_GROWSDOWN|VM_FIXED))) return -ENOMEM; /* Check that both stack segments have the same anon_vma? */ }
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 09:18 -0700, Linus Torvalds wrote: > > On Tue, Jul 4, 2017 at 4:36 AM, Ben Hutchingswrote: > > > > That's what I was thinking of. Tried the following patch: > > > > Subject: mmap: Ignore VM_NONE mappings when checking for space to > > expand the stack > > This looks sane to me. > > I'm going to ignore it in this thread, and assume that it gets sent as > a patch separately, ok? > > It would be good to have more acks on it. > > Also, separately, John Haxby kind of implied that the LibreOffice > regression on i386 is already fixed by commit f4cb767d76cf ("mm: fix > new crash in unmapped_area_topdown()"). > > Or was that a separate issue? They are separate issues. Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 09:18 -0700, Linus Torvalds wrote: > > On Tue, Jul 4, 2017 at 4:36 AM, Ben Hutchings wrote: > > > > That's what I was thinking of. Tried the following patch: > > > > Subject: mmap: Ignore VM_NONE mappings when checking for space to > > expand the stack > > This looks sane to me. > > I'm going to ignore it in this thread, and assume that it gets sent as > a patch separately, ok? > > It would be good to have more acks on it. > > Also, separately, John Haxby kind of implied that the LibreOffice > regression on i386 is already fixed by commit f4cb767d76cf ("mm: fix > new crash in unmapped_area_topdown()"). > > Or was that a separate issue? They are separate issues. Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 19:11 +0200, Willy Tarreau wrote: > On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote: > > @@ -2323,11 +2330,17 @@ int expand_downwards(struct vm_area_struct *vma, > > if (error) > > return error; > > > > - /* Enforce stack_guard_gap */ > > + /* > > + * Enforce stack_guard_gap, but allow VM_NONE mappings in the gap > > + * as some applications try to make their own stack guards > > + */ > > gap_addr = address - stack_guard_gap; > > if (gap_addr > address) > > return -ENOMEM; > > - prev = vma->vm_prev; > > + for (prev = vma->vm_prev; > > + prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)); > > + prev = prev->vm_prev) > > + ; > > if (prev && prev->vm_end > gap_addr) { > > if (!(prev->vm_flags & VM_GROWSDOWN)) > > return -ENOMEM; > > Hmmm shouldn't we also stop looping when we're out of the gap ? Yes, either that or only allow one such vma. Ben. > Something like this : > > for (prev = vma->vm_prev; > prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) && > address - prev->vm_end < stack_guard_gap; > prev = prev->vm_prev) > ; > > This would limit the risk of runaway loops if someone is having fun > allocating a lot of memory in small chunks (eg: 4 GB in 1 million > independant mmap() calls). -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 19:11 +0200, Willy Tarreau wrote: > On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote: > > @@ -2323,11 +2330,17 @@ int expand_downwards(struct vm_area_struct *vma, > > if (error) > > return error; > > > > - /* Enforce stack_guard_gap */ > > + /* > > + * Enforce stack_guard_gap, but allow VM_NONE mappings in the gap > > + * as some applications try to make their own stack guards > > + */ > > gap_addr = address - stack_guard_gap; > > if (gap_addr > address) > > return -ENOMEM; > > - prev = vma->vm_prev; > > + for (prev = vma->vm_prev; > > + prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)); > > + prev = prev->vm_prev) > > + ; > > if (prev && prev->vm_end > gap_addr) { > > if (!(prev->vm_flags & VM_GROWSDOWN)) > > return -ENOMEM; > > Hmmm shouldn't we also stop looping when we're out of the gap ? Yes, either that or only allow one such vma. Ben. > Something like this : > > for (prev = vma->vm_prev; > prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) && > address - prev->vm_end < stack_guard_gap; > prev = prev->vm_prev) > ; > > This would limit the risk of runaway loops if someone is having fun > allocating a lot of memory in small chunks (eg: 4 GB in 1 million > independant mmap() calls). -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote: > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > PROT_NONE would explicitly fault but we would simply > > run over this mapping too easily and who knows what might end up below > > it. So to me the guard gap does its job here. > > I tend to think that applications that implement their own stack guard > using PROT_NONE also assume that they will never perfom unchecked stack > allocations larger than their own guard, thus the condition above should > never happen. Otherwise they're bogus and/or vulnerable by design and it > is their responsibility to fix it. > > Thus maybe if that helps we could even relax some of the stack guard > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > packed if the application knows what it's doing. That wouldn't solve > the libreoffice issue though, given the lower page is RWX. How about, instead of looking at permissions, we remember whether vmas were allocated with MAP_FIXED and ignore those when evaluating the gap? Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote: > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > PROT_NONE would explicitly fault but we would simply > > run over this mapping too easily and who knows what might end up below > > it. So to me the guard gap does its job here. > > I tend to think that applications that implement their own stack guard > using PROT_NONE also assume that they will never perfom unchecked stack > allocations larger than their own guard, thus the condition above should > never happen. Otherwise they're bogus and/or vulnerable by design and it > is their responsibility to fix it. > > Thus maybe if that helps we could even relax some of the stack guard > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > packed if the application knows what it's doing. That wouldn't solve > the libreoffice issue though, given the lower page is RWX. How about, instead of looking at permissions, we remember whether vmas were allocated with MAP_FIXED and ignore those when evaluating the gap? Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings> wrote: > > > > We have: > > > > bottom = 0xff803fff > > sp = 0xb178 > > > > The relevant mappings are: > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > fffdd000-e000 rw-p 00:00 > > 0 [stack] > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > to use up almost all of it, and there's only about 28kB left between > "bottom" and that 'rwx' mapping. > > Still, that rwx mapping is interesting: it is a single page, and it > really is almost exactly 8MB below the stack. > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > the top of that odd one-page allocation (0xff7fd000). > > Can you find out where that is allocated? Perhaps a breakpoint on > mmap, with a condition to catch that particular one? [...] Found it, and it's now clear why only i386 is affected: http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote: > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings > wrote: > > > > We have: > > > > bottom = 0xff803fff > > sp = 0xb178 > > > > The relevant mappings are: > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > fffdd000-e000 rw-p 00:00 > > 0 [stack] > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > to use up almost all of it, and there's only about 28kB left between > "bottom" and that 'rwx' mapping. > > Still, that rwx mapping is interesting: it is a single page, and it > really is almost exactly 8MB below the stack. > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > the top of that odd one-page allocation (0xff7fd000). > > Can you find out where that is allocated? Perhaps a breakpoint on > mmap, with a condition to catch that particular one? [...] Found it, and it's now clear why only i386 is affected: http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852 http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881 Ben. -- Ben Hutchings Anthony's Law of Force: Don't force it, get a larger hammer. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 10:24:41AM +0200, Michal Hocko wrote: > > Thus maybe if that helps we could even relax some of the stack guard > > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > > packed if the application knows what it's doing. > > Yes, this is what my patch does [1]. Or did I miss your point? Sorry you're right, I got my mind confused when looking at the libreoffice dump and for whatever reason ended up thinking we were just considering that page as part of the gap and not being a marker for the bottom. Never mind. > > That wouldn't solve the libreoffice issue though, given the lower page > > is RWX. > > unfortunatelly yes. We only have limited room to address this issue > though. We could add per task (mm) stack_gap limit (controlled either > via proc or prctl) and revert back to 1 page for the specific program > but I would be really careful to add some more hack into the stack > expansion code. Actually one option could be to have a sysctl causing a warning to be emitted when hitting the stack guard instead of killing the process. We could think for example that once this warning is emitted, the guard is reduced to 64kB (I think it was the size before) and the application can continue to run. That could help problematic applications getting fixed quickly. And in environments with lots of local users it would be dissuasive enough to avoid users trying their luck on setuid binaries. Just my two cents, Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 10:24:41AM +0200, Michal Hocko wrote: > > Thus maybe if that helps we could even relax some of the stack guard > > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > > packed if the application knows what it's doing. > > Yes, this is what my patch does [1]. Or did I miss your point? Sorry you're right, I got my mind confused when looking at the libreoffice dump and for whatever reason ended up thinking we were just considering that page as part of the gap and not being a marker for the bottom. Never mind. > > That wouldn't solve the libreoffice issue though, given the lower page > > is RWX. > > unfortunatelly yes. We only have limited room to address this issue > though. We could add per task (mm) stack_gap limit (controlled either > via proc or prctl) and revert back to 1 page for the specific program > but I would be really careful to add some more hack into the stack > expansion code. Actually one option could be to have a sysctl causing a warning to be emitted when hitting the stack guard instead of killing the process. We could think for example that once this warning is emitted, the guard is reduced to 64kB (I think it was the size before) and the application can continue to run. That could help problematic applications getting fixed quickly. And in environments with lots of local users it would be dissuasive enough to avoid users trying their luck on setuid binaries. Just my two cents, Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 10:14:43, Willy Tarreau wrote: > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > PROT_NONE would explicitly fault but we would simply > > run over this mapping too easily and who knows what might end up below > > it. So to me the guard gap does its job here. > > I tend to think that applications that implement their own stack guard > using PROT_NONE also assume that they will never perfom unchecked stack > allocations larger than their own guard, thus the condition above should > never happen. Otherwise they're bogus and/or vulnerable by design and it > is their responsibility to fix it. > > Thus maybe if that helps we could even relax some of the stack guard > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > packed if the application knows what it's doing. Yes, this is what my patch does [1]. Or did I miss your point? > That wouldn't solve the libreoffice issue though, given the lower page > is RWX. unfortunatelly yes. We only have limited room to address this issue though. We could add per task (mm) stack_gap limit (controlled either via proc or prctl) and revert back to 1 page for the specific program but I would be really careful to add some more hack into the stack expansion code. [1] http://lkml.kernel.org/r/20170704093538.gf14...@dhcp22.suse.cz -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed 05-07-17 10:14:43, Willy Tarreau wrote: > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > > PROT_NONE would explicitly fault but we would simply > > run over this mapping too easily and who knows what might end up below > > it. So to me the guard gap does its job here. > > I tend to think that applications that implement their own stack guard > using PROT_NONE also assume that they will never perfom unchecked stack > allocations larger than their own guard, thus the condition above should > never happen. Otherwise they're bogus and/or vulnerable by design and it > is their responsibility to fix it. > > Thus maybe if that helps we could even relax some of the stack guard > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly > packed if the application knows what it's doing. Yes, this is what my patch does [1]. Or did I miss your point? > That wouldn't solve the libreoffice issue though, given the lower page > is RWX. unfortunatelly yes. We only have limited room to address this issue though. We could add per task (mm) stack_gap limit (controlled either via proc or prctl) and revert back to 1 page for the specific program but I would be really careful to add some more hack into the stack expansion code. [1] http://lkml.kernel.org/r/20170704093538.gf14...@dhcp22.suse.cz -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > PROT_NONE would explicitly fault but we would simply > run over this mapping too easily and who knows what might end up below > it. So to me the guard gap does its job here. I tend to think that applications that implement their own stack guard using PROT_NONE also assume that they will never perfom unchecked stack allocations larger than their own guard, thus the condition above should never happen. Otherwise they're bogus and/or vulnerable by design and it is their responsibility to fix it. Thus maybe if that helps we could even relax some of the stack guard checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly packed if the application knows what it's doing. That wouldn't solve the libreoffice issue though, given the lower page is RWX. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote: > PROT_NONE would explicitly fault but we would simply > run over this mapping too easily and who knows what might end up below > it. So to me the guard gap does its job here. I tend to think that applications that implement their own stack guard using PROT_NONE also assume that they will never perfom unchecked stack allocations larger than their own guard, thus the condition above should never happen. Otherwise they're bogus and/or vulnerable by design and it is their responsibility to fix it. Thus maybe if that helps we could even relax some of the stack guard checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly packed if the application knows what it's doing. That wouldn't solve the libreoffice issue though, given the lower page is RWX. Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue 04-07-17 16:31:52, Linus Torvalds wrote: > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchingswrote: > > > > We have: > > > > bottom = 0xff803fff > > sp = 0xb178 > > > > The relevant mappings are: > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > fffdd000-e000 rw-p 00:00 0 > > [stack] > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > to use up almost all of it, and there's only about 28kB left between > "bottom" and that 'rwx' mapping. > > Still, that rwx mapping is interesting: it is a single page, and it > really is almost exactly 8MB below the stack. > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > the top of that odd one-page allocation (0xff7fd000). Very interesting! I would be really curious whether changing ulimit to something bigger changes the picture. And if this is really the case what we are going to do here. We can special case a single page mapping under the stack but that sounds quite dangerous for something that is dubious in itself. PROT_NONE would explicitly fault but we would simply run over this mapping too easily and who knows what might end up below it. So to me the guard gap does its job here. Do you want me to post the earier patch to ignore PROT_NONE mapping or we should rather wait for this one to get more details? -- Michal Hocko SUSE Labs
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue 04-07-17 16:31:52, Linus Torvalds wrote: > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings wrote: > > > > We have: > > > > bottom = 0xff803fff > > sp = 0xb178 > > > > The relevant mappings are: > > > > ff7fc000-ff7fd000 rwxp 00:00 0 > > fffdd000-e000 rw-p 00:00 0 > > [stack] > > Ugh. So that stack is actually 8MB in size, but the alloca() is about > to use up almost all of it, and there's only about 28kB left between > "bottom" and that 'rwx' mapping. > > Still, that rwx mapping is interesting: it is a single page, and it > really is almost exactly 8MB below the stack. > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from > the top of that odd one-page allocation (0xff7fd000). Very interesting! I would be really curious whether changing ulimit to something bigger changes the picture. And if this is really the case what we are going to do here. We can special case a single page mapping under the stack but that sounds quite dangerous for something that is dubious in itself. PROT_NONE would explicitly fault but we would simply run over this mapping too easily and who knows what might end up below it. So to me the guard gap does its job here. Do you want me to post the earier patch to ignore PROT_NONE mapping or we should rather wait for this one to get more details? -- Michal Hocko SUSE Labs
Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas
This issue occurs post stackguard patches correct? Fixing it sounds like this might go beyond hardening and into CVE territory. -- Kurt Seifried -- Red Hat -- Product Security -- Cloud PGP A90B F995 7350 148F 66BF 7554 160D 4553 5E26 7993 Red Hat Product Security contact: secal...@redhat.com
Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas
This issue occurs post stackguard patches correct? Fixing it sounds like this might go beyond hardening and into CVE territory. -- Kurt Seifried -- Red Hat -- Product Security -- Cloud PGP A90B F995 7350 148F 66BF 7554 160D 4553 5E26 7993 Red Hat Product Security contact: secal...@redhat.com
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchingswrote: > > We have: > > bottom = 0xff803fff > sp = 0xb178 > > The relevant mappings are: > > ff7fc000-ff7fd000 rwxp 00:00 0 > fffdd000-e000 rw-p 00:00 0 > [stack] Ugh. So that stack is actually 8MB in size, but the alloca() is about to use up almost all of it, and there's only about 28kB left between "bottom" and that 'rwx' mapping. Still, that rwx mapping is interesting: it is a single page, and it really is almost exactly 8MB below the stack. In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from the top of that odd one-page allocation (0xff7fd000). Can you find out where that is allocated? Perhaps a breakpoint on mmap, with a condition to catch that particular one? Because I'm wondering if it was done explicitly as a 8MB stack boundary allocation, with the "knowledge" that the kernel then adds a one-page guard page. I really don't know why somebody would do that (as opposed to just limiting the stack with ulimit), but the 8MB+4kB distance is kind of intriguing. Maybe that one-page mapping is some hack to make sure that no random mmap() will ever get too close to the stack, so it really is a "guard mapping", except it's explicitly designed not so much to guard the stack from growing down further (ulimit does that), but to guard the brk() and other mmaps from growing *up* into the stack area.. Sometimes user mode does crazy things just because people are insane. But sometimes there really is a method to the madness. I would *not* be surprised if the way somebody allocared the stack was to basically say: - let's use "mmap()" with a size of 8MB+2 pages to find a sufficiently sized virtual memory area - once we've gotten that virtual address space range, let's over-map the last page as the new stack using MAP_FIXED - finally, munmap the 8MB in between so that the new stack can grow down into that gap the munmap creates. Notice how you end up with exactly the above pattern of allocations, and how it guarantees that you get a nice 8MB stack without having to do any locking (you rely on the kernel to just find the 8MB+8kB areas, and once one has been allocated, it will be "safe"). And yes, it would have been much nicer to just use PROT_NONE for that initial sizing allocation, but for somebody who is only interested in carving out a 8MB stack in virtual space, the protections are actually kind of immaterial, so 'rwx' might be just their mental default. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings wrote: > > We have: > > bottom = 0xff803fff > sp = 0xb178 > > The relevant mappings are: > > ff7fc000-ff7fd000 rwxp 00:00 0 > fffdd000-e000 rw-p 00:00 0 > [stack] Ugh. So that stack is actually 8MB in size, but the alloca() is about to use up almost all of it, and there's only about 28kB left between "bottom" and that 'rwx' mapping. Still, that rwx mapping is interesting: it is a single page, and it really is almost exactly 8MB below the stack. In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from the top of that odd one-page allocation (0xff7fd000). Can you find out where that is allocated? Perhaps a breakpoint on mmap, with a condition to catch that particular one? Because I'm wondering if it was done explicitly as a 8MB stack boundary allocation, with the "knowledge" that the kernel then adds a one-page guard page. I really don't know why somebody would do that (as opposed to just limiting the stack with ulimit), but the 8MB+4kB distance is kind of intriguing. Maybe that one-page mapping is some hack to make sure that no random mmap() will ever get too close to the stack, so it really is a "guard mapping", except it's explicitly designed not so much to guard the stack from growing down further (ulimit does that), but to guard the brk() and other mmaps from growing *up* into the stack area.. Sometimes user mode does crazy things just because people are insane. But sometimes there really is a method to the madness. I would *not* be surprised if the way somebody allocared the stack was to basically say: - let's use "mmap()" with a size of 8MB+2 pages to find a sufficiently sized virtual memory area - once we've gotten that virtual address space range, let's over-map the last page as the new stack using MAP_FIXED - finally, munmap the 8MB in between so that the new stack can grow down into that gap the munmap creates. Notice how you end up with exactly the above pattern of allocations, and how it guarantees that you get a nice 8MB stack without having to do any locking (you rely on the kernel to just find the 8MB+8kB areas, and once one has been allocated, it will be "safe"). And yes, it would have been much nicer to just use PROT_NONE for that initial sizing allocation, but for somebody who is only interested in carving out a 8MB stack in virtual space, the protections are actually kind of immaterial, so 'rwx' might be just their mental default. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 12:36 +0100, Ben Hutchings wrote: [...] > This *doesn't* fix the LibreOffice regression on i386. gdb shows me that the crash is at the last statement in this function: static void _expand_stack_to(address bottom) { address sp; size_t size; volatile char *p; // Adjust bottom to point to the largest address within the same page, it // gives us a one-page buffer if alloca() allocates slightly more memory. bottom = (address)align_size_down((uintptr_t)bottom, os::Linux::page_size()); bottom += os::Linux::page_size() - 1; // sp might be slightly above current stack pointer; if that's the case, we // will alloca() a little more space than necessary, which is OK. Don't use // os::current_stack_pointer(), as its result can be slightly below current // stack pointer, causing us to not alloca enough to reach "bottom". sp = (address) if (sp > bottom) { size = sp - bottom; p = (volatile char *)alloca(size); assert(p != NULL && p <= (volatile char *)bottom, "alloca problem?"); p[0] = '\0'; } } We have: bottom = 0xff803fff sp = 0xb178 The relevant mappings are: ff7fc000-ff7fd000 rwxp 00:00 0 fffdd000-e000 rw-p 00:00 0 [stack] So instead of a useless guard page, we have a dangerous WX page underneath the stack! I suppose I should find out where and why that's being allocated. Ben. -- Ben Hutchings The world is coming to an end. Please log off. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, 2017-07-04 at 12:36 +0100, Ben Hutchings wrote: [...] > This *doesn't* fix the LibreOffice regression on i386. gdb shows me that the crash is at the last statement in this function: static void _expand_stack_to(address bottom) { address sp; size_t size; volatile char *p; // Adjust bottom to point to the largest address within the same page, it // gives us a one-page buffer if alloca() allocates slightly more memory. bottom = (address)align_size_down((uintptr_t)bottom, os::Linux::page_size()); bottom += os::Linux::page_size() - 1; // sp might be slightly above current stack pointer; if that's the case, we // will alloca() a little more space than necessary, which is OK. Don't use // os::current_stack_pointer(), as its result can be slightly below current // stack pointer, causing us to not alloca enough to reach "bottom". sp = (address) if (sp > bottom) { size = sp - bottom; p = (volatile char *)alloca(size); assert(p != NULL && p <= (volatile char *)bottom, "alloca problem?"); p[0] = '\0'; } } We have: bottom = 0xff803fff sp = 0xb178 The relevant mappings are: ff7fc000-ff7fd000 rwxp 00:00 0 fffdd000-e000 rw-p 00:00 0 [stack] So instead of a useless guard page, we have a dangerous WX page underneath the stack! I suppose I should find out where and why that's being allocated. Ben. -- Ben Hutchings The world is coming to an end. Please log off. signature.asc Description: This is a digitally signed message part
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, Jul 04, 2017 at 11:47:37AM -0700, Linus Torvalds wrote: > Let's > say that you are using lots of threads, so that you know your stack > space is limited. What you do is to use MAP_FIXED a lot, and you lay > out your stacks fairly densely (with each other, but also possibly > with other mappings), with that PROT_NONE redzoning mapping in between > the "dense" allocations. > > So when the kernel wants to grow the stack, it finds the PROT_NONE > redzone mapping - but there's possibly other maps right under it, so > the stack_guard_gap still hits other mappings. (...) OK I didn't get that use case, that totally makes sense indeed! So now we use PROT_NONE not as something that must be skipped to find the unmapped area but as a hint that the application apparently wants the stack to stop here. Thanks for this clear explanation! Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, Jul 04, 2017 at 11:47:37AM -0700, Linus Torvalds wrote: > Let's > say that you are using lots of threads, so that you know your stack > space is limited. What you do is to use MAP_FIXED a lot, and you lay > out your stacks fairly densely (with each other, but also possibly > with other mappings), with that PROT_NONE redzoning mapping in between > the "dense" allocations. > > So when the kernel wants to grow the stack, it finds the PROT_NONE > redzone mapping - but there's possibly other maps right under it, so > the stack_guard_gap still hits other mappings. (...) OK I didn't get that use case, that totally makes sense indeed! So now we use PROT_NONE not as something that must be skipped to find the unmapped area but as a hint that the application apparently wants the stack to stop here. Thanks for this clear explanation! Willy
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, Jul 4, 2017 at 11:39 AM, Willy Tarreauwrote: > > But what is wrong with stopping the loop as soon as the distance gets > larger than the stack_guard_gap ? Absolutely nothing. But that's not the problem with the loop. Let's say that you are using lots of threads, so that you know your stack space is limited. What you do is to use MAP_FIXED a lot, and you lay out your stacks fairly densely (with each other, but also possibly with other mappings), with that PROT_NONE redzoning mapping in between the "dense" allocations. So when the kernel wants to grow the stack, it finds the PROT_NONE redzone mapping - but there's possibly other maps right under it, so the stack_guard_gap still hits other mappings. And the fact that this seems to trigger with (a) 32-bit x86 (b) Java actually makes sense in the above scenario: that's _exactly_ when you'd have dense mappings. Java is very thread-happy, and in a 32-bit VM, the virtual address space allocation for stacks is a primary issue with lots of threads. Of course, the downside to this theory is that apparently the Java problem is not confirmed to actually be due to this (Ben root-caused the rust thing on ppc64), but it still sounds like quite a reasonable thing to do. The problem with the Java issue may be that they do that "dense stack mappings in VM space" (for all the usual "lots of threads, limited VM" reasons), but they may *not* have that PROT_NONE redzoning at all. So the patch under discussion works for Rust exactly *because* it does its redzone to show "this is where I expect the stack to end". The i386 java load may simply not have that marker for us to use.. Linus
Re: [PATCH] mm: larger stack guard gap, between vmas
On Tue, Jul 4, 2017 at 11:39 AM, Willy Tarreau wrote: > > But what is wrong with stopping the loop as soon as the distance gets > larger than the stack_guard_gap ? Absolutely nothing. But that's not the problem with the loop. Let's say that you are using lots of threads, so that you know your stack space is limited. What you do is to use MAP_FIXED a lot, and you lay out your stacks fairly densely (with each other, but also possibly with other mappings), with that PROT_NONE redzoning mapping in between the "dense" allocations. So when the kernel wants to grow the stack, it finds the PROT_NONE redzone mapping - but there's possibly other maps right under it, so the stack_guard_gap still hits other mappings. And the fact that this seems to trigger with (a) 32-bit x86 (b) Java actually makes sense in the above scenario: that's _exactly_ when you'd have dense mappings. Java is very thread-happy, and in a 32-bit VM, the virtual address space allocation for stacks is a primary issue with lots of threads. Of course, the downside to this theory is that apparently the Java problem is not confirmed to actually be due to this (Ben root-caused the rust thing on ppc64), but it still sounds like quite a reasonable thing to do. The problem with the Java issue may be that they do that "dense stack mappings in VM space" (for all the usual "lots of threads, limited VM" reasons), but they may *not* have that PROT_NONE redzoning at all. So the patch under discussion works for Rust exactly *because* it does its redzone to show "this is where I expect the stack to end". The i386 java load may simply not have that marker for us to use.. Linus