Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-06 Thread Willy Tarreau
On Thu, Jul 06, 2017 at 10:24:06AM +0200, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote:
> > On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings  wrote:
> > >>
> > >> And I think your second patch breaks that "use a really large value to
> > >> approximate infinity" case that definitely has existed as a pattern.
> > >
> > > Right.  Well that seems to leave us with remembering the MAP_FIXED flag
> > > and using that as the condition to ignore the previous mapping.
> > 
> > I'm not particularly happy about having a MAP_FIXED special case, but
> > yeah, I'm not seeing a lot of alternatives.
> 
> We can possibly refine it like this :
>   - use PROT_NONE as a mark for the end of the stack and consider the
> application doing this knows exactly what it's doing ;
> 
>   - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering
> that 1) it used to work like this for many years, and 2) if an application
> is forcing a MAP_FIXED just below the stack and at the same time uses
> large alloca() or VLA it's definitely bogus and looking for unfixable
> trouble. Not allowing this means we break existing applications anyway.

That would probably give the following (only build-tested on x86_64). Do
you think it would make sense and/or be acceptable ? That would more
easily avoid the other options like adding sysctl + warnings or making
a special case of setuid.

Willy
---

>From 56ae4e57e446bc92fd2647327da281e313930524 Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Thu, 6 Jul 2017 12:00:54 +0200
Subject: mm: mm, mmap: only apply a one page gap betwen the stack an
 MAP_FIXED

Some programs place a MAP_FIXED below the stack, not leaving enough room
for the stack guard. This patch keeps track of MAP_FIXED, mirroring it in
a new VM_FIXED flag and reduces the stack guard to a single page (as it
used to be) in such a situation, assuming that when an application places
a fixed map close to the stack, it very likely does it on purpose and is
taking the full responsibility for the risk of the stack blowing up.

Cc: Ben Hutchings 
Cc: Michal Hocko 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Signed-off-by: Willy Tarreau 
---
 include/linux/mm.h   |  1 +
 include/linux/mman.h |  1 +
 mm/mmap.c| 30 --
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a4..41492b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -188,6 +188,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 #define VM_ACCOUNT 0x0010  /* Is a VM accounted object */
 #define VM_NORESERVE   0x0020  /* should the VM suppress accounting */
 #define VM_HUGETLB 0x0040  /* Huge TLB Page VM */
+#define VM_FIXED   0x0080  /* MAP_FIXED was used */
 #define VM_ARCH_1  0x0100  /* Architecture-specific flag */
 #define VM_ARCH_2  0x0200
 #define VM_DONTDUMP0x0400  /* Do not include in the core dump */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 634c4c5..3a29069 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,6 +86,7 @@ static inline bool arch_validate_prot(unsigned long prot)
 {
return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
   _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
+  _calc_vm_trans(flags, MAP_FIXED,  VM_FIXED ) |
   _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index ece0f6d..7fc1c29 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2244,12 +2244,17 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
gap_addr = TASK_SIZE;
 
next = vma->vm_next;
+
+   /* PROT_NONE above a MAP_GROWSUP always serves as a mark and inhibits
+* the stack guard gap.
+* MAP_FIXED above a MAP_GROWSUP only requires a single page guard.
+*/
if (next && next->vm_start < gap_addr &&
-   (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
-   if (!(next->vm_flags & VM_GROWSUP))
-   return -ENOMEM;
-   /* Check that both stack segments have the same anon_vma? */
-   }
+   !(next->vm_flags & VM_GROWSUP) &&
+   (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)) &&
+   (!(next->vm_flags & VM_FIXED) ||
+next->vm_start < address + PAGE_SIZE))
+   return -ENOMEM;
 
/* We must make sure the anon_vma is allocated. */
if (unlikely(anon_vma_prepare(vma)))
@@ -2329,12 +2334,17 @@ int expand_downwards(struct vm_area_struct *vma,
if (gap_addr > address)
return -ENOMEM;
prev = vma->vm_prev;
+
+   /* PROT_NONE below a 

Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-06 Thread Willy Tarreau
On Thu, Jul 06, 2017 at 10:24:06AM +0200, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote:
> > On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings  wrote:
> > >>
> > >> And I think your second patch breaks that "use a really large value to
> > >> approximate infinity" case that definitely has existed as a pattern.
> > >
> > > Right.  Well that seems to leave us with remembering the MAP_FIXED flag
> > > and using that as the condition to ignore the previous mapping.
> > 
> > I'm not particularly happy about having a MAP_FIXED special case, but
> > yeah, I'm not seeing a lot of alternatives.
> 
> We can possibly refine it like this :
>   - use PROT_NONE as a mark for the end of the stack and consider the
> application doing this knows exactly what it's doing ;
> 
>   - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering
> that 1) it used to work like this for many years, and 2) if an application
> is forcing a MAP_FIXED just below the stack and at the same time uses
> large alloca() or VLA it's definitely bogus and looking for unfixable
> trouble. Not allowing this means we break existing applications anyway.

That would probably give the following (only build-tested on x86_64). Do
you think it would make sense and/or be acceptable ? That would more
easily avoid the other options like adding sysctl + warnings or making
a special case of setuid.

Willy
---

>From 56ae4e57e446bc92fd2647327da281e313930524 Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Thu, 6 Jul 2017 12:00:54 +0200
Subject: mm: mm, mmap: only apply a one page gap betwen the stack an
 MAP_FIXED

Some programs place a MAP_FIXED below the stack, not leaving enough room
for the stack guard. This patch keeps track of MAP_FIXED, mirroring it in
a new VM_FIXED flag and reduces the stack guard to a single page (as it
used to be) in such a situation, assuming that when an application places
a fixed map close to the stack, it very likely does it on purpose and is
taking the full responsibility for the risk of the stack blowing up.

Cc: Ben Hutchings 
Cc: Michal Hocko 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Signed-off-by: Willy Tarreau 
---
 include/linux/mm.h   |  1 +
 include/linux/mman.h |  1 +
 mm/mmap.c| 30 --
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a4..41492b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -188,6 +188,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 #define VM_ACCOUNT 0x0010  /* Is a VM accounted object */
 #define VM_NORESERVE   0x0020  /* should the VM suppress accounting */
 #define VM_HUGETLB 0x0040  /* Huge TLB Page VM */
+#define VM_FIXED   0x0080  /* MAP_FIXED was used */
 #define VM_ARCH_1  0x0100  /* Architecture-specific flag */
 #define VM_ARCH_2  0x0200
 #define VM_DONTDUMP0x0400  /* Do not include in the core dump */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 634c4c5..3a29069 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,6 +86,7 @@ static inline bool arch_validate_prot(unsigned long prot)
 {
return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
   _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
+  _calc_vm_trans(flags, MAP_FIXED,  VM_FIXED ) |
   _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index ece0f6d..7fc1c29 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2244,12 +2244,17 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
gap_addr = TASK_SIZE;
 
next = vma->vm_next;
+
+   /* PROT_NONE above a MAP_GROWSUP always serves as a mark and inhibits
+* the stack guard gap.
+* MAP_FIXED above a MAP_GROWSUP only requires a single page guard.
+*/
if (next && next->vm_start < gap_addr &&
-   (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
-   if (!(next->vm_flags & VM_GROWSUP))
-   return -ENOMEM;
-   /* Check that both stack segments have the same anon_vma? */
-   }
+   !(next->vm_flags & VM_GROWSUP) &&
+   (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)) &&
+   (!(next->vm_flags & VM_FIXED) ||
+next->vm_start < address + PAGE_SIZE))
+   return -ENOMEM;
 
/* We must make sure the anon_vma is allocated. */
if (unlikely(anon_vma_prepare(vma)))
@@ -2329,12 +2334,17 @@ int expand_downwards(struct vm_area_struct *vma,
if (gap_addr > address)
return -ENOMEM;
prev = vma->vm_prev;
+
+   /* PROT_NONE below a MAP_GROWSDOWN always serves as a mark and inhibits
+* the stack guard gap.
+* MAP_FIXED below a MAP_GROWSDOWN 

Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-06 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote:
> On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings  wrote:
> >>
> >> And I think your second patch breaks that "use a really large value to
> >> approximate infinity" case that definitely has existed as a pattern.
> >
> > Right.  Well that seems to leave us with remembering the MAP_FIXED flag
> > and using that as the condition to ignore the previous mapping.
> 
> I'm not particularly happy about having a MAP_FIXED special case, but
> yeah, I'm not seeing a lot of alternatives.

We can possibly refine it like this :
  - use PROT_NONE as a mark for the end of the stack and consider the
application doing this knows exactly what it's doing ;

  - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering
that 1) it used to work like this for many years, and 2) if an application
is forcing a MAP_FIXED just below the stack and at the same time uses
large alloca() or VLA it's definitely bogus and looking for unfixable
trouble. Not allowing this means we break existing applications anyway.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-06 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote:
> On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings  wrote:
> >>
> >> And I think your second patch breaks that "use a really large value to
> >> approximate infinity" case that definitely has existed as a pattern.
> >
> > Right.  Well that seems to leave us with remembering the MAP_FIXED flag
> > and using that as the condition to ignore the previous mapping.
> 
> I'm not particularly happy about having a MAP_FIXED special case, but
> yeah, I'm not seeing a lot of alternatives.

We can possibly refine it like this :
  - use PROT_NONE as a mark for the end of the stack and consider the
application doing this knows exactly what it's doing ;

  - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering
that 1) it used to work like this for many years, and 2) if an application
is forcing a MAP_FIXED just below the stack and at the same time uses
large alloca() or VLA it's definitely bogus and looking for unfixable
trouble. Not allowing this means we break existing applications anyway.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-06 Thread Michal Hocko
On Wed 05-07-17 08:36:45, Michal Hocko wrote:
> On Tue 04-07-17 16:31:52, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings  wrote:
> > >
> > > We have:
> > >
> > > bottom = 0xff803fff
> > > sp = 0xb178
> > >
> > > The relevant mappings are:
> > >
> > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > fffdd000-e000 rw-p  00:00 0  
> > > [stack]
> > 
> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > to use up almost all of it, and there's only about 28kB left between
> > "bottom" and that 'rwx' mapping.
> > 
> > Still, that rwx mapping is interesting: it is a single page, and it
> > really is almost exactly 8MB below the stack.
> > 
> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > the top of that odd one-page allocation (0xff7fd000).
> 
> Very interesting! I would be really curious whether changing ulimit to
> something bigger changes the picture.

It's public holiday today here and I haven't read all new emails and I
will be mostly offline today. I will catch up tomorrow. But before we go
to more tricky workarounds. Could you double check that simply
increasing the RLIMIT_STACK workarounds the problem here? Because if it
does and other workarounds require some manual intervention then
changing ulimit sounds like the least tricky one to me.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-06 Thread Michal Hocko
On Wed 05-07-17 08:36:45, Michal Hocko wrote:
> On Tue 04-07-17 16:31:52, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings  wrote:
> > >
> > > We have:
> > >
> > > bottom = 0xff803fff
> > > sp = 0xb178
> > >
> > > The relevant mappings are:
> > >
> > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > fffdd000-e000 rw-p  00:00 0  
> > > [stack]
> > 
> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > to use up almost all of it, and there's only about 28kB left between
> > "bottom" and that 'rwx' mapping.
> > 
> > Still, that rwx mapping is interesting: it is a single page, and it
> > really is almost exactly 8MB below the stack.
> > 
> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > the top of that odd one-page allocation (0xff7fd000).
> 
> Very interesting! I would be really curious whether changing ulimit to
> something bigger changes the picture.

It's public holiday today here and I haven't read all new emails and I
will be mostly offline today. I will catch up tomorrow. But before we go
to more tricky workarounds. Could you double check that simply
increasing the RLIMIT_STACK workarounds the problem here? Because if it
does and other workarounds require some manual intervention then
changing ulimit sounds like the least tricky one to me.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kevin Easton
On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > wrote:
> > > > 
> > > > We have:
> > > > 
> > > > bottom = 0xff803fff
> > > > sp =?0xb178
> > > > 
> > > > The relevant mappings are:
> > > > 
> > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > fffdd000-e000 rw-p  00:00
> > > > 0??[stack]
> > > 
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > > 
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > > 
> > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > > 
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> > [...]
> > 
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> 
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

On the plus side, the code in that page (a single RET) is only executed
once when the workaround function is called.  Notice that 'codebuf'
is never even returned out of that function.

The only reason they even leave that page mapped is to stop the exec
shield limit from being lowered on them again.

- Kevin



Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kevin Easton
On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > wrote:
> > > > 
> > > > We have:
> > > > 
> > > > bottom = 0xff803fff
> > > > sp =?0xb178
> > > > 
> > > > The relevant mappings are:
> > > > 
> > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > fffdd000-e000 rw-p  00:00
> > > > 0??[stack]
> > > 
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > > 
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > > 
> > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > > 
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> > [...]
> > 
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> 
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

On the plus side, the code in that page (a single RET) is only executed
once when the workaround function is called.  Notice that 'codebuf'
is never even returned out of that function.

The only reason they even leave that page mapped is to stop the exec
shield limit from being lowered on them again.

- Kevin



Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 05:19:47PM -0700, Andy Lutomirski wrote:
> I think it's ridiculous that you can change rlimits and then
> exec a setuid thing.  It's not so easy to fix, though.  Maybe track,
> per-task, inherited by clone and exec, what the rlimits were the last
> time the process had privilege and reset to those limits when running
> something setuid.  But a better approach might be to have some sysctls
> that say what the rlimits become when doing setuid.

*Some* rlimits are useful and needed for the user as you mentionned.
RLIMIT_CORE definitely is one of them, especially for debugging, when
combined with suid_dumpable. Some others like RLIMIT_STACK should
probably never be configurable at all and cause trouble. Probably
that simply having a sysctl to set this one for setuid programs and
ignore the current limit would be enough. We could even imagine another
one to set the stack guard gap for setuid programs (this would also
limit the impacts of having a large gap for everyone).

> We need per-user-ns sysctls for stuff like this, and we don't really
> have them...

I don't think we need to be this fine-grained. min_mmap_addr is global,
is used to address very similar issues and nobody seems to complain.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 05:19:47PM -0700, Andy Lutomirski wrote:
> I think it's ridiculous that you can change rlimits and then
> exec a setuid thing.  It's not so easy to fix, though.  Maybe track,
> per-task, inherited by clone and exec, what the rlimits were the last
> time the process had privilege and reset to those limits when running
> something setuid.  But a better approach might be to have some sysctls
> that say what the rlimits become when doing setuid.

*Some* rlimits are useful and needed for the user as you mentionned.
RLIMIT_CORE definitely is one of them, especially for debugging, when
combined with suid_dumpable. Some others like RLIMIT_STACK should
probably never be configurable at all and cause trouble. Probably
that simply having a sysctl to set this one for setuid programs and
ignore the current limit would be enough. We could even imagine another
one to set the stack guard gap for setuid programs (this would also
limit the impacts of having a large gap for everyone).

> We need per-user-ns sysctls for stuff like this, and we don't really
> have them...

I don't think we need to be this fine-grained. min_mmap_addr is global,
is used to address very similar issues and nobody seems to complain.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kees Cook
On Wed, Jul 5, 2017 at 5:19 PM, Andy Lutomirski  wrote:
> On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
>> As part of that should we put restrictions on the environment of
>> set*id exec too? Part of the risks demonstrated by Qualys was that
>> allowing a privilege-elevating binary to inherit rlimits can have lead
>> to the nasty memory layout side-effects. That would fall into the
>> "hardening" bucket as well. And if it turns out there is some set*id
>> binary out there that can't run with "only", e.g., 128MB of stack, we
>> can make it configurable...
>
> Yes.  I think it's ridiculous that you can change rlimits and then
> exec a setuid thing.  It's not so easy to fix, though.  Maybe track,
> per-task, inherited by clone and exec, what the rlimits were the last
> time the process had privilege and reset to those limits when running
> something setuid.  But a better approach might be to have some sysctls
> that say what the rlimits become when doing setuid.
>
> We need per-user-ns sysctls for stuff like this, and we don't really
> have them...

In userspace, the way that sensible rlimit defaults were selected by
PAM when building an initial environment is to just examine the
rlimits of init. Maybe we could just do the same thing here, which
gives us some level of namespace control.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kees Cook
On Wed, Jul 5, 2017 at 5:19 PM, Andy Lutomirski  wrote:
> On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
>> As part of that should we put restrictions on the environment of
>> set*id exec too? Part of the risks demonstrated by Qualys was that
>> allowing a privilege-elevating binary to inherit rlimits can have lead
>> to the nasty memory layout side-effects. That would fall into the
>> "hardening" bucket as well. And if it turns out there is some set*id
>> binary out there that can't run with "only", e.g., 128MB of stack, we
>> can make it configurable...
>
> Yes.  I think it's ridiculous that you can change rlimits and then
> exec a setuid thing.  It's not so easy to fix, though.  Maybe track,
> per-task, inherited by clone and exec, what the rlimits were the last
> time the process had privilege and reset to those limits when running
> something setuid.  But a better approach might be to have some sysctls
> that say what the rlimits become when doing setuid.
>
> We need per-user-ns sysctls for stuff like this, and we don't really
> have them...

In userspace, the way that sensible rlimit defaults were selected by
PAM when building an initial environment is to just examine the
rlimits of init. Maybe we could just do the same thing here, which
gives us some level of namespace control.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 5:31 PM, Andy Lutomirski  wrote:
>
> I wonder if we could pull some "sane" values out of our arses and have
> it work just fine.

That approach may work, but it's pretty nasty.

But together with at least some way for the distro to set the values
we pick, it would probably be fairly reasonable.

You're right that most of the rlimits are just not very useful.

   Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 5:31 PM, Andy Lutomirski  wrote:
>
> I wonder if we could pull some "sane" values out of our arses and have
> it work just fine.

That approach may work, but it's pretty nasty.

But together with at least some way for the distro to set the values
we pick, it would probably be fairly reasonable.

You're right that most of the rlimits are just not very useful.

   Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 4:55 PM, Linus Torvalds
 wrote:
> On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
>>
>> As part of that should we put restrictions on the environment of
>> set*id exec too?
>
> I'm not seeing what sane limits you could use.
>
> I think the concept of "reset as much of the environment to sane
> things when running suid binaries" is a good concepr.
>
> But we simply don't have any sane values to reset things to.

I wonder if we could pull some "sane" values out of our arses and have
it work just fine.

It's worth noting that a lot of the rlimits don't meaningfully
restrict the use of any particular resource, so we could plausibly
drop requirements to have privilege to increase them if we really
cared to.  I don't see why we'd make such a change, but it means that,
if we reset on set*id and therefore poke a hole that allows a program
to do "sudo -u $me whatever" and thereby reset limits, it's not so
bad.  A tiny survey:

RLIMIT_AS: not a systemwide resource at all.
RLIMIT_CORE: more or less just a policy of what you do when you crash.
I don't see how you could do much damage here.
RLIMIT_CPU: unless you're not allowed to fork(), this doesn't restrict
anything systemwide.
RLIMIT_DATA: ***
RLIMIT_FSIZE: maybe?  but I can see this being quite dangerous across set*id
RLIMIT_LOCKS: gone
RLIMIT_MEMLOCK: this one matters, but it also seems nearly worthless
for exploits
RLIMIT_MSGQUEUE: privilege matters here
RLIMIT_NICE: maybe?  anyone who actually cares would use cgroups instead
RLIMIT_NOFILE: great for exploits.  Only sort of useful for resource management
RLIMIT_NPROC: privilege matters here
RLIMIT_RTTIME: privilege kind of matters.  Also dangerous for exploits
(a bit) since it lets you kill your children at controlled times.
RLIMIT_SIGPENDING: not sure
RLIMIT_STACK: ***

*** means that this is a half-arsed resource control.  It's half-arsed
because this stuff doesn't cover mmap(2), which seems to me like it
defeats the purpose.  This stuff feels like a throwback to the
eighties.


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 4:55 PM, Linus Torvalds
 wrote:
> On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
>>
>> As part of that should we put restrictions on the environment of
>> set*id exec too?
>
> I'm not seeing what sane limits you could use.
>
> I think the concept of "reset as much of the environment to sane
> things when running suid binaries" is a good concepr.
>
> But we simply don't have any sane values to reset things to.

I wonder if we could pull some "sane" values out of our arses and have
it work just fine.

It's worth noting that a lot of the rlimits don't meaningfully
restrict the use of any particular resource, so we could plausibly
drop requirements to have privilege to increase them if we really
cared to.  I don't see why we'd make such a change, but it means that,
if we reset on set*id and therefore poke a hole that allows a program
to do "sudo -u $me whatever" and thereby reset limits, it's not so
bad.  A tiny survey:

RLIMIT_AS: not a systemwide resource at all.
RLIMIT_CORE: more or less just a policy of what you do when you crash.
I don't see how you could do much damage here.
RLIMIT_CPU: unless you're not allowed to fork(), this doesn't restrict
anything systemwide.
RLIMIT_DATA: ***
RLIMIT_FSIZE: maybe?  but I can see this being quite dangerous across set*id
RLIMIT_LOCKS: gone
RLIMIT_MEMLOCK: this one matters, but it also seems nearly worthless
for exploits
RLIMIT_MSGQUEUE: privilege matters here
RLIMIT_NICE: maybe?  anyone who actually cares would use cgroups instead
RLIMIT_NOFILE: great for exploits.  Only sort of useful for resource management
RLIMIT_NPROC: privilege matters here
RLIMIT_RTTIME: privilege kind of matters.  Also dangerous for exploits
(a bit) since it lets you kill your children at controlled times.
RLIMIT_SIGPENDING: not sure
RLIMIT_STACK: ***

*** means that this is a half-arsed resource control.  It's half-arsed
because this stuff doesn't cover mmap(2), which seems to me like it
defeats the purpose.  This stuff feels like a throwback to the
eighties.


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 4:50 PM, Ben Hutchings  wrote:
> On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote:
>> On Jul 5, 2017, at 12:32 PM, Ben Hutchings  wrote:
>> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
> [...]
>> > >  - As a hardening feature, if the stack would expand within 64k or
>> > > whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
>> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> > > The idea being that, if you deliberately place a mapping under the
>> > > stack, you know what you're doing.  If you're like LibreOffice and do
>> > > something daft and are thus exploitable, you're on your own.
>> > >  - As a hardening measure, don't let mmap without MAP_FIXED position
>> > > something within 64k or whatever of the bottom of the stack unless a
>> > > MAP_FIXED mapping is between them.
>> >
>> > Having tested patches along these lines, I think the above would avoid
>> > the reported regressions.
>> >
>>
>> FWIW, even this last part may be problematic.  It'll break anything
>> that tries to allocate many small MAP_GROWSDOWN stacks on 32-
>> bit.  Hopefully nothing does this, but maybe Java does.
>
> glibc (NPTL) does not.  Java (at least Hotspot in OpenJDK 6,7, 8) does
> not.  LinuxThreads *does* and is used by uclibc.  dietlibc *does*.  I
> would be surprised if either was used for applications with very many
> threads, but then this issue has thrown up a lot of surprises.
>

Ugh.  But yeah, I'd be a bit surprised to see heavily threaded apps
using LinuxThreads or dietlibc.

LinuxThreads still uses modify_ldt(), right?  modify_ldt() performance
is abysmal, and I have no intention of even trying to optimize it.
Anyhow, you *can't* have more than 8192 threads if you use
modify_ldt() for TLS because you run out of LDT slots.  8192 * 64k
fits in 32 bits with room to spare, so this is unlikely to be a
showstopper.

--Andy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 4:50 PM, Ben Hutchings  wrote:
> On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote:
>> On Jul 5, 2017, at 12:32 PM, Ben Hutchings  wrote:
>> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
> [...]
>> > >  - As a hardening feature, if the stack would expand within 64k or
>> > > whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
>> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> > > The idea being that, if you deliberately place a mapping under the
>> > > stack, you know what you're doing.  If you're like LibreOffice and do
>> > > something daft and are thus exploitable, you're on your own.
>> > >  - As a hardening measure, don't let mmap without MAP_FIXED position
>> > > something within 64k or whatever of the bottom of the stack unless a
>> > > MAP_FIXED mapping is between them.
>> >
>> > Having tested patches along these lines, I think the above would avoid
>> > the reported regressions.
>> >
>>
>> FWIW, even this last part may be problematic.  It'll break anything
>> that tries to allocate many small MAP_GROWSDOWN stacks on 32-
>> bit.  Hopefully nothing does this, but maybe Java does.
>
> glibc (NPTL) does not.  Java (at least Hotspot in OpenJDK 6,7, 8) does
> not.  LinuxThreads *does* and is used by uclibc.  dietlibc *does*.  I
> would be surprised if either was used for applications with very many
> threads, but then this issue has thrown up a lot of surprises.
>

Ugh.  But yeah, I'd be a bit surprised to see heavily threaded apps
using LinuxThreads or dietlibc.

LinuxThreads still uses modify_ldt(), right?  modify_ldt() performance
is abysmal, and I have no intention of even trying to optimize it.
Anyhow, you *can't* have more than 8192 threads if you use
modify_ldt() for TLS because you run out of LDT slots.  8192 * 64k
fits in 32 bits with room to spare, so this is unlikely to be a
showstopper.

--Andy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
> On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski  wrote:
>> Right.  But I think the approach that we're all taking here is a bit
>> nutty.  We all realize that this issue is a longstanding *GCC* bug
>> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must
>> Be Fixed (tm) and therefore is allowed to break ABI.  My security hat
>> is normally pretty hard-line, but I think it may be time to call BS.
>>
>> Imagine if Kees had sent some symlink hardening patch that was
>> default-on and broke a stock distro.  Or if I had sent a vsyscall
>> hardening patch that broke real code.  It would get reverted right
>> away, probably along with a diatribe about how we should have known
>> better.  I think this stack gap stuff is the same thing.  It's not a
>> security fix -- it's a hardening patch.
>>
>> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>>
>> I'm starting to think that the right approach is to mostly revert all
>> this stuff (the execve fixes are fine).  Then start over and think
>> about it as hardening.  I would suggest the following approach:
>>
>>  - The stack gap is one page, just like it's been for years.
>>  - As a hardening feature, if the stack would expand within 64k or
>> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
>> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> The idea being that, if you deliberately place a mapping under the
>> stack, you know what you're doing.  If you're like LibreOffice and do
>> something daft and are thus exploitable, you're on your own.
>>  - As a hardening measure, don't let mmap without MAP_FIXED position
>> something within 64k or whatever of the bottom of the stack unless a
>> MAP_FIXED mapping is between them.
>>
>> And that's all.  It's not like a 64k gap actually fixes these bugs for
>> real -- it just makes them harder to exploit.
>>
>> [1] The code that GCC generates for char buf[bug number] and alloca()
>> is flat-out wrong.  Everyone who's ever thought about it all all knows
>> it and has known about it for years, but no one cared to fix it.
>
> As part of that should we put restrictions on the environment of
> set*id exec too? Part of the risks demonstrated by Qualys was that
> allowing a privilege-elevating binary to inherit rlimits can have lead
> to the nasty memory layout side-effects. That would fall into the
> "hardening" bucket as well. And if it turns out there is some set*id
> binary out there that can't run with "only", e.g., 128MB of stack, we
> can make it configurable...

Yes.  I think it's ridiculous that you can change rlimits and then
exec a setuid thing.  It's not so easy to fix, though.  Maybe track,
per-task, inherited by clone and exec, what the rlimits were the last
time the process had privilege and reset to those limits when running
something setuid.  But a better approach might be to have some sysctls
that say what the rlimits become when doing setuid.

We need per-user-ns sysctls for stuff like this, and we don't really
have them...

--Andy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
> On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski  wrote:
>> Right.  But I think the approach that we're all taking here is a bit
>> nutty.  We all realize that this issue is a longstanding *GCC* bug
>> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must
>> Be Fixed (tm) and therefore is allowed to break ABI.  My security hat
>> is normally pretty hard-line, but I think it may be time to call BS.
>>
>> Imagine if Kees had sent some symlink hardening patch that was
>> default-on and broke a stock distro.  Or if I had sent a vsyscall
>> hardening patch that broke real code.  It would get reverted right
>> away, probably along with a diatribe about how we should have known
>> better.  I think this stack gap stuff is the same thing.  It's not a
>> security fix -- it's a hardening patch.
>>
>> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>>
>> I'm starting to think that the right approach is to mostly revert all
>> this stuff (the execve fixes are fine).  Then start over and think
>> about it as hardening.  I would suggest the following approach:
>>
>>  - The stack gap is one page, just like it's been for years.
>>  - As a hardening feature, if the stack would expand within 64k or
>> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
>> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> The idea being that, if you deliberately place a mapping under the
>> stack, you know what you're doing.  If you're like LibreOffice and do
>> something daft and are thus exploitable, you're on your own.
>>  - As a hardening measure, don't let mmap without MAP_FIXED position
>> something within 64k or whatever of the bottom of the stack unless a
>> MAP_FIXED mapping is between them.
>>
>> And that's all.  It's not like a 64k gap actually fixes these bugs for
>> real -- it just makes them harder to exploit.
>>
>> [1] The code that GCC generates for char buf[bug number] and alloca()
>> is flat-out wrong.  Everyone who's ever thought about it all all knows
>> it and has known about it for years, but no one cared to fix it.
>
> As part of that should we put restrictions on the environment of
> set*id exec too? Part of the risks demonstrated by Qualys was that
> allowing a privilege-elevating binary to inherit rlimits can have lead
> to the nasty memory layout side-effects. That would fall into the
> "hardening" bucket as well. And if it turns out there is some set*id
> binary out there that can't run with "only", e.g., 128MB of stack, we
> can make it configurable...

Yes.  I think it's ridiculous that you can change rlimits and then
exec a setuid thing.  It's not so easy to fix, though.  Maybe track,
per-task, inherited by clone and exec, what the rlimits were the last
time the process had privilege and reset to those limits when running
something setuid.  But a better approach might be to have some sysctls
that say what the rlimits become when doing setuid.

We need per-user-ns sysctls for stuff like this, and we don't really
have them...

--Andy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
>
> As part of that should we put restrictions on the environment of
> set*id exec too?

I'm not seeing what sane limits you could use.

I think the concept of "reset as much of the environment to sane
things when running suid binaries" is a good concepr.

But we simply don't have any sane values to reset things to.

Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook  wrote:
>
> As part of that should we put restrictions on the environment of
> set*id exec too?

I'm not seeing what sane limits you could use.

I think the concept of "reset as much of the environment to sane
things when running suid binaries" is a good concepr.

But we simply don't have any sane values to reset things to.

Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote:
> On Jul 5, 2017, at 12:32 PM, Ben Hutchings  wrote:
> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
[...]
> > >  - As a hardening feature, if the stack would expand within 64k or
> > > whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> > > The idea being that, if you deliberately place a mapping under the
> > > stack, you know what you're doing.  If you're like LibreOffice and do
> > > something daft and are thus exploitable, you're on your own.
> > >  - As a hardening measure, don't let mmap without MAP_FIXED position
> > > something within 64k or whatever of the bottom of the stack unless a
> > > MAP_FIXED mapping is between them.
> > 
> > Having tested patches along these lines, I think the above would avoid
> > the reported regressions.
> > 
> 
> FWIW, even this last part may be problematic.  It'll break anything
> that tries to allocate many small MAP_GROWSDOWN stacks on 32-
> bit.  Hopefully nothing does this, but maybe Java does.

glibc (NPTL) does not.  Java (at least Hotspot in OpenJDK 6,7, 8) does
not.  LinuxThreads *does* and is used by uclibc.  dietlibc *does*.  I
would be surprised if either was used for applications with very many
threads, but then this issue has thrown up a lot of surprises.

Ben.

-- 
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote:
> On Jul 5, 2017, at 12:32 PM, Ben Hutchings  wrote:
> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
[...]
> > >  - As a hardening feature, if the stack would expand within 64k or
> > > whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> > > The idea being that, if you deliberately place a mapping under the
> > > stack, you know what you're doing.  If you're like LibreOffice and do
> > > something daft and are thus exploitable, you're on your own.
> > >  - As a hardening measure, don't let mmap without MAP_FIXED position
> > > something within 64k or whatever of the bottom of the stack unless a
> > > MAP_FIXED mapping is between them.
> > 
> > Having tested patches along these lines, I think the above would avoid
> > the reported regressions.
> > 
> 
> FWIW, even this last part may be problematic.  It'll break anything
> that tries to allocate many small MAP_GROWSDOWN stacks on 32-
> bit.  Hopefully nothing does this, but maybe Java does.

glibc (NPTL) does not.  Java (at least Hotspot in OpenJDK 6,7, 8) does
not.  LinuxThreads *does* and is used by uclibc.  dietlibc *does*.  I
would be surprised if either was used for applications with very many
threads, but then this issue has thrown up a lot of surprises.

Ben.

-- 
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings  wrote:
>>
>> And I think your second patch breaks that "use a really large value to
>> approximate infinity" case that definitely has existed as a pattern.
>
> Right.  Well that seems to leave us with remembering the MAP_FIXED flag
> and using that as the condition to ignore the previous mapping.

I'm not particularly happy about having a MAP_FIXED special case, but
yeah, I'm not seeing a lot of alternatives.

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings  wrote:
>>
>> And I think your second patch breaks that "use a really large value to
>> approximate infinity" case that definitely has existed as a pattern.
>
> Right.  Well that seems to leave us with remembering the MAP_FIXED flag
> and using that as the condition to ignore the previous mapping.

I'm not particularly happy about having a MAP_FIXED special case, but
yeah, I'm not seeing a lot of alternatives.

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kees Cook
On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski  wrote:
> Right.  But I think the approach that we're all taking here is a bit
> nutty.  We all realize that this issue is a longstanding *GCC* bug
> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must
> Be Fixed (tm) and therefore is allowed to break ABI.  My security hat
> is normally pretty hard-line, but I think it may be time to call BS.
>
> Imagine if Kees had sent some symlink hardening patch that was
> default-on and broke a stock distro.  Or if I had sent a vsyscall
> hardening patch that broke real code.  It would get reverted right
> away, probably along with a diatribe about how we should have known
> better.  I think this stack gap stuff is the same thing.  It's not a
> security fix -- it's a hardening patch.
>
> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>
> I'm starting to think that the right approach is to mostly revert all
> this stuff (the execve fixes are fine).  Then start over and think
> about it as hardening.  I would suggest the following approach:
>
>  - The stack gap is one page, just like it's been for years.
>  - As a hardening feature, if the stack would expand within 64k or
> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> The idea being that, if you deliberately place a mapping under the
> stack, you know what you're doing.  If you're like LibreOffice and do
> something daft and are thus exploitable, you're on your own.
>  - As a hardening measure, don't let mmap without MAP_FIXED position
> something within 64k or whatever of the bottom of the stack unless a
> MAP_FIXED mapping is between them.
>
> And that's all.  It's not like a 64k gap actually fixes these bugs for
> real -- it just makes them harder to exploit.
>
> [1] The code that GCC generates for char buf[bug number] and alloca()
> is flat-out wrong.  Everyone who's ever thought about it all all knows
> it and has known about it for years, but no one cared to fix it.

As part of that should we put restrictions on the environment of
set*id exec too? Part of the risks demonstrated by Qualys was that
allowing a privilege-elevating binary to inherit rlimits can have lead
to the nasty memory layout side-effects. That would fall into the
"hardening" bucket as well. And if it turns out there is some set*id
binary out there that can't run with "only", e.g., 128MB of stack, we
can make it configurable...

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Kees Cook
On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski  wrote:
> Right.  But I think the approach that we're all taking here is a bit
> nutty.  We all realize that this issue is a longstanding *GCC* bug
> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must
> Be Fixed (tm) and therefore is allowed to break ABI.  My security hat
> is normally pretty hard-line, but I think it may be time to call BS.
>
> Imagine if Kees had sent some symlink hardening patch that was
> default-on and broke a stock distro.  Or if I had sent a vsyscall
> hardening patch that broke real code.  It would get reverted right
> away, probably along with a diatribe about how we should have known
> better.  I think this stack gap stuff is the same thing.  It's not a
> security fix -- it's a hardening patch.
>
> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>
> I'm starting to think that the right approach is to mostly revert all
> this stuff (the execve fixes are fine).  Then start over and think
> about it as hardening.  I would suggest the following approach:
>
>  - The stack gap is one page, just like it's been for years.
>  - As a hardening feature, if the stack would expand within 64k or
> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> The idea being that, if you deliberately place a mapping under the
> stack, you know what you're doing.  If you're like LibreOffice and do
> something daft and are thus exploitable, you're on your own.
>  - As a hardening measure, don't let mmap without MAP_FIXED position
> something within 64k or whatever of the bottom of the stack unless a
> MAP_FIXED mapping is between them.
>
> And that's all.  It's not like a 64k gap actually fixes these bugs for
> real -- it just makes them harder to exploit.
>
> [1] The code that GCC generates for char buf[bug number] and alloca()
> is flat-out wrong.  Everyone who's ever thought about it all all knows
> it and has known about it for years, but no one cared to fix it.

As part of that should we put restrictions on the environment of
set*id exec too? Part of the risks demonstrated by Qualys was that
allowing a privilege-elevating binary to inherit rlimits can have lead
to the nasty memory layout side-effects. That would fall into the
"hardening" bucket as well. And if it turns out there is some set*id
binary out there that can't run with "only", e.g., 128MB of stack, we
can make it configurable...

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 10:15 -0700, Linus Torvalds wrote:
> > On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings  wrote:
> > 
> > I ended up with the following two patches, which seem to deal with
> > both the Java and Rust regressions.  These don't touch the
> > stack-grows-up paths at all because Rust doesn't run on those
> > architectures and the Java weirdness is i386-specific.
> > 
> > They definitely need longer commit messages and comments, but aside
> > from that do these look reasonable?
> 
> I thin kthey both look reasonable, but I think we might still want to
> massage things a bit (cutting down the quoting to a minimum, hopefully
> leaving enough context to still make sense):
> 
> > Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack 
> > gap
> > 
> > prev = vma->vm_prev;
> > +   if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
> > +   prev = prev->vm_prev;
> > if (prev && prev->vm_end > gap_addr) {
> 
> Do we just want to ignore the user-supplied guard mapping, or do we
> want to say "if the user does a guard mapping, we use that *instead*
> of our stack gap"?
> 
> IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want
> to just return "ok".

Rust effectively added a second guard page to the main thread stack.  
But it does not (yet) implement stack probing
(https://github.com/rust-lang/rust/issues/16012) so I think it will
benefit from the kernel's larger stack guard gap.

> > Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent  
> > if finite
> 
> This is good thinking, but no, I don't think the "if finite" is right.
> 
> I've seen people use "really big values" as replacement for
> RLIM_INIFITY, for various reasons.
> 
> We've had huge confusion about RLIM_INFINITY over the years - look for
> things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions
> we've had.

That sounds familiar...

> Some people just use MAX_LONG etc, which is *not* the same as
> RLIM_INFINITY, but in practice ends up doing the same thing. Yadda
> yadda.
> 
> So I'm personally leery of checking and depending on "exactly
> RLIM_INIFITY", because I've seen it go wrong so many times.
> 
> And I think your second patch breaks that "use a really large value to
> approximate infinity" case that definitely has existed as a pattern.

Right.  Well that seems to leave us with remembering the MAP_FIXED flag
and using that as the condition to ignore the previous mapping.

Ben.

-- 
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 10:15 -0700, Linus Torvalds wrote:
> > On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings  wrote:
> > 
> > I ended up with the following two patches, which seem to deal with
> > both the Java and Rust regressions.  These don't touch the
> > stack-grows-up paths at all because Rust doesn't run on those
> > architectures and the Java weirdness is i386-specific.
> > 
> > They definitely need longer commit messages and comments, but aside
> > from that do these look reasonable?
> 
> I thin kthey both look reasonable, but I think we might still want to
> massage things a bit (cutting down the quoting to a minimum, hopefully
> leaving enough context to still make sense):
> 
> > Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack 
> > gap
> > 
> > prev = vma->vm_prev;
> > +   if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
> > +   prev = prev->vm_prev;
> > if (prev && prev->vm_end > gap_addr) {
> 
> Do we just want to ignore the user-supplied guard mapping, or do we
> want to say "if the user does a guard mapping, we use that *instead*
> of our stack gap"?
> 
> IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want
> to just return "ok".

Rust effectively added a second guard page to the main thread stack.  
But it does not (yet) implement stack probing
(https://github.com/rust-lang/rust/issues/16012) so I think it will
benefit from the kernel's larger stack guard gap.

> > Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent  
> > if finite
> 
> This is good thinking, but no, I don't think the "if finite" is right.
> 
> I've seen people use "really big values" as replacement for
> RLIM_INIFITY, for various reasons.
> 
> We've had huge confusion about RLIM_INFINITY over the years - look for
> things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions
> we've had.

That sounds familiar...

> Some people just use MAX_LONG etc, which is *not* the same as
> RLIM_INFINITY, but in practice ends up doing the same thing. Yadda
> yadda.
> 
> So I'm personally leery of checking and depending on "exactly
> RLIM_INIFITY", because I've seen it go wrong so many times.
> 
> And I think your second patch breaks that "use a really large value to
> approximate infinity" case that definitely has existed as a pattern.

Right.  Well that seems to leave us with remembering the MAP_FIXED flag
and using that as the condition to ignore the previous mapping.

Ben.

-- 
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski


> On Jul 5, 2017, at 12:32 PM, Ben Hutchings  wrote:
> 
>> On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
>> [...]
>> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>> 
>> I'm starting to think that the right approach is to mostly revert all
>> this stuff (the execve fixes are fine).  Then start over and think
>> about it as hardening.  I would suggest the following approach:
>> 
>>  - The stack gap is one page, just like it's been for years.
> 
> Given that in the following points you say that something sounding like
> a stack gap would be "64k or whatever", what does "the stack gap" mean
> in this first point?

I mean one page, with semantics as close to previous (4.11) behavior as 
practical.

> 
>>  - As a hardening feature, if the stack would expand within 64k or
>> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
>> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> The idea being that, if you deliberately place a mapping under the
>> stack, you know what you're doing.  If you're like LibreOffice and do
>> something daft and are thus exploitable, you're on your own.
>>  - As a hardening measure, don't let mmap without MAP_FIXED position
>> something within 64k or whatever of the bottom of the stack unless a
>> MAP_FIXED mapping is between them.
> 
> Having tested patches along these lines, I think the above would avoid
> the reported regressions.
> 

FWIW, even this last part may be problematic.  It'll break anything that tries 
to allocate many small MAP_GROWSDOWN stacks on 32-bit.  Hopefully nothing does 
this, but maybe Java does.

> Ben.
> 
>> And that's all.  It's not like a 64k gap actually fixes these bugs for
>> real -- it just makes them harder to exploit.
>> 
>> [1] The code that GCC generates for char buf[bug number] and alloca()
>> is flat-out wrong.  Everyone who's ever thought about it all all knows
>> it and has known about it for years, but no one cared to fix it.
> -- 
> Ben Hutchings
> Anthony's Law of Force: Don't force it, get a larger hammer.
> 


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski


> On Jul 5, 2017, at 12:32 PM, Ben Hutchings  wrote:
> 
>> On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
>> [...]
>> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>> 
>> I'm starting to think that the right approach is to mostly revert all
>> this stuff (the execve fixes are fine).  Then start over and think
>> about it as hardening.  I would suggest the following approach:
>> 
>>  - The stack gap is one page, just like it's been for years.
> 
> Given that in the following points you say that something sounding like
> a stack gap would be "64k or whatever", what does "the stack gap" mean
> in this first point?

I mean one page, with semantics as close to previous (4.11) behavior as 
practical.

> 
>>  - As a hardening feature, if the stack would expand within 64k or
>> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
>> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> The idea being that, if you deliberately place a mapping under the
>> stack, you know what you're doing.  If you're like LibreOffice and do
>> something daft and are thus exploitable, you're on your own.
>>  - As a hardening measure, don't let mmap without MAP_FIXED position
>> something within 64k or whatever of the bottom of the stack unless a
>> MAP_FIXED mapping is between them.
> 
> Having tested patches along these lines, I think the above would avoid
> the reported regressions.
> 

FWIW, even this last part may be problematic.  It'll break anything that tries 
to allocate many small MAP_GROWSDOWN stacks on 32-bit.  Hopefully nothing does 
this, but maybe Java does.

> Ben.
> 
>> And that's all.  It's not like a 64k gap actually fixes these bugs for
>> real -- it just makes them harder to exploit.
>> 
>> [1] The code that GCC generates for char buf[bug number] and alloca()
>> is flat-out wrong.  Everyone who's ever thought about it all all knows
>> it and has known about it for years, but no one cared to fix it.
> -- 
> Ben Hutchings
> Anthony's Law of Force: Don't force it, get a larger hammer.
> 


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 08:32:43PM +0100, Ben Hutchings wrote:
> >  - As a hardening feature, if the stack would expand within 64k or
> > whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> > The idea being that, if you deliberately place a mapping under the
> > stack, you know what you're doing.  If you're like LibreOffice and do
> > something daft and are thus exploitable, you're on your own.
> >  - As a hardening measure, don't let mmap without MAP_FIXED position
> > something within 64k or whatever of the bottom of the stack unless a
> > MAP_FIXED mapping is between them.
> 
> Having tested patches along these lines, I think the above would avoid
> the reported regressions.

Stuff like this has already been proposed but Linus suspects that more
software than we imagine uses MAP_FIXED and could break. I cannot infirm
nor confirm, and that probably indicates that there's nothing fundamentally
wrong with this approach from the userland's perspective and that it could
indeed imply such software may be more common than we would like it.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 08:32:43PM +0100, Ben Hutchings wrote:
> >  - As a hardening feature, if the stack would expand within 64k or
> > whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> > The idea being that, if you deliberately place a mapping under the
> > stack, you know what you're doing.  If you're like LibreOffice and do
> > something daft and are thus exploitable, you're on your own.
> >  - As a hardening measure, don't let mmap without MAP_FIXED position
> > something within 64k or whatever of the bottom of the stack unless a
> > MAP_FIXED mapping is between them.
> 
> Having tested patches along these lines, I think the above would avoid
> the reported regressions.

Stuff like this has already been proposed but Linus suspects that more
software than we imagine uses MAP_FIXED and could break. I cannot infirm
nor confirm, and that probably indicates that there's nothing fundamentally
wrong with this approach from the userland's perspective and that it could
indeed imply such software may be more common than we would like it.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
[...]
> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
> 
> I'm starting to think that the right approach is to mostly revert all
> this stuff (the execve fixes are fine).  Then start over and think
> about it as hardening.  I would suggest the following approach:
> 
>  - The stack gap is one page, just like it's been for years.

Given that in the following points you say that something sounding like
a stack gap would be "64k or whatever", what does "the stack gap" mean
in this first point?

>  - As a hardening feature, if the stack would expand within 64k or
> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> The idea being that, if you deliberately place a mapping under the
> stack, you know what you're doing.  If you're like LibreOffice and do
> something daft and are thus exploitable, you're on your own.
>  - As a hardening measure, don't let mmap without MAP_FIXED position
> something within 64k or whatever of the bottom of the stack unless a
> MAP_FIXED mapping is between them.

Having tested patches along these lines, I think the above would avoid
the reported regressions.

Ben.

> And that's all.  It's not like a 64k gap actually fixes these bugs for
> real -- it just makes them harder to exploit.
> 
> [1] The code that GCC generates for char buf[bug number] and alloca()
> is flat-out wrong.  Everyone who's ever thought about it all all knows
> it and has known about it for years, but no one cared to fix it.
-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
[...]
> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
> 
> I'm starting to think that the right approach is to mostly revert all
> this stuff (the execve fixes are fine).  Then start over and think
> about it as hardening.  I would suggest the following approach:
> 
>  - The stack gap is one page, just like it's been for years.

Given that in the following points you say that something sounding like
a stack gap would be "64k or whatever", what does "the stack gap" mean
in this first point?

>  - As a hardening feature, if the stack would expand within 64k or
> whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> The idea being that, if you deliberately place a mapping under the
> stack, you know what you're doing.  If you're like LibreOffice and do
> something daft and are thus exploitable, you're on your own.
>  - As a hardening measure, don't let mmap without MAP_FIXED position
> something within 64k or whatever of the bottom of the stack unless a
> MAP_FIXED mapping is between them.

Having tested patches along these lines, I think the above would avoid
the reported regressions.

Ben.

> And that's all.  It's not like a 64k gap actually fixes these bugs for
> real -- it just makes them harder to exploit.
> 
> [1] The code that GCC generates for char buf[bug number] and alloca()
> is flat-out wrong.  Everyone who's ever thought about it all all knows
> it and has known about it for years, but no one cared to fix it.
-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 12:18 PM, Willy Tarreau  wrote:
>
> But only if the sysctl is set. It can simply be recommended to set it
> if any program fails. We've done this for many years with other ones
> like min_mmap_addr or tcp_ecn.

Ok, fair enough. I don't hate the approach, and maybe it's simpler
overall, and would help find other potential problem spots.

*Hopefully* it was just that Rust thing and the nasty Java exec-shield
workaround, but yeah, those might just be the first ones that have
been found so far.

  Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 12:18 PM, Willy Tarreau  wrote:
>
> But only if the sysctl is set. It can simply be recommended to set it
> if any program fails. We've done this for many years with other ones
> like min_mmap_addr or tcp_ecn.

Ok, fair enough. I don't hate the approach, and maybe it's simpler
overall, and would help find other potential problem spots.

*Hopefully* it was just that Rust thing and the nasty Java exec-shield
workaround, but yeah, those might just be the first ones that have
been found so far.

  Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 12:17:20PM -0700, Linus Torvalds wrote:
> On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau  wrote:
> >
> > Don't you think that the option of having a sysctl to relax the check
> > per task wouldn't be easier for distros and safer overall ? Ie, emit
> > a warning the first time the gap is hit instead of segfaulting, then
> > reduce it to something that used to work (4k or 64k, I don't remember)
> > and try again ?
> 
> It used to be just 4k.
> 
> .. and I think that might be a valid way to find these things, but
> would it be safer? It basically disables the new stack gap entirely
> apart from the warning.

But only if the sysctl is set. It can simply be recommended to set it
if any program fails. We've done this for many years with other ones
like min_mmap_addr or tcp_ecn.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 12:17:20PM -0700, Linus Torvalds wrote:
> On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau  wrote:
> >
> > Don't you think that the option of having a sysctl to relax the check
> > per task wouldn't be easier for distros and safer overall ? Ie, emit
> > a warning the first time the gap is hit instead of segfaulting, then
> > reduce it to something that used to work (4k or 64k, I don't remember)
> > and try again ?
> 
> It used to be just 4k.
> 
> .. and I think that might be a valid way to find these things, but
> would it be safer? It basically disables the new stack gap entirely
> apart from the warning.

But only if the sysctl is set. It can simply be recommended to set it
if any program fails. We've done this for many years with other ones
like min_mmap_addr or tcp_ecn.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau  wrote:
>
> Don't you think that the option of having a sysctl to relax the check
> per task wouldn't be easier for distros and safer overall ? Ie, emit
> a warning the first time the gap is hit instead of segfaulting, then
> reduce it to something that used to work (4k or 64k, I don't remember)
> and try again ?

It used to be just 4k.

.. and I think that might be a valid way to find these things, but
would it be safer? It basically disables the new stack gap entirely
apart from the warning.

And maybe that's ok and distros prefer that?

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau  wrote:
>
> Don't you think that the option of having a sysctl to relax the check
> per task wouldn't be easier for distros and safer overall ? Ie, emit
> a warning the first time the gap is hit instead of segfaulting, then
> reduce it to something that used to work (4k or 64k, I don't remember)
> and try again ?

It used to be just 4k.

.. and I think that might be a valid way to find these things, but
would it be safer? It basically disables the new stack gap entirely
apart from the warning.

And maybe that's ok and distros prefer that?

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 09:17:59AM -0700, Linus Torvalds wrote:
(...)
> The good news is that this is probably specialized enough that we can
> just keep the defaults as "will break this one case, but we give
> people the tools to work around it".
> 
> I hate doing that, but distros that still support 32-bit (which is
> apparently a shrinking number) can maybe hack the libreoffice launch
> scripts up?

Don't you think that the option of having a sysctl to relax the check
per task wouldn't be easier for distros and safer overall ? Ie, emit
a warning the first time the gap is hit instead of segfaulting, then
reduce it to something that used to work (4k or 64k, I don't remember)
and try again ? It would quickly report all these "special" programs
for end-user distros, without leaving too much room for attacks due
to the warning making it pretty obvious what's going on. I just don't
know how to place this stack gap per process but since this was already
discussed for prctl I think it's doable.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 09:17:59AM -0700, Linus Torvalds wrote:
(...)
> The good news is that this is probably specialized enough that we can
> just keep the defaults as "will break this one case, but we give
> people the tools to work around it".
> 
> I hate doing that, but distros that still support 32-bit (which is
> apparently a shrinking number) can maybe hack the libreoffice launch
> scripts up?

Don't you think that the option of having a sysctl to relax the check
per task wouldn't be easier for distros and safer overall ? Ie, emit
a warning the first time the gap is hit instead of segfaulting, then
reduce it to something that used to work (4k or 64k, I don't remember)
and try again ? It would quickly report all these "special" programs
for end-user distros, without leaving too much room for attacks due
to the warning making it pretty obvious what's going on. I just don't
know how to place this stack gap per process but since this was already
discussed for prctl I think it's doable.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 19:05 +0200, Michal Hocko wrote:
> On Wed 05-07-17 17:58:45, Ben Hutchings wrote:
> [...]
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c7906ae1a7a1..f8131a94e56e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, 
> > unsigned long address)
> >  }
> >  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> >  
> > +unsigned long __vm_start_gap(struct vm_area_struct *vma)
> > +{
> > +   unsigned long stack_limit =
> > +   current->signal->rlim[RLIMIT_STACK].rlim_cur;
> > +   unsigned long vm_start;
> > +
> > +   if (stack_limit != RLIM_INFINITY &&
> > +   vma->vm_end - vma->vm_start < stack_limit)
> > +   vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);
> 
> This is exactly what I was worried about in my previous email. Say
> somebody sets stack ulimit to 1G or so. Should we reduce the available
> address space that much?

It's not ideal, but why would someone set the stack limit that high
unless it's for an application that will actually use most of that
stack space?  Do you think that "increase the stack limit" has been
cargo-culted?

> Say you are 32b and you have an application
> with multiple stacks each doing its MAP_GROWSDOWN.
[...]

So this application is using dietlibc or uclibc?  glibc uses fixed-size 
mappings for new threads.

I suppose there's a risk that by doing this we would mamke
MAP_GROWSDOWN useful enough that it is more likely to be used for new
thread stacks in future.

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 19:05 +0200, Michal Hocko wrote:
> On Wed 05-07-17 17:58:45, Ben Hutchings wrote:
> [...]
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c7906ae1a7a1..f8131a94e56e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, 
> > unsigned long address)
> >  }
> >  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> >  
> > +unsigned long __vm_start_gap(struct vm_area_struct *vma)
> > +{
> > +   unsigned long stack_limit =
> > +   current->signal->rlim[RLIMIT_STACK].rlim_cur;
> > +   unsigned long vm_start;
> > +
> > +   if (stack_limit != RLIM_INFINITY &&
> > +   vma->vm_end - vma->vm_start < stack_limit)
> > +   vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);
> 
> This is exactly what I was worried about in my previous email. Say
> somebody sets stack ulimit to 1G or so. Should we reduce the available
> address space that much?

It's not ideal, but why would someone set the stack limit that high
unless it's for an application that will actually use most of that
stack space?  Do you think that "increase the stack limit" has been
cargo-culted?

> Say you are 32b and you have an application
> with multiple stacks each doing its MAP_GROWSDOWN.
[...]

So this application is using dietlibc or uclibc?  glibc uses fixed-size 
mappings for new threads.

I suppose there's a risk that by doing this we would mamke
MAP_GROWSDOWN useful enough that it is more likely to be used for new
thread stacks in future.

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 9:20 AM, Linus Torvalds
 wrote:
> On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski  wrote:
>> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko  wrote:
>>>
>>> This is really worrying. This doesn't look like a gap at all. It is a
>>> mapping which actually contains a code and so we should absolutely not
>>> allow to scribble over it. So I am afraid the only way forward is to
>>> allow per process stack gap and run this particular program to have a
>>> smaller gap. We basically have two ways. Either /proc//$file or
>>> a prctl inherited on exec. The later is a smaller code. What do you
>>> think?
>>
>> Why inherit on exec?
>
> .. because the whole point is that you have an existing binary that breaks.
>
> So you need to be able to wrap it in "let's lower the stack gap, then
> run that known-problematic binary".
>
> If you think the problem is solved by recompiling existing binaries,
> then why are we doing this kernel hack to begin with? The *real*
> solution was always to just fix the damn compiler and ABI.

That's not what I was suggesting at all.  I was suggesting that, if
we're going to suggest a new API, that the new API actually be sane.

>
> That *real* solution is simple and needs no kernel support at all.
>
> In other words, *ALL* of the kernel work in this area is purely to
> support existing binaries. Don't overlook that fact.

Right.  But I think the approach that we're all taking here is a bit
nutty.  We all realize that this issue is a longstanding *GCC* bug
[1], but we're acting like it's a Big Deal (tm) kernel bug that Must
Be Fixed (tm) and therefore is allowed to break ABI.  My security hat
is normally pretty hard-line, but I think it may be time to call BS.

Imagine if Kees had sent some symlink hardening patch that was
default-on and broke a stock distro.  Or if I had sent a vsyscall
hardening patch that broke real code.  It would get reverted right
away, probably along with a diatribe about how we should have known
better.  I think this stack gap stuff is the same thing.  It's not a
security fix -- it's a hardening patch.

Looking at it that way, I think a new inherited-on-exec flag is nucking futs.

I'm starting to think that the right approach is to mostly revert all
this stuff (the execve fixes are fine).  Then start over and think
about it as hardening.  I would suggest the following approach:

 - The stack gap is one page, just like it's been for years.
 - As a hardening feature, if the stack would expand within 64k or
whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
The idea being that, if you deliberately place a mapping under the
stack, you know what you're doing.  If you're like LibreOffice and do
something daft and are thus exploitable, you're on your own.
 - As a hardening measure, don't let mmap without MAP_FIXED position
something within 64k or whatever of the bottom of the stack unless a
MAP_FIXED mapping is between them.

And that's all.  It's not like a 64k gap actually fixes these bugs for
real -- it just makes them harder to exploit.

[1] The code that GCC generates for char buf[bug number] and alloca()
is flat-out wrong.  Everyone who's ever thought about it all all knows
it and has known about it for years, but no one cared to fix it.


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 9:20 AM, Linus Torvalds
 wrote:
> On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski  wrote:
>> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko  wrote:
>>>
>>> This is really worrying. This doesn't look like a gap at all. It is a
>>> mapping which actually contains a code and so we should absolutely not
>>> allow to scribble over it. So I am afraid the only way forward is to
>>> allow per process stack gap and run this particular program to have a
>>> smaller gap. We basically have two ways. Either /proc//$file or
>>> a prctl inherited on exec. The later is a smaller code. What do you
>>> think?
>>
>> Why inherit on exec?
>
> .. because the whole point is that you have an existing binary that breaks.
>
> So you need to be able to wrap it in "let's lower the stack gap, then
> run that known-problematic binary".
>
> If you think the problem is solved by recompiling existing binaries,
> then why are we doing this kernel hack to begin with? The *real*
> solution was always to just fix the damn compiler and ABI.

That's not what I was suggesting at all.  I was suggesting that, if
we're going to suggest a new API, that the new API actually be sane.

>
> That *real* solution is simple and needs no kernel support at all.
>
> In other words, *ALL* of the kernel work in this area is purely to
> support existing binaries. Don't overlook that fact.

Right.  But I think the approach that we're all taking here is a bit
nutty.  We all realize that this issue is a longstanding *GCC* bug
[1], but we're acting like it's a Big Deal (tm) kernel bug that Must
Be Fixed (tm) and therefore is allowed to break ABI.  My security hat
is normally pretty hard-line, but I think it may be time to call BS.

Imagine if Kees had sent some symlink hardening patch that was
default-on and broke a stock distro.  Or if I had sent a vsyscall
hardening patch that broke real code.  It would get reverted right
away, probably along with a diatribe about how we should have known
better.  I think this stack gap stuff is the same thing.  It's not a
security fix -- it's a hardening patch.

Looking at it that way, I think a new inherited-on-exec flag is nucking futs.

I'm starting to think that the right approach is to mostly revert all
this stuff (the execve fixes are fine).  Then start over and think
about it as hardening.  I would suggest the following approach:

 - The stack gap is one page, just like it's been for years.
 - As a hardening feature, if the stack would expand within 64k or
whatever of a non-MAP_FIXED mapping, refuse to expand it.  (This might
have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
The idea being that, if you deliberately place a mapping under the
stack, you know what you're doing.  If you're like LibreOffice and do
something daft and are thus exploitable, you're on your own.
 - As a hardening measure, don't let mmap without MAP_FIXED position
something within 64k or whatever of the bottom of the stack unless a
MAP_FIXED mapping is between them.

And that's all.  It's not like a 64k gap actually fixes these bugs for
real -- it just makes them harder to exploit.

[1] The code that GCC generates for char buf[bug number] and alloca()
is flat-out wrong.  Everyone who's ever thought about it all all knows
it and has known about it for years, but no one cared to fix it.


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings  wrote:
>
> I ended up with the following two patches, which seem to deal with
> both the Java and Rust regressions.  These don't touch the
> stack-grows-up paths at all because Rust doesn't run on those
> architectures and the Java weirdness is i386-specific.
>
> They definitely need longer commit messages and comments, but aside
> from that do these look reasonable?

I thin kthey both look reasonable, but I think we might still want to
massage things a bit (cutting down the quoting to a minimum, hopefully
leaving enough context to still make sense):

> Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap
>
> prev = vma->vm_prev;
> +   if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
> +   prev = prev->vm_prev;
> if (prev && prev->vm_end > gap_addr) {

Do we just want to ignore the user-supplied guard mapping, or do we
want to say "if the user does a guard mapping, we use that *instead*
of our stack gap"?

IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want
to just return "ok".

> Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent  if 
> finite

This is good thinking, but no, I don't think the "if finite" is right.

I've seen people use "really big values" as replacement for
RLIM_INIFITY, for various reasons.

We've had huge confusion about RLIM_INFINITY over the years - look for
things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions
we've had.

Some people just use MAX_LONG etc, which is *not* the same as
RLIM_INFINITY, but in practice ends up doing the same thing. Yadda
yadda.

So I'm personally leery of checking and depending on "exactly
RLIM_INIFITY", because I've seen it go wrong so many times.

And I think your second patch breaks that "use a really large value to
approximate infinity" case that definitely has existed as a pattern.

  Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings  wrote:
>
> I ended up with the following two patches, which seem to deal with
> both the Java and Rust regressions.  These don't touch the
> stack-grows-up paths at all because Rust doesn't run on those
> architectures and the Java weirdness is i386-specific.
>
> They definitely need longer commit messages and comments, but aside
> from that do these look reasonable?

I thin kthey both look reasonable, but I think we might still want to
massage things a bit (cutting down the quoting to a minimum, hopefully
leaving enough context to still make sense):

> Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap
>
> prev = vma->vm_prev;
> +   if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
> +   prev = prev->vm_prev;
> if (prev && prev->vm_end > gap_addr) {

Do we just want to ignore the user-supplied guard mapping, or do we
want to say "if the user does a guard mapping, we use that *instead*
of our stack gap"?

IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want
to just return "ok".

> Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent  if 
> finite

This is good thinking, but no, I don't think the "if finite" is right.

I've seen people use "really big values" as replacement for
RLIM_INIFITY, for various reasons.

We've had huge confusion about RLIM_INFINITY over the years - look for
things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions
we've had.

Some people just use MAX_LONG etc, which is *not* the same as
RLIM_INFINITY, but in practice ends up doing the same thing. Yadda
yadda.

So I'm personally leery of checking and depending on "exactly
RLIM_INIFITY", because I've seen it go wrong so many times.

And I think your second patch breaks that "use a really large value to
approximate infinity" case that definitely has existed as a pattern.

  Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 17:58:45, Ben Hutchings wrote:
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c7906ae1a7a1..f8131a94e56e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, 
> unsigned long address)
>  }
>  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
>  
> +unsigned long __vm_start_gap(struct vm_area_struct *vma)
> +{
> + unsigned long stack_limit =
> + current->signal->rlim[RLIMIT_STACK].rlim_cur;
> + unsigned long vm_start;
> +
> + if (stack_limit != RLIM_INFINITY &&
> + vma->vm_end - vma->vm_start < stack_limit)
> + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);

This is exactly what I was worried about in my previous email. Say
somebody sets stack ulimit to 1G or so. Should we reduce the available
address space that much? Say you are 32b and you have an application
with multiple stacks each doing its MAP_GROWSDOWN. You are quickly out
of address space. That's why I've said that we would need to find a cap
for the user defined limit. How much that should be though? Few (tens,
hundreds) megs. If we can figure that up I would be of course quite
happy about such a change because MAP_GROWSDOWN doesn't work really well
these days.

> + else
> + vm_start = vma->vm_start;
> +
> + vm_start -= stack_guard_gap;
> + if (vm_start > vma->vm_start)
> + vm_start = 0;
> +
> + return vm_start;
> +}
> +
>  /*
>   * vma is the first one with address < vma->vm_start.  Have to extend vma.
>   */
> 
> -- 
> Ben Hutchings
> For every complex problem
> there is a solution that is simple, neat, and wrong.



-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 17:58:45, Ben Hutchings wrote:
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c7906ae1a7a1..f8131a94e56e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, 
> unsigned long address)
>  }
>  #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
>  
> +unsigned long __vm_start_gap(struct vm_area_struct *vma)
> +{
> + unsigned long stack_limit =
> + current->signal->rlim[RLIMIT_STACK].rlim_cur;
> + unsigned long vm_start;
> +
> + if (stack_limit != RLIM_INFINITY &&
> + vma->vm_end - vma->vm_start < stack_limit)
> + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);

This is exactly what I was worried about in my previous email. Say
somebody sets stack ulimit to 1G or so. Should we reduce the available
address space that much? Say you are 32b and you have an application
with multiple stacks each doing its MAP_GROWSDOWN. You are quickly out
of address space. That's why I've said that we would need to find a cap
for the user defined limit. How much that should be though? Few (tens,
hundreds) megs. If we can figure that up I would be of course quite
happy about such a change because MAP_GROWSDOWN doesn't work really well
these days.

> + else
> + vm_start = vma->vm_start;
> +
> + vm_start -= stack_guard_gap;
> + if (vm_start > vma->vm_start)
> + vm_start = 0;
> +
> + return vm_start;
> +}
> +
>  /*
>   * vma is the first one with address < vma->vm_start.  Have to extend vma.
>   */
> 
> -- 
> Ben Hutchings
> For every complex problem
> there is a solution that is simple, neat, and wrong.



-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, Jul 05, 2017 at 04:25:00PM +0100, Ben Hutchings wrote:
[...]
> Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. 
> Look at os::pd_attempt_reserve_memory_at().  If the first, hinted,
> mmap() doesn't return the hinted address it then attempts to allocate
> huge areas (I'm not sure how intentional this is) and unmaps the
> unwanted parts.  Then os::workaround_expand_exec_shield_cs_limit() re-
> mmap()s the wanted part with MAP_FIXED.  If this fails at any point it
> is not a fatal error.
> 
> So if we change vm_start_gap() to take the stack limit into account
> (when it's finite) that should neutralise
> os::workaround_expand_exec_shield_cs_limit().  I'll try this.

I ended up with the following two patches, which seem to deal with
both the Java and Rust regressions.  These don't touch the
stack-grows-up paths at all because Rust doesn't run on those
architectures and the Java weirdness is i386-specific.

They definitely need longer commit messages and comments, but aside
from that do these look reasonable?

Ben.

Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap

Signed-off-by: Ben Hutchings 
---
 mm/mmap.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..c7906ae1a7a1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2323,11 +2323,16 @@ int expand_downwards(struct vm_area_struct *vma,
if (error)
return error;
 
-   /* Enforce stack_guard_gap */
+   /*
+* Enforce stack_guard_gap.  Some applications allocate a VM_NONE
+* mapping just below the stack, which we can safely ignore.
+*/
gap_addr = address - stack_guard_gap;
if (gap_addr > address)
return -ENOMEM;
prev = vma->vm_prev;
+   if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
+   prev = prev->vm_prev;
if (prev && prev->vm_end > gap_addr) {
if (!(prev->vm_flags & VM_GROWSDOWN))
return -ENOMEM;

Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent
 if finite

Signed-off-by: Ben Hutchings 
---
 include/linux/mm.h |  9 -
 mm/mmap.c  | 19 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..2240a0505072 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2223,15 +2223,14 @@ static inline struct vm_area_struct * 
find_vma_intersection(struct mm_struct * m
return vma;
 }
 
+unsigned long __vm_start_gap(struct vm_area_struct *vma);
+
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
unsigned long vm_start = vma->vm_start;
 
-   if (vma->vm_flags & VM_GROWSDOWN) {
-   vm_start -= stack_guard_gap;
-   if (vm_start > vma->vm_start)
-   vm_start = 0;
-   }
+   if (vma->vm_flags & VM_GROWSDOWN)
+   vm_start = __vm_start_gap(vma);
return vm_start;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index c7906ae1a7a1..f8131a94e56e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
 
+unsigned long __vm_start_gap(struct vm_area_struct *vma)
+{
+   unsigned long stack_limit =
+   current->signal->rlim[RLIMIT_STACK].rlim_cur;
+   unsigned long vm_start;
+
+   if (stack_limit != RLIM_INFINITY &&
+   vma->vm_end - vma->vm_start < stack_limit)
+   vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);
+   else
+   vm_start = vma->vm_start;
+
+   vm_start -= stack_guard_gap;
+   if (vm_start > vma->vm_start)
+   vm_start = 0;
+
+   return vm_start;
+}
+
 /*
  * vma is the first one with address < vma->vm_start.  Have to extend vma.
  */

-- 
Ben Hutchings
For every complex problem
there is a solution that is simple, neat, and wrong.


signature.asc
Description: Digital signature


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, Jul 05, 2017 at 04:25:00PM +0100, Ben Hutchings wrote:
[...]
> Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. 
> Look at os::pd_attempt_reserve_memory_at().  If the first, hinted,
> mmap() doesn't return the hinted address it then attempts to allocate
> huge areas (I'm not sure how intentional this is) and unmaps the
> unwanted parts.  Then os::workaround_expand_exec_shield_cs_limit() re-
> mmap()s the wanted part with MAP_FIXED.  If this fails at any point it
> is not a fatal error.
> 
> So if we change vm_start_gap() to take the stack limit into account
> (when it's finite) that should neutralise
> os::workaround_expand_exec_shield_cs_limit().  I'll try this.

I ended up with the following two patches, which seem to deal with
both the Java and Rust regressions.  These don't touch the
stack-grows-up paths at all because Rust doesn't run on those
architectures and the Java weirdness is i386-specific.

They definitely need longer commit messages and comments, but aside
from that do these look reasonable?

Ben.

Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap

Signed-off-by: Ben Hutchings 
---
 mm/mmap.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..c7906ae1a7a1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2323,11 +2323,16 @@ int expand_downwards(struct vm_area_struct *vma,
if (error)
return error;
 
-   /* Enforce stack_guard_gap */
+   /*
+* Enforce stack_guard_gap.  Some applications allocate a VM_NONE
+* mapping just below the stack, which we can safely ignore.
+*/
gap_addr = address - stack_guard_gap;
if (gap_addr > address)
return -ENOMEM;
prev = vma->vm_prev;
+   if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
+   prev = prev->vm_prev;
if (prev && prev->vm_end > gap_addr) {
if (!(prev->vm_flags & VM_GROWSDOWN))
return -ENOMEM;

Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent
 if finite

Signed-off-by: Ben Hutchings 
---
 include/linux/mm.h |  9 -
 mm/mmap.c  | 19 +++
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..2240a0505072 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2223,15 +2223,14 @@ static inline struct vm_area_struct * 
find_vma_intersection(struct mm_struct * m
return vma;
 }
 
+unsigned long __vm_start_gap(struct vm_area_struct *vma);
+
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
 {
unsigned long vm_start = vma->vm_start;
 
-   if (vma->vm_flags & VM_GROWSDOWN) {
-   vm_start -= stack_guard_gap;
-   if (vm_start > vma->vm_start)
-   vm_start = 0;
-   }
+   if (vma->vm_flags & VM_GROWSDOWN)
+   vm_start = __vm_start_gap(vma);
return vm_start;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index c7906ae1a7a1..f8131a94e56e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
 
+unsigned long __vm_start_gap(struct vm_area_struct *vma)
+{
+   unsigned long stack_limit =
+   current->signal->rlim[RLIMIT_STACK].rlim_cur;
+   unsigned long vm_start;
+
+   if (stack_limit != RLIM_INFINITY &&
+   vma->vm_end - vma->vm_start < stack_limit)
+   vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);
+   else
+   vm_start = vma->vm_start;
+
+   vm_start -= stack_guard_gap;
+   if (vm_start > vma->vm_start)
+   vm_start = 0;
+
+   return vm_start;
+}
+
 /*
  * vma is the first one with address < vma->vm_start.  Have to extend vma.
  */

-- 
Ben Hutchings
For every complex problem
there is a solution that is simple, neat, and wrong.


signature.asc
Description: Digital signature


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski  wrote:
> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko  wrote:
>>
>> This is really worrying. This doesn't look like a gap at all. It is a
>> mapping which actually contains a code and so we should absolutely not
>> allow to scribble over it. So I am afraid the only way forward is to
>> allow per process stack gap and run this particular program to have a
>> smaller gap. We basically have two ways. Either /proc//$file or
>> a prctl inherited on exec. The later is a smaller code. What do you
>> think?
>
> Why inherit on exec?

.. because the whole point is that you have an existing binary that breaks.

So you need to be able to wrap it in "let's lower the stack gap, then
run that known-problematic binary".

If you think the problem is solved by recompiling existing binaries,
then why are we doing this kernel hack to begin with? The *real*
solution was always to just fix the damn compiler and ABI.

That *real* solution is simple and needs no kernel support at all.

In other words, *ALL* of the kernel work in this area is purely to
support existing binaries. Don't overlook that fact.

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski  wrote:
> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko  wrote:
>>
>> This is really worrying. This doesn't look like a gap at all. It is a
>> mapping which actually contains a code and so we should absolutely not
>> allow to scribble over it. So I am afraid the only way forward is to
>> allow per process stack gap and run this particular program to have a
>> smaller gap. We basically have two ways. Either /proc//$file or
>> a prctl inherited on exec. The later is a smaller code. What do you
>> think?
>
> Why inherit on exec?

.. because the whole point is that you have an existing binary that breaks.

So you need to be able to wrap it in "let's lower the stack gap, then
run that known-problematic binary".

If you think the problem is solved by recompiling existing binaries,
then why are we doing this kernel hack to begin with? The *real*
solution was always to just fix the damn compiler and ABI.

That *real* solution is simple and needs no kernel support at all.

In other words, *ALL* of the kernel work in this area is purely to
support existing binaries. Don't overlook that fact.

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 5:19 AM, Ben Hutchings  wrote:
> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
>>
>> Can you find out where that is allocated? Perhaps a breakpoint on
>> mmap, with a condition to catch that particular one?
>
> Found it, and it's now clear why only i386 is affected:
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881

Thanks, good work.

Well, good work on *your* part. I will try very hard to refrain from
commenting too much on the f*cking stinking pile of sh*t that was
exec-shield.

But yes, I don't think we can sanely recognize this. The code clearly
very intentionally does that mapping under the stack, and it's very
intentionally not PROT_NONE, since it's meant to be both writable and
executable.

As I said earlier (and I see Michal Hocko suggested the same - sudden
email flurry going on here), I think we need to basically allow people
to set the stack gap per-process to something low.

The good news is that this is probably specialized enough that we can
just keep the defaults as "will break this one case, but we give
people the tools to work around it".

I hate doing that, but distros that still support 32-bit (which is
apparently a shrinking number) can maybe hack the libreoffice launch
scripts up?

   Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 5:19 AM, Ben Hutchings  wrote:
> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
>>
>> Can you find out where that is allocated? Perhaps a breakpoint on
>> mmap, with a condition to catch that particular one?
>
> Found it, and it's now clear why only i386 is affected:
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881

Thanks, good work.

Well, good work on *your* part. I will try very hard to refrain from
commenting too much on the f*cking stinking pile of sh*t that was
exec-shield.

But yes, I don't think we can sanely recognize this. The code clearly
very intentionally does that mapping under the stack, and it's very
intentionally not PROT_NONE, since it's meant to be both writable and
executable.

As I said earlier (and I see Michal Hocko suggested the same - sudden
email flurry going on here), I think we need to basically allow people
to set the stack gap per-process to something low.

The good news is that this is probably specialized enough that we can
just keep the defaults as "will break this one case, but we give
people the tools to work around it".

I hate doing that, but distros that still support 32-bit (which is
apparently a shrinking number) can maybe hack the libreoffice launch
scripts up?

   Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko  wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
>> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
>> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
>> > wrote:
>> > >
>> > > We have:
>> > >
>> > > bottom = 0xff803fff
>> > > sp = 0xb178
>> > >
>> > > The relevant mappings are:
>> > >
>> > > ff7fc000-ff7fd000 rwxp  00:00 0
>> > > fffdd000-e000 rw-p  00:00
>> > > 0  [stack]
>> >
>> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
>> > to use up almost all of it, and there's only about 28kB left between
>> > "bottom" and that 'rwx' mapping.
>> >
>> > Still, that rwx mapping is interesting: it is a single page, and it
>> > really is almost exactly 8MB below the stack.
>> >
>> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
>> > the top of that odd one-page allocation (0xff7fd000).
>> >
>> > Can you find out where that is allocated? Perhaps a breakpoint on
>> > mmap, with a condition to catch that particular one?
>> [...]
>>
>> Found it, and it's now clear why only i386 is affected:
>> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
>> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
>
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

Why inherit on exec?

I think that, if we add a new API, we should do it right rather than
making it even more hackish.  Specifically, we'd add a real VMA type
(via flag or whatever) that means "this is a modern stack".  A modern
stack wouldn't ever expand and would have no guard page at all.  It
would, however, properly account stack space by tracking the pages
used as stack space.  Users of the new VMA type would be responsible
for allocating their own guard pages, probably by mapping an extra
page and than mapping PROT_NONE over it.

Also, this doesn't even need a new API, I think.  What's wrong with
plain old mmap(2) with MAP_STACK and *without* MAP_GROWSDOWN?  Only
new kernels would get the accounting right, but I doubt that matters
much in practice.


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Andy Lutomirski
On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko  wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
>> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
>> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
>> > wrote:
>> > >
>> > > We have:
>> > >
>> > > bottom = 0xff803fff
>> > > sp = 0xb178
>> > >
>> > > The relevant mappings are:
>> > >
>> > > ff7fc000-ff7fd000 rwxp  00:00 0
>> > > fffdd000-e000 rw-p  00:00
>> > > 0  [stack]
>> >
>> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
>> > to use up almost all of it, and there's only about 28kB left between
>> > "bottom" and that 'rwx' mapping.
>> >
>> > Still, that rwx mapping is interesting: it is a single page, and it
>> > really is almost exactly 8MB below the stack.
>> >
>> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
>> > the top of that odd one-page allocation (0xff7fd000).
>> >
>> > Can you find out where that is allocated? Perhaps a breakpoint on
>> > mmap, with a condition to catch that particular one?
>> [...]
>>
>> Found it, and it's now clear why only i386 is affected:
>> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
>> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
>
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

Why inherit on exec?

I think that, if we add a new API, we should do it right rather than
making it even more hackish.  Specifically, we'd add a real VMA type
(via flag or whatever) that means "this is a modern stack".  A modern
stack wouldn't ever expand and would have no guard page at all.  It
would, however, properly account stack space by tracking the pages
used as stack space.  Users of the new VMA type would be responsible
for allocating their own guard pages, probably by mapping an extra
page and than mapping PROT_NONE over it.

Also, this doesn't even need a new API, I think.  What's wrong with
plain old mmap(2) with MAP_STACK and *without* MAP_GROWSDOWN?  Only
new kernels would get the accounting right, but I doubt that matters
much in practice.


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 5:21 AM, Ben Hutchings  wrote:
>
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?

No, I think that's a bad idea. There's tons of good reasons to use
MAP_FIXED, and real programs do it all the time.

I'd much rather just do something special for the Java case, either
recognizing that particular pattern, or (and this is likely what we'll
have to do) just have a per-process stack limit that

 (a) will be reset by suid transitions etc security boundaries

 (b) you can just set back to 4kB for the specific Java case.

because I'd rather make this be a very conscious thing rather than a
random hack.

The PROT_NONE thing made tons of conceptual sense ("let people do
their own guard mappings"). The MAP_FIXED thing does not.

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Linus Torvalds
On Wed, Jul 5, 2017 at 5:21 AM, Ben Hutchings  wrote:
>
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?

No, I think that's a bad idea. There's tons of good reasons to use
MAP_FIXED, and real programs do it all the time.

I'd much rather just do something special for the Java case, either
recognizing that particular pattern, or (and this is likely what we'll
have to do) just have a per-process stack limit that

 (a) will be reset by suid transitions etc security boundaries

 (b) you can just set back to 4kB for the specific Java case.

because I'd rather make this be a very conscious thing rather than a
random hack.

The PROT_NONE thing made tons of conceptual sense ("let people do
their own guard mappings"). The MAP_FIXED thing does not.

 Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 16:25:00, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote:
> > On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > > wrote:
> > > > > 
> > > > > We have:
> > > > > 
> > > > > bottom = 0xff803fff
> > > > > sp = 0xb178
> > > > > 
> > > > > The relevant mappings are:
> > > > > 
> > > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > > fffdd000-e000 rw-p  00:00
> > > > > 0  [stack]
> > > > 
> > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > > to use up almost all of it, and there's only about 28kB left between
> > > > "bottom" and that 'rwx' mapping.
> > > > 
> > > > Still, that rwx mapping is interesting: it is a single page, and it
> > > > really is almost exactly 8MB below the stack.
> > > > 
> > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > > the top of that odd one-page allocation (0xff7fd000).
> > > > 
> > > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > > mmap, with a condition to catch that particular one?
> > > 
> > > [...]
> > > 
> > > Found it, and it's now clear why only i386 is affected:
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> > 
> > This is really worrying. This doesn't look like a gap at all. It is a
> > mapping which actually contains a code and so we should absolutely not
> > allow to scribble over it. So I am afraid the only way forward is to
> > allow per process stack gap and run this particular program to have a
> > smaller gap. We basically have two ways. Either /proc//$file or
> > a prctl inherited on exec. The later is a smaller code. What do you
> > think?
> 
> Distributions can do that, but what about all the other apps out there
> using JNI and private copies of the JRE?

Yes this sucks. I was thinking about something like run_legacy_stack
which would do
prctl(PR_SET_STACK_GAP, 1, 0, 0, 0);
execve(argv[1], argv+1, environ);

so we would have a way to start applications that start crashing with
the new setup without changing the default for all other applications.
The question is what to do if the execed task is suid because we
definitely do not want to allow tricking anybody to have smaller gap.

Or maybe just start the java with increased stack rlimit?

> Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. 
> Look at os::pd_attempt_reserve_memory_at().  If the first, hinted,
> mmap() doesn't return the hinted address it then attempts to allocate
> huge areas (I'm not sure how intentional this is) and unmaps the
> unwanted parts.  Then os::workaround_expand_exec_shield_cs_limit() re-
> mmap()s the wanted part with MAP_FIXED.  If this fails at any point it
> is not a fatal error.
> 
> So if we change vm_start_gap() to take the stack limit into account
> (when it's finite) that should neutralise
> os::workaround_expand_exec_shield_cs_limit().  I'll try this.

I was already thinking about doing something like that to have a better
support for MAP_GROWSDOWN but then I just gave up because this would
require to cap RLIMIT_STACK for large values in order to not break
userspace again. The max value is not really clear to me.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 16:25:00, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote:
> > On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > > wrote:
> > > > > 
> > > > > We have:
> > > > > 
> > > > > bottom = 0xff803fff
> > > > > sp = 0xb178
> > > > > 
> > > > > The relevant mappings are:
> > > > > 
> > > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > > fffdd000-e000 rw-p  00:00
> > > > > 0  [stack]
> > > > 
> > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > > to use up almost all of it, and there's only about 28kB left between
> > > > "bottom" and that 'rwx' mapping.
> > > > 
> > > > Still, that rwx mapping is interesting: it is a single page, and it
> > > > really is almost exactly 8MB below the stack.
> > > > 
> > > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > > the top of that odd one-page allocation (0xff7fd000).
> > > > 
> > > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > > mmap, with a condition to catch that particular one?
> > > 
> > > [...]
> > > 
> > > Found it, and it's now clear why only i386 is affected:
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> > 
> > This is really worrying. This doesn't look like a gap at all. It is a
> > mapping which actually contains a code and so we should absolutely not
> > allow to scribble over it. So I am afraid the only way forward is to
> > allow per process stack gap and run this particular program to have a
> > smaller gap. We basically have two ways. Either /proc//$file or
> > a prctl inherited on exec. The later is a smaller code. What do you
> > think?
> 
> Distributions can do that, but what about all the other apps out there
> using JNI and private copies of the JRE?

Yes this sucks. I was thinking about something like run_legacy_stack
which would do
prctl(PR_SET_STACK_GAP, 1, 0, 0, 0);
execve(argv[1], argv+1, environ);

so we would have a way to start applications that start crashing with
the new setup without changing the default for all other applications.
The question is what to do if the execed task is suid because we
definitely do not want to allow tricking anybody to have smaller gap.

Or maybe just start the java with increased stack rlimit?

> Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. 
> Look at os::pd_attempt_reserve_memory_at().  If the first, hinted,
> mmap() doesn't return the hinted address it then attempts to allocate
> huge areas (I'm not sure how intentional this is) and unmaps the
> unwanted parts.  Then os::workaround_expand_exec_shield_cs_limit() re-
> mmap()s the wanted part with MAP_FIXED.  If this fails at any point it
> is not a fatal error.
> 
> So if we change vm_start_gap() to take the stack limit into account
> (when it's finite) that should neutralise
> os::workaround_expand_exec_shield_cs_limit().  I'll try this.

I was already thinking about doing something like that to have a better
support for MAP_GROWSDOWN but then I just gave up because this would
require to cap RLIMIT_STACK for large values in order to not break
userspace again. The max value is not really clear to me.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > wrote:
> > > > 
> > > > We have:
> > > > 
> > > > bottom = 0xff803fff
> > > > sp = 0xb178
> > > > 
> > > > The relevant mappings are:
> > > > 
> > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > fffdd000-e000 rw-p  00:00
> > > > 0  [stack]
> > > 
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > > 
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > > 
> > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > > 
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> > 
> > [...]
> > 
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> 
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

Distributions can do that, but what about all the other apps out there
using JNI and private copies of the JRE?

Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. 
Look at os::pd_attempt_reserve_memory_at().  If the first, hinted,
mmap() doesn't return the hinted address it then attempts to allocate
huge areas (I'm not sure how intentional this is) and unmaps the
unwanted parts.  Then os::workaround_expand_exec_shield_cs_limit() re-
mmap()s the wanted part with MAP_FIXED.  If this fails at any point it
is not a fatal error.

So if we change vm_start_gap() to take the stack limit into account
(when it's finite) that should neutralise
os::workaround_expand_exec_shield_cs_limit().  I'll try this.

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > > wrote:
> > > > 
> > > > We have:
> > > > 
> > > > bottom = 0xff803fff
> > > > sp = 0xb178
> > > > 
> > > > The relevant mappings are:
> > > > 
> > > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > > fffdd000-e000 rw-p  00:00
> > > > 0  [stack]
> > > 
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > > 
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > > 
> > > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > > 
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> > 
> > [...]
> > 
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> 
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc//$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?

Distributions can do that, but what about all the other apps out there
using JNI and private copies of the JRE?

Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. 
Look at os::pd_attempt_reserve_memory_at().  If the first, hinted,
mmap() doesn't return the hinted address it then attempts to allocate
huge areas (I'm not sure how intentional this is) and unmaps the
unwanted parts.  Then os::workaround_expand_exec_shield_cs_limit() re-
mmap()s the wanted part with MAP_FIXED.  If this fails at any point it
is not a fatal error.

So if we change vm_start_gap() to take the stack limit into account
(when it's finite) that should neutralise
os::workaround_expand_exec_shield_cs_limit().  I'll try this.

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > wrote:
> > > 
> > > We have:
> > > 
> > > bottom = 0xff803fff
> > > sp = 0xb178
> > > 
> > > The relevant mappings are:
> > > 
> > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > fffdd000-e000 rw-p  00:00
> > > 0  [stack]
> > 
> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > to use up almost all of it, and there's only about 28kB left between
> > "bottom" and that 'rwx' mapping.
> > 
> > Still, that rwx mapping is interesting: it is a single page, and it
> > really is almost exactly 8MB below the stack.
> > 
> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > the top of that odd one-page allocation (0xff7fd000).
> > 
> > Can you find out where that is allocated? Perhaps a breakpoint on
> > mmap, with a condition to catch that particular one?
> [...]
> 
> Found it, and it's now clear why only i386 is affected:
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881

This is really worrying. This doesn't look like a gap at all. It is a
mapping which actually contains a code and so we should absolutely not
allow to scribble over it. So I am afraid the only way forward is to
allow per process stack gap and run this particular program to have a
smaller gap. We basically have two ways. Either /proc//$file or
a prctl inherited on exec. The later is a smaller code. What do you
think?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> > wrote:
> > > 
> > > We have:
> > > 
> > > bottom = 0xff803fff
> > > sp = 0xb178
> > > 
> > > The relevant mappings are:
> > > 
> > > ff7fc000-ff7fd000 rwxp  00:00 0
> > > fffdd000-e000 rw-p  00:00
> > > 0  [stack]
> > 
> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > to use up almost all of it, and there's only about 28kB left between
> > "bottom" and that 'rwx' mapping.
> > 
> > Still, that rwx mapping is interesting: it is a single page, and it
> > really is almost exactly 8MB below the stack.
> > 
> > In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> > the top of that odd one-page allocation (0xff7fd000).
> > 
> > Can you find out where that is allocated? Perhaps a breakpoint on
> > mmap, with a condition to catch that particular one?
> [...]
> 
> Found it, and it's now clear why only i386 is affected:
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881

This is really worrying. This doesn't look like a gap at all. It is a
mapping which actually contains a code and so we should absolutely not
allow to scribble over it. So I am afraid the only way forward is to
allow per process stack gap and run this particular program to have a
smaller gap. We basically have two ways. Either /proc//$file or
a prctl inherited on exec. The later is a smaller code. What do you
think?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 13:21:54, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > > PROT_NONE would explicitly fault but we would simply
> > > run over this mapping too easily and who knows what might end up below
> > > it. So to me the guard gap does its job here.
> > 
> > I tend to think that applications that implement their own stack guard
> > using PROT_NONE also assume that they will never perfom unchecked stack
> > allocations larger than their own guard, thus the condition above should
> > never happen. Otherwise they're bogus and/or vulnerable by design and it
> > is their responsibility to fix it.
> > 
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing. That wouldn't solve
> > the libreoffice issue though, given the lower page is RWX.
> 
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?

To be honest I really hate this. The same way as any other heuristics
where we try to guess the gap which will not fault to let userspace
know something is wrong. And the Java example just proves the point
AFAIU. The mapping we clash on is _not_ a gap. It is a real mapping we
should rather not scribble over. It contains a code to execute and that
is even more worrying. So I guess the _only_ sane way forward for this
case is to reduce stack gap for the particular code. 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 13:21:54, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > > PROT_NONE would explicitly fault but we would simply
> > > run over this mapping too easily and who knows what might end up below
> > > it. So to me the guard gap does its job here.
> > 
> > I tend to think that applications that implement their own stack guard
> > using PROT_NONE also assume that they will never perfom unchecked stack
> > allocations larger than their own guard, thus the condition above should
> > never happen. Otherwise they're bogus and/or vulnerable by design and it
> > is their responsibility to fix it.
> > 
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing. That wouldn't solve
> > the libreoffice issue though, given the lower page is RWX.
> 
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?

To be honest I really hate this. The same way as any other heuristics
where we try to guess the gap which will not fault to let userspace
know something is wrong. And the Java example just proves the point
AFAIU. The mapping we clash on is _not_ a gap. It is a real mapping we
should rather not scribble over. It contains a code to execute and that
is even more worrying. So I guess the _only_ sane way forward for this
case is to reduce stack gap for the particular code. 
-- 
Michal Hocko
SUSE Labs


Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Solar Designer
Hi all,

On Tue, Jul 04, 2017 at 07:16:06PM -0600, kseifr...@redhat.com wrote:
> This issue occurs post stackguard patches correct? Fixing it sounds like
> this might go beyond hardening and into CVE territory.

Since this thread is public on LKML, as it should be, it's no longer
valid to be CC'ed to linux-distros, which is for handling of embargoed
issues only.  So please drop linux-distros from further CC's (I moved
linux-distros to Bcc on this reply, just so they know what happened).

If specific security issues are identified (such as with LibreOffice and
Java), then ideally those should be posted to oss-security as separate
reports.  I'd appreciate it if anyone takes care of that (regardless of
CVE worthiness).

In fact, I already mentioned this thread in:

http://www.openwall.com/lists/oss-security/2017/07/05/11

Thank you!

Alexander


Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Solar Designer
Hi all,

On Tue, Jul 04, 2017 at 07:16:06PM -0600, kseifr...@redhat.com wrote:
> This issue occurs post stackguard patches correct? Fixing it sounds like
> this might go beyond hardening and into CVE territory.

Since this thread is public on LKML, as it should be, it's no longer
valid to be CC'ed to linux-distros, which is for handling of embargoed
issues only.  So please drop linux-distros from further CC's (I moved
linux-distros to Bcc on this reply, just so they know what happened).

If specific security issues are identified (such as with LibreOffice and
Java), then ideally those should be posted to oss-security as separate
reports.  I'd appreciate it if anyone takes care of that (regardless of
CVE worthiness).

In fact, I already mentioned this thread in:

http://www.openwall.com/lists/oss-security/2017/07/05/11

Thank you!

Alexander


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 01:21:54PM +0100, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > > PROT_NONE would explicitly fault but we would simply
> > > run over this mapping too easily and who knows what might end up below
> > > it. So to me the guard gap does its job here.
> > 
> > I tend to think that applications that implement their own stack guard
> > using PROT_NONE also assume that they will never perfom unchecked stack
> > allocations larger than their own guard, thus the condition above should
> > never happen. Otherwise they're bogus and/or vulnerable by design and it
> > is their responsibility to fix it.
> > 
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing. That wouldn't solve
> > the libreoffice issue though, given the lower page is RWX.
> 
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?

I like this idea. It leaves complete control to the application. Our
usual principle of letting people shoot themselves in the foot if they
insist on doing so.

Do you think something like this could work (not even build tested) ?

Willy
--

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16f524..8ad7f40 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -90,6 +90,7 @@
 #define VM_PFNMAP  0x0400  /* Page-ranges managed without "struct 
page", just pure PFN */
 #define VM_DENYWRITE   0x0800  /* ETXTBSY on write attempts.. */
 
+#define VM_FIXED   0x1000  /* MAP_FIXED was used */
 #define VM_LOCKED  0x2000
 #define VM_IO   0x4000 /* Memory mapped I/O or similar */
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 9aa863d..4df2659 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -79,6 +79,7 @@ static inline int arch_validate_prot(unsigned long prot)
 {
return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
   _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
+  _calc_vm_trans(flags, MAP_FIXED,  VM_FIXED ) |
   _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED);
 }
 #endif /* _LINUX_MMAN_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 3c4e4d7..b612868 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2145,7 +2145,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
 
next = vma->vm_next;
if (next && next->vm_start < gap_addr) {
-   if (!(next->vm_flags & VM_GROWSUP))
+   if (!(next->vm_flags & (VM_GROWSUP|VM_FIXED)))
return -ENOMEM;
/* Check that both stack segments have the same anon_vma? */
}
@@ -2225,7 +2225,7 @@ int expand_downwards(struct vm_area_struct *vma,
return -ENOMEM;
prev = vma->vm_prev;
if (prev && prev->vm_end > gap_addr) {
-   if (!(prev->vm_flags & VM_GROWSDOWN))
+   if (!(prev->vm_flags & (VM_GROWSDOWN|VM_FIXED)))
return -ENOMEM;
/* Check that both stack segments have the same anon_vma? */
}



Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 01:21:54PM +0100, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > > PROT_NONE would explicitly fault but we would simply
> > > run over this mapping too easily and who knows what might end up below
> > > it. So to me the guard gap does its job here.
> > 
> > I tend to think that applications that implement their own stack guard
> > using PROT_NONE also assume that they will never perfom unchecked stack
> > allocations larger than their own guard, thus the condition above should
> > never happen. Otherwise they're bogus and/or vulnerable by design and it
> > is their responsibility to fix it.
> > 
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing. That wouldn't solve
> > the libreoffice issue though, given the lower page is RWX.
> 
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?

I like this idea. It leaves complete control to the application. Our
usual principle of letting people shoot themselves in the foot if they
insist on doing so.

Do you think something like this could work (not even build tested) ?

Willy
--

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16f524..8ad7f40 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -90,6 +90,7 @@
 #define VM_PFNMAP  0x0400  /* Page-ranges managed without "struct 
page", just pure PFN */
 #define VM_DENYWRITE   0x0800  /* ETXTBSY on write attempts.. */
 
+#define VM_FIXED   0x1000  /* MAP_FIXED was used */
 #define VM_LOCKED  0x2000
 #define VM_IO   0x4000 /* Memory mapped I/O or similar */
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 9aa863d..4df2659 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -79,6 +79,7 @@ static inline int arch_validate_prot(unsigned long prot)
 {
return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
   _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
+  _calc_vm_trans(flags, MAP_FIXED,  VM_FIXED ) |
   _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED);
 }
 #endif /* _LINUX_MMAN_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 3c4e4d7..b612868 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2145,7 +2145,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned 
long address)
 
next = vma->vm_next;
if (next && next->vm_start < gap_addr) {
-   if (!(next->vm_flags & VM_GROWSUP))
+   if (!(next->vm_flags & (VM_GROWSUP|VM_FIXED)))
return -ENOMEM;
/* Check that both stack segments have the same anon_vma? */
}
@@ -2225,7 +2225,7 @@ int expand_downwards(struct vm_area_struct *vma,
return -ENOMEM;
prev = vma->vm_prev;
if (prev && prev->vm_end > gap_addr) {
-   if (!(prev->vm_flags & VM_GROWSDOWN))
+   if (!(prev->vm_flags & (VM_GROWSDOWN|VM_FIXED)))
return -ENOMEM;
/* Check that both stack segments have the same anon_vma? */
}



Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Tue, 2017-07-04 at 09:18 -0700, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:36 AM, Ben Hutchings  wrote:
> > 
> > That's what I was thinking of.  Tried the following patch:
> > 
> > Subject: mmap: Ignore VM_NONE mappings when checking for space to
> >  expand the stack
> 
> This looks sane to me.
> 
> I'm going to ignore it in this thread, and assume that it gets sent as
> a patch separately, ok?
> 
> It would be good to have more acks on it.
> 
> Also, separately, John Haxby kind of implied that the LibreOffice
> regression on i386 is already fixed by commit f4cb767d76cf ("mm: fix
> new crash in unmapped_area_topdown()").
> 
> Or was that a separate issue?

They are separate issues.

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Tue, 2017-07-04 at 09:18 -0700, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:36 AM, Ben Hutchings  wrote:
> > 
> > That's what I was thinking of.  Tried the following patch:
> > 
> > Subject: mmap: Ignore VM_NONE mappings when checking for space to
> >  expand the stack
> 
> This looks sane to me.
> 
> I'm going to ignore it in this thread, and assume that it gets sent as
> a patch separately, ok?
> 
> It would be good to have more acks on it.
> 
> Also, separately, John Haxby kind of implied that the LibreOffice
> regression on i386 is already fixed by commit f4cb767d76cf ("mm: fix
> new crash in unmapped_area_topdown()").
> 
> Or was that a separate issue?

They are separate issues.

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Tue, 2017-07-04 at 19:11 +0200, Willy Tarreau wrote:
> On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote:
> > @@ -2323,11 +2330,17 @@ int expand_downwards(struct vm_area_struct *vma,
> >     if (error)
> >     return error;
> >  
> > -   /* Enforce stack_guard_gap */
> > +   /*
> > +    * Enforce stack_guard_gap, but allow VM_NONE mappings in the gap
> > +    * as some applications try to make their own stack guards
> > +    */
> >     gap_addr = address - stack_guard_gap;
> >     if (gap_addr > address)
> >     return -ENOMEM;
> > -   prev = vma->vm_prev;
> > +   for (prev = vma->vm_prev;
> > +    prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
> > +    prev = prev->vm_prev)
> > +   ;
> >     if (prev && prev->vm_end > gap_addr) {
> >     if (!(prev->vm_flags & VM_GROWSDOWN))
> >     return -ENOMEM;
> 
> Hmmm shouldn't we also stop looping when we're out of the gap ?

Yes, either that or only allow one such vma.

Ben.

> Something like this :
> 
>   for (prev = vma->vm_prev;
>    prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) &&
>   address - prev->vm_end < stack_guard_gap;
>    prev = prev->vm_prev)
>   ;
> 
> This would limit the risk of runaway loops if someone is having fun
> allocating a lot of memory in small chunks (eg: 4 GB in 1 million
> independant mmap() calls).

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Tue, 2017-07-04 at 19:11 +0200, Willy Tarreau wrote:
> On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote:
> > @@ -2323,11 +2330,17 @@ int expand_downwards(struct vm_area_struct *vma,
> >     if (error)
> >     return error;
> >  
> > -   /* Enforce stack_guard_gap */
> > +   /*
> > +    * Enforce stack_guard_gap, but allow VM_NONE mappings in the gap
> > +    * as some applications try to make their own stack guards
> > +    */
> >     gap_addr = address - stack_guard_gap;
> >     if (gap_addr > address)
> >     return -ENOMEM;
> > -   prev = vma->vm_prev;
> > +   for (prev = vma->vm_prev;
> > +    prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
> > +    prev = prev->vm_prev)
> > +   ;
> >     if (prev && prev->vm_end > gap_addr) {
> >     if (!(prev->vm_flags & VM_GROWSDOWN))
> >     return -ENOMEM;
> 
> Hmmm shouldn't we also stop looping when we're out of the gap ?

Yes, either that or only allow one such vma.

Ben.

> Something like this :
> 
>   for (prev = vma->vm_prev;
>    prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) &&
>   address - prev->vm_end < stack_guard_gap;
>    prev = prev->vm_prev)
>   ;
> 
> This would limit the risk of runaway loops if someone is having fun
> allocating a lot of memory in small chunks (eg: 4 GB in 1 million
> independant mmap() calls).

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > PROT_NONE would explicitly fault but we would simply
> > run over this mapping too easily and who knows what might end up below
> > it. So to me the guard gap does its job here.
> 
> I tend to think that applications that implement their own stack guard
> using PROT_NONE also assume that they will never perfom unchecked stack
> allocations larger than their own guard, thus the condition above should
> never happen. Otherwise they're bogus and/or vulnerable by design and it
> is their responsibility to fix it.
> 
> Thus maybe if that helps we could even relax some of the stack guard
> checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> packed if the application knows what it's doing. That wouldn't solve
> the libreoffice issue though, given the lower page is RWX.

How about, instead of looking at permissions, we remember whether vmas
were allocated with MAP_FIXED and ignore those when evaluating the gap?

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > PROT_NONE would explicitly fault but we would simply
> > run over this mapping too easily and who knows what might end up below
> > it. So to me the guard gap does its job here.
> 
> I tend to think that applications that implement their own stack guard
> using PROT_NONE also assume that they will never perfom unchecked stack
> allocations larger than their own guard, thus the condition above should
> never happen. Otherwise they're bogus and/or vulnerable by design and it
> is their responsibility to fix it.
> 
> Thus maybe if that helps we could even relax some of the stack guard
> checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> packed if the application knows what it's doing. That wouldn't solve
> the libreoffice issue though, given the lower page is RWX.

How about, instead of looking at permissions, we remember whether vmas
were allocated with MAP_FIXED and ignore those when evaluating the gap?

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.



signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> wrote:
> > 
> > We have:
> > 
> > bottom = 0xff803fff
> > sp = 0xb178
> > 
> > The relevant mappings are:
> > 
> > ff7fc000-ff7fd000 rwxp  00:00 0
> > fffdd000-e000 rw-p  00:00
> > 0  [stack]
> 
> Ugh. So that stack is actually 8MB in size, but the alloca() is about
> to use up almost all of it, and there's only about 28kB left between
> "bottom" and that 'rwx' mapping.
> 
> Still, that rwx mapping is interesting: it is a single page, and it
> really is almost exactly 8MB below the stack.
> 
> In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> the top of that odd one-page allocation (0xff7fd000).
> 
> Can you find out where that is allocated? Perhaps a breakpoint on
> mmap, with a condition to catch that particular one?
[...]

Found it, and it's now clear why only i386 is affected:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Ben Hutchings
On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings 
> wrote:
> > 
> > We have:
> > 
> > bottom = 0xff803fff
> > sp = 0xb178
> > 
> > The relevant mappings are:
> > 
> > ff7fc000-ff7fd000 rwxp  00:00 0
> > fffdd000-e000 rw-p  00:00
> > 0  [stack]
> 
> Ugh. So that stack is actually 8MB in size, but the alloca() is about
> to use up almost all of it, and there's only about 28kB left between
> "bottom" and that 'rwx' mapping.
> 
> Still, that rwx mapping is interesting: it is a single page, and it
> really is almost exactly 8MB below the stack.
> 
> In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> the top of that odd one-page allocation (0xff7fd000).
> 
> Can you find out where that is allocated? Perhaps a breakpoint on
> mmap, with a condition to catch that particular one?
[...]

Found it, and it's now clear why only i386 is affected:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881

Ben.

-- 
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 10:24:41AM +0200, Michal Hocko wrote:
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing.
> 
> Yes, this is what my patch does [1]. Or did I miss your point?

Sorry you're right, I got my mind confused when looking at the
libreoffice dump and for whatever reason ended up thinking we were
just considering that page as part of the gap and not being a marker
for the bottom. Never mind.

> > That wouldn't solve the libreoffice issue though, given the lower page
> > is RWX.
> 
> unfortunatelly yes. We only have limited room to address this issue
> though. We could add per task (mm) stack_gap limit (controlled either
> via proc or prctl) and revert back to 1 page for the specific program
> but I would be really careful to add some more hack into the stack
> expansion code.

Actually one option could be to have a sysctl causing a warning to be
emitted when hitting the stack guard instead of killing the process. We
could think for example that once this warning is emitted, the guard is
reduced to 64kB (I think it was the size before) and the application can
continue to run. That could help problematic applications getting fixed
quickly. And in environments with lots of local users it would be
dissuasive enough to avoid users trying their luck on setuid binaries.

Just my two cents,
Willy



Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 10:24:41AM +0200, Michal Hocko wrote:
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing.
> 
> Yes, this is what my patch does [1]. Or did I miss your point?

Sorry you're right, I got my mind confused when looking at the
libreoffice dump and for whatever reason ended up thinking we were
just considering that page as part of the gap and not being a marker
for the bottom. Never mind.

> > That wouldn't solve the libreoffice issue though, given the lower page
> > is RWX.
> 
> unfortunatelly yes. We only have limited room to address this issue
> though. We could add per task (mm) stack_gap limit (controlled either
> via proc or prctl) and revert back to 1 page for the specific program
> but I would be really careful to add some more hack into the stack
> expansion code.

Actually one option could be to have a sysctl causing a warning to be
emitted when hitting the stack guard instead of killing the process. We
could think for example that once this warning is emitted, the guard is
reduced to 64kB (I think it was the size before) and the application can
continue to run. That could help problematic applications getting fixed
quickly. And in environments with lots of local users it would be
dissuasive enough to avoid users trying their luck on setuid binaries.

Just my two cents,
Willy



Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 10:14:43, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > PROT_NONE would explicitly fault but we would simply
> > run over this mapping too easily and who knows what might end up below
> > it. So to me the guard gap does its job here.
> 
> I tend to think that applications that implement their own stack guard
> using PROT_NONE also assume that they will never perfom unchecked stack
> allocations larger than their own guard, thus the condition above should
> never happen. Otherwise they're bogus and/or vulnerable by design and it
> is their responsibility to fix it.
> 
> Thus maybe if that helps we could even relax some of the stack guard
> checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> packed if the application knows what it's doing.

Yes, this is what my patch does [1]. Or did I miss your point?

> That wouldn't solve the libreoffice issue though, given the lower page
> is RWX.

unfortunatelly yes. We only have limited room to address this issue
though. We could add per task (mm) stack_gap limit (controlled either
via proc or prctl) and revert back to 1 page for the specific program
but I would be really careful to add some more hack into the stack
expansion code.

[1] http://lkml.kernel.org/r/20170704093538.gf14...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Wed 05-07-17 10:14:43, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > PROT_NONE would explicitly fault but we would simply
> > run over this mapping too easily and who knows what might end up below
> > it. So to me the guard gap does its job here.
> 
> I tend to think that applications that implement their own stack guard
> using PROT_NONE also assume that they will never perfom unchecked stack
> allocations larger than their own guard, thus the condition above should
> never happen. Otherwise they're bogus and/or vulnerable by design and it
> is their responsibility to fix it.
> 
> Thus maybe if that helps we could even relax some of the stack guard
> checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> packed if the application knows what it's doing.

Yes, this is what my patch does [1]. Or did I miss your point?

> That wouldn't solve the libreoffice issue though, given the lower page
> is RWX.

unfortunatelly yes. We only have limited room to address this issue
though. We could add per task (mm) stack_gap limit (controlled either
via proc or prctl) and revert back to 1 page for the specific program
but I would be really careful to add some more hack into the stack
expansion code.

[1] http://lkml.kernel.org/r/20170704093538.gf14...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> PROT_NONE would explicitly fault but we would simply
> run over this mapping too easily and who knows what might end up below
> it. So to me the guard gap does its job here.

I tend to think that applications that implement their own stack guard
using PROT_NONE also assume that they will never perfom unchecked stack
allocations larger than their own guard, thus the condition above should
never happen. Otherwise they're bogus and/or vulnerable by design and it
is their responsibility to fix it.

Thus maybe if that helps we could even relax some of the stack guard
checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
packed if the application knows what it's doing. That wouldn't solve
the libreoffice issue though, given the lower page is RWX.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Willy Tarreau
On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> PROT_NONE would explicitly fault but we would simply
> run over this mapping too easily and who knows what might end up below
> it. So to me the guard gap does its job here.

I tend to think that applications that implement their own stack guard
using PROT_NONE also assume that they will never perfom unchecked stack
allocations larger than their own guard, thus the condition above should
never happen. Otherwise they're bogus and/or vulnerable by design and it
is their responsibility to fix it.

Thus maybe if that helps we could even relax some of the stack guard
checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
packed if the application knows what it's doing. That wouldn't solve
the libreoffice issue though, given the lower page is RWX.

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Tue 04-07-17 16:31:52, Linus Torvalds wrote:
> On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings  wrote:
> >
> > We have:
> >
> > bottom = 0xff803fff
> > sp = 0xb178
> >
> > The relevant mappings are:
> >
> > ff7fc000-ff7fd000 rwxp  00:00 0
> > fffdd000-e000 rw-p  00:00 0  
> > [stack]
> 
> Ugh. So that stack is actually 8MB in size, but the alloca() is about
> to use up almost all of it, and there's only about 28kB left between
> "bottom" and that 'rwx' mapping.
> 
> Still, that rwx mapping is interesting: it is a single page, and it
> really is almost exactly 8MB below the stack.
> 
> In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> the top of that odd one-page allocation (0xff7fd000).

Very interesting! I would be really curious whether changing ulimit to
something bigger changes the picture. And if this is really the case
what we are going to do here. We can special case a single page mapping
under the stack but that sounds quite dangerous for something that is
dubious in itself. PROT_NONE would explicitly fault but we would simply
run over this mapping too easily and who knows what might end up below
it. So to me the guard gap does its job here.

Do you want me to post the earier patch to ignore PROT_NONE mapping
or we should rather wait for this one to get more details?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-05 Thread Michal Hocko
On Tue 04-07-17 16:31:52, Linus Torvalds wrote:
> On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings  wrote:
> >
> > We have:
> >
> > bottom = 0xff803fff
> > sp = 0xb178
> >
> > The relevant mappings are:
> >
> > ff7fc000-ff7fd000 rwxp  00:00 0
> > fffdd000-e000 rw-p  00:00 0  
> > [stack]
> 
> Ugh. So that stack is actually 8MB in size, but the alloca() is about
> to use up almost all of it, and there's only about 28kB left between
> "bottom" and that 'rwx' mapping.
> 
> Still, that rwx mapping is interesting: it is a single page, and it
> really is almost exactly 8MB below the stack.
> 
> In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
> the top of that odd one-page allocation (0xff7fd000).

Very interesting! I would be really curious whether changing ulimit to
something bigger changes the picture. And if this is really the case
what we are going to do here. We can special case a single page mapping
under the stack but that sounds quite dangerous for something that is
dubious in itself. PROT_NONE would explicitly fault but we would simply
run over this mapping too easily and who knows what might end up below
it. So to me the guard gap does its job here.

Do you want me to post the earier patch to ignore PROT_NONE mapping
or we should rather wait for this one to get more details?
-- 
Michal Hocko
SUSE Labs


Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread kseifr...@redhat.com
This issue occurs post stackguard patches correct? Fixing it sounds like
this might go beyond hardening and into CVE territory.

-- 
Kurt Seifried -- Red Hat -- Product Security -- Cloud
PGP A90B F995 7350 148F 66BF 7554 160D 4553 5E26 7993
Red Hat Product Security contact: secal...@redhat.com


Re: [vs-plain] Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread kseifr...@redhat.com
This issue occurs post stackguard patches correct? Fixing it sounds like
this might go beyond hardening and into CVE territory.

-- 
Kurt Seifried -- Red Hat -- Product Security -- Cloud
PGP A90B F995 7350 148F 66BF 7554 160D 4553 5E26 7993
Red Hat Product Security contact: secal...@redhat.com


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Linus Torvalds
On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings  wrote:
>
> We have:
>
> bottom = 0xff803fff
> sp = 0xb178
>
> The relevant mappings are:
>
> ff7fc000-ff7fd000 rwxp  00:00 0
> fffdd000-e000 rw-p  00:00 0  
> [stack]

Ugh. So that stack is actually 8MB in size, but the alloca() is about
to use up almost all of it, and there's only about 28kB left between
"bottom" and that 'rwx' mapping.

Still, that rwx mapping is interesting: it is a single page, and it
really is almost exactly 8MB below the stack.

In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
the top of that odd one-page allocation (0xff7fd000).

Can you find out where that is allocated? Perhaps a breakpoint on
mmap, with a condition to catch that particular one?

Because I'm wondering if it was done explicitly as a 8MB stack
boundary allocation, with the "knowledge" that the kernel then adds a
one-page guard page.

I really don't know why somebody would do that (as opposed to just
limiting the stack with ulimit), but the 8MB+4kB distance is kind of
intriguing.

Maybe that one-page mapping is some hack to make sure that no random
mmap() will ever get too close to the stack, so it really is a "guard
mapping", except it's explicitly designed not so much to guard the
stack from growing down further (ulimit does that), but to guard the
brk() and other mmaps from growing *up* into the stack area..

Sometimes user mode does crazy things just because people are insane.
But sometimes there really is a method to the madness.

I would *not* be surprised if the way somebody allocared the stack was
to basically say:

 - let's use "mmap()" with a size of 8MB+2 pages to find a
sufficiently sized virtual memory area

 - once we've gotten that virtual address space range, let's over-map
the last page as the new stack using MAP_FIXED

 - finally, munmap the 8MB in between so that the new stack can grow
down into that gap the munmap creates.

Notice how you end up with exactly the above pattern of allocations,
and how it guarantees that you get a nice 8MB stack without having to
do any locking (you rely on the kernel to just find the 8MB+8kB areas,
and once one has been allocated, it will be "safe").

And yes, it would have been much nicer to just use PROT_NONE for that
initial sizing allocation, but for somebody who is only interested in
carving out a 8MB stack in virtual space, the protections are actually
kind of immaterial, so 'rwx' might be just their mental default.

  Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Linus Torvalds
On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings  wrote:
>
> We have:
>
> bottom = 0xff803fff
> sp = 0xb178
>
> The relevant mappings are:
>
> ff7fc000-ff7fd000 rwxp  00:00 0
> fffdd000-e000 rw-p  00:00 0  
> [stack]

Ugh. So that stack is actually 8MB in size, but the alloca() is about
to use up almost all of it, and there's only about 28kB left between
"bottom" and that 'rwx' mapping.

Still, that rwx mapping is interesting: it is a single page, and it
really is almost exactly 8MB below the stack.

In fact, the top of stack (at 0xe000) is *exactly* 8MB+4kB from
the top of that odd one-page allocation (0xff7fd000).

Can you find out where that is allocated? Perhaps a breakpoint on
mmap, with a condition to catch that particular one?

Because I'm wondering if it was done explicitly as a 8MB stack
boundary allocation, with the "knowledge" that the kernel then adds a
one-page guard page.

I really don't know why somebody would do that (as opposed to just
limiting the stack with ulimit), but the 8MB+4kB distance is kind of
intriguing.

Maybe that one-page mapping is some hack to make sure that no random
mmap() will ever get too close to the stack, so it really is a "guard
mapping", except it's explicitly designed not so much to guard the
stack from growing down further (ulimit does that), but to guard the
brk() and other mmaps from growing *up* into the stack area..

Sometimes user mode does crazy things just because people are insane.
But sometimes there really is a method to the madness.

I would *not* be surprised if the way somebody allocared the stack was
to basically say:

 - let's use "mmap()" with a size of 8MB+2 pages to find a
sufficiently sized virtual memory area

 - once we've gotten that virtual address space range, let's over-map
the last page as the new stack using MAP_FIXED

 - finally, munmap the 8MB in between so that the new stack can grow
down into that gap the munmap creates.

Notice how you end up with exactly the above pattern of allocations,
and how it guarantees that you get a nice 8MB stack without having to
do any locking (you rely on the kernel to just find the 8MB+8kB areas,
and once one has been allocated, it will be "safe").

And yes, it would have been much nicer to just use PROT_NONE for that
initial sizing allocation, but for somebody who is only interested in
carving out a 8MB stack in virtual space, the protections are actually
kind of immaterial, so 'rwx' might be just their mental default.

  Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Ben Hutchings
On Tue, 2017-07-04 at 12:36 +0100, Ben Hutchings wrote:
[...]
> This *doesn't* fix the LibreOffice regression on i386.

gdb shows me that the crash is at the last statement in this function:

static void _expand_stack_to(address bottom) {
  address sp;
  size_t size;
  volatile char *p;

  // Adjust bottom to point to the largest address within the same page, it
  // gives us a one-page buffer if alloca() allocates slightly more memory.
  bottom = (address)align_size_down((uintptr_t)bottom, os::Linux::page_size());
  bottom += os::Linux::page_size() - 1;

  // sp might be slightly above current stack pointer; if that's the case, we
  // will alloca() a little more space than necessary, which is OK. Don't use
  // os::current_stack_pointer(), as its result can be slightly below current
  // stack pointer, causing us to not alloca enough to reach "bottom".
  sp = (address)

  if (sp > bottom) {
size = sp - bottom;
p = (volatile char *)alloca(size);
assert(p != NULL && p <= (volatile char *)bottom, "alloca problem?");
p[0] = '\0';
  }
}

We have:

bottom = 0xff803fff
sp = 0xb178

The relevant mappings are:

ff7fc000-ff7fd000 rwxp  00:00 0
fffdd000-e000 rw-p  00:00 0  [stack]

So instead of a useless guard page, we have a dangerous WX page
underneath the stack!  I suppose I should find out where and why that's
being allocated.

Ben.

-- 
Ben Hutchings
The world is coming to an end.  Please log off.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Ben Hutchings
On Tue, 2017-07-04 at 12:36 +0100, Ben Hutchings wrote:
[...]
> This *doesn't* fix the LibreOffice regression on i386.

gdb shows me that the crash is at the last statement in this function:

static void _expand_stack_to(address bottom) {
  address sp;
  size_t size;
  volatile char *p;

  // Adjust bottom to point to the largest address within the same page, it
  // gives us a one-page buffer if alloca() allocates slightly more memory.
  bottom = (address)align_size_down((uintptr_t)bottom, os::Linux::page_size());
  bottom += os::Linux::page_size() - 1;

  // sp might be slightly above current stack pointer; if that's the case, we
  // will alloca() a little more space than necessary, which is OK. Don't use
  // os::current_stack_pointer(), as its result can be slightly below current
  // stack pointer, causing us to not alloca enough to reach "bottom".
  sp = (address)

  if (sp > bottom) {
size = sp - bottom;
p = (volatile char *)alloca(size);
assert(p != NULL && p <= (volatile char *)bottom, "alloca problem?");
p[0] = '\0';
  }
}

We have:

bottom = 0xff803fff
sp = 0xb178

The relevant mappings are:

ff7fc000-ff7fd000 rwxp  00:00 0
fffdd000-e000 rw-p  00:00 0  [stack]

So instead of a useless guard page, we have a dangerous WX page
underneath the stack!  I suppose I should find out where and why that's
being allocated.

Ben.

-- 
Ben Hutchings
The world is coming to an end.  Please log off.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Willy Tarreau
On Tue, Jul 04, 2017 at 11:47:37AM -0700, Linus Torvalds wrote:
> Let's
> say that you are using lots of threads, so that you know your stack
> space is limited. What you do is to use MAP_FIXED a lot, and you lay
> out your stacks fairly densely (with each other, but also possibly
> with other mappings), with that PROT_NONE redzoning mapping in between
> the "dense" allocations.
> 
> So when the kernel wants to grow the stack, it finds the PROT_NONE
> redzone mapping - but there's possibly other maps right under it, so
> the stack_guard_gap still hits other mappings.
(...)

OK I didn't get that use case, that totally makes sense indeed! So
now we use PROT_NONE not as something that must be skipped to find
the unmapped area but as a hint that the application apparently
wants the stack to stop here.

Thanks for this clear explanation!

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Willy Tarreau
On Tue, Jul 04, 2017 at 11:47:37AM -0700, Linus Torvalds wrote:
> Let's
> say that you are using lots of threads, so that you know your stack
> space is limited. What you do is to use MAP_FIXED a lot, and you lay
> out your stacks fairly densely (with each other, but also possibly
> with other mappings), with that PROT_NONE redzoning mapping in between
> the "dense" allocations.
> 
> So when the kernel wants to grow the stack, it finds the PROT_NONE
> redzone mapping - but there's possibly other maps right under it, so
> the stack_guard_gap still hits other mappings.
(...)

OK I didn't get that use case, that totally makes sense indeed! So
now we use PROT_NONE not as something that must be skipped to find
the unmapped area but as a hint that the application apparently
wants the stack to stop here.

Thanks for this clear explanation!

Willy


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Linus Torvalds
On Tue, Jul 4, 2017 at 11:39 AM, Willy Tarreau  wrote:
>
> But what is wrong with stopping the loop as soon as the distance gets
> larger than the stack_guard_gap ?

Absolutely nothing. But that's not the problem with the loop. Let's
say that you are using lots of threads, so that you know your stack
space is limited. What you do is to use MAP_FIXED a lot, and you lay
out your stacks fairly densely (with each other, but also possibly
with other mappings), with that PROT_NONE redzoning mapping in between
the "dense" allocations.

So when the kernel wants to grow the stack, it finds the PROT_NONE
redzone mapping - but there's possibly other maps right under it, so
the stack_guard_gap still hits other mappings.

And the fact that this seems to trigger with
 (a) 32-bit x86
 (b) Java
actually makes sense in the above scenario: that's _exactly_ when
you'd have dense mappings. Java is very thread-happy, and in a 32-bit
VM, the virtual address space allocation for stacks is a primary issue
with lots of threads.

Of course, the downside to this theory is that apparently the Java
problem is not confirmed to actually be due to this (Ben root-caused
the rust thing on ppc64), but it still sounds like quite a reasonable
thing to do.

The problem with the Java issue may be that they do that "dense stack
mappings in VM space" (for all the usual "lots of threads, limited VM"
reasons), but they may *not* have that PROT_NONE redzoning at all.

So the patch under discussion works for Rust exactly *because* it does
its redzone to show "this is where I expect the stack to end". The
i386 java load may simply not have that marker for us to use..

   Linus


Re: [PATCH] mm: larger stack guard gap, between vmas

2017-07-04 Thread Linus Torvalds
On Tue, Jul 4, 2017 at 11:39 AM, Willy Tarreau  wrote:
>
> But what is wrong with stopping the loop as soon as the distance gets
> larger than the stack_guard_gap ?

Absolutely nothing. But that's not the problem with the loop. Let's
say that you are using lots of threads, so that you know your stack
space is limited. What you do is to use MAP_FIXED a lot, and you lay
out your stacks fairly densely (with each other, but also possibly
with other mappings), with that PROT_NONE redzoning mapping in between
the "dense" allocations.

So when the kernel wants to grow the stack, it finds the PROT_NONE
redzone mapping - but there's possibly other maps right under it, so
the stack_guard_gap still hits other mappings.

And the fact that this seems to trigger with
 (a) 32-bit x86
 (b) Java
actually makes sense in the above scenario: that's _exactly_ when
you'd have dense mappings. Java is very thread-happy, and in a 32-bit
VM, the virtual address space allocation for stacks is a primary issue
with lots of threads.

Of course, the downside to this theory is that apparently the Java
problem is not confirmed to actually be due to this (Ben root-caused
the rust thing on ppc64), but it still sounds like quite a reasonable
thing to do.

The problem with the Java issue may be that they do that "dense stack
mappings in VM space" (for all the usual "lots of threads, limited VM"
reasons), but they may *not* have that PROT_NONE redzoning at all.

So the patch under discussion works for Rust exactly *because* it does
its redzone to show "this is where I expect the stack to end". The
i386 java load may simply not have that marker for us to use..

   Linus


  1   2   >