Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-13 Thread H.J. Lu
On Mon, Jan 2, 2017 at 12:35 AM, Kirill A. Shutemov
 wrote:
> On Fri, Dec 30, 2016 at 06:08:27PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell  wrote:
>> > On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
>> >> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> >>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>> >>>  wrote:
>>  This patch introduces new rlimit resource to manage maximum virtual
>>  address available to userspace to map.
>> 
>>  On x86, 5-level paging enables 56-bit userspace virtual address space.
>>  Not all user space is ready to handle wide addresses. It's known that
>>  at least some JIT compilers use high bit in pointers to encode their
>>  information. It collides with valid pointers with 5-level paging and
>>  leads to crashes.
>> 
>>  The patch aims to address this compatibility issue.
>> 
>>  MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>  address available to map by userspace.
>> 
>>  The default hard limit will be RLIM_INFINITY, which basically means that
>>  TASK_SIZE limits available address space.
>> 
>>  The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>  with 5-level paging enabled. In this case, soft limit would be
>>  (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>  paging which known to be safe
>> 
>>  New rlimit resource would follow usual semantics with regards to
>>  inheritance: preserved on fork(2) and exec(2). This has potential to
>>  break application if limits set too wide or too narrow, but this is not
>>  uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>> 
>>  As with other resources you can set the limit lower than current usage.
>>  It would affect only future virtual address space allocations.
>> 
>>  Use-cases for new rlimit:
>> 
>>    - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>  its children to use addresses above 47-bits.
>> 
>>    - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>  exec(2) allows the child to use addresses above 47-bits.
>> 
>>    - Lowering the hard limit to 47-bits would prevent current process all
>>  its children to use addresses above 47-bits, unless a process has
>>  CAP_SYS_RESOURCES.
>> 
>>    - It’s also can be handy to lower hard or soft limit to arbitrary
>>  address. User-mode emulation in QEMU may lower the limit to 32-bit
>>  to emulate 32-bit machine on 64-bit host.
>> >>>
>> >>> I tend to think that this should be a personality or an ELF flag, not
>> >>> an rlimit.
>> >>
>> >> My plan was to implement ELF flag on top. Basically, ELF flag would mean
>> >> that we bump soft limit to hard limit on exec.
>> >
>> > Could you clarify what you mean by an "ELF flag?"
>>
>> Some way to mark a binary as supporting a larger address space.  I
>> don't have a precise solution in mind, but an ELF note might be a good
>> way to go here.
>
> + H.J.
>
> There's discussion of proposal of "Program Properties"[1]. It seems fits
> the purpose.
>
> [1] https://sourceware.org/ml/gnu-gabi/2016-q4/msg0.html
>
> --
>  Kirill A. Shutemov

There is another proposal:

https://fedoraproject.org/wiki/Toolchain/Watermark#Markup_for_ELF_objects

which covers much more than mine.

-- 
H.J.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-13 Thread H.J. Lu
On Mon, Jan 2, 2017 at 12:35 AM, Kirill A. Shutemov
 wrote:
> On Fri, Dec 30, 2016 at 06:08:27PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell  wrote:
>> > On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
>> >> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> >>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>> >>>  wrote:
>>  This patch introduces new rlimit resource to manage maximum virtual
>>  address available to userspace to map.
>> 
>>  On x86, 5-level paging enables 56-bit userspace virtual address space.
>>  Not all user space is ready to handle wide addresses. It's known that
>>  at least some JIT compilers use high bit in pointers to encode their
>>  information. It collides with valid pointers with 5-level paging and
>>  leads to crashes.
>> 
>>  The patch aims to address this compatibility issue.
>> 
>>  MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>  address available to map by userspace.
>> 
>>  The default hard limit will be RLIM_INFINITY, which basically means that
>>  TASK_SIZE limits available address space.
>> 
>>  The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>  with 5-level paging enabled. In this case, soft limit would be
>>  (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>  paging which known to be safe
>> 
>>  New rlimit resource would follow usual semantics with regards to
>>  inheritance: preserved on fork(2) and exec(2). This has potential to
>>  break application if limits set too wide or too narrow, but this is not
>>  uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>> 
>>  As with other resources you can set the limit lower than current usage.
>>  It would affect only future virtual address space allocations.
>> 
>>  Use-cases for new rlimit:
>> 
>>    - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>  its children to use addresses above 47-bits.
>> 
>>    - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>  exec(2) allows the child to use addresses above 47-bits.
>> 
>>    - Lowering the hard limit to 47-bits would prevent current process all
>>  its children to use addresses above 47-bits, unless a process has
>>  CAP_SYS_RESOURCES.
>> 
>>    - It’s also can be handy to lower hard or soft limit to arbitrary
>>  address. User-mode emulation in QEMU may lower the limit to 32-bit
>>  to emulate 32-bit machine on 64-bit host.
>> >>>
>> >>> I tend to think that this should be a personality or an ELF flag, not
>> >>> an rlimit.
>> >>
>> >> My plan was to implement ELF flag on top. Basically, ELF flag would mean
>> >> that we bump soft limit to hard limit on exec.
>> >
>> > Could you clarify what you mean by an "ELF flag?"
>>
>> Some way to mark a binary as supporting a larger address space.  I
>> don't have a precise solution in mind, but an ELF note might be a good
>> way to go here.
>
> + H.J.
>
> There's discussion of proposal of "Program Properties"[1]. It seems fits
> the purpose.
>
> [1] https://sourceware.org/ml/gnu-gabi/2016-q4/msg0.html
>
> --
>  Kirill A. Shutemov

There is another proposal:

https://fedoraproject.org/wiki/Toolchain/Watermark#Markup_for_ELF_objects

which covers much more than mine.

-- 
H.J.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Andi Kleen
On Wed, Jan 11, 2017 at 11:31:25AM -0800, Linus Torvalds wrote:
> On Wed, Jan 11, 2017 at 11:20 AM, Andy Lutomirski  wrote:
> >
> > Taking a step back, I think it would be fantastic if we could find a
> > way to make this work without any inheritable settings at all.
> > Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> > execve() and can be raised by ELF note or by prctl()?
> 
> I definitely think this is the right model. No inheritable settings,
> no suid issues, no worries. Make people who want the large address
> space (and there aren't going to be a lot of them) just mark their
> binaries at compile time.

Compile time is inconvenient if you want to test some existing
random binary if it works.

I tried to write a tool which patched ELF notes into binaries
some time ago for another project, but it ran into difficulties
and didn't work everywhere.

An inheritance scheme is much nicer for such use cases.

-Andi


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Andi Kleen
On Wed, Jan 11, 2017 at 11:31:25AM -0800, Linus Torvalds wrote:
> On Wed, Jan 11, 2017 at 11:20 AM, Andy Lutomirski  wrote:
> >
> > Taking a step back, I think it would be fantastic if we could find a
> > way to make this work without any inheritable settings at all.
> > Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> > execve() and can be raised by ELF note or by prctl()?
> 
> I definitely think this is the right model. No inheritable settings,
> no suid issues, no worries. Make people who want the large address
> space (and there aren't going to be a lot of them) just mark their
> binaries at compile time.

Compile time is inconvenient if you want to test some existing
random binary if it works.

I tried to write a tool which patched ELF notes into binaries
some time ago for another project, but it ran into difficulties
and didn't work everywhere.

An inheritance scheme is much nicer for such use cases.

-Andi


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Linus Torvalds
On Wed, Jan 11, 2017 at 11:32 AM, Kirill A. Shutemov
 wrote:
>
> Running legacy binary with full address space is valuable option.

I disagree.

It's simply not valuable enough to worry about. Especially when there
is a fairly trivial wrapper approach: just make a full-address-space
wrapper than acts as a binary loader (think "specialized ld.so").

Sure, the wrapper may be "fairly trivial" but not necessarily
pleasant: you have to parse ELF sections etc and basically load the
binary by hand. But there are libraries for that, and loading an ELF
executable isn't rocket surgery, it's just possibly tedious.

Linus


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Linus Torvalds
On Wed, Jan 11, 2017 at 11:32 AM, Kirill A. Shutemov
 wrote:
>
> Running legacy binary with full address space is valuable option.

I disagree.

It's simply not valuable enough to worry about. Especially when there
is a fairly trivial wrapper approach: just make a full-address-space
wrapper than acts as a binary loader (think "specialized ld.so").

Sure, the wrapper may be "fairly trivial" but not necessarily
pleasant: you have to parse ELF sections etc and basically load the
binary by hand. But there are libraries for that, and loading an ELF
executable isn't rocket surgery, it's just possibly tedious.

Linus


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Kirill A. Shutemov
On Wed, Jan 11, 2017 at 11:20:38AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 10:49 AM, Dave Hansen  wrote:
> > On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
> >>> How about preventing the max addr from being changed to too high a
> >>> value while MPX is on instead of overriding the set value?  This would
> >>> have the added benefit that it would prevent silent failures where you
> >>> think you've enabled large addresses but MPX is also on and mmap
> >>> refuses to return large addresses.
> >> Setting rlimit high doesn't mean that you necessary will get access to
> >> full address space, even without MPX in picture. TASK_SIZE limits the
> >> available address space too.
> >
> > OK, sure...  If you want to take another mechanism into account with
> > respect to MPX, we can do that.  We'd just need to change every
> > mechanism we want to support to ensure that it can't transition in ways
> > that break MPX.
> >
> > What are you arguing here, though?  Since we *might* be limited by
> > something else that we should not care about controlling the rlimit?
> >
> >> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
> >> to unlimited doesn't really means you are not subject to other resource
> >> management.
> >
> > The farther we get into this, the more and more I think using an rlimit
> > is a horrible idea.  Its semantics aren't a great match, and you seem to
> > be resistant to making *this* rlimit differ from the others when there's
> > an entirely need to do so.  We're already being bitten by "legacy"
> > rlimit.  IOW, being consistent with *other* rlimit behavior buys us
> > nothing, only complexity.
> 
> Taking a step back, I think it would be fantastic if we could find a
> way to make this work without any inheritable settings at all.
> Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> execve() and can be raised by ELF note or by prctl()?

One thing that inheritance give us is ability to change available address
space from outside of binary. Both ELF note and prctl() doesn't really
work here.

Running legacy binary with full address space is valuable option.
As well as limiting address space for binary with ELF note or prctl() in
case of breakage in a field.

Sure, we can use personality(2) or invent other interface for this. But to
me rlimit covers both normal and emergency use-cases relatively well.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Kirill A. Shutemov
On Wed, Jan 11, 2017 at 11:20:38AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 10:49 AM, Dave Hansen  wrote:
> > On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
> >>> How about preventing the max addr from being changed to too high a
> >>> value while MPX is on instead of overriding the set value?  This would
> >>> have the added benefit that it would prevent silent failures where you
> >>> think you've enabled large addresses but MPX is also on and mmap
> >>> refuses to return large addresses.
> >> Setting rlimit high doesn't mean that you necessary will get access to
> >> full address space, even without MPX in picture. TASK_SIZE limits the
> >> available address space too.
> >
> > OK, sure...  If you want to take another mechanism into account with
> > respect to MPX, we can do that.  We'd just need to change every
> > mechanism we want to support to ensure that it can't transition in ways
> > that break MPX.
> >
> > What are you arguing here, though?  Since we *might* be limited by
> > something else that we should not care about controlling the rlimit?
> >
> >> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
> >> to unlimited doesn't really means you are not subject to other resource
> >> management.
> >
> > The farther we get into this, the more and more I think using an rlimit
> > is a horrible idea.  Its semantics aren't a great match, and you seem to
> > be resistant to making *this* rlimit differ from the others when there's
> > an entirely need to do so.  We're already being bitten by "legacy"
> > rlimit.  IOW, being consistent with *other* rlimit behavior buys us
> > nothing, only complexity.
> 
> Taking a step back, I think it would be fantastic if we could find a
> way to make this work without any inheritable settings at all.
> Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> execve() and can be raised by ELF note or by prctl()?

One thing that inheritance give us is ability to change available address
space from outside of binary. Both ELF note and prctl() doesn't really
work here.

Running legacy binary with full address space is valuable option.
As well as limiting address space for binary with ELF note or prctl() in
case of breakage in a field.

Sure, we can use personality(2) or invent other interface for this. But to
me rlimit covers both normal and emergency use-cases relatively well.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Linus Torvalds
On Wed, Jan 11, 2017 at 11:20 AM, Andy Lutomirski  wrote:
>
> Taking a step back, I think it would be fantastic if we could find a
> way to make this work without any inheritable settings at all.
> Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> execve() and can be raised by ELF note or by prctl()?

I definitely think this is the right model. No inheritable settings,
no suid issues, no worries. Make people who want the large address
space (and there aren't going to be a lot of them) just mark their
binaries at compile time.

And as to the stack location: I think it should just be the same
regardless - up in "high" virtual memory in the 47-bit model. Because
as you say, if you actually end up having 57 bits of address space,
that still gives you basically the whole VM for data mappings -
they'll just be up above the stack.

Linus


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Linus Torvalds
On Wed, Jan 11, 2017 at 11:20 AM, Andy Lutomirski  wrote:
>
> Taking a step back, I think it would be fantastic if we could find a
> way to make this work without any inheritable settings at all.
> Perhaps we could have a per-mm value that is initialized to 2^47-1 on
> execve() and can be raised by ELF note or by prctl()?

I definitely think this is the right model. No inheritable settings,
no suid issues, no worries. Make people who want the large address
space (and there aren't going to be a lot of them) just mark their
binaries at compile time.

And as to the stack location: I think it should just be the same
regardless - up in "high" virtual memory in the 47-bit model. Because
as you say, if you actually end up having 57 bits of address space,
that still gives you basically the whole VM for data mappings -
they'll just be up above the stack.

Linus


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 10:49 AM, Dave Hansen  wrote:
> On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
>>> How about preventing the max addr from being changed to too high a
>>> value while MPX is on instead of overriding the set value?  This would
>>> have the added benefit that it would prevent silent failures where you
>>> think you've enabled large addresses but MPX is also on and mmap
>>> refuses to return large addresses.
>> Setting rlimit high doesn't mean that you necessary will get access to
>> full address space, even without MPX in picture. TASK_SIZE limits the
>> available address space too.
>
> OK, sure...  If you want to take another mechanism into account with
> respect to MPX, we can do that.  We'd just need to change every
> mechanism we want to support to ensure that it can't transition in ways
> that break MPX.
>
> What are you arguing here, though?  Since we *might* be limited by
> something else that we should not care about controlling the rlimit?
>
>> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
>> to unlimited doesn't really means you are not subject to other resource
>> management.
>
> The farther we get into this, the more and more I think using an rlimit
> is a horrible idea.  Its semantics aren't a great match, and you seem to
> be resistant to making *this* rlimit differ from the others when there's
> an entirely need to do so.  We're already being bitten by "legacy"
> rlimit.  IOW, being consistent with *other* rlimit behavior buys us
> nothing, only complexity.

Taking a step back, I think it would be fantastic if we could find a
way to make this work without any inheritable settings at all.
Perhaps we could have a per-mm value that is initialized to 2^47-1 on
execve() and can be raised by ELF note or by prctl()?  Getting it
right for 32-bit would require a bit of thought.  The ELF note would
make a high stack possible and, without the ELF note, we'd get a low
stack but high mmap().  Then the messy bits can be glibc's problem and
a toolchain problem as it should be, given that the only reason we
need a limit at all is because of messy userspace code.

Sure, the low stack prevents the *whole* address space from being used
in one big block for databases, but 2^57 - 2^47 ought to be good
enough.

I'm not 100% sure this is workable but, if it is, it makes everyone's
life easier.  There's no need to muck around with setarch(1) or
similar hacks.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 10:49 AM, Dave Hansen  wrote:
> On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
>>> How about preventing the max addr from being changed to too high a
>>> value while MPX is on instead of overriding the set value?  This would
>>> have the added benefit that it would prevent silent failures where you
>>> think you've enabled large addresses but MPX is also on and mmap
>>> refuses to return large addresses.
>> Setting rlimit high doesn't mean that you necessary will get access to
>> full address space, even without MPX in picture. TASK_SIZE limits the
>> available address space too.
>
> OK, sure...  If you want to take another mechanism into account with
> respect to MPX, we can do that.  We'd just need to change every
> mechanism we want to support to ensure that it can't transition in ways
> that break MPX.
>
> What are you arguing here, though?  Since we *might* be limited by
> something else that we should not care about controlling the rlimit?
>
>> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
>> to unlimited doesn't really means you are not subject to other resource
>> management.
>
> The farther we get into this, the more and more I think using an rlimit
> is a horrible idea.  Its semantics aren't a great match, and you seem to
> be resistant to making *this* rlimit differ from the others when there's
> an entirely need to do so.  We're already being bitten by "legacy"
> rlimit.  IOW, being consistent with *other* rlimit behavior buys us
> nothing, only complexity.

Taking a step back, I think it would be fantastic if we could find a
way to make this work without any inheritable settings at all.
Perhaps we could have a per-mm value that is initialized to 2^47-1 on
execve() and can be raised by ELF note or by prctl()?  Getting it
right for 32-bit would require a bit of thought.  The ELF note would
make a high stack possible and, without the ELF note, we'd get a low
stack but high mmap().  Then the messy bits can be glibc's problem and
a toolchain problem as it should be, given that the only reason we
need a limit at all is because of messy userspace code.

Sure, the low stack prevents the *whole* address space from being used
in one big block for databases, but 2^57 - 2^47 ought to be good
enough.

I'm not 100% sure this is workable but, if it is, it makes everyone's
life easier.  There's no need to muck around with setarch(1) or
similar hacks.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Dave Hansen
On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
>> How about preventing the max addr from being changed to too high a
>> value while MPX is on instead of overriding the set value?  This would
>> have the added benefit that it would prevent silent failures where you
>> think you've enabled large addresses but MPX is also on and mmap
>> refuses to return large addresses.
> Setting rlimit high doesn't mean that you necessary will get access to
> full address space, even without MPX in picture. TASK_SIZE limits the
> available address space too.

OK, sure...  If you want to take another mechanism into account with
respect to MPX, we can do that.  We'd just need to change every
mechanism we want to support to ensure that it can't transition in ways
that break MPX.

What are you arguing here, though?  Since we *might* be limited by
something else that we should not care about controlling the rlimit?

> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
> to unlimited doesn't really means you are not subject to other resource
> management.

The farther we get into this, the more and more I think using an rlimit
is a horrible idea.  Its semantics aren't a great match, and you seem to
be resistant to making *this* rlimit differ from the others when there's
an entirely need to do so.  We're already being bitten by "legacy"
rlimit.  IOW, being consistent with *other* rlimit behavior buys us
nothing, only complexity.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Dave Hansen
On 01/11/2017 10:37 AM, Kirill A. Shutemov wrote:
>> How about preventing the max addr from being changed to too high a
>> value while MPX is on instead of overriding the set value?  This would
>> have the added benefit that it would prevent silent failures where you
>> think you've enabled large addresses but MPX is also on and mmap
>> refuses to return large addresses.
> Setting rlimit high doesn't mean that you necessary will get access to
> full address space, even without MPX in picture. TASK_SIZE limits the
> available address space too.

OK, sure...  If you want to take another mechanism into account with
respect to MPX, we can do that.  We'd just need to change every
mechanism we want to support to ensure that it can't transition in ways
that break MPX.

What are you arguing here, though?  Since we *might* be limited by
something else that we should not care about controlling the rlimit?

> I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
> to unlimited doesn't really means you are not subject to other resource
> management.

The farther we get into this, the more and more I think using an rlimit
is a horrible idea.  Its semantics aren't a great match, and you seem to
be resistant to making *this* rlimit differ from the others when there's
an entirely need to do so.  We're already being bitten by "legacy"
rlimit.  IOW, being consistent with *other* rlimit behavior buys us
nothing, only complexity.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Kirill A. Shutemov
On Wed, Jan 11, 2017 at 10:09:17AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 6:29 AM, Kirill A. Shutemov
>  wrote:
> > On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
> >> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
> >> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> >> >> silent data corruption?  I'm pretty sure doing this to MPX would.
> >> >>
> >> > What actually goes wrong in this case?  That is, what combination of
> >> > MPX setup of subsequent allocations will cause a problem, and is the
> >> > problem worse than just a segfault?  IMO it would be really nice to
> >> > keep the messy case confined to MPX.
> >>
> >> The MPX bounds tables are indexed by virtual address.  They need to grow
> >> if the virtual address space grows.   There's an MSR that controls
> >> whether we use the 48-bit or 57-bit layout.  It basically decides
> >> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
> >>
> >> The question is what we do with legacy MPX applications.  We obviously
> >> can't let them just allocate a 2GB table and then go let the hardware
> >> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> >> table an address >48-bits.
> >>
> >> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> >> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> >> legacy MPX is active, that the RLIMIT can't be raised because all hell
> >> will break loose when the new addresses show up.
> >
> > I think we can do this. See the patch below.
> >
> > Basically, we refuse to enable MPX and issue warning in dmesg if there's
> > anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
> > be higher than 47-bits too.
> >
> > Function call from mmap_max_addr() is unfortunate, but I don't see a
> > way around.
> 
> How about preventing the max addr from being changed to too high a
> value while MPX is on instead of overriding the set value?  This would
> have the added benefit that it would prevent silent failures where you
> think you've enabled large addresses but MPX is also on and mmap
> refuses to return large addresses.

Setting rlimit high doesn't mean that you necessary will get access to
full address space, even without MPX in picture. TASK_SIZE limits the
available address space too.

I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
to unlimited doesn't really means you are not subject to other resource
management.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Kirill A. Shutemov
On Wed, Jan 11, 2017 at 10:09:17AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 6:29 AM, Kirill A. Shutemov
>  wrote:
> > On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
> >> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
> >> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> >> >> silent data corruption?  I'm pretty sure doing this to MPX would.
> >> >>
> >> > What actually goes wrong in this case?  That is, what combination of
> >> > MPX setup of subsequent allocations will cause a problem, and is the
> >> > problem worse than just a segfault?  IMO it would be really nice to
> >> > keep the messy case confined to MPX.
> >>
> >> The MPX bounds tables are indexed by virtual address.  They need to grow
> >> if the virtual address space grows.   There's an MSR that controls
> >> whether we use the 48-bit or 57-bit layout.  It basically decides
> >> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
> >>
> >> The question is what we do with legacy MPX applications.  We obviously
> >> can't let them just allocate a 2GB table and then go let the hardware
> >> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> >> table an address >48-bits.
> >>
> >> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> >> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> >> legacy MPX is active, that the RLIMIT can't be raised because all hell
> >> will break loose when the new addresses show up.
> >
> > I think we can do this. See the patch below.
> >
> > Basically, we refuse to enable MPX and issue warning in dmesg if there's
> > anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
> > be higher than 47-bits too.
> >
> > Function call from mmap_max_addr() is unfortunate, but I don't see a
> > way around.
> 
> How about preventing the max addr from being changed to too high a
> value while MPX is on instead of overriding the set value?  This would
> have the added benefit that it would prevent silent failures where you
> think you've enabled large addresses but MPX is also on and mmap
> refuses to return large addresses.

Setting rlimit high doesn't mean that you necessary will get access to
full address space, even without MPX in picture. TASK_SIZE limits the
available address space too.

I think it's consistent with other resources in rlimit: setting RLIMIT_RSS
to unlimited doesn't really means you are not subject to other resource
management.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Dave Hansen
On 01/11/2017 06:29 AM, Kirill A. Shutemov wrote:
> +#define mmap_max_addr() \
> +({   \
> + unsigned long max_addr = min(TASK_SIZE, rlimit(RLIMIT_VADDR));  \
> + /* At the moment, MPX cannot handle addresses above 47-bits */  \
> + if (max_addr > USER_VADDR_LIM &&\
> + kernel_managing_mpx_tables(current->mm))\
> + max_addr = USER_VADDR_LIM;  \
> + max_addr;   \
> +})

The bad part about this is that it adds code to a relatively fast path,
and the check that it's doing will not change its result for basically
the entire life of the process.

I'd much rather see this checking done at the point that MPX is enabled
and at the point the limit is changed.  Those are both super-rare paths.

>  extern u16 amd_get_nb_id(int cpu);
>  extern u32 amd_get_nodes_per_socket(void);
>  
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index 324e5713d386..04fa386a165a 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -354,10 +354,22 @@ int mpx_enable_management(void)
>*/
>   bd_base = mpx_get_bounds_dir();
>   down_write(>mmap_sem);
> +
> + /*
> +  * MPX doesn't support addresses above 47-bits yes.
> +  * Make sure nothing is mapped there before enabling.
> +  */
> + if (find_vma(mm, 1UL << 47)) {
> + pr_warn("%s (%d): MPX cannot handle addresses above 47-bits. "
> + "Disabling.", current->comm, current->pid);
> + ret = -ENXIO;
> + goto out;
> + }

I don't think allowing userspace to spam unlimited amounts of message
into the kernel log is a good idea. :)  But a WARN_ONCE() might not kill
any puppies.



Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Dave Hansen
On 01/11/2017 06:29 AM, Kirill A. Shutemov wrote:
> +#define mmap_max_addr() \
> +({   \
> + unsigned long max_addr = min(TASK_SIZE, rlimit(RLIMIT_VADDR));  \
> + /* At the moment, MPX cannot handle addresses above 47-bits */  \
> + if (max_addr > USER_VADDR_LIM &&\
> + kernel_managing_mpx_tables(current->mm))\
> + max_addr = USER_VADDR_LIM;  \
> + max_addr;   \
> +})

The bad part about this is that it adds code to a relatively fast path,
and the check that it's doing will not change its result for basically
the entire life of the process.

I'd much rather see this checking done at the point that MPX is enabled
and at the point the limit is changed.  Those are both super-rare paths.

>  extern u16 amd_get_nb_id(int cpu);
>  extern u32 amd_get_nodes_per_socket(void);
>  
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index 324e5713d386..04fa386a165a 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -354,10 +354,22 @@ int mpx_enable_management(void)
>*/
>   bd_base = mpx_get_bounds_dir();
>   down_write(>mmap_sem);
> +
> + /*
> +  * MPX doesn't support addresses above 47-bits yes.
> +  * Make sure nothing is mapped there before enabling.
> +  */
> + if (find_vma(mm, 1UL << 47)) {
> + pr_warn("%s (%d): MPX cannot handle addresses above 47-bits. "
> + "Disabling.", current->comm, current->pid);
> + ret = -ENXIO;
> + goto out;
> + }

I don't think allowing userspace to spam unlimited amounts of message
into the kernel log is a good idea. :)  But a WARN_ONCE() might not kill
any puppies.



Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 6:29 AM, Kirill A. Shutemov
 wrote:
> On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
>> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>> >> silent data corruption?  I'm pretty sure doing this to MPX would.
>> >>
>> > What actually goes wrong in this case?  That is, what combination of
>> > MPX setup of subsequent allocations will cause a problem, and is the
>> > problem worse than just a segfault?  IMO it would be really nice to
>> > keep the messy case confined to MPX.
>>
>> The MPX bounds tables are indexed by virtual address.  They need to grow
>> if the virtual address space grows.   There's an MSR that controls
>> whether we use the 48-bit or 57-bit layout.  It basically decides
>> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
>>
>> The question is what we do with legacy MPX applications.  We obviously
>> can't let them just allocate a 2GB table and then go let the hardware
>> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
>> table an address >48-bits.
>>
>> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
>> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
>> legacy MPX is active, that the RLIMIT can't be raised because all hell
>> will break loose when the new addresses show up.
>
> I think we can do this. See the patch below.
>
> Basically, we refuse to enable MPX and issue warning in dmesg if there's
> anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
> be higher than 47-bits too.
>
> Function call from mmap_max_addr() is unfortunate, but I don't see a
> way around.

How about preventing the max addr from being changed to too high a
value while MPX is on instead of overriding the set value?  This would
have the added benefit that it would prevent silent failures where you
think you've enabled large addresses but MPX is also on and mmap
refuses to return large addresses.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 6:29 AM, Kirill A. Shutemov
 wrote:
> On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
>> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>> >> silent data corruption?  I'm pretty sure doing this to MPX would.
>> >>
>> > What actually goes wrong in this case?  That is, what combination of
>> > MPX setup of subsequent allocations will cause a problem, and is the
>> > problem worse than just a segfault?  IMO it would be really nice to
>> > keep the messy case confined to MPX.
>>
>> The MPX bounds tables are indexed by virtual address.  They need to grow
>> if the virtual address space grows.   There's an MSR that controls
>> whether we use the 48-bit or 57-bit layout.  It basically decides
>> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
>>
>> The question is what we do with legacy MPX applications.  We obviously
>> can't let them just allocate a 2GB table and then go let the hardware
>> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
>> table an address >48-bits.
>>
>> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
>> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
>> legacy MPX is active, that the RLIMIT can't be raised because all hell
>> will break loose when the new addresses show up.
>
> I think we can do this. See the patch below.
>
> Basically, we refuse to enable MPX and issue warning in dmesg if there's
> anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
> be higher than 47-bits too.
>
> Function call from mmap_max_addr() is unfortunate, but I don't see a
> way around.

How about preventing the max addr from being changed to too high a
value while MPX is on instead of overriding the set value?  This would
have the added benefit that it would prevent silent failures where you
think you've enabled large addresses but MPX is also on and mmap
refuses to return large addresses.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Kirill A. Shutemov
On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> >> silent data corruption?  I'm pretty sure doing this to MPX would.
> >>
> > What actually goes wrong in this case?  That is, what combination of
> > MPX setup of subsequent allocations will cause a problem, and is the
> > problem worse than just a segfault?  IMO it would be really nice to
> > keep the messy case confined to MPX.
> 
> The MPX bounds tables are indexed by virtual address.  They need to grow
> if the virtual address space grows.   There's an MSR that controls
> whether we use the 48-bit or 57-bit layout.  It basically decides
> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
> 
> The question is what we do with legacy MPX applications.  We obviously
> can't let them just allocate a 2GB table and then go let the hardware
> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> table an address >48-bits.
> 
> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> legacy MPX is active, that the RLIMIT can't be raised because all hell
> will break loose when the new addresses show up.

I think we can do this. See the patch below.

Basically, we refuse to enable MPX and issue warning in dmesg if there's
anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
be higher than 47-bits too.

Function call from mmap_max_addr() is unfortunate, but I don't see a
way around.

As we add support of MAWA it will get somewhat more complex, but general
idea should be the same.

Build-tested only.

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 07cc4f27ca41..f97b149145f8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1742,7 +1742,6 @@ config X86_SMAP
 config X86_INTEL_MPX
prompt "Intel MPX (Memory Protection Extensions)"
def_bool n
-   depends on !X86_5LEVEL
depends on CPU_SUP_INTEL
---help---
  MPX provides hardware features that can be used in
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 0b416d4cf73b..ba9005f9bf87 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -56,11 +56,8 @@
 
 #ifdef CONFIG_X86_INTEL_MPX
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs);
+int kernel_managing_mpx_tables(struct mm_struct *mm);
 int mpx_handle_bd_fault(void);
-static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
-{
-   return (mm->context.bd_addr != MPX_INVALID_BOUNDS_DIR);
-}
 static inline void mpx_mm_init(struct mm_struct *mm)
 {
/*
@@ -80,10 +77,6 @@ static inline int mpx_handle_bd_fault(void)
 {
return -EINVAL;
 }
-static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
-{
-   return 0;
-}
 static inline void mpx_mm_init(struct mm_struct *mm)
 {
 }
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e02917126859..589610a4f099 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -869,6 +869,7 @@ extern int set_tsc_mode(unsigned int val);
 #ifdef CONFIG_X86_INTEL_MPX
 extern int mpx_enable_management(void);
 extern int mpx_disable_management(void);
+extern int kernel_managing_mpx_tables(struct mm_struct *mm);
 #else
 static inline int mpx_enable_management(void)
 {
@@ -878,8 +879,22 @@ static inline int mpx_disable_management(void)
 {
return -EINVAL;
 }
+static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
+{
+   return 0;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
+#define mmap_max_addr() \
+({ \
+   unsigned long max_addr = min(TASK_SIZE, rlimit(RLIMIT_VADDR));  \
+   /* At the moment, MPX cannot handle addresses above 47-bits */  \
+   if (max_addr > USER_VADDR_LIM &&\
+   kernel_managing_mpx_tables(current->mm))\
+   max_addr = USER_VADDR_LIM;  \
+   max_addr;   \
+})
+
 extern u16 amd_get_nb_id(int cpu);
 extern u32 amd_get_nodes_per_socket(void);
 
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 324e5713d386..04fa386a165a 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -354,10 +354,22 @@ int mpx_enable_management(void)
 */
bd_base = mpx_get_bounds_dir();
down_write(>mmap_sem);
+
+   /*
+* MPX doesn't support addresses above 47-bits yes.
+* Make sure nothing is mapped there before enabling.
+*/
+   if (find_vma(mm, 1UL << 47)) {
+   pr_warn("%s (%d): MPX cannot handle addresses above 47-bits. "
+   "Disabling.", current->comm, current->pid);
+   ret = -ENXIO;
+ 

Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-11 Thread Kirill A. Shutemov
On Thu, Jan 05, 2017 at 12:49:44PM -0800, Dave Hansen wrote:
> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
> >> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> >> silent data corruption?  I'm pretty sure doing this to MPX would.
> >>
> > What actually goes wrong in this case?  That is, what combination of
> > MPX setup of subsequent allocations will cause a problem, and is the
> > problem worse than just a segfault?  IMO it would be really nice to
> > keep the messy case confined to MPX.
> 
> The MPX bounds tables are indexed by virtual address.  They need to grow
> if the virtual address space grows.   There's an MSR that controls
> whether we use the 48-bit or 57-bit layout.  It basically decides
> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
> 
> The question is what we do with legacy MPX applications.  We obviously
> can't let them just allocate a 2GB table and then go let the hardware
> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> table an address >48-bits.
> 
> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> legacy MPX is active, that the RLIMIT can't be raised because all hell
> will break loose when the new addresses show up.

I think we can do this. See the patch below.

Basically, we refuse to enable MPX and issue warning in dmesg if there's
anything mapped above 47-bits. Once MPX is enabled, mmap_max_addr() cannot
be higher than 47-bits too.

Function call from mmap_max_addr() is unfortunate, but I don't see a
way around.

As we add support of MAWA it will get somewhat more complex, but general
idea should be the same.

Build-tested only.

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 07cc4f27ca41..f97b149145f8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1742,7 +1742,6 @@ config X86_SMAP
 config X86_INTEL_MPX
prompt "Intel MPX (Memory Protection Extensions)"
def_bool n
-   depends on !X86_5LEVEL
depends on CPU_SUP_INTEL
---help---
  MPX provides hardware features that can be used in
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 0b416d4cf73b..ba9005f9bf87 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -56,11 +56,8 @@
 
 #ifdef CONFIG_X86_INTEL_MPX
 siginfo_t *mpx_generate_siginfo(struct pt_regs *regs);
+int kernel_managing_mpx_tables(struct mm_struct *mm);
 int mpx_handle_bd_fault(void);
-static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
-{
-   return (mm->context.bd_addr != MPX_INVALID_BOUNDS_DIR);
-}
 static inline void mpx_mm_init(struct mm_struct *mm)
 {
/*
@@ -80,10 +77,6 @@ static inline int mpx_handle_bd_fault(void)
 {
return -EINVAL;
 }
-static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
-{
-   return 0;
-}
 static inline void mpx_mm_init(struct mm_struct *mm)
 {
 }
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e02917126859..589610a4f099 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -869,6 +869,7 @@ extern int set_tsc_mode(unsigned int val);
 #ifdef CONFIG_X86_INTEL_MPX
 extern int mpx_enable_management(void);
 extern int mpx_disable_management(void);
+extern int kernel_managing_mpx_tables(struct mm_struct *mm);
 #else
 static inline int mpx_enable_management(void)
 {
@@ -878,8 +879,22 @@ static inline int mpx_disable_management(void)
 {
return -EINVAL;
 }
+static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
+{
+   return 0;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
+#define mmap_max_addr() \
+({ \
+   unsigned long max_addr = min(TASK_SIZE, rlimit(RLIMIT_VADDR));  \
+   /* At the moment, MPX cannot handle addresses above 47-bits */  \
+   if (max_addr > USER_VADDR_LIM &&\
+   kernel_managing_mpx_tables(current->mm))\
+   max_addr = USER_VADDR_LIM;  \
+   max_addr;   \
+})
+
 extern u16 amd_get_nb_id(int cpu);
 extern u32 amd_get_nodes_per_socket(void);
 
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 324e5713d386..04fa386a165a 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -354,10 +354,22 @@ int mpx_enable_management(void)
 */
bd_base = mpx_get_bounds_dir();
down_write(>mmap_sem);
+
+   /*
+* MPX doesn't support addresses above 47-bits yes.
+* Make sure nothing is mapped there before enabling.
+*/
+   if (find_vma(mm, 1UL << 47)) {
+   pr_warn("%s (%d): MPX cannot handle addresses above 47-bits. "
+   "Disabling.", current->comm, current->pid);
+   ret = -ENXIO;
+ 

Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 01/05/2017 01:27 PM, Andy Lutomirski wrote:
> On Thu, Jan 5, 2017 at 12:49 PM, Dave Hansen  wrote:
...
>> Remember, we already have (legacy MPX) binaries in the wild that have no
>> knowledge of this stuff.  So, we can implicitly have the kernel bump
>> this rlimit around, but we can't expect userspace to do it, ever.
> 
> If you s/rlimit/prctl, then I think this all makes sense with one
> exception.  It would be a bit sad if the personality-setting tool
> didn't work if compiled with MPX.

Ahh, because if you have MPX enabled you *can't* sanely switch between
the two modes because you suddenly go from having small bounds tables to
having big ones?

It's not the simplest thing in the world to do, but there's nothing
keeping the personality-setting tool from doing all the work.  It can do:

new_bd = malloc(1TB);
prctl(MPX_DISABLE_MANAGEMENT);
memcpy(new_bd, old_bd, LEGACY_MPX_BD_SIZE);
set_bounds_config(new_bd | ENABLE_BIT);
prctl(WIDER_VADDR_WIDTH);
prctl(MPX_ENABLE_MANAGEMENT);


> So what if we had a second prctl field that is the value that kicks in
> after execve()?

Yeah, that's a pretty sane way to do it too.  execve() is a nice chokepoint.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 01/05/2017 01:27 PM, Andy Lutomirski wrote:
> On Thu, Jan 5, 2017 at 12:49 PM, Dave Hansen  wrote:
...
>> Remember, we already have (legacy MPX) binaries in the wild that have no
>> knowledge of this stuff.  So, we can implicitly have the kernel bump
>> this rlimit around, but we can't expect userspace to do it, ever.
> 
> If you s/rlimit/prctl, then I think this all makes sense with one
> exception.  It would be a bit sad if the personality-setting tool
> didn't work if compiled with MPX.

Ahh, because if you have MPX enabled you *can't* sanely switch between
the two modes because you suddenly go from having small bounds tables to
having big ones?

It's not the simplest thing in the world to do, but there's nothing
keeping the personality-setting tool from doing all the work.  It can do:

new_bd = malloc(1TB);
prctl(MPX_DISABLE_MANAGEMENT);
memcpy(new_bd, old_bd, LEGACY_MPX_BD_SIZE);
set_bounds_config(new_bd | ENABLE_BIT);
prctl(WIDER_VADDR_WIDTH);
prctl(MPX_ENABLE_MANAGEMENT);


> So what if we had a second prctl field that is the value that kicks in
> after execve()?

Yeah, that's a pretty sane way to do it too.  execve() is a nice chokepoint.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Andy Lutomirski
On Thu, Jan 5, 2017 at 12:49 PM, Dave Hansen  wrote:
> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>>> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>>> silent data corruption?  I'm pretty sure doing this to MPX would.
>>>
>> What actually goes wrong in this case?  That is, what combination of
>> MPX setup of subsequent allocations will cause a problem, and is the
>> problem worse than just a segfault?  IMO it would be really nice to
>> keep the messy case confined to MPX.
>
> The MPX bounds tables are indexed by virtual address.  They need to grow
> if the virtual address space grows.   There's an MSR that controls
> whether we use the 48-bit or 57-bit layout.  It basically decides
> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
>
> The question is what we do with legacy MPX applications.  We obviously
> can't let them just allocate a 2GB table and then go let the hardware
> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> table an address >48-bits.
>
> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> legacy MPX is active, that the RLIMIT can't be raised because all hell
> will break loose when the new addresses show up.
>
> Remember, we already have (legacy MPX) binaries in the wild that have no
> knowledge of this stuff.  So, we can implicitly have the kernel bump
> this rlimit around, but we can't expect userspace to do it, ever.

If you s/rlimit/prctl, then I think this all makes sense with one
exception.  It would be a bit sad if the personality-setting tool
didn't work if compiled with MPX.

So what if we had a second prctl field that is the value that kicks in
after execve()?

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Andy Lutomirski
On Thu, Jan 5, 2017 at 12:49 PM, Dave Hansen  wrote:
> On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>>> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>>> silent data corruption?  I'm pretty sure doing this to MPX would.
>>>
>> What actually goes wrong in this case?  That is, what combination of
>> MPX setup of subsequent allocations will cause a problem, and is the
>> problem worse than just a segfault?  IMO it would be really nice to
>> keep the messy case confined to MPX.
>
> The MPX bounds tables are indexed by virtual address.  They need to grow
> if the virtual address space grows.   There's an MSR that controls
> whether we use the 48-bit or 57-bit layout.  It basically decides
> whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.
>
> The question is what we do with legacy MPX applications.  We obviously
> can't let them just allocate a 2GB table and then go let the hardware
> pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
> table an address >48-bits.
>
> Ideally, I'd like to make sure that legacy MPX can't be enabled if this
> RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
> legacy MPX is active, that the RLIMIT can't be raised because all hell
> will break loose when the new addresses show up.
>
> Remember, we already have (legacy MPX) binaries in the wild that have no
> knowledge of this stuff.  So, we can implicitly have the kernel bump
> this rlimit around, but we can't expect userspace to do it, ever.

If you s/rlimit/prctl, then I think this all makes sense with one
exception.  It would be a bit sad if the personality-setting tool
didn't work if compiled with MPX.

So what if we had a second prctl field that is the value that kicks in
after execve()?

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>> silent data corruption?  I'm pretty sure doing this to MPX would.
>>
> What actually goes wrong in this case?  That is, what combination of
> MPX setup of subsequent allocations will cause a problem, and is the
> problem worse than just a segfault?  IMO it would be really nice to
> keep the messy case confined to MPX.

The MPX bounds tables are indexed by virtual address.  They need to grow
if the virtual address space grows.   There's an MSR that controls
whether we use the 48-bit or 57-bit layout.  It basically decides
whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.

The question is what we do with legacy MPX applications.  We obviously
can't let them just allocate a 2GB table and then go let the hardware
pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
table an address >48-bits.

Ideally, I'd like to make sure that legacy MPX can't be enabled if this
RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
legacy MPX is active, that the RLIMIT can't be raised because all hell
will break loose when the new addresses show up.

Remember, we already have (legacy MPX) binaries in the wild that have no
knowledge of this stuff.  So, we can implicitly have the kernel bump
this rlimit around, but we can't expect userspace to do it, ever.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 01/05/2017 12:14 PM, Andy Lutomirski wrote:
>> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
>> silent data corruption?  I'm pretty sure doing this to MPX would.
>>
> What actually goes wrong in this case?  That is, what combination of
> MPX setup of subsequent allocations will cause a problem, and is the
> problem worse than just a segfault?  IMO it would be really nice to
> keep the messy case confined to MPX.

The MPX bounds tables are indexed by virtual address.  They need to grow
if the virtual address space grows.   There's an MSR that controls
whether we use the 48-bit or 57-bit layout.  It basically decides
whether we need a 2GB (48-bit) or 1TB (57-bit) bounds directory.

The question is what we do with legacy MPX applications.  We obviously
can't let them just allocate a 2GB table and then go let the hardware
pretend it's 1TB in size.  We also can't hand the hardware using a 2GB
table an address >48-bits.

Ideally, I'd like to make sure that legacy MPX can't be enabled if this
RLIMIT is set over 48-bits (really 47).  I'd also like to make sure that
legacy MPX is active, that the RLIMIT can't be raised because all hell
will break loose when the new addresses show up.

Remember, we already have (legacy MPX) binaries in the wild that have no
knowledge of this stuff.  So, we can implicitly have the kernel bump
this rlimit around, but we can't expect userspace to do it, ever.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Andy Lutomirski
On Thu, Jan 5, 2017 at 11:39 AM, Dave Hansen  wrote:
> On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
>> On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
>>> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
 MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
 address available to map by userspace.
>>>
>>> What happens to existing mappings above the limit when this upper limit
>>> is dropped?
>>
>> Nothing: we only prevent creating new mappings. All existing are not
>> affected.
>>
>> The semantics here the same as with other resource limits.
>>
>>> Similarly, why do we do with an application running with something
>>> incompatible with the larger address space that tries to raise the
>>> limit?  Say, legacy MPX.
>>
>> It has to know what it does. Yes, it can change limit to the point where
>> application is unusable. But you can to the same with other limits.
>
> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> silent data corruption?  I'm pretty sure doing this to MPX would.
>

What actually goes wrong in this case?  That is, what combination of
MPX setup of subsequent allocations will cause a problem, and is the
problem worse than just a segfault?  IMO it would be really nice to
keep the messy case confined to MPX.

FWIW, this problem is kind of generic.  If you run code in a process,
MPX or otherwise, that assumes something about pointer values and then
create a pointer that violates its assumptions, you will cause
problems.  For example, some VMs use high bits to store metadata.  If
you feed a pointer that's too big to such code, boom.  This is exactly
why high addresses need to be opt-in.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Andy Lutomirski
On Thu, Jan 5, 2017 at 11:39 AM, Dave Hansen  wrote:
> On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
>> On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
>>> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
 MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
 address available to map by userspace.
>>>
>>> What happens to existing mappings above the limit when this upper limit
>>> is dropped?
>>
>> Nothing: we only prevent creating new mappings. All existing are not
>> affected.
>>
>> The semantics here the same as with other resource limits.
>>
>>> Similarly, why do we do with an application running with something
>>> incompatible with the larger address space that tries to raise the
>>> limit?  Say, legacy MPX.
>>
>> It has to know what it does. Yes, it can change limit to the point where
>> application is unusable. But you can to the same with other limits.
>
> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> silent data corruption?  I'm pretty sure doing this to MPX would.
>

What actually goes wrong in this case?  That is, what combination of
MPX setup of subsequent allocations will cause a problem, and is the
problem worse than just a segfault?  IMO it would be really nice to
keep the messy case confined to MPX.

FWIW, this problem is kind of generic.  If you run code in a process,
MPX or otherwise, that assumes something about pointer values and then
create a pointer that violates its assumptions, you will cause
problems.  For example, some VMs use high bits to store metadata.  If
you feed a pointer that's too big to such code, boom.  This is exactly
why high addresses need to be opt-in.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Kirill A. Shutemov
On Thu, Jan 05, 2017 at 11:39:16AM -0800, Dave Hansen wrote:
> On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
> > On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
> >> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> >>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> >>> address available to map by userspace.
> >>
> >> What happens to existing mappings above the limit when this upper limit
> >> is dropped?
> > 
> > Nothing: we only prevent creating new mappings. All existing are not
> > affected.
> > 
> > The semantics here the same as with other resource limits.
> > 
> >> Similarly, why do we do with an application running with something
> >> incompatible with the larger address space that tries to raise the
> >> limit?  Say, legacy MPX.
> > 
> > It has to know what it does. Yes, it can change limit to the point where
> > application is unusable. But you can to the same with other limits.
> 
> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> silent data corruption?  I'm pretty sure doing this to MPX would.

Maybe it's too ugly, but MPX can set rlim_max to rlim_cur on enabling.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Kirill A. Shutemov
On Thu, Jan 05, 2017 at 11:39:16AM -0800, Dave Hansen wrote:
> On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
> > On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
> >> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> >>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> >>> address available to map by userspace.
> >>
> >> What happens to existing mappings above the limit when this upper limit
> >> is dropped?
> > 
> > Nothing: we only prevent creating new mappings. All existing are not
> > affected.
> > 
> > The semantics here the same as with other resource limits.
> > 
> >> Similarly, why do we do with an application running with something
> >> incompatible with the larger address space that tries to raise the
> >> limit?  Say, legacy MPX.
> > 
> > It has to know what it does. Yes, it can change limit to the point where
> > application is unusable. But you can to the same with other limits.
> 
> I'm not sure I'm comfortable with this.  Do other rlimit changes cause
> silent data corruption?  I'm pretty sure doing this to MPX would.

Maybe it's too ugly, but MPX can set rlim_max to rlim_cur on enabling.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
> On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
>> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>
>> What happens to existing mappings above the limit when this upper limit
>> is dropped?
> 
> Nothing: we only prevent creating new mappings. All existing are not
> affected.
> 
> The semantics here the same as with other resource limits.
> 
>> Similarly, why do we do with an application running with something
>> incompatible with the larger address space that tries to raise the
>> limit?  Say, legacy MPX.
> 
> It has to know what it does. Yes, it can change limit to the point where
> application is unusable. But you can to the same with other limits.

I'm not sure I'm comfortable with this.  Do other rlimit changes cause
silent data corruption?  I'm pretty sure doing this to MPX would.



Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 01/05/2017 11:29 AM, Kirill A. Shutemov wrote:
> On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
>> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>
>> What happens to existing mappings above the limit when this upper limit
>> is dropped?
> 
> Nothing: we only prevent creating new mappings. All existing are not
> affected.
> 
> The semantics here the same as with other resource limits.
> 
>> Similarly, why do we do with an application running with something
>> incompatible with the larger address space that tries to raise the
>> limit?  Say, legacy MPX.
> 
> It has to know what it does. Yes, it can change limit to the point where
> application is unusable. But you can to the same with other limits.

I'm not sure I'm comfortable with this.  Do other rlimit changes cause
silent data corruption?  I'm pretty sure doing this to MPX would.



Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Kirill A. Shutemov
On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> > address available to map by userspace.
> 
> What happens to existing mappings above the limit when this upper limit
> is dropped?

Nothing: we only prevent creating new mappings. All existing are not
affected.

The semantics here the same as with other resource limits.

> Similarly, why do we do with an application running with something
> incompatible with the larger address space that tries to raise the
> limit?  Say, legacy MPX.

It has to know what it does. Yes, it can change limit to the point where
application is unusable. But you can to the same with other limits.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Kirill A. Shutemov
On Thu, Jan 05, 2017 at 11:13:57AM -0800, Dave Hansen wrote:
> On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> > address available to map by userspace.
> 
> What happens to existing mappings above the limit when this upper limit
> is dropped?

Nothing: we only prevent creating new mappings. All existing are not
affected.

The semantics here the same as with other resource limits.

> Similarly, why do we do with an application running with something
> incompatible with the larger address space that tries to raise the
> limit?  Say, legacy MPX.

It has to know what it does. Yes, it can change limit to the point where
application is unusable. But you can to the same with other limits.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.

What happens to existing mappings above the limit when this upper limit
is dropped?

Similarly, why do we do with an application running with something
incompatible with the larger address space that tries to raise the
limit?  Say, legacy MPX.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Dave Hansen
On 12/26/2016 05:54 PM, Kirill A. Shutemov wrote:
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.

What happens to existing mappings above the limit when this upper limit
is dropped?

Similarly, why do we do with an application running with something
incompatible with the larger address space that tries to raise the
limit?  Say, legacy MPX.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Andy Lutomirski
On Wed, Jan 4, 2017 at 6:19 AM, Kirill A. Shutemov  wrote:
> On Tue, Jan 03, 2017 at 10:27:22AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov  
>> wrote:
>> > And what about stack? I'm not sure that everybody would be happy with
>> > stack in the middle of address space.
>>
>> I would, personally.  I think that, for very large address spaces, we
>> should allocate a large block of stack and get rid of the "stack grows
>> down forever" legacy idea.  Then we would never need to worry about
>> the stack eventually hitting some other allocation.  And 2^57 bytes is
>> hilariously large for a default stack.
>
> The stack in the middle of address space can prevent creating other huuuge
> contiguous mapping. Databases may want this.

Fair enough.  OTOH, 2^47 is nowhere near the middle if we were to put
it near the top of the legacy address space.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-05 Thread Andy Lutomirski
On Wed, Jan 4, 2017 at 6:19 AM, Kirill A. Shutemov  wrote:
> On Tue, Jan 03, 2017 at 10:27:22AM -0800, Andy Lutomirski wrote:
>> On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov  
>> wrote:
>> > And what about stack? I'm not sure that everybody would be happy with
>> > stack in the middle of address space.
>>
>> I would, personally.  I think that, for very large address spaces, we
>> should allocate a large block of stack and get rid of the "stack grows
>> down forever" legacy idea.  Then we would never need to worry about
>> the stack eventually hitting some other allocation.  And 2^57 bytes is
>> hilariously large for a default stack.
>
> The stack in the middle of address space can prevent creating other huuuge
> contiguous mapping. Databases may want this.

Fair enough.  OTOH, 2^47 is nowhere near the middle if we were to put
it near the top of the legacy address space.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-04 Thread Kirill A. Shutemov
On Tue, Jan 03, 2017 at 10:27:22AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov  
> wrote:
> > And what about stack? I'm not sure that everybody would be happy with
> > stack in the middle of address space.
> 
> I would, personally.  I think that, for very large address spaces, we
> should allocate a large block of stack and get rid of the "stack grows
> down forever" legacy idea.  Then we would never need to worry about
> the stack eventually hitting some other allocation.  And 2^57 bytes is
> hilariously large for a default stack.

The stack in the middle of address space can prevent creating other huuuge
contiguous mapping. Databases may want this.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-04 Thread Kirill A. Shutemov
On Tue, Jan 03, 2017 at 10:27:22AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov  
> wrote:
> > And what about stack? I'm not sure that everybody would be happy with
> > stack in the middle of address space.
> 
> I would, personally.  I think that, for very large address spaces, we
> should allocate a large block of stack and get rid of the "stack grows
> down forever" legacy idea.  Then we would never need to worry about
> the stack eventually hitting some other allocation.  And 2^57 bytes is
> hilariously large for a default stack.

The stack in the middle of address space can prevent creating other huuuge
contiguous mapping. Databases may want this.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-04 Thread Arnd Bergmann
On Tuesday, January 3, 2017 2:09:16 PM CET Andy Lutomirski wrote:
> >
> >> When
> >> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
> >> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
> >> settable and reading it via prctl returns whatever is implied by the
> >> other personality bits.
> >
> > I don't see anything wrong with it, but I'm a bit confused now
> > what this would be good for, compared to using just prctl.
> >
> > Is this about setuid clearing the personality but not the prctl,
> > or something else?
> 
> It's to avid ambiguity as to what happens if you set ADDR_LIMIT_32BIT
> and use the prctl.  ISTM it would be nice for the semantics to be
> fully defined in all cases.
> 

Ok, got it.

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-04 Thread Arnd Bergmann
On Tuesday, January 3, 2017 2:09:16 PM CET Andy Lutomirski wrote:
> >
> >> When
> >> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
> >> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
> >> settable and reading it via prctl returns whatever is implied by the
> >> other personality bits.
> >
> > I don't see anything wrong with it, but I'm a bit confused now
> > what this would be good for, compared to using just prctl.
> >
> > Is this about setuid clearing the personality but not the prctl,
> > or something else?
> 
> It's to avid ambiguity as to what happens if you set ADDR_LIMIT_32BIT
> and use the prctl.  ISTM it would be nice for the semantics to be
> fully defined in all cases.
> 

Ok, got it.

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Arnd Bergmann
On Tuesday, January 3, 2017 10:29:33 AM CET Andy Lutomirski wrote:
> 
> Hmm.  What if we approached this a bit differently?  We could add a
> single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
> cause PER_LINUX32_3GB etc to be automatically cleared.

Both the ADDR_LIMIT_32BIT and ADDR_LIMIT_3GB flags I guess?

> When
> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
> settable and reading it via prctl returns whatever is implied by the
> other personality bits.

I don't see anything wrong with it, but I'm a bit confused now
what this would be good for, compared to using just prctl.

Is this about setuid clearing the personality but not the prctl,
or something else?

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Arnd Bergmann
On Tuesday, January 3, 2017 10:29:33 AM CET Andy Lutomirski wrote:
> 
> Hmm.  What if we approached this a bit differently?  We could add a
> single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
> cause PER_LINUX32_3GB etc to be automatically cleared.

Both the ADDR_LIMIT_32BIT and ADDR_LIMIT_3GB flags I guess?

> When
> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
> settable and reading it via prctl returns whatever is implied by the
> other personality bits.

I don't see anything wrong with it, but I'm a bit confused now
what this would be good for, compared to using just prctl.

Is this about setuid clearing the personality but not the prctl,
or something else?

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Andy Lutomirski
On Tue, Jan 3, 2017 at 2:07 PM, Arnd Bergmann  wrote:
> On Tuesday, January 3, 2017 10:29:33 AM CET Andy Lutomirski wrote:
>>
>> Hmm.  What if we approached this a bit differently?  We could add a
>> single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
>> cause PER_LINUX32_3GB etc to be automatically cleared.
>
> Both the ADDR_LIMIT_32BIT and ADDR_LIMIT_3GB flags I guess?

Yes.

>
>> When
>> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
>> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
>> settable and reading it via prctl returns whatever is implied by the
>> other personality bits.
>
> I don't see anything wrong with it, but I'm a bit confused now
> what this would be good for, compared to using just prctl.
>
> Is this about setuid clearing the personality but not the prctl,
> or something else?

It's to avid ambiguity as to what happens if you set ADDR_LIMIT_32BIT
and use the prctl.  ISTM it would be nice for the semantics to be
fully defined in all cases.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Andy Lutomirski
On Tue, Jan 3, 2017 at 2:07 PM, Arnd Bergmann  wrote:
> On Tuesday, January 3, 2017 10:29:33 AM CET Andy Lutomirski wrote:
>>
>> Hmm.  What if we approached this a bit differently?  We could add a
>> single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
>> cause PER_LINUX32_3GB etc to be automatically cleared.
>
> Both the ADDR_LIMIT_32BIT and ADDR_LIMIT_3GB flags I guess?

Yes.

>
>> When
>> ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
>> limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
>> settable and reading it via prctl returns whatever is implied by the
>> other personality bits.
>
> I don't see anything wrong with it, but I'm a bit confused now
> what this would be good for, compared to using just prctl.
>
> Is this about setuid clearing the personality but not the prctl,
> or something else?

It's to avid ambiguity as to what happens if you set ADDR_LIMIT_32BIT
and use the prctl.  ISTM it would be nice for the semantics to be
fully defined in all cases.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Andy Lutomirski
On Tue, Jan 3, 2017 at 5:18 AM, Arnd Bergmann  wrote:
> On Monday, January 2, 2017 10:08:28 PM CET Andy Lutomirski wrote:
>>
>> > This seems to nicely address the same problem on arm64, which has
>> > run into the same issue due to the various page table formats
>> > that can currently be chosen at compile time.
>>
>> On further reflection, I think this has very little to do with paging
>> formats except insofar as paging formats make us notice the problem.
>> The issue is that user code wants to be able to assume an upper limit
>> on an address, and it gets an upper limit right now that depends on
>> architecture due to paging formats.  But someone really might want to
>> write a *portable* 64-bit program that allocates memory with the high
>> 16 bits clear.  So let's add such a mechanism directly.
>>
>> As a thought experiment, what if x86_64 simply never allocated "high"
>> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
>> were used?  Old glibc would continue working.  Old VMs would work.
>> New programs that want to use ginormous mappings would have to use the
>> new syscall.  This would be totally stateless and would have no issues
>> with CRIU.
>
> I can see this working well for the 47-bit addressing default, but
> what about applications that actually rely on 39-bit addressing
> (I'd have to double-check, but I think this was the limit that
> people were most interested in for arm64)?
>
> 39 bits seems a little small to make that the default for everyone
> who doesn't pass the extra flag. Having to pass another flag to
> limit the addresses introduces other problems (e.g. mmap from
> library call that doesn't pass that flag).

That's a fair point.  Maybe my straw man isn't so good.

>
>> If necessary, we could also have a prctl that changes a
>> "personality-like" limit that is in effect when the old mmap was used.
>> I say "personality-like" because it would reset under exactly the same
>> conditions that personality resets itself.
>
> For "personality-like", it would still have to interact
> with the existing PER_LINUX32 and PER_LINUX32_3GB flags that
> do the exact same thing, so actually using personality might
> be better.
>
> We still have a few bits in the personality arguments, and
> we could combine them with the existing ADDR_LIMIT_3GB
> and ADDR_LIMIT_32BIT flags that are mutually exclusive by
> definition, such as
>
> ADDR_LIMIT_32BIT =  0x080, /* existing */
> ADDR_LIMIT_3GB   =  0x800, /* existing */
> ADDR_LIMIT_39BIT =  0x001, /* next free bit */
> ADDR_LIMIT_42BIT =  0x801,
> ADDR_LIMIT_47BIT =  0x081,
> ADDR_LIMIT_48BIT =  0x881,
>
> This would probably take only one or two personality bits for the
> limits that are interesting in practice.

Hmm.  What if we approached this a bit differently?  We could add a
single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
cause PER_LINUX32_3GB etc to be automatically cleared.  When
ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
settable and reading it via prctl returns whatever is implied by the
other personality bits.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Andy Lutomirski
On Tue, Jan 3, 2017 at 5:18 AM, Arnd Bergmann  wrote:
> On Monday, January 2, 2017 10:08:28 PM CET Andy Lutomirski wrote:
>>
>> > This seems to nicely address the same problem on arm64, which has
>> > run into the same issue due to the various page table formats
>> > that can currently be chosen at compile time.
>>
>> On further reflection, I think this has very little to do with paging
>> formats except insofar as paging formats make us notice the problem.
>> The issue is that user code wants to be able to assume an upper limit
>> on an address, and it gets an upper limit right now that depends on
>> architecture due to paging formats.  But someone really might want to
>> write a *portable* 64-bit program that allocates memory with the high
>> 16 bits clear.  So let's add such a mechanism directly.
>>
>> As a thought experiment, what if x86_64 simply never allocated "high"
>> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
>> were used?  Old glibc would continue working.  Old VMs would work.
>> New programs that want to use ginormous mappings would have to use the
>> new syscall.  This would be totally stateless and would have no issues
>> with CRIU.
>
> I can see this working well for the 47-bit addressing default, but
> what about applications that actually rely on 39-bit addressing
> (I'd have to double-check, but I think this was the limit that
> people were most interested in for arm64)?
>
> 39 bits seems a little small to make that the default for everyone
> who doesn't pass the extra flag. Having to pass another flag to
> limit the addresses introduces other problems (e.g. mmap from
> library call that doesn't pass that flag).

That's a fair point.  Maybe my straw man isn't so good.

>
>> If necessary, we could also have a prctl that changes a
>> "personality-like" limit that is in effect when the old mmap was used.
>> I say "personality-like" because it would reset under exactly the same
>> conditions that personality resets itself.
>
> For "personality-like", it would still have to interact
> with the existing PER_LINUX32 and PER_LINUX32_3GB flags that
> do the exact same thing, so actually using personality might
> be better.
>
> We still have a few bits in the personality arguments, and
> we could combine them with the existing ADDR_LIMIT_3GB
> and ADDR_LIMIT_32BIT flags that are mutually exclusive by
> definition, such as
>
> ADDR_LIMIT_32BIT =  0x080, /* existing */
> ADDR_LIMIT_3GB   =  0x800, /* existing */
> ADDR_LIMIT_39BIT =  0x001, /* next free bit */
> ADDR_LIMIT_42BIT =  0x801,
> ADDR_LIMIT_47BIT =  0x081,
> ADDR_LIMIT_48BIT =  0x881,
>
> This would probably take only one or two personality bits for the
> limits that are interesting in practice.

Hmm.  What if we approached this a bit differently?  We could add a
single new personality bit ADDR_LIMIT_EXPLICIT.  Setting this bit
cause PER_LINUX32_3GB etc to be automatically cleared.  When
ADDR_LIMIT_EXPLICIT is in effect, prctl can set a 64-bit numeric
limit.  If ADDR_LIMIT_EXPLICIT is cleared, the prctl value stops being
settable and reading it via prctl returns whatever is implied by the
other personality bits.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Andy Lutomirski
On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov  wrote:
> On Mon, Jan 02, 2017 at 10:08:28PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann  wrote:
>> > On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
>> >> As with other resources you can set the limit lower than current usage.
>> >> It would affect only future virtual address space allocations.
>>
>> I still don't buy all these use cases:
>>
>> >>
>> >> Use-cases for new rlimit:
>> >>
>> >>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> >> its children to use addresses above 47-bits.
>>
>> OK, I get this, but only as a workaround for programs that make
>> assumptions about the address space and don't use some mechanism (to
>> be designed?) to work correctly in spite of a larger address space.
>
> I guess you've misread the case. It's opt-in for large adrress space, not
> other way around.
>
> I believe 47-bit VA by default is right way to go to make the transition
> without breaking userspace.

What I meant was: setting the rlimit to anything other than -1ULL is a
workaround, but otherwise I agree.  This still makes little sense if
set by PAM or other conventional rlimit tools.

>> >>
>> >>   - Lowering the hard limit to 47-bits would prevent current process all
>> >> its children to use addresses above 47-bits, unless a process has
>> >> CAP_SYS_RESOURCES.
>>
>> I've tried and I can't imagine any reason to do this.
>
> That's just if something went wrong and we want to stop an application
> from use addresses above 47-bit.

But CAP_SYS_RESOURCES still makes no sense in this context.

>
>> >>   - It’s also can be handy to lower hard or soft limit to arbitrary
>> >> address. User-mode emulation in QEMU may lower the limit to 32-bit
>> >> to emulate 32-bit machine on 64-bit host.
>>
>> I don't understand.  QEMU user-mode emulation intercepts all syscalls.
>> What QEMU would *actually* want is a way to say "allocate me some
>> memory with the high N bits clear".  mmap-via-int80 on x86 should be
>> fixed to do this, but a new syscall with an explicit parameter would
>> work, as would a prctl changing the current limit.
>
> Look at mess in mmap_find_vma(). QEmu has to guess where is free virtual
> memory. That's unnessesary complex.
>
> prctl would work for this too. new-mmap would *not*: there are more ways
> to allocate vitual address space: shmat(), mremap(). Changing all of them
> just for this is stupid.

Fair enough.

Except that mmap-via-int80, shmat-via-int80, etc should still work (if
I understand what qemu needs correctly), as would the prctl.

>
>> >>
>> >> TODO:
>> >>   - port to non-x86;
>> >>
>> >> Not-yet-signed-off-by: Kirill A. Shutemov 
>> >> 
>> >> Cc: linux-...@vger.kernel.org
>> >
>> > This seems to nicely address the same problem on arm64, which has
>> > run into the same issue due to the various page table formats
>> > that can currently be chosen at compile time.
>>
>> On further reflection, I think this has very little to do with paging
>> formats except insofar as paging formats make us notice the problem.
>> The issue is that user code wants to be able to assume an upper limit
>> on an address, and it gets an upper limit right now that depends on
>> architecture due to paging formats.  But someone really might want to
>> write a *portable* 64-bit program that allocates memory with the high
>> 16 bits clear.  So let's add such a mechanism directly.
>>
>> As a thought experiment, what if x86_64 simply never allocated "high"
>> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
>> were used?  Old glibc would continue working.  Old VMs would work.
>> New programs that want to use ginormous mappings would have to use the
>> new syscall.  This would be totally stateless and would have no issues
>> with CRIU.
>
> Except, we need more than mmap as I mentioned.
>
> And what about stack? I'm not sure that everybody would be happy with
> stack in the middle of address space.

I would, personally.  I think that, for very large address spaces, we
should allocate a large block of stack and get rid of the "stack grows
down forever" legacy idea.  Then we would never need to worry about
the stack eventually hitting some other allocation.  And 2^57 bytes is
hilariously large for a default stack.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Andy Lutomirski
On Tue, Jan 3, 2017 at 8:04 AM, Kirill A. Shutemov  wrote:
> On Mon, Jan 02, 2017 at 10:08:28PM -0800, Andy Lutomirski wrote:
>> On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann  wrote:
>> > On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
>> >> As with other resources you can set the limit lower than current usage.
>> >> It would affect only future virtual address space allocations.
>>
>> I still don't buy all these use cases:
>>
>> >>
>> >> Use-cases for new rlimit:
>> >>
>> >>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> >> its children to use addresses above 47-bits.
>>
>> OK, I get this, but only as a workaround for programs that make
>> assumptions about the address space and don't use some mechanism (to
>> be designed?) to work correctly in spite of a larger address space.
>
> I guess you've misread the case. It's opt-in for large adrress space, not
> other way around.
>
> I believe 47-bit VA by default is right way to go to make the transition
> without breaking userspace.

What I meant was: setting the rlimit to anything other than -1ULL is a
workaround, but otherwise I agree.  This still makes little sense if
set by PAM or other conventional rlimit tools.

>> >>
>> >>   - Lowering the hard limit to 47-bits would prevent current process all
>> >> its children to use addresses above 47-bits, unless a process has
>> >> CAP_SYS_RESOURCES.
>>
>> I've tried and I can't imagine any reason to do this.
>
> That's just if something went wrong and we want to stop an application
> from use addresses above 47-bit.

But CAP_SYS_RESOURCES still makes no sense in this context.

>
>> >>   - It’s also can be handy to lower hard or soft limit to arbitrary
>> >> address. User-mode emulation in QEMU may lower the limit to 32-bit
>> >> to emulate 32-bit machine on 64-bit host.
>>
>> I don't understand.  QEMU user-mode emulation intercepts all syscalls.
>> What QEMU would *actually* want is a way to say "allocate me some
>> memory with the high N bits clear".  mmap-via-int80 on x86 should be
>> fixed to do this, but a new syscall with an explicit parameter would
>> work, as would a prctl changing the current limit.
>
> Look at mess in mmap_find_vma(). QEmu has to guess where is free virtual
> memory. That's unnessesary complex.
>
> prctl would work for this too. new-mmap would *not*: there are more ways
> to allocate vitual address space: shmat(), mremap(). Changing all of them
> just for this is stupid.

Fair enough.

Except that mmap-via-int80, shmat-via-int80, etc should still work (if
I understand what qemu needs correctly), as would the prctl.

>
>> >>
>> >> TODO:
>> >>   - port to non-x86;
>> >>
>> >> Not-yet-signed-off-by: Kirill A. Shutemov 
>> >> 
>> >> Cc: linux-...@vger.kernel.org
>> >
>> > This seems to nicely address the same problem on arm64, which has
>> > run into the same issue due to the various page table formats
>> > that can currently be chosen at compile time.
>>
>> On further reflection, I think this has very little to do with paging
>> formats except insofar as paging formats make us notice the problem.
>> The issue is that user code wants to be able to assume an upper limit
>> on an address, and it gets an upper limit right now that depends on
>> architecture due to paging formats.  But someone really might want to
>> write a *portable* 64-bit program that allocates memory with the high
>> 16 bits clear.  So let's add such a mechanism directly.
>>
>> As a thought experiment, what if x86_64 simply never allocated "high"
>> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
>> were used?  Old glibc would continue working.  Old VMs would work.
>> New programs that want to use ginormous mappings would have to use the
>> new syscall.  This would be totally stateless and would have no issues
>> with CRIU.
>
> Except, we need more than mmap as I mentioned.
>
> And what about stack? I'm not sure that everybody would be happy with
> stack in the middle of address space.

I would, personally.  I think that, for very large address spaces, we
should allocate a large block of stack and get rid of the "stack grows
down forever" legacy idea.  Then we would never need to worry about
the stack eventually hitting some other allocation.  And 2^57 bytes is
hilariously large for a default stack.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Kirill A. Shutemov
On Mon, Jan 02, 2017 at 10:08:28PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann  wrote:
> > On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
> >> As with other resources you can set the limit lower than current usage.
> >> It would affect only future virtual address space allocations.
> 
> I still don't buy all these use cases:
> 
> >>
> >> Use-cases for new rlimit:
> >>
> >>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >> its children to use addresses above 47-bits.
> 
> OK, I get this, but only as a workaround for programs that make
> assumptions about the address space and don't use some mechanism (to
> be designed?) to work correctly in spite of a larger address space.

I guess you've misread the case. It's opt-in for large adrress space, not
other way around.

I believe 47-bit VA by default is right way to go to make the transition
without breaking userspace.

> >>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >> exec(2) allows the child to use addresses above 47-bits.
> 
> Ditto.
> 
> >>
> >>   - Lowering the hard limit to 47-bits would prevent current process all
> >> its children to use addresses above 47-bits, unless a process has
> >> CAP_SYS_RESOURCES.
> 
> I've tried and I can't imagine any reason to do this.

That's just if something went wrong and we want to stop an application
from use addresses above 47-bit.

> >>   - It’s also can be handy to lower hard or soft limit to arbitrary
> >> address. User-mode emulation in QEMU may lower the limit to 32-bit
> >> to emulate 32-bit machine on 64-bit host.
> 
> I don't understand.  QEMU user-mode emulation intercepts all syscalls.
> What QEMU would *actually* want is a way to say "allocate me some
> memory with the high N bits clear".  mmap-via-int80 on x86 should be
> fixed to do this, but a new syscall with an explicit parameter would
> work, as would a prctl changing the current limit.

Look at mess in mmap_find_vma(). QEmu has to guess where is free virtual
memory. That's unnessesary complex.

prctl would work for this too. new-mmap would *not*: there are more ways
to allocate vitual address space: shmat(), mremap(). Changing all of them
just for this is stupid.

> >>
> >> TODO:
> >>   - port to non-x86;
> >>
> >> Not-yet-signed-off-by: Kirill A. Shutemov 
> >> Cc: linux-...@vger.kernel.org
> >
> > This seems to nicely address the same problem on arm64, which has
> > run into the same issue due to the various page table formats
> > that can currently be chosen at compile time.
> 
> On further reflection, I think this has very little to do with paging
> formats except insofar as paging formats make us notice the problem.
> The issue is that user code wants to be able to assume an upper limit
> on an address, and it gets an upper limit right now that depends on
> architecture due to paging formats.  But someone really might want to
> write a *portable* 64-bit program that allocates memory with the high
> 16 bits clear.  So let's add such a mechanism directly.
> 
> As a thought experiment, what if x86_64 simply never allocated "high"
> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
> were used?  Old glibc would continue working.  Old VMs would work.
> New programs that want to use ginormous mappings would have to use the
> new syscall.  This would be totally stateless and would have no issues
> with CRIU.

Except, we need more than mmap as I mentioned.

And what about stack? I'm not sure that everybody would be happy with
stack in the middle of address space.

> If necessary, we could also have a prctl that changes a
> "personality-like" limit that is in effect when the old mmap was used.
> I say "personality-like" because it would reset under exactly the same
> conditions that personality resets itself.
> 
> Thoughts?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Kirill A. Shutemov
On Mon, Jan 02, 2017 at 10:08:28PM -0800, Andy Lutomirski wrote:
> On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann  wrote:
> > On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
> >> As with other resources you can set the limit lower than current usage.
> >> It would affect only future virtual address space allocations.
> 
> I still don't buy all these use cases:
> 
> >>
> >> Use-cases for new rlimit:
> >>
> >>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >> its children to use addresses above 47-bits.
> 
> OK, I get this, but only as a workaround for programs that make
> assumptions about the address space and don't use some mechanism (to
> be designed?) to work correctly in spite of a larger address space.

I guess you've misread the case. It's opt-in for large adrress space, not
other way around.

I believe 47-bit VA by default is right way to go to make the transition
without breaking userspace.

> >>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >> exec(2) allows the child to use addresses above 47-bits.
> 
> Ditto.
> 
> >>
> >>   - Lowering the hard limit to 47-bits would prevent current process all
> >> its children to use addresses above 47-bits, unless a process has
> >> CAP_SYS_RESOURCES.
> 
> I've tried and I can't imagine any reason to do this.

That's just if something went wrong and we want to stop an application
from use addresses above 47-bit.

> >>   - It’s also can be handy to lower hard or soft limit to arbitrary
> >> address. User-mode emulation in QEMU may lower the limit to 32-bit
> >> to emulate 32-bit machine on 64-bit host.
> 
> I don't understand.  QEMU user-mode emulation intercepts all syscalls.
> What QEMU would *actually* want is a way to say "allocate me some
> memory with the high N bits clear".  mmap-via-int80 on x86 should be
> fixed to do this, but a new syscall with an explicit parameter would
> work, as would a prctl changing the current limit.

Look at mess in mmap_find_vma(). QEmu has to guess where is free virtual
memory. That's unnessesary complex.

prctl would work for this too. new-mmap would *not*: there are more ways
to allocate vitual address space: shmat(), mremap(). Changing all of them
just for this is stupid.

> >>
> >> TODO:
> >>   - port to non-x86;
> >>
> >> Not-yet-signed-off-by: Kirill A. Shutemov 
> >> Cc: linux-...@vger.kernel.org
> >
> > This seems to nicely address the same problem on arm64, which has
> > run into the same issue due to the various page table formats
> > that can currently be chosen at compile time.
> 
> On further reflection, I think this has very little to do with paging
> formats except insofar as paging formats make us notice the problem.
> The issue is that user code wants to be able to assume an upper limit
> on an address, and it gets an upper limit right now that depends on
> architecture due to paging formats.  But someone really might want to
> write a *portable* 64-bit program that allocates memory with the high
> 16 bits clear.  So let's add such a mechanism directly.
> 
> As a thought experiment, what if x86_64 simply never allocated "high"
> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
> were used?  Old glibc would continue working.  Old VMs would work.
> New programs that want to use ginormous mappings would have to use the
> new syscall.  This would be totally stateless and would have no issues
> with CRIU.

Except, we need more than mmap as I mentioned.

And what about stack? I'm not sure that everybody would be happy with
stack in the middle of address space.

> If necessary, we could also have a prctl that changes a
> "personality-like" limit that is in effect when the old mmap was used.
> I say "personality-like" because it would reset under exactly the same
> conditions that personality resets itself.
> 
> Thoughts?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Arnd Bergmann
On Monday, January 2, 2017 10:08:28 PM CET Andy Lutomirski wrote:
> 
> > This seems to nicely address the same problem on arm64, which has
> > run into the same issue due to the various page table formats
> > that can currently be chosen at compile time.
> 
> On further reflection, I think this has very little to do with paging
> formats except insofar as paging formats make us notice the problem.
> The issue is that user code wants to be able to assume an upper limit
> on an address, and it gets an upper limit right now that depends on
> architecture due to paging formats.  But someone really might want to
> write a *portable* 64-bit program that allocates memory with the high
> 16 bits clear.  So let's add such a mechanism directly.
> 
> As a thought experiment, what if x86_64 simply never allocated "high"
> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
> were used?  Old glibc would continue working.  Old VMs would work.
> New programs that want to use ginormous mappings would have to use the
> new syscall.  This would be totally stateless and would have no issues
> with CRIU.

I can see this working well for the 47-bit addressing default, but
what about applications that actually rely on 39-bit addressing
(I'd have to double-check, but I think this was the limit that
people were most interested in for arm64)?

39 bits seems a little small to make that the default for everyone
who doesn't pass the extra flag. Having to pass another flag to
limit the addresses introduces other problems (e.g. mmap from
library call that doesn't pass that flag).

> If necessary, we could also have a prctl that changes a
> "personality-like" limit that is in effect when the old mmap was used.
> I say "personality-like" because it would reset under exactly the same
> conditions that personality resets itself.

For "personality-like", it would still have to interact
with the existing PER_LINUX32 and PER_LINUX32_3GB flags that
do the exact same thing, so actually using personality might
be better.

We still have a few bits in the personality arguments, and
we could combine them with the existing ADDR_LIMIT_3GB
and ADDR_LIMIT_32BIT flags that are mutually exclusive by
definition, such as

ADDR_LIMIT_32BIT =  0x080, /* existing */
ADDR_LIMIT_3GB   =  0x800, /* existing */
ADDR_LIMIT_39BIT =  0x001, /* next free bit */
ADDR_LIMIT_42BIT =  0x801,
ADDR_LIMIT_47BIT =  0x081,
ADDR_LIMIT_48BIT =  0x881,

This would probably take only one or two personality bits for the
limits that are interesting in practice.

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-03 Thread Arnd Bergmann
On Monday, January 2, 2017 10:08:28 PM CET Andy Lutomirski wrote:
> 
> > This seems to nicely address the same problem on arm64, which has
> > run into the same issue due to the various page table formats
> > that can currently be chosen at compile time.
> 
> On further reflection, I think this has very little to do with paging
> formats except insofar as paging formats make us notice the problem.
> The issue is that user code wants to be able to assume an upper limit
> on an address, and it gets an upper limit right now that depends on
> architecture due to paging formats.  But someone really might want to
> write a *portable* 64-bit program that allocates memory with the high
> 16 bits clear.  So let's add such a mechanism directly.
> 
> As a thought experiment, what if x86_64 simply never allocated "high"
> (above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
> were used?  Old glibc would continue working.  Old VMs would work.
> New programs that want to use ginormous mappings would have to use the
> new syscall.  This would be totally stateless and would have no issues
> with CRIU.

I can see this working well for the 47-bit addressing default, but
what about applications that actually rely on 39-bit addressing
(I'd have to double-check, but I think this was the limit that
people were most interested in for arm64)?

39 bits seems a little small to make that the default for everyone
who doesn't pass the extra flag. Having to pass another flag to
limit the addresses introduces other problems (e.g. mmap from
library call that doesn't pass that flag).

> If necessary, we could also have a prctl that changes a
> "personality-like" limit that is in effect when the old mmap was used.
> I say "personality-like" because it would reset under exactly the same
> conditions that personality resets itself.

For "personality-like", it would still have to interact
with the existing PER_LINUX32 and PER_LINUX32_3GB flags that
do the exact same thing, so actually using personality might
be better.

We still have a few bits in the personality arguments, and
we could combine them with the existing ADDR_LIMIT_3GB
and ADDR_LIMIT_32BIT flags that are mutually exclusive by
definition, such as

ADDR_LIMIT_32BIT =  0x080, /* existing */
ADDR_LIMIT_3GB   =  0x800, /* existing */
ADDR_LIMIT_39BIT =  0x001, /* next free bit */
ADDR_LIMIT_42BIT =  0x801,
ADDR_LIMIT_47BIT =  0x081,
ADDR_LIMIT_48BIT =  0x881,

This would probably take only one or two personality bits for the
limits that are interesting in practice.

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Andy Lutomirski
On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann  wrote:
> On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
>> As with other resources you can set the limit lower than current usage.
>> It would affect only future virtual address space allocations.

I still don't buy all these use cases:

>>
>> Use-cases for new rlimit:
>>
>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> its children to use addresses above 47-bits.

OK, I get this, but only as a workaround for programs that make
assumptions about the address space and don't use some mechanism (to
be designed?) to work correctly in spite of a larger address space.

>>
>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>> exec(2) allows the child to use addresses above 47-bits.

Ditto.

>>
>>   - Lowering the hard limit to 47-bits would prevent current process all
>> its children to use addresses above 47-bits, unless a process has
>> CAP_SYS_RESOURCES.

I've tried and I can't imagine any reason to do this.

>>
>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>> address. User-mode emulation in QEMU may lower the limit to 32-bit
>> to emulate 32-bit machine on 64-bit host.

I don't understand.  QEMU user-mode emulation intercepts all syscalls.
What QEMU would *actually* want is a way to say "allocate me some
memory with the high N bits clear".  mmap-via-int80 on x86 should be
fixed to do this, but a new syscall with an explicit parameter would
work, as would a prctl changing the current limit.

>>
>> TODO:
>>   - port to non-x86;
>>
>> Not-yet-signed-off-by: Kirill A. Shutemov 
>> Cc: linux-...@vger.kernel.org
>
> This seems to nicely address the same problem on arm64, which has
> run into the same issue due to the various page table formats
> that can currently be chosen at compile time.

On further reflection, I think this has very little to do with paging
formats except insofar as paging formats make us notice the problem.
The issue is that user code wants to be able to assume an upper limit
on an address, and it gets an upper limit right now that depends on
architecture due to paging formats.  But someone really might want to
write a *portable* 64-bit program that allocates memory with the high
16 bits clear.  So let's add such a mechanism directly.

As a thought experiment, what if x86_64 simply never allocated "high"
(above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
were used?  Old glibc would continue working.  Old VMs would work.
New programs that want to use ginormous mappings would have to use the
new syscall.  This would be totally stateless and would have no issues
with CRIU.

If necessary, we could also have a prctl that changes a
"personality-like" limit that is in effect when the old mmap was used.
I say "personality-like" because it would reset under exactly the same
conditions that personality resets itself.

Thoughts?


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Andy Lutomirski
On Mon, Jan 2, 2017 at 12:44 AM, Arnd Bergmann  wrote:
> On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
>> As with other resources you can set the limit lower than current usage.
>> It would affect only future virtual address space allocations.

I still don't buy all these use cases:

>>
>> Use-cases for new rlimit:
>>
>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> its children to use addresses above 47-bits.

OK, I get this, but only as a workaround for programs that make
assumptions about the address space and don't use some mechanism (to
be designed?) to work correctly in spite of a larger address space.

>>
>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>> exec(2) allows the child to use addresses above 47-bits.

Ditto.

>>
>>   - Lowering the hard limit to 47-bits would prevent current process all
>> its children to use addresses above 47-bits, unless a process has
>> CAP_SYS_RESOURCES.

I've tried and I can't imagine any reason to do this.

>>
>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>> address. User-mode emulation in QEMU may lower the limit to 32-bit
>> to emulate 32-bit machine on 64-bit host.

I don't understand.  QEMU user-mode emulation intercepts all syscalls.
What QEMU would *actually* want is a way to say "allocate me some
memory with the high N bits clear".  mmap-via-int80 on x86 should be
fixed to do this, but a new syscall with an explicit parameter would
work, as would a prctl changing the current limit.

>>
>> TODO:
>>   - port to non-x86;
>>
>> Not-yet-signed-off-by: Kirill A. Shutemov 
>> Cc: linux-...@vger.kernel.org
>
> This seems to nicely address the same problem on arm64, which has
> run into the same issue due to the various page table formats
> that can currently be chosen at compile time.

On further reflection, I think this has very little to do with paging
formats except insofar as paging formats make us notice the problem.
The issue is that user code wants to be able to assume an upper limit
on an address, and it gets an upper limit right now that depends on
architecture due to paging formats.  But someone really might want to
write a *portable* 64-bit program that allocates memory with the high
16 bits clear.  So let's add such a mechanism directly.

As a thought experiment, what if x86_64 simply never allocated "high"
(above 2^47-1) addresses unless a new mmap-with-explicit-limit syscall
were used?  Old glibc would continue working.  Old VMs would work.
New programs that want to use ginormous mappings would have to use the
new syscall.  This would be totally stateless and would have no issues
with CRIU.

If necessary, we could also have a prctl that changes a
"personality-like" limit that is in effect when the old mmap was used.
I say "personality-like" because it would reset under exactly the same
conditions that personality resets itself.

Thoughts?


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Kirill A. Shutemov
On Mon, Dec 26, 2016 at 07:22:03PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 26, 2016 at 6:24 PM, Kirill A. Shutemov
>  wrote:
> > On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
> >>  wrote:
> >> > This patch introduces new rlimit resource to manage maximum virtual
> >> > address available to userspace to map.
> >> >
> >> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> >> > Not all user space is ready to handle wide addresses. It's known that
> >> > at least some JIT compilers use high bit in pointers to encode their
> >> > information. It collides with valid pointers with 5-level paging and
> >> > leads to crashes.
> >> >
> >> > The patch aims to address this compatibility issue.
> >> >
> >> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> >> > address available to map by userspace.
> >> >
> >> > The default hard limit will be RLIM_INFINITY, which basically means that
> >> > TASK_SIZE limits available address space.
> >> >
> >> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
> >> > with 5-level paging enabled. In this case, soft limit would be
> >> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> >> > paging which known to be safe
> >> >
> >> > New rlimit resource would follow usual semantics with regards to
> >> > inheritance: preserved on fork(2) and exec(2). This has potential to
> >> > break application if limits set too wide or too narrow, but this is not
> >> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> >> >
> >> > As with other resources you can set the limit lower than current usage.
> >> > It would affect only future virtual address space allocations.
> >> >
> >> > Use-cases for new rlimit:
> >> >
> >> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >> > its children to use addresses above 47-bits.
> >> >
> >> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >> > exec(2) allows the child to use addresses above 47-bits.
> >> >
> >> >   - Lowering the hard limit to 47-bits would prevent current process all
> >> > its children to use addresses above 47-bits, unless a process has
> >> > CAP_SYS_RESOURCES.
> >> >
> >> >   - It’s also can be handy to lower hard or soft limit to arbitrary
> >> > address. User-mode emulation in QEMU may lower the limit to 32-bit
> >> > to emulate 32-bit machine on 64-bit host.
> >>
> >> I tend to think that this should be a personality or an ELF flag, not
> >> an rlimit.
> >
> > My plan was to implement ELF flag on top. Basically, ELF flag would mean
> > that we bump soft limit to hard limit on exec.
> >
> >> That way setuid works right.
> >
> > Um.. I probably miss background here.
> >
> 
> If a setuid program depends on the lower limit, then a malicious
> program shouldn't be able to cause it to run with the higher limit.
> The personality code should already get this case right because
> personalities are reset when setuid happens.

It would be nice to have more fine-grained control than binary personality
flag gives. It would cover more use-cases.

Well, we could reset the limit on exec of setuid binary too. That's not
ideal, but...

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Kirill A. Shutemov
On Mon, Dec 26, 2016 at 07:22:03PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 26, 2016 at 6:24 PM, Kirill A. Shutemov
>  wrote:
> > On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
> >>  wrote:
> >> > This patch introduces new rlimit resource to manage maximum virtual
> >> > address available to userspace to map.
> >> >
> >> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> >> > Not all user space is ready to handle wide addresses. It's known that
> >> > at least some JIT compilers use high bit in pointers to encode their
> >> > information. It collides with valid pointers with 5-level paging and
> >> > leads to crashes.
> >> >
> >> > The patch aims to address this compatibility issue.
> >> >
> >> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> >> > address available to map by userspace.
> >> >
> >> > The default hard limit will be RLIM_INFINITY, which basically means that
> >> > TASK_SIZE limits available address space.
> >> >
> >> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
> >> > with 5-level paging enabled. In this case, soft limit would be
> >> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> >> > paging which known to be safe
> >> >
> >> > New rlimit resource would follow usual semantics with regards to
> >> > inheritance: preserved on fork(2) and exec(2). This has potential to
> >> > break application if limits set too wide or too narrow, but this is not
> >> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> >> >
> >> > As with other resources you can set the limit lower than current usage.
> >> > It would affect only future virtual address space allocations.
> >> >
> >> > Use-cases for new rlimit:
> >> >
> >> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> >> > its children to use addresses above 47-bits.
> >> >
> >> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> >> > exec(2) allows the child to use addresses above 47-bits.
> >> >
> >> >   - Lowering the hard limit to 47-bits would prevent current process all
> >> > its children to use addresses above 47-bits, unless a process has
> >> > CAP_SYS_RESOURCES.
> >> >
> >> >   - It’s also can be handy to lower hard or soft limit to arbitrary
> >> > address. User-mode emulation in QEMU may lower the limit to 32-bit
> >> > to emulate 32-bit machine on 64-bit host.
> >>
> >> I tend to think that this should be a personality or an ELF flag, not
> >> an rlimit.
> >
> > My plan was to implement ELF flag on top. Basically, ELF flag would mean
> > that we bump soft limit to hard limit on exec.
> >
> >> That way setuid works right.
> >
> > Um.. I probably miss background here.
> >
> 
> If a setuid program depends on the lower limit, then a malicious
> program shouldn't be able to cause it to run with the higher limit.
> The personality code should already get this case right because
> personalities are reset when setuid happens.

It would be nice to have more fine-grained control than binary personality
flag gives. It would cover more use-cases.

Well, we could reset the limit on exec of setuid binary too. That's not
ideal, but...

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Arnd Bergmann
On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
> This patch introduces new rlimit resource to manage maximum virtual
> address available to userspace to map.
> 
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use high bit in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
> 
> The patch aims to address this compatibility issue.
> 
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.
> 
> The default hard limit will be RLIM_INFINITY, which basically means that
> TASK_SIZE limits available address space.
> 
> The soft limit will also be RLIM_INFINITY everywhere, but the machine
> with 5-level paging enabled. In this case, soft limit would be
> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> paging which known to be safe
> 
> New rlimit resource would follow usual semantics with regards to
> inheritance: preserved on fork(2) and exec(2). This has potential to
> break application if limits set too wide or too narrow, but this is not
> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> 
> As with other resources you can set the limit lower than current usage.
> It would affect only future virtual address space allocations.
> 
> Use-cases for new rlimit:
> 
>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> its children to use addresses above 47-bits.
> 
>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> exec(2) allows the child to use addresses above 47-bits.
> 
>   - Lowering the hard limit to 47-bits would prevent current process all
> its children to use addresses above 47-bits, unless a process has
> CAP_SYS_RESOURCES.
> 
>   - It’s also can be handy to lower hard or soft limit to arbitrary
> address. User-mode emulation in QEMU may lower the limit to 32-bit
> to emulate 32-bit machine on 64-bit host.
> 
> TODO:
>   - port to non-x86;
> 
> Not-yet-signed-off-by: Kirill A. Shutemov 
> Cc: linux-...@vger.kernel.org

This seems to nicely address the same problem on arm64, which has
run into the same issue due to the various page table formats
that can currently be chosen at compile time.

I don't see how this interacts with the existing
PER_LINUX32/PER_LINUX32_3GB personality flags, but I assume you have
either already thought of that, or we can come up with a good way
to define what happens when conflicting settings are applied.

The two reasonable ways I can think of are to either use the
minimum of the two limits, or to make the personality syscall
set the soft rlimit and use whatever limit was last set.

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Arnd Bergmann
On Tuesday, December 27, 2016 4:54:13 AM CET Kirill A. Shutemov wrote:
> This patch introduces new rlimit resource to manage maximum virtual
> address available to userspace to map.
> 
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use high bit in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
> 
> The patch aims to address this compatibility issue.
> 
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.
> 
> The default hard limit will be RLIM_INFINITY, which basically means that
> TASK_SIZE limits available address space.
> 
> The soft limit will also be RLIM_INFINITY everywhere, but the machine
> with 5-level paging enabled. In this case, soft limit would be
> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> paging which known to be safe
> 
> New rlimit resource would follow usual semantics with regards to
> inheritance: preserved on fork(2) and exec(2). This has potential to
> break application if limits set too wide or too narrow, but this is not
> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> 
> As with other resources you can set the limit lower than current usage.
> It would affect only future virtual address space allocations.
> 
> Use-cases for new rlimit:
> 
>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> its children to use addresses above 47-bits.
> 
>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> exec(2) allows the child to use addresses above 47-bits.
> 
>   - Lowering the hard limit to 47-bits would prevent current process all
> its children to use addresses above 47-bits, unless a process has
> CAP_SYS_RESOURCES.
> 
>   - It’s also can be handy to lower hard or soft limit to arbitrary
> address. User-mode emulation in QEMU may lower the limit to 32-bit
> to emulate 32-bit machine on 64-bit host.
> 
> TODO:
>   - port to non-x86;
> 
> Not-yet-signed-off-by: Kirill A. Shutemov 
> Cc: linux-...@vger.kernel.org

This seems to nicely address the same problem on arm64, which has
run into the same issue due to the various page table formats
that can currently be chosen at compile time.

I don't see how this interacts with the existing
PER_LINUX32/PER_LINUX32_3GB personality flags, but I assume you have
either already thought of that, or we can come up with a good way
to define what happens when conflicting settings are applied.

The two reasonable ways I can think of are to either use the
minimum of the two limits, or to make the personality syscall
set the soft rlimit and use whatever limit was last set.

Arnd


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Kirill A. Shutemov
On Fri, Dec 30, 2016 at 06:08:27PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell  wrote:
> > On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> >> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> >>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
> >>>  wrote:
>  This patch introduces new rlimit resource to manage maximum virtual
>  address available to userspace to map.
> 
>  On x86, 5-level paging enables 56-bit userspace virtual address space.
>  Not all user space is ready to handle wide addresses. It's known that
>  at least some JIT compilers use high bit in pointers to encode their
>  information. It collides with valid pointers with 5-level paging and
>  leads to crashes.
> 
>  The patch aims to address this compatibility issue.
> 
>  MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>  address available to map by userspace.
> 
>  The default hard limit will be RLIM_INFINITY, which basically means that
>  TASK_SIZE limits available address space.
> 
>  The soft limit will also be RLIM_INFINITY everywhere, but the machine
>  with 5-level paging enabled. In this case, soft limit would be
>  (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>  paging which known to be safe
> 
>  New rlimit resource would follow usual semantics with regards to
>  inheritance: preserved on fork(2) and exec(2). This has potential to
>  break application if limits set too wide or too narrow, but this is not
>  uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> 
>  As with other resources you can set the limit lower than current usage.
>  It would affect only future virtual address space allocations.
> 
>  Use-cases for new rlimit:
> 
>    - Bumping the soft limit to RLIM_INFINITY, allows current process all
>  its children to use addresses above 47-bits.
> 
>    - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>  exec(2) allows the child to use addresses above 47-bits.
> 
>    - Lowering the hard limit to 47-bits would prevent current process all
>  its children to use addresses above 47-bits, unless a process has
>  CAP_SYS_RESOURCES.
> 
>    - It’s also can be handy to lower hard or soft limit to arbitrary
>  address. User-mode emulation in QEMU may lower the limit to 32-bit
>  to emulate 32-bit machine on 64-bit host.
> >>>
> >>> I tend to think that this should be a personality or an ELF flag, not
> >>> an rlimit.
> >>
> >> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> >> that we bump soft limit to hard limit on exec.
> >
> > Could you clarify what you mean by an "ELF flag?"
> 
> Some way to mark a binary as supporting a larger address space.  I
> don't have a precise solution in mind, but an ELF note might be a good
> way to go here.

+ H.J.

There's discussion of proposal of "Program Properties"[1]. It seems fits
the purpose.

[1] https://sourceware.org/ml/gnu-gabi/2016-q4/msg0.html

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2017-01-02 Thread Kirill A. Shutemov
On Fri, Dec 30, 2016 at 06:08:27PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell  wrote:
> > On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> >> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> >>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
> >>>  wrote:
>  This patch introduces new rlimit resource to manage maximum virtual
>  address available to userspace to map.
> 
>  On x86, 5-level paging enables 56-bit userspace virtual address space.
>  Not all user space is ready to handle wide addresses. It's known that
>  at least some JIT compilers use high bit in pointers to encode their
>  information. It collides with valid pointers with 5-level paging and
>  leads to crashes.
> 
>  The patch aims to address this compatibility issue.
> 
>  MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>  address available to map by userspace.
> 
>  The default hard limit will be RLIM_INFINITY, which basically means that
>  TASK_SIZE limits available address space.
> 
>  The soft limit will also be RLIM_INFINITY everywhere, but the machine
>  with 5-level paging enabled. In this case, soft limit would be
>  (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>  paging which known to be safe
> 
>  New rlimit resource would follow usual semantics with regards to
>  inheritance: preserved on fork(2) and exec(2). This has potential to
>  break application if limits set too wide or too narrow, but this is not
>  uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> 
>  As with other resources you can set the limit lower than current usage.
>  It would affect only future virtual address space allocations.
> 
>  Use-cases for new rlimit:
> 
>    - Bumping the soft limit to RLIM_INFINITY, allows current process all
>  its children to use addresses above 47-bits.
> 
>    - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>  exec(2) allows the child to use addresses above 47-bits.
> 
>    - Lowering the hard limit to 47-bits would prevent current process all
>  its children to use addresses above 47-bits, unless a process has
>  CAP_SYS_RESOURCES.
> 
>    - It’s also can be handy to lower hard or soft limit to arbitrary
>  address. User-mode emulation in QEMU may lower the limit to 32-bit
>  to emulate 32-bit machine on 64-bit host.
> >>>
> >>> I tend to think that this should be a personality or an ELF flag, not
> >>> an rlimit.
> >>
> >> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> >> that we bump soft limit to hard limit on exec.
> >
> > Could you clarify what you mean by an "ELF flag?"
> 
> Some way to mark a binary as supporting a larger address space.  I
> don't have a precise solution in mind, but an ELF note might be a good
> way to go here.

+ H.J.

There's discussion of proposal of "Program Properties"[1]. It seems fits
the purpose.

[1] https://sourceware.org/ml/gnu-gabi/2016-q4/msg0.html

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-30 Thread Andy Lutomirski
On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell  wrote:
> On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
>> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>>  wrote:
 This patch introduces new rlimit resource to manage maximum virtual
 address available to userspace to map.

 On x86, 5-level paging enables 56-bit userspace virtual address space.
 Not all user space is ready to handle wide addresses. It's known that
 at least some JIT compilers use high bit in pointers to encode their
 information. It collides with valid pointers with 5-level paging and
 leads to crashes.

 The patch aims to address this compatibility issue.

 MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
 address available to map by userspace.

 The default hard limit will be RLIM_INFINITY, which basically means that
 TASK_SIZE limits available address space.

 The soft limit will also be RLIM_INFINITY everywhere, but the machine
 with 5-level paging enabled. In this case, soft limit would be
 (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
 paging which known to be safe

 New rlimit resource would follow usual semantics with regards to
 inheritance: preserved on fork(2) and exec(2). This has potential to
 break application if limits set too wide or too narrow, but this is not
 uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).

 As with other resources you can set the limit lower than current usage.
 It would affect only future virtual address space allocations.

 Use-cases for new rlimit:

   - Bumping the soft limit to RLIM_INFINITY, allows current process all
 its children to use addresses above 47-bits.

   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
 exec(2) allows the child to use addresses above 47-bits.

   - Lowering the hard limit to 47-bits would prevent current process all
 its children to use addresses above 47-bits, unless a process has
 CAP_SYS_RESOURCES.

   - It’s also can be handy to lower hard or soft limit to arbitrary
 address. User-mode emulation in QEMU may lower the limit to 32-bit
 to emulate 32-bit machine on 64-bit host.
>>>
>>> I tend to think that this should be a personality or an ELF flag, not
>>> an rlimit.
>>
>> My plan was to implement ELF flag on top. Basically, ELF flag would mean
>> that we bump soft limit to hard limit on exec.
>
> Could you clarify what you mean by an "ELF flag?"

Some way to mark a binary as supporting a larger address space.  I
don't have a precise solution in mind, but an ELF note might be a good
way to go here.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-30 Thread Andy Lutomirski
On Wed, Dec 28, 2016 at 6:53 PM, Carlos O'Donell  wrote:
> On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
>> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>>  wrote:
 This patch introduces new rlimit resource to manage maximum virtual
 address available to userspace to map.

 On x86, 5-level paging enables 56-bit userspace virtual address space.
 Not all user space is ready to handle wide addresses. It's known that
 at least some JIT compilers use high bit in pointers to encode their
 information. It collides with valid pointers with 5-level paging and
 leads to crashes.

 The patch aims to address this compatibility issue.

 MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
 address available to map by userspace.

 The default hard limit will be RLIM_INFINITY, which basically means that
 TASK_SIZE limits available address space.

 The soft limit will also be RLIM_INFINITY everywhere, but the machine
 with 5-level paging enabled. In this case, soft limit would be
 (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
 paging which known to be safe

 New rlimit resource would follow usual semantics with regards to
 inheritance: preserved on fork(2) and exec(2). This has potential to
 break application if limits set too wide or too narrow, but this is not
 uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).

 As with other resources you can set the limit lower than current usage.
 It would affect only future virtual address space allocations.

 Use-cases for new rlimit:

   - Bumping the soft limit to RLIM_INFINITY, allows current process all
 its children to use addresses above 47-bits.

   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
 exec(2) allows the child to use addresses above 47-bits.

   - Lowering the hard limit to 47-bits would prevent current process all
 its children to use addresses above 47-bits, unless a process has
 CAP_SYS_RESOURCES.

   - It’s also can be handy to lower hard or soft limit to arbitrary
 address. User-mode emulation in QEMU may lower the limit to 32-bit
 to emulate 32-bit machine on 64-bit host.
>>>
>>> I tend to think that this should be a personality or an ELF flag, not
>>> an rlimit.
>>
>> My plan was to implement ELF flag on top. Basically, ELF flag would mean
>> that we bump soft limit to hard limit on exec.
>
> Could you clarify what you mean by an "ELF flag?"

Some way to mark a binary as supporting a larger address space.  I
don't have a precise solution in mind, but an ELF note might be a good
way to go here.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-28 Thread Carlos O'Donell
On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>  wrote:
>>> This patch introduces new rlimit resource to manage maximum virtual
>>> address available to userspace to map.
>>>
>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>> Not all user space is ready to handle wide addresses. It's known that
>>> at least some JIT compilers use high bit in pointers to encode their
>>> information. It collides with valid pointers with 5-level paging and
>>> leads to crashes.
>>>
>>> The patch aims to address this compatibility issue.
>>>
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>>
>>> The default hard limit will be RLIM_INFINITY, which basically means that
>>> TASK_SIZE limits available address space.
>>>
>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>> with 5-level paging enabled. In this case, soft limit would be
>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>> paging which known to be safe
>>>
>>> New rlimit resource would follow usual semantics with regards to
>>> inheritance: preserved on fork(2) and exec(2). This has potential to
>>> break application if limits set too wide or too narrow, but this is not
>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>>>
>>> As with other resources you can set the limit lower than current usage.
>>> It would affect only future virtual address space allocations.
>>>
>>> Use-cases for new rlimit:
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>> its children to use addresses above 47-bits.
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>> exec(2) allows the child to use addresses above 47-bits.
>>>
>>>   - Lowering the hard limit to 47-bits would prevent current process all
>>> its children to use addresses above 47-bits, unless a process has
>>> CAP_SYS_RESOURCES.
>>>
>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>>> address. User-mode emulation in QEMU may lower the limit to 32-bit
>>> to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
> 
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.

Could you clarify what you mean by an "ELF flag?"

-- 
Cheers,
Carlos.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-28 Thread Carlos O'Donell
On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>  wrote:
>>> This patch introduces new rlimit resource to manage maximum virtual
>>> address available to userspace to map.
>>>
>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>> Not all user space is ready to handle wide addresses. It's known that
>>> at least some JIT compilers use high bit in pointers to encode their
>>> information. It collides with valid pointers with 5-level paging and
>>> leads to crashes.
>>>
>>> The patch aims to address this compatibility issue.
>>>
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>>
>>> The default hard limit will be RLIM_INFINITY, which basically means that
>>> TASK_SIZE limits available address space.
>>>
>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>> with 5-level paging enabled. In this case, soft limit would be
>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>> paging which known to be safe
>>>
>>> New rlimit resource would follow usual semantics with regards to
>>> inheritance: preserved on fork(2) and exec(2). This has potential to
>>> break application if limits set too wide or too narrow, but this is not
>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>>>
>>> As with other resources you can set the limit lower than current usage.
>>> It would affect only future virtual address space allocations.
>>>
>>> Use-cases for new rlimit:
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>> its children to use addresses above 47-bits.
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>> exec(2) allows the child to use addresses above 47-bits.
>>>
>>>   - Lowering the hard limit to 47-bits would prevent current process all
>>> its children to use addresses above 47-bits, unless a process has
>>> CAP_SYS_RESOURCES.
>>>
>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>>> address. User-mode emulation in QEMU may lower the limit to 32-bit
>>> to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
> 
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.

Could you clarify what you mean by an "ELF flag?"

-- 
Cheers,
Carlos.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-26 Thread Andy Lutomirski
On Mon, Dec 26, 2016 at 6:24 PM, Kirill A. Shutemov
 wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>  wrote:
>> > This patch introduces new rlimit resource to manage maximum virtual
>> > address available to userspace to map.
>> >
>> > On x86, 5-level paging enables 56-bit userspace virtual address space.
>> > Not all user space is ready to handle wide addresses. It's known that
>> > at least some JIT compilers use high bit in pointers to encode their
>> > information. It collides with valid pointers with 5-level paging and
>> > leads to crashes.
>> >
>> > The patch aims to address this compatibility issue.
>> >
>> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>> > address available to map by userspace.
>> >
>> > The default hard limit will be RLIM_INFINITY, which basically means that
>> > TASK_SIZE limits available address space.
>> >
>> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
>> > with 5-level paging enabled. In this case, soft limit would be
>> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>> > paging which known to be safe
>> >
>> > New rlimit resource would follow usual semantics with regards to
>> > inheritance: preserved on fork(2) and exec(2). This has potential to
>> > break application if limits set too wide or too narrow, but this is not
>> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>> >
>> > As with other resources you can set the limit lower than current usage.
>> > It would affect only future virtual address space allocations.
>> >
>> > Use-cases for new rlimit:
>> >
>> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> > its children to use addresses above 47-bits.
>> >
>> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>> > exec(2) allows the child to use addresses above 47-bits.
>> >
>> >   - Lowering the hard limit to 47-bits would prevent current process all
>> > its children to use addresses above 47-bits, unless a process has
>> > CAP_SYS_RESOURCES.
>> >
>> >   - It’s also can be handy to lower hard or soft limit to arbitrary
>> > address. User-mode emulation in QEMU may lower the limit to 32-bit
>> > to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
>
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.
>
>> That way setuid works right.
>
> Um.. I probably miss background here.
>

If a setuid program depends on the lower limit, then a malicious
program shouldn't be able to cause it to run with the higher limit.
The personality code should already get this case right because
personalities are reset when setuid happens.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-26 Thread Andy Lutomirski
On Mon, Dec 26, 2016 at 6:24 PM, Kirill A. Shutemov
 wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>  wrote:
>> > This patch introduces new rlimit resource to manage maximum virtual
>> > address available to userspace to map.
>> >
>> > On x86, 5-level paging enables 56-bit userspace virtual address space.
>> > Not all user space is ready to handle wide addresses. It's known that
>> > at least some JIT compilers use high bit in pointers to encode their
>> > information. It collides with valid pointers with 5-level paging and
>> > leads to crashes.
>> >
>> > The patch aims to address this compatibility issue.
>> >
>> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>> > address available to map by userspace.
>> >
>> > The default hard limit will be RLIM_INFINITY, which basically means that
>> > TASK_SIZE limits available address space.
>> >
>> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
>> > with 5-level paging enabled. In this case, soft limit would be
>> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>> > paging which known to be safe
>> >
>> > New rlimit resource would follow usual semantics with regards to
>> > inheritance: preserved on fork(2) and exec(2). This has potential to
>> > break application if limits set too wide or too narrow, but this is not
>> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>> >
>> > As with other resources you can set the limit lower than current usage.
>> > It would affect only future virtual address space allocations.
>> >
>> > Use-cases for new rlimit:
>> >
>> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>> > its children to use addresses above 47-bits.
>> >
>> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>> > exec(2) allows the child to use addresses above 47-bits.
>> >
>> >   - Lowering the hard limit to 47-bits would prevent current process all
>> > its children to use addresses above 47-bits, unless a process has
>> > CAP_SYS_RESOURCES.
>> >
>> >   - It’s also can be handy to lower hard or soft limit to arbitrary
>> > address. User-mode emulation in QEMU may lower the limit to 32-bit
>> > to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
>
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.
>
>> That way setuid works right.
>
> Um.. I probably miss background here.
>

If a setuid program depends on the lower limit, then a malicious
program shouldn't be able to cause it to run with the higher limit.
The personality code should already get this case right because
personalities are reset when setuid happens.

--Andy


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-26 Thread Kirill A. Shutemov
On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>  wrote:
> > This patch introduces new rlimit resource to manage maximum virtual
> > address available to userspace to map.
> >
> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > Not all user space is ready to handle wide addresses. It's known that
> > at least some JIT compilers use high bit in pointers to encode their
> > information. It collides with valid pointers with 5-level paging and
> > leads to crashes.
> >
> > The patch aims to address this compatibility issue.
> >
> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> > address available to map by userspace.
> >
> > The default hard limit will be RLIM_INFINITY, which basically means that
> > TASK_SIZE limits available address space.
> >
> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
> > with 5-level paging enabled. In this case, soft limit would be
> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> > paging which known to be safe
> >
> > New rlimit resource would follow usual semantics with regards to
> > inheritance: preserved on fork(2) and exec(2). This has potential to
> > break application if limits set too wide or too narrow, but this is not
> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> >
> > As with other resources you can set the limit lower than current usage.
> > It would affect only future virtual address space allocations.
> >
> > Use-cases for new rlimit:
> >
> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> > its children to use addresses above 47-bits.
> >
> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> > exec(2) allows the child to use addresses above 47-bits.
> >
> >   - Lowering the hard limit to 47-bits would prevent current process all
> > its children to use addresses above 47-bits, unless a process has
> > CAP_SYS_RESOURCES.
> >
> >   - It’s also can be handy to lower hard or soft limit to arbitrary
> > address. User-mode emulation in QEMU may lower the limit to 32-bit
> > to emulate 32-bit machine on 64-bit host.
> 
> I tend to think that this should be a personality or an ELF flag, not
> an rlimit.

My plan was to implement ELF flag on top. Basically, ELF flag would mean
that we bump soft limit to hard limit on exec.

> That way setuid works right.

Um.. I probably miss background here.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-26 Thread Kirill A. Shutemov
On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>  wrote:
> > This patch introduces new rlimit resource to manage maximum virtual
> > address available to userspace to map.
> >
> > On x86, 5-level paging enables 56-bit userspace virtual address space.
> > Not all user space is ready to handle wide addresses. It's known that
> > at least some JIT compilers use high bit in pointers to encode their
> > information. It collides with valid pointers with 5-level paging and
> > leads to crashes.
> >
> > The patch aims to address this compatibility issue.
> >
> > MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> > address available to map by userspace.
> >
> > The default hard limit will be RLIM_INFINITY, which basically means that
> > TASK_SIZE limits available address space.
> >
> > The soft limit will also be RLIM_INFINITY everywhere, but the machine
> > with 5-level paging enabled. In this case, soft limit would be
> > (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> > paging which known to be safe
> >
> > New rlimit resource would follow usual semantics with regards to
> > inheritance: preserved on fork(2) and exec(2). This has potential to
> > break application if limits set too wide or too narrow, but this is not
> > uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
> >
> > As with other resources you can set the limit lower than current usage.
> > It would affect only future virtual address space allocations.
> >
> > Use-cases for new rlimit:
> >
> >   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> > its children to use addresses above 47-bits.
> >
> >   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> > exec(2) allows the child to use addresses above 47-bits.
> >
> >   - Lowering the hard limit to 47-bits would prevent current process all
> > its children to use addresses above 47-bits, unless a process has
> > CAP_SYS_RESOURCES.
> >
> >   - It’s also can be handy to lower hard or soft limit to arbitrary
> > address. User-mode emulation in QEMU may lower the limit to 32-bit
> > to emulate 32-bit machine on 64-bit host.
> 
> I tend to think that this should be a personality or an ELF flag, not
> an rlimit.

My plan was to implement ELF flag on top. Basically, ELF flag would mean
that we bump soft limit to hard limit on exec.

> That way setuid works right.

Um.. I probably miss background here.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-26 Thread Andy Lutomirski
On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
 wrote:
> This patch introduces new rlimit resource to manage maximum virtual
> address available to userspace to map.
>
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use high bit in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> The patch aims to address this compatibility issue.
>
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.
>
> The default hard limit will be RLIM_INFINITY, which basically means that
> TASK_SIZE limits available address space.
>
> The soft limit will also be RLIM_INFINITY everywhere, but the machine
> with 5-level paging enabled. In this case, soft limit would be
> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> paging which known to be safe
>
> New rlimit resource would follow usual semantics with regards to
> inheritance: preserved on fork(2) and exec(2). This has potential to
> break application if limits set too wide or too narrow, but this is not
> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>
> As with other resources you can set the limit lower than current usage.
> It would affect only future virtual address space allocations.
>
> Use-cases for new rlimit:
>
>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> its children to use addresses above 47-bits.
>
>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> exec(2) allows the child to use addresses above 47-bits.
>
>   - Lowering the hard limit to 47-bits would prevent current process all
> its children to use addresses above 47-bits, unless a process has
> CAP_SYS_RESOURCES.
>
>   - It’s also can be handy to lower hard or soft limit to arbitrary
> address. User-mode emulation in QEMU may lower the limit to 32-bit
> to emulate 32-bit machine on 64-bit host.

I tend to think that this should be a personality or an ELF flag, not
an rlimit.  That way setuid works right.


Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-26 Thread Andy Lutomirski
On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
 wrote:
> This patch introduces new rlimit resource to manage maximum virtual
> address available to userspace to map.
>
> On x86, 5-level paging enables 56-bit userspace virtual address space.
> Not all user space is ready to handle wide addresses. It's known that
> at least some JIT compilers use high bit in pointers to encode their
> information. It collides with valid pointers with 5-level paging and
> leads to crashes.
>
> The patch aims to address this compatibility issue.
>
> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
> address available to map by userspace.
>
> The default hard limit will be RLIM_INFINITY, which basically means that
> TASK_SIZE limits available address space.
>
> The soft limit will also be RLIM_INFINITY everywhere, but the machine
> with 5-level paging enabled. In this case, soft limit would be
> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
> paging which known to be safe
>
> New rlimit resource would follow usual semantics with regards to
> inheritance: preserved on fork(2) and exec(2). This has potential to
> break application if limits set too wide or too narrow, but this is not
> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>
> As with other resources you can set the limit lower than current usage.
> It would affect only future virtual address space allocations.
>
> Use-cases for new rlimit:
>
>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
> its children to use addresses above 47-bits.
>
>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
> exec(2) allows the child to use addresses above 47-bits.
>
>   - Lowering the hard limit to 47-bits would prevent current process all
> its children to use addresses above 47-bits, unless a process has
> CAP_SYS_RESOURCES.
>
>   - It’s also can be handy to lower hard or soft limit to arbitrary
> address. User-mode emulation in QEMU may lower the limit to 32-bit
> to emulate 32-bit machine on 64-bit host.

I tend to think that this should be a personality or an ELF flag, not
an rlimit.  That way setuid works right.