Re: How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2017-01-02 Thread hpa
On January 2, 2017 8:52:41 AM PST, Andy Lutomirski  wrote:
>On Mon, Jan 2, 2017 at 1:49 AM, Kirill A. Shutemov
> wrote:
>> On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:
>>> On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov
> wrote:
>>> > Keep task's virtual address space size as mm_struct field which
>>> > exists for a long time - it's initialized in setup_new_exec()
>>> > depending on the new task's personality.
>>> > This way TASK_SIZE will always be the same as
>current->mm->task_size.
>>> > Previously, there could be an issue about different values of
>>> > TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can
>unset
>>> > ADDR_LIMIT_3GB personality (with personality syscall) and
>>> > so TASK_SIZE will be 4Gb, which is larger than mm->task_size =
>3Gb.
>>> > As TASK_SIZE *and* current->mm->task_size are used both in code
>>> > frequently, this difference creates a subtle situations, for
>example:
>>> > one can mmap addresses > 3Gb, but they will be hidden in
>>> > /proc/pid/pagemap as it checks mm->task_size.
>>> > I've moved initialization of mm->task_size earlier in
>setup_new_exec()
>>> > as arch_pick_mmap_layout() initializes mmap_legacy_base with
>>> > TASK_UNMAPPED_BASE, which depends on TASK_SIZE.
>>>
>>> I don't like this patch so much because I think that we should
>figure
>>> out how this will all work in the long run first.  I've added some
>>> more people to the thread because other arches have similar issues
>and
>>> because x86 is about to get considerably more complicated (choices
>>> include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).
>>>
>>> Here are a few of my thoughts on the matter.  This isn't all that
>well
>>> thought out:
>>>
>>> The address space limit, especially if CRIU is in play, isn't really
>a
>>> hard limit.  For example, you could allocate high memory then lower
>>> the limit.  Similarly, I see no reason that an x32 program should be
>>> forbidden from mapping some high addresses or, similarly, that an
>i386
>>> program can't (if it really wanted to) do a 64-bit mmap() and get a
>>> high address.
>>>
>>> On that note, can we just *delete* the task_size check from pagemap?
>>> It's been there since the very beginning:
>>>
>>> commit 85863e475e59afb027b0113290e3796ee6020b7d
>>> Author: Matt Mackall 
>>> Date:   Mon Feb 4 22:29:04 2008 -0800
>>>
>>> maps4: add /proc/pid/pagemap interface
>>>
>>> and there's no explanation for why it's needed.
>>>
>>> So maybe we should have a *number* (not a bit) that indicates the
>>> maximum address that mmap() will return unless an override is in
>use.
>>> Since common practice seems to be to stick this in the personality
>>> field, we may need some fancy encoding.  Executing a setuid binary
>>> needs to reset to the default, and personality handles that.
>>
>> If we want to be able to specify arbitrary address as maximum, a
>fancy
>> encoding would need to claim 51 bits (63 VA - 12 in-page address) on
>x86
>> from the persona flag.
>> To me, it's stretching personality interface too far.
>>
>> Maybe it's easier to reset the rlimit for suid binaries?
>
>I guess I don't see why rlimit makes any sense, though.  It's not a
>resource utilization control, hard vs soft limits make very little
>sense, requiring capabilities to exceed the hard limit doesn't help
>anything, and it's only useful to preserve it across execve() to work
>around bugs.
>
>So if it's going to be a number, let's just make it be a new number
>with a new API to control it.
>
>--Andy

It's an API that already exists, that's the plus. Hard and soft limits *do* 
make sense IMO.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2017-01-02 Thread hpa
On January 2, 2017 8:52:41 AM PST, Andy Lutomirski  wrote:
>On Mon, Jan 2, 2017 at 1:49 AM, Kirill A. Shutemov
> wrote:
>> On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:
>>> On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov
> wrote:
>>> > Keep task's virtual address space size as mm_struct field which
>>> > exists for a long time - it's initialized in setup_new_exec()
>>> > depending on the new task's personality.
>>> > This way TASK_SIZE will always be the same as
>current->mm->task_size.
>>> > Previously, there could be an issue about different values of
>>> > TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can
>unset
>>> > ADDR_LIMIT_3GB personality (with personality syscall) and
>>> > so TASK_SIZE will be 4Gb, which is larger than mm->task_size =
>3Gb.
>>> > As TASK_SIZE *and* current->mm->task_size are used both in code
>>> > frequently, this difference creates a subtle situations, for
>example:
>>> > one can mmap addresses > 3Gb, but they will be hidden in
>>> > /proc/pid/pagemap as it checks mm->task_size.
>>> > I've moved initialization of mm->task_size earlier in
>setup_new_exec()
>>> > as arch_pick_mmap_layout() initializes mmap_legacy_base with
>>> > TASK_UNMAPPED_BASE, which depends on TASK_SIZE.
>>>
>>> I don't like this patch so much because I think that we should
>figure
>>> out how this will all work in the long run first.  I've added some
>>> more people to the thread because other arches have similar issues
>and
>>> because x86 is about to get considerably more complicated (choices
>>> include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).
>>>
>>> Here are a few of my thoughts on the matter.  This isn't all that
>well
>>> thought out:
>>>
>>> The address space limit, especially if CRIU is in play, isn't really
>a
>>> hard limit.  For example, you could allocate high memory then lower
>>> the limit.  Similarly, I see no reason that an x32 program should be
>>> forbidden from mapping some high addresses or, similarly, that an
>i386
>>> program can't (if it really wanted to) do a 64-bit mmap() and get a
>>> high address.
>>>
>>> On that note, can we just *delete* the task_size check from pagemap?
>>> It's been there since the very beginning:
>>>
>>> commit 85863e475e59afb027b0113290e3796ee6020b7d
>>> Author: Matt Mackall 
>>> Date:   Mon Feb 4 22:29:04 2008 -0800
>>>
>>> maps4: add /proc/pid/pagemap interface
>>>
>>> and there's no explanation for why it's needed.
>>>
>>> So maybe we should have a *number* (not a bit) that indicates the
>>> maximum address that mmap() will return unless an override is in
>use.
>>> Since common practice seems to be to stick this in the personality
>>> field, we may need some fancy encoding.  Executing a setuid binary
>>> needs to reset to the default, and personality handles that.
>>
>> If we want to be able to specify arbitrary address as maximum, a
>fancy
>> encoding would need to claim 51 bits (63 VA - 12 in-page address) on
>x86
>> from the persona flag.
>> To me, it's stretching personality interface too far.
>>
>> Maybe it's easier to reset the rlimit for suid binaries?
>
>I guess I don't see why rlimit makes any sense, though.  It's not a
>resource utilization control, hard vs soft limits make very little
>sense, requiring capabilities to exceed the hard limit doesn't help
>anything, and it's only useful to preserve it across execve() to work
>around bugs.
>
>So if it's going to be a number, let's just make it be a new number
>with a new API to control it.
>
>--Andy

It's an API that already exists, that's the plus. Hard and soft limits *do* 
make sense IMO.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2017-01-02 Thread Andy Lutomirski
On Mon, Jan 2, 2017 at 1:49 AM, Kirill A. Shutemov  wrote:
> On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov  
>> wrote:
>> > Keep task's virtual address space size as mm_struct field which
>> > exists for a long time - it's initialized in setup_new_exec()
>> > depending on the new task's personality.
>> > This way TASK_SIZE will always be the same as current->mm->task_size.
>> > Previously, there could be an issue about different values of
>> > TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
>> > ADDR_LIMIT_3GB personality (with personality syscall) and
>> > so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
>> > As TASK_SIZE *and* current->mm->task_size are used both in code
>> > frequently, this difference creates a subtle situations, for example:
>> > one can mmap addresses > 3Gb, but they will be hidden in
>> > /proc/pid/pagemap as it checks mm->task_size.
>> > I've moved initialization of mm->task_size earlier in setup_new_exec()
>> > as arch_pick_mmap_layout() initializes mmap_legacy_base with
>> > TASK_UNMAPPED_BASE, which depends on TASK_SIZE.
>>
>> I don't like this patch so much because I think that we should figure
>> out how this will all work in the long run first.  I've added some
>> more people to the thread because other arches have similar issues and
>> because x86 is about to get considerably more complicated (choices
>> include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).
>>
>> Here are a few of my thoughts on the matter.  This isn't all that well
>> thought out:
>>
>> The address space limit, especially if CRIU is in play, isn't really a
>> hard limit.  For example, you could allocate high memory then lower
>> the limit.  Similarly, I see no reason that an x32 program should be
>> forbidden from mapping some high addresses or, similarly, that an i386
>> program can't (if it really wanted to) do a 64-bit mmap() and get a
>> high address.
>>
>> On that note, can we just *delete* the task_size check from pagemap?
>> It's been there since the very beginning:
>>
>> commit 85863e475e59afb027b0113290e3796ee6020b7d
>> Author: Matt Mackall 
>> Date:   Mon Feb 4 22:29:04 2008 -0800
>>
>> maps4: add /proc/pid/pagemap interface
>>
>> and there's no explanation for why it's needed.
>>
>> So maybe we should have a *number* (not a bit) that indicates the
>> maximum address that mmap() will return unless an override is in use.
>> Since common practice seems to be to stick this in the personality
>> field, we may need some fancy encoding.  Executing a setuid binary
>> needs to reset to the default, and personality handles that.
>
> If we want to be able to specify arbitrary address as maximum, a fancy
> encoding would need to claim 51 bits (63 VA - 12 in-page address) on x86
> from the persona flag.
> To me, it's stretching personality interface too far.
>
> Maybe it's easier to reset the rlimit for suid binaries?

I guess I don't see why rlimit makes any sense, though.  It's not a
resource utilization control, hard vs soft limits make very little
sense, requiring capabilities to exceed the hard limit doesn't help
anything, and it's only useful to preserve it across execve() to work
around bugs.

So if it's going to be a number, let's just make it be a new number
with a new API to control it.

--Andy


Re: How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2017-01-02 Thread Andy Lutomirski
On Mon, Jan 2, 2017 at 1:49 AM, Kirill A. Shutemov  wrote:
> On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov  
>> wrote:
>> > Keep task's virtual address space size as mm_struct field which
>> > exists for a long time - it's initialized in setup_new_exec()
>> > depending on the new task's personality.
>> > This way TASK_SIZE will always be the same as current->mm->task_size.
>> > Previously, there could be an issue about different values of
>> > TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
>> > ADDR_LIMIT_3GB personality (with personality syscall) and
>> > so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
>> > As TASK_SIZE *and* current->mm->task_size are used both in code
>> > frequently, this difference creates a subtle situations, for example:
>> > one can mmap addresses > 3Gb, but they will be hidden in
>> > /proc/pid/pagemap as it checks mm->task_size.
>> > I've moved initialization of mm->task_size earlier in setup_new_exec()
>> > as arch_pick_mmap_layout() initializes mmap_legacy_base with
>> > TASK_UNMAPPED_BASE, which depends on TASK_SIZE.
>>
>> I don't like this patch so much because I think that we should figure
>> out how this will all work in the long run first.  I've added some
>> more people to the thread because other arches have similar issues and
>> because x86 is about to get considerably more complicated (choices
>> include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).
>>
>> Here are a few of my thoughts on the matter.  This isn't all that well
>> thought out:
>>
>> The address space limit, especially if CRIU is in play, isn't really a
>> hard limit.  For example, you could allocate high memory then lower
>> the limit.  Similarly, I see no reason that an x32 program should be
>> forbidden from mapping some high addresses or, similarly, that an i386
>> program can't (if it really wanted to) do a 64-bit mmap() and get a
>> high address.
>>
>> On that note, can we just *delete* the task_size check from pagemap?
>> It's been there since the very beginning:
>>
>> commit 85863e475e59afb027b0113290e3796ee6020b7d
>> Author: Matt Mackall 
>> Date:   Mon Feb 4 22:29:04 2008 -0800
>>
>> maps4: add /proc/pid/pagemap interface
>>
>> and there's no explanation for why it's needed.
>>
>> So maybe we should have a *number* (not a bit) that indicates the
>> maximum address that mmap() will return unless an override is in use.
>> Since common practice seems to be to stick this in the personality
>> field, we may need some fancy encoding.  Executing a setuid binary
>> needs to reset to the default, and personality handles that.
>
> If we want to be able to specify arbitrary address as maximum, a fancy
> encoding would need to claim 51 bits (63 VA - 12 in-page address) on x86
> from the persona flag.
> To me, it's stretching personality interface too far.
>
> Maybe it's easier to reset the rlimit for suid binaries?

I guess I don't see why rlimit makes any sense, though.  It's not a
resource utilization control, hard vs soft limits make very little
sense, requiring capabilities to exceed the hard limit doesn't help
anything, and it's only useful to preserve it across execve() to work
around bugs.

So if it's going to be a number, let's just make it be a new number
with a new API to control it.

--Andy


Re: How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2017-01-02 Thread Kirill A. Shutemov
On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:
> On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov  
> wrote:
> > Keep task's virtual address space size as mm_struct field which
> > exists for a long time - it's initialized in setup_new_exec()
> > depending on the new task's personality.
> > This way TASK_SIZE will always be the same as current->mm->task_size.
> > Previously, there could be an issue about different values of
> > TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
> > ADDR_LIMIT_3GB personality (with personality syscall) and
> > so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
> > As TASK_SIZE *and* current->mm->task_size are used both in code
> > frequently, this difference creates a subtle situations, for example:
> > one can mmap addresses > 3Gb, but they will be hidden in
> > /proc/pid/pagemap as it checks mm->task_size.
> > I've moved initialization of mm->task_size earlier in setup_new_exec()
> > as arch_pick_mmap_layout() initializes mmap_legacy_base with
> > TASK_UNMAPPED_BASE, which depends on TASK_SIZE.
> 
> I don't like this patch so much because I think that we should figure
> out how this will all work in the long run first.  I've added some
> more people to the thread because other arches have similar issues and
> because x86 is about to get considerably more complicated (choices
> include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).
> 
> Here are a few of my thoughts on the matter.  This isn't all that well
> thought out:
> 
> The address space limit, especially if CRIU is in play, isn't really a
> hard limit.  For example, you could allocate high memory then lower
> the limit.  Similarly, I see no reason that an x32 program should be
> forbidden from mapping some high addresses or, similarly, that an i386
> program can't (if it really wanted to) do a 64-bit mmap() and get a
> high address.
> 
> On that note, can we just *delete* the task_size check from pagemap?
> It's been there since the very beginning:
> 
> commit 85863e475e59afb027b0113290e3796ee6020b7d
> Author: Matt Mackall 
> Date:   Mon Feb 4 22:29:04 2008 -0800
> 
> maps4: add /proc/pid/pagemap interface
> 
> and there's no explanation for why it's needed.
> 
> So maybe we should have a *number* (not a bit) that indicates the
> maximum address that mmap() will return unless an override is in use.
> Since common practice seems to be to stick this in the personality
> field, we may need some fancy encoding.  Executing a setuid binary
> needs to reset to the default, and personality handles that.

If we want to be able to specify arbitrary address as maximum, a fancy
encoding would need to claim 51 bits (63 VA - 12 in-page address) on x86
from the persona flag.
To me, it's stretching personality interface too far.

Maybe it's easier to reset the rlimit for suid binaries?

-- 
 Kirill A. Shutemov


Re: How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2017-01-02 Thread Kirill A. Shutemov
On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:
> On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov  
> wrote:
> > Keep task's virtual address space size as mm_struct field which
> > exists for a long time - it's initialized in setup_new_exec()
> > depending on the new task's personality.
> > This way TASK_SIZE will always be the same as current->mm->task_size.
> > Previously, there could be an issue about different values of
> > TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
> > ADDR_LIMIT_3GB personality (with personality syscall) and
> > so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
> > As TASK_SIZE *and* current->mm->task_size are used both in code
> > frequently, this difference creates a subtle situations, for example:
> > one can mmap addresses > 3Gb, but they will be hidden in
> > /proc/pid/pagemap as it checks mm->task_size.
> > I've moved initialization of mm->task_size earlier in setup_new_exec()
> > as arch_pick_mmap_layout() initializes mmap_legacy_base with
> > TASK_UNMAPPED_BASE, which depends on TASK_SIZE.
> 
> I don't like this patch so much because I think that we should figure
> out how this will all work in the long run first.  I've added some
> more people to the thread because other arches have similar issues and
> because x86 is about to get considerably more complicated (choices
> include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).
> 
> Here are a few of my thoughts on the matter.  This isn't all that well
> thought out:
> 
> The address space limit, especially if CRIU is in play, isn't really a
> hard limit.  For example, you could allocate high memory then lower
> the limit.  Similarly, I see no reason that an x32 program should be
> forbidden from mapping some high addresses or, similarly, that an i386
> program can't (if it really wanted to) do a 64-bit mmap() and get a
> high address.
> 
> On that note, can we just *delete* the task_size check from pagemap?
> It's been there since the very beginning:
> 
> commit 85863e475e59afb027b0113290e3796ee6020b7d
> Author: Matt Mackall 
> Date:   Mon Feb 4 22:29:04 2008 -0800
> 
> maps4: add /proc/pid/pagemap interface
> 
> and there's no explanation for why it's needed.
> 
> So maybe we should have a *number* (not a bit) that indicates the
> maximum address that mmap() will return unless an override is in use.
> Since common practice seems to be to stick this in the personality
> field, we may need some fancy encoding.  Executing a setuid binary
> needs to reset to the default, and personality handles that.

If we want to be able to specify arbitrary address as maximum, a fancy
encoding would need to claim 51 bits (63 VA - 12 in-page address) on x86
from the persona flag.
To me, it's stretching personality interface too far.

Maybe it's easier to reset the rlimit for suid binaries?

-- 
 Kirill A. Shutemov


How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2016-12-30 Thread Andy Lutomirski
On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov  wrote:
> Keep task's virtual address space size as mm_struct field which
> exists for a long time - it's initialized in setup_new_exec()
> depending on the new task's personality.
> This way TASK_SIZE will always be the same as current->mm->task_size.
> Previously, there could be an issue about different values of
> TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
> ADDR_LIMIT_3GB personality (with personality syscall) and
> so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
> As TASK_SIZE *and* current->mm->task_size are used both in code
> frequently, this difference creates a subtle situations, for example:
> one can mmap addresses > 3Gb, but they will be hidden in
> /proc/pid/pagemap as it checks mm->task_size.
> I've moved initialization of mm->task_size earlier in setup_new_exec()
> as arch_pick_mmap_layout() initializes mmap_legacy_base with
> TASK_UNMAPPED_BASE, which depends on TASK_SIZE.

I don't like this patch so much because I think that we should figure
out how this will all work in the long run first.  I've added some
more people to the thread because other arches have similar issues and
because x86 is about to get considerably more complicated (choices
include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).

Here are a few of my thoughts on the matter.  This isn't all that well
thought out:

The address space limit, especially if CRIU is in play, isn't really a
hard limit.  For example, you could allocate high memory then lower
the limit.  Similarly, I see no reason that an x32 program should be
forbidden from mapping some high addresses or, similarly, that an i386
program can't (if it really wanted to) do a 64-bit mmap() and get a
high address.

On that note, can we just *delete* the task_size check from pagemap?
It's been there since the very beginning:

commit 85863e475e59afb027b0113290e3796ee6020b7d
Author: Matt Mackall 
Date:   Mon Feb 4 22:29:04 2008 -0800

maps4: add /proc/pid/pagemap interface

and there's no explanation for why it's needed.

So maybe we should have a *number* (not a bit) that indicates the
maximum address that mmap() will return unless an override is in use.
Since common practice seems to be to stick this in the personality
field, we may need some fancy encoding.  Executing a setuid binary
needs to reset to the default, and personality handles that.

We should also probably come up with a reasonable set of getters and
setters for CRIU and otherwise.  I can think of the following
questions that might be asked:

 - What is the highest currently mapped VA?  (/proc can already answer
this.  If needed, a prctl could be added, too.  Using /proc is a bit
tricky due to out-of-range "gate areas", e.g. the x86 vsyscall page.)

 - What is the highest address that mmap() will return without being
forced?  On x86 and sparc, this could plausbly be extra complicated
because there are multiple "mmap()" syscalls (32-bit vs 64-bit), so an
i386 process could theoretically have a limit of 2^47-1, for example.
I doubt this matters much.  I'm also not sure whether this should be
per-task or per-mm.  I can see legitimate use cases to set this to
unusual numbers.  For example, some x86 program could want 2^51-1
because it wants that many high bits free, even though this doesn't
correspond to any particular hardware paging mode.  This probably
wants a getter and a setter.

 - What is the highest address that the hardware is capable of?  (By
this, I mean with the kernel's adjustments applied.  x86_64 is
currently capable of 2^47-1, but we artificially cap it to 2^47-4097
due to Intel having screwed up SYSRET.)  This is just TASK_SIZE_MAX.
I'm not sure we need a getter or a setter, since it doesn't seem all
that useful to know.

We still need to see if there should be some way for an ELF binary to
be tagged with its maximum supported virtual address.  If we do this,
I would suggest an ELF note as a straw-man initial proposal.  ELF
notes are nice because I think they can be generated cleanly from C or
asm without modifying binutils.

I wouldn't mind seeing TASK_SIZE go away completely in the long run,
since it's not clearly defined what it does.  Also, it's worth noting
that, in the absence of userspace making assumptions, no limit is
needed at all.  A 32-bit or x32 program should *automatically* get a
limited address simply by virtue of making an mmap() syscall with the
32-bit or x32 ABI.

Thoughts?

P.S. If we add an ELF note saying "supports 57 bits", I think we
should also add an ELF note saying "doesn't use vsyscalls", but that's
a bit off topic.

P.P.S. We can efficiently turn vsyscalls off per process.  I even have
code for it.  It's a bit tricky and abuses some paging bits, but it
works.


How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)

2016-12-30 Thread Andy Lutomirski
On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov  wrote:
> Keep task's virtual address space size as mm_struct field which
> exists for a long time - it's initialized in setup_new_exec()
> depending on the new task's personality.
> This way TASK_SIZE will always be the same as current->mm->task_size.
> Previously, there could be an issue about different values of
> TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
> ADDR_LIMIT_3GB personality (with personality syscall) and
> so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
> As TASK_SIZE *and* current->mm->task_size are used both in code
> frequently, this difference creates a subtle situations, for example:
> one can mmap addresses > 3Gb, but they will be hidden in
> /proc/pid/pagemap as it checks mm->task_size.
> I've moved initialization of mm->task_size earlier in setup_new_exec()
> as arch_pick_mmap_layout() initializes mmap_legacy_base with
> TASK_UNMAPPED_BASE, which depends on TASK_SIZE.

I don't like this patch so much because I think that we should figure
out how this will all work in the long run first.  I've added some
more people to the thread because other arches have similar issues and
because x86 is about to get considerably more complicated (choices
include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)).

Here are a few of my thoughts on the matter.  This isn't all that well
thought out:

The address space limit, especially if CRIU is in play, isn't really a
hard limit.  For example, you could allocate high memory then lower
the limit.  Similarly, I see no reason that an x32 program should be
forbidden from mapping some high addresses or, similarly, that an i386
program can't (if it really wanted to) do a 64-bit mmap() and get a
high address.

On that note, can we just *delete* the task_size check from pagemap?
It's been there since the very beginning:

commit 85863e475e59afb027b0113290e3796ee6020b7d
Author: Matt Mackall 
Date:   Mon Feb 4 22:29:04 2008 -0800

maps4: add /proc/pid/pagemap interface

and there's no explanation for why it's needed.

So maybe we should have a *number* (not a bit) that indicates the
maximum address that mmap() will return unless an override is in use.
Since common practice seems to be to stick this in the personality
field, we may need some fancy encoding.  Executing a setuid binary
needs to reset to the default, and personality handles that.

We should also probably come up with a reasonable set of getters and
setters for CRIU and otherwise.  I can think of the following
questions that might be asked:

 - What is the highest currently mapped VA?  (/proc can already answer
this.  If needed, a prctl could be added, too.  Using /proc is a bit
tricky due to out-of-range "gate areas", e.g. the x86 vsyscall page.)

 - What is the highest address that mmap() will return without being
forced?  On x86 and sparc, this could plausbly be extra complicated
because there are multiple "mmap()" syscalls (32-bit vs 64-bit), so an
i386 process could theoretically have a limit of 2^47-1, for example.
I doubt this matters much.  I'm also not sure whether this should be
per-task or per-mm.  I can see legitimate use cases to set this to
unusual numbers.  For example, some x86 program could want 2^51-1
because it wants that many high bits free, even though this doesn't
correspond to any particular hardware paging mode.  This probably
wants a getter and a setter.

 - What is the highest address that the hardware is capable of?  (By
this, I mean with the kernel's adjustments applied.  x86_64 is
currently capable of 2^47-1, but we artificially cap it to 2^47-4097
due to Intel having screwed up SYSRET.)  This is just TASK_SIZE_MAX.
I'm not sure we need a getter or a setter, since it doesn't seem all
that useful to know.

We still need to see if there should be some way for an ELF binary to
be tagged with its maximum supported virtual address.  If we do this,
I would suggest an ELF note as a straw-man initial proposal.  ELF
notes are nice because I think they can be generated cleanly from C or
asm without modifying binutils.

I wouldn't mind seeing TASK_SIZE go away completely in the long run,
since it's not clearly defined what it does.  Also, it's worth noting
that, in the absence of userspace making assumptions, no limit is
needed at all.  A 32-bit or x32 program should *automatically* get a
limited address simply by virtue of making an mmap() syscall with the
32-bit or x32 ABI.

Thoughts?

P.S. If we add an ELF note saying "supports 57 bits", I think we
should also add an ELF note saying "doesn't use vsyscalls", but that's
a bit off topic.

P.P.S. We can efficiently turn vsyscalls off per process.  I even have
code for it.  It's a bit tricky and abuses some paging bits, but it
works.


[RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size

2016-12-30 Thread Dmitry Safonov
Keep task's virtual address space size as mm_struct field which
exists for a long time - it's initialized in setup_new_exec()
depending on the new task's personality.
This way TASK_SIZE will always be the same as current->mm->task_size.
Previously, there could be an issue about different values of
TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
ADDR_LIMIT_3GB personality (with personality syscall) and
so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
As TASK_SIZE *and* current->mm->task_size are used both in code
frequently, this difference creates a subtle situations, for example:
one can mmap addresses > 3Gb, but they will be hidden in
/proc/pid/pagemap as it checks mm->task_size.
I've moved initialization of mm->task_size earlier in setup_new_exec()
as arch_pick_mmap_layout() initializes mmap_legacy_base with
TASK_UNMAPPED_BASE, which depends on TASK_SIZE.

Signed-off-by: Dmitry Safonov 
---
 arch/x86/include/asm/processor.h | 17 +
 arch/x86/um/asm/segment.h|  2 +-
 arch/x86/xen/mmu.c   |  4 ++--
 fs/exec.c| 17 +++--
 4 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index dbc7dec5fa84..47ebde614f06 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -768,10 +768,8 @@ static inline void spin_lock_prefetch(const void *x)
 /*
  * User space process size: 3GB (default).
  */
-#define TASK_SIZE  PAGE_OFFSET
-#define TASK_SIZE_MAX  TASK_SIZE
-#define STACK_TOP  TASK_SIZE
-#define STACK_TOP_MAX  STACK_TOP
+#define TASK_SIZE_MAX  PAGE_OFFSET
+#define INIT_TASK_SIZE TASK_SIZE_MAX
 
 #define INIT_THREAD  {   \
.sp0= TOP_OF_INIT_STACK,  \
@@ -817,12 +815,9 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET   ((current->personality & ADDR_LIMIT_3GB) ? \
0xc000 : 0xe000)
 
-#define TASK_SIZE  (current->personality & ADDR_LIMIT_32BIT ? \
+#define INIT_TASK_SIZE (current->personality & ADDR_LIMIT_32BIT ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP  TASK_SIZE
-#define STACK_TOP_MAX  TASK_SIZE_MAX
-
 #define INIT_THREAD  { \
.sp0= TOP_OF_INIT_STACK,\
.addr_limit = KERNEL_DS,\
@@ -833,6 +828,12 @@ extern unsigned long KSTK_ESP(struct task_struct *task);
 
 #endif /* CONFIG_X86_64 */
 
+#define TASK_SIZE  \
+   ((current->mm) ? current->mm->task_size : TASK_SIZE_MAX)
+
+#define STACK_TOP  TASK_SIZE
+#define STACK_TOP_MAX  TASK_SIZE_MAX
+
 extern unsigned long thread_saved_pc(struct task_struct *tsk);
 
 extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
diff --git a/arch/x86/um/asm/segment.h b/arch/x86/um/asm/segment.h
index 41dd5e1f3cd7..3a9aa9f050df 100644
--- a/arch/x86/um/asm/segment.h
+++ b/arch/x86/um/asm/segment.h
@@ -13,6 +13,6 @@ typedef struct {
 
 #define MAKE_MM_SEG(s) ((mm_segment_t) { (s) })
 #define KERNEL_DS  MAKE_MM_SEG(~0UL)
-#define USER_DSMAKE_MM_SEG(TASK_SIZE)
+#define USER_DSMAKE_MM_SEG(TASK_SIZE_MAX)
 
 #endif
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7d5afdb417cc..264ca3b7be58 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -830,7 +830,7 @@ static void __xen_pgd_pin(struct mm_struct *mm, pgd_t *pgd)
 #else /* CONFIG_X86_32 */
 #ifdef CONFIG_X86_PAE
/* Need to make sure unshared kernel PMD is pinnable */
-   xen_pin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE)]),
+   xen_pin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE_MAX)]),
 PT_PMD);
 #endif
xen_do_pin(MMUEXT_PIN_L3_TABLE, PFN_DOWN(__pa(pgd)));
@@ -949,7 +949,7 @@ static void __xen_pgd_unpin(struct mm_struct *mm, pgd_t 
*pgd)
 
 #ifdef CONFIG_X86_PAE
/* Need to make sure unshared kernel PMD is unpinned */
-   xen_unpin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE)]),
+   xen_unpin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE_MAX)]),
   PT_PMD);
 #endif
 
diff --git a/fs/exec.c b/fs/exec.c
index e57946610733..826b73600fc2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1303,8 +1303,19 @@ void would_dump(struct linux_binprm *bprm, struct file 
*file)
 }
 EXPORT_SYMBOL(would_dump);
 
+#ifndef INIT_TASK_SIZE
+#define INIT_TASK_SIZE TASK_SIZE
+#endif
+
 void setup_new_exec(struct linux_binprm * bprm)
 {
+
+   /* Set the new mm task size. We have to do that late because it may
+* depend on TIF_32BIT which is only updated 

[RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size

2016-12-30 Thread Dmitry Safonov
Keep task's virtual address space size as mm_struct field which
exists for a long time - it's initialized in setup_new_exec()
depending on the new task's personality.
This way TASK_SIZE will always be the same as current->mm->task_size.
Previously, there could be an issue about different values of
TASK_SIZE and current->mm->task_size: e.g, a 32-bit process can unset
ADDR_LIMIT_3GB personality (with personality syscall) and
so TASK_SIZE will be 4Gb, which is larger than mm->task_size = 3Gb.
As TASK_SIZE *and* current->mm->task_size are used both in code
frequently, this difference creates a subtle situations, for example:
one can mmap addresses > 3Gb, but they will be hidden in
/proc/pid/pagemap as it checks mm->task_size.
I've moved initialization of mm->task_size earlier in setup_new_exec()
as arch_pick_mmap_layout() initializes mmap_legacy_base with
TASK_UNMAPPED_BASE, which depends on TASK_SIZE.

Signed-off-by: Dmitry Safonov 
---
 arch/x86/include/asm/processor.h | 17 +
 arch/x86/um/asm/segment.h|  2 +-
 arch/x86/xen/mmu.c   |  4 ++--
 fs/exec.c| 17 +++--
 4 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index dbc7dec5fa84..47ebde614f06 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -768,10 +768,8 @@ static inline void spin_lock_prefetch(const void *x)
 /*
  * User space process size: 3GB (default).
  */
-#define TASK_SIZE  PAGE_OFFSET
-#define TASK_SIZE_MAX  TASK_SIZE
-#define STACK_TOP  TASK_SIZE
-#define STACK_TOP_MAX  STACK_TOP
+#define TASK_SIZE_MAX  PAGE_OFFSET
+#define INIT_TASK_SIZE TASK_SIZE_MAX
 
 #define INIT_THREAD  {   \
.sp0= TOP_OF_INIT_STACK,  \
@@ -817,12 +815,9 @@ static inline void spin_lock_prefetch(const void *x)
 #define IA32_PAGE_OFFSET   ((current->personality & ADDR_LIMIT_3GB) ? \
0xc000 : 0xe000)
 
-#define TASK_SIZE  (current->personality & ADDR_LIMIT_32BIT ? \
+#define INIT_TASK_SIZE (current->personality & ADDR_LIMIT_32BIT ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP  TASK_SIZE
-#define STACK_TOP_MAX  TASK_SIZE_MAX
-
 #define INIT_THREAD  { \
.sp0= TOP_OF_INIT_STACK,\
.addr_limit = KERNEL_DS,\
@@ -833,6 +828,12 @@ extern unsigned long KSTK_ESP(struct task_struct *task);
 
 #endif /* CONFIG_X86_64 */
 
+#define TASK_SIZE  \
+   ((current->mm) ? current->mm->task_size : TASK_SIZE_MAX)
+
+#define STACK_TOP  TASK_SIZE
+#define STACK_TOP_MAX  TASK_SIZE_MAX
+
 extern unsigned long thread_saved_pc(struct task_struct *tsk);
 
 extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
diff --git a/arch/x86/um/asm/segment.h b/arch/x86/um/asm/segment.h
index 41dd5e1f3cd7..3a9aa9f050df 100644
--- a/arch/x86/um/asm/segment.h
+++ b/arch/x86/um/asm/segment.h
@@ -13,6 +13,6 @@ typedef struct {
 
 #define MAKE_MM_SEG(s) ((mm_segment_t) { (s) })
 #define KERNEL_DS  MAKE_MM_SEG(~0UL)
-#define USER_DSMAKE_MM_SEG(TASK_SIZE)
+#define USER_DSMAKE_MM_SEG(TASK_SIZE_MAX)
 
 #endif
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7d5afdb417cc..264ca3b7be58 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -830,7 +830,7 @@ static void __xen_pgd_pin(struct mm_struct *mm, pgd_t *pgd)
 #else /* CONFIG_X86_32 */
 #ifdef CONFIG_X86_PAE
/* Need to make sure unshared kernel PMD is pinnable */
-   xen_pin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE)]),
+   xen_pin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE_MAX)]),
 PT_PMD);
 #endif
xen_do_pin(MMUEXT_PIN_L3_TABLE, PFN_DOWN(__pa(pgd)));
@@ -949,7 +949,7 @@ static void __xen_pgd_unpin(struct mm_struct *mm, pgd_t 
*pgd)
 
 #ifdef CONFIG_X86_PAE
/* Need to make sure unshared kernel PMD is unpinned */
-   xen_unpin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE)]),
+   xen_unpin_page(mm, pgd_page(pgd[pgd_index(TASK_SIZE_MAX)]),
   PT_PMD);
 #endif
 
diff --git a/fs/exec.c b/fs/exec.c
index e57946610733..826b73600fc2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1303,8 +1303,19 @@ void would_dump(struct linux_binprm *bprm, struct file 
*file)
 }
 EXPORT_SYMBOL(would_dump);
 
+#ifndef INIT_TASK_SIZE
+#define INIT_TASK_SIZE TASK_SIZE
+#endif
+
 void setup_new_exec(struct linux_binprm * bprm)
 {
+
+   /* Set the new mm task size. We have to do that late because it may
+* depend on TIF_32BIT which is only updated in flush_thread() on
+