Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-13 Thread Dave Hansen
On 12/08/2016 09:01 PM, Ingo Molnar wrote:
>> >   - Handle opt-in wider address space for userspace.
>> > 
>> > Not all userspace is ready to handle addresses wider than current
>> > 47-bits. At least some JIT compiler make use of upper bits to encode
>> > their info.
>> > 
>> > We need to have an interface to opt-in wider addresses from userspace
>> > to avoid regressions.
>> > 
>> > For now, I've included testing-only patch which bumps TASK_SIZE to
>> > 56-bits. This can be handy for testing to see what breaks if we max-out
>> > size of virtual address space.
> So this is just a detail - but it sounds a bit limiting to me to provide an 
> 'opt 
> in' flag for something that will work just fine on the vast majority of 
> 64-bit 
> software.

MPX is going to be a real pain here.  It is relatively transparent to
applications that use it, and old MPX binaries are entirely incompatible
with the new address space size, so an opt-out wouldn't be friendly.

Because the top-level MPX bounds table is indexed by the virtual
address, a growth in vaddr space is going to require the table to grow
(or change somehow).  The solution baked into the hardware spec is to
just make the top-level table 512x larger to accommodate the 512x
increase in vaddr space.  (This behavior is controlled by a new MSR, btw...)

So, either we disable MPX on all old MPX binaries by returning an error
when the prctl() tries to enable MPX and 5-level paging is on, or we go
with some form of an opt-in.  New MPX binaries will opt-in to the larger
address space since they know to allocate the new, larger table.


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-13 Thread Dave Hansen
On 12/08/2016 09:01 PM, Ingo Molnar wrote:
>> >   - Handle opt-in wider address space for userspace.
>> > 
>> > Not all userspace is ready to handle addresses wider than current
>> > 47-bits. At least some JIT compiler make use of upper bits to encode
>> > their info.
>> > 
>> > We need to have an interface to opt-in wider addresses from userspace
>> > to avoid regressions.
>> > 
>> > For now, I've included testing-only patch which bumps TASK_SIZE to
>> > 56-bits. This can be handy for testing to see what breaks if we max-out
>> > size of virtual address space.
> So this is just a detail - but it sounds a bit limiting to me to provide an 
> 'opt 
> in' flag for something that will work just fine on the vast majority of 
> 64-bit 
> software.

MPX is going to be a real pain here.  It is relatively transparent to
applications that use it, and old MPX binaries are entirely incompatible
with the new address space size, so an opt-out wouldn't be friendly.

Because the top-level MPX bounds table is indexed by the virtual
address, a growth in vaddr space is going to require the table to grow
(or change somehow).  The solution baked into the hardware spec is to
just make the top-level table 512x larger to accommodate the 512x
increase in vaddr space.  (This behavior is controlled by a new MSR, btw...)

So, either we disable MPX on all old MPX binaries by returning an error
when the prctl() tries to enable MPX and 5-level paging is on, or we go
with some form of an opt-in.  New MPX binaries will opt-in to the larger
address space since they know to allocate the new, larger table.


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Kirill A. Shutemov
On Fri, Dec 09, 2016 at 08:40:11AM -0800, Andi Kleen wrote:
> > On other hand, large virtual address space would put more pressure on
> > cache -- at least one more page table per process, if we make 56-bit VA
> > default.
> 
> The top level page always has to be there unless you disable it at boot time
> (unless you go for a scheme where some processes share top level pages, and
> others do not, which would likely be very complicated)
> 
> But even with that it is more than one: A typical set up has at least two 
> extra
> 4K pages overhead, one for the bottom and one for the top mappings. Could 
> easily be
> more.

So, right, one page for pgd, which we can't easily avoid.

If we limit VA to 47-bits by default, we would have one p4d page as the
range will be covered by one entry in pgd.

If we go to 56-bits VA by default, we would have at least two p4d pages
even for small processes. This where mine "at least one more page table
per process" comes from.

That's waste of memory and potentially cache. I don't think it's
justified.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Kirill A. Shutemov
On Fri, Dec 09, 2016 at 08:40:11AM -0800, Andi Kleen wrote:
> > On other hand, large virtual address space would put more pressure on
> > cache -- at least one more page table per process, if we make 56-bit VA
> > default.
> 
> The top level page always has to be there unless you disable it at boot time
> (unless you go for a scheme where some processes share top level pages, and
> others do not, which would likely be very complicated)
> 
> But even with that it is more than one: A typical set up has at least two 
> extra
> 4K pages overhead, one for the bottom and one for the top mappings. Could 
> easily be
> more.

So, right, one page for pgd, which we can't easily avoid.

If we limit VA to 47-bits by default, we would have one p4d page as the
range will be covered by one entry in pgd.

If we go to 56-bits VA by default, we would have at least two p4d pages
even for small processes. This where mine "at least one more page table
per process" comes from.

That's waste of memory and potentially cache. I don't think it's
justified.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Dave Hansen
On 12/09/2016 02:37 AM, Kirill A. Shutemov wrote:
> On other hand, large virtual address space would put more pressure on
> cache -- at least one more page table per process, if we make 56-bit VA
> default.

For a process only using a small amount of its address space, the
mid-level paging structure caches will be very effective since the page
walks are all very similar.  You may take a cache miss on the extra
level on the *first* walk, but you only do that once per context switch.
 I bet the CPU is also pretty aggressive about filling those things when
it sees a new CR3 and they've been forcibly emptied.  So, you may never
even _see_ the latency from that extra miss.


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Dave Hansen
On 12/09/2016 02:37 AM, Kirill A. Shutemov wrote:
> On other hand, large virtual address space would put more pressure on
> cache -- at least one more page table per process, if we make 56-bit VA
> default.

For a process only using a small amount of its address space, the
mid-level paging structure caches will be very effective since the page
walks are all very similar.  You may take a cache miss on the extra
level on the *first* walk, but you only do that once per context switch.
 I bet the CPU is also pretty aggressive about filling those things when
it sees a new CR3 and they've been forcibly emptied.  So, you may never
even _see_ the latency from that extra miss.


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Andi Kleen
> On other hand, large virtual address space would put more pressure on
> cache -- at least one more page table per process, if we make 56-bit VA
> default.

The top level page always has to be there unless you disable it at boot time
(unless you go for a scheme where some processes share top level pages, and
others do not, which would likely be very complicated)

But even with that it is more than one: A typical set up has at least two extra
4K pages overhead, one for the bottom and one for the top mappings. Could 
easily be
more.

-Andi



Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Andi Kleen
> On other hand, large virtual address space would put more pressure on
> cache -- at least one more page table per process, if we make 56-bit VA
> default.

The top level page always has to be there unless you disable it at boot time
(unless you go for a scheme where some processes share top level pages, and
others do not, which would likely be very complicated)

But even with that it is more than one: A typical set up has at least two extra
4K pages overhead, one for the bottom and one for the top mappings. Could 
easily be
more.

-Andi



Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Catalin Marinas
On Fri, Dec 09, 2016 at 11:24:12AM +0100, Arnd Bergmann wrote:
> On Friday, December 9, 2016 6:01:30 AM CET Ingo Molnar wrote:
> > >   - Handle opt-in wider address space for userspace.
> > > 
> > > Not all userspace is ready to handle addresses wider than current
> > > 47-bits. At least some JIT compiler make use of upper bits to encode
> > > their info.
> > > 
> > > We need to have an interface to opt-in wider addresses from userspace
> > > to avoid regressions.
> > > 
> > > For now, I've included testing-only patch which bumps TASK_SIZE to
> > > 56-bits. This can be handy for testing to see what breaks if we 
> > > max-out
> > > size of virtual address space.
> > 
> > So this is just a detail - but it sounds a bit limiting to me to provide an 
> > 'opt 
> > in' flag for something that will work just fine on the vast majority of 
> > 64-bit 
> > software.
> > 
> > Please make this an opt out compatibility flag instead: similar to how we 
> > handle 
> > address space layout limitations/quirks ABI details, such as 
> > ADDR_LIMIT_32BIT, 
> > ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.
> 
> We've had a similar discussion about JIT software on ARM64, which has a wide
> range of supported page table layouts and some software wants to limit that
> to a specific number.
> 
> I don't remember the outcome of that discussion, but I'm adding a few people
> to Cc that might remember.

The arm64 kernel supports several user VA space configurations (though
commonly 39 and 48-bit) and has had these from the initial port. We
realised that certain JITs (e.g.
https://bugzilla.mozilla.org/show_bug.cgi?id=1143022) and IIRC LLVM
assume a 47-bit user VA but AFAICT, most have been fixed.

ARMv8.1 also supports 52-bit VA (though only with 64K pages and we
haven't added support for it yet). However, it's likely that if we make
a 52-bit TASK_SIZE this the default, we will break some user
assumptions. While arguably that's not necessarily ABI, if user relies
on a 47 or 48-bit VA the kernel shouldn't break it. So I'm strongly
inclined to make the 52-bit TASK_SIZE an opt-in on arm64.

-- 
Catalin


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Catalin Marinas
On Fri, Dec 09, 2016 at 11:24:12AM +0100, Arnd Bergmann wrote:
> On Friday, December 9, 2016 6:01:30 AM CET Ingo Molnar wrote:
> > >   - Handle opt-in wider address space for userspace.
> > > 
> > > Not all userspace is ready to handle addresses wider than current
> > > 47-bits. At least some JIT compiler make use of upper bits to encode
> > > their info.
> > > 
> > > We need to have an interface to opt-in wider addresses from userspace
> > > to avoid regressions.
> > > 
> > > For now, I've included testing-only patch which bumps TASK_SIZE to
> > > 56-bits. This can be handy for testing to see what breaks if we 
> > > max-out
> > > size of virtual address space.
> > 
> > So this is just a detail - but it sounds a bit limiting to me to provide an 
> > 'opt 
> > in' flag for something that will work just fine on the vast majority of 
> > 64-bit 
> > software.
> > 
> > Please make this an opt out compatibility flag instead: similar to how we 
> > handle 
> > address space layout limitations/quirks ABI details, such as 
> > ADDR_LIMIT_32BIT, 
> > ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.
> 
> We've had a similar discussion about JIT software on ARM64, which has a wide
> range of supported page table layouts and some software wants to limit that
> to a specific number.
> 
> I don't remember the outcome of that discussion, but I'm adding a few people
> to Cc that might remember.

The arm64 kernel supports several user VA space configurations (though
commonly 39 and 48-bit) and has had these from the initial port. We
realised that certain JITs (e.g.
https://bugzilla.mozilla.org/show_bug.cgi?id=1143022) and IIRC LLVM
assume a 47-bit user VA but AFAICT, most have been fixed.

ARMv8.1 also supports 52-bit VA (though only with 64K pages and we
haven't added support for it yet). However, it's likely that if we make
a 52-bit TASK_SIZE this the default, we will break some user
assumptions. While arguably that's not necessarily ABI, if user relies
on a 47 or 48-bit VA the kernel shouldn't break it. So I'm strongly
inclined to make the 52-bit TASK_SIZE an opt-in on arm64.

-- 
Catalin


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Kirill A. Shutemov
On Fri, Dec 09, 2016 at 06:01:30AM +0100, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov  wrote:
> 
> > x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
> > of physical address space. We are already bumping into this limit: some
> > vendors offers servers with 64 TiB of memory today.
> > 
> > To overcome the limitation upcoming hardware will introduce support for
> > 5-level paging[1]. It is a straight-forward extension of the current page
> > table structure adding one more layer of translation.
> > 
> > It bumps the limits to 128 PiB of virtual address space and 4 PiB of
> > physical address space. This "ought to be enough for anybody" ©.
> > 
> > This patchset is still very early. There are a number of things missing
> > that we have to do before asking anyone to merge it (listed below).
> > It would be great if folks can start testing applications now (in QEMU) to
> > look for breakage.
> > Any early comments on the design or the patches would be appreciated as
> > well.
> > 
> > More details on the design and what’s left to implement are below.
> 
> The patches don't look too painful, so no big complaints from me - kudos!

Thanks.

> > There is still work to do:
> > 
> >   - Boot-time switch between 4- and 5-level paging.
> > 
> > We assume that distributions will be keen to avoid returning to the
> > i386 days where we shipped one kernel binary for each page table
> > layout.
> 
> Absolutely.
> 
> > As page table format is the same for 4- and 5-level paging it should
> > be possible to have single kernel binary and switch between them at
> > boot-time without too much hassle.
> > 
> > For now I only implemented compile-time switch.
> > 
> > I hoped to bring this feature with separate patchset once basic
> > enabling is in upstream.
> > 
> > Is it okay?
> 
> LGTM, but we would eventually want to convert this kind of crazy open coding:
> 
> pgd_t *pgd, *pgd_ref;
> p4d_t *p4d, *p4d_ref;
> pud_t *pud, *pud_ref;
> pmd_t *pmd, *pmd_ref;
> pte_t *pte, *pte_ref;
> 
> To something saner that iterates and navigates the page table hierarchy in an 
> extensible fashion. That would also make it (much) easier to make the paging 
> depth 
> boot time switchable.

Yes, it would be nice to replace all these p??_t with something more
flexible. But that's no obviously right design for such transition.

I would rather not tight it to boot-time switch for paging, but have
separate experimental patchset. One day...

> Somehow I'm quite certain we'll see requests for more than 4 PiB memory in 
> our 
> lifetimes.
> 
> In a decade or two once global warming really gets going, especially after 
> Trump & 
> Republicans & Old Energy implement their billionaire welfare policies to 
> mine, 
> sell and burn even more coal & oil without paying for the damage caused, the 
> U.S. 
> meteorology clusters tracking Category 6 hurricanes in the Atlantic (capable 
> of 1+ 
> trillion dollars damage) in near real time at 1 meter resolution will have to 
> run 
> on something capable, right?
> 
> >   - Handle opt-in wider address space for userspace.
> > 
> > Not all userspace is ready to handle addresses wider than current
> > 47-bits. At least some JIT compiler make use of upper bits to encode
> > their info.
> > 
> > We need to have an interface to opt-in wider addresses from userspace
> > to avoid regressions.
> > 
> > For now, I've included testing-only patch which bumps TASK_SIZE to
> > 56-bits. This can be handy for testing to see what breaks if we max-out
> > size of virtual address space.
> 
> So this is just a detail - but it sounds a bit limiting to me to provide an 
> 'opt 
> in' flag for something that will work just fine on the vast majority of 
> 64-bit 
> software.
> 
> Please make this an opt out compatibility flag instead: similar to how we 
> handle 
> address space layout limitations/quirks ABI details, such as 
> ADDR_LIMIT_32BIT, 
> ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

Well, that's true that most userspace can handle wide addresses just fine.
But even by simply booting Fedora on QEMU I see one SIGSEGV for this
reason: libmozjs-17.0.so cannot handle it (polkitd linked with it, hell
knows why).

I think keeping software from crashing is kinda priority in this
transition.

Beyond that, most of software would not benefit much from large virtual
address space. Okay, there's more bits for ASLR, but that's it.

On other hand, large virtual address space would put more pressure on
cache -- at least one more page table per process, if we make 56-bit VA
default.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Kirill A. Shutemov
On Fri, Dec 09, 2016 at 06:01:30AM +0100, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov  wrote:
> 
> > x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
> > of physical address space. We are already bumping into this limit: some
> > vendors offers servers with 64 TiB of memory today.
> > 
> > To overcome the limitation upcoming hardware will introduce support for
> > 5-level paging[1]. It is a straight-forward extension of the current page
> > table structure adding one more layer of translation.
> > 
> > It bumps the limits to 128 PiB of virtual address space and 4 PiB of
> > physical address space. This "ought to be enough for anybody" ©.
> > 
> > This patchset is still very early. There are a number of things missing
> > that we have to do before asking anyone to merge it (listed below).
> > It would be great if folks can start testing applications now (in QEMU) to
> > look for breakage.
> > Any early comments on the design or the patches would be appreciated as
> > well.
> > 
> > More details on the design and what’s left to implement are below.
> 
> The patches don't look too painful, so no big complaints from me - kudos!

Thanks.

> > There is still work to do:
> > 
> >   - Boot-time switch between 4- and 5-level paging.
> > 
> > We assume that distributions will be keen to avoid returning to the
> > i386 days where we shipped one kernel binary for each page table
> > layout.
> 
> Absolutely.
> 
> > As page table format is the same for 4- and 5-level paging it should
> > be possible to have single kernel binary and switch between them at
> > boot-time without too much hassle.
> > 
> > For now I only implemented compile-time switch.
> > 
> > I hoped to bring this feature with separate patchset once basic
> > enabling is in upstream.
> > 
> > Is it okay?
> 
> LGTM, but we would eventually want to convert this kind of crazy open coding:
> 
> pgd_t *pgd, *pgd_ref;
> p4d_t *p4d, *p4d_ref;
> pud_t *pud, *pud_ref;
> pmd_t *pmd, *pmd_ref;
> pte_t *pte, *pte_ref;
> 
> To something saner that iterates and navigates the page table hierarchy in an 
> extensible fashion. That would also make it (much) easier to make the paging 
> depth 
> boot time switchable.

Yes, it would be nice to replace all these p??_t with something more
flexible. But that's no obviously right design for such transition.

I would rather not tight it to boot-time switch for paging, but have
separate experimental patchset. One day...

> Somehow I'm quite certain we'll see requests for more than 4 PiB memory in 
> our 
> lifetimes.
> 
> In a decade or two once global warming really gets going, especially after 
> Trump & 
> Republicans & Old Energy implement their billionaire welfare policies to 
> mine, 
> sell and burn even more coal & oil without paying for the damage caused, the 
> U.S. 
> meteorology clusters tracking Category 6 hurricanes in the Atlantic (capable 
> of 1+ 
> trillion dollars damage) in near real time at 1 meter resolution will have to 
> run 
> on something capable, right?
> 
> >   - Handle opt-in wider address space for userspace.
> > 
> > Not all userspace is ready to handle addresses wider than current
> > 47-bits. At least some JIT compiler make use of upper bits to encode
> > their info.
> > 
> > We need to have an interface to opt-in wider addresses from userspace
> > to avoid regressions.
> > 
> > For now, I've included testing-only patch which bumps TASK_SIZE to
> > 56-bits. This can be handy for testing to see what breaks if we max-out
> > size of virtual address space.
> 
> So this is just a detail - but it sounds a bit limiting to me to provide an 
> 'opt 
> in' flag for something that will work just fine on the vast majority of 
> 64-bit 
> software.
> 
> Please make this an opt out compatibility flag instead: similar to how we 
> handle 
> address space layout limitations/quirks ABI details, such as 
> ADDR_LIMIT_32BIT, 
> ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

Well, that's true that most userspace can handle wide addresses just fine.
But even by simply booting Fedora on QEMU I see one SIGSEGV for this
reason: libmozjs-17.0.so cannot handle it (polkitd linked with it, hell
knows why).

I think keeping software from crashing is kinda priority in this
transition.

Beyond that, most of software would not benefit much from large virtual
address space. Okay, there's more bits for ASLR, but that's it.

On other hand, large virtual address space would put more pressure on
cache -- at least one more page table per process, if we make 56-bit VA
default.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Arnd Bergmann
On Friday, December 9, 2016 6:01:30 AM CET Ingo Molnar wrote:
> >   - Handle opt-in wider address space for userspace.
> > 
> > Not all userspace is ready to handle addresses wider than current
> > 47-bits. At least some JIT compiler make use of upper bits to encode
> > their info.
> > 
> > We need to have an interface to opt-in wider addresses from userspace
> > to avoid regressions.
> > 
> > For now, I've included testing-only patch which bumps TASK_SIZE to
> > 56-bits. This can be handy for testing to see what breaks if we max-out
> > size of virtual address space.
> 
> So this is just a detail - but it sounds a bit limiting to me to provide an 
> 'opt 
> in' flag for something that will work just fine on the vast majority of 
> 64-bit 
> software.
> 
> Please make this an opt out compatibility flag instead: similar to how we 
> handle 
> address space layout limitations/quirks ABI details, such as 
> ADDR_LIMIT_32BIT, 
> ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

We've had a similar discussion about JIT software on ARM64, which has a wide
range of supported page table layouts and some software wants to limit that
to a specific number.

I don't remember the outcome of that discussion, but I'm adding a few people
to Cc that might remember.

There have also been some discussions in the past to make the depth of the
page table a per-task decision on s390, since you may have some tasks that
run just fine with two or three levels of paging while another task actually
wants the full 64-bit address space. I wonder how much extra work this would
be on top of the boot-time option.

Arnd


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-09 Thread Arnd Bergmann
On Friday, December 9, 2016 6:01:30 AM CET Ingo Molnar wrote:
> >   - Handle opt-in wider address space for userspace.
> > 
> > Not all userspace is ready to handle addresses wider than current
> > 47-bits. At least some JIT compiler make use of upper bits to encode
> > their info.
> > 
> > We need to have an interface to opt-in wider addresses from userspace
> > to avoid regressions.
> > 
> > For now, I've included testing-only patch which bumps TASK_SIZE to
> > 56-bits. This can be handy for testing to see what breaks if we max-out
> > size of virtual address space.
> 
> So this is just a detail - but it sounds a bit limiting to me to provide an 
> 'opt 
> in' flag for something that will work just fine on the vast majority of 
> 64-bit 
> software.
> 
> Please make this an opt out compatibility flag instead: similar to how we 
> handle 
> address space layout limitations/quirks ABI details, such as 
> ADDR_LIMIT_32BIT, 
> ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

We've had a similar discussion about JIT software on ARM64, which has a wide
range of supported page table layouts and some software wants to limit that
to a specific number.

I don't remember the outcome of that discussion, but I'm adding a few people
to Cc that might remember.

There have also been some discussions in the past to make the depth of the
page table a per-task decision on s390, since you may have some tasks that
run just fine with two or three levels of paging while another task actually
wants the full 64-bit address space. I wonder how much extra work this would
be on top of the boot-time option.

Arnd


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread Ingo Molnar

* Kirill A. Shutemov  wrote:

> x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
> of physical address space. We are already bumping into this limit: some
> vendors offers servers with 64 TiB of memory today.
> 
> To overcome the limitation upcoming hardware will introduce support for
> 5-level paging[1]. It is a straight-forward extension of the current page
> table structure adding one more layer of translation.
> 
> It bumps the limits to 128 PiB of virtual address space and 4 PiB of
> physical address space. This "ought to be enough for anybody" ©.
> 
> This patchset is still very early. There are a number of things missing
> that we have to do before asking anyone to merge it (listed below).
> It would be great if folks can start testing applications now (in QEMU) to
> look for breakage.
> Any early comments on the design or the patches would be appreciated as
> well.
> 
> More details on the design and what’s left to implement are below.

The patches don't look too painful, so no big complaints from me - kudos!

> There is still work to do:
> 
>   - Boot-time switch between 4- and 5-level paging.
> 
> We assume that distributions will be keen to avoid returning to the
> i386 days where we shipped one kernel binary for each page table
> layout.

Absolutely.

> As page table format is the same for 4- and 5-level paging it should
> be possible to have single kernel binary and switch between them at
> boot-time without too much hassle.
> 
> For now I only implemented compile-time switch.
> 
> I hoped to bring this feature with separate patchset once basic
> enabling is in upstream.
> 
> Is it okay?

LGTM, but we would eventually want to convert this kind of crazy open coding:

pgd_t *pgd, *pgd_ref;
p4d_t *p4d, *p4d_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
pte_t *pte, *pte_ref;

To something saner that iterates and navigates the page table hierarchy in an 
extensible fashion. That would also make it (much) easier to make the paging 
depth 
boot time switchable.

Somehow I'm quite certain we'll see requests for more than 4 PiB memory in our 
lifetimes.

In a decade or two once global warming really gets going, especially after 
Trump & 
Republicans & Old Energy implement their billionaire welfare policies to mine, 
sell and burn even more coal & oil without paying for the damage caused, the 
U.S. 
meteorology clusters tracking Category 6 hurricanes in the Atlantic (capable of 
1+ 
trillion dollars damage) in near real time at 1 meter resolution will have to 
run 
on something capable, right?

>   - Handle opt-in wider address space for userspace.
> 
> Not all userspace is ready to handle addresses wider than current
> 47-bits. At least some JIT compiler make use of upper bits to encode
> their info.
> 
> We need to have an interface to opt-in wider addresses from userspace
> to avoid regressions.
> 
> For now, I've included testing-only patch which bumps TASK_SIZE to
> 56-bits. This can be handy for testing to see what breaks if we max-out
> size of virtual address space.

So this is just a detail - but it sounds a bit limiting to me to provide an 
'opt 
in' flag for something that will work just fine on the vast majority of 64-bit 
software.

Please make this an opt out compatibility flag instead: similar to how we 
handle 
address space layout limitations/quirks ABI details, such as ADDR_LIMIT_32BIT, 
ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

Thanks,

Ingo


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread Ingo Molnar

* Kirill A. Shutemov  wrote:

> x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
> of physical address space. We are already bumping into this limit: some
> vendors offers servers with 64 TiB of memory today.
> 
> To overcome the limitation upcoming hardware will introduce support for
> 5-level paging[1]. It is a straight-forward extension of the current page
> table structure adding one more layer of translation.
> 
> It bumps the limits to 128 PiB of virtual address space and 4 PiB of
> physical address space. This "ought to be enough for anybody" ©.
> 
> This patchset is still very early. There are a number of things missing
> that we have to do before asking anyone to merge it (listed below).
> It would be great if folks can start testing applications now (in QEMU) to
> look for breakage.
> Any early comments on the design or the patches would be appreciated as
> well.
> 
> More details on the design and what’s left to implement are below.

The patches don't look too painful, so no big complaints from me - kudos!

> There is still work to do:
> 
>   - Boot-time switch between 4- and 5-level paging.
> 
> We assume that distributions will be keen to avoid returning to the
> i386 days where we shipped one kernel binary for each page table
> layout.

Absolutely.

> As page table format is the same for 4- and 5-level paging it should
> be possible to have single kernel binary and switch between them at
> boot-time without too much hassle.
> 
> For now I only implemented compile-time switch.
> 
> I hoped to bring this feature with separate patchset once basic
> enabling is in upstream.
> 
> Is it okay?

LGTM, but we would eventually want to convert this kind of crazy open coding:

pgd_t *pgd, *pgd_ref;
p4d_t *p4d, *p4d_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
pte_t *pte, *pte_ref;

To something saner that iterates and navigates the page table hierarchy in an 
extensible fashion. That would also make it (much) easier to make the paging 
depth 
boot time switchable.

Somehow I'm quite certain we'll see requests for more than 4 PiB memory in our 
lifetimes.

In a decade or two once global warming really gets going, especially after 
Trump & 
Republicans & Old Energy implement their billionaire welfare policies to mine, 
sell and burn even more coal & oil without paying for the damage caused, the 
U.S. 
meteorology clusters tracking Category 6 hurricanes in the Atlantic (capable of 
1+ 
trillion dollars damage) in near real time at 1 meter resolution will have to 
run 
on something capable, right?

>   - Handle opt-in wider address space for userspace.
> 
> Not all userspace is ready to handle addresses wider than current
> 47-bits. At least some JIT compiler make use of upper bits to encode
> their info.
> 
> We need to have an interface to opt-in wider addresses from userspace
> to avoid regressions.
> 
> For now, I've included testing-only patch which bumps TASK_SIZE to
> 56-bits. This can be handy for testing to see what breaks if we max-out
> size of virtual address space.

So this is just a detail - but it sounds a bit limiting to me to provide an 
'opt 
in' flag for something that will work just fine on the vast majority of 64-bit 
software.

Please make this an opt out compatibility flag instead: similar to how we 
handle 
address space layout limitations/quirks ABI details, such as ADDR_LIMIT_32BIT, 
ADDR_LIMIT_3GB, ADDR_COMPAT_LAYOUT, READ_IMPLIES_EXEC, etc.

Thanks,

Ingo


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread Kirill A. Shutemov
On Thu, Dec 08, 2016 at 10:16:07AM -0800, Linus Torvalds wrote:
> On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
>  wrote:
> >
> > This patchset is still very early. There are a number of things missing
> > that we have to do before asking anyone to merge it (listed below).
> > It would be great if folks can start testing applications now (in QEMU) to
> > look for breakage.
> > Any early comments on the design or the patches would be appreciated as
> > well.
> 
> Looks ok to me. Starting off with a compile-time config option seems fine.
> 
> I do think that the x86 cpuid part should (patch 15) should be the
> first patch, so that we see "la57" as a capability in /proc/cpuinfo
> whether it's being enabled or not? We should merge that part
> regardless of any mm patches, I think.

Okay, I'll split up the CPUID part into separate patch and move it
beginning for the patchset

REQUIRED_MASK portion will stay where it is.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread Kirill A. Shutemov
On Thu, Dec 08, 2016 at 10:16:07AM -0800, Linus Torvalds wrote:
> On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
>  wrote:
> >
> > This patchset is still very early. There are a number of things missing
> > that we have to do before asking anyone to merge it (listed below).
> > It would be great if folks can start testing applications now (in QEMU) to
> > look for breakage.
> > Any early comments on the design or the patches would be appreciated as
> > well.
> 
> Looks ok to me. Starting off with a compile-time config option seems fine.
> 
> I do think that the x86 cpuid part should (patch 15) should be the
> first patch, so that we see "la57" as a capability in /proc/cpuinfo
> whether it's being enabled or not? We should merge that part
> regardless of any mm patches, I think.

Okay, I'll split up the CPUID part into separate patch and move it
beginning for the patchset

REQUIRED_MASK portion will stay where it is.

-- 
 Kirill A. Shutemov


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread hpa
On December 8, 2016 10:16:07 AM PST, Linus Torvalds 
 wrote:
>On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
> wrote:
>>
>> This patchset is still very early. There are a number of things
>missing
>> that we have to do before asking anyone to merge it (listed below).
>> It would be great if folks can start testing applications now (in
>QEMU) to
>> look for breakage.
>> Any early comments on the design or the patches would be appreciated
>as
>> well.
>
>Looks ok to me. Starting off with a compile-time config option seems
>fine.
>
>I do think that the x86 cpuid part should (patch 15) should be the
>first patch, so that we see "la57" as a capability in /proc/cpuinfo
>whether it's being enabled or not? We should merge that part
>regardless of any mm patches, I think.
>
>   Linus

Definitely.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread hpa
On December 8, 2016 10:16:07 AM PST, Linus Torvalds 
 wrote:
>On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
> wrote:
>>
>> This patchset is still very early. There are a number of things
>missing
>> that we have to do before asking anyone to merge it (listed below).
>> It would be great if folks can start testing applications now (in
>QEMU) to
>> look for breakage.
>> Any early comments on the design or the patches would be appreciated
>as
>> well.
>
>Looks ok to me. Starting off with a compile-time config option seems
>fine.
>
>I do think that the x86 cpuid part should (patch 15) should be the
>first patch, so that we see "la57" as a capability in /proc/cpuinfo
>whether it's being enabled or not? We should merge that part
>regardless of any mm patches, I think.
>
>   Linus

Definitely.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread Linus Torvalds
On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
 wrote:
>
> This patchset is still very early. There are a number of things missing
> that we have to do before asking anyone to merge it (listed below).
> It would be great if folks can start testing applications now (in QEMU) to
> look for breakage.
> Any early comments on the design or the patches would be appreciated as
> well.

Looks ok to me. Starting off with a compile-time config option seems fine.

I do think that the x86 cpuid part should (patch 15) should be the
first patch, so that we see "la57" as a capability in /proc/cpuinfo
whether it's being enabled or not? We should merge that part
regardless of any mm patches, I think.

   Linus


Re: [RFC, PATCHv1 00/28] 5-level paging

2016-12-08 Thread Linus Torvalds
On Thu, Dec 8, 2016 at 8:21 AM, Kirill A. Shutemov
 wrote:
>
> This patchset is still very early. There are a number of things missing
> that we have to do before asking anyone to merge it (listed below).
> It would be great if folks can start testing applications now (in QEMU) to
> look for breakage.
> Any early comments on the design or the patches would be appreciated as
> well.

Looks ok to me. Starting off with a compile-time config option seems fine.

I do think that the x86 cpuid part should (patch 15) should be the
first patch, so that we see "la57" as a capability in /proc/cpuinfo
whether it's being enabled or not? We should merge that part
regardless of any mm patches, I think.

   Linus