Re: [PATCH] x86/mm: Fix boot with some memory above MAXMEM

2020-06-03 Thread Dave Hansen
On 6/2/20 4:18 PM, Kirill A. Shutemov wrote:
> On Tue, May 26, 2020 at 07:27:15AM -0700, Dave Hansen wrote:
>> On 5/25/20 8:08 AM, Kirill A. Shutemov wrote:
>> +if (not_addressable) {
>> +pr_err("%lldGB of physical memory is not addressable in 
>> the paging mode\n",
>> +   not_addressable >> 30);
>> +if (!pgtable_l5_enabled())
>> +pr_err("Consider enabling 5-level paging\n");
 Could this happen at all when l5 is enabled?
 Does it mean we need kmap() for 64-bit?
>>> It's future-profing. Who knows what paging modes we would have in the
>>> future.
>>
>> Future-proofing and firmware-proofing. :)
>>
>> In any case, are we *really* limited to 52 bits of physical memory with
>> 5-level paging?
> 
> Yes. It's architectural. SDM says "MAXPHYADDR is at most 52" (Vol 3A,
> 4.1.4).

Right you are.

I'm glad it's in the architecture.  Makes all of this a lot easier!

>> So shouldn't it technically be this:
>>
>> #define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 55 : 46)
>>
>> ?
> 
> Bits above 52 are ignored in the page table entries and accessible to
> software. Some of them got claimed by HW features (XD-bit, protection
> keys), but such features require explicit opt-in on software side.
> 
> Kernel could claim bits 53-55 for the physical address, but it doesn't get
> us anything: if future HW would provide such feature it would require
> opt-in. On other hand claiming them now means we cannot use them for other
> purposes as SW bit. I don't see a point.

Yep, agreed.


Re: [PATCH] x86/mm: Fix boot with some memory above MAXMEM

2020-06-02 Thread Kirill A. Shutemov
On Tue, May 26, 2020 at 07:27:15AM -0700, Dave Hansen wrote:
> On 5/25/20 8:08 AM, Kirill A. Shutemov wrote:
>  +if (not_addressable) {
>  +pr_err("%lldGB of physical memory is not addressable in 
>  the paging mode\n",
>  +   not_addressable >> 30);
>  +if (!pgtable_l5_enabled())
>  +pr_err("Consider enabling 5-level paging\n");
> >> Could this happen at all when l5 is enabled?
> >> Does it mean we need kmap() for 64-bit?
> > It's future-profing. Who knows what paging modes we would have in the
> > future.
> 
> Future-proofing and firmware-proofing. :)
> 
> In any case, are we *really* limited to 52 bits of physical memory with
> 5-level paging?

Yes. It's architectural. SDM says "MAXPHYADDR is at most 52" (Vol 3A,
4.1.4).

I guess it can be extended with an opt-in feature and relevant changes to
page table structure. But as of today there's no such thing.

> Previously, we said we were limited to 46 bits, and now
> we're saying that the limit is 52 with 5-level paging:
> 
> #define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
> 
> The 46 was fine with the 48 bits of address space on 4-level paging
> systems since we need 1/2 of the address space for userspace, 1/4 for
> the direct map and 1/4 for the vmalloc-and-friends area.  At 46 bits of
> address space, we fill up the direct map.
> 
> The hardware designers know this and never enumerated a MAXPHYADDR from
> CPUID which was higher than what we could cover with 46 bits.  It was
> nice and convenient that these two separate things matched:
> 1. The amount of physical address space addressable in a direct map
>consuming 1/4 of the virtual address space.
> 2. The CPU-enumerated MAXPHYADDR which among other things dictates how
>much physical address space is addressable in a PTE.
> 
> But, with 5-level paging, things are a little different.  The limit in
> addressable memory because of running out of the direct map actually
> happens at 55 bits (57-2=55, analogous to the 4-level 48-2=46).
> 
> So shouldn't it technically be this:
> 
> #define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 55 : 46)
> 
> ?

Bits above 52 are ignored in the page table entries and accessible to
software. Some of them got claimed by HW features (XD-bit, protection
keys), but such features require explicit opt-in on software side.

Kernel could claim bits 53-55 for the physical address, but it doesn't get
us anything: if future HW would provide such feature it would require
opt-in. On other hand claiming them now means we cannot use them for other
purposes as SW bit. I don't see a point.

-- 
 Kirill A. Shutemov


Re: [PATCH] x86/mm: Fix boot with some memory above MAXMEM

2020-05-26 Thread Dave Hansen
On 5/25/20 8:08 AM, Kirill A. Shutemov wrote:
 +  if (not_addressable) {
 +  pr_err("%lldGB of physical memory is not addressable in the 
 paging mode\n",
 + not_addressable >> 30);
 +  if (!pgtable_l5_enabled())
 +  pr_err("Consider enabling 5-level paging\n");
>> Could this happen at all when l5 is enabled?
>> Does it mean we need kmap() for 64-bit?
> It's future-profing. Who knows what paging modes we would have in the
> future.

Future-proofing and firmware-proofing. :)

In any case, are we *really* limited to 52 bits of physical memory with
5-level paging?  Previously, we said we were limited to 46 bits, and now
we're saying that the limit is 52 with 5-level paging:

#define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)

The 46 was fine with the 48 bits of address space on 4-level paging
systems since we need 1/2 of the address space for userspace, 1/4 for
the direct map and 1/4 for the vmalloc-and-friends area.  At 46 bits of
address space, we fill up the direct map.

The hardware designers know this and never enumerated a MAXPHYADDR from
CPUID which was higher than what we could cover with 46 bits.  It was
nice and convenient that these two separate things matched:
1. The amount of physical address space addressable in a direct map
   consuming 1/4 of the virtual address space.
2. The CPU-enumerated MAXPHYADDR which among other things dictates how
   much physical address space is addressable in a PTE.

But, with 5-level paging, things are a little different.  The limit in
addressable memory because of running out of the direct map actually
happens at 55 bits (57-2=55, analogous to the 4-level 48-2=46).

So shouldn't it technically be this:

#define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 55 : 46)

?


Re: [PATCH] x86/mm: Fix boot with some memory above MAXMEM

2020-05-25 Thread Mike Rapoport
On Mon, May 25, 2020 at 06:08:20PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 25, 2020 at 05:59:43PM +0300, Mike Rapoport wrote:
> > On Mon, May 25, 2020 at 07:49:02AM +0300, Kirill A. Shutemov wrote:
> > > On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> > > > A 5-level paging capable machine can have memory above 46-bit in the
> > > > physical address space. This memory is only addressable in the 5-level
> > > > paging mode: we don't have enough virtual address space to create direct
> > > > mapping for such memory in the 4-level paging mode.
> > > > 
> > > > Currently, we fail boot completely: NULL pointer dereference in
> > > > subsection_map_init().
> > > > 
> > > > Skip creating a memblock for such memory instead and notify user that
> > > > some memory is not addressable.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov 
> > > > Reviewed-by: Dave Hansen 
> > > > Cc: sta...@vger.kernel.org # v4.14
> > > > ---
> > > 
> > > Gentle ping.
> > > 
> > > It's not urgent, but it's a bug fix. Please consider applying.
> > > 
> > > > Tested with a hacked QEMU: 
> > > > https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> > > > 
> > > > ---
> > > >  arch/x86/kernel/e820.c | 19 +--
> > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index c5399e80c59c..d320d37d0f95 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
> > > >  
> > > >  void __init e820__memblock_setup(void)
> > > >  {
> > > > +   u64 size, end, not_addressable = 0;
> > > > int i;
> > > > -   u64 end;
> > > >  
> > > > /*
> > > >  * The bootstrap memblock region count maximum is 128 entries
> > > > @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
> > > > if (entry->type != E820_TYPE_RAM && entry->type != 
> > > > E820_TYPE_RESERVED_KERN)
> > > > continue;
> > > >  
> > > > -   memblock_add(entry->addr, entry->size);
> > > > +   if (entry->addr >= MAXMEM) {
> > > > +   not_addressable += entry->size;
> > > > +   continue;
> > > > +   }
> > > > +
> > > > +   end = min_t(u64, end, MAXMEM - 1);
> > > > +   size = end - entry->addr;
> > > > +   not_addressable += entry->size - size;
> > > > +   memblock_add(entry->addr, size);
> > > > +   }
> > > > +
> > > > +   if (not_addressable) {
> > > > +   pr_err("%lldGB of physical memory is not addressable in 
> > > > the paging mode\n",
> > > > +  not_addressable >> 30);
> > > > +   if (!pgtable_l5_enabled())
> > > > +   pr_err("Consider enabling 5-level paging\n");
> > 
> > Could this happen at all when l5 is enabled?
> > Does it mean we need kmap() for 64-bit?
> 
> It's future-profing. Who knows what paging modes we would have in the
> future.

Than maybe

pr_err("%lldGB of physical memory is not addressable in %s the paging 
mode\n",
   not_addressable >> 30, pgtable_l5_enabled() "5-level" ? 
"4-level");

"the paging mode" on its own sounds a bit awkward to me.

> -- 
>  Kirill A. Shutemov

-- 
Sincerely yours,
Mike.


Re: [PATCH] x86/mm: Fix boot with some memory above MAXMEM

2020-05-25 Thread Kirill A. Shutemov
On Mon, May 25, 2020 at 05:59:43PM +0300, Mike Rapoport wrote:
> On Mon, May 25, 2020 at 07:49:02AM +0300, Kirill A. Shutemov wrote:
> > On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> > > A 5-level paging capable machine can have memory above 46-bit in the
> > > physical address space. This memory is only addressable in the 5-level
> > > paging mode: we don't have enough virtual address space to create direct
> > > mapping for such memory in the 4-level paging mode.
> > > 
> > > Currently, we fail boot completely: NULL pointer dereference in
> > > subsection_map_init().
> > > 
> > > Skip creating a memblock for such memory instead and notify user that
> > > some memory is not addressable.
> > > 
> > > Signed-off-by: Kirill A. Shutemov 
> > > Reviewed-by: Dave Hansen 
> > > Cc: sta...@vger.kernel.org # v4.14
> > > ---
> > 
> > Gentle ping.
> > 
> > It's not urgent, but it's a bug fix. Please consider applying.
> > 
> > > Tested with a hacked QEMU: 
> > > https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> > > 
> > > ---
> > >  arch/x86/kernel/e820.c | 19 +--
> > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index c5399e80c59c..d320d37d0f95 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
> > >  
> > >  void __init e820__memblock_setup(void)
> > >  {
> > > + u64 size, end, not_addressable = 0;
> > >   int i;
> > > - u64 end;
> > >  
> > >   /*
> > >* The bootstrap memblock region count maximum is 128 entries
> > > @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
> > >   if (entry->type != E820_TYPE_RAM && entry->type != 
> > > E820_TYPE_RESERVED_KERN)
> > >   continue;
> > >  
> > > - memblock_add(entry->addr, entry->size);
> > > + if (entry->addr >= MAXMEM) {
> > > + not_addressable += entry->size;
> > > + continue;
> > > + }
> > > +
> > > + end = min_t(u64, end, MAXMEM - 1);
> > > + size = end - entry->addr;
> > > + not_addressable += entry->size - size;
> > > + memblock_add(entry->addr, size);
> > > + }
> > > +
> > > + if (not_addressable) {
> > > + pr_err("%lldGB of physical memory is not addressable in the 
> > > paging mode\n",
> > > +not_addressable >> 30);
> > > + if (!pgtable_l5_enabled())
> > > + pr_err("Consider enabling 5-level paging\n");
> 
> Could this happen at all when l5 is enabled?
> Does it mean we need kmap() for 64-bit?

It's future-profing. Who knows what paging modes we would have in the
future.

-- 
 Kirill A. Shutemov


Re: [PATCH] x86/mm: Fix boot with some memory above MAXMEM

2020-05-25 Thread Mike Rapoport
On Mon, May 25, 2020 at 07:49:02AM +0300, Kirill A. Shutemov wrote:
> On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> > A 5-level paging capable machine can have memory above 46-bit in the
> > physical address space. This memory is only addressable in the 5-level
> > paging mode: we don't have enough virtual address space to create direct
> > mapping for such memory in the 4-level paging mode.
> > 
> > Currently, we fail boot completely: NULL pointer dereference in
> > subsection_map_init().
> > 
> > Skip creating a memblock for such memory instead and notify user that
> > some memory is not addressable.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > Reviewed-by: Dave Hansen 
> > Cc: sta...@vger.kernel.org # v4.14
> > ---
> 
> Gentle ping.
> 
> It's not urgent, but it's a bug fix. Please consider applying.
> 
> > Tested with a hacked QEMU: 
> > https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> > 
> > ---
> >  arch/x86/kernel/e820.c | 19 +--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index c5399e80c59c..d320d37d0f95 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
> >  
> >  void __init e820__memblock_setup(void)
> >  {
> > +   u64 size, end, not_addressable = 0;
> > int i;
> > -   u64 end;
> >  
> > /*
> >  * The bootstrap memblock region count maximum is 128 entries
> > @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
> > if (entry->type != E820_TYPE_RAM && entry->type != 
> > E820_TYPE_RESERVED_KERN)
> > continue;
> >  
> > -   memblock_add(entry->addr, entry->size);
> > +   if (entry->addr >= MAXMEM) {
> > +   not_addressable += entry->size;
> > +   continue;
> > +   }
> > +
> > +   end = min_t(u64, end, MAXMEM - 1);
> > +   size = end - entry->addr;
> > +   not_addressable += entry->size - size;
> > +   memblock_add(entry->addr, size);
> > +   }
> > +
> > +   if (not_addressable) {
> > +   pr_err("%lldGB of physical memory is not addressable in the 
> > paging mode\n",
> > +  not_addressable >> 30);
> > +   if (!pgtable_l5_enabled())
> > +   pr_err("Consider enabling 5-level paging\n");

Could this happen at all when l5 is enabled?
Does it mean we need kmap() for 64-bit?

> > }
> >  
> > /* Throw away partial pages: */
> > -- 
> > 2.26.2
> > 
> > 
> 
> -- 
>  Kirill A. Shutemov
> 

-- 
Sincerely yours,
Mike.


Re: [PATCH] x86/mm: Fix boot with some memory above MAXMEM

2020-05-24 Thread Kirill A. Shutemov
On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> A 5-level paging capable machine can have memory above 46-bit in the
> physical address space. This memory is only addressable in the 5-level
> paging mode: we don't have enough virtual address space to create direct
> mapping for such memory in the 4-level paging mode.
> 
> Currently, we fail boot completely: NULL pointer dereference in
> subsection_map_init().
> 
> Skip creating a memblock for such memory instead and notify user that
> some memory is not addressable.
> 
> Signed-off-by: Kirill A. Shutemov 
> Reviewed-by: Dave Hansen 
> Cc: sta...@vger.kernel.org # v4.14
> ---

Gentle ping.

It's not urgent, but it's a bug fix. Please consider applying.

> Tested with a hacked QEMU: 
> https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> 
> ---
>  arch/x86/kernel/e820.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c5399e80c59c..d320d37d0f95 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
>  
>  void __init e820__memblock_setup(void)
>  {
> + u64 size, end, not_addressable = 0;
>   int i;
> - u64 end;
>  
>   /*
>* The bootstrap memblock region count maximum is 128 entries
> @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
>   if (entry->type != E820_TYPE_RAM && entry->type != 
> E820_TYPE_RESERVED_KERN)
>   continue;
>  
> - memblock_add(entry->addr, entry->size);
> + if (entry->addr >= MAXMEM) {
> + not_addressable += entry->size;
> + continue;
> + }
> +
> + end = min_t(u64, end, MAXMEM - 1);
> + size = end - entry->addr;
> + not_addressable += entry->size - size;
> + memblock_add(entry->addr, size);
> + }
> +
> + if (not_addressable) {
> + pr_err("%lldGB of physical memory is not addressable in the 
> paging mode\n",
> +not_addressable >> 30);
> + if (!pgtable_l5_enabled())
> + pr_err("Consider enabling 5-level paging\n");
>   }
>  
>   /* Throw away partial pages: */
> -- 
> 2.26.2
> 
> 

-- 
 Kirill A. Shutemov