RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread Justin He

> -Original Message-
> From: David Hildenbrand 
> Sent: Wednesday, July 29, 2020 5:35 PM
> To: Mike Rapoport ; Justin He 
> Cc: Dan Williams ; Vishal Verma
> ; Catalin Marinas ;
> Will Deacon ; Greg Kroah-Hartman
> ; Rafael J. Wysocki ; Dave
> Jiang ; Andrew Morton ;
> Steve Capper ; Mark Rutland ;
> Logan Gunthorpe ; Anshuman Khandual
> ; Hsin-Yi Wang ; Jason
> Gunthorpe ; Dave Hansen ; Kees
> Cook ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei
> Yang ; Pankaj Gupta
> ; Ira Weiny ; Kaly Xin
> 
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
> 
> On 29.07.20 11:31, Mike Rapoport wrote:
> > Hi Justin,
> >
> > On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote:
> >> Hi David
> >>>>
> >>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
> >>> only
> >>>> use 2G bytes for dax pmem(kmem) in the worst case.
> >>>> e.g.
> >>>> 24000-33fdf : Persistent Memory
> >>>> We can only use the memblock between [24000, 2] due to
> the
> >>> hard
> >>>> limitation. It wastes too much memory space.
> >>>>
> >>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative,
> but
> >>> there
> >>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> >>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
> >>>>
> >>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> >>> alignment
> >>>> with memory_block_size_bytes().
> >>>>
> >>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device.
> dax
> >>> pmem
> >>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
> >>> are both
> >>>> tested on arm64/x86 guest.
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I am not convinced this use case is worth such hacks (that’s what it
> is)
> >>> for now. On real machines pmem is big - your example (losing 50% is
> >>> extreme).
> >>>
> >>> I would much rather want to see the section size on arm64 reduced. I
> >>> remember there were patches and that at least with a base page size of
> 4k
> >>> it can be reduced drastically (64k base pages are more problematic due
> to
> >>> the ridiculous THP size of 512M). But could be a section size of 512
> is
> >>> possible on all configs right now.
> >>
> >> Yes, I once investigated how to reduce section size on arm64
> thoughtfully:
> >> There are many constraints for reducing SECTION_SIZE_BITS
> >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be
> reduced too
> >>much.
> >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be
> counted
> >>into page->flags.
> >> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> >>  - 3.1 mmzone.h
> >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> >> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> >> #endif
> >>  - 3.2 hugepage_init()
> >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> >>
> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> >> SECTION_SIZE_BITS can be reduced to 27.
> >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS
> can not
> >> be reduced to 27.
> >>
> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the
> Kconfig
> >> might be very complicated,e.g. we still need to consider the case for
> >> ARM64_16K_PAGES.
> >
> > It is not necessary to pollute Kconfig with that.
> > arch/arm64/include/asm/sparesemem.h can have something like
> >
> > #ifdef CONFIG_ARM64_64K_PAGES
> > #define SPARSE_SECTION_SIZE 29
> > #elif defined(CONFIG_ARM16K_PAGES)
> > #define SPARSE_SECTION_SIZE 28
> > #elif defined(CONFIG_ARM4K_PAGES)
> > #define SPARSE_SECTION_SIZE 27
> > #else
> > #error
> > #endif
> 
> ack
Thanks, David and Mike. Will discuss it further more with arm internally about
the thoughtful section_size change

--
Cheers,
Justin (Jia He)



Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread Mike Rapoport
On Wed, Jul 29, 2020 at 03:03:04PM +0200, David Hildenbrand wrote:
> On 29.07.20 15:00, Mike Rapoport wrote:
> > On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
> >>>  
> >>> There is still large gap with ARM64_64K_PAGES, though.
> >>>
> >>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
> >>
> >> I was asking myself the same question a while ago and didn't really find
> >> a compelling one.
> > 
> > Memory overhead for VMEMMAP is larger, especially for arm64 that knows
> > how to free empty parts of the memory map with "classic" SPARSEMEM.
> 
> You mean the hole punching within section memmap? (which is why their
> pfn_valid() implementation is special)

Yes, arm (both 32 and 64) do this. And for smaller systems with a few
memory banks this is very reasonable to trade slight (if any) slowdown
in pfn_valid() for several megs of memory.
 
> (I do wonder why that shouldn't work with VMEMMAP, or is it simply not
> implemented?)
 
It's not implemented. There was a patch [1] recently to implement this. 

[1] https://lore.kernel.org/lkml/20200721073203.107862-1-liwei...@huawei.com/

> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.


Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread David Hildenbrand
On 29.07.20 15:00, Mike Rapoport wrote:
> On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
>> On 29.07.20 11:31, Mike Rapoport wrote:
>>> Hi Justin,
>>>
>>> On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote:
 Hi David
>>
>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
> only
>> use 2G bytes for dax pmem(kmem) in the worst case.
>> e.g.
>> 24000-33fdf : Persistent Memory
>> We can only use the memblock between [24000, 2] due to the
> hard
>> limitation. It wastes too much memory space.
>>
>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> there
>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>
>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> alignment
>> with memory_block_size_bytes().
>>
>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> pmem
>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
> are both
>> tested on arm64/x86 guest.
>>
>
> Hi,
>
> I am not convinced this use case is worth such hacks (that’s what it is)
> for now. On real machines pmem is big - your example (losing 50% is
> extreme).
>
> I would much rather want to see the section size on arm64 reduced. I
> remember there were patches and that at least with a base page size of 4k
> it can be reduced drastically (64k base pages are more problematic due to
> the ridiculous THP size of 512M). But could be a section size of 512 is
> possible on all configs right now.

 Yes, I once investigated how to reduce section size on arm64 thoughtfully:
 There are many constraints for reducing SECTION_SIZE_BITS
 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced 
 too
much.
 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
into page->flags.
 3. MAX_ORDER depends on SECTION_SIZE_BITS 
  - 3.1 mmzone.h
 #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_ORDER exceeds SECTION_SIZE
 #endif
  - 3.2 hugepage_init()
 MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);

 Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
 SECTION_SIZE_BITS can be reduced to 27.
 But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
 Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can 
 not
 be reduced to 27.

 In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the 
 Kconfig
 might be very complicated,e.g. we still need to consider the case for
 ARM64_16K_PAGES.
>>>
>>> It is not necessary to pollute Kconfig with that.
>>> arch/arm64/include/asm/sparesemem.h can have something like
>>>
>>> #ifdef CONFIG_ARM64_64K_PAGES
>>> #define SPARSE_SECTION_SIZE 29
>>> #elif defined(CONFIG_ARM16K_PAGES)
>>> #define SPARSE_SECTION_SIZE 28
>>> #elif defined(CONFIG_ARM4K_PAGES)
>>> #define SPARSE_SECTION_SIZE 27
>>> #else
>>> #error
>>> #endif
>>
>> ack
>>
>>>  
>>> There is still large gap with ARM64_64K_PAGES, though.
>>>
>>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
>>
>> I was asking myself the same question a while ago and didn't really find
>> a compelling one.
> 
> Memory overhead for VMEMMAP is larger, especially for arm64 that knows
> how to free empty parts of the memory map with "classic" SPARSEMEM.

You mean the hole punching within section memmap? (which is why their
pfn_valid() implementation is special)

(I do wonder why that shouldn't work with VMEMMAP, or is it simply not
implemented?)

>  
>> I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
>> would require config tweaks to even disable it.
> 
> Nope, it's right there in menuconfig,
> 
> "Memory Management options" -> "Sparse Memory virtual memmap"

Ah, good to know.


-- 
Thanks,

David / dhildenb



Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread Mike Rapoport
On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
> On 29.07.20 11:31, Mike Rapoport wrote:
> > Hi Justin,
> > 
> > On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote:
> >> Hi David
> 
>  Without this series, if qemu creates a 4G bytes nvdimm device, we can
> >>> only
>  use 2G bytes for dax pmem(kmem) in the worst case.
>  e.g.
>  24000-33fdf : Persistent Memory
>  We can only use the memblock between [24000, 2] due to the
> >>> hard
>  limitation. It wastes too much memory space.
> 
>  Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> >>> there
>  are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>  SPARSEMEM_VMEMMAP, page bits in struct page ...
> 
>  Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> >>> alignment
>  with memory_block_size_bytes().
> 
>  Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> >>> pmem
>  can be used as ram with smaller gap. Also the kmem hotplug add/remove
> >>> are both
>  tested on arm64/x86 guest.
> 
> >>>
> >>> Hi,
> >>>
> >>> I am not convinced this use case is worth such hacks (that’s what it is)
> >>> for now. On real machines pmem is big - your example (losing 50% is
> >>> extreme).
> >>>
> >>> I would much rather want to see the section size on arm64 reduced. I
> >>> remember there were patches and that at least with a base page size of 4k
> >>> it can be reduced drastically (64k base pages are more problematic due to
> >>> the ridiculous THP size of 512M). But could be a section size of 512 is
> >>> possible on all configs right now.
> >>
> >> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> >> There are many constraints for reducing SECTION_SIZE_BITS
> >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced 
> >> too
> >>much.
> >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
> >>into page->flags.
> >> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
> >>  - 3.1 mmzone.h
> >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> >> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> >> #endif
> >>  - 3.2 hugepage_init()
> >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> >>
> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> >> SECTION_SIZE_BITS can be reduced to 27.
> >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can 
> >> not
> >> be reduced to 27.
> >>
> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the 
> >> Kconfig
> >> might be very complicated,e.g. we still need to consider the case for
> >> ARM64_16K_PAGES.
> > 
> > It is not necessary to pollute Kconfig with that.
> > arch/arm64/include/asm/sparesemem.h can have something like
> > 
> > #ifdef CONFIG_ARM64_64K_PAGES
> > #define SPARSE_SECTION_SIZE 29
> > #elif defined(CONFIG_ARM16K_PAGES)
> > #define SPARSE_SECTION_SIZE 28
> > #elif defined(CONFIG_ARM4K_PAGES)
> > #define SPARSE_SECTION_SIZE 27
> > #else
> > #error
> > #endif
> 
> ack
> 
> >  
> > There is still large gap with ARM64_64K_PAGES, though.
> > 
> > As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
> 
> I was asking myself the same question a while ago and didn't really find
> a compelling one.

Memory overhead for VMEMMAP is larger, especially for arm64 that knows
how to free empty parts of the memory map with "classic" SPARSEMEM.
 
> I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
> would require config tweaks to even disable it.

Nope, it's right there in menuconfig,

"Memory Management options" -> "Sparse Memory virtual memmap"

> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.


Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread David Hildenbrand
On 29.07.20 11:31, Mike Rapoport wrote:
> Hi Justin,
> 
> On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote:
>> Hi David

 Without this series, if qemu creates a 4G bytes nvdimm device, we can
>>> only
 use 2G bytes for dax pmem(kmem) in the worst case.
 e.g.
 24000-33fdf : Persistent Memory
 We can only use the memblock between [24000, 2] due to the
>>> hard
 limitation. It wastes too much memory space.

 Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>>> there
 are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
 SPARSEMEM_VMEMMAP, page bits in struct page ...

 Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>>> alignment
 with memory_block_size_bytes().

 Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>>> pmem
 can be used as ram with smaller gap. Also the kmem hotplug add/remove
>>> are both
 tested on arm64/x86 guest.

>>>
>>> Hi,
>>>
>>> I am not convinced this use case is worth such hacks (that’s what it is)
>>> for now. On real machines pmem is big - your example (losing 50% is
>>> extreme).
>>>
>>> I would much rather want to see the section size on arm64 reduced. I
>>> remember there were patches and that at least with a base page size of 4k
>>> it can be reduced drastically (64k base pages are more problematic due to
>>> the ridiculous THP size of 512M). But could be a section size of 512 is
>>> possible on all configs right now.
>>
>> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
>> There are many constraints for reducing SECTION_SIZE_BITS
>> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>>much.
>> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>>into page->flags.
>> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>>  - 3.1 mmzone.h
>> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
>> #error Allocator MAX_ORDER exceeds SECTION_SIZE
>> #endif
>>  - 3.2 hugepage_init()
>> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>>
>> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
>> SECTION_SIZE_BITS can be reduced to 27.
>> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
>> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can 
>> not
>> be reduced to 27.
>>
>> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the 
>> Kconfig
>> might be very complicated,e.g. we still need to consider the case for
>> ARM64_16K_PAGES.
> 
> It is not necessary to pollute Kconfig with that.
> arch/arm64/include/asm/sparesemem.h can have something like
> 
> #ifdef CONFIG_ARM64_64K_PAGES
> #define SPARSE_SECTION_SIZE 29
> #elif defined(CONFIG_ARM16K_PAGES)
> #define SPARSE_SECTION_SIZE 28
> #elif defined(CONFIG_ARM4K_PAGES)
> #define SPARSE_SECTION_SIZE 27
> #else
> #error
> #endif

ack

>  
> There is still large gap with ARM64_64K_PAGES, though.
> 
> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?

I was asking myself the same question a while ago and didn't really find
a compelling one.

I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
would require config tweaks to even disable it.

-- 
Thanks,

David / dhildenb



Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread Mike Rapoport
Hi Justin,

On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote:
> Hi David
> > >
> > > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> > only
> > > use 2G bytes for dax pmem(kmem) in the worst case.
> > > e.g.
> > > 24000-33fdf : Persistent Memory
> > > We can only use the memblock between [24000, 2] due to the
> > hard
> > > limitation. It wastes too much memory space.
> > >
> > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> > there
> > > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > > SPARSEMEM_VMEMMAP, page bits in struct page ...
> > >
> > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> > alignment
> > > with memory_block_size_bytes().
> > >
> > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> > pmem
> > > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> > are both
> > > tested on arm64/x86 guest.
> > >
> > 
> > Hi,
> > 
> > I am not convinced this use case is worth such hacks (that’s what it is)
> > for now. On real machines pmem is big - your example (losing 50% is
> > extreme).
> > 
> > I would much rather want to see the section size on arm64 reduced. I
> > remember there were patches and that at least with a base page size of 4k
> > it can be reduced drastically (64k base pages are more problematic due to
> > the ridiculous THP size of 512M). But could be a section size of 512 is
> > possible on all configs right now.
> 
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>into page->flags.
> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>  - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif
>  - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> 
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.
> 
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the 
> Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.

It is not necessary to pollute Kconfig with that.
arch/arm64/include/asm/sparesemem.h can have something like

#ifdef CONFIG_ARM64_64K_PAGES
#define SPARSE_SECTION_SIZE 29
#elif defined(CONFIG_ARM16K_PAGES)
#define SPARSE_SECTION_SIZE 28
#elif defined(CONFIG_ARM4K_PAGES)
#define SPARSE_SECTION_SIZE 27
#else
#error
#endif
 
There is still large gap with ARM64_64K_PAGES, though.

As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?

> > 
> > In the long term we might want to rework the memory block device model
> > (eventually supporting old/new as discussed with Michal some time ago
> > using a kernel parameter), dropping the fixed sizes
> 
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.
> 
> 
> --
> Cheers,
> Justin (Jia He)
> 
> 
> 
> > - allowing sizes / addresses aligned with subsection size
> > - drastically reducing the number of devices for boot memory to only a
> > hand full (e.g., one per resource / DIMM we can actually unplug again.
> > 
> > Long story short, I don’t like this hack.
> > 
> > 
> > > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-
> > rc5 [2].
> > >
> > > [1] https://lkml.org/lkml/2019/6/19/67
> > > [2] https://lkml.org/lkml/2020/7/8/1546
> > > Jia He (6):
> > >  mm/memory_hotplug: remove redundant memory block size alignment check
> > >  resource: export find_next_iomem_res() helper
> > >  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
> > >  mm/page_alloc: adjust the start,end in dax pmem kmem case
> > >  device-dax: relax the memblock size alignment for kmem_start
> > >  arm64: fall back to vmemmap_populate_basepages if not aligned  with
> > >PMD_SIZE
> > >
> > > arch/arm64/mm/mmu.c|  4 
> > > drivers/base/memory.c  | 24 
> > > drivers/dax/kmem.c | 22 +-
> > > include/linux/ioport.h |  3 +++
> > > kernel/resource.c  |  3 ++-
> > > mm/memory_hotplug.c| 39 ++-
> > > mm/page_alloc.c| 14 ++
> > > 7 files changed, 90 insertions(+), 19 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >
> 

-- 
Sincerely yours,
Mike.


Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread David Hildenbrand
On 29.07.20 10:27, Justin He wrote:
> Hi David
> 
>> -Original Message-
>> From: David Hildenbrand 
>> Sent: Wednesday, July 29, 2020 2:37 PM
>> To: Justin He 
>> Cc: Dan Williams ; Vishal Verma
>> ; Mike Rapoport ; David
>> Hildenbrand ; Catalin Marinas ;
>> Will Deacon ; Greg Kroah-Hartman
>> ; Rafael J. Wysocki ; Dave
>> Jiang ; Andrew Morton ;
>> Steve Capper ; Mark Rutland ;
>> Logan Gunthorpe ; Anshuman Khandual
>> ; Hsin-Yi Wang ; Jason
>> Gunthorpe ; Dave Hansen ; Kees
>> Cook ; linux-arm-ker...@lists.infradead.org; linux-
>> ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei
>> Yang ; Pankaj Gupta
>> ; Ira Weiny ; Kaly Xin
>> 
>> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
>> alignment
>>
>>
>>
>>> Am 29.07.2020 um 05:35 schrieb Jia He :
>>>
>>> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
>>> addr in dev_dax_kmem_probe() should be aligned w/
>> SECTION_SIZE_BITS(30),i.e.
>>> 1G memblock size. Even Dan Williams' sub-section patch series [1] had
>> been
>>> upstream merged, it was not helpful due to hard limitation of kmem_start:
>>> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
>> -a 2M
>>> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>>> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>>> $cat /proc/iomem
>>> ...
>>> 23c00-23fff : System RAM
>>>  23dd4-23fec : reserved
>>>  23fed-23fff : reserved
>>> 24000-33fdf : Persistent Memory
>>>  24000-2403f : namespace0.0
>>>  28000-2bfff : dax0.0  <- aligned with 1G boundary
>>>28000-2bfff : System RAM
>>> Hence there is a big gap between 0x2403f and 0x28000 due to the
>> 1G
>>> alignment.
>>>
>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>> only
>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>> e.g.
>>> 24000-33fdf : Persistent Memory
>>> We can only use the memblock between [24000, 2] due to the
>> hard
>>> limitation. It wastes too much memory space.
>>>
>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>> there
>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>
>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>> alignment
>>> with memory_block_size_bytes().
>>>
>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>> pmem
>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>> are both
>>> tested on arm64/x86 guest.
>>>
>>
>> Hi,
>>
>> I am not convinced this use case is worth such hacks (that’s what it is)
>> for now. On real machines pmem is big - your example (losing 50% is
>> extreme).
>>
>> I would much rather want to see the section size on arm64 reduced. I
>> remember there were patches and that at least with a base page size of 4k
>> it can be reduced drastically (64k base pages are more problematic due to
>> the ridiculous THP size of 512M). But could be a section size of 512 is
>> possible on all configs right now.
> 
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>into page->flags.

Yep.

> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>  - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif

Yep, with 4k base pages it's 4 MB. However, with 64k base pages its
512MB ( :( ).

>  - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> 
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.

I think there were plans to eventually switch to 2MB THP with 64k base
pages as well (which can be emulated using some sort of consecutive PTE

RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread Justin He
Hi David

> -Original Message-
> From: David Hildenbrand 
> Sent: Wednesday, July 29, 2020 2:37 PM
> To: Justin He 
> Cc: Dan Williams ; Vishal Verma
> ; Mike Rapoport ; David
> Hildenbrand ; Catalin Marinas ;
> Will Deacon ; Greg Kroah-Hartman
> ; Rafael J. Wysocki ; Dave
> Jiang ; Andrew Morton ;
> Steve Capper ; Mark Rutland ;
> Logan Gunthorpe ; Anshuman Khandual
> ; Hsin-Yi Wang ; Jason
> Gunthorpe ; Dave Hansen ; Kees
> Cook ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei
> Yang ; Pankaj Gupta
> ; Ira Weiny ; Kaly Xin
> 
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
> 
> 
> 
> > Am 29.07.2020 um 05:35 schrieb Jia He :
> >
> > When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
> > addr in dev_dax_kmem_probe() should be aligned w/
> SECTION_SIZE_BITS(30),i.e.
> > 1G memblock size. Even Dan Williams' sub-section patch series [1] had
> been
> > upstream merged, it was not helpful due to hard limitation of kmem_start:
> > $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
> -a 2M
> > $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > $cat /proc/iomem
> > ...
> > 23c00-23fff : System RAM
> >  23dd4-23fec : reserved
> >  23fed-23fff : reserved
> > 24000-33fdf : Persistent Memory
> >  24000-2403f : namespace0.0
> >  28000-2bfff : dax0.0  <- aligned with 1G boundary
> >28000-2bfff : System RAM
> > Hence there is a big gap between 0x2403f and 0x28000 due to the
> 1G
> > alignment.
> >
> > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> only
> > use 2G bytes for dax pmem(kmem) in the worst case.
> > e.g.
> > 24000-33fdf : Persistent Memory
> > We can only use the memblock between [24000, 2] due to the
> hard
> > limitation. It wastes too much memory space.
> >
> > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> there
> > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > SPARSEMEM_VMEMMAP, page bits in struct page ...
> >
> > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> alignment
> > with memory_block_size_bytes().
> >
> > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> pmem
> > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> are both
> > tested on arm64/x86 guest.
> >
> 
> Hi,
> 
> I am not convinced this use case is worth such hacks (that’s what it is)
> for now. On real machines pmem is big - your example (losing 50% is
> extreme).
> 
> I would much rather want to see the section size on arm64 reduced. I
> remember there were patches and that at least with a base page size of 4k
> it can be reduced drastically (64k base pages are more problematic due to
> the ridiculous THP size of 512M). But could be a section size of 512 is
> possible on all configs right now.

Yes, I once investigated how to reduce section size on arm64 thoughtfully:
There are many constraints for reducing SECTION_SIZE_BITS
1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
   much.
2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
   into page->flags.
3. MAX_ORDER depends on SECTION_SIZE_BITS 
 - 3.1 mmzone.h
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
 - 3.2 hugepage_init()
MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);

Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
SECTION_SIZE_BITS can be reduced to 27.
But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
be reduced to 27.

In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
might be very complicated,e.g. we still need to consider the case for
ARM64_16K_PAGES.

> 
> In the long term we might want to rework the memory block device model
> (eventually supporting old/new as discussed with Michal some time ago
> using a kernel parameter), dropping the fixed sizes

Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.


--
Cheers,
Justin (Jia He)



> - allowing sizes / addresses aligned with subsection size
> - drastically reducing the number of devices for boot memory to only a
> hand full (e.g., one per resource / DIMM we can 

Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread David Hildenbrand



> Am 29.07.2020 um 05:35 schrieb Jia He :
> 
> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
> addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e.
> 1G memblock size. Even Dan Williams' sub-section patch series [1] had been
> upstream merged, it was not helpful due to hard limitation of kmem_start:
> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> $cat /proc/iomem
> ...
> 23c00-23fff : System RAM
>  23dd4-23fec : reserved
>  23fed-23fff : reserved
> 24000-33fdf : Persistent Memory
>  24000-2403f : namespace0.0
>  28000-2bfff : dax0.0  <- aligned with 1G boundary
>28000-2bfff : System RAM
> Hence there is a big gap between 0x2403f and 0x28000 due to the 1G
> alignment.
> 
> Without this series, if qemu creates a 4G bytes nvdimm device, we can only
> use 2G bytes for dax pmem(kmem) in the worst case.
> e.g.
> 24000-33fdf : Persistent Memory 
> We can only use the memblock between [24000, 2] due to the hard
> limitation. It wastes too much memory space.
> 
> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there
> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> SPARSEMEM_VMEMMAP, page bits in struct page ...
> 
> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment
> with memory_block_size_bytes().
> 
> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem
> can be used as ram with smaller gap. Also the kmem hotplug add/remove are both
> tested on arm64/x86 guest.
> 

Hi,

I am not convinced this use case is worth such hacks (that’s what it is) for 
now. On real machines pmem is big - your example (losing 50% is extreme).

I would much rather want to see the section size on arm64 reduced. I remember 
there were patches and that at least with a base page size of 4k it can be 
reduced drastically (64k base pages are more problematic due to the ridiculous 
THP size of 512M). But could be a section size of 512 is possible on all 
configs right now.

In the long term we might want to rework the memory block device model 
(eventually supporting old/new as discussed with Michal some time ago using a 
kernel parameter), dropping the fixed sizes
- allowing sizes / addresses aligned with subsection size
- drastically reducing the number of devices for boot memory to only a hand 
full (e.g., one per resource / DIMM we can actually unplug again.

Long story short, I don’t like this hack.


> This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 
> [2].
> 
> [1] https://lkml.org/lkml/2019/6/19/67
> [2] https://lkml.org/lkml/2020/7/8/1546
> Jia He (6):
>  mm/memory_hotplug: remove redundant memory block size alignment check
>  resource: export find_next_iomem_res() helper
>  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
>  mm/page_alloc: adjust the start,end in dax pmem kmem case
>  device-dax: relax the memblock size alignment for kmem_start
>  arm64: fall back to vmemmap_populate_basepages if not aligned  with
>PMD_SIZE
> 
> arch/arm64/mm/mmu.c|  4 
> drivers/base/memory.c  | 24 
> drivers/dax/kmem.c | 22 +-
> include/linux/ioport.h |  3 +++
> kernel/resource.c  |  3 ++-
> mm/memory_hotplug.c| 39 ++-
> mm/page_alloc.c| 14 ++
> 7 files changed, 90 insertions(+), 19 deletions(-)
> 
> -- 
> 2.17.1
> 



[RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-28 Thread Jia He
When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e.
1G memblock size. Even Dan Williams' sub-section patch series [1] had been
upstream merged, it was not helpful due to hard limitation of kmem_start:
$ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
$echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
$echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
$cat /proc/iomem
...
23c00-23fff : System RAM
  23dd4-23fec : reserved
  23fed-23fff : reserved
24000-33fdf : Persistent Memory
  24000-2403f : namespace0.0
  28000-2bfff : dax0.0  <- aligned with 1G boundary
28000-2bfff : System RAM
Hence there is a big gap between 0x2403f and 0x28000 due to the 1G
alignment.
 
Without this series, if qemu creates a 4G bytes nvdimm device, we can only
use 2G bytes for dax pmem(kmem) in the worst case.
e.g.
24000-33fdf : Persistent Memory 
We can only use the memblock between [24000, 2] due to the hard
limitation. It wastes too much memory space.

Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there
are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
SPARSEMEM_VMEMMAP, page bits in struct page ...

Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment
with memory_block_size_bytes().

Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem
can be used as ram with smaller gap. Also the kmem hotplug add/remove are both
tested on arm64/x86 guest.

This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2].

[1] https://lkml.org/lkml/2019/6/19/67
[2] https://lkml.org/lkml/2020/7/8/1546
Jia He (6):
  mm/memory_hotplug: remove redundant memory block size alignment check
  resource: export find_next_iomem_res() helper
  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
  mm/page_alloc: adjust the start,end in dax pmem kmem case
  device-dax: relax the memblock size alignment for kmem_start
  arm64: fall back to vmemmap_populate_basepages if not aligned  with
PMD_SIZE

 arch/arm64/mm/mmu.c|  4 
 drivers/base/memory.c  | 24 
 drivers/dax/kmem.c | 22 +-
 include/linux/ioport.h |  3 +++
 kernel/resource.c  |  3 ++-
 mm/memory_hotplug.c| 39 ++-
 mm/page_alloc.c| 14 ++
 7 files changed, 90 insertions(+), 19 deletions(-)

-- 
2.17.1