RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
> -Original Message- > From: David Hildenbrand > Sent: Wednesday, July 29, 2020 5:35 PM > To: Mike Rapoport ; Justin He > Cc: Dan Williams ; Vishal Verma > ; Catalin Marinas ; > Will Deacon ; Greg Kroah-Hartman > ; Rafael J. Wysocki ; Dave > Jiang ; Andrew Morton ; > Steve Capper ; Mark Rutland ; > Logan Gunthorpe ; Anshuman Khandual > ; Hsin-Yi Wang ; Jason > Gunthorpe ; Dave Hansen ; Kees > Cook ; linux-arm-ker...@lists.infradead.org; linux- > ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei > Yang ; Pankaj Gupta > ; Ira Weiny ; Kaly Xin > > Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem > alignment > > On 29.07.20 11:31, Mike Rapoport wrote: > > Hi Justin, > > > > On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote: > >> Hi David > >>>> > >>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can > >>> only > >>>> use 2G bytes for dax pmem(kmem) in the worst case. > >>>> e.g. > >>>> 24000-33fdf : Persistent Memory > >>>> We can only use the memblock between [24000, 2] due to > the > >>> hard > >>>> limitation. It wastes too much memory space. > >>>> > >>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, > but > >>> there > >>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, > >>>> SPARSEMEM_VMEMMAP, page bits in struct page ... > >>>> > >>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem > >>> alignment > >>>> with memory_block_size_bytes(). > >>>> > >>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. > dax > >>> pmem > >>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove > >>> are both > >>>> tested on arm64/x86 guest. > >>>> > >>> > >>> Hi, > >>> > >>> I am not convinced this use case is worth such hacks (that’s what it > is) > >>> for now. On real machines pmem is big - your example (losing 50% is > >>> extreme). > >>> > >>> I would much rather want to see the section size on arm64 reduced. I > >>> remember there were patches and that at least with a base page size of > 4k > >>> it can be reduced drastically (64k base pages are more problematic due > to > >>> the ridiculous THP size of 512M). But could be a section size of 512 > is > >>> possible on all configs right now. > >> > >> Yes, I once investigated how to reduce section size on arm64 > thoughtfully: > >> There are many constraints for reducing SECTION_SIZE_BITS > >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be > reduced too > >>much. > >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be > counted > >>into page->flags. > >> 3. MAX_ORDER depends on SECTION_SIZE_BITS > >> - 3.1 mmzone.h > >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS > >> #error Allocator MAX_ORDER exceeds SECTION_SIZE > >> #endif > >> - 3.2 hugepage_init() > >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); > >> > >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled, > >> SECTION_SIZE_BITS can be reduced to 27. > >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13. > >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS > can not > >> be reduced to 27. > >> > >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the > Kconfig > >> might be very complicated,e.g. we still need to consider the case for > >> ARM64_16K_PAGES. > > > > It is not necessary to pollute Kconfig with that. > > arch/arm64/include/asm/sparesemem.h can have something like > > > > #ifdef CONFIG_ARM64_64K_PAGES > > #define SPARSE_SECTION_SIZE 29 > > #elif defined(CONFIG_ARM16K_PAGES) > > #define SPARSE_SECTION_SIZE 28 > > #elif defined(CONFIG_ARM4K_PAGES) > > #define SPARSE_SECTION_SIZE 27 > > #else > > #error > > #endif > > ack Thanks, David and Mike. Will discuss it further more with arm internally about the thoughtful section_size change -- Cheers, Justin (Jia He)
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
On Wed, Jul 29, 2020 at 03:03:04PM +0200, David Hildenbrand wrote: > On 29.07.20 15:00, Mike Rapoport wrote: > > On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote: > >>> > >>> There is still large gap with ARM64_64K_PAGES, though. > >>> > >>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it? > >> > >> I was asking myself the same question a while ago and didn't really find > >> a compelling one. > > > > Memory overhead for VMEMMAP is larger, especially for arm64 that knows > > how to free empty parts of the memory map with "classic" SPARSEMEM. > > You mean the hole punching within section memmap? (which is why their > pfn_valid() implementation is special) Yes, arm (both 32 and 64) do this. And for smaller systems with a few memory banks this is very reasonable to trade slight (if any) slowdown in pfn_valid() for several megs of memory. > (I do wonder why that shouldn't work with VMEMMAP, or is it simply not > implemented?) It's not implemented. There was a patch [1] recently to implement this. [1] https://lore.kernel.org/lkml/20200721073203.107862-1-liwei...@huawei.com/ > -- > Thanks, > > David / dhildenb > -- Sincerely yours, Mike.
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
On 29.07.20 15:00, Mike Rapoport wrote: > On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote: >> On 29.07.20 11:31, Mike Rapoport wrote: >>> Hi Justin, >>> >>> On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote: Hi David >> >> Without this series, if qemu creates a 4G bytes nvdimm device, we can > only >> use 2G bytes for dax pmem(kmem) in the worst case. >> e.g. >> 24000-33fdf : Persistent Memory >> We can only use the memblock between [24000, 2] due to the > hard >> limitation. It wastes too much memory space. >> >> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but > there >> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, >> SPARSEMEM_VMEMMAP, page bits in struct page ... >> >> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem > alignment >> with memory_block_size_bytes(). >> >> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax > pmem >> can be used as ram with smaller gap. Also the kmem hotplug add/remove > are both >> tested on arm64/x86 guest. >> > > Hi, > > I am not convinced this use case is worth such hacks (that’s what it is) > for now. On real machines pmem is big - your example (losing 50% is > extreme). > > I would much rather want to see the section size on arm64 reduced. I > remember there were patches and that at least with a base page size of 4k > it can be reduced drastically (64k base pages are more problematic due to > the ridiculous THP size of 512M). But could be a section size of 512 is > possible on all configs right now. Yes, I once investigated how to reduce section size on arm64 thoughtfully: There are many constraints for reducing SECTION_SIZE_BITS 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too much. 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted into page->flags. 3. MAX_ORDER depends on SECTION_SIZE_BITS - 3.1 mmzone.h #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS #error Allocator MAX_ORDER exceeds SECTION_SIZE #endif - 3.2 hugepage_init() MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled, SECTION_SIZE_BITS can be reduced to 27. But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13. Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not be reduced to 27. In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig might be very complicated,e.g. we still need to consider the case for ARM64_16K_PAGES. >>> >>> It is not necessary to pollute Kconfig with that. >>> arch/arm64/include/asm/sparesemem.h can have something like >>> >>> #ifdef CONFIG_ARM64_64K_PAGES >>> #define SPARSE_SECTION_SIZE 29 >>> #elif defined(CONFIG_ARM16K_PAGES) >>> #define SPARSE_SECTION_SIZE 28 >>> #elif defined(CONFIG_ARM4K_PAGES) >>> #define SPARSE_SECTION_SIZE 27 >>> #else >>> #error >>> #endif >> >> ack >> >>> >>> There is still large gap with ARM64_64K_PAGES, though. >>> >>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it? >> >> I was asking myself the same question a while ago and didn't really find >> a compelling one. > > Memory overhead for VMEMMAP is larger, especially for arm64 that knows > how to free empty parts of the memory map with "classic" SPARSEMEM. You mean the hole punching within section memmap? (which is why their pfn_valid() implementation is special) (I do wonder why that shouldn't work with VMEMMAP, or is it simply not implemented?) > >> I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and >> would require config tweaks to even disable it. > > Nope, it's right there in menuconfig, > > "Memory Management options" -> "Sparse Memory virtual memmap" Ah, good to know. -- Thanks, David / dhildenb
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote: > On 29.07.20 11:31, Mike Rapoport wrote: > > Hi Justin, > > > > On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote: > >> Hi David > > Without this series, if qemu creates a 4G bytes nvdimm device, we can > >>> only > use 2G bytes for dax pmem(kmem) in the worst case. > e.g. > 24000-33fdf : Persistent Memory > We can only use the memblock between [24000, 2] due to the > >>> hard > limitation. It wastes too much memory space. > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but > >>> there > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, > SPARSEMEM_VMEMMAP, page bits in struct page ... > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem > >>> alignment > with memory_block_size_bytes(). > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax > >>> pmem > can be used as ram with smaller gap. Also the kmem hotplug add/remove > >>> are both > tested on arm64/x86 guest. > > >>> > >>> Hi, > >>> > >>> I am not convinced this use case is worth such hacks (that’s what it is) > >>> for now. On real machines pmem is big - your example (losing 50% is > >>> extreme). > >>> > >>> I would much rather want to see the section size on arm64 reduced. I > >>> remember there were patches and that at least with a base page size of 4k > >>> it can be reduced drastically (64k base pages are more problematic due to > >>> the ridiculous THP size of 512M). But could be a section size of 512 is > >>> possible on all configs right now. > >> > >> Yes, I once investigated how to reduce section size on arm64 thoughtfully: > >> There are many constraints for reducing SECTION_SIZE_BITS > >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced > >> too > >>much. > >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted > >>into page->flags. > >> 3. MAX_ORDER depends on SECTION_SIZE_BITS > >> - 3.1 mmzone.h > >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS > >> #error Allocator MAX_ORDER exceeds SECTION_SIZE > >> #endif > >> - 3.2 hugepage_init() > >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); > >> > >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled, > >> SECTION_SIZE_BITS can be reduced to 27. > >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13. > >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can > >> not > >> be reduced to 27. > >> > >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the > >> Kconfig > >> might be very complicated,e.g. we still need to consider the case for > >> ARM64_16K_PAGES. > > > > It is not necessary to pollute Kconfig with that. > > arch/arm64/include/asm/sparesemem.h can have something like > > > > #ifdef CONFIG_ARM64_64K_PAGES > > #define SPARSE_SECTION_SIZE 29 > > #elif defined(CONFIG_ARM16K_PAGES) > > #define SPARSE_SECTION_SIZE 28 > > #elif defined(CONFIG_ARM4K_PAGES) > > #define SPARSE_SECTION_SIZE 27 > > #else > > #error > > #endif > > ack > > > > > There is still large gap with ARM64_64K_PAGES, though. > > > > As for SPARSEMEM without VMEMMAP, are there actual benefits to use it? > > I was asking myself the same question a while ago and didn't really find > a compelling one. Memory overhead for VMEMMAP is larger, especially for arm64 that knows how to free empty parts of the memory map with "classic" SPARSEMEM. > I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and > would require config tweaks to even disable it. Nope, it's right there in menuconfig, "Memory Management options" -> "Sparse Memory virtual memmap" > -- > Thanks, > > David / dhildenb > -- Sincerely yours, Mike.
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
On 29.07.20 11:31, Mike Rapoport wrote: > Hi Justin, > > On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote: >> Hi David Without this series, if qemu creates a 4G bytes nvdimm device, we can >>> only use 2G bytes for dax pmem(kmem) in the worst case. e.g. 24000-33fdf : Persistent Memory We can only use the memblock between [24000, 2] due to the >>> hard limitation. It wastes too much memory space. Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but >>> there are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, SPARSEMEM_VMEMMAP, page bits in struct page ... Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem >>> alignment with memory_block_size_bytes(). Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax >>> pmem can be used as ram with smaller gap. Also the kmem hotplug add/remove >>> are both tested on arm64/x86 guest. >>> >>> Hi, >>> >>> I am not convinced this use case is worth such hacks (that’s what it is) >>> for now. On real machines pmem is big - your example (losing 50% is >>> extreme). >>> >>> I would much rather want to see the section size on arm64 reduced. I >>> remember there were patches and that at least with a base page size of 4k >>> it can be reduced drastically (64k base pages are more problematic due to >>> the ridiculous THP size of 512M). But could be a section size of 512 is >>> possible on all configs right now. >> >> Yes, I once investigated how to reduce section size on arm64 thoughtfully: >> There are many constraints for reducing SECTION_SIZE_BITS >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too >>much. >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted >>into page->flags. >> 3. MAX_ORDER depends on SECTION_SIZE_BITS >> - 3.1 mmzone.h >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS >> #error Allocator MAX_ORDER exceeds SECTION_SIZE >> #endif >> - 3.2 hugepage_init() >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); >> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled, >> SECTION_SIZE_BITS can be reduced to 27. >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13. >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can >> not >> be reduced to 27. >> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the >> Kconfig >> might be very complicated,e.g. we still need to consider the case for >> ARM64_16K_PAGES. > > It is not necessary to pollute Kconfig with that. > arch/arm64/include/asm/sparesemem.h can have something like > > #ifdef CONFIG_ARM64_64K_PAGES > #define SPARSE_SECTION_SIZE 29 > #elif defined(CONFIG_ARM16K_PAGES) > #define SPARSE_SECTION_SIZE 28 > #elif defined(CONFIG_ARM4K_PAGES) > #define SPARSE_SECTION_SIZE 27 > #else > #error > #endif ack > > There is still large gap with ARM64_64K_PAGES, though. > > As for SPARSEMEM without VMEMMAP, are there actual benefits to use it? I was asking myself the same question a while ago and didn't really find a compelling one. I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and would require config tweaks to even disable it. -- Thanks, David / dhildenb
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
Hi Justin, On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote: > Hi David > > > > > > Without this series, if qemu creates a 4G bytes nvdimm device, we can > > only > > > use 2G bytes for dax pmem(kmem) in the worst case. > > > e.g. > > > 24000-33fdf : Persistent Memory > > > We can only use the memblock between [24000, 2] due to the > > hard > > > limitation. It wastes too much memory space. > > > > > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but > > there > > > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, > > > SPARSEMEM_VMEMMAP, page bits in struct page ... > > > > > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem > > alignment > > > with memory_block_size_bytes(). > > > > > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax > > pmem > > > can be used as ram with smaller gap. Also the kmem hotplug add/remove > > are both > > > tested on arm64/x86 guest. > > > > > > > Hi, > > > > I am not convinced this use case is worth such hacks (that’s what it is) > > for now. On real machines pmem is big - your example (losing 50% is > > extreme). > > > > I would much rather want to see the section size on arm64 reduced. I > > remember there were patches and that at least with a base page size of 4k > > it can be reduced drastically (64k base pages are more problematic due to > > the ridiculous THP size of 512M). But could be a section size of 512 is > > possible on all configs right now. > > Yes, I once investigated how to reduce section size on arm64 thoughtfully: > There are many constraints for reducing SECTION_SIZE_BITS > 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too >much. > 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted >into page->flags. > 3. MAX_ORDER depends on SECTION_SIZE_BITS > - 3.1 mmzone.h > #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS > #error Allocator MAX_ORDER exceeds SECTION_SIZE > #endif > - 3.2 hugepage_init() > MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); > > Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled, > SECTION_SIZE_BITS can be reduced to 27. > But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13. > Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not > be reduced to 27. > > In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the > Kconfig > might be very complicated,e.g. we still need to consider the case for > ARM64_16K_PAGES. It is not necessary to pollute Kconfig with that. arch/arm64/include/asm/sparesemem.h can have something like #ifdef CONFIG_ARM64_64K_PAGES #define SPARSE_SECTION_SIZE 29 #elif defined(CONFIG_ARM16K_PAGES) #define SPARSE_SECTION_SIZE 28 #elif defined(CONFIG_ARM4K_PAGES) #define SPARSE_SECTION_SIZE 27 #else #error #endif There is still large gap with ARM64_64K_PAGES, though. As for SPARSEMEM without VMEMMAP, are there actual benefits to use it? > > > > In the long term we might want to rework the memory block device model > > (eventually supporting old/new as discussed with Michal some time ago > > using a kernel parameter), dropping the fixed sizes > > Has this been posted to Linux mm maillist? Sorry, searched and didn't find it. > > > -- > Cheers, > Justin (Jia He) > > > > > - allowing sizes / addresses aligned with subsection size > > - drastically reducing the number of devices for boot memory to only a > > hand full (e.g., one per resource / DIMM we can actually unplug again. > > > > Long story short, I don’t like this hack. > > > > > > > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8- > > rc5 [2]. > > > > > > [1] https://lkml.org/lkml/2019/6/19/67 > > > [2] https://lkml.org/lkml/2020/7/8/1546 > > > Jia He (6): > > > mm/memory_hotplug: remove redundant memory block size alignment check > > > resource: export find_next_iomem_res() helper > > > mm/memory_hotplug: allow pmem kmem not to align with memory_block_size > > > mm/page_alloc: adjust the start,end in dax pmem kmem case > > > device-dax: relax the memblock size alignment for kmem_start > > > arm64: fall back to vmemmap_populate_basepages if not aligned with > > >PMD_SIZE > > > > > > arch/arm64/mm/mmu.c| 4 > > > drivers/base/memory.c | 24 > > > drivers/dax/kmem.c | 22 +- > > > include/linux/ioport.h | 3 +++ > > > kernel/resource.c | 3 ++- > > > mm/memory_hotplug.c| 39 ++- > > > mm/page_alloc.c| 14 ++ > > > 7 files changed, 90 insertions(+), 19 deletions(-) > > > > > > -- > > > 2.17.1 > > > > -- Sincerely yours, Mike.
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
On 29.07.20 10:27, Justin He wrote: > Hi David > >> -Original Message- >> From: David Hildenbrand >> Sent: Wednesday, July 29, 2020 2:37 PM >> To: Justin He >> Cc: Dan Williams ; Vishal Verma >> ; Mike Rapoport ; David >> Hildenbrand ; Catalin Marinas ; >> Will Deacon ; Greg Kroah-Hartman >> ; Rafael J. Wysocki ; Dave >> Jiang ; Andrew Morton ; >> Steve Capper ; Mark Rutland ; >> Logan Gunthorpe ; Anshuman Khandual >> ; Hsin-Yi Wang ; Jason >> Gunthorpe ; Dave Hansen ; Kees >> Cook ; linux-arm-ker...@lists.infradead.org; linux- >> ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei >> Yang ; Pankaj Gupta >> ; Ira Weiny ; Kaly Xin >> >> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem >> alignment >> >> >> >>> Am 29.07.2020 um 05:35 schrieb Jia He : >>> >>> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start >>> addr in dev_dax_kmem_probe() should be aligned w/ >> SECTION_SIZE_BITS(30),i.e. >>> 1G memblock size. Even Dan Williams' sub-section patch series [1] had >> been >>> upstream merged, it was not helpful due to hard limitation of kmem_start: >>> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f >> -a 2M >>> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind >>> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id >>> $cat /proc/iomem >>> ... >>> 23c00-23fff : System RAM >>> 23dd4-23fec : reserved >>> 23fed-23fff : reserved >>> 24000-33fdf : Persistent Memory >>> 24000-2403f : namespace0.0 >>> 28000-2bfff : dax0.0 <- aligned with 1G boundary >>>28000-2bfff : System RAM >>> Hence there is a big gap between 0x2403f and 0x28000 due to the >> 1G >>> alignment. >>> >>> Without this series, if qemu creates a 4G bytes nvdimm device, we can >> only >>> use 2G bytes for dax pmem(kmem) in the worst case. >>> e.g. >>> 24000-33fdf : Persistent Memory >>> We can only use the memblock between [24000, 2] due to the >> hard >>> limitation. It wastes too much memory space. >>> >>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but >> there >>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, >>> SPARSEMEM_VMEMMAP, page bits in struct page ... >>> >>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem >> alignment >>> with memory_block_size_bytes(). >>> >>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax >> pmem >>> can be used as ram with smaller gap. Also the kmem hotplug add/remove >> are both >>> tested on arm64/x86 guest. >>> >> >> Hi, >> >> I am not convinced this use case is worth such hacks (that’s what it is) >> for now. On real machines pmem is big - your example (losing 50% is >> extreme). >> >> I would much rather want to see the section size on arm64 reduced. I >> remember there were patches and that at least with a base page size of 4k >> it can be reduced drastically (64k base pages are more problematic due to >> the ridiculous THP size of 512M). But could be a section size of 512 is >> possible on all configs right now. > > Yes, I once investigated how to reduce section size on arm64 thoughtfully: > There are many constraints for reducing SECTION_SIZE_BITS > 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too >much. > 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted >into page->flags. Yep. > 3. MAX_ORDER depends on SECTION_SIZE_BITS > - 3.1 mmzone.h > #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS > #error Allocator MAX_ORDER exceeds SECTION_SIZE > #endif Yep, with 4k base pages it's 4 MB. However, with 64k base pages its 512MB ( :( ). > - 3.2 hugepage_init() > MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); > > Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled, > SECTION_SIZE_BITS can be reduced to 27. > But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13. > Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not > be reduced to 27. I think there were plans to eventually switch to 2MB THP with 64k base pages as well (which can be emulated using some sort of consecutive PTE
RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
Hi David > -Original Message- > From: David Hildenbrand > Sent: Wednesday, July 29, 2020 2:37 PM > To: Justin He > Cc: Dan Williams ; Vishal Verma > ; Mike Rapoport ; David > Hildenbrand ; Catalin Marinas ; > Will Deacon ; Greg Kroah-Hartman > ; Rafael J. Wysocki ; Dave > Jiang ; Andrew Morton ; > Steve Capper ; Mark Rutland ; > Logan Gunthorpe ; Anshuman Khandual > ; Hsin-Yi Wang ; Jason > Gunthorpe ; Dave Hansen ; Kees > Cook ; linux-arm-ker...@lists.infradead.org; linux- > ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei > Yang ; Pankaj Gupta > ; Ira Weiny ; Kaly Xin > > Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem > alignment > > > > > Am 29.07.2020 um 05:35 schrieb Jia He : > > > > When enabling dax pmem as RAM device on arm64, I noticed that kmem_start > > addr in dev_dax_kmem_probe() should be aligned w/ > SECTION_SIZE_BITS(30),i.e. > > 1G memblock size. Even Dan Williams' sub-section patch series [1] had > been > > upstream merged, it was not helpful due to hard limitation of kmem_start: > > $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f > -a 2M > > $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > > $cat /proc/iomem > > ... > > 23c00-23fff : System RAM > > 23dd4-23fec : reserved > > 23fed-23fff : reserved > > 24000-33fdf : Persistent Memory > > 24000-2403f : namespace0.0 > > 28000-2bfff : dax0.0 <- aligned with 1G boundary > >28000-2bfff : System RAM > > Hence there is a big gap between 0x2403f and 0x28000 due to the > 1G > > alignment. > > > > Without this series, if qemu creates a 4G bytes nvdimm device, we can > only > > use 2G bytes for dax pmem(kmem) in the worst case. > > e.g. > > 24000-33fdf : Persistent Memory > > We can only use the memblock between [24000, 2] due to the > hard > > limitation. It wastes too much memory space. > > > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but > there > > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, > > SPARSEMEM_VMEMMAP, page bits in struct page ... > > > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem > alignment > > with memory_block_size_bytes(). > > > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax > pmem > > can be used as ram with smaller gap. Also the kmem hotplug add/remove > are both > > tested on arm64/x86 guest. > > > > Hi, > > I am not convinced this use case is worth such hacks (that’s what it is) > for now. On real machines pmem is big - your example (losing 50% is > extreme). > > I would much rather want to see the section size on arm64 reduced. I > remember there were patches and that at least with a base page size of 4k > it can be reduced drastically (64k base pages are more problematic due to > the ridiculous THP size of 512M). But could be a section size of 512 is > possible on all configs right now. Yes, I once investigated how to reduce section size on arm64 thoughtfully: There are many constraints for reducing SECTION_SIZE_BITS 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too much. 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted into page->flags. 3. MAX_ORDER depends on SECTION_SIZE_BITS - 3.1 mmzone.h #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS #error Allocator MAX_ORDER exceeds SECTION_SIZE #endif - 3.2 hugepage_init() MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER); Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled, SECTION_SIZE_BITS can be reduced to 27. But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13. Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not be reduced to 27. In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig might be very complicated,e.g. we still need to consider the case for ARM64_16K_PAGES. > > In the long term we might want to rework the memory block device model > (eventually supporting old/new as discussed with Michal some time ago > using a kernel parameter), dropping the fixed sizes Has this been posted to Linux mm maillist? Sorry, searched and didn't find it. -- Cheers, Justin (Jia He) > - allowing sizes / addresses aligned with subsection size > - drastically reducing the number of devices for boot memory to only a > hand full (e.g., one per resource / DIMM we can
Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
> Am 29.07.2020 um 05:35 schrieb Jia He : > > When enabling dax pmem as RAM device on arm64, I noticed that kmem_start > addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e. > 1G memblock size. Even Dan Williams' sub-section patch series [1] had been > upstream merged, it was not helpful due to hard limitation of kmem_start: > $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M > $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > $cat /proc/iomem > ... > 23c00-23fff : System RAM > 23dd4-23fec : reserved > 23fed-23fff : reserved > 24000-33fdf : Persistent Memory > 24000-2403f : namespace0.0 > 28000-2bfff : dax0.0 <- aligned with 1G boundary >28000-2bfff : System RAM > Hence there is a big gap between 0x2403f and 0x28000 due to the 1G > alignment. > > Without this series, if qemu creates a 4G bytes nvdimm device, we can only > use 2G bytes for dax pmem(kmem) in the worst case. > e.g. > 24000-33fdf : Persistent Memory > We can only use the memblock between [24000, 2] due to the hard > limitation. It wastes too much memory space. > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, > SPARSEMEM_VMEMMAP, page bits in struct page ... > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment > with memory_block_size_bytes(). > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem > can be used as ram with smaller gap. Also the kmem hotplug add/remove are both > tested on arm64/x86 guest. > Hi, I am not convinced this use case is worth such hacks (that’s what it is) for now. On real machines pmem is big - your example (losing 50% is extreme). I would much rather want to see the section size on arm64 reduced. I remember there were patches and that at least with a base page size of 4k it can be reduced drastically (64k base pages are more problematic due to the ridiculous THP size of 512M). But could be a section size of 512 is possible on all configs right now. In the long term we might want to rework the memory block device model (eventually supporting old/new as discussed with Michal some time ago using a kernel parameter), dropping the fixed sizes - allowing sizes / addresses aligned with subsection size - drastically reducing the number of devices for boot memory to only a hand full (e.g., one per resource / DIMM we can actually unplug again. Long story short, I don’t like this hack. > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 > [2]. > > [1] https://lkml.org/lkml/2019/6/19/67 > [2] https://lkml.org/lkml/2020/7/8/1546 > Jia He (6): > mm/memory_hotplug: remove redundant memory block size alignment check > resource: export find_next_iomem_res() helper > mm/memory_hotplug: allow pmem kmem not to align with memory_block_size > mm/page_alloc: adjust the start,end in dax pmem kmem case > device-dax: relax the memblock size alignment for kmem_start > arm64: fall back to vmemmap_populate_basepages if not aligned with >PMD_SIZE > > arch/arm64/mm/mmu.c| 4 > drivers/base/memory.c | 24 > drivers/dax/kmem.c | 22 +- > include/linux/ioport.h | 3 +++ > kernel/resource.c | 3 ++- > mm/memory_hotplug.c| 39 ++- > mm/page_alloc.c| 14 ++ > 7 files changed, 90 insertions(+), 19 deletions(-) > > -- > 2.17.1 >
[RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
When enabling dax pmem as RAM device on arm64, I noticed that kmem_start addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e. 1G memblock size. Even Dan Williams' sub-section patch series [1] had been upstream merged, it was not helpful due to hard limitation of kmem_start: $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id $cat /proc/iomem ... 23c00-23fff : System RAM 23dd4-23fec : reserved 23fed-23fff : reserved 24000-33fdf : Persistent Memory 24000-2403f : namespace0.0 28000-2bfff : dax0.0 <- aligned with 1G boundary 28000-2bfff : System RAM Hence there is a big gap between 0x2403f and 0x28000 due to the 1G alignment. Without this series, if qemu creates a 4G bytes nvdimm device, we can only use 2G bytes for dax pmem(kmem) in the worst case. e.g. 24000-33fdf : Persistent Memory We can only use the memblock between [24000, 2] due to the hard limitation. It wastes too much memory space. Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, SPARSEMEM_VMEMMAP, page bits in struct page ... Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment with memory_block_size_bytes(). Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem can be used as ram with smaller gap. Also the kmem hotplug add/remove are both tested on arm64/x86 guest. This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2]. [1] https://lkml.org/lkml/2019/6/19/67 [2] https://lkml.org/lkml/2020/7/8/1546 Jia He (6): mm/memory_hotplug: remove redundant memory block size alignment check resource: export find_next_iomem_res() helper mm/memory_hotplug: allow pmem kmem not to align with memory_block_size mm/page_alloc: adjust the start,end in dax pmem kmem case device-dax: relax the memblock size alignment for kmem_start arm64: fall back to vmemmap_populate_basepages if not aligned with PMD_SIZE arch/arm64/mm/mmu.c| 4 drivers/base/memory.c | 24 drivers/dax/kmem.c | 22 +- include/linux/ioport.h | 3 +++ kernel/resource.c | 3 ++- mm/memory_hotplug.c| 39 ++- mm/page_alloc.c| 14 ++ 7 files changed, 90 insertions(+), 19 deletions(-) -- 2.17.1