Re: [PATCH v6 11/12] mm/vmalloc: Hugepage vmalloc mappings

2020-08-21 Thread Nicholas Piggin
Excerpts from Eric Dumazet's message of August 22, 2020 1:38 am:
> 
> On 8/21/20 8:12 AM, Nicholas Piggin wrote:
>> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
>> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
>> supports PMD sized vmap mappings.
>> 
>> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
>> larger, and fall back to small pages if that was unsuccessful.
>> 
>> Allocations that do not use PAGE_KERNEL prot are not permitted to use huge
>> pages, because not all callers expect this (e.g., module allocations vs
>> strict module rwx).
>> 
>> This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
>> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
>> 
>> This can result in more internal fragmentation and memory overhead for a
>> given allocation, an option nohugevmalloc is added to disable at boot.
>> 
>>
> 
> Thanks for working on this stuff, I tried something similar in the past,
> but could not really do more than a hack.
> ( https://lkml.org/lkml/2016/12/21/285 )

Oh nice. It might be possible to do some ideas from your patch
still. Higher order pages smaller than PMD size, or the memory
policy stuff, perhaps.

> Note that __init alloc_large_system_hash() is used at boot time,
> when NUMA policy is spreading allocations over all NUMA nodes.
> 
> This means that on a dual node system, a hash table should be 50/50 spread.
> 
> With your patch, if a hashtable is exactly the size of one huge page,
> the location of this hashtable will be not balanced, this might have some
> unwanted impact.

In that case it shouldn't because it divides by the number of nodes,
but it will in general have a bit larger granularity in balancing than
smaller pages of course.

There's probably a better way to size these important hashes on NUMA. I
suspect most of the time you have a NUMA machine you actually would
prefer to use large pages now, even if it means taking up to 2MB more
memory per node per hash. It's not a great amount and the allocation 
size is rather arbitrary anyway.

Thanks,
Nick


Re: [PATCH v6 11/12] mm/vmalloc: Hugepage vmalloc mappings

2020-08-21 Thread Eric Dumazet


On 8/21/20 8:12 AM, Nicholas Piggin wrote:
> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
> supports PMD sized vmap mappings.
> 
> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
> larger, and fall back to small pages if that was unsuccessful.
> 
> Allocations that do not use PAGE_KERNEL prot are not permitted to use huge
> pages, because not all callers expect this (e.g., module allocations vs
> strict module rwx).
> 
> This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
> 
> This can result in more internal fragmentation and memory overhead for a
> given allocation, an option nohugevmalloc is added to disable at boot.
> 
>

Thanks for working on this stuff, I tried something similar in the past,
but could not really do more than a hack.
( https://lkml.org/lkml/2016/12/21/285 )

Note that __init alloc_large_system_hash() is used at boot time,
when NUMA policy is spreading allocations over all NUMA nodes.

This means that on a dual node system, a hash table should be 50/50 spread.

With your patch, if a hashtable is exactly the size of one huge page,
the location of this hashtable will be not balanced, this might have some
unwanted impact.

Thanks !



[PATCH v6 11/12] mm/vmalloc: Hugepage vmalloc mappings

2020-08-21 Thread Nicholas Piggin
Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
larger, and fall back to small pages if that was unsuccessful.

Allocations that do not use PAGE_KERNEL prot are not permitted to use huge
pages, because not all callers expect this (e.g., module allocations vs
strict module rwx).

This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig|   4 +
 include/linux/vmalloc.h |   1 +
 mm/page_alloc.c |   5 +-
 mm/vmalloc.c| 180 ++--
 4 files changed, 145 insertions(+), 45 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..b2b89d629317 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -616,6 +616,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 15adb9a14fb6..a7449064fe35 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+   unsigned intpage_order;
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bab486fea..b6427cc7b838 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -8102,6 +8103,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8169,6 +8171,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = (find_vm_area(table)->page_order > 0);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8185,7 +8188,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
-   virt ? "vmalloc" : "linear");
+   virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear");
 
if (_hash_shift)
*_hash_shift = log2qty;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1d6cad16bda3..8db53c2d7f72 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -44,6 +44,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmalloc(char *str)
+{
+   vmap_allow_huge = false;
+   return 0;
+}
+early_param("nohugevmalloc", set_nohugevmalloc);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+
 bool is_vmalloc_addr(const void *x)
 {
unsigned long addr = (unsigned long)x;
@@ -477,31 +490,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long 
addr,
return 0;
 }
 
-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size specify 
should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing.  The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
-pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages)
 {
unsigned long start = addr;
-   unsigned long end = addr + size;
-   unsigned long next;
pgd_t *pgd;
+