Re: [PATCH] Add support for choosing huge page size

2020-06-21 Thread Odin Ugedal
ry time, and it'd also be slower for
> other stupid implementation reasons).
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> # echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>
> shared_buffers=8GB
> dynamic_shared_memory_main_size=8GB
>
> create table t as select generate_series(1, 1)::int i;
> alter table t set (parallel_workers = 7);
> create extension pg_prewarm;
> select pg_prewarm('t');
> set max_parallel_workers_per_gather=7;
> set work_mem='1GB';
>
> select count(*) from t t1 join t t2 using (i);
>
> 4KB pages: 12.42 seconds
> 2MB pages:  9.12 seconds
> 1GB pages:  9.07 seconds
>
> Unfortunately I can't access the TLB miss counters on this system due
> to virtualisation restrictions, and the systems where I can don't have
> 1GB pages.  According to cpuid(1) this system has a fairly typical
> setup:
>
>cache and TLB information (2):
>   0x63: data TLB: 2M/4M pages, 4-way, 32 entries
> data TLB: 1G pages, 4-way, 4 entries
>   0x03: data TLB: 4K pages, 4-way, 64 entries
>
> This operation is touching about 8GB of data (scanning 3.5GB of table,
> building a 4.5GB hash table) so 4 x 1GB is not enough do this without
> TLB misses.
>
> Let's try that again, except this time with shared_buffers=4GB,
> dynamic_shared_memory_main_size=4GB, and only half as many tuples in
> t, so it ought to fit:
>
> 4KB pages:  6.37 seconds
> 2MB pages:  4.96 seconds
> 1GB pages:  5.07 seconds
>
> Well that's disappointing.  I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got:
>
> 4KB pages:  5.43 seconds
> 2MB pages:  4.05 seconds
> 1GB pages:  4.00 seconds
>
> From this I can't really conclude that it's terribly useful to use
> larger page sizes, but it's certainly useful to have the ability to do
> further testing using the proposed GUC.
>
> [1] 
> https://www.postgresql.org/message-id/flat/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
From fa3b30a32032bf38c8dc72de9656526a5d5e8daa Mon Sep 17 00:00:00 2001
From: Odin Ugedal 
Date: Sun, 7 Jun 2020 21:04:57 +0200
Subject: [PATCH v4] Add support for choosing huge page size

This adds support for using non-default huge page sizes for shared
memory. This is achived via the new "huge_page_size" config entry.
The config value defaults to 0, meaning it will use the system default.
---
 configure | 26 +++
 configure.in  |  4 ++
 doc/src/sgml/config.sgml  | 27 
 doc/src/sgml/runtime.sgml | 41 ++-
 src/backend/port/sysv_shmem.c | 69 ++-
 src/backend/utils/misc/guc.c  | 25 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/pg_config.h.in|  8 +++
 src/include/pg_config_manual.h|  6 ++
 src/include/storage/pg_shmem.h|  1 +
 src/tools/msvc/Solution.pm|  2 +
 11 files changed, 179 insertions(+), 32 deletions(-)

diff --git a/configure b/configure
index 2feff37fe3..11e3112ee4 100755
--- a/configure
+++ b/configure
@@ -15488,6 +15488,32 @@ _ACEOF
 
 fi # fi
 
+# Check if system supports mmap flags for allocating huge page memory with page sizes
+# other than the default
+ac_fn_c_check_decl "$LINENO" "MAP_HUGE_MASK" "ac_cv_have_decl_MAP_HUGE_MASK" "#include 
+"
+if test "x$ac_cv_have_decl_MAP_HUGE_MASK" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_MAP_HUGE_MASK $ac_have_decl
+_ACEOF
+ac_fn_c_check_decl "$LINENO" "MAP_HUGE_SHIFT" "ac_cv_have_decl_MAP_HUGE_SHIFT" "#include 
+"
+if test "x$ac_cv_have_decl_MAP_HUGE_SHIFT" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_MAP_HUGE_SHIFT $ac_have_decl
+_ACEOF
+
+
 ac_fn_c_check_decl "$LINENO" "fdatasync" "ac_cv_have_decl_fdatasync" "#include 
 "
 if test "x$ac_cv_have_decl_fdatasync" = xyes; then :
diff --git a/configure.in b/configure.in
index 0188c6ff07..f56c06eb3d 100644
--- a/configure.in
+++ b/configure.in
@@ -1687,6 +1687,10 @@ AC_CHECK_FUNCS(posix_fadvise)
 AC_CHECK_DECLS(posix_fadvise, [], [], [#include ])
 ]) # fi
 
+# Check if system supports mmap flags for allocating huge page memory with page sizes
+# other than the default
+AC_CHECK_DECLS

Re: [PATCH] Add support for choosing huge page size

2020-06-10 Thread Odin Ugedal
Thanks again Thomas,

> Oh, so maybe we need a configure test for them?  And if you don't have
> it, a runtime error if you try to set the page size to something other
> than 0 (like we do for effective_io_concurrency if you don't have a
> posix_fadvise() function).

Ahh, yes, that sounds reasonable. Did some fiddling with the configure
script to add a check, and think I got it right (but not 100% sure
tho.). Added new v3 patch.

> If you set it to an unsupported size, that seems reasonable to me.  If
> you set it to an unsupported size and have huge_pages=try, do we fall
> back to using no huge pages?

Yes, the "fallback" with huge_pages=try is the same for both
huge_page_size=0 and huge_page_size=nMB, and is the same as without
this patch.

> For what it's worth, here's what I know about this on other operating systems:

Thanks for all the background info!

> 1.  AIX can do huge pages, but only if you use System V shared memory
> (not for mmap() anonymous shared).  In
> https://commitfest.postgresql.org/25/1960/ we got as far as adding
> support for shared_memory_type=sysv, but to go further we'll need
> someone willing to hack on the patch on an AIX system, preferably with
> root access so they can grant the postgres user wired memory
> privileges (or whatever they call that over there).  But at a glance,
> they don't have a way to ask for a specific page size, just "large".

Interesting. I might get access to some AIX systems at university this fall,
so maybe I will get some time to dive into the patch.


Odin
From 8cb876bf73258646044a6a99d72e7c12d1d03e3a Mon Sep 17 00:00:00 2001
From: Odin Ugedal 
Date: Sun, 7 Jun 2020 21:04:57 +0200
Subject: [PATCH v3] Add support for choosing huge page size

This adds support for using non-default huge page sizes for shared
memory. This is achived via the new "huge_page_size" config entry.
The config value defaults to 0, meaning it will use the system default.
---
 configure | 26 +++
 configure.in  |  4 ++
 doc/src/sgml/config.sgml  | 27 
 doc/src/sgml/runtime.sgml | 41 +++-
 src/backend/port/sysv_shmem.c | 67 ++-
 src/backend/utils/misc/guc.c  | 25 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/pg_config.h.in|  8 +++
 src/include/pg_config_manual.h|  6 ++
 src/include/storage/pg_shmem.h|  1 +
 10 files changed, 176 insertions(+), 31 deletions(-)

diff --git a/configure b/configure
index 2feff37fe3..11e3112ee4 100755
--- a/configure
+++ b/configure
@@ -15488,6 +15488,32 @@ _ACEOF
 
 fi # fi
 
+# Check if system supports mmap flags for allocating huge page memory with page sizes
+# other than the default
+ac_fn_c_check_decl "$LINENO" "MAP_HUGE_MASK" "ac_cv_have_decl_MAP_HUGE_MASK" "#include 
+"
+if test "x$ac_cv_have_decl_MAP_HUGE_MASK" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_MAP_HUGE_MASK $ac_have_decl
+_ACEOF
+ac_fn_c_check_decl "$LINENO" "MAP_HUGE_SHIFT" "ac_cv_have_decl_MAP_HUGE_SHIFT" "#include 
+"
+if test "x$ac_cv_have_decl_MAP_HUGE_SHIFT" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_MAP_HUGE_SHIFT $ac_have_decl
+_ACEOF
+
+
 ac_fn_c_check_decl "$LINENO" "fdatasync" "ac_cv_have_decl_fdatasync" "#include 
 "
 if test "x$ac_cv_have_decl_fdatasync" = xyes; then :
diff --git a/configure.in b/configure.in
index 0188c6ff07..f56c06eb3d 100644
--- a/configure.in
+++ b/configure.in
@@ -1687,6 +1687,10 @@ AC_CHECK_FUNCS(posix_fadvise)
 AC_CHECK_DECLS(posix_fadvise, [], [], [#include ])
 ]) # fi
 
+# Check if system supports mmap flags for allocating huge page memory with page sizes
+# other than the default
+AC_CHECK_DECLS([MAP_HUGE_MASK, MAP_HUGE_SHIFT], [], [], [#include ])
+
 AC_CHECK_DECLS(fdatasync, [], [], [#include ])
 AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 # This is probably only present on macOS, but may as well check always
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aca8f73a50..42f06a41cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1582,6 +1582,33 @@ include_dir 'conf.d'
   
  
 
+ 
+  huge_page_size (integer)
+  
+   huge_page_size configuration parameter
+  
+  
+  
+   
+Controls what size of huge pages is used in conjunction with
+.
+The default is zero (0).
+When set to 0, the default huge page size on the system will
+be

Re: [PATCH] Add support for choosing huge page size

2020-06-09 Thread Odin Ugedal
Hi,

Thank you so much for the feedback David and Thomas!

Attached v2 of patch, updated with the comments from Thomas (again,
thanks). I also changed the mmap flags to only set size if the
selected huge page size is not the default on (on linux). The support
for this functionality was added in Linux 3.8, and therefore it was
not supported before then. Should we add that to the docs, or what do
you think? The definitions of MAP_HUGE_MASK and MAP_HUGE_SHIFT were
added in Linux 3.8 too, but since they are a part of libc/musl, and
are "used" at compile time, that shouldn't be a problem, or?

If a huge page size that is not supported on the system is chosen via
huge_page_size (and huge_pages = on), it will result in "FATAL:  could
not map anonymous shared memory: Invalid argument". This is the same
that happens today when huge pages aren't supported at all, so I guess
it is ok for now (and then we can consider verifying that it is
supported at a later stage).

Also, thanks for the information about the Windows. Have been
searching about info on huge pages in windows and "superpages" in bsd,
without that much luck. I only have experience on linux, so I think we
can do as you said, to let someone else look at it. :)

Odin
From 5cf1af94337523c2dcd6427d70ca5c589942a64c Mon Sep 17 00:00:00 2001
From: Odin Ugedal 
Date: Sun, 7 Jun 2020 21:04:57 +0200
Subject: [PATCH v2] Add support for choosing huge page size

This adds support for using non-default huge page sizes for shared
memory. This is achived via the new "huge_page_size" config entry.
The config value defaults to 0, meaning it will use the system default.
---
 doc/src/sgml/config.sgml  | 27 
 doc/src/sgml/runtime.sgml | 41 +++-
 src/backend/port/sysv_shmem.c | 67 ++-
 src/backend/utils/misc/guc.c  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/storage/pg_shmem.h|  1 +
 6 files changed, 118 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aca8f73a50..42f06a41cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1582,6 +1582,33 @@ include_dir 'conf.d'
   
  
 
+ 
+  huge_page_size (integer)
+  
+   huge_page_size configuration parameter
+  
+  
+  
+   
+Controls what size of huge pages is used in conjunction with
+.
+The default is zero (0).
+When set to 0, the default huge page size on the system will
+be used.
+   
+   
+Some commonly available page sizes on modern 64 bit server architectures include:
+2MB and 1GB (Intel and AMD), 16MB and
+16GB (IBM POWER), and 64kB, 2MB,
+32MB and 1GB (ARM). For more information
+about usage and support, see .
+   
+   
+Controlling huge page size is currently not supported on Windows.
+   
+  
+ 
+
  
   temp_buffers (integer)
   
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 88210c4a5d..cbdbcb4fdf 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1391,41 +1391,50 @@ export PG_OOM_ADJUST_VALUE=0
 using large values of .  To use this
 feature in PostgreSQL you need a kernel
 with CONFIG_HUGETLBFS=y and
-CONFIG_HUGETLB_PAGE=y. You will also have to adjust
-the kernel setting vm.nr_hugepages. To estimate the
-number of huge pages needed, start PostgreSQL
-without huge pages enabled and check the
-postmaster's anonymous shared memory segment size, as well as the system's
-huge page size, using the /proc file system.  This might
-look like:
+CONFIG_HUGETLB_PAGE=y. You will also have to pre-allocate
+huge pages with the the desired huge page size. To estimate the number of
+huge pages needed, start PostgreSQL without huge
+pages enabled and check the postmaster's anonymous shared memory segment size,
+as well as the system's supported huge page sizes, using the
+/sys file system.  This might look like:
 
 $ head -1 $PGDATA/postmaster.pid
 4170
 $ pmap 4170 | awk '/rw-s/ && /zero/ {print $2}'
 6490428K
+$ ls /sys/kernel/mm/hugepages
+hugepages-1048576kB  hugepages-2048kB
+
+
+ You can now choose between the supported sizes, 2MiB and 1GiB in this case.
+ By default PostgreSQL will use the default huge
+ page size on the system, but that can be configured via
+ .
+ The default huge page size can be found with:
+
 $ grep ^Hugepagesize /proc/meminfo
 Hugepagesize:   2048 kB
 
+
+ For 2MiB,
  6490428 / 2048 gives approximately
  3169.154, so in this example we need at
  least 3170 huge pages, which we can set with:
 
-$ sysctl -w vm.nr_hugepages=3170
+$ echo 3170 | tee /sys/kernel/mm/hugepages/h

[PATCH] Add support for choosing huge page size

2020-06-08 Thread Odin Ugedal
This adds support for using non-default huge page sizes for shared
memory. This is achived via the new "huge_page_size" config entry.
The config value defaults to 0, meaning it will use the system default.
---

This would be very helpful when running in kubernetes since nodes may
support multiple huge page sizes, and have pre-allocated huge page meory
for each size. This lets the user select huge page size without having
to change the default huge page size on the node. This will also be
useful when doing benchmarking with different huge page sizes, since it
wouldn't require a full system reboot.

Since the default value of the new config is 0 (resulting in using the
default huge page size) this should be backwards compatible with old
configs.

Feel free to comment on the phrasing (both in docs and code) and on the
overall change.

 doc/src/sgml/config.sgml  | 25 ++
 doc/src/sgml/runtime.sgml | 41 +
 src/backend/port/sysv_shmem.c | 88 ---
 src/backend/utils/misc/guc.c  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/storage/pg_shmem.h|  1 +
 6 files changed, 120 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aca8f73a50..6177b819ce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1582,6 +1582,31 @@ include_dir 'conf.d'
   
  
 
+ 
+  huge_page_size (integer)
+  
+   huge_page_size configuration 
parameter
+  
+  
+  
+   
+Controls what size of huge pages is used in conjunction with
+.
+The default is zero (0).
+When set to 0, the default huge page size on the 
system will
+be used.
+   
+   
+Most modern linux systems support 2MB and 
1GB
+huge pages, and some architectures supports other sizes as well. For 
more information
+on how to check for support and usage, see .
+   
+   
+Controling huge page size is not supported on Windows.  
+   
+  
+ 
+
  
   temp_buffers (integer)
   
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 88210c4a5d..cbdbcb4fdf 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1391,41 +1391,50 @@ export PG_OOM_ADJUST_VALUE=0
 using large values of .  To use this
 feature in PostgreSQL you need a kernel
 with CONFIG_HUGETLBFS=y and
-CONFIG_HUGETLB_PAGE=y. You will also have to adjust
-the kernel setting vm.nr_hugepages. To estimate the
-number of huge pages needed, start PostgreSQL
-without huge pages enabled and check the
-postmaster's anonymous shared memory segment size, as well as the system's
-huge page size, using the /proc file system.  This 
might
-look like:
+CONFIG_HUGETLB_PAGE=y. You will also have to 
pre-allocate
+huge pages with the the desired huge page size. To estimate the number of
+huge pages needed, start PostgreSQL without huge
+pages enabled and check the postmaster's anonymous shared memory segment 
size,
+as well as the system's supported huge page sizes, using the
+/sys file system.  This might look like:
 
 $ head -1 $PGDATA/postmaster.pid
 4170
 $ pmap 4170 | awk '/rw-s/ && /zero/ {print $2}'
 6490428K
+$ ls /sys/kernel/mm/hugepages
+hugepages-1048576kB  hugepages-2048kB
+
+
+ You can now choose between the supported sizes, 2MiB and 1GiB in this 
case.
+ By default PostgreSQL will use the default huge
+ page size on the system, but that can be configured via
+ .
+ The default huge page size can be found with:
+
 $ grep ^Hugepagesize /proc/meminfo
 Hugepagesize:   2048 kB
 
+
+ For 2MiB,
  6490428 / 2048 gives approximately
  3169.154, so in this example we need at
  least 3170 huge pages, which we can set with:
 
-$ sysctl -w vm.nr_hugepages=3170
+$ echo 3170 | tee 
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
 
 A larger setting would be appropriate if other programs on the machine
-also need huge pages.  Don't forget to add this setting
-to /etc/sysctl.conf so that it will be reapplied
-after reboots.
+also need huge pages. It is also possible to pre allocate huge pages on 
boot
+by adding the kernel parameters hugepagesz=2M 
hugepages=3170.

 

 Sometimes the kernel is not able to allocate the desired number of huge
-pages immediately, so it might be necessary to repeat the command or to
-reboot.  (Immediately after a reboot, most of the machine's memory
-should be available to convert into huge pages.)  To verify the huge
-page allocation situation, use:
+pages immediately due to external fragmentation, so it might be necessary 
to
+repeat the command or to reboot. To verify the huge page allocation 
situation
+for a given size, use:
 
-$ grep Huge /proc/memin