Re: Fast DSM segments
On Mon, Jul 27, 2020 at 2:45 PM Thomas Munro wrote: > Here's a new version, using the name min_dynamic_shared_memory, which > sounds better to me. Any objections? I also fixed the GUC's maximum > setting so that it's sure to fit in size_t. I pushed it like that. Happy to rename the GUC if someone has a better idea. I don't really love the way dsm_create()'s code flows, but I didn't see another way to do this within the existing constraints. I think it'd be nice to rewrite this thing to get rid of the random number-based handles that are directly convertible to key_t/pathname, and instead use something holding {slot number, generation number}. Then you could improve that code flow and get rid of several cases of linear array scans under an exclusive lock. The underlying key_t/pathname would live in the slot. You'd need a new way to find the control segment itself after a restart, where dsm_cleanup_using_control_segment() cleans up after the previous incarnation, but I think that just requires putting the key_t/pathname directly in PGShmemHeader, instead of a new {slot number, generation number} style handle. Or maybe a separate mapped file opened by well known pathname, or something like that.
Re: Fast DSM segments
On Sat, Jun 20, 2020 at 7:17 AM Andres Freund wrote: > On 2020-06-19 17:42:41 +1200, Thomas Munro wrote: > > On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro wrote: > > > Here's a version that adds some documentation. > > > > I jumped on a dual socket machine with 36 cores/72 threads and 144GB > > of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge > > pages available, and I ran a very simple test: select count(*) from t > > t1 join t t2 using (i), where the table was created with create table > > t as select generate_series(1, 4)::int i, and then prewarmed > > into 20GB of shared_buffers. > > I assume all the data fits into 20GB? Yep. > Which kernel version is this? Tested on 4.19 (Debian stable/10). > How much of the benefit comes from huge pages being used, how much from > avoiding the dsm overhead, and how much from the page table being shared > for that mapping? Do you have a rough idea? Without huge pages, the 36 process version of the test mentioned above shows around a 1.1x speedup, which is in line with the numbers from my first message (which was from a much smaller computer). The rest of the speedup (2x) is due to huge pages. Further speedups are available by increasing the hash chunk size, and probably doing NUMA-aware allocation, in later work. Here's a new version, using the name min_dynamic_shared_memory, which sounds better to me. Any objections? I also fixed the GUC's maximum setting so that it's sure to fit in size_t. From 3a77e8ab67e5492340425f1ce4a36a45a8c80d05 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 27 Jul 2020 12:07:05 +1200 Subject: [PATCH v3] Preallocate some DSM space at startup. Create an optional region in the main shared memory segment that can be used to acquire and release "fast" DSM segments, and can benefit from huge pages allocated at cluster startup time, if configured. Fall back to the existing mechanisms when that space is full. The size is controlled by a new GUC min_dynamic_shared_memory, defaulting to 0. Main region DSM segments initially contain whatever garbage the memory held last time they were used, rather than zeroes. That change revealed that DSA areas failed to initialize themselves correctly in memory that wasn't zeroed first, so fix that problem. Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com --- doc/src/sgml/config.sgml | 26 +++ src/backend/storage/ipc/dsm.c | 184 -- src/backend/storage/ipc/dsm_impl.c| 3 + src/backend/storage/ipc/ipci.c| 3 + src/backend/utils/misc/guc.c | 11 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/backend/utils/mmgr/dsa.c | 5 +- src/include/storage/dsm.h | 3 + src/include/storage/dsm_impl.h| 1 + 9 files changed, 212 insertions(+), 25 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6ce5907896..48fef11041 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1864,6 +1864,32 @@ include_dir 'conf.d' + + min_dynamic_shared_memory (integer) + + min_dynamic_shared_memory configuration parameter + + + + +Specifies the amount of memory that should be allocated at server +startup time for use by parallel queries. When this memory region is +insufficient or exhausted by concurrent parallel queries, new +parallel queries try to allocate extra shared memory temporarily from +the operating system using the method configured with +dynamic_shared_memory_type, which may be slower +due to memory management overheads. +Memory that is allocated at startup time with +min_dynamic_shared_memory is affected by the +huge_pages setting on operating systems where that +is supported, and may be more likely to benefit from larger pages on +operating systems where page size is managed automatically. Larger +memory pages can improve the performance of parallel hash joins. +The default value is 0 (none). + + + + diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c index ef64d08357..b9941496a3 100644 --- a/src/backend/storage/ipc/dsm.c +++ b/src/backend/storage/ipc/dsm.c @@ -35,10 +35,12 @@ #include "lib/ilist.h" #include "miscadmin.h" +#include "port/pg_bitutils.h" #include "storage/dsm.h" #include "storage/ipc.h" #include "storage/lwlock.h" #include "storage/pg_shmem.h" +#include "utils/freepage.h" #include "utils/guc.h" #include "utils/memutils.h" #include "utils/resowner_
Re: Fast DSM segments
Hi, On 2020-06-19 17:42:41 +1200, Thomas Munro wrote: > On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro wrote: > > Here's a version that adds some documentation. > > I jumped on a dual socket machine with 36 cores/72 threads and 144GB > of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge > pages available, and I ran a very simple test: select count(*) from t > t1 join t t2 using (i), where the table was created with create table > t as select generate_series(1, 4)::int i, and then prewarmed > into 20GB of shared_buffers. I assume all the data fits into 20GB? Which kernel version is this? How much of the benefit comes from huge pages being used, how much from avoiding the dsm overhead, and how much from the page table being shared for that mapping? Do you have a rough idea? Greetings, Andres Freund
Re: Fast DSM segments
On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro wrote: > Here's a version that adds some documentation. I jumped on a dual socket machine with 36 cores/72 threads and 144GB of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge pages available, and I ran a very simple test: select count(*) from t t1 join t t2 using (i), where the table was created with create table t as select generate_series(1, 4)::int i, and then prewarmed into 20GB of shared_buffers. I compared the default behaviour to preallocate_dynamic_shared_memory=20GB, with work_mem set sky high so that there would be no batching (you get a hash table of around 16GB), and I set things up so that I could test with a range of worker processes, and computed the speedup compared to a serial hash join. Here's what I got: Processes Default Preallocated 1 627.6s 9 101.3s = 6.1x 68.1s = 9.2x 18 56.1s = 11.1x34.9s = 17.9x 27 42.5s = 14.7x23.5s = 26.7x 36 36.0s = 17.4x18.2s = 34.4x 45 33.5s = 18.7x15.5s = 40.5x 54 35.6s = 17.6x13.6s = 46.1x 63 35.4s = 17.7x12.2s = 51.4x 72 33.8s = 18.5x11.3s = 55.5x It scaled nearly perfectly up to somewhere just under 36 threads, and then the slope tapered off a bit so that each extra process was supplying somewhere a bit over half of its potential. I can improve the slope after the halfway point a bit by cranking HASH_CHUNK_SIZE up to 128KB (and it doesn't get much better after that): Processes Default Preallocated 1 627.6s 9 102.7s = 6.1x 67.7s = 9.2x 18 56.8s = 11.1x34.8s = 18.0x 27 41.0s = 15.3x23.4s = 26.8x 36 33.9s = 18.5x18.2s = 34.4x 45 30.1s = 20.8x15.4s = 40.7x 54 27.2s = 23.0x13.3s = 47.1x 63 25.1s = 25.0x11.9s = 52.7x 72 23.8s = 26.3x10.8s = 58.1x I don't claim that this is representative of any particular workload or server configuration, but it's a good way to show that bottleneck, and it's pretty cool to be able to run a query that previously took over 10 minutes in 10 seconds. (I can shave a further 10% off these times with my experimental hash join prefetching patch, but I'll probably write about that separately when I've figured out why it's not doing better than that...).
Re: Fast DSM segments
On Thu, Jun 11, 2020 at 5:37 AM Robert Haas wrote: > On Tue, Jun 9, 2020 at 6:03 PM Thomas Munro wrote: > > That all makes sense. Now I'm wondering if I should use exactly that > > word in the GUC... dynamic_shared_memory_preallocate? > > I tend to prefer verb-object rather than object-verb word ordering, > because that's how English normally works, but I realize this is not a > unanimous view. It's pretty much just me and Yoda against all the rest of you, so let's try preallocate_dynamic_shared_memory. I guess it could also be min_dynamic_shared_memory to drop the verb. Other ideas welcome. > It's a little strange because the fact of preallocating it makes it > not dynamic any more. I don't know what to do about that. Well, it's not dynamic at the operating system level, but it's still dynamic in the sense that PostgreSQL code can get some and give it back, and there's no change from the point of view of any DSM client code. Admittedly, the shared memory architecture is a bit confusing. We have main shared memory, DSM memory, DSA memory that is inside main shared memory with extra DSMs as required, DSA memory that is inside a DSM and creates extra DSMs as required, and with this patch also DSMs that are inside main shared memory. Not to mention palloc and MemoryContexts and all that. As you probably remember I once managed to give an internal presentation at EDB for one hour of solid talking about all the different kinds of allocators and what they're good for. It was like a Möbius slide deck already. Here's a version that adds some documentation. From 8f222062b60d6674cd9f46e716a56201ef498f84 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 9 Apr 2020 14:12:38 +1200 Subject: [PATCH v2] Preallocate some DSM space at startup. Create an optional region in the main shared memory segment that can be used to acquire and release "fast" DSM segments, and can benefit from huge pages allocated at cluster startup time, if configured. Fall back to the existing mechanisms when that space is full. The size is controlled by preallocate_dynamic_shared_memory, defaulting to 0. Main region DSM segments initially contain whatever garbage the memory held last time they were used, rather than zeroes. That change revealed that DSA areas failed to initialize themselves correctly in memory that wasn't zeroed first, so fix that problem. Reviewed-by: Robert Haas Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com --- doc/src/sgml/config.sgml | 26 +++ src/backend/storage/ipc/dsm.c | 184 -- src/backend/storage/ipc/dsm_impl.c| 3 + src/backend/storage/ipc/ipci.c| 3 + src/backend/utils/misc/guc.c | 11 ++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/backend/utils/mmgr/dsa.c | 5 +- src/include/storage/dsm.h | 3 + src/include/storage/dsm_impl.h| 1 + 9 files changed, 213 insertions(+), 25 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 783bf7a12b..35d342a694 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1831,6 +1831,32 @@ include_dir 'conf.d' + + preallocate_dynamic_shared_memory (integer) + + preallocate_dynamic_shared_memory configuration parameter + + + + +Specifies the amount of memory that should be allocated at server +startup time for use by parallel queries. When this memory region is +insufficient or exhausted by concurrent parallel queries, new +parallel queries try to allocate extra shared memory temporarily from +the operating system using the method configured with +dynamic_shared_memory_type, which may be slower +due to memory management overheads. +Memory that is allocated with +preallocate_dynamic_shared_memory is affected by the +huge_pages setting on operating systems where that +is supported, and may be more likely to benefit from larger pages on +operating systems where page size is managed automatically. Larger +memory pages can improve the performance of parallel hash joins. +The default value is 0 (none). + + + + diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c index ef64d08357..4f87ece3b3 100644 --- a/src/backend/storage/ipc/dsm.c +++ b/src/backend/storage/ipc/dsm.c @@ -35,10 +35,12 @@ #include "lib/ilist.h" #include "miscadmin.h" +#include "port/pg_bitutils.h" #include "storage/dsm.h" #include "storage/ipc.h" #include "storage/lwlock.h" #include "storage/pg_shmem.h" +#include "utils/freepage.h" #include "utils/guc.h"
Re: Fast DSM segments
On Tue, Jun 9, 2020 at 6:03 PM Thomas Munro wrote: > That all makes sense. Now I'm wondering if I should use exactly that > word in the GUC... dynamic_shared_memory_preallocate? I tend to prefer verb-object rather than object-verb word ordering, because that's how English normally works, but I realize this is not a unanimous view. It's a little strange because the fact of preallocating it makes it not dynamic any more. I don't know what to do about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Fast DSM segments
On Sat, Apr 11, 2020 at 1:55 AM Robert Haas wrote: > On Thu, Apr 9, 2020 at 1:46 AM Thomas Munro wrote: > > The attached highly experimental patch adds a new GUC > > dynamic_shared_memory_main_size. If you set it > 0, it creates a > > fixed sized shared memory region that supplies memory for "fast" DSM > > segments. When there isn't enough free space, dsm_create() falls back > > to the traditional approach using eg shm_open(). > > I think this is a reasonable option to have available for people who > want to use it. I didn't want to have parallel query be limited to a > fixed-size amount of shared memory because I think there are some > cases where efficient performance really requires a large chunk of > memory, and it seemed impractical to keep the largest amount of memory > that any query might need to use permanently allocated, let alone that > amount multiplied by the maximum possible number of parallel queries > that could be running at the same time. But none of that is any > argument against giving people the option to preallocate some memory > for parallel query. That all makes sense. Now I'm wondering if I should use exactly that word in the GUC... dynamic_shared_memory_preallocate?
Re: Fast DSM segments
On Thu, Apr 9, 2020 at 1:46 AM Thomas Munro wrote: > The attached highly experimental patch adds a new GUC > dynamic_shared_memory_main_size. If you set it > 0, it creates a > fixed sized shared memory region that supplies memory for "fast" DSM > segments. When there isn't enough free space, dsm_create() falls back > to the traditional approach using eg shm_open(). I think this is a reasonable option to have available for people who want to use it. I didn't want to have parallel query be limited to a fixed-size amount of shared memory because I think there are some cases where efficient performance really requires a large chunk of memory, and it seemed impractical to keep the largest amount of memory that any query might need to use permanently allocated, let alone that amount multiplied by the maximum possible number of parallel queries that could be running at the same time. But none of that is any argument against giving people the option to preallocate some memory for parallel query. My guess is that on smaller boxes this won't find a lot of use, but on bigger ones it will be handy. It's hard to imagine setting aside 1GB of memory for this if you only have 8GB total, but if you have 512GB total, it's pretty easy to imagine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Fast DSM segments
Hello PostgreSQL 14 hackers, FreeBSD is much faster than Linux (and probably Windows) at parallel hash joins on the same hardware, primarily because its DSM segments run in huge pages out of the box. There are various ways to convince recent-ish Linux to put our DSMs on huge pages (see below for one), but that's not the only problem I wanted to attack. The attached highly experimental patch adds a new GUC dynamic_shared_memory_main_size. If you set it > 0, it creates a fixed sized shared memory region that supplies memory for "fast" DSM segments. When there isn't enough free space, dsm_create() falls back to the traditional approach using eg shm_open(). This allows parallel queries to run faster, because: * no more expensive system calls * no repeated VM allocation (whether explicit posix_fallocate() or first-touch) * can be in huge pages on Linux and Windows This makes lots of parallel queries measurably faster, especially parallel hash join. To demonstrate with a very simple query: create table t (i int); insert into t select generate_series(1, 1000); select pg_prewarm('t'); set work_mem = '1GB'; select count(*) from t t1 join t t2 using (i); Here are some quick and dirty results from a Linux 4.19 laptop. The first column is the new GUC, and the last column is from "perf stat -e dTLB-load-misses -p ". size huge_pages time speedup TLB misses 0 off2.595s 9,131,285 0 on 2.571s 1% 8,951,595 1GB off2.398s 8% 9,082,803 1GB on 1.898s 37% 169,867 You can get some of this speedup unpatched on a Linux 4.7+ system by putting "huge=always" in your /etc/fstab options for /dev/shm (= where shm_open() lives). For comparison, that gives me: size huge_pages time speedup TLB misses 0 on 2.007s 29% 221,910 That still leave the other 8% on the table, and in fact that 8% explodes to a much larger number as you throw more cores at the problem (here I was using defaults, 2 workers). Unfortunately, dsa.c -- used by parallel hash join to allocate vast amounts of memory really fast during the build phase -- holds a lock while creating new segments, as you'll soon discover if you test very large hash join builds on a 72-way box. I considered allowing concurrent segment creation, but as far as I could see that would lead to terrible fragmentation problems, especially in combination with our geometric growth policy for segment sizes due to limited slots. I think this is the main factor that causes parallel hash join scalability to fall off around 8 cores. The present patch should really help with that (more digging in that area needed; there are other ways to improve that situation, possibly including something smarter than a stream of of dsa_allocate(32kB) calls). A competing idea would have freelists of lingering DSM segments for reuse. Among other problems, you'd probably have fragmentation problems due to their differing sizes. Perhaps there could be a hybrid of these two ideas, putting a region for "fast" DSM segments inside many OS-supplied segments, though it's obviously much more complicated. As for what a reasonable setting would be for this patch, well, erm, it depends. Obviously that's RAM that the system can't use for other purposes while you're not running parallel queries, and if it's huge pages, it can't be swapped out; if it's not huge pages, then it can be swapped out, and that'd be terrible for performance next time you need it. So you wouldn't want to set it too large. If you set it too small, it falls back to the traditional behaviour. One argument I've heard in favour of creating fresh segments every time is that NUMA systems configured to prefer local memory allocation (as opposed to interleaved allocation) probably avoid cross node traffic. I haven't looked into that topic yet; I suppose one way to deal with it in this scheme would be to have one such region per node, and prefer to allocate from the local one. From 703b3ed55e7a0f1d895800943bfa64fb52a0fec1 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 9 Apr 2020 14:12:38 +1200 Subject: [PATCH] Support DSM segments in the main shmem area. Create an optional region in the main shared memory segment that can be used to acquire and release "fast" DSM segments, and can benefit from huge pages allocated at cluster startup time, if configured. Fall back to the existing mechanisms when that space is full. The size is controlled by dynamic_shared_memory_main_size, defaulting to 0 (disabled). Main region DSM segments initially contain whatever garbage the memory held last time they were used, rather than zeroes. That change revealed that DSA areas failed to initialize themselves correctly in memory that wasn't zeroed first, so fix that problem. --- src/backend/storage/ipc/dsm.c | 185 + src/backend/storage/i