Re: Fast DSM segments

2020-07-30 Thread Thomas Munro
On Mon, Jul 27, 2020 at 2:45 PM Thomas Munro  wrote:
> Here's a new version, using the name min_dynamic_shared_memory, which
> sounds better to me.  Any objections?  I also fixed the GUC's maximum
> setting so that it's sure to fit in size_t.

I pushed it like that.  Happy to rename the GUC if someone has a better idea.

I don't really love the way dsm_create()'s code flows, but I didn't
see another way to do this within the existing constraints.  I think
it'd be nice to rewrite this thing to get rid of the random
number-based handles that are directly convertible to key_t/pathname,
and instead use something holding {slot number, generation number}.
Then you could improve that code flow and get rid of several cases of
linear array scans under an exclusive lock.  The underlying
key_t/pathname would live in the slot.  You'd need a new way to find
the control segment itself after a restart, where
dsm_cleanup_using_control_segment() cleans up after the previous
incarnation, but I think that just requires putting the key_t/pathname
directly in PGShmemHeader, instead of a new {slot number, generation
number} style handle.  Or maybe a separate mapped file opened by well
known pathname, or something like that.




Re: Fast DSM segments

2020-07-26 Thread Thomas Munro
On Sat, Jun 20, 2020 at 7:17 AM Andres Freund  wrote:
> On 2020-06-19 17:42:41 +1200, Thomas Munro wrote:
> > On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro  wrote:
> > > Here's a version that adds some documentation.
> >
> > I jumped on a dual socket machine with 36 cores/72 threads and 144GB
> > of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge
> > pages available, and I ran a very simple test: select count(*) from t
> > t1 join t t2 using (i), where the table was created with create table
> > t as select generate_series(1, 4)::int i, and then prewarmed
> > into 20GB of shared_buffers.
>
> I assume all the data fits into 20GB?

Yep.

> Which kernel version is this?

Tested on 4.19 (Debian stable/10).

> How much of the benefit comes from huge pages being used, how much from
> avoiding the dsm overhead, and how much from the page table being shared
> for that mapping? Do you have a rough idea?

Without huge pages, the 36 process version of the test mentioned above
shows around a 1.1x speedup, which is in line with the numbers from my
first message (which was from a much smaller computer).  The rest of
the speedup (2x) is due to huge pages.

Further speedups are available by increasing the hash chunk size, and
probably doing NUMA-aware allocation, in later work.

Here's a new version, using the name min_dynamic_shared_memory, which
sounds better to me.  Any objections?  I also fixed the GUC's maximum
setting so that it's sure to fit in size_t.
From 3a77e8ab67e5492340425f1ce4a36a45a8c80d05 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 27 Jul 2020 12:07:05 +1200
Subject: [PATCH v3] Preallocate some DSM space at startup.

Create an optional region in the main shared memory segment that can be
used to acquire and release "fast" DSM segments, and can benefit from
huge pages allocated at cluster startup time, if configured.  Fall back
to the existing mechanisms when that space is full.  The size is
controlled by a new GUC min_dynamic_shared_memory, defaulting to 0.

Main region DSM segments initially contain whatever garbage the memory
held last time they were used, rather than zeroes.  That change revealed
that DSA areas failed to initialize themselves correctly in memory that
wasn't zeroed first, so fix that problem.

Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  26 +++
 src/backend/storage/ipc/dsm.c | 184 --
 src/backend/storage/ipc/dsm_impl.c|   3 +
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/utils/misc/guc.c  |  11 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/mmgr/dsa.c  |   5 +-
 src/include/storage/dsm.h |   3 +
 src/include/storage/dsm_impl.h|   1 +
 9 files changed, 212 insertions(+), 25 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6ce5907896..48fef11041 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1864,6 +1864,32 @@ include_dir 'conf.d'
   
  
 
+ 
+  min_dynamic_shared_memory (integer)
+  
+   min_dynamic_shared_memory configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory that should be allocated at server
+startup time for use by parallel queries.  When this memory region is
+insufficient or exhausted by concurrent parallel queries, new
+parallel queries try to allocate extra shared memory temporarily from
+the operating system using the method configured with
+dynamic_shared_memory_type, which may be slower
+due to memory management overheads.
+Memory that is allocated at startup time with
+min_dynamic_shared_memory is affected by the
+huge_pages setting on operating systems where that
+is supported, and may be more likely to benefit from larger pages on
+operating systems where page size is managed automatically.  Larger
+memory pages can improve the performance of parallel hash joins.
+The default value is 0 (none).
+   
+  
+ 
+
  
  
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index ef64d08357..b9941496a3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -35,10 +35,12 @@
 
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "storage/dsm.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "utils/freepage.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_

Re: Fast DSM segments

2020-06-19 Thread Andres Freund
Hi,

On 2020-06-19 17:42:41 +1200, Thomas Munro wrote:
> On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro  wrote:
> > Here's a version that adds some documentation.
> 
> I jumped on a dual socket machine with 36 cores/72 threads and 144GB
> of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge
> pages available, and I ran a very simple test: select count(*) from t
> t1 join t t2 using (i), where the table was created with create table
> t as select generate_series(1, 4)::int i, and then prewarmed
> into 20GB of shared_buffers.

I assume all the data fits into 20GB?

Which kernel version is this?

How much of the benefit comes from huge pages being used, how much from
avoiding the dsm overhead, and how much from the page table being shared
for that mapping? Do you have a rough idea?

Greetings,

Andres Freund




Re: Fast DSM segments

2020-06-18 Thread Thomas Munro
On Thu, Jun 18, 2020 at 6:05 PM Thomas Munro  wrote:
> Here's a version that adds some documentation.

I jumped on a dual socket machine with 36 cores/72 threads and 144GB
of RAM (Azure F72s_v2) running Linux, configured with 50GB of huge
pages available, and I ran a very simple test: select count(*) from t
t1 join t t2 using (i), where the table was created with create table
t as select generate_series(1, 4)::int i, and then prewarmed
into 20GB of shared_buffers.  I compared the default behaviour to
preallocate_dynamic_shared_memory=20GB, with work_mem set sky high so
that there would be no batching (you get a hash table of around 16GB),
and I set things up so that I could test with a range of worker
processes, and computed the speedup compared to a serial hash join.

Here's what I got:

Processes   Default   Preallocated
1   627.6s
9   101.3s = 6.1x 68.1s = 9.2x
18   56.1s = 11.1x34.9s = 17.9x
27   42.5s = 14.7x23.5s = 26.7x
36   36.0s = 17.4x18.2s = 34.4x
45   33.5s = 18.7x15.5s = 40.5x
54   35.6s = 17.6x13.6s = 46.1x
63   35.4s = 17.7x12.2s = 51.4x
72   33.8s = 18.5x11.3s = 55.5x

It scaled nearly perfectly up to somewhere just under 36 threads, and
then the slope tapered off a bit so that each extra process was
supplying somewhere a bit over half of its potential.  I can improve
the slope after the halfway point a bit by cranking HASH_CHUNK_SIZE up
to 128KB (and it doesn't get much better after that):

Processes   Default   Preallocated
1   627.6s
9   102.7s = 6.1x 67.7s = 9.2x
18   56.8s = 11.1x34.8s = 18.0x
27   41.0s = 15.3x23.4s = 26.8x
36   33.9s = 18.5x18.2s = 34.4x
45   30.1s = 20.8x15.4s = 40.7x
54   27.2s = 23.0x13.3s = 47.1x
63   25.1s = 25.0x11.9s = 52.7x
72   23.8s = 26.3x10.8s = 58.1x

I don't claim that this is representative of any particular workload
or server configuration, but it's a good way to show that bottleneck,
and it's pretty cool to be able to run a query that previously took
over 10 minutes in 10 seconds.  (I can shave a further 10% off these
times with my experimental hash join prefetching patch, but I'll
probably write about that separately when I've figured out why it's
not doing better than that...).


Re: Fast DSM segments

2020-06-18 Thread Thomas Munro
On Thu, Jun 11, 2020 at 5:37 AM Robert Haas  wrote:
> On Tue, Jun 9, 2020 at 6:03 PM Thomas Munro  wrote:
> > That all makes sense.  Now I'm wondering if I should use exactly that
> > word in the GUC... dynamic_shared_memory_preallocate?
>
> I tend to prefer verb-object rather than object-verb word ordering,
> because that's how English normally works, but I realize this is not a
> unanimous view.

It's pretty much just me and Yoda against all the rest of you, so
let's try preallocate_dynamic_shared_memory.  I guess it could also be
min_dynamic_shared_memory to drop the verb.  Other ideas welcome.

> It's a little strange because the fact of preallocating it makes it
> not dynamic any more. I don't know what to do about that.

Well, it's not dynamic at the operating system level, but it's still
dynamic in the sense that PostgreSQL code can get some and give it
back, and there's no change from the point of view of any DSM client
code.

Admittedly, the shared memory architecture is a bit confusing.  We
have main shared memory, DSM memory, DSA memory that is inside main
shared memory with extra DSMs as required, DSA memory that is inside a
DSM and creates extra DSMs as required, and with this patch also DSMs
that are inside main shared memory.  Not to mention palloc and
MemoryContexts and all that.  As you probably remember I once managed
to give an internal presentation at EDB for one hour of solid talking
about all the different kinds of allocators and what they're good for.
It was like a Möbius slide deck already.

Here's a version that adds some documentation.
From 8f222062b60d6674cd9f46e716a56201ef498f84 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 9 Apr 2020 14:12:38 +1200
Subject: [PATCH v2] Preallocate some DSM space at startup.

Create an optional region in the main shared memory segment that can be
used to acquire and release "fast" DSM segments, and can benefit from
huge pages allocated at cluster startup time, if configured.  Fall back
to the existing mechanisms when that space is full.  The size is
controlled by preallocate_dynamic_shared_memory, defaulting to 0.

Main region DSM segments initially contain whatever garbage the memory
held last time they were used, rather than zeroes.  That change revealed
that DSA areas failed to initialize themselves correctly in memory that
wasn't zeroed first, so fix that problem.

Reviewed-by: Robert Haas 
Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  26 +++
 src/backend/storage/ipc/dsm.c | 184 --
 src/backend/storage/ipc/dsm_impl.c|   3 +
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/utils/misc/guc.c  |  11 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/mmgr/dsa.c  |   5 +-
 src/include/storage/dsm.h |   3 +
 src/include/storage/dsm_impl.h|   1 +
 9 files changed, 213 insertions(+), 25 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 783bf7a12b..35d342a694 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1831,6 +1831,32 @@ include_dir 'conf.d'
   
  
 
+ 
+  preallocate_dynamic_shared_memory (integer)
+  
+   preallocate_dynamic_shared_memory configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory that should be allocated at server
+startup time for use by parallel queries.  When this memory region is
+insufficient or exhausted by concurrent parallel queries, new
+parallel queries try to allocate extra shared memory temporarily from
+the operating system using the method configured with
+dynamic_shared_memory_type, which may be slower
+due to memory management overheads.
+Memory that is allocated with
+preallocate_dynamic_shared_memory is affected by the
+huge_pages setting on operating systems where that
+is supported, and may be more likely to benefit from larger pages on
+operating systems where page size is managed automatically.  Larger
+memory pages can improve the performance of parallel hash joins.
+The default value is 0 (none).
+   
+  
+ 
+
  
  
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index ef64d08357..4f87ece3b3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -35,10 +35,12 @@
 
 #include "lib/ilist.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "storage/dsm.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "utils/freepage.h"
 #include "utils/guc.h"

Re: Fast DSM segments

2020-06-10 Thread Robert Haas
On Tue, Jun 9, 2020 at 6:03 PM Thomas Munro  wrote:
> That all makes sense.  Now I'm wondering if I should use exactly that
> word in the GUC... dynamic_shared_memory_preallocate?

I tend to prefer verb-object rather than object-verb word ordering,
because that's how English normally works, but I realize this is not a
unanimous view.

It's a little strange because the fact of preallocating it makes it
not dynamic any more. I don't know what to do about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Fast DSM segments

2020-06-09 Thread Thomas Munro
On Sat, Apr 11, 2020 at 1:55 AM Robert Haas  wrote:
> On Thu, Apr 9, 2020 at 1:46 AM Thomas Munro  wrote:
> > The attached highly experimental patch adds a new GUC
> > dynamic_shared_memory_main_size.  If you set it > 0, it creates a
> > fixed sized shared memory region that supplies memory for "fast" DSM
> > segments.  When there isn't enough free space, dsm_create() falls back
> > to the traditional approach using eg shm_open().
>
> I think this is a reasonable option to have available for people who
> want to use it. I didn't want to have parallel query be limited to a
> fixed-size amount of shared memory because I think there are some
> cases where efficient performance really requires a large chunk of
> memory, and it seemed impractical to keep the largest amount of memory
> that any query might need to use permanently allocated, let alone that
> amount multiplied by the maximum possible number of parallel queries
> that could be running at the same time. But none of that is any
> argument against giving people the option to preallocate some memory
> for parallel query.

That all makes sense.  Now I'm wondering if I should use exactly that
word in the GUC... dynamic_shared_memory_preallocate?




Re: Fast DSM segments

2020-04-10 Thread Robert Haas
On Thu, Apr 9, 2020 at 1:46 AM Thomas Munro  wrote:
> The attached highly experimental patch adds a new GUC
> dynamic_shared_memory_main_size.  If you set it > 0, it creates a
> fixed sized shared memory region that supplies memory for "fast" DSM
> segments.  When there isn't enough free space, dsm_create() falls back
> to the traditional approach using eg shm_open().

I think this is a reasonable option to have available for people who
want to use it. I didn't want to have parallel query be limited to a
fixed-size amount of shared memory because I think there are some
cases where efficient performance really requires a large chunk of
memory, and it seemed impractical to keep the largest amount of memory
that any query might need to use permanently allocated, let alone that
amount multiplied by the maximum possible number of parallel queries
that could be running at the same time. But none of that is any
argument against giving people the option to preallocate some memory
for parallel query.

My guess is that on smaller boxes this won't find a lot of use, but on
bigger ones it will be handy. It's hard to imagine setting aside 1GB
of memory for this if you only have 8GB total, but if you have 512GB
total, it's pretty easy to imagine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Fast DSM segments

2020-04-08 Thread Thomas Munro
Hello PostgreSQL 14 hackers,

FreeBSD is much faster than Linux (and probably Windows) at parallel
hash joins on the same hardware, primarily because its DSM segments
run in huge pages out of the box.  There are various ways to convince
recent-ish Linux to put our DSMs on huge pages (see below for one),
but that's not the only problem I wanted to attack.

The attached highly experimental patch adds a new GUC
dynamic_shared_memory_main_size.  If you set it > 0, it creates a
fixed sized shared memory region that supplies memory for "fast" DSM
segments.  When there isn't enough free space, dsm_create() falls back
to the traditional approach using eg shm_open().  This allows parallel
queries to run faster, because:

* no more expensive system calls
* no repeated VM allocation (whether explicit posix_fallocate() or first-touch)
* can be in huge pages on Linux and Windows

This makes lots of parallel queries measurably faster, especially
parallel hash join.  To demonstrate with a very simple query:

  create table t (i int);
  insert into t select generate_series(1, 1000);
  select pg_prewarm('t');
  set work_mem = '1GB';

  select count(*) from t t1 join t t2 using (i);

Here are some quick and dirty results from a Linux 4.19 laptop.  The
first column is the new GUC, and the last column is from "perf stat -e
dTLB-load-misses -p ".

  size  huge_pages time   speedup  TLB misses
  0 off2.595s   9,131,285
  0 on 2.571s  1%   8,951,595
  1GB   off2.398s  8%   9,082,803
  1GB   on 1.898s 37% 169,867

You can get some of this speedup unpatched on a Linux 4.7+ system by
putting "huge=always" in your /etc/fstab options for /dev/shm (= where
shm_open() lives).  For comparison, that gives me:

  size  huge_pages time   speedup  TLB misses
  0 on 2.007s 29% 221,910

That still leave the other 8% on the table, and in fact that 8%
explodes to a much larger number as you throw more cores at the
problem (here I was using defaults, 2 workers).  Unfortunately, dsa.c
-- used by parallel hash join to allocate vast amounts of memory
really fast during the build phase -- holds a lock while creating new
segments, as you'll soon discover if you test very large hash join
builds on a 72-way box.  I considered allowing concurrent segment
creation, but as far as I could see that would lead to terrible
fragmentation problems, especially in combination with our geometric
growth policy for segment sizes due to limited slots.  I think this is
the main factor that causes parallel hash join scalability to fall off
around 8 cores.  The present patch should really help with that (more
digging in that area needed; there are other ways to improve that
situation, possibly including something smarter than a stream of of
dsa_allocate(32kB) calls).

A competing idea would have freelists of lingering DSM segments for
reuse.  Among other problems, you'd probably have fragmentation
problems due to their differing sizes.  Perhaps there could be a
hybrid of these two ideas, putting a region for "fast" DSM segments
inside many OS-supplied segments, though it's obviously much more
complicated.

As for what a reasonable setting would be for this patch, well, erm,
it depends.  Obviously that's RAM that the system can't use for other
purposes while you're not running parallel queries, and if it's huge
pages, it can't be swapped out; if it's not huge pages, then it can be
swapped out, and that'd be terrible for performance next time you need
it.  So you wouldn't want to set it too large.  If you set it too
small, it falls back to the traditional behaviour.

One argument I've heard in favour of creating fresh segments every
time is that NUMA systems configured to prefer local memory allocation
(as opposed to interleaved allocation) probably avoid cross node
traffic.  I haven't looked into that topic yet; I suppose one way to
deal with it in this scheme would be to have one such region per node,
and prefer to allocate from the local one.
From 703b3ed55e7a0f1d895800943bfa64fb52a0fec1 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 9 Apr 2020 14:12:38 +1200
Subject: [PATCH] Support DSM segments in the main shmem area.

Create an optional region in the main shared memory segment that can be
used to acquire and release "fast" DSM segments, and can benefit from
huge pages allocated at cluster startup time, if configured.  Fall back
to the existing mechanisms when that space is full.  The size is
controlled by dynamic_shared_memory_main_size, defaulting to 0
(disabled).

Main region DSM segments initially contain whatever garbage the memory
held last time they were used, rather than zeroes.  That change revealed
that DSA areas failed to initialize themselves correctly in memory that
wasn't zeroed first, so fix that problem.
---
 src/backend/storage/ipc/dsm.c  | 185 +
 src/backend/storage/i