Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-02-01 Thread Amit Kapila
On Tue, Jan 27, 2015 at 9:50 AM, Robert Haas robertmh...@gmail.com wrote:

 This developed a slight merge conflict.  I've rebased the attached
 version, and I also took the step of getting rid of buf_table.c, as I
 think I proposed somewhere upthread.  This avoids the overhead of
 constructing a BufferTag only to copy it into a BufferLookupEnt, plus
 some function calls and so forth.  A quick-and-dirty test suggests
 this might not have cut down on the 1-client overhead much, but I
 think it's worth doing anyway: it's certainly saving a few cycles, and
 I don't think it's complicating anything measurably.


Performance data at some of the configurations.

Configuration and Db Details
--
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
checkpoint_segments=300
checkpoint_timeout=25min
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5min
Scale_factor - 5000
HEAD (commit id - 168a809d)

Below is the data for median of 3-runs with pgbench read-only
(using -M prepared) configuration


Shared_buffers=8GB   Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256  HEAD
17748 119106 164949 246632 216763 183177 173055  HEAD + patch 17802 119721
167422 298746 457863 422621 356756


Shared_buffers=16GB
   Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256  HEAD 18139 113265
169217 270172 310936 238490 215308  HEAD + patch 17900 119960 174196 314866
448238 425916 347919


Observations as per data
--
a. It improves the tps by great margin at higher client count.
b. It is evident that as we increase the shared buffers, the gain
is relatively less (gain when shared_buffers is 16GB is lesser as
compare to when shared_buffers is 8GB)

I think the patch is valuable for such loads even though it will show
lesser benefit at higher shared buffers value, although we might want
to once verify that it doesn't topple at configurations such as
(shared_buffers = scale_factor).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-01-27 Thread Andres Freund
On 2015-01-27 10:36:35 -0500, Robert Haas wrote:
 On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas robertmh...@gmail.com wrote:
  This developed a slight merge conflict.  I've rebased the attached
  version, and I also took the step of getting rid of buf_table.c, as I
  think I proposed somewhere upthread.  This avoids the overhead of
  constructing a BufferTag only to copy it into a BufferLookupEnt, plus
  some function calls and so forth.  A quick-and-dirty test suggests
  this might not have cut down on the 1-client overhead much, but I
  think it's worth doing anyway: it's certainly saving a few cycles, and
  I don't think it's complicating anything measurably.
 
 So here are median-of-three results for 5-minute read-only pgbench
 runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
 various client counts.

 That's extremely unimpressive.  You (Andres) showed a much bigger
 benefit on a different machine with a much higher scale factor (5000)
 so I don't know whether the modest benefit here is because of the
 different machine, the different scale factor, or what.

Based on my test on hydra some time back the bottleneck is nearly
entirely in GetSnapshotData() at higher client counts. So I'm not that
surprised you don't see the big benefit here.  IIRC on hydra the results
for using a large enough shared_buffers setting for a fully cached run
at that scale is pretty close to your master results - so there's really
not much performance increase to be expected by making the buf table
lockless.

I guess you would see a slightly bigger difference at a different
shared_buffer/scale combination. IIRC a scale 1000 is about 15GB large?
So about half of the data set fit in shared buffers. In the scale 5k
results I posted it was about 1/5.

I had also tested on a four socket x86 machine where GetSnapshotData()
was a, but not the sole big, bottleneck.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-01-27 Thread Robert Haas
On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas robertmh...@gmail.com wrote:
 This developed a slight merge conflict.  I've rebased the attached
 version, and I also took the step of getting rid of buf_table.c, as I
 think I proposed somewhere upthread.  This avoids the overhead of
 constructing a BufferTag only to copy it into a BufferLookupEnt, plus
 some function calls and so forth.  A quick-and-dirty test suggests
 this might not have cut down on the 1-client overhead much, but I
 think it's worth doing anyway: it's certainly saving a few cycles, and
 I don't think it's complicating anything measurably.

So here are median-of-three results for 5-minute read-only pgbench
runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
various client counts.

clients - master@4b2a25 - master+chash-buftable-v2
1 8319.254199 8105.479632
2 15485.561948 15138.227533
3 23690.186576 23264.702784
4 32157.346740 31536.870881
5 40879.532747 40455.794841
6 49063.279970 49625.681573
7 57518.374517 57492.275197
8 65351.415323 65622.211763
16 126166.416498 126668.793798
24 155727.916112 155587.414299
32 180329.012019 179543.424754
40 201384.186317 200109.614362
48 222352.265595 225688.574611
56 240400.659338 254398.144976
64 253699.031075 266624.224699
72 261198.336625 270652.578322
80 264569.862257 270332.738084

That's extremely unimpressive.  You (Andres) showed a much bigger
benefit on a different machine with a much higher scale factor (5000)
so I don't know whether the modest benefit here is because of the
different machine, the different scale factor, or what.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-01-26 Thread Robert Haas
This developed a slight merge conflict.  I've rebased the attached
version, and I also took the step of getting rid of buf_table.c, as I
think I proposed somewhere upthread.  This avoids the overhead of
constructing a BufferTag only to copy it into a BufferLookupEnt, plus
some function calls and so forth.  A quick-and-dirty test suggests
this might not have cut down on the 1-client overhead much, but I
think it's worth doing anyway: it's certainly saving a few cycles, and
I don't think it's complicating anything measurably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/Makefile b/contrib/Makefile
index 195d447..141ef0a 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		earthdistance	\
 		file_fdw	\
 		fuzzystrmatch	\
+		hashtest	\
 		hstore		\
 		intagg		\
 		intarray	\
diff --git a/contrib/hashtest/Makefile b/contrib/hashtest/Makefile
new file mode 100644
index 000..3ee42f8
--- /dev/null
+++ b/contrib/hashtest/Makefile
@@ -0,0 +1,18 @@
+# contrib/hashtest/Makefile
+
+MODULE_big = hashtest
+OBJS = hashtest.o
+
+EXTENSION = hashtest
+DATA = hashtest--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/hashtest
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/hashtest/hashtest--1.0.sql b/contrib/hashtest/hashtest--1.0.sql
new file mode 100644
index 000..e271baf
--- /dev/null
+++ b/contrib/hashtest/hashtest--1.0.sql
@@ -0,0 +1,52 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use CREATE EXTENSION hashtest to load this file. \quit
+
+CREATE FUNCTION chash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_collision_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_collision_test'
+LANGUAGE C;
diff --git a/contrib/hashtest/hashtest.c b/contrib/hashtest/hashtest.c
new file mode 100644
index 000..172a5bb
--- /dev/null
+++ b/contrib/hashtest/hashtest.c
@@ -0,0 +1,527 @@
+/*-
+ * hashtest.c
+ *-
+ */
+
+#include postgres.h
+
+#include funcapi.h
+#include libpq/auth.h
+#include lib/stringinfo.h
+#include miscadmin.h
+#include portability/instr_time.h
+#include storage/ipc.h
+#include utils/chash.h
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+Datum		chash_insert_test(PG_FUNCTION_ARGS);
+Datum		chash_search_test(PG_FUNCTION_ARGS);
+Datum		chash_delete_test(PG_FUNCTION_ARGS);
+Datum		chash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		chash_collision_test(PG_FUNCTION_ARGS);
+Datum		dynahash_insert_test(PG_FUNCTION_ARGS);
+Datum		dynahash_search_test(PG_FUNCTION_ARGS);
+Datum		dynahash_delete_test(PG_FUNCTION_ARGS);
+Datum		dynahash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		dynahash_collision_test(PG_FUNCTION_ARGS);
+static void hashtest_shmem_startup(void);
+
+PG_FUNCTION_INFO_V1(chash_insert_test);
+PG_FUNCTION_INFO_V1(chash_search_test);
+PG_FUNCTION_INFO_V1(chash_delete_test);
+PG_FUNCTION_INFO_V1(chash_concurrent_test);
+PG_FUNCTION_INFO_V1(chash_collision_test);
+PG_FUNCTION_INFO_V1(dynahash_insert_test);
+PG_FUNCTION_INFO_V1(dynahash_search_test);
+PG_FUNCTION_INFO_V1(dynahash_delete_test);
+PG_FUNCTION_INFO_V1(dynahash_concurrent_test);
+PG_FUNCTION_INFO_V1(dynahash_collision_test);
+
+typedef struct
+{
+	uint32	key;
+	uint32	val;
+} hentry;
+
+static CHashDescriptor cdesc = {
+	hashtest-chash,	/* name */
+	1048576,			/* capacity */
+	sizeof(hentry),		/* element size */
+	sizeof(uint32)		/* key size */
+};
+
+#define DYNAHASH_PARTITIONS		16
+
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+static CHashTable chash;
+static HTAB *dynahash;
+static LWLockId dynahash_lock[DYNAHASH_PARTITIONS];
+static ClientAuthentication_hook_type original_client_auth_hook = NULL;
+
+static 

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-12-15 Thread Robert Haas
On Sun, Nov 9, 2014 at 3:58 PM, Andres Freund and...@2ndquadrant.com wrote:
  * There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?

 It's not obvious to me how that would work.  If you just throw those
 elements on the garbage lists with everything else, it will soon be
 the case that one bucket header ends up using the cell stored in some
 other bucket, which isn't going to be awesome.  So you need some kind
 of more complex scheme to preserve locality.

 Treating the element inside the bucket as a kind of one element freelist
 seems quite workable to me. Test whether it's marked deleted, scan the
 hazard pointer list to see whether it can be used. If yes, grand. If
 not, go to the current code. The fact that the element is cacheline
 local will make the test for its deletedness almost free. Yes, the
 hazard pointer scan surely ain't free - but at least for cases like
 bufmgr where reads are never less frequent than writes and very often
 *much* more frequent I'm pretty sure it'd be a noticeable win.

Maybe. I'd expect that to radically increase the time spend doing
hazard pointer scans.  The performance of this system depends heavily
on garbage collection not happening too often, and ISTR that the
performance changes significantly if you vary the amount of of
overallocation.

 I'm not sure.  We're talking about something on the order of half a
 percent on my tests, and lots of changes cause things to bounce around
 by that much.  I'm not sure it makes sense to get too worked up about
 that if the alternative is to add a big pile of new complexity.

 I saw things in the range of 3-4% on my laptop.

Mrmpf.  Well, that sucks.

  * I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?

 I think the worst case scenario is that we falsely conclude that
 there's a hash match when there really isn't.  The ensuing memcmp()
 will clarify matters.

 Hm. Let me think about it for a while.

 I was thinking that a spurious cmp  0 could causing us to to miss a
 match - but that could only happen if the match just has been removed
 from the list. Which is fine. Perhaps that case deserves to be mentioned
 in the comment?

Maybe.  I mean, the general principle here is that there may be some
difference between the apparent order of execution on one CPU as
opposed to another, and the code that uses the hash table has to be OK
with that - at least, unless it has independent methods of assuring
that things happen in the right order, but in that case that other
thing may well become the botleneck.  This is just one example of
that.  You might also fail to see an insert that's just happened on
some other node but is not committed to main memory yet, which is not
really an issue we need to fix; it's just how things are.  A general
discussion of this somewhere might be worthwhile, but it's pretty much
the same as any other highly-concurrent hashtable implemenation you'll
find out there.

(It's also not really different from surrounding the hash table
operation, and only the hash table operation, with a big lock.  Then
things can't change while you are scanning the bucket list, but they
can change by the time you can do anything with the returned value,
which is the same problem we have to cope with here.)

 * Another thing I'm wondering about here is whether it's possible that
 somebody is currently walking an older version of the bucket list
 (i.e. is currently at an already unlinked element) and then visits a
 newly inserted element with an 'apparently' out of order hash
 value. That seems possible because the inserter might not have seen the
 unlinked element. Afaics that's not a problem for searches - but I'm not
 sure whether it couldn't cause a problem for concurrent inserts and
 deletes.

This is why the delete-mark bit has to be part of the same atomic word
as the next-pointer.  If somebody applies a delete-mark, a subsequent
attempt to insert an entry at that position will fail because the
CAS() of the next-word won't find the right value there - it will find
a delete-marked value, which will never be the value it's expecting.

 * One thing I noticed while looking at that part of code is:
 +   h = target_node-un.hashcode;
 +   if (h == hashcode)
 +   cmp = memcmp(CHashNodeGetItem(target_node), key,
 +  

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-12-14 Thread Michael Paquier
There hasn't been much activity on this patch these days, and Andres
has provided some feedback, hence switching to Returned with feedback.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-11-09 Thread Andres Freund
On 2014-11-07 11:08:57 -0500, Robert Haas wrote:
 On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund and...@2ndquadrant.com wrote:
  * In benchmarks it becomes apparent that the dynamic element width makes
some macros like CHashTableGetNode() and
CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
can't figure how to build an efficient expression for the
target. There's two ways to address this:
a) Actually morph chash into something that will be included into the
   user sites. Since I think there'll only be a limited number of those
   that's probably acceptable.
 
 Have you benchmarked this?  How much does it help?

I've done quick benchmarks, and it helped in the 2-3% range in some
tests, and was a wash in others. What I did wasn't actually including it
in buf_table.c, but hardcoding the size in chash. That's obviously not
the real solution.

b) Simplify the access rules. I quite seriously doubt that the
  interleaving of garbage and freelists is actually necessary/helpful.
 
 That wasn't added for nothing.  I did a bunch of performance testing
 on this vs. dynahash (I think the test code is included in the patch)
 and the region of memory containing the freelists practically caught
 fire.  Spreading them out like this greatly improved the performance
 under heavy concurrency, but even with those changes the freelists
 were extremely hot.  Now, since this is the buffer mapping table
 rather than a tight loop around hash table manipulation, the
 difference may not be enough to measure.  But on a microbenchmark it
 *definitely* was.

I think it'd be much less visible for the buffer manager, but it might
still be visible...

  * There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?
 
 It's not obvious to me how that would work.  If you just throw those
 elements on the garbage lists with everything else, it will soon be
 the case that one bucket header ends up using the cell stored in some
 other bucket, which isn't going to be awesome.  So you need some kind
 of more complex scheme to preserve locality.

Treating the element inside the bucket as a kind of one element freelist
seems quite workable to me. Test whether it's marked deleted, scan the
hazard pointer list to see whether it can be used. If yes, grand. If
not, go to the current code. The fact that the element is cacheline
local will make the test for its deletedness almost free. Yes, the
hazard pointer scan surely ain't free - but at least for cases like
bufmgr where reads are never less frequent than writes and very often
*much* more frequent I'm pretty sure it'd be a noticeable win.

I'm afraid that I can't see us going forward unless we address the
noticeable single threaded penalties. Do you see that differently?
 
 I'm not sure.  We're talking about something on the order of half a
 percent on my tests, and lots of changes cause things to bounce around
 by that much.  I'm not sure it makes sense to get too worked up about
 that if the alternative is to add a big pile of new complexity.

I saw things in the range of 3-4% on my laptop.

  * There's some whitespace damage. (space followed by tab, followed by
actual contents)
 
 That's the least of our problems at this point.

Sure, but still worth cleaning up ;)

  * I still think it's a good idea to move the full memory barriers into
the cleanup/writing process by doing write memory barriers when
acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
in the process testing for the hazard pointer.
 
 Can you cite any existing precedent in other system software?  Does
 Linux do anything like that, for example?

No, I can't right now. I think it follows from the cache coherency
rules, but I can understand suspicion there.

  * Shouldn't we relax the CPU in a couple places like CHashAllocate(),
CHashBucketScan()'s loops?
 
 You mean like, have it sleep the way SpinLockAcquire() does?

Not actually like in s_lock(), but rather like the SPIN_DELAY() used in
the spinlock code for retries.

 Yeah, possibly, but that adds non-trivial code complexity which may
 not be worth it if - as is hoped for - there's no real contention
 there.

I think just adding a pg_spin_delay() before retrying should be trivial.

  * I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?
 
 I think 

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-11-07 Thread Robert Haas
On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund and...@2ndquadrant.com wrote:
 * In benchmarks it becomes apparent that the dynamic element width makes
   some macros like CHashTableGetNode() and
   CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
   can't figure how to build an efficient expression for the
   target. There's two ways to address this:
   a) Actually morph chash into something that will be included into the
  user sites. Since I think there'll only be a limited number of those
  that's probably acceptable.

Have you benchmarked this?  How much does it help?

   b) Simplify the access rules. I quite seriously doubt that the
 interleaving of garbage and freelists is actually necessary/helpful.

That wasn't added for nothing.  I did a bunch of performance testing
on this vs. dynahash (I think the test code is included in the patch)
and the region of memory containing the freelists practically caught
fire.  Spreading them out like this greatly improved the performance
under heavy concurrency, but even with those changes the freelists
were extremely hot.  Now, since this is the buffer mapping table
rather than a tight loop around hash table manipulation, the
difference may not be enough to measure.  But on a microbenchmark it
*definitely* was.

 * There's several cases where it's noticeable that chash creates more
   cacheline misses - that's imo a good part of why the single threaded
   performance regresses. There's good reasons for the current design
   without a inline bucket, namely that it makes the concurrency handling
   easier. But I think that can be countered by still storing a pointer -
   just to an element inside the bucket. Afaics that'd allow this to be
   introduced quite easily?

It's not obvious to me how that would work.  If you just throw those
elements on the garbage lists with everything else, it will soon be
the case that one bucket header ends up using the cell stored in some
other bucket, which isn't going to be awesome.  So you need some kind
of more complex scheme to preserve locality.

   I'm afraid that I can't see us going forward unless we address the
   noticeable single threaded penalties. Do you see that differently?

I'm not sure.  We're talking about something on the order of half a
percent on my tests, and lots of changes cause things to bounce around
by that much.  I'm not sure it makes sense to get too worked up about
that if the alternative is to add a big pile of new complexity.

 * There's some whitespace damage. (space followed by tab, followed by
   actual contents)

That's the least of our problems at this point.

 * I still think it's a good idea to move the full memory barriers into
   the cleanup/writing process by doing write memory barriers when
   acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
   in the process testing for the hazard pointer.

Can you cite any existing precedent in other system software?  Does
Linux do anything like that, for example?

 * Shouldn't we relax the CPU in a couple places like CHashAllocate(),
   CHashBucketScan()'s loops?

You mean like, have it sleep the way SpinLockAcquire() does?  Yeah,
possibly, but that adds non-trivial code complexity which may not be
worth it if - as is hoped for - there's no real contention there.

 * I don't understand right now (but then I'm in a Hotel bar) why it's
   safe for CHashAddToGarbage() to clobber the hash value?
   CHashBucketScan() relies the chain being ordered by hashcode. No?
   Another backend might just have been past the IsMarked() bit and look
   at the hashcode?

I think the worst case scenario is that we falsely conclude that
there's a hash match when there really isn't.  The ensuing memcmp()
will clarify matters.

 * We really should find a way to sensibly print the stats.

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-11-05 Thread Andres Freund
Hi,

 git branch also available at:
 http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

A minor review of this:
* Should be rebased ontop of the atomics API

* In benchmarks it becomes apparent that the dynamic element width makes
  some macros like CHashTableGetNode() and
  CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
  can't figure how to build an efficient expression for the
  target. There's two ways to address this:
  a) Actually morph chash into something that will be included into the
 user sites. Since I think there'll only be a limited number of those
 that's probably acceptable.
  b) Simplify the access rules. I quite seriously doubt that the
interleaving of garbage and freelists is actually necessary/helpful.

* There's several cases where it's noticeable that chash creates more
  cacheline misses - that's imo a good part of why the single threaded
  performance regresses. There's good reasons for the current design
  without a inline bucket, namely that it makes the concurrency handling
  easier. But I think that can be countered by still storing a pointer -
  just to an element inside the bucket. Afaics that'd allow this to be
  introduced quite easily?

  I'm afraid that I can't see us going forward unless we address the
  noticeable single threaded penalties. Do you see that differently?

* There's some whitespace damage. (space followed by tab, followed by
  actual contents)

* I still think it's a good idea to move the full memory barriers into
  the cleanup/writing process by doing write memory barriers when
  acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
  in the process testing for the hazard pointer.

* Shouldn't we relax the CPU in a couple places like CHashAllocate(),
  CHashBucketScan()'s loops?

* I don't understand right now (but then I'm in a Hotel bar) why it's
  safe for CHashAddToGarbage() to clobber the hash value?
  CHashBucketScan() relies the chain being ordered by hashcode. No?
  Another backend might just have been past the IsMarked() bit and look
  at the hashcode?

* We really should find a way to sensibly print the stats.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-17 Thread Andres Freund
On 2014-10-16 20:22:24 -0400, Robert Haas wrote:
 On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund and...@2ndquadrant.com wrote:
  When using shared_buffers = 96GB there's a performance benefit, but not
  huge:
  master (f630b0dd5ea2de52972d456f5978a012436115e):   153621.8
  master + LW_SHARED + lockless StrategyGetBuffer():  477118.4
  master + LW_SHARED + lockless StrategyGetBuffer() + chash:  496788.6
  master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7
 
  But with shared_buffers = 16GB:
  master (f630b0dd5ea2de52972d456f5978a012436115e):   177302.9
  master + LW_SHARED + lockless StrategyGetBuffer():  206172.4
  master + LW_SHARED + lockless StrategyGetBuffer() + chash:  413344.1
  master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8
 
 Very interesting.  This doesn't show that chash is the right solution,
 but it definitely shows that doing nothing is the wrong solution.

Absolutely.


The thing worrying me most (but not all that much on an absolute scale)
about chash is that lots of the solutions to memory management it builds
are specific to it... And generalizing afterwards will be hard because
we'll have to prove that that general solution is as performant as the
special case code...

 It shows that, even with the recent bump to 128 lock manager
 partitions, and LW_SHARED on top of that, workloads that actually
 update the buffer mapping table still produce a lot of contention
 there. 

FWIW, I couldn't see much of a benefit without LW_SHARED even though I
a *few* things. The bottleneck simply is completely elsewhere.

 This hasn't been obvious to me from profiling, but the numbers
 above make it pretty clear.

So I'm not super surprised about that. There very well might be cases
where it's a bad bottleneck before, but I've not seen them.

 It also seems to suggest that trying to get rid of the memory barriers
 isn't a very useful optimization project.

I'm not yet convinced of that. Yes, it's not showing up hugely in the
numbers here, but that's simply because the workload is again completely
dominated by the kernel copying data for the read()s, GetSnapshotData(),
the buffer mapping cache misses and a few other familiar friends.

 We might get a couple of
 percent out of it, but it's pretty small potatoes, so unless it can be
 done more easily than I suspect, it's probably not worth bothering
 with.

I still think that moving the barrier to the reading side would be
simple (implementation wise) and correct for x86. If you think about it,
otherwise our spinlock implementation for x86 would be broken. As a
unlock is just a compiler barrier the full barrier on acquire better be
a full synchronization point. Am I missing something?

 An approach I think might have more promise is to have bufmgr.c
 call the CHash stuff directly instead of going through buf_table.c.

I don't see all that much point in buf_table.c currently - on the other
hand it has lead to it replacing the buffer mapping being slightly
easier... Maybe it should just go to a header...

 Right now, for example, BufferAlloc() creates and initializes a
 BufferTag and passes a pointer to that buffer tag to BufTableLookup,
 which copies it into a BufferLookupEnt.  But it would be just as easy
 for BufferAlloc() to put the BufferLookupEnt on its own stack, and
 then you wouldn't need to copy the data an extra time.  Now a 20-byte
 copy isn't a lot, but it's completely unnecessary and looks easy to
 get rid of.

Worthwile to try.

  I had to play with setting max_connections+1 sometimes to get halfway
  comparable results for master - unaligned data otherwise causes wierd
  results otherwise. Without doing that the performance gap between master
  96/16G was even bigger. We really need to fix that...
 
  This is pretty awesome.
 
 Thanks.  I wasn't quite sure how to test this or where the workloads
 that it would benefit would be found, so I appreciate you putting time
 into it.  And I'm really glad to hear that it delivers good results.

I wasn't sure either ;). Lucky that it played out so impressively. After
the first results I was nearly ready to send out a 'Meh.' type of
message ;)


Btw, CHashTableGetNode isn't exactly cheap. It shows up noticeably in
profiles. It results in several non-pipelineable instructions in a
already penalized section of the code... Replacing arena_stride by a
constant provided measurable improvements (no imul is required anymore,
instead you get shr;lea or so). I'm not sure how to deal with that. If
it shows up even on my quite new laptop, it'll be more so noticeable on
older x86 platforms.

 I think it would be useful to plumb the chash statistics into the
 stats collector or at least a debugging dump of some kind for testing.

I'm not sure it's solid enough at this point to be in the stats
collector. But I surely would like to access it somehow. I'm
e.g. absolutely not sure that your loadfactor is good and it'd be 

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Andres Freund
On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
 On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund and...@2ndquadrant.com wrote:
  The code in CHashSearch shows the problem there: you need to STORE the
  hazard pointer before you begin to do the LOAD operations to scan the
  bucket, and you must finish all of those LOADs before you STORE the
  NULL hazard pointer.  A read or write barrier won't provide store/load
  or load/store ordering.
 
  I'm not sure I understand the problem here - but a read + write barrier
  should suffice? To avoid falling back to two full barriers, we'd need a
  separate define pg_read_write_barrier(), but that's not a problem. IIUC
  that should allow us to avoid emitting any actual code on x86.
 
 Well, thinking about x86 specifically, it definitely needs at least
 one mfence, after setting the hazard pointer and before beginning to
 read the bucket chain.  It probably doesn't need the second mfence,
 before clearing the the hazard pointer, because x86 allows loads to be
 reordered before stores but not stores before loads.  We could
 certainly try removing the second fence on x86 for benchmarking
 purposes, and we could also check whether mfence outperforms lock;
 addl in that scenario.

Hm. So I thought about this for a while this morning after I wasn't able
to unprogram my hotel room's alarm clock that a previous occupant set
way to early. Given that premise, take the following with a grain of
salt.

The reason for neading an mfence is that a read/write barrier doesn't
guarantee that the store is visible to other processes - just in which
order they become visible. Right? And that's essentially because the
write might sit in the process's store buffer.

So, how about *not* using a full barrier on the reader side (i.e. the
side that does the hazard pointer writes). But instead do something like
a atomic_fetch_add(hazard_ptr, 0) (i.e. lock; xaddl) on the side that
needs to scan the hazard pointers. That, combined with the write memory
barrier, should guarantee that the store buffers are emptied. Which is
pretty nice, because it moves the overhead to the rather infrequent
scanning of the hazard pointers - which needs to do lots of other atomic
ops anyway.

Sounds sane?

That's something that should best be simulated in an exhaustive x86
simulator, but it sounds sane - and it'd be quite awesome imo :)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Ryan Johnson

On 15/10/2014 11:41 AM, Robert Haas wrote:

On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund and...@2ndquadrant.com wrote:

On 2014-10-14 17:53:10 -0400, Robert Haas wrote:

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund and...@2ndquadrant.com wrote:

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer.  A read or write barrier won't provide store/load
or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.

Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it.  This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in.  Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard.  Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture.  I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.

That having been said, there is some literature on generation numbers,
and I think something like that could be made to work.  It does have
some significant disadvantages, though.  One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide.In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists.  A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead.  Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section.  Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser.  It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower.  Or at least, I
think.

That having been said, I don't know what to do about the fact that the
fence is too expensive.  I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure.  But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again.  Surely a pure fence shouldn't
cost more than a spinlock cycle?  Even with arguably one extra cache
line touch, that seems like it ought to be a win.  But my intuitions
in this area are shaky.
Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems 
like an ideal fit...


In brief, RCU has the following requirements:

 * Read-heavy access pattern
 * Writers must be able to make dead objects unreachable to new readers
   (easily done for most data structures)
 * Writers must be able to mark dead objects in such a way that
   existing readers know to ignore their contents but can still
   traverse the data structure properly (usually straightforward)
 * Readers must occasionally inform the system that they are not
   currently using any RCU-protected pointers (to allow resource
   

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Robert Haas
On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
ryan.john...@cs.utoronto.ca wrote:
 Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
 an ideal fit...

 In brief, RCU has the following requirements:

 Read-heavy access pattern
 Writers must be able to make dead objects unreachable to new readers (easily
 done for most data structures)
 Writers must be able to mark dead objects in such a way that existing
 readers know to ignore their contents but can still traverse the data
 structure properly (usually straightforward)
 Readers must occasionally inform the system that they are not currently
 using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
URCU overview section.  Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.

All of the many techniques that have been developed in this area are
merely minor variations on a very old theme: set some kind of flag
variable in shared memory to let people know that you are reading a
shared data structure, and clear it when you are done.  Then, other
people can figure out when it's safe to recycle memory that was
previously part of that data structure.  In Linux's RCU, the flag
variable is whether the process is currently scheduled on a CPU,
which is obviously not workable from user-space.  Lacking that, you
need an explicit flag variable, which means you need memory barriers,
since the protected operation is a load and the flag variable is
updated via a store.  You can try to avoid some of the overhead by
updating the flag variable less often (say, when a signal arrives) or
you can make it more fine-grained (in my case, we only prevent reclaim
of a fraction of the data structure at a time, rather than all of it)
or various other variants, but none of this is unfortunately so simple
as apply technique X and your problem just goes away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Andres Freund
On 2014-10-16 09:19:16 -0400, Robert Haas wrote:
 On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
 ryan.john...@cs.utoronto.ca wrote:
  Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
  an ideal fit...
 
  In brief, RCU has the following requirements:
 
  Read-heavy access pattern
  Writers must be able to make dead objects unreachable to new readers (easily
  done for most data structures)
  Writers must be able to mark dead objects in such a way that existing
  readers know to ignore their contents but can still traverse the data
  structure properly (usually straightforward)
  Readers must occasionally inform the system that they are not currently
  using any RCU-protected pointers (to allow resource reclamation)
 
 Have a look at http://lwn.net/Articles/573424/ and specifically the
 URCU overview section.  Basically, that last requirement - that
 readers inform the system tat they are not currently using any
 RCU-protected pointers - turns out to require either memory barriers
 or signals.

Well, there's the quiescent-state-based RCU - that's actually
something that could reasonably be used inside postgres. Put something
around socket reads (syscalls are synchronization points) and non-nested
lwlock acquires. That'd mean it's nearly no additional overhead.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Robert Haas
On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-10-16 09:19:16 -0400, Robert Haas wrote:
 On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
 ryan.john...@cs.utoronto.ca wrote:
  Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
  an ideal fit...
 
  In brief, RCU has the following requirements:
 
  Read-heavy access pattern
  Writers must be able to make dead objects unreachable to new readers 
  (easily
  done for most data structures)
  Writers must be able to mark dead objects in such a way that existing
  readers know to ignore their contents but can still traverse the data
  structure properly (usually straightforward)
  Readers must occasionally inform the system that they are not currently
  using any RCU-protected pointers (to allow resource reclamation)

 Have a look at http://lwn.net/Articles/573424/ and specifically the
 URCU overview section.  Basically, that last requirement - that
 readers inform the system tat they are not currently using any
 RCU-protected pointers - turns out to require either memory barriers
 or signals.

 Well, there's the quiescent-state-based RCU - that's actually
 something that could reasonably be used inside postgres. Put something
 around socket reads (syscalls are synchronization points) and non-nested
 lwlock acquires. That'd mean it's nearly no additional overhead.

Sure, so, you reuse your existing barriers instead of adding new ones.
Making it work sounds like a lot of work for an uncertain benefit
though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Ryan Johnson

On 16/10/2014 7:19 AM, Robert Haas wrote:

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
ryan.john...@cs.utoronto.ca wrote:

Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...

In brief, RCU has the following requirements:

Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
URCU overview section.  Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.
All of the many techniques that have been developed in this area are
merely minor variations on a very old theme: set some kind of flag
variable in shared memory to let people know that you are reading a
shared data structure, and clear it when you are done.  Then, other
people can figure out when it's safe to recycle memory that was
previously part of that data structure.
Sure, but RCU has the key benefit of decoupling its machinery (esp. that 
flag update) from the actual critical section(s) it protects. In a DBMS 
setting, for example, once per transaction or SQL statement would do 
just fine. The notification can be much better than a simple flag---you 
want to know whether the thread has ever quiesced since the last reclaim 
cycle began, not whether it is currently quiesced (which it usually 
isn't). In the implementation I use, a busy thread (e.g. not about to go 
idle) can chain its RCU transactions. In the common case, a chained 
quiesce call comes when the RCU epoch is not trying to change, and the 
flag update degenerates to a simple load. Further, the only time it's 
critical to have that memory barrier is if the quiescing thread is about 
to go idle. Otherwise, missing a flag just imposes a small delay on 
resource reclamation (and that's assuming the flag in question even 
belonged to a straggler process). How you implement epoch management, 
especially the handling of stragglers, is the deciding factor in whether 
RCU works well. The early URCU techniques were pretty terrible, and 
maybe general-purpose URCU is doomed to stay that way, but in a DBMS 
core it can be done very cleanly and efficiently because we can easily 
add the quiescent points at appropriate locations in the code.



  In Linux's RCU, the flag
variable is whether the process is currently scheduled on a CPU,
which is obviously not workable from user-space.  Lacking that, you
need an explicit flag variable, which means you need memory barriers,
since the protected operation is a load and the flag variable is
updated via a store.  You can try to avoid some of the overhead by
updating the flag variable less often (say, when a signal arrives) or
you can make it more fine-grained (in my case, we only prevent reclaim
of a fraction of the data structure at a time, rather than all of it)
or various other variants, but none of this is unfortunately so simple
as apply technique X and your problem just goes away.
Magic wand, no (does nothing for update contention, for example, and 
requires some care to apply). But from a practical perspective RCU, 
properly implemented, does make an awful lot of problems an awful lot 
simpler to tackle. Especially for the readers.


Ryan



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Ryan Johnson

On 16/10/2014 7:59 AM, Robert Haas wrote:

On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund and...@2ndquadrant.com wrote:

On 2014-10-16 09:19:16 -0400, Robert Haas wrote:

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
ryan.john...@cs.utoronto.ca wrote:

Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...

In brief, RCU has the following requirements:

Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
URCU overview section.  Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.

Well, there's the quiescent-state-based RCU - that's actually
something that could reasonably be used inside postgres. Put something
around socket reads (syscalls are synchronization points) and non-nested
lwlock acquires. That'd mean it's nearly no additional overhead.

Sure, so, you reuse your existing barriers instead of adding new ones.
Making it work sounds like a lot of work for an uncertain benefit
though.
Perhaps RCU is too much work and/or too little benefit to justify 
replacing the current latch-based code. I personally am not convinced, 
but have no hard data to offer other than to point at the rapid (even 
accelerating) uptake of RCU throughout  the Linux kernel (cf. Fig. 1 of 
http://www2.rdrop.com/users/paulmck/techreports/RCUUsage.2013.02.24a.pdf).


However---
If we are convinced the latch-based code needs to go and the question is 
whether to replace it with RCU or hazard pointers, then RCU wins 
hands-down IMO. It's comparable work to implement, easier to reason 
about (RCU read protocol is significantly simpler), and a clearer 
performance benefit (hazard pointers require more barriers, more atomic 
ops, more validating, and more pointer chasing than RCU). The only 
metric where RCU loses to hazard pointers is if you have really tight 
timing constraints on resource reclamation. Hazard pointers give a tight 
bound on the number of dead objects that cannot be reclaimed at any 
given moment, while RCU does not. I've not heard that this is a major 
concern, though, and in practice it doesn't seem to be a problem unless 
you forget to annotate an important quiescent point (like a blocking 
syscall).


Ryan





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Ants Aasma
On Oct 15, 2014 7:32 PM, Ants Aasma a...@cybertec.at wrote:
 I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
 associativity). This allows us to fit the bucket onto 2 regular sized
 cache lines and have 8 bytes left over. Buckets would be protected by
 seqlocks stored in the extra space. On the read side we would only
 need 2 read barriers (basically free on x86), and we are guaranteed to
 have an answer by fetching two pairs of cache lines. We can trade
 memory bandwidth for latency by issuing prefetches (once we add
 primitives for this). Alternatively we can trade insert speed for
 lookup speed by using asymmetrically sized tables.

I was thinking about this. It came to me that we might not even need
BufferTag's to be present in the hash table. In the hash table itself
we store just a combination of 4 byte buffer tag hash-buffer id. This
way we can store 8 pairs in one cacheline. Lookup must account for the
fact that due to hash collisions we may find multiple matches one of
which may be the buffer we are looking for.  We can make condition
exceedingly unlikely by taking advantage of the fact that we have two
hashes anyway, in each table we use the lookup hash of the other table
for tagging. (32bit hash collision within 8 items). By having a
reasonably high load factor (75% should work) and 8 bytes per key we
even have a pretty good chance of fitting the hotter parts of the
buffer lookup table in CPU caches.

We use a separate array for the locks (spinlocks should be ok here)
for fine grained locking, this may be 1:1 with the buckets or we can
map many buckets to a single lock. Otherwise the scheme should work
mostly the same. Using this scheme we can get by on the lookup side
using 64 bit atomic reads with no barriers, we only need atomic ops
for pinning the found buffer.

I haven't had the chance to check out how second-chance hashing works
and if this scheme would be applicable to it.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Robert Haas
On Thu, Oct 16, 2014 at 12:26 PM, Ryan Johnson
ryan.john...@cs.utoronto.ca wrote:
 The only metric where RCU loses to
 hazard pointers is if you have really tight timing constraints on resource
 reclamation.

I think we do have that problem.  It's certainly completely
unacceptable for a query to prevent buffer reclaim on any significant
number of buffers even until the end of the query, let alone the end
of the transaction.

But, hey, if somebody wants to try writing a patch different than the
one that I wrote and see whether it works better than mine, I'm
totally cool with that.  This is something I came up with, and we're
here to evaluate whether it works better than any other option that we
have now or that someone wants to develop.  I'm not saying this is the
best solution; it's just something I've got that seems to basically
work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Robert Haas
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund and...@2ndquadrant.com wrote:
 Yes, I can see that now. I do wonder if that's not going to prove quite
 expensive... I think I can see some ways around needing that. But I
 think that needs some benchmarking first - no need to build a even more
 complex solution if not needed.

I did a bit of testing on my MacBook Pro last night.  Unfortunately, I
don't have access to a large x86 machine the moment.[1]  I tried four
configurations:

(1) master
(2) master with chash patch
(3) master with chash patch, pg_memory_barrier() changed from lock
addl to mfence
(4) same as (3), plus barrier at the end of CHashSearch changed to a
compiler barrier (which should be safe on x86)

I tested pgbench -S, scale factor 100, shared_buffers 400MB.  3 trials
per configuration; runs were interleaved.  Percentages indicate the
average difference between that line and master.

At 1 client:

(1) 11689.388986 11426.653176 11297.775005
(2) 11618.306822 11331.805396 11273.272945 -0.55%
(3) 11813.664402 11300.082928 11231.265030 -0.20%
(4) 11428.097384 11266.979165 11294.422376 -1.23%

At 16 clients:

(1) 56716.465893 56992.590587 56755.298362
(2) 57126.787712 56848.527712 57351.559138 +0.51%
(3) 56779.624757 57036.610925 56878.036445 +0.13%
(4) 56818.522369 56885.544509 56977.810413 +0.13%

I think this tends to confirm that the patch is a small loss (for
unknown reasons) at 1 client, but that it picks up speed with more
contention, even on a machine with only 4 cores.  But if there's an
important difference between the different fencing techniques, it
doesn't show up in this test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Yes, this is a hint.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Andres Freund
Hi,

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
 A few years ago I started working on a concurrent hash table for
 PostgreSQL.  The hash table part of it worked, but I never did
 anything with it, really.  Amit mentioned to me earlier this week that
 he was seeing contention inside the dynahash machinery, which inspired
 me to go back and update the patch.  I took the basic infrastructure
 from before and used it to replace the buffer table.  Patch is
 attached.

So. I ran a quick tests on a larger x86 machine. 4xE5-4620, 256GB RAM.

The hotel wifi is too flaky to get detailed results, but some tasty
bits.

The test I used is readonly pgbench on a scale 5000 DB - about 73GB. I'm
benchmarking chash ontop my LW_SHARED and StrategyGetBuffer()
optimizations because otherwise the bottleneck is completely elsewhere.

When using shared_buffers = 96GB there's a performance benefit, but not
huge:
master (f630b0dd5ea2de52972d456f5978a012436115e):   153621.8
master + LW_SHARED + lockless StrategyGetBuffer():  477118.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash:  496788.6
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7

But with shared_buffers = 16GB:
master (f630b0dd5ea2de52972d456f5978a012436115e):   177302.9
master + LW_SHARED + lockless StrategyGetBuffer():  206172.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash:  413344.1
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8

All the numbers here -P5 output when it looks like it has stabilized -
it takes a couple minutes to get to that point so pgbench runs would
have to be really long to not be skewed by the startup. I don't think my
manual selection of measurements matters much at this scale of
differences ;)

I had to play with setting max_connections+1 sometimes to get halfway
comparable results for master - unaligned data otherwise causes wierd
results otherwise. Without doing that the performance gap between master
96/16G was even bigger. We really need to fix that...

This is pretty awesome.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Robert Haas
On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund and...@2ndquadrant.com wrote:
 When using shared_buffers = 96GB there's a performance benefit, but not
 huge:
 master (f630b0dd5ea2de52972d456f5978a012436115e):   153621.8
 master + LW_SHARED + lockless StrategyGetBuffer():  477118.4
 master + LW_SHARED + lockless StrategyGetBuffer() + chash:  496788.6
 master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7

 But with shared_buffers = 16GB:
 master (f630b0dd5ea2de52972d456f5978a012436115e):   177302.9
 master + LW_SHARED + lockless StrategyGetBuffer():  206172.4
 master + LW_SHARED + lockless StrategyGetBuffer() + chash:  413344.1
 master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8

Very interesting.  This doesn't show that chash is the right solution,
but it definitely shows that doing nothing is the wrong solution.  It
shows that, even with the recent bump to 128 lock manager partitions,
and LW_SHARED on top of that, workloads that actually update the
buffer mapping table still produce a lot of contention there.  This
hasn't been obvious to me from profiling, but the numbers above make
it pretty clear.

It also seems to suggest that trying to get rid of the memory barriers
isn't a very useful optimization project.  We might get a couple of
percent out of it, but it's pretty small potatoes, so unless it can be
done more easily than I suspect, it's probably not worth bothering
with.  An approach I think might have more promise is to have bufmgr.c
call the CHash stuff directly instead of going through buf_table.c.
Right now, for example, BufferAlloc() creates and initializes a
BufferTag and passes a pointer to that buffer tag to BufTableLookup,
which copies it into a BufferLookupEnt.  But it would be just as easy
for BufferAlloc() to put the BufferLookupEnt on its own stack, and
then you wouldn't need to copy the data an extra time.  Now a 20-byte
copy isn't a lot, but it's completely unnecessary and looks easy to
get rid of.

 I had to play with setting max_connections+1 sometimes to get halfway
 comparable results for master - unaligned data otherwise causes wierd
 results otherwise. Without doing that the performance gap between master
 96/16G was even bigger. We really need to fix that...

 This is pretty awesome.

Thanks.  I wasn't quite sure how to test this or where the workloads
that it would benefit would be found, so I appreciate you putting time
into it.  And I'm really glad to hear that it delivers good results.

I think it would be useful to plumb the chash statistics into the
stats collector or at least a debugging dump of some kind for testing.
  They include a number of useful contention measures, and I'd be
interested to see what those look like on this workload.  (If we're
really desperate for every last ounce of performance, we could also
disable those statistics in production builds.  That's probably worth
testing at least once to see if it matters much, but I kind of hope it
doesn't.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-15 Thread Andres Freund
On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
 On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund and...@2ndquadrant.com wrote:
  The code in CHashSearch shows the problem there: you need to STORE the
  hazard pointer before you begin to do the LOAD operations to scan the
  bucket, and you must finish all of those LOADs before you STORE the
  NULL hazard pointer.  A read or write barrier won't provide store/load
  or load/store ordering.
 
  I'm not sure I understand the problem here - but a read + write barrier
  should suffice? To avoid falling back to two full barriers, we'd need a
  separate define pg_read_write_barrier(), but that's not a problem. IIUC
  that should allow us to avoid emitting any actual code on x86.
 
 Well, thinking about x86 specifically, it definitely needs at least
 one mfence, after setting the hazard pointer and before beginning to
 read the bucket chain.

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-15 Thread Ants Aasma
On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas robertmh...@gmail.com wrote:
 With regard for using a hash table for the buffer mapping lock I'm
 doubtful that any form of separate chaining is the right one. We
 currently have a quite noticeable problem with the number of cache
 misses in the buffer mapping hash (and also the heavyweight lock mgr) -
 if we stay with hashes that's only going to be fundamentally lower than
 dynahash if we change the type of hashing. I've had good, *preliminary*,
 results using an open addressing + linear probing approach.

 I'm very skeptical of open addressing + linear probing.  Such hash
 tables tend to degrade over time, because you have to leave tombstones
 behind to ensure that future scans advance far enough.  There's really
 no way to recover without rebuilding the whole hash table, and
 eventually it degrades to linear search.  If we're spending too much
 time walking hash chains, I think the solution is to increase the
 number of buckets so that the chains get shorter.

What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]

I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.

[1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-15 Thread Ryan Johnson

On 15/10/2014 10:32 AM, Ants Aasma wrote:

On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas robertmh...@gmail.com wrote:

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing.  Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough.  There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search.  If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]

I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.

[1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf
Actually, I'd go with second-chance hashing [2], same number of hash 
functions but it's more stable (no infinite loops, for example). Most 
probably the techniques from [1] would apply equally well.


[2] 
http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf


Ryan



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-15 Thread Robert Haas
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
 On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  The code in CHashSearch shows the problem there: you need to STORE the
  hazard pointer before you begin to do the LOAD operations to scan the
  bucket, and you must finish all of those LOADs before you STORE the
  NULL hazard pointer.  A read or write barrier won't provide store/load
  or load/store ordering.
 
  I'm not sure I understand the problem here - but a read + write barrier
  should suffice? To avoid falling back to two full barriers, we'd need a
  separate define pg_read_write_barrier(), but that's not a problem. IIUC
  that should allow us to avoid emitting any actual code on x86.

 Well, thinking about x86 specifically, it definitely needs at least
 one mfence, after setting the hazard pointer and before beginning to
 read the bucket chain.

 Yes, I can see that now. I do wonder if that's not going to prove quite
 expensive... I think I can see some ways around needing that. But I
 think that needs some benchmarking first - no need to build a even more
 complex solution if not needed.

 The solution I'm thinking of is to essentially move away from hazard
 pointers and store something like a generation counter per
 backend. Which is updated less often, and in combination with other
 operations. When a backend need to clean up and sees that there's a
 straggler with a old generation it sends that backend a signal to ensure
 it sets the latest generation.

It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.

Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it.  This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in.  Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard.  Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture.  I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.

That having been said, there is some literature on generation numbers,
and I think something like that could be made to work.  It does have
some significant disadvantages, though.  One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide.In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists.  A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead.  Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section.  Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser.  It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower.  Or at least, I
think.

That having been said, I don't know what to do about the fact that the
fence is too expensive.  I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure.  But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again.  Surely a pure fence shouldn't
cost more than a spinlock cycle?  Even with arguably one extra cache
line touch, that seems like it ought to be a win.  But my intuitions
in this area are shaky.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-15 Thread Andres Freund
On 2014-10-15 13:41:25 -0400, Robert Haas wrote:
 On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund and...@2ndquadrant.com wrote:
  The solution I'm thinking of is to essentially move away from hazard
  pointers and store something like a generation counter per
  backend. Which is updated less often, and in combination with other
  operations. When a backend need to clean up and sees that there's a
  straggler with a old generation it sends that backend a signal to ensure
  it sets the latest generation.
 
 It's possible that might work ... but on the timescale we're talking
 about here, that's asking the garbage collecting process to wait for
 practically geologic time.

I think it depends on what we're tying the generation increase to. We
very well could add a implict generation increment to, say, lwlock
acquisition - which are full barriers anyway. And I don't think we'll
normally have a that high rate of garbage collection anyway - there
should be plenty of headroom.

 Back when I first wrote this code, I spent a fair amount of time
 looking at existing work in the area of lock-free hash tables.
 Essentially all of the work I was able to find on this topic assumes a
 threaded model - or more precisely, it assumes that shared memory
 allocation is cheap and easy and you'll have no problem getting as
 much of it as you need whenever you need it.

FWIW, I think many of the solutions that are actually used in practice
don't rely on that heavily. I know at least some that require memory to
be reserved beforehand, in special contexts.

 This assumption often
 isn't even spelled out explicitly: it's just assumed that that's the
 programming environment you're working in.  Finding highly parallel
 algorithms that don't rely on memory allocation as a primitive is
 hard.  Hazard pointers are one of the few techniques I found that
 seems like it can work in our architecture.  I'm quite reluctant to
 leave aside techniques that have been proven to work well in favor of
 inventing something entirely novel to PostgreSQL.

I don't think something like generation numbers is really that new and
unproven - it's essentially a more trivial version of RCU. Which is used
quite heavily, and under different names.

That said, I'm far from convinced that they are the solution - they just
were the first thing that came to my mind thinking about the problem.

 That having been said, there is some literature on generation numbers,
 and I think something like that could be made to work.  It does have
 some significant disadvantages, though.  One is that a single process
 which fails to update its generation number prevents all reclamation,
 system-wide.In my algorithm, a process whose execution is
 suspended while a hazard pointer is set prevents recycling of only one
 of many garbage lists.  A process searching for a reusable element can
 mostly likely find some other garbage list to reclaim instead.

Hm. Couldn't a similar scheme with several lists be used with generation
numbers?

 Also, a generation number implies that we update the value
 periodically, rather than before and after each critical section.

Hm. Don't think it necessarily has to do that.

 Anything that forces garbage collection to be postponed longer than
 absolutely necessary seems to me likely to be a loser.  It might be a
 little faster as long as we have free elements to allocate, but as
 soon as we're out and have to wait longer than we otherwise would for
 garbage collection, and all system activity halts as a result, even
 for a few milliseconds, it's going to be a whole lot slower.  Or at
 least, I think.

I think it really depends on the user of the facility. If the generation
were e.g. also tied to lwlocks I couldn't see bufmgr.c outrunning that
realistically.

 That having been said, I don't know what to do about the fact that the
 fence is too expensive.

I'm far from sure that it's the problem at hand here - the reason I'm
wondering about the barriers is primarily that the buffer mapping hash
table is one of the top profile entries in in all concurrent
workloads. The top one often even, after removing some locking
bottleneck. Nearly all of the time is spent there due to cache
misses. While I think we can get the table smaller and more efficient
for the same NBuffers value, realistically we need to cope with much
bigger NBuffers values.

Since cache misses are a problem that we're going to have to deal with,
restricting the processor's tool for efficiently dealing with that (out
of order execution, SMT), doesn't seem like a wise choice.

 I don't know that we've really established
 that that's the true root of the problem rather than some other
 pedestrian optimization failure.

Absolutely.

 But the existing code is using an
 atomic operation to acquire a spinlock, then releasing it, walking the
 bucket chain, and then using another atomic operation to acquire a
 spinlock and then releasing it again.

Well, we don't do that for lookups right now. 

[HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
A few years ago I started working on a concurrent hash table for
PostgreSQL.  The hash table part of it worked, but I never did
anything with it, really.  Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.  I took the basic infrastructure
from before and used it to replace the buffer table.  Patch is
attached.

The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.
The algorithm is not strictly lock-free for reasons explained in the
comments in chash.c, but it's a lot less locky than what we have now,
so in theory you might think that would be a good thing.

I haven't had time to do much performance testing yet, but it looks
like this may be slower at low client counts and faster at high client
counts.  However, my results weren't real reproducible, and I haven't
done comprehensive testing yet.  What was really bizarre is that I
couldn't really pin down the cause of the slowness at low client
counts; a quick perf profile showed overhead concentrated in
CHashBucketScan, basically memory access latency for walking the
bucket chain.  But the table is built to have a load factor of 1, so I
can't see why that should amount to much, or why it should be
significantly worse than for dynahash.

This patch contains assorted leftovers and is grotty in various ways,
but I'm sharing it anyway just to get it out there.

git branch also available at:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..0b91ac1 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -20,6 +20,7 @@ SUBDIRS = \
 		earthdistance	\
 		file_fdw	\
 		fuzzystrmatch	\
+		hashtest	\
 		hstore		\
 		intagg		\
 		intarray	\
diff --git a/contrib/hashtest/Makefile b/contrib/hashtest/Makefile
new file mode 100644
index 000..3ee42f8
--- /dev/null
+++ b/contrib/hashtest/Makefile
@@ -0,0 +1,18 @@
+# contrib/hashtest/Makefile
+
+MODULE_big = hashtest
+OBJS = hashtest.o
+
+EXTENSION = hashtest
+DATA = hashtest--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/hashtest
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/hashtest/hashtest--1.0.sql b/contrib/hashtest/hashtest--1.0.sql
new file mode 100644
index 000..e271baf
--- /dev/null
+++ b/contrib/hashtest/hashtest--1.0.sql
@@ -0,0 +1,52 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use CREATE EXTENSION hashtest to load this file. \quit
+
+CREATE FUNCTION chash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_collision_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_collision_test'
+LANGUAGE C;
diff --git a/contrib/hashtest/hashtest.c b/contrib/hashtest/hashtest.c
new file mode 100644
index 000..172a5bb
--- /dev/null
+++ b/contrib/hashtest/hashtest.c
@@ -0,0 +1,527 @@
+/*-
+ * hashtest.c
+ *-
+ */
+
+#include postgres.h
+
+#include funcapi.h
+#include libpq/auth.h
+#include lib/stringinfo.h
+#include miscadmin.h
+#include portability/instr_time.h
+#include storage/ipc.h
+#include utils/chash.h
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+Datum		chash_insert_test(PG_FUNCTION_ARGS);
+Datum		chash_search_test(PG_FUNCTION_ARGS);
+Datum		chash_delete_test(PG_FUNCTION_ARGS);
+Datum		chash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		chash_collision_test(PG_FUNCTION_ARGS);
+Datum		dynahash_insert_test(PG_FUNCTION_ARGS);
+Datum		dynahash_search_test(PG_FUNCTION_ARGS);
+Datum		dynahash_delete_test(PG_FUNCTION_ARGS);

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
 A few years ago I started working on a concurrent hash table for
 PostgreSQL.  The hash table part of it worked, but I never did
 anything with it, really.  Amit mentioned to me earlier this week that
 he was seeing contention inside the dynahash machinery, which inspired
 me to go back and update the patch.

Interestingly I've benchmarked similar loads, even on the same machine
as Amit, and I do seem trememdous time spent in dynahash.c. It's nearly
all cache misses in my tests though.

 I took the basic infrastructure
 from before and used it to replace the buffer table.  Patch is
 attached.

That's pretty cool. The algorithm is complex enough that I haven't fully
understood it yet, but it sounds sane on a first glance.

 The key idea here is that lookups are done without any locks, only
 memory barriers; and inserts and deletes are done using atomic ops.

Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.

 The algorithm is not strictly lock-free for reasons explained in the
 comments in chash.c, but it's a lot less locky than what we have now,
 so in theory you might think that would be a good thing.

I don't see much reason to strive for full lock-free ness. If the locks
aren't noticeable in real world scenarios - who cares?

 I haven't had time to do much performance testing yet, but it looks
 like this may be slower at low client counts and faster at high client
 counts.  However, my results weren't real reproducible, and I haven't
 done comprehensive testing yet.  What was really bizarre is that I
 couldn't really pin down the cause of the slowness at low client
 counts; a quick perf profile showed overhead concentrated in
 CHashBucketScan, basically memory access latency for walking the
 bucket chain.  But the table is built to have a load factor of 1, so I
 can't see why that should amount to much, or why it should be
 significantly worse than for dynahash.

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.

It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
employing builds for some comparable load.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
 I took the basic infrastructure from before and used it to replace the
 buffer table.  Patch is attached.

Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.

How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Amit Kapila
On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
  A few years ago I started working on a concurrent hash table for
  PostgreSQL.  The hash table part of it worked, but I never did
  anything with it, really.  Amit mentioned to me earlier this week that
  he was seeing contention inside the dynahash machinery, which inspired
  me to go back and update the patch.

 Interestingly I've benchmarked similar loads, even on the same machine
 as Amit,

There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
 On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund and...@2ndquadrant.com
 wrote:
 
  On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
   A few years ago I started working on a concurrent hash table for
   PostgreSQL.  The hash table part of it worked, but I never did
   anything with it, really.  Amit mentioned to me earlier this week that
   he was seeing contention inside the dynahash machinery, which inspired
   me to go back and update the patch.
 
  Interestingly I've benchmarked similar loads, even on the same machine
  as Amit,
 
 There is one catch here, for these profiles I am using Power-8 m/c
 and the load is slightly higher (5000 scale factor).

Ah, right. I don't think the scale factor changes much, but the
different architecture certainly does. As I said elsewhere, I would not
believe these profiles much until they're actually done with optimized
code...

I also think we need to be a bit careful about optimizing too much for
stock pgbench with working set  s_b. The uniform readonly access
pattern you're profiling here isn't something super realistic. Still
valuable, but we should take it with a grain of salt.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
 I took the basic infrastructure from before and used it to replace the
 buffer table.  Patch is attached.

 Hm. Unless I missed something you pretty much completely eradicated
 locks from the buffer mapping code? That'd be pretty cool, but it's also
 scary.

 How confident are you that your conversion is correct? Not in the sense
 that there aren't any buglets, but fundamentally.

It doesn't look particularly dangerous to me.  Famous last words.

Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it.  If we don't do that, I think the only consequence is
that, by the time we get the buffer pin, the buffer might no longer be
the one we looked up.  So we need to recheck.  But assuming we do
that, I don't see an issue.  In fact, it occurred to me while I was
cobbling this together that we might want to experiment with that
change independently of chash.  We already know (from the
StrategyGetBuffer locking changes) that holding system-wide locks to
prevent people from twaddling the state of individual buffers can be a
loser.  This case isn't nearly as bad, because we're only pinning one
buffer, rather than potentially all of them multiple times, and the
lock we're holding only affects 1/128th of the buffers, not all of
them.

The other application I had in mind for chash was SLRU stuff.  I
haven't tried to write the code yet, but fundamentally, the same
considerations apply there.  You do the lookup locklessly to find out
which buffer is thought to contain the SLRU page you want, but by the
time you lock the page the answer might be out of date.  Assuming that
this doesn't happen many times in a row and that lookups are
relatively cheap, you could get rid of having any centralized lock for
SLRU, and just have per-page locks.

I'm not quite sure what fundamental dangers you're thinking about
here, but from what I understand from reading the literature, doing
lookups in a lock-free hash table needs to be thought about in a sort
of relativistic way.  The different CPUs have slightly different views
of the world, so any answer you get may be obsolete by the time you
get it, and may according to some other CPU's view of the world have
been obsolete even at the time that it was copied.  That's OK, because
it's just equivalent to some other CPU doing its hash table update
slightly sooner or later than it actually did it, within the bounds
imposed by memory barriers, which it could have done anyway.  There is
no one source of truth.  The result of all this is that some of the
synchronization responsibility is transferred to the caller.  That's
obviously a bit more complex to reason about, but it hasn't stopped
lock-free hash tables from being wildly popular data structures.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Amit Kapila
On Tue, Oct 14, 2014 at 8:38 PM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
  On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund and...@2ndquadrant.com
  wrote:
  
   On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
A few years ago I started working on a concurrent hash table for
PostgreSQL.  The hash table part of it worked, but I never did
anything with it, really.  Amit mentioned to me earlier this week
that
he was seeing contention inside the dynahash machinery, which
inspired
me to go back and update the patch.
  
   Interestingly I've benchmarked similar loads, even on the same machine
   as Amit,
 
  There is one catch here, for these profiles I am using Power-8 m/c
  and the load is slightly higher (5000 scale factor).

 Ah, right. I don't think the scale factor changes much, but the
 different architecture certainly does. As I said elsewhere, I would not
 believe these profiles much until they're actually done with optimized
 code...

Today, that m/c is not accessible, so will take a day or so to get the
optimized profiles and will post it once I am able to take the same.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund and...@2ndquadrant.com wrote:
 The key idea here is that lookups are done without any locks, only
 memory barriers; and inserts and deletes are done using atomic ops.

 Hm. I quickly looked and I see that you use two full barriers for every
 lookup. That's pretty expensive. I think this should be doable using
 only read/write barriers on the lookup side? Even on architectures where
 read/write memory barriers have to be something but a compiler barrier,
 they're still usually a magnitude or so cheaper than full barriers.

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer.  A read or write barrier won't provide store/load
or load/store ordering.

 I don't see much reason to strive for full lock-free ness. If the locks
 aren't noticeable in real world scenarios - who cares?

That's my feeling too.  ISTR that when I stress-tested the hash table
back in 2012 when I started this code, the concurrency was far better
than dynahash with 16 lwlock partitions.  I don't remember off hand
exactly how I tested that, but it was with the code in hashtest.c.  I
also seem to recall that it was possible to make the freelists get
VERY hot, but of course that was a workload where hash table updates
were the only thing happening, so I assume that on a workload where
we're actually doing work (like, copying the data in and out of kernel
buffers) that's not going to be a real problem unless you have
thousands of cores.  But we'll see.

 With regard for using a hash table for the buffer mapping lock I'm
 doubtful that any form of separate chaining is the right one. We
 currently have a quite noticeable problem with the number of cache
 misses in the buffer mapping hash (and also the heavyweight lock mgr) -
 if we stay with hashes that's only going to be fundamentally lower than
 dynahash if we change the type of hashing. I've had good, *preliminary*,
 results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing.  Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough.  There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search.  If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

 My guess is that the additional indirection via the arena explains the
 difference in cache misses? But I might be completely off here.

The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline.  But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done.  Am I missing something?

 It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
 employing builds for some comparable load.

I'm hoping you can test this out on x86.  All I have to work with are
the POWER machines, and the characteristics of those are somewhat
different.  I can test it there, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 11:08:08 -0400, Robert Haas wrote:
 On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
  I took the basic infrastructure from before and used it to replace the
  buffer table.  Patch is attached.
 
  Hm. Unless I missed something you pretty much completely eradicated
  locks from the buffer mapping code? That'd be pretty cool, but it's also
  scary.
 
  How confident are you that your conversion is correct? Not in the sense
  that there aren't any buglets, but fundamentally.
 
 It doesn't look particularly dangerous to me.  Famous last words.

 Basically, I think what we're doing right now is holding the buffer
 mapping lock so that the buffer can't be renamed under us while we're
 pinning it.

What I'm afraid of is that there's hidden assumptions about the
consistency provided by the mapping locks.

 If we don't do that, I think the only consequence is
 that, by the time we get the buffer pin, the buffer might no longer be
 the one we looked up.  So we need to recheck.  But assuming we do
 that, I don't see an issue.  In fact, it occurred to me while I was
 cobbling this together that we might want to experiment with that
 change independently of chash.  We already know (from the
 StrategyGetBuffer locking changes) that holding system-wide locks to
 prevent people from twaddling the state of individual buffers can be a
 loser.  This case isn't nearly as bad, because we're only pinning one
 buffer, rather than potentially all of them multiple times, and the
 lock we're holding only affects 1/128th of the buffers, not all of
 them.

Also IIRC at least linux has targeted wakeup/time transfer logic when
using semaphore - and doesn't for spinlocks. So if you're not happening
to sleep while the lwlock's spinlock is held, the result will be
better. Only that we'll frequently sleep within that for very frequently
taken locks...

 The other application I had in mind for chash was SLRU stuff.  I
 haven't tried to write the code yet, but fundamentally, the same
 considerations apply there.  You do the lookup locklessly to find out
 which buffer is thought to contain the SLRU page you want, but by the
 time you lock the page the answer might be out of date.  Assuming that
 this doesn't happen many times in a row and that lookups are
 relatively cheap, you could get rid of having any centralized lock for
 SLRU, and just have per-page locks.

Hm. I have to admit I haven't really looked enough into the slru code to
judge this. My impression so far wasn't that the locking itself was the
problem in most scenarios - that's not what you've seen?

 I'm not quite sure what fundamental dangers you're thinking about
 here

Oh, only in the context of the bufmgr.c conversion. Not more
generally. I agree that a lockfree hashtable is something quite
worthwile to have.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 11:31 AM, Andres Freund and...@2ndquadrant.com wrote:
 It doesn't look particularly dangerous to me.  Famous last words.

 Basically, I think what we're doing right now is holding the buffer
 mapping lock so that the buffer can't be renamed under us while we're
 pinning it.

 What I'm afraid of is that there's hidden assumptions about the
 consistency provided by the mapping locks.

That's certainly worth checking for, but I think the only code that
needs to be checked is the code that would formerly have run while
holding said lock.  And there isn't that much of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 11:19:16 -0400, Robert Haas wrote:
 On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  The key idea here is that lookups are done without any locks, only
  memory barriers; and inserts and deletes are done using atomic ops.
 
  Hm. I quickly looked and I see that you use two full barriers for every
  lookup. That's pretty expensive. I think this should be doable using
  only read/write barriers on the lookup side? Even on architectures where
  read/write memory barriers have to be something but a compiler barrier,
  they're still usually a magnitude or so cheaper than full barriers.
 
 The code in CHashSearch shows the problem there: you need to STORE the
 hazard pointer before you begin to do the LOAD operations to scan the
 bucket, and you must finish all of those LOADs before you STORE the
 NULL hazard pointer.  A read or write barrier won't provide store/load
 or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

  With regard for using a hash table for the buffer mapping lock I'm
  doubtful that any form of separate chaining is the right one. We
  currently have a quite noticeable problem with the number of cache
  misses in the buffer mapping hash (and also the heavyweight lock mgr) -
  if we stay with hashes that's only going to be fundamentally lower than
  dynahash if we change the type of hashing. I've had good, *preliminary*,
  results using an open addressing + linear probing approach.
 
 I'm very skeptical of open addressing + linear probing.  Such hash
 tables tend to degrade over time, because you have to leave tombstones
 behind to ensure that future scans advance far enough.

You can try to be a bit smarter than a plain open addressing
approach. But yes, it has disadvantages.

 There's really
 no way to recover without rebuilding the whole hash table, and
 eventually it degrades to linear search.  If we're spending too much
 time walking hash chains, I think the solution is to increase the
 number of buckets so that the chains get shorter.

Might be worthwile to try. I know that both my handrolled open
addressing and my hand rolled chaining approach have significantly fewer
cache misses than dynahash for the same amount of work.

Hm. Possibly that's at least partially because of the segment stuff in
dynahash?

Anyway, you can get the size of the entire hashtable down quite a
bit. Primarily because there's no need to store a next pointer. There's
also not really any need for the 'hashvalue' in the bufmgr case
imo.

  My guess is that the additional indirection via the arena explains the
  difference in cache misses? But I might be completely off here.
 
 The arena makes the calculation of the pointer address less
 predictable, which might make things more difficult for the CPU
 pipeline.  But aside from that, I think it's the same basic idea: you
 bounce from some sort of bucket header to the first element of the
 chain and then, hopefully, you're done.  Am I missing something?

I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.

So it's not really the arena, but that you chose not to store the first
element in a chain inline...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund and...@2ndquadrant.com wrote:
 The code in CHashSearch shows the problem there: you need to STORE the
 hazard pointer before you begin to do the LOAD operations to scan the
 bucket, and you must finish all of those LOADs before you STORE the
 NULL hazard pointer.  A read or write barrier won't provide store/load
 or load/store ordering.

 I'm not sure I understand the problem here - but a read + write barrier
 should suffice? To avoid falling back to two full barriers, we'd need a
 separate define pg_read_write_barrier(), but that's not a problem. IIUC
 that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.  It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads.  We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

I tested PPC, because that's what I have.  I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases.  The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it.   Or that's my reading anyway.

http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/

  My guess is that the additional indirection via the arena explains the
  difference in cache misses? But I might be completely off here.

 The arena makes the calculation of the pointer address less
 predictable, which might make things more difficult for the CPU
 pipeline.  But aside from that, I think it's the same basic idea: you
 bounce from some sort of bucket header to the first element of the
 chain and then, hopefully, you're done.  Am I missing something?

 I haven't really read much of the code yet (wrote most of this email on
 my workstation before embarking on a plane), but afaics there's one
 indirection to the bucket, and then one to the first node in the linked
 list. Right?
 In dynahash it's first to the segment, but there's only few, so it's
 likely to be cached. Then to the bucket - which, afaics, already can
 contain the key.

 So it's not really the arena, but that you chose not to store the first
 element in a chain inline...

Hmm, you have a point.  I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise.  If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm.  There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers