Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-12-09 Thread Andres Freund
Hi,

On 2016-12-09 15:21:36 -0800, Mark Dilger wrote:
> Andres,
> 
> Your patch, below, appears to have been applied to master in commit
> 5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5.  It makes mention of a
> function, tuplehash_start_iterate, in a macro, but the function is not
> defined or declared, and its signature and intention is not clear.  Is there
> any chance you could add some documentation about how this function
> is intended to be used and defined?
> 
> See InitTupleHashIterator in src/include/nodes/execnodes.h

The patch generates functions based on the defined prefix. E.g.

/* define paramters necessary to generate the tuple hash table interface */
#define SH_PREFIX tuplehash
#define SH_ELEMENT_TYPE TupleHashEntryData
#define SH_KEY_TYPE MinimalTuple
#define SH_SCOPE extern
#define SH_DECLARE
#include "lib/simplehash.h"

makes tuplehash_iterate out of
#define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
...
SH_SCOPE void SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter);
SH_SCOPE void
SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter)
{
...
}

See the simplehash.h's header for some explanation.


Hope that helps,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-12-09 Thread Mark Dilger
Andres,

Your patch, below, appears to have been applied to master in commit
5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5.  It makes mention of a
function, tuplehash_start_iterate, in a macro, but the function is not
defined or declared, and its signature and intention is not clear.  Is there
any chance you could add some documentation about how this function
is intended to be used and defined?

See InitTupleHashIterator in src/include/nodes/execnodes.h

Thanks,

Mark Dilger

> On Oct 9, 2016, at 4:38 PM, Andres Freund  wrote:
> 
> Hi,
> 
> Attached is an updated version of the patchset. The main changes are
> - address most of Tomas' feedback
> - address regression test output changes by adding explicit ORDER BYs,
>  in a separate commit.
> - fix issues around hash tables sized up to PG_UINT32_MAX
> - fix a bug in iteration with "concurrent" deletions
> 
>>> We didn't really for dynahash (as it basically assumed a fillfactor of 1
>>> all the time), not sure if changing it won't also cause issues.
>>> 
>> 
>> That's a case of glass half-full vs. half-empty, I guess. If we assumed load
>> factor 1.0, then I see it as accounting for load factor (while you see it as
>> not accounting of it).
> 
> Well, that load factor is almost never achieved, because we'd have grown
> since...  I added a comment about disregarding fill factor and growth
> policy to estimate_hashagg_tablesize, which is actually the place where
> it'd seemingly make sense to handle it.
> 
> Tomas, did you have time to run a benchmark?
> 
> I think this is starting to look good.
> 
> 
> Regards,
> 
> Andres
> <0001-Add-likely-unlikely-branch-hint-macros.patch><0002-Make-regression-tests-less-dependent-on-hash-table-o.patch><0003-Add-a-macro-customizable-hashtable.patch><0004-Use-more-efficient-hashtable-for-tidbitmap.c-to-spee.patch><0005-Use-more-efficient-hashtable-for-execGrouping.c-to-s.patch>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-11 Thread Tomas Vondra

On 10/11/2016 05:56 PM, Andres Freund wrote:

On 2016-10-11 04:29:31 +0200, Tomas Vondra wrote:

On 10/11/2016 04:07 AM, Andres Freund wrote:

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:

TPC-DS (tpcds.ods)
--

In this case, I'd say the results are less convincing. There are quite a few
queries that got slower by ~10%, which is well above - for example queries
22 and 67. There are of course queries that got ~10% faster, and in total
the patched version executed more queries (so overall the result is slightly
positive, but not significantly).


That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.


Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths.  Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.



Yes, I see similar plans (no bitmap index scans or hash aggregates). But the
difference is there, even when running the query alone (so it's not merely
due to the randomized ordering).



I wonder whether this is again due to compiler moving stuff around.


It looks like that. I looked through a significant set of plans where
there we time differences (generated on my machine), and none of them
had bitmap or hash groupings to any significant degree.  Comparing
profiles in those cases usually shows a picture like:
24.98%   +0.32%  postgres[.] slot_deform_tuple
16.58%   -0.05%  postgres[.] ExecMakeFunctionResultNoSets
12.41%   -0.01%  postgres[.] slot_getattr
 6.10%   +0.39%  postgres[.] heap_getnext
 4.41%   -0.37%  postgres[.] ExecQual
 3.08%   +0.12%  postgres[.] ExecEvalScalarVarFast
 2.85%   -0.11%  postgres[.] check_stack_depth
 2.48%   +0.42%  postgres[.] ExecEvalConst
 2.44%   -0.33%  postgres[.] heapgetpage
 2.34%   +0.11%  postgres[.] ExecScan
 2.14%   -0.20%  postgres[.] ExecStoreTuple

I.e. pretty random performance changes. This indeed looks like binary
layout changes.  Looking at these plans and at profiles spanning a run
of all queries shows that bitmap scans and hash aggregations, while
present, account for a very small amount of time in total. So tpc-ds
doesn't look particularly interesting to evaluate these patches - but
vey interesting for my slot deforming and qual evaluation patches.



Meh, that's annoying. Anyway, the reason why many of the TPC-DS queries 
can't really benefit from the patch is that a lot of the queries use 
grouping sets, and we only have sorted implementation for that.


I wonder whether the TPC-H differences are also affected by this ...


Btw, query_14.sql as generated by your templates (in pgperffarm) doesn't
seem to work here. And I never had the patience to run query_1.sql to
completion... Looks like we could very well use some planner
improvements here.



Yeah, planner improvements would be great ;-)

Query 14 needs an explicit cast to text, otherwise you get something 
like "can't cast unknown to text" error. Not sure if that's an expected 
issue or not, but the cast fixes it. I haven't pushed that to the 
pgperffarm repo yet, along with several other tooling fixes.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-11 Thread Andres Freund
On 2016-10-11 04:29:31 +0200, Tomas Vondra wrote:
> On 10/11/2016 04:07 AM, Andres Freund wrote:
> > On 2016-10-10 17:46:22 -0700, Andres Freund wrote:
> > > > TPC-DS (tpcds.ods)
> > > > --
> > > > 
> > > > In this case, I'd say the results are less convincing. There are quite 
> > > > a few
> > > > queries that got slower by ~10%, which is well above - for example 
> > > > queries
> > > > 22 and 67. There are of course queries that got ~10% faster, and in 
> > > > total
> > > > the patched version executed more queries (so overall the result is 
> > > > slightly
> > > > positive, but not significantly).
> > > 
> > > That's interesting. I wonder whether that's plan changes just due to the
> > > changing memory estimates, or what's causing that. I'll look into it.
> > 
> > Hm. Based on an initial look those queries aren't planned with any of
> > the affected codepaths.  Could this primarily be a question of
> > randomness? Would it perhaps make sense to run the tests in a comparable
> > order? Looking at tpcds.py and the output files, it seems that the query
> > order differes between the runs, that can easily explain bigger
> > difference than the above. For me a scale=1 run creates a database of
> > approximately 4.5GB, thus with shared_buffers=1GB execution order is
> > likely to have a significant performance impact.
> > 
> 
> Yes, I see similar plans (no bitmap index scans or hash aggregates). But the
> difference is there, even when running the query alone (so it's not merely
> due to the randomized ordering).

> I wonder whether this is again due to compiler moving stuff around.

It looks like that. I looked through a significant set of plans where
there we time differences (generated on my machine), and none of them
had bitmap or hash groupings to any significant degree.  Comparing
profiles in those cases usually shows a picture like:
24.98%   +0.32%  postgres[.] slot_deform_tuple
16.58%   -0.05%  postgres[.] ExecMakeFunctionResultNoSets
12.41%   -0.01%  postgres[.] slot_getattr
 6.10%   +0.39%  postgres[.] heap_getnext
 4.41%   -0.37%  postgres[.] ExecQual
 3.08%   +0.12%  postgres[.] ExecEvalScalarVarFast
 2.85%   -0.11%  postgres[.] check_stack_depth
 2.48%   +0.42%  postgres[.] ExecEvalConst
 2.44%   -0.33%  postgres[.] heapgetpage
 2.34%   +0.11%  postgres[.] ExecScan
 2.14%   -0.20%  postgres[.] ExecStoreTuple

I.e. pretty random performance changes. This indeed looks like binary
layout changes.  Looking at these plans and at profiles spanning a run
of all queries shows that bitmap scans and hash aggregations, while
present, account for a very small amount of time in total. So tpc-ds
doesn't look particularly interesting to evaluate these patches - but
vey interesting for my slot deforming and qual evaluation patches.

Btw, query_14.sql as generated by your templates (in pgperffarm) doesn't
seem to work here. And I never had the patience to run query_1.sql to
completion... Looks like we could very well use some planner
improvements here.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-10 Thread Tomas Vondra

On 10/11/2016 04:07 AM, Andres Freund wrote:

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:

TPC-DS (tpcds.ods)
--

In this case, I'd say the results are less convincing. There are quite a few
queries that got slower by ~10%, which is well above - for example queries
22 and 67. There are of course queries that got ~10% faster, and in total
the patched version executed more queries (so overall the result is slightly
positive, but not significantly).


That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.


Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths.  Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.



Yes, I see similar plans (no bitmap index scans or hash aggregates). But 
the difference is there, even when running the query alone (so it's not 
merely due to the randomized ordering).


I wonder whether this is again due to compiler moving stuff around.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-10 Thread Andres Freund
On 2016-10-10 17:46:22 -0700, Andres Freund wrote:
> > TPC-DS (tpcds.ods)
> > --
> > 
> > In this case, I'd say the results are less convincing. There are quite a few
> > queries that got slower by ~10%, which is well above - for example queries
> > 22 and 67. There are of course queries that got ~10% faster, and in total
> > the patched version executed more queries (so overall the result is slightly
> > positive, but not significantly).
> 
> That's interesting. I wonder whether that's plan changes just due to the
> changing memory estimates, or what's causing that. I'll look into it.

Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths.  Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-10 Thread Andres Freund
Hi,

On 2016-10-11 02:38:26 +0200, Tomas Vondra wrote:
> Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it took
> quite a bit of time. These tests were done on a fairly small machine (the
> usual i5-2500k with 8GB of RAM), with only 1GB data sets for both
> benchmarks, to keep it completely in memory as I presume once we start
> hitting I/O, it becomes the dominant part.

Thanks!


> TPC-DS (tpcds.ods)
> --
> 
> In this case, I'd say the results are less convincing. There are quite a few
> queries that got slower by ~10%, which is well above - for example queries
> 22 and 67. There are of course queries that got ~10% faster, and in total
> the patched version executed more queries (so overall the result is slightly
> positive, but not significantly).

That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-09 Thread Andres Freund
Hi,

Attached is an updated version of the patchset. The main changes are
- address most of Tomas' feedback
- address regression test output changes by adding explicit ORDER BYs,
  in a separate commit.
- fix issues around hash tables sized up to PG_UINT32_MAX
- fix a bug in iteration with "concurrent" deletions

> > We didn't really for dynahash (as it basically assumed a fillfactor of 1
> > all the time), not sure if changing it won't also cause issues.
> > 
> 
> That's a case of glass half-full vs. half-empty, I guess. If we assumed load
> factor 1.0, then I see it as accounting for load factor (while you see it as
> not accounting of it).

Well, that load factor is almost never achieved, because we'd have grown
since...  I added a comment about disregarding fill factor and growth
policy to estimate_hashagg_tablesize, which is actually the place where
it'd seemingly make sense to handle it.

Tomas, did you have time to run a benchmark?

I think this is starting to look good.


Regards,

Andres
>From 0cf78f283af702d2e938e249de8c09733d2f5f77 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 3 Jul 2016 15:05:18 -0700
Subject: [PATCH 1/5] Add likely/unlikely() branch hint macros.

These are useful for hot code paths. Because it's easy to guess wrongly
about likelihood, and because such likelihoods change over time, they
should be used sparingly.

Discussion: <20160727004333.r3e2k2y6fvk2n...@alap3.anarazel.de>
---
 src/include/c.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/include/c.h b/src/include/c.h
index 4ab3f80..3a77107 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -939,6 +939,22 @@ typedef NameData *Name;
 #endif
 
 
+/*
+ * Hints to the compiler about the likelihood of a branch. Both likely() and
+ * unlikely() return the boolean value of the contained expression.
+ *
+ * These should only be used sparingly, in very hot code paths. It's very easy
+ * to mis-estimate likelihoods.
+ */
+#if __GNUC__ >= 3
+#define likely(x)	__builtin_expect((x) != 0, 1)
+#define unlikely(x)	__builtin_expect((x) != 0, 0)
+#else
+#define likely(x)	((x) != 0)
+#define unlikely(x)	((x) != 0)
+#endif
+
+
 /* 
  *Section 8:	random stuff
  * 
-- 
2.9.3

>From b9bee24439cd26cf7d3ccd5d02ff52ac579646f5 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 9 Oct 2016 15:34:13 -0700
Subject: [PATCH 2/5] Make regression tests less dependent on hash table order.

Upcoming changes to the hash table code used, among others, for grouping
and set operations will change the output order for a few queries. To
make it less likely that actual bugs are hidden between ordering
changes, and to make the tests robust against platform dependant
ordering, add ORDER BYs guaranteeing the output order.

Discussion: <20160727004333.r3e2k2y6fvk2n...@alap3.anarazel.de>
---
 src/test/regress/expected/matview.out |   6 +-
 src/test/regress/expected/psql.out|   2 +-
 src/test/regress/expected/tsrf.out|   8 +-
 src/test/regress/expected/union.out   | 133 ++
 src/test/regress/expected/with.out|   8 +-
 src/test/regress/sql/matview.sql  |   4 +-
 src/test/regress/sql/psql.sql |   2 +-
 src/test/regress/sql/tsrf.sql |   2 +-
 src/test/regress/sql/union.sql|  63 
 src/test/regress/sql/with.sql |   8 +-
 10 files changed, 125 insertions(+), 111 deletions(-)

diff --git a/src/test/regress/expected/matview.out b/src/test/regress/expected/matview.out
index e7d0ad1..08cffcf 100644
--- a/src/test/regress/expected/matview.out
+++ b/src/test/regress/expected/matview.out
@@ -33,7 +33,7 @@ SELECT relispopulated FROM pg_class WHERE oid = 'mvtest_tm'::regclass;
  f
 (1 row)
 
-SELECT * FROM mvtest_tm;
+SELECT * FROM mvtest_tm ORDER BY type;
 ERROR:  materialized view "mvtest_tm" has not been populated
 HINT:  Use the REFRESH MATERIALIZED VIEW command.
 REFRESH MATERIALIZED VIEW mvtest_tm;
@@ -44,12 +44,12 @@ SELECT relispopulated FROM pg_class WHERE oid = 'mvtest_tm'::regclass;
 (1 row)
 
 CREATE UNIQUE INDEX mvtest_tm_type ON mvtest_tm (type);
-SELECT * FROM mvtest_tm;
+SELECT * FROM mvtest_tm ORDER BY type;
  type | totamt 
 --+
+ x|  5
  y| 12
  z| 11
- x|  5
 (3 rows)
 
 -- create various views
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 017b79e..464436a 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -123,7 +123,7 @@ unicode_header_linestyle single
 prepare q as select array_to_string(array_agg(repeat('x',2*n)),E'\n') as "ab
 
 c", array_to_string(array_agg(repeat('y',20-2*n)),E'\n') as "a
-bc" from generate_series(1,10) as n(n) group by n>1 ;
+bc" from generate_series(1,10) as n(n) group by n>1 order by n>1;
 \pset linestyle 

Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-03 Thread Andres Freund
Hi,


On 2016-10-03 13:26:09 +0200, Arthur Silva wrote:
> On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund  wrote:
> A couple of comments.
> * 80% occupancy is a bit conservative for RH hashing, it works well up to
> 90% if you use the early stops due to distance. So that TODO is worth
> pursuing.

I found 90% a tiny bit slower during modifications, due to the increased
moving of cells around.  I actually implemented that TODO at some point,
it's not actually a very clear win for narrow elements and mid-sized
tables - the additional instructions for distance computations cost.

I've played with a different version of robin hood hashing, where the
buckets are ordered by hash-value. I.e. the hashvalue is shifted right,
instead of being masked, to get the hash bucket. That allows to have a
hashtable which is ordered by the hash-value, thus distance checks
simply become >=.  The problem with that is that it only works if you
have an "overflow" area at the end of the table, which is hard to size
correctly.


> * An optimized memory layout for RH hashmap is along the lines of
> HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays
> specially well with large occupancies (~90%). Due to RH the probings on the
> H array are very short and within a single cache line. Even with a 31bit
> hash it's reallyyy rare to have to probe more than 1 entry in the KV array.
> Essentially guaranteeing that the 99% percentile of lookups will only hit a
> couple of cache lines (if you ignore crossing cache lines and other
> details).

That seems likely to end up being noticeably more complicated - I'm not
sure the cpu overhead of the more complex lookups weighs of the
benefits.  I'd like to get the basics in, we can optimize further later
on.  Based on my instrumentation the majority of time is currently spent
in the cache-miss for the initial bucket, everything else inside the
hash table code is quite small. After that it's hash value computations
in the execGrouping case.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-03 Thread Arthur Silva
On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund  wrote:

> Hi,
>
> On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> > In the attached patch I've attached simplehash.h, which can be
> > customized by a bunch of macros, before being inlined.  There's also a
> > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> > execGrouping.c.
>
> Attached is a significantly improved version of this series.  The main
> changes are:
>
> - The hash table now uses robin-hood style hash collision handling. See the
>   commit message of 0002 (or simplehash.h) for an introduction. That
>   significantly reduces the impact of "clustering" due to linear
>   addressing.
> - Significant comment and correctness fixes, both in simplehash.h
> - itself, and 0003/0004.
> - a lot of other random performance improvements for the hashing code.
>
>
> Unfortunately I'm running out battery right now, so I don't want to
> re-run the benchmarks posted upthread (loading takes a while). But the
> last time I ran them all the results after the patches were better than
> before.
>
>
> This patch series currently consists out of four patches:
> - 0001 - add unlikely/likely() macros. The use here is not entirely
>   mandatory, but they do provide a quite consistent improvement. There
>   are other threads where they proved to be beneficial, so I see little
>   reason not to introduce them.
> - 0002 - the customizable hashtable itself. It now contains documentation.
> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
>   for the issue mentioned in [1], to improve peformance in heavily lossy
>   scenarios (otherwise we could regress in some extreme cases)
> - 0004 - convert execGrouping.c, e.g. used by hash aggregates
>
>
> While not quite perfect yet, I do think this is at a state where input
> is needed.  I don't want to go a lot deeper into this rabbit hole,
> before we have some agreement that this is a sensible course of action.
>
>
> I think there's several possible additional users of simplehash.h,
> e.g. catcache.c - which has an open coded, not particularly fast hash
> table and frequently shows up in profiles - but I think the above two
> conversions are plenty to start with.
>
>
> Comments?
>
>
> Greetings,
>
> Andres Freund
>
> [1] http://archives.postgresql.org/message-id/20160923205843.zcs
> 533sqdzlh4cpo%40alap3.anarazel.de
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
This is a great addition.

A couple of comments.

* 80% occupancy is a bit conservative for RH hashing, it works well up to
90% if you use the early stops due to distance. So that TODO is worth
pursuing.

* An optimized memory layout for RH hashmap is along the lines of
HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays
specially well with large occupancies (~90%). Due to RH the probings on the
H array are very short and within a single cache line. Even with a 31bit
hash it's reallyyy rare to have to probe more than 1 entry in the KV array.
Essentially guaranteeing that the 99% percentile of lookups will only hit a
couple of cache lines (if you ignore crossing cache lines and other
details).


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-02 Thread Andres Freund
On 2016-10-02 02:59:04 +0200, Tomas Vondra wrote:
> On 10/02/2016 02:17 AM, Andres Freund wrote:
> > Hi,
> > ...
> > 
> > > > > I think a crucial part of the benchmarking will be identifying and
> > > > > measuring corner cases, e.g. a lot of rows with the same key, etc.
> > > > > Although that probably is not a major issue for the two places
> > > > > switched to the new implementation (e.g. execGrouping merges the
> > > > > duplicates to a single group, by definition).
> > > > 
> > > > Hm. I don't really see a case where that's going to cause issues - all
> > > > the execGrouping.c users only store one key, and either ignore later
> > > > ones, or add the data to the initial tuple in the group.  I don't really
> > > > see any case where repeated keys should cause an issue for a hash table?
> > > > 
> > > 
> > > Yep, that's pretty much what I suggested. I was wondering more about the
> > > other places where this hash table might be used - I've been thinking 
> > > about
> > > hashjoins, but now that I think of it - that's a fairly specialized and
> > > tuned code. In any case, let's not complicate the discussion for now.
> > 
> > My point is that I don't really see any potential usecase where
> > that's an issue.
> > 
> 
> OK, I think we agree the two places modified by the submitted patch series
> are fine. Let's leave discussion about places modified by future patches for
> the future ;-)

I think my problem here is that I just can't see a potential use-case
for hashtables where that's an issue ;)


> > > 5) SH_RESIZE
> > > 
> > > Do we expect to shrink hash tables?
> > 
> > Not at the moment. Should be doable, but I'm not sure it's worth
> > developing and testing - I don't see any usecases in pg atm.
> > 
> 
> OK, sure. Then what about adding this to SH_RESIZE?
> 
> Assert(oldsize <= newsize);

Yes. And potentially rename it to SH_GROW.


> I assumed we might shrink the catcache after invalidation, for example (but
> maybe I don't really know how that works).

They don't shrink so far, and in most invalidation cases we don't delete
entries, just mark them as invalid (as they might be referenced on the
outside at that moment).


> > > It's not entirely clear why is it guaranteed that there's always
> > > an element with optimal position (when load factor < 1)? Adding an
> > >  explanation or a link to a paper would be helpful, I guess.
> > 
> > As the load factor always has to be below 1, there always has to be
> > an empty bucket. Every bucket after an empty bucket has to be at
> > it's optimal position.

> Hmmm, thanks to moving the entries after delete?

Pretty much, yes.


> Adding an assert() enforcing this seems like a good idea.

Probably requires some additional state to be meaningfully enforced. Not
sure that's worth the effort. If that invariant is broken, not a lot of
stuff works (believe me, I have broken it ;)). Otherwise searches won't
necessarily work anymore, if they didn't end up in their original
position (as the cell would now be empty, a lookup/insert/delete would
not find the existing element).


> > > 7) Should hash agg size estimation for the planner consider the 
> > > fillfactor?
> > > 
> > > I think we should account for fillfactor - we should account for the
> > > allocated memory, even if there's free space in the hash table. After all,
> > > that's what hashjoins do.
> > 
> > We didn't really for dynahash (as it basically assumed a fillfactor of 1
> > all the time), not sure if changing it won't also cause issues.
> > 
> 
> That's a case of glass half-full vs. half-empty, I guess. If we assumed load
> factor 1.0, then I see it as accounting for load factor (while you see it as
> not accounting of it).

Well, even before we grow by factors of two, so 1 wasn't accurate for
most of the time.  My issue isn't really that I don't want to do it, but
that I'm not entirely sure that there's a good way. TupleHashEntryData
itself isn't exactly large, and we'd have to multiply it by the load
factor. At the same time, after the patch, AggStatePerGroupData isn't
allocated for empty elements anymore, and that's the largest part of the
memory.  So I'm kind of inclined to just disregard the fillfactor (and
document that).

> Also, with load factor 0.8 the table is only ~20% larger, so even if we
> don't include that into work_mem it's a reasonably small error (easily
> smaller than errors in the pre-9.5 HashJoin accounting, which did not
> include chunk headers etc.).

I think in either cases the entries themselves are smaller after the
patch by enough that the overhead doesn't matter once you crossed a few
dozen entries.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Tomas Vondra

On 10/02/2016 02:17 AM, Andres Freund wrote:

Hi,

> ...



I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).


Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group.  I don't really
see any case where repeated keys should cause an issue for a hash table?



Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking about
hashjoins, but now that I think of it - that's a fairly specialized and
tuned code. In any case, let's not complicate the discussion for now.


My point is that I don't really see any potential usecase where
that's an issue.



OK, I think we agree the two places modified by the submitted patch 
series are fine. Let's leave discussion about places modified by future 
patches for the future ;-)





A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?


For some cases that'll waste a good bit of space. Consider
e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
the individual uses.



Hmmm, OK. I have my doubts about those hardcoded constants, but you're 
right 256 is probably an overkill.





5) SH_RESIZE

Do we expect to shrink hash tables?


Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.



OK, sure. Then what about adding this to SH_RESIZE?

Assert(oldsize <= newsize);

I assumed we might shrink the catcache after invalidation, for example 
(but maybe I don't really know how that works).





It's not entirely clear why is it guaranteed that there's always
an element with optimal position (when load factor < 1)? Adding an
 explanation or a link to a paper would be helpful, I guess.


As the load factor always has to be below 1, there always has to be
an empty bucket. Every bucket after an empty bucket has to be at
it's optimal position.



Hmmm, thanks to moving the entries after delete? Adding an assert() 
enforcing this seems like a good idea.





7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After all,
that's what hashjoins do.


We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.



That's a case of glass half-full vs. half-empty, I guess. If we assumed 
load factor 1.0, then I see it as accounting for load factor (while you 
see it as not accounting of it).


I don't see why this should cause issues - of course, the hash table may 
be a bit larger, exceed work_mem and make the tidbitmap lossy, or switch 
HashAggregate to GroupAggregate. But I don't think that's a major issue, 
as those decisions depend on estimates anyway, so we can't really 
guarantee them.


Also, with load factor 0.8 the table is only ~20% larger, so even if we 
don't include that into work_mem it's a reasonably small error (easily 
smaller than errors in the pre-9.5 HashJoin accounting, which did not 
include chunk headers etc.).


So I won't fight for this, but I don't see why not to account for it.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Andres Freund
Hi,


On 2016-10-02 01:37:35 +0200, Tomas Vondra wrote:
> On 10/01/2016 09:59 PM, Andres Freund wrote:
> >>What about running a bigger benchmark - say, TPC-DS - and evaluating
> >>the impact?
> >
> >Worthwhile, although obviously the impact will be a lot smaller,
> >since they're not entirely bottlenecked on hash-aggs and bitmap
> >scans.
> >
> 
> Sure, the improvement won't be as significant as on the simple queries, but
> it's interesting IMHO.

Agreed.


> >>I think a crucial part of the benchmarking will be identifying and
> >>measuring corner cases, e.g. a lot of rows with the same key, etc.
> >>Although that probably is not a major issue for the two places
> >>switched to the new implementation (e.g. execGrouping merges the
> >>duplicates to a single group, by definition).
> >
> >Hm. I don't really see a case where that's going to cause issues - all
> >the execGrouping.c users only store one key, and either ignore later
> >ones, or add the data to the initial tuple in the group.  I don't really
> >see any case where repeated keys should cause an issue for a hash table?
> >
> 
> Yep, that's pretty much what I suggested. I was wondering more about the
> other places where this hash table might be used - I've been thinking about
> hashjoins, but now that I think of it - that's a fairly specialized and
> tuned code. In any case, let's not complicate the discussion for now.

My point is that I don't really see any potential usecase where that's
an issue.


> A few comments, after quickly skimming through the first two patches:
> 
> 1) SH_CREATE does this:
> 
>   /* round up size to the next power of 2, eases lookups */
>   if (size < 2)
>   size = 2;
>   else
>   size = sh_pow2_int(size);
> 
> I very much doubt a hash table with 2 buckets is very useful. What about
> using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
> which might be a bit too much, but perhaps 256 would work?

For some cases that'll waste a good bit of space. Consider
e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
the individual uses.


> 2) tb->resize_above
> 
> I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as
> a constant somewhere (not sure if it makes sense to make it part of SH_TYPE,
> so that hash tables may use different load factors).

Fair point.


> 3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
> SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
> size in bytes' or 'number of entries'.

Ok.


> 4) SH_CONTAINS sounds more like a function checking whether a hash table
> contains a key, not as a 'type of hash table entry'. What about
> SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.

Hm, ok.


> 5) SH_RESIZE
> 
> Do we expect to shrink hash tables?

Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.


> It's not entirely clear why is it guaranteed that there's always an element
> with optimal position (when load factor < 1)? Adding an explanation or a
> link to a paper would be helpful, I guess.

As the load factor always has to be below 1, there always has to be an
empty bucket. Every bucket after an empty bucket has to be at it's
optimal position.

> 
> 6) SH_STAT
> 
> This bit is a bit suspicious:
> 
> uint32 max_collisions = 0;
> 
> ...
> 
> /* single contained element is not a collision */
> curcoll--;
> total_collisions += curcoll;
> if (curcoll > max_collisions)
> max_collisions = curcoll - 1;
> 
> Shouldn't the last line be just "max_collisions = curcoll"?

Hm. That's probably a refactoring artefact. Nice catch.


> 7) Should hash agg size estimation for the planner consider the fillfactor?
> 
> I think we should account for fillfactor - we should account for the
> allocated memory, even if there's free space in the hash table. After all,
> that's what hashjoins do.

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.


Thanks!


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Tomas Vondra

On 10/01/2016 09:59 PM, Andres Freund wrote:

Hi,

On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined.  There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.


Attached is a significantly improved version of this series.  The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
  commit message of 0002 (or simplehash.h) for an introduction. That
  significantly reduces the impact of "clustering" due to linear
  addressing.


Interesting. Have you looked at cuckoo hashing?


Yes. I don't think it's a good fit for modern CPUs. Because it
doesn't probe linearly you constantly have random uncached memory
accesses.

I've played with a few schemes, and they all seemed to be slower
than linear probing deviations, unless you intentionally used a
wrong hash-distribution, or intentionally "unbalanced" the hash-space
by iterating over either end of the keyspace and moved the entries
around.


OK, understood.




- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.


Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.



Well, I have rather bad experience with running benchmarks on laptops anyway
- a lot of noise due to power management, etc.


Well, with factor ~2x improvements thats not really a big detractor.
Using a new CPU makes some forms of analysis easier (better PMUs).



True.

>

For longrunning tests I agree.



What about running a bigger benchmark - say, TPC-DS - and evaluating
the impact?


Worthwhile, although obviously the impact will be a lot smaller,
since they're not entirely bottlenecked on hash-aggs and bitmap
scans.



Sure, the improvement won't be as significant as on the simple queries, 
but it's interesting IMHO.





I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).


Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group.  I don't really
see any case where repeated keys should cause an issue for a hash table?



Yep, that's pretty much what I suggested. I was wondering more about the 
other places where this hash table might be used - I've been thinking 
about hashjoins, but now that I think of it - that's a fairly 
specialized and tuned code. In any case, let's not complicate the 
discussion for now.





This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
  mandatory, but they do provide a quite consistent improvement. There
  are other threads where they proved to be beneficial, so I see little
  reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
  for the issue mentioned in [1], to improve peformance in heavily lossy
  scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates


While not quite perfect yet, I do think this is at a state where input
is needed.  I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.



So, is it the right time to do some benchmarking?


That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...



Hmmm ... not sure. If one of the points is to get rid of function calls 
determined at runtime (which make it impossible for the compiler to 
optimize the code), then I can think of three options:


(a) copy-paste the code for each place
(b) use some templating
(c) use JIT

I think (b) is way better than (a), and I don't think we're using JIT 
anywhere at this point. So while the macro-based templates look a bit 
awkward, I'm not aware of a better C thing.


A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very 

Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Andres Freund
On 2016-10-01 20:04:18 +0100, Greg Stark wrote:
> On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund  wrote:
> >
> > Unfortunately I'm running out battery right now, so I don't want to
> > re-run the benchmarks posted upthread (loading takes a while). But the
> > last time I ran them all the results after the patches were better than
> > before.

> I have a machine sitting idle now too if you have specific ideas of
> what to benchmark.

Last time I just extracted interesting parts of queries from tpc-h (to
be able to look at bitmapscans/hash-aggs in isolation) and ran tpc-h as
a whole.  During development I also tried to come up with adverse cases
(e.g. huge bitmapscans, with very low work_mem).  I don't really have a
lot better ideas than that.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Andres Freund
Hi,

On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:
> On 10/01/2016 02:44 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> > > In the attached patch I've attached simplehash.h, which can be
> > > customized by a bunch of macros, before being inlined.  There's also a
> > > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> > > execGrouping.c.
> > 
> > Attached is a significantly improved version of this series.  The main
> > changes are:
> > 
> > - The hash table now uses robin-hood style hash collision handling. See the
> >   commit message of 0002 (or simplehash.h) for an introduction. That
> >   significantly reduces the impact of "clustering" due to linear
> >   addressing.
> 
> Interesting. Have you looked at cuckoo hashing?

Yes. I don't think it's a good fit for modern CPUs. Because it doesn't
probe linearly you constantly have random uncached memory accesses.

I've played with a few schemes, and they all seemed to be slower than
linear probing deviations, unless you intentionally used a wrong
hash-distribution, or intentionally "unbalanced" the hash-space by
iterating over either end of the keyspace and moved the entries around.


> > - Significant comment and correctness fixes, both in simplehash.h
> > - itself, and 0003/0004.
> > - a lot of other random performance improvements for the hashing code.
> > 
> > 
> > Unfortunately I'm running out battery right now, so I don't want to
> > re-run the benchmarks posted upthread (loading takes a while). But the
> > last time I ran them all the results after the patches were better than
> > before.
> > 
> 
> Well, I have rather bad experience with running benchmarks on laptops anyway
> - a lot of noise due to power management, etc.

Well, with factor ~2x improvements thats not really a big
detractor. Using a new CPU makes some forms of analysis easier (better
PMUs).

For longrunning tests I agree.


> What about running a bigger benchmark - say, TPC-DS - and evaluating
> the impact?

Worthwhile, although obviously the impact will be a lot smaller, since
they're not entirely bottlenecked on hash-aggs and bitmap scans.


> I think a crucial part of the benchmarking will be identifying and measuring
> corner cases, e.g. a lot of rows with the same key, etc.
> Although that probably is not a major issue for the two places
> switched to the new implementation (e.g. execGrouping merges the
> duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group.  I don't really
see any case where repeated keys should cause an issue for a hash table?


> > This patch series currently consists out of four patches:
> > - 0001 - add unlikely/likely() macros. The use here is not entirely
> >   mandatory, but they do provide a quite consistent improvement. There
> >   are other threads where they proved to be beneficial, so I see little
> >   reason not to introduce them.
> > - 0002 - the customizable hashtable itself. It now contains documentation.
> > - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
> >   for the issue mentioned in [1], to improve peformance in heavily lossy
> >   scenarios (otherwise we could regress in some extreme cases)
> > - 0004 - convert execGrouping.c, e.g. used by hash aggregates
> > 
> > 
> > While not quite perfect yet, I do think this is at a state where input
> > is needed.  I don't want to go a lot deeper into this rabbit hole,
> > before we have some agreement that this is a sensible course of action.
> > 
> 
> So, is it the right time to do some benchmarking?

That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Greg Stark
On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund  wrote:
>
> Unfortunately I'm running out battery right now, so I don't want to
> re-run the benchmarks posted upthread (loading takes a while). But the
> last time I ran them all the results after the patches were better than
> before.


I have a machine sitting idle now too if you have specific ideas of
what to benchmark.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Tomas Vondra

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined.  There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.


Attached is a significantly improved version of this series.  The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
  commit message of 0002 (or simplehash.h) for an introduction. That
  significantly reduces the impact of "clustering" due to linear
  addressing.


Interesting. Have you looked at cuckoo hashing? That seems to be the 
go-to hashing in several database-related papers I've read recently, so 
I guess there's a reason for that (IIRC it has constant worst case for 
lookups, constant amortized time for inserts/deletes). Not sure how it 
compares to robin-hood hashing regarding cache-friendliness though.



- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.


Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.



Well, I have rather bad experience with running benchmarks on laptops 
anyway - a lot of noise due to power management, etc. What about running 
a bigger benchmark - say, TPC-DS - and evaluating the impact?


I think a crucial part of the benchmarking will be identifying and 
measuring corner cases, e.g. a lot of rows with the same key, etc. 
Although that probably is not a major issue for the two places switched 
to the new implementation (e.g. execGrouping merges the duplicates to a 
single group, by definition).




This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
  mandatory, but they do provide a quite consistent improvement. There
  are other threads where they proved to be beneficial, so I see little
  reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
  for the issue mentioned in [1], to improve peformance in heavily lossy
  scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates


While not quite perfect yet, I do think this is at a state where input
is needed.  I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.



So, is it the right time to do some benchmarking?



I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-09-30 Thread Andres Freund
Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> In the attached patch I've attached simplehash.h, which can be
> customized by a bunch of macros, before being inlined.  There's also a
> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> execGrouping.c.

Attached is a significantly improved version of this series.  The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
  commit message of 0002 (or simplehash.h) for an introduction. That
  significantly reduces the impact of "clustering" due to linear
  addressing.
- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.


Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.


This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
  mandatory, but they do provide a quite consistent improvement. There
  are other threads where they proved to be beneficial, so I see little
  reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
  for the issue mentioned in [1], to improve peformance in heavily lossy
  scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates


While not quite perfect yet, I do think this is at a state where input
is needed.  I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.


I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.


Comments?


Greetings,

Andres Freund

[1] 
http://archives.postgresql.org/message-id/20160923205843.zcs533sqdzlh4cpo%40alap3.anarazel.de
>From 6e64ea6fa79c265ebadf2260b7cc1ad05befd801 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 3 Jul 2016 15:05:18 -0700
Subject: [PATCH 1/4] Add likely/unlikely() branch hint macros.

These are useful for hot code paths. Because it's easy to guess wrongly
about likelihood, and because such likelihoods change over time, they
should be used sparingly.
---
 src/include/c.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/include/c.h b/src/include/c.h
index 4ab3f80..3a77107 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -939,6 +939,22 @@ typedef NameData *Name;
 #endif
 
 
+/*
+ * Hints to the compiler about the likelihood of a branch. Both likely() and
+ * unlikely() return the boolean value of the contained expression.
+ *
+ * These should only be used sparingly, in very hot code paths. It's very easy
+ * to mis-estimate likelihoods.
+ */
+#if __GNUC__ >= 3
+#define likely(x)	__builtin_expect((x) != 0, 1)
+#define unlikely(x)	__builtin_expect((x) != 0, 0)
+#else
+#define likely(x)	((x) != 0)
+#define unlikely(x)	((x) != 0)
+#endif
+
+
 /* 
  *Section 8:	random stuff
  * 
-- 
2.9.3

>From 2e9425196a203fde28b022e8d742c44d0d6a5e70 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 19 Jul 2016 14:10:28 -0700
Subject: [PATCH 2/4] Add a macro customizable hashtable.

dynahash.c hash tables aren't quite fast enough for some
use-cases. There are several reasons for lacking performance:
- the use of chaining for collision handling makes them cache
  inefficient, that's especially an issue when the tables get bigger.
- as the element sizes for dynahash are only determined at runtime,
  offset computations are somewhat expensive
- hash and element comparisons are indirect function calls, causing
  unnecessary pipeline stalls
- it's two level structure has some benefits (somewhat natural
  partitioning), but increases the number of indirections

to fix several of these the hash tables have to be adjusted to the
invididual use-case at compile-time. C unfortunately doesn't provide a
good way to do compile code generation (like e.g. c++'s templates for
all their weaknesses do).  Thus the somewhat ugly approach taken here is
to allow for code generation using a macro-templatized header file,
which generates functions and types based on a prefix and other
parameters.

Later patches use this infrastructure to use such hash tables for
tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation,
...). In queries where these use up a large fraction of the time, this
has been measured to lead to performance improvements of over 

Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-07-27 Thread Andres Freund
On 2016-07-27 10:04:52 -0400, Robert Haas wrote:
> On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund  wrote:
> > As previously mentioned, dynahash isn't particularly fast. In many cases
> > that doesn't particularly matter. But e.g. hash aggregation and bitmap
> > index scans are quite sensitive to hash performance.
> >
> > The biggest problems with dynahash (also discussed at [1]) are
> >
> > a) two level structure doubling the amount of indirect lookups
> > b) indirect function calls
> > c) using separate chaining based conflict resolution
> > d) being too general.
> 
> I am ... skeptical about open addressing.  It's less unappealing for
> this application than in general because we don't actually need to
> delete anything, but that wouldn't be true for the catcache.

I don't think deletions really change the picture for open addressing -
you don't need toombstones, you can instead move elements closer to
their origin at deletion. I just hadn't implemented that yet, because it
didn't seem like a crucial point.


Unfortunately open addressing, particularly linear one, has such vastly
superiour cache access patterns, that it's hard to come close with a
chained hash table. I've previously tried a chained hash table, and the
performance gains were a lot smaller.  For that reason many hash-table
implementations used in performance critical paths appear to be open
addressing.


Don't get me wrong, I do certainly think that there's good cases for
using separate chaining in places where performance is less of a
concern; and possibly when you want lock-free concurrency.


> All the same, I feel that we need to assess the risk that we're
> improving average-case performance while creating large regressions in
> other cases (i.e. where there is clustering).

The actual algorithmic worst-case complexity isn't different. It's
obviously important to resize appropriately. It's not that hard to beat
dynahash's memory efficiency, so fillfactors don't have to be kept super
high to not regress in memory usage.


> Do likely() and unlikely() actually buy us anything here?

It's only a few percent here, but overall, especially when used around
error checks, it's above 10%. The reason I used it here is primary that
it made the assembly easier to read for me... ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-07-27 Thread Robert Haas
On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund  wrote:
> As previously mentioned, dynahash isn't particularly fast. In many cases
> that doesn't particularly matter. But e.g. hash aggregation and bitmap
> index scans are quite sensitive to hash performance.
>
> The biggest problems with dynahash (also discussed at [1]) are
>
> a) two level structure doubling the amount of indirect lookups
> b) indirect function calls
> c) using separate chaining based conflict resolution
> d) being too general.

I am ... skeptical about open addressing.  It's less unappealing for
this application than in general because we don't actually need to
delete anything, but that wouldn't be true for the catcache.  All the
same, I feel that we need to assess the risk that we're improving
average-case performance while creating large regressions in other
cases (i.e. where there is clustering).

Do likely() and unlikely() actually buy us anything here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers