Re: Default setting for enable_hashagg_disk

2020-07-29 Thread Peter Geoghegan
On Mon, Jul 27, 2020 at 5:55 PM Peter Geoghegan  wrote:
> Pushed the hashagg_avoid_disk_plan patch -- thanks!

Pushed the hash_mem_multiplier patch as well -- thanks again!

As I've said before, I am not totally opposed to adding a true escape
hatch. That has not proven truly necessary just yet. For now, my
working assumption is that the problem on the table has been
addressed.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-29 Thread Jeff Davis
On Sat, 2020-07-25 at 17:52 -0700, Peter Geoghegan wrote:
> BTW, your HLL patch ameliorates the problem with my extreme "sorted
> vs
> random input" test case from this morning [1] (the thing that I just
> discussed with Tomas). Without the HLL patch the sorted case had 2424
> batches. With the HLL patch it has 20. That at least seems like a
> notable improvement.

Committed.

Though I did notice some overhead for spilled-but-still-in-memory cases
due to addHyperLogLog() itself. It seems that it can be mostly
eliminated with [1], though I'll wait to see if there's an objection
because that would affect other users of HLL.

Regards,
Jeff Davis

[1] 
https://www.postgresql.org/message-id/17068336d300fab76dd6131cbe1996df450dde38.ca...@j-davis.com






Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan
On Mon, Jul 27, 2020 at 5:10 PM Jeff Davis  wrote:
> I noticed that one of the conditionals, "cheapest_total_path != NULL",
> was already redundant with the outer conditional before your patch. I
> guess that was just a mistake which your patch corrects along the way?

Makes sense.

> Anyway, the patch looks good to me. We can have a separate discussion
> about pessimizing the costing, if necessary.

Pushed the hashagg_avoid_disk_plan patch -- thanks!

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Jeff Davis
On Mon, 2020-07-27 at 11:30 -0700, Peter Geoghegan wrote:
> The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
> surprisingly complicated. It would be nice if you could take a look
> at
> that aspect (or confirm that it's included in your review).

I noticed that one of the conditionals, "cheapest_total_path != NULL",
was already redundant with the outer conditional before your patch. I
guess that was just a mistake which your patch corrects along the way?

Anyway, the patch looks good to me. We can have a separate discussion
about pessimizing the costing, if necessary.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan
On Mon, Jul 27, 2020 at 12:52 PM Alvaro Herrera
 wrote:
> On 2020-Jul-27, Peter Geoghegan wrote:
> > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
> > surprisingly complicated. It would be nice if you could take a look at
> > that aspect (or confirm that it's included in your review).
>
> I think you mean "it replaces surprisingly complicated code with
> straightforward code".  Right?  Because in the previous code, there was
> a lot of effort going into deciding whether the path needed to be
> generated; the new code just generates the path always.

Yes, that's what I meant.

It's a bit tricky. For example, I have removed a redundant
"cheapest_total_path != NULL" test in create_partial_grouping_paths()
(two, actually). But these two tests were always redundant. I have to
wonder if I missed the point. Though it seems likely that that was
just an accident. Accretions of code over time made the code work like
that; nothing more.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Alvaro Herrera
On 2020-Jul-27, Peter Geoghegan wrote:

> The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
> surprisingly complicated. It would be nice if you could take a look at
> that aspect (or confirm that it's included in your review).

I think you mean "it replaces surprisingly complicated code with
straightforward code".  Right?  Because in the previous code, there was
a lot of effort going into deciding whether the path needed to be
generated; the new code just generates the path always.

Similarly the code to decide allow_hash in create_distinct_path, which
used to be nontrivial, could (if you wanted) be simplified down to a
single boolean condition.  Previously, it was nontrivial only because
it needed to consider memory usage -- not anymore.

But maybe you're talking about something more subtle that I'm just too
blind to see.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan
On Mon, Jul 27, 2020 at 11:24 AM Alvaro Herrera
 wrote:
> > Are you proposing that I just put the prototype in miscadmin.h, while
> > leaving the implementation where it is (in nodeHash.c)?
>
> Yes, that's in the part of my reply you didn't quote:
>
> : It remains strange to have the function in executor
> : implementation, but I don't offhand see a better place, so maybe it's
> : okay where it is.

Got it.

I tried putting the prototype in miscadmin.h, and I now agree that
that's the best way to do it -- that's how I do it in the attached
revision. No other changes.

The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
surprisingly complicated. It would be nice if you could take a look at
that aspect (or confirm that it's included in your review).

-- 
Peter Geoghegan


v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch
Description: Binary data


v4-0002-Add-hash_mem_multiplier-GUC.patch
Description: Binary data


Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Alvaro Herrera
On 2020-Jul-27, Peter Geoghegan wrote:

> On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera
>  wrote:
> > On 2020-Jul-23, Peter Geoghegan wrote:
> > I notice you put the prototype for get_hash_mem in nodeHash.h.  This
> > would be fine if not for the fact that optimizer needs to call the
> > function too, which means now optimizer have to include executor headers
> > -- not a great thing.  I'd move the prototype elsewhere to avoid this,
> > and I think miscadmin.h is a decent place for the prototype, next to
> > work_mem and m_w_m.
> 
> The location of get_hash_mem() is awkward,

Yes.

> but there is no obvious alternative.

Agreed.

> Are you proposing that I just put the prototype in miscadmin.h, while
> leaving the implementation where it is (in nodeHash.c)?

Yes, that's in the part of my reply you didn't quote:

: It remains strange to have the function in executor
: implementation, but I don't offhand see a better place, so maybe it's
: okay where it is.

> [...] moving the implementation of get_hash_mem() to either of those
> two files seems worse to me.

Sure.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan
On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera
 wrote:
> On 2020-Jul-23, Peter Geoghegan wrote:
> I notice you put the prototype for get_hash_mem in nodeHash.h.  This
> would be fine if not for the fact that optimizer needs to call the
> function too, which means now optimizer have to include executor headers
> -- not a great thing.  I'd move the prototype elsewhere to avoid this,
> and I think miscadmin.h is a decent place for the prototype, next to
> work_mem and m_w_m.

The location of get_hash_mem() is awkward, but there is no obvious alternative.

Are you proposing that I just put the prototype in miscadmin.h, while
leaving the implementation where it is (in nodeHash.c)? Maybe that
sounds like an odd question, but bear in mind that the natural place
to put the implementation of a function declared in miscadmin.h is
either utils/init/postinit.c or utils/init/miscinit.c -- moving the
implementation of get_hash_mem() to either of those two files seems
worse to me.

That said, there is an existing oddball case in miscadmin.h, right at
the end -- the two functions whose implementation is in
access/transam/xlog.c. So I can see an argument for adding another
oddball case (i.e. moving the prototype to the end of miscadmin.h
without changing anything else).

> Other than that admittedly trivial complaint, I found nothing to
> complain about in this patch.

Great. Thanks for the review.

My intention is to commit hash_mem_multiplier on Wednesday. We need to
move on from this, and get the release out the door.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Alvaro Herrera
On 2020-Jul-23, Peter Geoghegan wrote:

> Attached is v3 of the hash_mem_multiplier patch series, which now has
> a preparatory patch that removes hashagg_avoid_disk_plan.

I notice you put the prototype for get_hash_mem in nodeHash.h.  This
would be fine if not for the fact that optimizer needs to call the
function too, which means now optimizer have to include executor headers
-- not a great thing.  I'd move the prototype elsewhere to avoid this,
and I think miscadmin.h is a decent place for the prototype, next to
work_mem and m_w_m.  It remains strange to have the function in executor
implementation, but I don't offhand see a better place, so maybe it's
okay where it is.

Other than that admittedly trivial complaint, I found nothing to
complain about in this patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan
On Sun, Jul 26, 2020 at 11:34 AM Tomas Vondra
 wrote:
> That's 1.6GB, if I read it right. Which is more than 200MB ;-)

Sigh. That solves that "mystery": the behavior that my sorted vs
random example exhibited is a known limitation in hash aggs that spill
(and an acceptable one). The memory usage is reported on accurately by
EXPLAIN ANALYZE.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-26 Thread Tomas Vondra

On Sat, Jul 25, 2020 at 05:13:00PM -0700, Peter Geoghegan wrote:

On Sat, Jul 25, 2020 at 5:05 PM Tomas Vondra
 wrote:

I'm not sure what you mean by "reported memory usage doesn't reflect the
space used for transition state"? Surely it does include that, we've
built the memory accounting stuff pretty much exactly to do that.

I think it's pretty clear what's happening - in the sorted case there's
only a single group getting new values at any moment, so when we decide
to spill we'll only add rows to that group and everything else will be
spilled to disk.


Right.


In the unsorted case however we manage to initialize all groups in the
hash table, but at that point the groups are tiny an fit into work_mem.
As we process more and more data the groups grow, but we can't evict
them - at the moment we don't have that capability. So we end up
processing everything in memory, but significantly exceeding work_mem.


work_mem was set to 200MB, which is more than the reported "Peak
Memory Usage: 1605334kB". So either the random case significantly


That's 1.6GB, if I read it right. Which is more than 200MB ;-)


exceeds work_mem and the "Peak Memory Usage" accounting is wrong
(because it doesn't report this excess), or the random case really
doesn't exceed work_mem but has a surprising advantage over the sorted
case.

--
Peter Geoghegan


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 5:31 PM Peter Geoghegan  wrote:
> I'm glad that this better principled approach is possible. It's hard
> to judge how much of a problem this really is, though. We'll need to
> think about this aspect some more.

BTW, your HLL patch ameliorates the problem with my extreme "sorted vs
random input" test case from this morning [1] (the thing that I just
discussed with Tomas). Without the HLL patch the sorted case had 2424
batches. With the HLL patch it has 20. That at least seems like a
notable improvement.

[1] 
https://postgr.es/m/CAH2-Wz=ur7MQKpaUZJP=Adtg0TPMx5M_WoNE=ke2vUU=amd...@mail.gmail.com
-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 4:56 PM Jeff Davis  wrote:
> I wrote a quick patch to use HyperLogLog to estimate the number of
> groups contained in a spill file. It seems to reduce the
> overpartitioning effect, and is a more principled approach than what I
> was doing before.

This pretty much fixes the issue that I observed with overparitioning.
At least in the sense that the number of partitions grows more
predictably -- even when the number of partitions planned is reduced
the change in the number of batches seems smooth-ish. It "looks nice".

> It does seem to hurt the runtime slightly when spilling to disk in some
> cases. I haven't narrowed down whether this is because we end up
> recursing multiple times, or if it's just more efficient to
> overpartition, or if the cost of doing the HLL itself is significant.

I'm glad that this better principled approach is possible. It's hard
to judge how much of a problem this really is, though. We'll need to
think about this aspect some more.

Thanks
-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 5:05 PM Tomas Vondra
 wrote:
> I'm not sure what you mean by "reported memory usage doesn't reflect the
> space used for transition state"? Surely it does include that, we've
> built the memory accounting stuff pretty much exactly to do that.
>
> I think it's pretty clear what's happening - in the sorted case there's
> only a single group getting new values at any moment, so when we decide
> to spill we'll only add rows to that group and everything else will be
> spilled to disk.

Right.

> In the unsorted case however we manage to initialize all groups in the
> hash table, but at that point the groups are tiny an fit into work_mem.
> As we process more and more data the groups grow, but we can't evict
> them - at the moment we don't have that capability. So we end up
> processing everything in memory, but significantly exceeding work_mem.

work_mem was set to 200MB, which is more than the reported "Peak
Memory Usage: 1605334kB". So either the random case significantly
exceeds work_mem and the "Peak Memory Usage" accounting is wrong
(because it doesn't report this excess), or the random case really
doesn't exceed work_mem but has a surprising advantage over the sorted
case.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Tomas Vondra

On Sat, Jul 25, 2020 at 10:07:37AM -0700, Peter Geoghegan wrote:

On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan  wrote:

"Peak Memory Usage: 1605334kB"

Hash agg avoids spilling entirely (so the planner gets it right this
time around). It even uses notably less memory.


I guess that this is because the reported memory usage doesn't reflect
the space used for transition state, which is presumably most of the
total -- array_agg() is used in the query.



I'm not sure what you mean by "reported memory usage doesn't reflect the
space used for transition state"? Surely it does include that, we've
built the memory accounting stuff pretty much exactly to do that.

I think it's pretty clear what's happening - in the sorted case there's
only a single group getting new values at any moment, so when we decide
to spill we'll only add rows to that group and everything else will be
spilled to disk.

In the unsorted case however we manage to initialize all groups in the
hash table, but at that point the groups are tiny an fit into work_mem.
As we process more and more data the groups grow, but we can't evict
them - at the moment we don't have that capability. So we end up
processing everything in memory, but significantly exceeding work_mem.


FWIW all my tests are done on the same TPC-H data set clustered by
l_shipdate (so probably random with respect to other columns).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Sat, 2020-07-25 at 13:27 -0700, Peter Geoghegan wrote:
> It's not clear to me that overpartitioning is a real problem in
> > this
> > case -- but I think the fact that it's causing confusion is enough
> > reason to see if we can fix it.
> 
> I'm not sure about that either.
> 
> FWIW I notice that when I reduce work_mem a little further (to 3MB)
> with the same query, the number of partitions is still 128, while the
> number of run time batches is 16,512 (an increase from 11,456 from
> 6MB
> work_mem). I notice that 16512/128 is 129, which hints at the nature
> of what's going on with the recursion. I guess it would be ideal if
> the growth in batches was more gradual as I subtract memory.

I wrote a quick patch to use HyperLogLog to estimate the number of
groups contained in a spill file. It seems to reduce the
overpartitioning effect, and is a more principled approach than what I
was doing before.

It does seem to hurt the runtime slightly when spilling to disk in some
cases. I haven't narrowed down whether this is because we end up
recursing multiple times, or if it's just more efficient to
overpartition, or if the cost of doing the HLL itself is significant.

Regards,
Jeff Davis

From 424f611f442e36a91747fd39cd28acba42eb958b Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 25 Jul 2020 14:29:59 -0700
Subject: [PATCH] HLL

---
 src/backend/executor/nodeAgg.c | 64 ++
 src/include/executor/nodeAgg.h |  2 +-
 2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b79c845a6b7..25771cb4869 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -245,9 +245,11 @@
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "common/hashfn.h"
 #include "executor/execExpr.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
+#include "lib/hyperloglog.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -295,6 +297,14 @@
 #define HASHAGG_READ_BUFFER_SIZE BLCKSZ
 #define HASHAGG_WRITE_BUFFER_SIZE BLCKSZ
 
+/*
+ * HyperLogLog is used for estimating the cardinality of the spilled tuples in
+ * a given partition. 5 bits corresponds to a size of about 32 bytes and a
+ * worst-case error of around 18%. That's effective enough to choose a
+ * reasonable number of partitions when recursing.
+ */
+#define HASHAGG_HLL_BIT_WIDTH 5
+
 /*
  * Estimate chunk overhead as a constant 16 bytes. XXX: should this be
  * improved?
@@ -339,6 +349,7 @@ typedef struct HashAggSpill
 	int64	   *ntuples;		/* number of tuples in each partition */
 	uint32		mask;			/* mask to find partition from hash value */
 	int			shift;			/* after masking, shift by this amount */
+	hyperLogLogState *hll_card;	/* cardinality estimate for contents */
 } HashAggSpill;
 
 /*
@@ -357,6 +368,7 @@ typedef struct HashAggBatch
 	LogicalTapeSet *tapeset;	/* borrowed reference to tape set */
 	int			input_tapenum;	/* input partition tape */
 	int64		input_tuples;	/* number of tuples in this batch */
+	double		input_card;		/* estimated group cardinality */
 } HashAggBatch;
 
 /* used to find referenced colnos */
@@ -409,7 +421,7 @@ static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
 static long hash_choose_num_buckets(double hashentrysize,
 	long estimated_nbuckets,
 	Size memory);
-static int	hash_choose_num_partitions(uint64 input_groups,
+static int	hash_choose_num_partitions(double input_groups,
 	   double hashentrysize,
 	   int used_bits,
 	   int *log2_npartittions);
@@ -429,10 +441,11 @@ static void hashagg_finish_initial_spills(AggState *aggstate);
 static void hashagg_reset_spill_state(AggState *aggstate);
 static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset,
 	   int input_tapenum, int setno,
-	   int64 input_tuples, int used_bits);
+	   int64 input_tuples, double input_card,
+	   int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
-			   int used_bits, uint64 input_tuples,
+			   int used_bits, double input_groups,
 			   double hashentrysize);
 static Size hashagg_spill_tuple(AggState *aggstate, HashAggSpill *spill,
 TupleTableSlot *slot, uint32 hash);
@@ -1775,7 +1788,7 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
  * substantially larger than the initial value.
  */
 void
-hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits,
+hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 	Size *mem_limit, uint64 *ngroups_limit,
 	int *num_partitions)
 {
@@ -1967,7 +1980,7 @@ hash_choose_num_buckets(double hashentrysize, long ngroups, Size memory)
  * *log2_npartitions to the 

Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 1:10 PM Jeff Davis  wrote:
> On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote:
> > What worries me a bit is the sharp discontinuities when spilling with
> > significantly less work_mem than the "optimal" amount. For example,
> > with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
> > that setting work_mem to 6MB looks like this:
>
> ...
>
> >  Planned Partitions: 128  Peak Memory Usage: 6161kB  Disk
> > Usage: 2478080kB  HashAgg Batches: 128
>
> ...
>
> >  Planned Partitions: 128  Peak Memory Usage: 5393kB  Disk
> > Usage: 2482152kB  HashAgg Batches: 11456

> It's not clear to me that overpartitioning is a real problem in this
> case -- but I think the fact that it's causing confusion is enough
> reason to see if we can fix it.

I'm not sure about that either.

FWIW I notice that when I reduce work_mem a little further (to 3MB)
with the same query, the number of partitions is still 128, while the
number of run time batches is 16,512 (an increase from 11,456 from 6MB
work_mem). I notice that 16512/128 is 129, which hints at the nature
of what's going on with the recursion. I guess it would be ideal if
the growth in batches was more gradual as I subtract memory.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Sat, 2020-07-25 at 11:05 -0700, Peter Geoghegan wrote:
> What worries me a bit is the sharp discontinuities when spilling with
> significantly less work_mem than the "optimal" amount. For example,
> with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
> that setting work_mem to 6MB looks like this:

...

>  Planned Partitions: 128  Peak Memory Usage: 6161kB  Disk
> Usage: 2478080kB  HashAgg Batches: 128

...

>  Planned Partitions: 128  Peak Memory Usage: 5393kB  Disk
> Usage: 2482152kB  HashAgg Batches: 11456

...

> My guess that this is because the
> recursive hash aggregation misbehaves in a self-similar fashion once
> a
> certain tipping point has been reached.

It looks like it might be fairly easy to use HyperLogLog as an
estimator for the recursive step. That should reduce the
overpartitioning, which I believe is the cause of this discontinuity.

It's not clear to me that overpartitioning is a real problem in this
case -- but I think the fact that it's causing confusion is enough
reason to see if we can fix it.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 10:23 AM Jeff Davis  wrote:
> There's also another effect at work that can cause the total number of
> batches to be higher for larger work_mem values: when we do recurse, we
> again need to estimate the number of partitions needed. Right now, we
> overestimate the number of partitions needed (to be conservative),
> which leads to a wider fan-out and lots of tiny partitions, and
> therefore more batches.

What worries me a bit is the sharp discontinuities when spilling with
significantly less work_mem than the "optimal" amount. For example,
with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
that setting work_mem to 6MB looks like this:

   ->  HashAggregate  (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=21039.788..32278.703 rows=200 loops=1)
 Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
 Group Key: lineitem.l_partkey
 Planned Partitions: 128  Peak Memory Usage: 6161kB  Disk
Usage: 2478080kB  HashAgg Batches: 128

(And we have a sensible looking number of batches that match the
number of planned partitions with higher work_mem settings, too.)

However, if I set work_mem to 5MB (or less), it looks like this:

->  HashAggregate  (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=20849.490..37027.533 rows=200 loops=1)
 Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
 Group Key: lineitem.l_partkey
 Planned Partitions: 128  Peak Memory Usage: 5393kB  Disk
Usage: 2482152kB  HashAgg Batches: 11456

So the number of partitions is still 128, but the number of batches
explodes to 11,456 all at once. My guess that this is because the
recursive hash aggregation misbehaves in a self-similar fashion once a
certain tipping point has been reached. I expect that the exact nature
of that tipping point is very complicated, and generally dependent on
the dataset, clustering, etc. But I don't think that this kind of
effect will be uncommon.

(FWIW this example requires ~620MB work_mem to complete without
spilling at all -- so it's kind of extreme, though not quite as
extreme as many of the similar test results from Tomas.)

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Thu, 2020-07-23 at 19:33 -0700, Peter Geoghegan wrote:
> That does make it sound like the costs of the hash agg aren't being
> represented. I suppose it isn't clear if this is a costing issue
> because it isn't clear if the execution time performance itself is
> pathological or is instead something that must be accepted as the
> cost
> of spilling the hash agg in a general kind of way.

I have a feeling that this is mostly a costing problem. Sort uses its
memory in two different phases:

  1. when writing the sorted runs, it needs the memory to hold the run
before sorting it, and only a single buffer for the output tape; and
  2. when merging, it needs a lot of read buffers

But in HashAgg, it needs to hold all of the groups in memory *at the
same time* as it needs a lot of output buffers (one for each
partition). This doesn't matter a lot at high values of work_mem,
because the buffers will only be 8MB at most.

I did attempt to cost this properly: hash_agg_set_limits() takes into
account the memory the partitions will use, and the remaining memory is
what's used in cost_agg(). But there's a lot of room for error in
there.

If someone sees an obvious error in the costing, please let me know.
Otherwise, I think it will just take some time to make it better
reflect reality in a variety of cases. For v13, and we will either need
to live with it, or pessimize the costing for HashAgg until we get it
right.

Many costing issues can deal with a lot of slop -- e.g. HashJoin vs
MergeJoin -- because a small factor often doesn't make the difference
between plans. But HashAgg and Sort are more competitive with each
other, so costing needs to be more precise.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Fri, 2020-07-24 at 10:40 +0200, Tomas Vondra wrote:
> FWIW one more suspicious thing that I forgot to mention is the
> behavior
> of the "planned partitions" depending on work_mem, which looks like
> this:
> 
>2MB   Planned Partitions:  64HashAgg Batches:  4160
>4MB   Planned Partitions: 128HashAgg Batches: 16512
>8MB   Planned Partitions: 256HashAgg Batches: 21488
>   64MB   Planned Partitions:  32HashAgg Batches:  2720
>  256MB   Planned Partitions:   8HashAgg Batches: 8
> 
> I'd expect the number of planned partitions to decrease (slowly) as
> work_mem increases, but it seems to increase initially. Seems a bit
> strange, but maybe it's expected.

The space for open-partition buffers is also limited to about 25% of
memory. Each open partition takes BLCKSZ memory, so those numbers are
exactly what I'd expect (64*8192 = 512kB).

There's also another effect at work that can cause the total number of
batches to be higher for larger work_mem values: when we do recurse, we
again need to estimate the number of partitions needed. Right now, we
overestimate the number of partitions needed (to be conservative),
which leads to a wider fan-out and lots of tiny partitions, and
therefore more batches.

I think we can improve this by using something like a HyperLogLog on
the hash values of the spilled tuples to get a better estimate for the
number of groups (and therefore the number of partitions) that we need
when we recurse, which would reduce the number of overall batches at
higher work_mem settings. But I didn't get a chance to implement that
yet.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Sat, Jul 25, 2020 at 9:39 AM Peter Geoghegan  wrote:
> "Peak Memory Usage: 1605334kB"
>
> Hash agg avoids spilling entirely (so the planner gets it right this
> time around). It even uses notably less memory.

I guess that this is because the reported memory usage doesn't reflect
the space used for transition state, which is presumably most of the
total -- array_agg() is used in the query.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 12:55 PM Peter Geoghegan  wrote:
> Could that be caused by clustering in the data?
>
> If the input data is in totally random order then we have a good
> chance of never having to spill skewed "common" values. That is, we're
> bound to encounter common values before entering spill mode, and so
> those common values will continue to be usefully aggregated until
> we're done with the initial groups (i.e. until the in-memory hash
> table is cleared in order to process spilled input tuples). This is
> great because the common values get aggregated without ever spilling,
> and most of the work is done before we even begin with spilled tuples.
>
> If, on the other hand, the common values are concentrated together in
> the input...

I still don't know if that was a factor in your example, but I can
clearly demonstrate that the clustering of data can matter a lot to
hash aggs in Postgres 13. I attach a contrived example where it makes
a *huge* difference.

I find that the sorted version of the aggregate query takes
significantly longer to finish, and has the following spill
characteristics:

"Peak Memory Usage: 205086kB  Disk Usage: 2353920kB  HashAgg Batches: 2424"

Note that the planner doesn't expect any partitions here, but we still
get 2424 batches -- so the planner seems to get it totally wrong.
OTOH, the same query against a randomized version of the same data (no
longer in sorted order, no clustering) works perfectly with the same
work_mem (200MB):

"Peak Memory Usage: 1605334kB"

Hash agg avoids spilling entirely (so the planner gets it right this
time around). It even uses notably less memory.

-- 
Peter Geoghegan


test-agg-sorted.sql
Description: Binary data


Re: Default setting for enable_hashagg_disk

2020-07-25 Thread Jeff Davis
On Fri, 2020-07-24 at 21:16 +0200, Tomas Vondra wrote:
> Surely more recursive spilling should do more I/O, but the Disk Usage
> reported by explain analyze does not show anything like ...

I suspect that's because of disk reuse in logtape.c. 

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 12:16 PM Tomas Vondra
 wrote:
> Maybe, but we're nowhere close to these limits. See this table which I
> posted earlier:
>
>2MB   Planned Partitions:  64HashAgg Batches:  4160
>4MB   Planned Partitions: 128HashAgg Batches: 16512
>8MB   Planned Partitions: 256HashAgg Batches: 21488
>   64MB   Planned Partitions:  32HashAgg Batches:  2720
>  256MB   Planned Partitions:   8HashAgg Batches: 8
>
> This is from the non-parallel runs on the i5 machine with 32GB data set,
> the first column is work_mem. We're nowhere near the 1024 limit, and the
> cardinality estimates are pretty good.
>
> OTOH the number o batches is much higher, so clearly there was some
> recursive spilling happening. What I find strange is that this grows
> with work_mem and only starts dropping after 64MB.

Could that be caused by clustering in the data?

If the input data is in totally random order then we have a good
chance of never having to spill skewed "common" values. That is, we're
bound to encounter common values before entering spill mode, and so
those common values will continue to be usefully aggregated until
we're done with the initial groups (i.e. until the in-memory hash
table is cleared in order to process spilled input tuples). This is
great because the common values get aggregated without ever spilling,
and most of the work is done before we even begin with spilled tuples.

If, on the other hand, the common values are concentrated together in
the input...

Assuming that I have this right, then I would also expect simply
having more memory to ameliorate the problem. If you only have/need 4
or 8 partitions then you can fit a higher proportion of the total
number of groups for the whole dataset in the hash table (at the point
when you first enter spill mode). I think it follows that the "nailed"
hash table entries/groupings will "better characterize" the dataset as
a whole.

> Also, how could the amount of I/O be almost constant in all these cases?
> Surely more recursive spilling should do more I/O, but the Disk Usage
> reported by explain analyze does not show anything like ...

Not sure, but might that just be because of the fact that logtape.c
can recycle disk space?

As I said in my last e-mail, it's pretty reasonable to assume that the
vast majority of external sorts are one-pass. It follows that disk
usage can be thought of as almost the same thing as total I/O for
tuplesort. But the same heuristic isn't reasonable when thinking about
hash agg. Hash agg might write out much less data than the total
memory used for the equivalent "peak optimal nospill" hash agg case --
or much more. (Again, reiterating what I said in my last e-mail.)

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Fri, Jul 24, 2020 at 11:03:54AM -0700, Peter Geoghegan wrote:

On Fri, Jul 24, 2020 at 8:19 AM Robert Haas  wrote:

This is all really good analysis, I think, but this seems like the key
finding. It seems like we don't really understand what's actually
getting written. Whether we use hash or sort doesn't seem like it
should have this kind of impact on how much data gets written, and
whether we use CP_SMALL_TLIST or project when needed doesn't seem like
it should matter like this either.


Isn't this more or less the expected behavior in the event of
partitions that are spilled recursively? The case that Tomas tested
were mostly cases where work_mem was tiny relative to the data being
aggregated.

The following is an extract from commit 1f39bce0215 showing some stuff
added to the beginning of nodeAgg.c:

+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 1024



Maybe, but we're nowhere close to these limits. See this table which I
posted earlier:

  2MB   Planned Partitions:  64HashAgg Batches:  4160
  4MB   Planned Partitions: 128HashAgg Batches: 16512
  8MB   Planned Partitions: 256HashAgg Batches: 21488
 64MB   Planned Partitions:  32HashAgg Batches:  2720
256MB   Planned Partitions:   8HashAgg Batches: 8

This is from the non-parallel runs on the i5 machine with 32GB data set,
the first column is work_mem. We're nowhere near the 1024 limit, and the
cardinality estimates are pretty good.

OTOH the number o batches is much higher, so clearly there was some
recursive spilling happening. What I find strange is that this grows
with work_mem and only starts dropping after 64MB.

Also, how could the amount of I/O be almost constant in all these cases?
Surely more recursive spilling should do more I/O, but the Disk Usage
reported by explain analyze does not show anything like ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 1:40 AM Tomas Vondra
 wrote:
> Maybe, not sure what exactly you think is pathological? The trouble is
> hashagg has to spill input tuples but the memory used in no-spill case
> represents aggregated groups, so I'm not sure how you could extrapolate
> from that ...

Yeah, but when hash agg enters spill mode it will continue to advance
the transition states for groups already in the hash table, which
could be quite a significant effect. The peak memory usage for an
equivalent no-spill hash agg is therefore kind of related to the
amount of I/O needed for spilling.

I suppose that you mostly tested cases where memory was in very short
supply, where that breaks down completely. Perhaps it wasn't helpful
for me to bring that factor into this discussion -- it's not as if
there is any doubt that hash agg is spilling a lot more here in any
case.

> Not sure, but I think we need to spill roughly as much as sort, so it
> seems a bit strange that (a) we're spilling 2x as much data and yet the
> cost is so much lower.

ISTM that the amount of I/O that hash agg performs can vary *very*
widely for the same data. This is mostly determined by work_mem, but
there are second order effects. OTOH, the amount of I/O that a sort
must do is practically fixed. You can quibble with that
characterisation a bit because of multi-pass sorts, but not really --
multi-pass sorts are generally quite rare.

I think that we need a more sophisticated cost model for this in
cost_agg(). Maybe the "pages_written" calculation could be pessimized.
However, it doesn't seem like this is precisely an issue with I/O
costs.

--
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 8:19 AM Robert Haas  wrote:
> This is all really good analysis, I think, but this seems like the key
> finding. It seems like we don't really understand what's actually
> getting written. Whether we use hash or sort doesn't seem like it
> should have this kind of impact on how much data gets written, and
> whether we use CP_SMALL_TLIST or project when needed doesn't seem like
> it should matter like this either.

Isn't this more or less the expected behavior in the event of
partitions that are spilled recursively? The case that Tomas tested
were mostly cases where work_mem was tiny relative to the data being
aggregated.

The following is an extract from commit 1f39bce0215 showing some stuff
added to the beginning of nodeAgg.c:

+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 1024

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Fri, Jul 24, 2020 at 11:18:48AM -0400, Robert Haas wrote:

On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra
 wrote:

   2MB 4MB8MB64MB256MB
---
 hash 6.716.70   6.736.44 5.81
 hash CP_SMALL_TLIST  5.285.26   5.245.04 4.54
 sort 3.413.41   3.413.57 3.45

So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).

I initially assumed this is due to writing the hash value to the tapes,
and the rows are fairly narrow (only about 40B per row), so a 4B hash
could make a difference - but certainly not this much. Moreover, that
does not explain the difference between master and the now-reverted
CP_SMALL_TLIST, I think.


This is all really good analysis, I think, but this seems like the key
finding. It seems like we don't really understand what's actually
getting written. Whether we use hash or sort doesn't seem like it
should have this kind of impact on how much data gets written, and
whether we use CP_SMALL_TLIST or project when needed doesn't seem like
it should matter like this either.



I think for CP_SMALL_TLIST at least some of the extra data can be
attributed to writing the hash value along with the tuple, which sort
obviously does not do. With the 32GB data set (the i5 machine), there
are ~20M rows in the lineitem table, and with 4B hash values that's
about 732MB of extra data. That's about the 50% of the difference
between sort and CP_SMALL_TLIST, and I'd dare to speculate the other 50%
is due to LogicalTape internals (pointers to the next block, etc.)

The question is why master has 2x the overhead of CP_SMALL_TLIST, if
it's meant to write the same set of columns etc.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Robert Haas
On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra
 wrote:
>2MB 4MB8MB64MB256MB
> ---
>  hash 6.716.70   6.736.44 5.81
>  hash CP_SMALL_TLIST  5.285.26   5.245.04 4.54
>  sort 3.413.41   3.413.57 3.45
>
> So sort writes ~3.4GB of data, give or take. But hashagg/master writes
> almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
> original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
> much more than the 3.4GB of data written by sort (which has to spill
> everything, while hashagg only spills rows not covered by the groups
> that fit into work_mem).
>
> I initially assumed this is due to writing the hash value to the tapes,
> and the rows are fairly narrow (only about 40B per row), so a 4B hash
> could make a difference - but certainly not this much. Moreover, that
> does not explain the difference between master and the now-reverted
> CP_SMALL_TLIST, I think.

This is all really good analysis, I think, but this seems like the key
finding. It seems like we don't really understand what's actually
getting written. Whether we use hash or sort doesn't seem like it
should have this kind of impact on how much data gets written, and
whether we use CP_SMALL_TLIST or project when needed doesn't seem like
it should matter like this either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Fri, Jul 24, 2020 at 10:40:47AM +0200, Tomas Vondra wrote:

On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:

On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
 wrote:

So let me share some fresh I/O statistics collected on the current code
using iosnoop. I've done the tests on two different machines using the
"aggregate part" of TPC-H Q17, i.e. essentially this:

SELECT * FROM (
   SELECT
   l_partkey AS agg_partkey,
   0.2 * avg(l_quantity) AS avg_quantity
   FROM lineitem GROUP BY l_partkey OFFSET 10
) part_agg;

The OFFSET is there just to ensure we don't need to send anything to
the client, etc.


Thanks for testing this.


So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).


What I find when I run your query (with my own TPC-H DB that is
smaller than what you used here -- 59,986,052 lineitem tuples) is that
the sort required about 7x more memory than the hash agg to do
everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
hash agg memory usage. I'd be surprised if the ratio was very
different for you -- but can you check?



I can check, but it's not quite clear to me what are we looking for?
Increase work_mem until there's no need to spill in either case?



FWIW the hashagg needs about 4775953kB and the sort 33677586kB. So yeah,
that's about 7x more. I think that's probably built into the TPC-H data
set. It'd be easy to construct cases with much higher/lower factors.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
   QUERY PLAN   
 
-
 Limit  (cost=15493271.13..15493271.13 rows=1 width=36) (actual 
time=331351.099..331351.099 rows=0 loops=1)
   ->  HashAggregate  (cost=15186647.64..15493271.13 rows=20441566 width=36) 
(actual time=318190.465..330956.383 rows=1500 loops=1)
 Group Key: lineitem.l_partkey
 Peak Memory Usage: 4775953kB
 ->  Seq Scan on lineitem  (cost=0.00..12936556.76 rows=450018176 
width=9) (actual time=0.156..56051.850 rows=450019701 loops=1)
 Planning Time: 0.151 ms
 Execution Time: 331931.239 ms
(7 rows)


  QUERY PLAN
   
---
 Limit  (cost=81298097.02..81298097.02 rows=1 width=36) (actual 
time=415344.639..415344.639 rows=0 loops=1)
   ->  GroupAggregate  (cost=77616337.21..81298097.02 rows=20441566 width=36) 
(actual time=209292.469..414951.954 rows=1500 loops=1)
 Group Key: lineitem.l_partkey
 ->  Sort  (cost=77616337.21..78741382.65 rows=450018176 width=9) 
(actual time=209292.435..329583.999 rows=450019701 loops=1)
   Sort Key: lineitem.l_partkey
   Sort Method: quicksort  Memory: 33677586kB
   ->  Seq Scan on lineitem  (cost=0.00..12936556.76 rows=450018176 
width=9) (actual time=0.096..72474.733 rows=450019701 loops=1)
 Planning Time: 0.145 ms
 Execution Time: 417157.598 ms
(9 rows)


Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:

On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
 wrote:

So let me share some fresh I/O statistics collected on the current code
using iosnoop. I've done the tests on two different machines using the
"aggregate part" of TPC-H Q17, i.e. essentially this:

 SELECT * FROM (
SELECT
l_partkey AS agg_partkey,
0.2 * avg(l_quantity) AS avg_quantity
FROM lineitem GROUP BY l_partkey OFFSET 10
 ) part_agg;

The OFFSET is there just to ensure we don't need to send anything to
the client, etc.


Thanks for testing this.


So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).


What I find when I run your query (with my own TPC-H DB that is
smaller than what you used here -- 59,986,052 lineitem tuples) is that
the sort required about 7x more memory than the hash agg to do
everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
hash agg memory usage. I'd be surprised if the ratio was very
different for you -- but can you check?



I can check, but it's not quite clear to me what are we looking for?
Increase work_mem until there's no need to spill in either case?


I think that there is something pathological about this spill
behavior, because it sounds like the precise opposite of what you
might expect when you make a rough extrapolation of what disk I/O will
be based on the memory used in no-spill cases (as reported by EXPLAIN
ANALYZE).



Maybe, not sure what exactly you think is pathological? The trouble is
hashagg has to spill input tuples but the memory used in no-spill case
represents aggregated groups, so I'm not sure how you could extrapolate
from that ...

FWIW one more suspicious thing that I forgot to mention is the behavior
of the "planned partitions" depending on work_mem, which looks like
this:

  2MB   Planned Partitions:  64HashAgg Batches:  4160
  4MB   Planned Partitions: 128HashAgg Batches: 16512
  8MB   Planned Partitions: 256HashAgg Batches: 21488
 64MB   Planned Partitions:  32HashAgg Batches:  2720
256MB   Planned Partitions:   8HashAgg Batches: 8

I'd expect the number of planned partitions to decrease (slowly) as
work_mem increases, but it seems to increase initially. Seems a bit
strange, but maybe it's expected.


What I find really surprising is the costing - despite writing about
twice as much data, the hashagg cost is estimated to be much lower than
the sort. For example on the i5 machine, the hashagg cost is ~10M, while
sort cost is almost 42M. Despite using almost twice as much disk. And
the costing is exactly the same for master and the CP_SMALL_TLIST.


That does make it sound like the costs of the hash agg aren't being
represented. I suppose it isn't clear if this is a costing issue
because it isn't clear if the execution time performance itself is
pathological or is instead something that must be accepted as the cost
of spilling the hash agg in a general kind of way.



Not sure, but I think we need to spill roughly as much as sort, so it
seems a bit strange that (a) we're spilling 2x as much data and yet the
cost is so much lower.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-23 Thread Peter Geoghegan
On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
 wrote:
> So let me share some fresh I/O statistics collected on the current code
> using iosnoop. I've done the tests on two different machines using the
> "aggregate part" of TPC-H Q17, i.e. essentially this:
>
>  SELECT * FROM (
> SELECT
> l_partkey AS agg_partkey,
> 0.2 * avg(l_quantity) AS avg_quantity
> FROM lineitem GROUP BY l_partkey OFFSET 10
>  ) part_agg;
>
> The OFFSET is there just to ensure we don't need to send anything to
> the client, etc.

Thanks for testing this.

> So sort writes ~3.4GB of data, give or take. But hashagg/master writes
> almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
> original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
> much more than the 3.4GB of data written by sort (which has to spill
> everything, while hashagg only spills rows not covered by the groups
> that fit into work_mem).

What I find when I run your query (with my own TPC-H DB that is
smaller than what you used here -- 59,986,052 lineitem tuples) is that
the sort required about 7x more memory than the hash agg to do
everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
hash agg memory usage. I'd be surprised if the ratio was very
different for you -- but can you check?

I think that there is something pathological about this spill
behavior, because it sounds like the precise opposite of what you
might expect when you make a rough extrapolation of what disk I/O will
be based on the memory used in no-spill cases (as reported by EXPLAIN
ANALYZE).

> What I find really surprising is the costing - despite writing about
> twice as much data, the hashagg cost is estimated to be much lower than
> the sort. For example on the i5 machine, the hashagg cost is ~10M, while
> sort cost is almost 42M. Despite using almost twice as much disk. And
> the costing is exactly the same for master and the CP_SMALL_TLIST.

That does make it sound like the costs of the hash agg aren't being
represented. I suppose it isn't clear if this is a costing issue
because it isn't clear if the execution time performance itself is
pathological or is instead something that must be accepted as the cost
of spilling the hash agg in a general kind of way.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-23 Thread Peter Geoghegan
On Fri, Jul 17, 2020 at 5:13 PM Jeff Davis  wrote:
> The patch itself looks reasonable to me. I don't see a lot of obvious
> dangers, but perhaps someone would like to take a closer look at the
> planner changes as you suggest.

Attached is v3 of the hash_mem_multiplier patch series, which now has
a preparatory patch that removes hashagg_avoid_disk_plan. What do you
think of this approach, Jeff?

It seems as if removing hashagg_avoid_disk_plan will necessitate
removing various old bits of planner.c that were concerned with
avoiding hash aggs that spill (the bits that hashagg_avoid_disk_plan
skips in the common case where it's turned off). This makes v3-0001-*
a bit trickier than I imagined it would have to be. At least it lowers
the footprint of the hash_mem_multiplier code added by v3-0002-*
(compared to the last version of the patch).

I find the partial group paths stuff added to planner.c by commit
4f15e5d09de rather confusing (that commit was preparatory work for the
main feature commit e2f1eb0e). Hopefully the
hash_mem_multiplier-removal patch didn't get anything wrong in this
area. Perhaps Robert can comment on this as the committer of record
for partition-wise grouping/aggregation.

I would like to commit this patch series by next week, and close out
the two relevant open items. Separately, I suspect that we'll also
need to update the cost model for hash aggs that spill, but that now
seems like a totally unrelated matter. I'm waiting to hear back from
Tomas about that. Tomas?

Thanks
--
Peter Geoghegan


v3-0001-Remove-hashagg_avoid_disk_plan-GUC.patch
Description: Binary data


v3-0002-Add-hash_mem_multiplier-GUC.patch
Description: Binary data


Re: Default setting for enable_hashagg_disk

2020-07-22 Thread Peter Geoghegan
On Tue, Jul 21, 2020 at 1:30 PM Bruce Momjian  wrote:
> On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote:
> > Maybe I missed your point here. The problem is not so much that we'll
> > get HashAggs that spill -- there is nothing intrinsically wrong with
> > that. While it's true that the I/O pattern is not as sequential as a
> > similar group agg + sort, that doesn't seem like the really important
> > factor here. The really important factor is that in-memory HashAggs
> > can be blazingly fast relative to *any* alternative strategy -- be it
> > a HashAgg that spills, or a group aggregate + sort that doesn't spill,
> > whatever. We're mostly concerned about keeping the one available fast
> > strategy than we are about getting a new, generally slow strategy.
>
> Do we have any data that in-memory HashAggs are "blazingly fast relative
> to *any* alternative strategy?"  The data I have tested myself and what
> I saw from Tomas was that spilling sort or spilling hash are both 2.5x
> slower.  Are we sure the quoted statement is true?

I admit that I was unclear in the remarks you quote here. I placed too
much emphasis on the precise cross-over point at which a hash agg that
didn't spill in Postgres 12 spills now. That was important to Andres,
who was concerned about the added I/O, especially with things like
cloud providers [1] -- it's not desirable to go from no I/O to lots of
I/O when upgrading, regardless of how fast your disks for temp files
are. But that was not the point I was trying to make (though it's a
good point, and one that I agree with).

I'll take another shot at it. I'll use with Andres' test case in [1].
Specifically this query (I went with this example because it was
convenient):

SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a),
(SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING
array_length(array_agg(b), 1) = 0;

The planner generally prefers a hashagg here, though it's not a
particularly sympathetic case for hash agg. For one thing the input to
the sort is already sorted. For another, there isn't skew. But the
planner seems to have it right, at least when everything fits in
memory, because that takes ~17.6 seconds with a group agg + sort vs
~13.2 seconds with an in-memory hash agg. Importantly, hash agg's peak
memory usage is 1443558kB (once we get to the point that no spilling
is required), whereas for the sort we're using 7833229kB for the
quicksort. Don't forget that in-memory hash agg is using ~5.4x less
memory in this case on account of the way hash agg represents things.
It's faster, and much much more efficient once you take a holistic
view (i.e. something like work done per second per KB of memory).

Clearly the precise "query operation spills" cross-over point isn't
that relevant to query execution time (on my server with a fast nvme
SSD), because if I give the sort 95% - 99% of the memory it needs to
be an in-memory quicksort then it makes a noticeable difference, but
not a huge difference. I get one big run and one tiny run in the
tuplesort. The query itself takes ~23.4 seconds -- higher than 17.6
seconds, but not all that much higher considering we have to write and
read ~7GB of data. If I try to do approximately the same thing with
hash agg (give it very slightly less than optimal memory) I find that
the difference is smaller -- it takes ~14.1 seconds (up from ~13.2
seconds). It looks like my original remarks are totally wrong so far,
because it's as if the performance hit is entirely explainable as the
extra temp file I/O (right down to the fact that hash agg takes a
smaller hit because it has much less to write out to disk). But let's
keep going.

= Sort vs Hash =

We'll focus on how the group agg + sort case behaves as we take memory
away. What I notice is that it literally doesn't matter how much
memory I take away any more (now that the sort has started to spill).
I said that it was ~23.4 seconds when we have two runs, but if I keep
taking memory away so that we get 10 runs it takes 23.2 seconds. If
there are 36 runs it takes 22.8 seconds. And if there are 144 runs
(work_mem is 50MB, down from the "optimal" required for the sort to be
internal, ~7GB) then it takes 21.9 seconds. So it gets slightly
faster, not slower. We really don't need very much memory to do the
sort in one pass, and it pretty much doesn't matter how many runs we
need to merge provided it doesn't get into the thousands, which is
quite rare (when random I/O from multiple passes finally starts to
bite).

Now for hash agg -- this is where it gets interesting. If we give it
about half the memory it needs (work_mem 700MB) we still have 4
batches and it hardly changes -- it takes 19.8 seconds, which is
slower than the 4 batch case that took 14.1 seconds but not that
surprising. 300MB still gets 4 batches which now takes ~23.5 seconds.
200MB gets 2424 batches and takes ~27.7 seconds -- a big jump! With
100MB it takes ~31.1 seconds (3340 batches). 50MB it's ~32.8 seconds

Re: Default setting for enable_hashagg_disk

2020-07-22 Thread Robert Haas
On Tue, Jul 14, 2020 at 6:49 PM Peter Geoghegan  wrote:
> Maybe I missed your point here. The problem is not so much that we'll
> get HashAggs that spill -- there is nothing intrinsically wrong with
> that. While it's true that the I/O pattern is not as sequential as a
> similar group agg + sort, that doesn't seem like the really important
> factor here. The really important factor is that in-memory HashAggs
> can be blazingly fast relative to *any* alternative strategy -- be it
> a HashAgg that spills, or a group aggregate + sort that doesn't spill,
> whatever. We're mostly concerned about keeping the one available fast
> strategy than we are about getting a new, generally slow strategy.

I don't know; it depends. Like, if the less-sequential I/O pattern
that is caused by a HashAgg is not really any slower than a
Sort+GroupAgg, then whatever. The planner might as well try a HashAgg
- because it will be fast if it stays in memory - and if it doesn't
work out, we've lost little by trying. But if a Sort+GroupAgg is
noticeably faster than a HashAgg that ends up spilling, then there is
a potential regression. I thought we had evidence that this was a real
problem, but if that's not the case, then I think we're fine as far as
v13 goes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-07-21 Thread Bruce Momjian
On Tue, Jul 14, 2020 at 03:49:40PM -0700, Peter Geoghegan wrote:
> Maybe I missed your point here. The problem is not so much that we'll
> get HashAggs that spill -- there is nothing intrinsically wrong with
> that. While it's true that the I/O pattern is not as sequential as a
> similar group agg + sort, that doesn't seem like the really important
> factor here. The really important factor is that in-memory HashAggs
> can be blazingly fast relative to *any* alternative strategy -- be it
> a HashAgg that spills, or a group aggregate + sort that doesn't spill,
> whatever. We're mostly concerned about keeping the one available fast
> strategy than we are about getting a new, generally slow strategy.

Do we have any data that in-memory HashAggs are "blazingly fast relative
to *any* alternative strategy?"  The data I have tested myself and what
I saw from Tomas was that spilling sort or spilling hash are both 2.5x
slower.  Are we sure the quoted statement is true?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Default setting for enable_hashagg_disk

2020-07-20 Thread Tomas Vondra

On Mon, Jul 20, 2020 at 09:17:21AM -0400, Tom Lane wrote:

Tomas Vondra  writes:

There's a minor problem here, though - these stats were collected before
we fixed the tlist issue, so hashagg was spilling about 10x the amount
of data compared to sort+groupagg. So maybe that's the first thing we
should do, before contemplating changes to the costing - collecting
fresh data. I can do that, if needed.


+1.  I'm not sure if we still need to do anything, but we definitely
can't tell on the basis of data that doesn't reliably reflect what
the code does now.



OK, will do. The hardware is busy doing something else at the moment,
but I'll do the tests and report results in a couple days.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-20 Thread Tom Lane
Tomas Vondra  writes:
> There's a minor problem here, though - these stats were collected before
> we fixed the tlist issue, so hashagg was spilling about 10x the amount
> of data compared to sort+groupagg. So maybe that's the first thing we
> should do, before contemplating changes to the costing - collecting
> fresh data. I can do that, if needed.

+1.  I'm not sure if we still need to do anything, but we definitely
can't tell on the basis of data that doesn't reliably reflect what
the code does now.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-19 Thread Tomas Vondra

On Sun, Jul 19, 2020 at 02:17:15PM -0700, Jeff Davis wrote:

On Sat, 2020-07-18 at 21:15 -0400, Tom Lane wrote:

Jeff Davis  writes:
> What is your opinion about pessimizing the HashAgg disk costs (not
> affecting HashAgg plans expected to stay in memory)? Tomas Vondra
> presented some evidence that Sort had some better IO patterns in
> some
> cases that weren't easily reflected in a principled way in the cost
> model.

Hm, was that in some other thread?  I didn't find any such info
in a quick look through this one.



https://www.postgresql.org/message-id/2df2e0728d48f498b9d6954b5f9080a34535c385.camel%40j-davis.com



FWIW the two messages to look at are these two:

1) report with initial data
https://www.postgresql.org/message-id/20200519151202.u2p2gpiawoaznsv2%40development

2) updated stats, with the block pre-allocation and tlist projection
https://www.postgresql.org/message-id/20200521001255.kfaihp3afv6vy6uq%40development

But I'm not convinced we actually need to tweak the costing - we've
ended up fixing two things, and I think a lot of the differences in I/O
patterns disappeared thanks to this.

For sort, the stats of request sizes look like this:

  type |  bytes  | count |   pct
 --+-+---+---
  RA   |  131072 | 26034 | 59.92
  RA   |   16384 |  6160 | 14.18
  RA   |8192 |  3636 |  8.37
  RA   |   32768 |  3406 |  7.84
  RA   |   65536 |  3270 |  7.53
  RA   |   24576 |   361 |  0.83
  ...
  W| 1310720 |  8070 | 34.26
  W|  262144 |  1213 |  5.15
  W|  524288 |  1056 |  4.48
  W| 1056768 |   689 |  2.93
  W|  786432 |   292 |  1.24
  W|  802816 |   199 |  0.84
  ...

And for the hashagg, it looks like this:

  type |  bytes  | count  |  pct
 --+-++
  RA   |  131072 | 200816 |  70.93
  RA   |8192 |  23640 |   8.35
  RA   |   16384 |  19324 |   6.83
  RA   |   32768 |  19279 |   6.81
  RA   |   65536 |  19273 |   6.81
  ...
  W| 1310720 |  18000 |  65.91
  W|  524288 |   2074 |   7.59
  W| 1048576 |660 |   2.42
  W|8192 |409 |   1.50
  W|  786432 |354 |   1.30
  ...

so it's actually a tad better than sort, because larger proportion of
both reads and writes is in larger chunks (reads 128kB, writes 1280kB).
I think the device had default read-ahead setting, which I assume
explains the 128kB.

For the statistics of deltas between requests - for sort

  type | block_delta | count |   pct
 --+-+---+---
  RA   | 256 | 13432 | 30.91
  RA   |  16 |  3291 |  7.57
  RA   |  32 |  3272 |  7.53
  RA   |  64 |  3266 |  7.52
  RA   | 128 |  2877 |  6.62
  RA   |1808 |  1278 |  2.94
  RA   |   -2320 |   483 |  1.11
  RA   |   28928 |   386 |  0.89
  ...
  W|2560 |  7856 | 33.35
  W|2064 |  4921 | 20.89
  W|2080 |   586 |  2.49
  W|   30960 |   300 |  1.27
  W|2160 |   253 |  1.07
  W|1024 |   248 |  1.05
  ...

and for hashagg:

  type | block_delta | count  |  pct
 --+-++---
  RA   | 256 | 180955 | 63.91
  RA   |  32 |  19274 |  6.81
  RA   |  64 |  19273 |  6.81
  RA   | 128 |  19264 |  6.80
  RA   |  16 |  19203 |  6.78
  RA   |   30480 |   9835 |  3.47

At first this might look worse than sort, but 256 sectors matches the
128kB from the request size stats, and it's good match (64% vs. 70%).


There's a minor problem here, though - these stats were collected before
we fixed the tlist issue, so hashagg was spilling about 10x the amount
of data compared to sort+groupagg. So maybe that's the first thing we
should do, before contemplating changes to the costing - collecting
fresh data. I can do that, if needed.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-19 Thread Peter Geoghegan
On Sat, Jul 18, 2020 at 3:04 PM Jeff Davis  wrote:
> > I think the entire discussion
> > is way out ahead of any field evidence that we need such a knob.
> > In the absence of evidence, our default position ought to be to
> > keep it simple, not to accumulate backwards-compatibility kluges.
>
> Fair enough. I think that was where Stephen and Amit were coming from,
> as well.

> That would lessen the number of changed plans, but we could easily
> remove the pessimization without controversy later if it turned out to
> be unnecessary, or if we further optimize HashAgg IO.

Does this mean that we've reached a final conclusion on
hashagg_avoid_disk_plan for Postgres 13, which is that it should be
removed? If so, I'd appreciate it if you took care of it. I don't
think that we need to delay its removal until the details of the
HashAgg cost pessimization are finalized. (I expect that that will be
totally uncontroversial.)

Thanks
-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-19 Thread Jeff Davis
On Sat, 2020-07-18 at 21:15 -0400, Tom Lane wrote:
> Jeff Davis  writes:
> > What is your opinion about pessimizing the HashAgg disk costs (not
> > affecting HashAgg plans expected to stay in memory)? Tomas Vondra
> > presented some evidence that Sort had some better IO patterns in
> > some
> > cases that weren't easily reflected in a principled way in the cost
> > model.
> 
> Hm, was that in some other thread?  I didn't find any such info
> in a quick look through this one.


https://www.postgresql.org/message-id/2df2e0728d48f498b9d6954b5f9080a34535c385.camel%40j-davis.com

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-19 Thread David G. Johnston
On Sun, Jul 19, 2020 at 4:38 AM Stephen Frost  wrote:

> > (The only reason I'm in favor of heap_mem[_multiplier] is that it
> > seems like it might be possible to use it to get *better* plans
> > than before.  I do not see it as a backwards-compatibility knob.)
>
> I still don't think a hash_mem-type thing is really the right direction
> to go in, even if making a distinction between memory used for sorting
> and memory used for hashing is, and I'm of the general opinion that we'd
> be thinking about doing something better and more appropriate- except
> for the fact that we're talking about adding this in during beta.
>
> In other words, if we'd stop trying to shoehorn something in, which
> we're doing because we're in beta, we'd very likely be talking about all
> of this in a very different way and probably be contemplating something
> like a query_mem that provides for an overall memory limit and which
> favors memory for hashing over memory for sorting, etc.
>

At minimum we'd need a patch we would be happy with dropping in should
there be user complaints.  And once this conversation ends with that in
hand I have my doubts whether there will be interest, or even project
desirability, in working toward a "better" solution should this one prove
itself "good enough".  And as it seems unlikely that this patch would
foreclose on other promising solutions, combined with there being a
non-trivial behavioral change that we've made, suggests to me that we might
as well just deploy whatever short-term solution we come up with now.

As for hashagg_avoid_disk_plan...

The physical processes we are modelling here:
1. Processing D amount of records takes M amount of memory
2. Processing D amount of records in-memory takes T time per record while
doing the same on-disk takes V time per record
3. Processing D amount of records via some other plan has an effective cost
U
3. V >> T (is strictly greater than)
4. Having chosen a value for M that ensures T it is still possible for V to
end up used

Thus:

If we get D wrong the user can still tweak the system by changing the
hash_mem_multiplier (this is strictly better than v12 which used work_mem)

Setting hashagg_avoid_disk_plan = off provides a means to move V infinitely
far away from T (set to on by default, off reverts to v12 behavior).

There is no way for the user to move V's relative position toward T (n/a in
v12)

The only way to move T is to make it infinitely large by setting
enable_hashagg = off (same as in v12)

Is hashagg_disk_cost_multiplier = [0.0, 1,000,000,000.0] i.e., (T *
hashagg_disk_cost_multiplier == V) doable?

It has a nice symmetry with hash_mem_multiplier and can move V both toward
and away from T.  To the extent T is tunable or not in v12 it can remain
the same in v13.

David J.


Re: Default setting for enable_hashagg_disk

2020-07-19 Thread Tom Lane
Stephen Frost  writes:
> In other words, if we'd stop trying to shoehorn something in, which
> we're doing because we're in beta, we'd very likely be talking about all
> of this in a very different way and probably be contemplating something
> like a query_mem that provides for an overall memory limit and which
> favors memory for hashing over memory for sorting, etc.

Even if we were at the start of the dev cycle rather than its end,
I'm not sure I agree.  Yes, replacing work_mem with some more-holistic
approach would be great.  But that's a research project, one that
we can't be sure will yield fruit on any particular schedule.  (Seeing
that we've understood this to be a problem for *decades*, I would tend
to bet on a longer not shorter time frame for a solution.)

I think that if we are worried about hashagg-spill behavior in the near
term, we have to have some fix that's not conditional on solving that
very large problem.  The only other practical alternative is "do
nothing for v13", and I agree with the camp that doesn't like that.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-19 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Jeff Davis  writes:
> > On Fri, 2020-07-17 at 18:38 -0700, Peter Geoghegan wrote:
> >> There is also the separate question of what to do about the
> >> hashagg_avoid_disk_plan GUC (this is a separate open item that
> >> requires a separate resolution). Tom leans slightly towards removing
> >> it now. Is your position about the same as before?
> 
> > Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at
> > least one release.
> 
> You'e being optimistic about it being possible to remove a GUC once
> we ship it.  That seems to be a hard sell most of the time.

Agreed.

> I'm honestly a bit baffled about the level of fear being expressed
> around this feature.  We have *frequently* made changes that would
> change query plans, perhaps not 100.00% for the better, and never
> before have we had this kind of bikeshedding about whether it was
> necessary to be able to turn it off.  I think the entire discussion
> is way out ahead of any field evidence that we need such a knob.
> In the absence of evidence, our default position ought to be to
> keep it simple, not to accumulate backwards-compatibility kluges.

+100

> (The only reason I'm in favor of heap_mem[_multiplier] is that it
> seems like it might be possible to use it to get *better* plans
> than before.  I do not see it as a backwards-compatibility knob.)

I still don't think a hash_mem-type thing is really the right direction
to go in, even if making a distinction between memory used for sorting
and memory used for hashing is, and I'm of the general opinion that we'd
be thinking about doing something better and more appropriate- except
for the fact that we're talking about adding this in during beta.

In other words, if we'd stop trying to shoehorn something in, which
we're doing because we're in beta, we'd very likely be talking about all
of this in a very different way and probably be contemplating something
like a query_mem that provides for an overall memory limit and which
favors memory for hashing over memory for sorting, etc.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-07-18 Thread Tom Lane
Jeff Davis  writes:
> What is your opinion about pessimizing the HashAgg disk costs (not
> affecting HashAgg plans expected to stay in memory)? Tomas Vondra
> presented some evidence that Sort had some better IO patterns in some
> cases that weren't easily reflected in a principled way in the cost
> model.

Hm, was that in some other thread?  I didn't find any such info
in a quick look through this one.

> That would lessen the number of changed plans, but we could easily
> remove the pessimization without controversy later if it turned out to
> be unnecessary, or if we further optimize HashAgg IO.

Trying to improve our cost models under-the-hood seems like a
perfectly reasonable activity to me.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-18 Thread Jeff Davis
On Sat, 2020-07-18 at 14:30 -0400, Tom Lane wrote:
> You'e being optimistic about it being possible to remove a GUC once
> we ship it.  That seems to be a hard sell most of the time.

If nothing else, a repeat of this thread in a year or two to discuss
removing a GUC doesn't seem appealing.

> I think the entire discussion
> is way out ahead of any field evidence that we need such a knob.
> In the absence of evidence, our default position ought to be to
> keep it simple, not to accumulate backwards-compatibility kluges.

Fair enough. I think that was where Stephen and Amit were coming from,
as well.

What is your opinion about pessimizing the HashAgg disk costs (not
affecting HashAgg plans expected to stay in memory)? Tomas Vondra
presented some evidence that Sort had some better IO patterns in some
cases that weren't easily reflected in a principled way in the cost
model.

That would lessen the number of changed plans, but we could easily
remove the pessimization without controversy later if it turned out to
be unnecessary, or if we further optimize HashAgg IO.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-18 Thread Peter Geoghegan
On Sat, Jul 18, 2020 at 11:30 AM Tom Lane  wrote:
> Jeff Davis  writes:
> > Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at
> > least one release.
>
> You'e being optimistic about it being possible to remove a GUC once
> we ship it.  That seems to be a hard sell most of the time.

You've said that you're +0.5 on removing this GUC, while Jeff seems to
be about -0.5 (at least that's my take). It's hard to see a way
towards closing out the hashagg_avoid_disk_plan open item if that's
our starting point.

The "do we need to keep hashagg_avoid_disk_plan?" question is
fundamentally a value judgement IMV. I believe that you both
understand each other's perspectives. I also suspect that no pragmatic
compromise will be possible -- we can either have the
hashagg_avoid_disk_plan GUC or not have it. ISTM that we're
deadlocked, at least in a technical or procedural sense.

Does that understanding seem accurate to you both?

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-18 Thread Tom Lane
Jeff Davis  writes:
> On Fri, 2020-07-17 at 18:38 -0700, Peter Geoghegan wrote:
>> There is also the separate question of what to do about the
>> hashagg_avoid_disk_plan GUC (this is a separate open item that
>> requires a separate resolution). Tom leans slightly towards removing
>> it now. Is your position about the same as before?

> Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at
> least one release.

You'e being optimistic about it being possible to remove a GUC once
we ship it.  That seems to be a hard sell most of the time.

I'm honestly a bit baffled about the level of fear being expressed
around this feature.  We have *frequently* made changes that would
change query plans, perhaps not 100.00% for the better, and never
before have we had this kind of bikeshedding about whether it was
necessary to be able to turn it off.  I think the entire discussion
is way out ahead of any field evidence that we need such a knob.
In the absence of evidence, our default position ought to be to
keep it simple, not to accumulate backwards-compatibility kluges.

(The only reason I'm in favor of heap_mem[_multiplier] is that it
seems like it might be possible to use it to get *better* plans
than before.  I do not see it as a backwards-compatibility knob.)

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-18 Thread Jeff Davis
On Fri, 2020-07-17 at 18:38 -0700, Peter Geoghegan wrote:
> There is also the separate question of what to do about the
> hashagg_avoid_disk_plan GUC (this is a separate open item that
> requires a separate resolution). Tom leans slightly towards removing
> it now. Is your position about the same as before?

Yes, I think we should have that GUC (hashagg_avoid_disk_plan) for at
least one release.

Clealy, a lot of plans will change. For any GROUP BY where there are a
lot of groups, there was only one choice in v12 and now there are two
choices in v13. Obviously I think most of those changes will be for the
better, but some regressions are bound to happen. Giving users some
time to adjust, and for us to tune the cost model based on user
feedback, seems prudent.

Are there other examples of widespread changes in plans where we
*didn't* have a GUC? There are many GUCs for controlling parallism,
JIT, etc.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-17 Thread Peter Geoghegan
On Fri, Jul 17, 2020 at 5:13 PM Jeff Davis  wrote:
> The idea is growing on me a bit. It doesn't give exactly v12 behavior,
> but it does offer another lever that might tackle a lot of the
> practical cases.

Cool.

> If I were making the decision alone, I'd still choose
> the escape hatch based on simplicity, but I'm fine with this approach
> as well.

There is also the separate question of what to do about the
hashagg_avoid_disk_plan GUC (this is a separate open item that
requires a separate resolution). Tom leans slightly towards removing
it now. Is your position about the same as before?

> The patch itself looks reasonable to me. I don't see a lot of obvious
> dangers, but perhaps someone would like to take a closer look at the
> planner changes as you suggest.

It would be good to get further input on the patch from somebody else,
particularly the planner aspects.

My intention is to commit the patch myself. I was the primary advocate
for hash_mem_multiplier, so it seems as if I should own it. (You may
have noticed that I just pushed the preparatory
local-variable-renaming patch, to get that piece out of the way.)

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-17 Thread Jeff Davis
On Tue, 2020-07-14 at 21:12 -0700, Peter Geoghegan wrote:
> Attached is a WIP patch implementing hash_mem_multiplier, with 1.0 as
> the GUC's default value (i.e. the patch introduces no behavioral
> changes by default). The first patch in the series renames some local
> variables whose name is made ambiguous by the second, main patch.

The idea is growing on me a bit. It doesn't give exactly v12 behavior,
but it does offer another lever that might tackle a lot of the
practical cases. If I were making the decision alone, I'd still choose
the escape hatch based on simplicity, but I'm fine with this approach
as well.

The patch itself looks reasonable to me. I don't see a lot of obvious
dangers, but perhaps someone would like to take a closer look at the
planner changes as you suggest.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-14 Thread Peter Geoghegan
On Mon, Jul 13, 2020 at 9:47 AM Alvaro Herrera  wrote:
> I'm in favor of hash_mem_multiplier.  I think a >1 default is more
> sensible than =1 in the long run, but if strategic vote is what we're
> doing, then I support the =1 option.

Attached is a WIP patch implementing hash_mem_multiplier, with 1.0 as
the GUC's default value (i.e. the patch introduces no behavioral
changes by default). The first patch in the series renames some local
variables whose name is made ambiguous by the second, main patch.

Since the patch doesn't add a new work_mem-style GUC, but existing
consumers of work_mem expect something like that, the code is
structured in a way that allows the planner and executor to pretend
that there really is a work_mem-style GUC called hash_mem, which they
can determine the value of by calling the get_hash_mem() function.
This seemed like the simplest approach overall. I placed the
get_hash_mem() function in nodeHash.c, which is a pretty random place
for it. If anybody has any better ideas about where it should live,
please say so.

ISTM that the planner changes are where there's mostly likely to be
problems. Reviewers should examine consider_groupingsets_paths() in
detail.

--
Peter Geoghegan


v2-0001-Rename-hash_mem-local-variable.patch
Description: Binary data


v2-0002-Add-hash_mem_multiplier-GUC.patch
Description: Binary data


Re: Default setting for enable_hashagg_disk

2020-07-14 Thread Peter Geoghegan
On Tue, Jul 14, 2020 at 12:46 PM Robert Haas  wrote:
> - I thought the problem we were trying to solve here was that, in v12,
> if the planner thinks that your hashagg will fit in memory when really
> it doesn't, you will get good performance because we'll cheat; in v13,
> you'll get VERY bad performance because we won't.

That is the problem we started out with. I propose to solve a broader
problem that I believe mostly encompasses the original problem (it's
an "inventor's paradox" situation). Although the exact degree to which
it truly addresses the original problem will vary across
installations, I believe that it will go a very long way towards
cutting down on problems for users upgrading to Postgres 13 generally.

> - So, if hash_mem_multiplier affects both planning and execution, it
> doesn't really solve the problem. Neither does adjusting the existing
> work_mem setting. Imagine that you have two queries. The planner
> thinks Q1 will use 1GB of memory for a HashAgg but it will actually
> need 2GB. It thinks Q2 will use 1.5GB for a HashAgg but it will
> actually need 3GB. If you plan using a 1GB memory limit, Q1 will pick
> a HashAgg and perform terribly when it spills. Q2 will pick a
> GroupAggregate which will be OK but not great. If you plan with a 2GB
> memory limit, Q1 will pick a HashAgg and will not spill so now it will
> be in great shape. But Q2 will pick a HashAgg and then spill so it
> will stink. Oops.

Maybe I missed your point here. The problem is not so much that we'll
get HashAggs that spill -- there is nothing intrinsically wrong with
that. While it's true that the I/O pattern is not as sequential as a
similar group agg + sort, that doesn't seem like the really important
factor here. The really important factor is that in-memory HashAggs
can be blazingly fast relative to *any* alternative strategy -- be it
a HashAgg that spills, or a group aggregate + sort that doesn't spill,
whatever. We're mostly concerned about keeping the one available fast
strategy than we are about getting a new, generally slow strategy.

There will be no problems at all unless and until we're short on
memory, because you can just increase work_mem and everything works
out, regardless of the details. Obviously the general problems we
anticipate only crop up when increasing work_mem stops being a viable
DBA strategy.

By teaching the system to have at least a crude appreciation of the
value of memory when hashing vs when sorting, the system is often able
to give much more memory to Hash aggs (and hash joins). Increasing
hash_mem_multiplier (maybe while also decreasing work_mem) will be
beneficial when we take memory from things that don't really need so
much, like sorts (or even CTE tuplestores) -- we reduce the memory
pressure without paying a commensurate price in system throughput
(maybe even only a very small hit). As a bonus, everything going
faster may actually *reduce* the memory usage for the system as a
whole, even as individual queries use more memory.

Under this scheme, it may well not matter that you cannot cheat
(Postgres 12 style) anymore, because you'll be able to use the memory
that is available sensibly -- regardless of whether or not the group
estimates are very good (we have to have more than zero faith in the
estimates -- they can be bad without being terrible). Maybe no amount
of tuning can ever restore the desirable Postgres 12 performance
characteristics you came to rely on, but remaining "regressions" are
probably cases where the user was flying pretty close to the sun
OOM-wise all along. They may have been happy with Postgres 12, but at
a certain point that really is something that you have to view as a
fool's paradise, even if like me you happen to be a memory Keynesian.

Really big outliers tend to be rare and therefore something that the
user can afford to have go slower. It's the constant steady stream of
medium-sized hash aggs that we mostly need to worry about. To the
extent that that's true, hash_mem_multiplier addresses the problem on
the table.

> - An escape hatch that prevents spilling at execution time *does*
> solve this problem, but now suppose we add a Q3 which the planner
> thinks will use 512MB of memory but at execution time it will actually
> consume 512GB due to the row count estimate being 1024x off. So if you
> enable the escape hatch to get back to a situation where Q1 and Q2
> both perform acceptably, then Q3 makes your system OOM.

Right. Nothing stops these two things from being true at the same time.

> - If you were to instead introduce a GUC like what I proposed before,
> which allows the execution-time memory usage to exceed what was
> planned, but only by a certain margin, then you can set
> hash_mem_execution_overrun_multiplier_thingy=2.5 and call it a day.
> Now, no matter how you set work_mem, you're fine. Depending on the
> value you choose for work_mem, you may get group aggregates for some
> of the queries. If you set it large enough that you get 

Re: Default setting for enable_hashagg_disk

2020-07-14 Thread Robert Haas
On Mon, Jul 13, 2020 at 2:50 PM Peter Geoghegan  wrote:
> Primarily in favor of escape hatch:
>
> Jeff,
> DavidR,
> Pavel,
> Andres,
> Robert ??,
> Amit ??
>
> Primarily in favor of hash_mem/hash_mem_multiplier:
>
> PeterG,
> Tom,
> Alvaro,
> Tomas,
> Justin,
> DavidG,
> Jonathan Katz
>
> There are clear problems with this summary, including for example the
> fact that Robert weighed in before the hash_mem/hash_mem_multiplier
> proposal was even on the table. What he actually said about it [1]
> seems closer to hash_mem, so I feel that putting him in that bucket is
> a conservative assumption on my part. Same goes for Amit, who warmed
> to the idea of hash_mem_multiplier recently. (Though I probably got
> some detail wrong, in which case please correct me.)

My view is:

- I thought the problem we were trying to solve here was that, in v12,
if the planner thinks that your hashagg will fit in memory when really
it doesn't, you will get good performance because we'll cheat; in v13,
you'll get VERY bad performance because we won't.

- So, if hash_mem_multiplier affects both planning and execution, it
doesn't really solve the problem. Neither does adjusting the existing
work_mem setting. Imagine that you have two queries. The planner
thinks Q1 will use 1GB of memory for a HashAgg but it will actually
need 2GB. It thinks Q2 will use 1.5GB for a HashAgg but it will
actually need 3GB. If you plan using a 1GB memory limit, Q1 will pick
a HashAgg and perform terribly when it spills. Q2 will pick a
GroupAggregate which will be OK but not great. If you plan with a 2GB
memory limit, Q1 will pick a HashAgg and will not spill so now it will
be in great shape. But Q2 will pick a HashAgg and then spill so it
will stink. Oops.

- An escape hatch that prevents spilling at execution time *does*
solve this problem, but now suppose we add a Q3 which the planner
thinks will use 512MB of memory but at execution time it will actually
consume 512GB due to the row count estimate being 1024x off. So if you
enable the escape hatch to get back to a situation where Q1 and Q2
both perform acceptably, then Q3 makes your system OOM.

- If you were to instead introduce a GUC like what I proposed before,
which allows the execution-time memory usage to exceed what was
planned, but only by a certain margin, then you can set
hash_mem_execution_overrun_multiplier_thingy=2.5 and call it a day.
Now, no matter how you set work_mem, you're fine. Depending on the
value you choose for work_mem, you may get group aggregates for some
of the queries. If you set it large enough that you get hash
aggregates, then Q1 and Q2 will avoid spilling (which works but is
slow) because the overrun is less than 2x. Q3 will spill, so you won't
OOM. Wahoo!

- I do agree in general that it makes more sense to allow
hash_work_mem > sort_work_mem, and even to make that the default.
Allowing the same budget for both is unreasonable, because I think we
have good evidence that inadequate memory has a severe impact on
hashing operations but usually only a fairly mild effect on sorting
operations, except in the case where the underrun is severe. That is,
if you need 1GB of memory for a sort and you only get 768MB, the
slowdown is much much less severe than if the same thing happens for a
hash. If you have 10MB of memory, both are going to suck, but that's
kinda unavoidable.

- If you hold my feet to the fire and ask me to choose between a
Boolean escape hatch (rather than a multiplier-based one) and
hash_mem_multiplier, gosh, I don't know. I guess the Boolean escape
hatch? I mean it's a pretty bad solution, but at least if I have that
I can get both Q1 and Q2 to perform well at the same time, and I guess
I'm no worse off than I was in v12. The hash_mem_multiplier thing,
assuming it affects both planning and execution, seems like a very
good idea in general, but I guess I don't see how it helps with this
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Amit Kapila
On Mon, Jul 13, 2020 at 9:50 PM Peter Geoghegan  wrote:
>
> On Mon, Jul 13, 2020 at 6:13 AM Stephen Frost  wrote:
> > Yes, increasing work_mem isn't unusual, at all.
>
> It's unusual as a way of avoiding OOMs!
>
> > Eh?  That's not at all what it looks like- they were getting OOM's
> > because they set work_mem to be higher than the actual amount of memory
> > they had and the Sort before the GroupAgg was actually trying to use all
> > that memory.  The HashAgg ended up not needing that much memory because
> > the aggregated set wasn't actually that large.  If anything, this shows
> > exactly what Jeff's fine work here is (hopefully) going to give us- the
> > option to plan a HashAgg in such cases, since we can accept spilling to
> > disk if we end up underestimate, or take advantage of that HashAgg
> > being entirely in memory if we overestimate.
>
> I very specifically said that it wasn't a case where something like
> hash_mem would be expected to make all the difference.
>
> > Having looked back, I'm not sure that I'm really in the minority
> > regarding the proposal to add this at this time either- there's been a
> > few different comments that it's too late for v13 and/or that we should
> > see if we actually end up with users seriously complaining about the
> > lack of a separate way to specify the memory for a given node type,
> > and/or that if we're going to do this then we should have a broader set
> > of options covering other nodes types too, all of which are positions
> > that I agree with.
>
> By proposing to do nothing at all, you are very clearly in a small
> minority. While (for example) I might have debated the details with
> David Rowley a lot recently, and you couldn't exactly say that we're
> in agreement, our two positions are nevertheless relatively close
> together.
>
> AFAICT, the only other person that has argued that we should do
> nothing (have no new GUC) is Bruce, which was a while ago now. (Amit
> said something similar, but has since softened his opinion [1]).
>

To be clear, my vote for PG13 is not to do anything till we have clear
evidence of regressions.  In the email you quoted, I was trying to say
that due to parallelism we might not have the problem for which we are
planning to provide an escape-hatch or hash_mem GUC.  I think the
reason for the delay in getting to the agreement is that there is no
clear evidence for the problem (user-reported cases or results of some
benchmarks like TPC-H) unless I have missed something.

Having said that, I understand that we have to reach some conclusion
to close this open item and if the majority of people are in-favor of
escape-hatch or hash_mem solution then we have to do one of those.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Tom Lane
"David G. Johnston"  writes:
> To be clear, by "escape hatch" you mean "add a GUC that instructs the
> PostgreSQL executor to ignore hash_mem when deciding whether to spill the
> contents of the hash table to disk - IOW to never spill the contents of a
> hash table to disk"?  If so that seems separate from whether to add a
> hash_mem GUC to provide finer grained control - people may well want both.

If we define the problem narrowly as "allow people to get back exactly
the pre-v13 behavior", then yeah you'd need an escape hatch of that
sort.  We should not, however, forget that the pre-v13 behavior is
pretty darn problematic.  It's hard to see why anyone would really
want to get back exactly "never spill even if it leads to OOM".

The proposals for allowing a higher-than-work_mem, but not infinite,
spill boundary seem to me to be a reasonable way to accommodate cases
where the old behavior is accidentally preferable to what v13 does
right now.  Moreover, such a knob seems potentially useful in its
own right, at least as a stopgap until we figure out how to generalize
or remove work_mem.  (Which might be a long time.)

I'm not unalterably opposed to providing an escape hatch of the other
sort, but right now I think the evidence for needing it isn't there.
If we get field complaints that can't be resolved with the "raise the
spill threshold by X" approach, we could reconsider.  But that approach
seems a whole lot less brittle than "raise the spill threshold to
infinity", so I think we should start with the former type of fix.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Peter Geoghegan
On Mon, Jul 13, 2020 at 12:57 PM David G. Johnston
 wrote:
> To be clear, by "escape hatch" you mean "add a GUC that instructs the 
> PostgreSQL executor to ignore hash_mem when deciding whether to spill the 
> contents of the hash table to disk - IOW to never spill the contents of a 
> hash table to disk"?

Yes, that's what that means.

> If so that seems separate from whether to add a hash_mem GUC to provide finer 
> grained control - people may well want both.

They might want the escape hatch too, as an additional measure, but my
assumption is that anybody in favor of the
hash_mem/hash_mem_multiplier proposal takes that position because they
think that it's the principled solution. That's the kind of subtlety
that is bound to get lost when summarizing general sentiment at a high
level. In any case no individual has seriously argued that there is a
simultaneous need for both -- at least not yet.

This thread is already enormous, and very hard to keep up with. I'm
trying to draw a line under the discussion. For my part, I have
compromised on the important question of the default value of
hash_mem_multiplier -- I am writing a new version of the patch that
makes the default 1.0 (i.e. no behavioral changes by default).

> I would prefer DavidJ as an abbreviation - my middle initial can be dropped 
> when referring to me.

Sorry about that.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread David G. Johnston
On Mon, Jul 13, 2020 at 11:50 AM Peter Geoghegan  wrote:

>
> Primarily in favor of escape hatch:
>
> Jeff,
> DavidR,
> Pavel,
> Andres,
> Robert ??,
> Amit ??
>
>
To be clear, by "escape hatch" you mean "add a GUC that instructs the
PostgreSQL executor to ignore hash_mem when deciding whether to spill the
contents of the hash table to disk - IOW to never spill the contents of a
hash table to disk"?  If so that seems separate from whether to add a
hash_mem GUC to provide finer grained control - people may well want both.

Primarily in favor of hash_mem/hash_mem_multiplier:
>
> PeterG,
> Tom,
> Alvaro,
> Tomas,
> Justin,
> DavidG,
> Jonathan Katz
>
>
I would prefer DavidJ as an abbreviation - my middle initial can be dropped
when referring to me.

David J.


Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Peter Geoghegan
On Mon, Jul 13, 2020 at 7:25 AM David Rowley  wrote:
> I think it would be good if we could try to move towards getting
> consensus here rather than reiterating our arguments over and over.

+1

> Updated summary:
> * For hash_mem = Tomas [7], Justin [16]
> * For hash_mem_multiplier with a default > 1.0 = DavidG [21]
> * For hash_mem_multiplier with default = 1.0 =  PeterG [15][0], Tom [20][24]
> * hash_mem out of scope for PG13 = Bruce [8], Andres [9]
> * hashagg_mem default to -1 meaning use work_mem = DavidR [23] (2nd 
> preference)
> * Escape hatch that can be removed later when we get something better
> = Jeff [11], DavidR [12], Pavel [13], Andres [14], Justin [1]
> * Add enable_hashagg_spill = Tom [2] (I'm unclear on this proposal.
> Does it affect the planner or executor or both?) (updated opinion in
> [20])
> * Maybe do nothing until we see how things go during beta = Bruce [3], Amit 
> [10]
> * Just let users set work_mem = Stephen [21], Alvaro [4] (Alvaro
> changed his mind after Andres pointed out that changes other nodes in
> the plan too [25])
> * Swap enable_hashagg for a GUC that specifies when spilling should
> occur. -1 means work_mem = Robert [17], Amit [18]
> * hash_mem does not solve the problem = Tomas [6] (changed his mind in [7])

I don't think that hashagg_mem needs to be considered here, because
you were the only one that spoke out in favor of that idea, and it's
your second preference in any case (maybe Tom was in favor of such a
thing at one point, but he clearly favors hash_mem/hash_mem_multiplier
now so it hardly matters). I don't think that hashagg_mem represents a
meaningful compromise between the escape hatch and
hash_mem/hash_mem_multiplier in any case. (I would *prefer* the escape
hatch to hashagg_mem, since at least the escape hatch is an "honest"
escape hatch.)

ISTM that there are three basic approaches to resolving this open item
that remain:

1. Do nothing.

2. Add an escape hatch.

3. Add hash_mem/hash_mem_multiplier.

Many people (e.g., Tom, Jeff, you, Andres, myself) have clearly
indicated that doing nothing is simply a non-starter. It's not just
that it doesn't get a lot of votes -- it's something that is strongly
opposed. We can rule it out right away.

This is where it gets harder. Many of us have views that are won't
easily fit into buckets. For example, even though I myself proposed
hash_mem/hash_mem_multiplier, I've said that I can live with the
escape hatch. Similarly, Jeff favors the escape hatch, but has said
that he can live with hash_mem/hash_mem_multiplier. And, Andres said
to me privately that he thinks that hash_mem could be a good idea,
even though he opposes it now due to release management
considerations.

Even still, I think that it's possible to divide people into two camps
on this without grossly misrepresenting anybody.

Primarily in favor of escape hatch:

Jeff,
DavidR,
Pavel,
Andres,
Robert ??,
Amit ??

Primarily in favor of hash_mem/hash_mem_multiplier:

PeterG,
Tom,
Alvaro,
Tomas,
Justin,
DavidG,
Jonathan Katz

There are clear problems with this summary, including for example the
fact that Robert weighed in before the hash_mem/hash_mem_multiplier
proposal was even on the table. What he actually said about it [1]
seems closer to hash_mem, so I feel that putting him in that bucket is
a conservative assumption on my part. Same goes for Amit, who warmed
to the idea of hash_mem_multiplier recently. (Though I probably got
some detail wrong, in which case please correct me.)

ISTM that there is a majority of opinion in favor of
hash_mem/hash_mem_multiplier. If you assume that I have this wrong,
and that we're simply deadlocked, then it becomes a matter for the
RMT. I strongly doubt that that changes the overall outcome, since
this year's RMT members happen to all be in favor of the
hash_mem/hash_mem_multiplier proposal on an individual basis.

[1] 
https://www.postgresql.org/message-id/ca+tgmobyv9+t-wjx-ctpdqurcgt1thz1ml3v1nxc4m4g-h6...@mail.gmail.com
-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Justin Pryzby
On Mon, Jul 13, 2020 at 12:47:36PM -0400, Alvaro Herrera wrote:
> On 2020-Jul-13, Jeff Davis wrote:
> 
> > On Tue, 2020-07-14 at 02:25 +1200, David Rowley wrote:
> > > Updated summary:
> > > * For hash_mem = Tomas [7], Justin [16]
> > > * For hash_mem_multiplier with a default > 1.0 = DavidG [21]
> > > * For hash_mem_multiplier with default = 1.0 =  PeterG [15][0], Tom
> > > [20][24]
> > 
> > I am OK with these options, but I still prefer a simple escape hatch.
> 
> I'm in favor of hash_mem_multiplier.  I think a >1 default is more
> sensible than =1 in the long run, but if strategic vote is what we're
> doing, then I support the =1 option.

I recanted and support hash_mem_multiplier (or something supporting that
behavior, even if it also supports an absolute/scalar value).
https://www.postgresql.org/message-id/20200703145620.gk4...@telsasoft.com

1.0 (or -1) is fine, possibly to be >= 1.0 in master at a later date.

-- 
Justin




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Tom Lane
Alvaro Herrera  writes:
> I'm in favor of hash_mem_multiplier.  I think a >1 default is more
> sensible than =1 in the long run, but if strategic vote is what we're
> doing, then I support the =1 option.

FWIW, I also think that we'll eventually end up with >1 default.
But the evidence to support that isn't really there yet, so
I'm good with 1.0 default to start with.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Alvaro Herrera
On 2020-Jul-13, Jeff Davis wrote:

> On Tue, 2020-07-14 at 02:25 +1200, David Rowley wrote:
> > Updated summary:
> > * For hash_mem = Tomas [7], Justin [16]
> > * For hash_mem_multiplier with a default > 1.0 = DavidG [21]
> > * For hash_mem_multiplier with default = 1.0 =  PeterG [15][0], Tom
> > [20][24]
> 
> I am OK with these options, but I still prefer a simple escape hatch.

I'm in favor of hash_mem_multiplier.  I think a >1 default is more
sensible than =1 in the long run, but if strategic vote is what we're
doing, then I support the =1 option.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Peter Geoghegan
On Mon, Jul 13, 2020 at 6:13 AM Stephen Frost  wrote:
> Yes, increasing work_mem isn't unusual, at all.

It's unusual as a way of avoiding OOMs!

> Eh?  That's not at all what it looks like- they were getting OOM's
> because they set work_mem to be higher than the actual amount of memory
> they had and the Sort before the GroupAgg was actually trying to use all
> that memory.  The HashAgg ended up not needing that much memory because
> the aggregated set wasn't actually that large.  If anything, this shows
> exactly what Jeff's fine work here is (hopefully) going to give us- the
> option to plan a HashAgg in such cases, since we can accept spilling to
> disk if we end up underestimate, or take advantage of that HashAgg
> being entirely in memory if we overestimate.

I very specifically said that it wasn't a case where something like
hash_mem would be expected to make all the difference.

> Having looked back, I'm not sure that I'm really in the minority
> regarding the proposal to add this at this time either- there's been a
> few different comments that it's too late for v13 and/or that we should
> see if we actually end up with users seriously complaining about the
> lack of a separate way to specify the memory for a given node type,
> and/or that if we're going to do this then we should have a broader set
> of options covering other nodes types too, all of which are positions
> that I agree with.

By proposing to do nothing at all, you are very clearly in a small
minority. While (for example) I might have debated the details with
David Rowley a lot recently, and you couldn't exactly say that we're
in agreement, our two positions are nevertheless relatively close
together.

AFAICT, the only other person that has argued that we should do
nothing (have no new GUC) is Bruce, which was a while ago now. (Amit
said something similar, but has since softened his opinion [1]).

[1] 
https://postgr.es.m/m/CAA4eK1+KMSQuOq5Gsj-g-pYec_8zgGb4K=xrznbcccnaumf...@mail.gmail.com
-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Peter Eisentraut

On 2020-07-13 16:11, Tomas Vondra wrote:

Why is running out of disk space worse experience than running out of
memory?

Sure, it'll take longer and ultimately the query fails (and if it fills
the device used by the WAL then it may also cause shutdown of the main
instance due to inability to write WAL). But that can be prevented by
moving the temp tablespace and/or setting the temp file limit, as
already mentioned.

With OOM, if the kernel OOM killer decides to act, it may easily bring
down the instance too, and there are much less options to prevent that.


Well, that's an interesting point.  Depending on the OS setup, by 
default an out of memory might actually be worse if the OOM killer 
strikes in an unfortunate way.  That didn't happen to me in my tests, so 
the OS must have been configured differently by default.


So maybe a lesson here is that just like we have been teaching users to 
adjust the OOM killer, we have to teach them now that setting the temp 
file limit might become more important.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Jeff Davis
On Tue, 2020-07-14 at 02:25 +1200, David Rowley wrote:
> Updated summary:
> * For hash_mem = Tomas [7], Justin [16]
> * For hash_mem_multiplier with a default > 1.0 = DavidG [21]
> * For hash_mem_multiplier with default = 1.0 =  PeterG [15][0], Tom
> [20][24]

I am OK with these options, but I still prefer a simple escape hatch.

> * Maybe do nothing until we see how things go during beta = Bruce
> [3], Amit [10]
> * Just let users set work_mem = Stephen [21], Alvaro [4] (Alvaro
> changed his mind after Andres pointed out that changes other nodes in
> the plan too [25])

I am not on board with these options.

Regards,
Jeff Davis






Re: Default setting for enable_hashagg_disk

2020-07-13 Thread David Rowley
On Tue, 14 Jul 2020 at 01:13, Stephen Frost  wrote:
> Yes, increasing work_mem isn't unusual, at all.  What that tweet shows
> that I don't think folks who are suggesting things like setting this
> factor to 2.0 is that people may have a work_mem configured in the
> gigabytes- meaning that a 2.0 value would result in a work_mem of 5GB
> and a hash_mem of 10GB.  Now, I'm all for telling people to review their
> configurations between major versions, but that's a large difference
> that's going to be pretty deeply hidden in a 'multiplier' setting.

I think Peter seems to be fine with setting the default to 1.0, per [0].

This thread did split off a while back into "Default setting for
enable_hashagg_disk (hash_mem)", I did try and summarise who sits
where on this in [19].

I think it would be good if we could try to move towards getting
consensus here rather than reiterating our arguments over and over.

Updated summary:
* For hash_mem = Tomas [7], Justin [16]
* For hash_mem_multiplier with a default > 1.0 = DavidG [21]
* For hash_mem_multiplier with default = 1.0 =  PeterG [15][0], Tom [20][24]
* hash_mem out of scope for PG13 = Bruce [8], Andres [9]
* hashagg_mem default to -1 meaning use work_mem = DavidR [23] (2nd preference)
* Escape hatch that can be removed later when we get something better
= Jeff [11], DavidR [12], Pavel [13], Andres [14], Justin [1]
* Add enable_hashagg_spill = Tom [2] (I'm unclear on this proposal.
Does it affect the planner or executor or both?) (updated opinion in
[20])
* Maybe do nothing until we see how things go during beta = Bruce [3], Amit [10]
* Just let users set work_mem = Stephen [21], Alvaro [4] (Alvaro
changed his mind after Andres pointed out that changes other nodes in
the plan too [25])
* Swap enable_hashagg for a GUC that specifies when spilling should
occur. -1 means work_mem = Robert [17], Amit [18]
* hash_mem does not solve the problem = Tomas [6] (changed his mind in [7])

Perhaps people who have managed to follow this thread but not chip in
yet can reply quoting the option above that they'd be voting for. Or
if you're ok changing your mind to some option that has more votes
than the one your name is already against. That might help move this
along.

David

[0] 
https://www.postgresql.org/message-id/CAH2-Wz=VV6EKFGUJDsHEqyvRk7pCO36BvEoF5sBQry_O6R2=n...@mail.gmail.com
[1] https://www.postgresql.org/message-id/20200624031443.gv4...@telsasoft.com
[2] https://www.postgresql.org/message-id/2214502.1593019...@sss.pgh.pa.us
[3] https://www.postgresql.org/message-id/20200625182512.gc12...@momjian.us
[4] https://www.postgresql.org/message-id/20200625224422.GA9653@alvherre.pgsql
[5] 
https://www.postgresql.org/message-id/caa4ek1k0cgk_8hryxsvppgoh_z-ny+uztcfwb2we6baj9dx...@mail.gmail.com
[6] 
https://www.postgresql.org/message-id/20200627104141.gq7d3hm2tvoqgjjs@development
[7] 
https://www.postgresql.org/message-id/20200629212229.n3afgzq6xpxrr4cu@development
[8] https://www.postgresql.org/message-id/20200703030001.gd26...@momjian.us
[9] 
https://www.postgresql.org/message-id/20200707171216.jqxrld2jnxwf5...@alap3.anarazel.de
[10] 
https://www.postgresql.org/message-id/CAA4eK1KfPi6iz0hWxBLZzfVOG_NvOVJL=9UQQirWLpaN=ka...@mail.gmail.com
[11] 
https://www.postgresql.org/message-id/8bff2e4e8020c3caa16b61a46918d21b573eaf78.ca...@j-davis.com
[12] 
https://www.postgresql.org/message-id/CAApHDvqFZikXhAGW=ukzkq1_fzhy+xzmuzajinj6rwythh4...@mail.gmail.com
[13] 
https://www.postgresql.org/message-id/cafj8prbf1w4ndz-ynd+muptfizfbs7+cpjc4ob8v9d3x0ms...@mail.gmail.com
[14] 
https://www.postgresql.org/message-id/20200624191433.5gnqgrxfmucex...@alap3.anarazel.de
[15] 
https://www.postgresql.org/message-id/CAH2-WzmD+i1pG6rc1+Cjc4V6EaFJ_qSuKCCHVnH=oruqd-z...@mail.gmail.com
[16] https://www.postgresql.org/message-id/20200703024649.gj4...@telsasoft.com
[17] 
https://www.postgresql.org/message-id/ca+tgmobyv9+t-wjx-ctpdqurcgt1thz1ml3v1nxc4m4g-h6...@mail.gmail.com
[18] 
https://www.postgresql.org/message-id/caa4ek1k0cgk_8hryxsvppgoh_z-ny+uztcfwb2we6baj9dx...@mail.gmail.com
[19] 
https://www.postgresql.org/message-id/caaphdvrp1fiev4aql2zscbhi32w+gp01j+qnhwou7y7p-qf...@mail.gmail.com
[20] https://www.postgresql.org/message-id/2107841.1594403...@sss.pgh.pa.us
[21] 
https://www.postgresql.org/message-id/20200710141714.gi12...@tamriel.snowman.net
[22] 
https://www.postgresql.org/message-id/CAKFQuwa2gwLa0b%2BmQv5r5A_Q0XWsA2%3D1zQ%2BZ5m4pQprxh-aM4Q%40mail.gmail.com
[23] 
https://www.postgresql.org/message-id/caaphdvpxbhhp566rrjjwgnfs0yoxr53eztz5lhh-jcekvqd...@mail.gmail.com
[24] https://www.postgresql.org/message-id/2463591.1594514...@sss.pgh.pa.us
[25] 
https://www.postgresql.org/message-id/20200625225853.GA11137%40alvherre.pgsql




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Tomas Vondra

On Mon, Jul 13, 2020 at 01:51:42PM +0200, Peter Eisentraut wrote:

On 2020-04-07 20:20, Jeff Davis wrote:

Now that we have Disk-based Hash Aggregation, there are a lot more
situations where the planner can choose HashAgg. The
enable_hashagg_disk GUC, if set to true, chooses HashAgg based on
costing. If false, it only generates a HashAgg path if it thinks it
will fit in work_mem, similar to the old behavior (though it wlil now
spill to disk if the planner was wrong about it fitting in work_mem).
The current default is true.


I have an anecdote that might be related to this discussion.

I was running an unrelated benchmark suite.  With PostgreSQL 12, one 
query ran out of memory.  With PostgreSQL 13, the same query instead 
ran out of disk space.  I bisected this to the introduction of 
disk-based hash aggregation.  Of course, the very point of that 
feature is to eliminate the out of memory and make use of disk space 
instead.  But running out of disk space is likely to be a worse 
experience than running out of memory.  Also, while it's relatively 
easy to limit memory use both in PostgreSQL and in the kernel, it is 
difficult or impossible to limit disk space use in a similar way.




Why is running out of disk space worse experience than running out of
memory?

Sure, it'll take longer and ultimately the query fails (and if it fills
the device used by the WAL then it may also cause shutdown of the main
instance due to inability to write WAL). But that can be prevented by
moving the temp tablespace and/or setting the temp file limit, as
already mentioned.

With OOM, if the kernel OOM killer decides to act, it may easily bring
down the instance too, and there are much less options to prevent that.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Stephen Frost
Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > Have you got a better proposal that is reasonably implementable for v13?
> > > (I do not accept the argument that "do nothing" is a better proposal.)
> 
> > So, no, I don't agree that 'do nothing' (except ripping out the one GUC
> > that was already added) is a worse proposal than adding another work_mem
> > like thing that's only for some nodes types.
> 
> The question was "Have you got a better proposal that is reasonably
> implementable for v13?".
> 
> This is anecdotal, but just today somebody on Twitter reported
> *increasing* work_mem to stop getting OOMs from group aggregate +
> sort:
> 
> https://twitter.com/theDressler/status/1281942941133615104

Yes, increasing work_mem isn't unusual, at all.  What that tweet shows
that I don't think folks who are suggesting things like setting this
factor to 2.0 is that people may have a work_mem configured in the
gigabytes- meaning that a 2.0 value would result in a work_mem of 5GB
and a hash_mem of 10GB.  Now, I'm all for telling people to review their
configurations between major versions, but that's a large difference
that's going to be pretty deeply hidden in a 'multiplier' setting.

I'm still wholly unconvinced that we need such a setting, just to be
clear, but I don't think there's any way it'd be reasonable to have it
set to something other than "whatever work_mem is" by default- and it
needs to actually be "what work_mem is" and not "have the same default
value" or *everyone* would have to configure it.

> It was possible to fix the problem in this instance, since evidently
> there wasn't anything else that really did try to consume ~5 GB of
> work_mem memory. Evidently the memory isn't available in any general
> sense, so there are no OOMs now. Nevertheless, we can expect OOMs on
> this server just as soon as there is a real need to do a ~5GB sort,
> regardless of anything else.

Eh?  That's not at all what it looks like- they were getting OOM's
because they set work_mem to be higher than the actual amount of memory
they had and the Sort before the GroupAgg was actually trying to use all
that memory.  The HashAgg ended up not needing that much memory because
the aggregated set wasn't actually that large.  If anything, this shows
exactly what Jeff's fine work here is (hopefully) going to give us- the
option to plan a HashAgg in such cases, since we can accept spilling to
disk if we end up underestimate, or take advantage of that HashAgg
being entirely in memory if we overestimate.

> I don't think that this kind of perverse effect is uncommon. Hash
> aggregate can naturally be far faster than group agg + sort, Hash agg
> can naturally use a lot less memory in many cases, and we have every
> reason to think that grouping estimates are regularly totally wrong.

I'm confused as to why we're talking about the relative performance of a
HashAgg vs. a Sort+GroupAgg- of course the HashAgg is going to be faster
if it's got enough memory, but that's a constraint we have to consider
and deal with because, otherwise, the query can end up failing and
potentially impacting other queries or activity on the system, including
resulting in the entire database system falling over due to the OOM
Killer firing and killing a process and the database ending up
restarting and going through crash recovery, which is going to be quite
a bit worse than performance maybe not being great.

> You're significantly underestimating the risk.

Of... what?  That we'll end up getting worse performance because we
underestimated the size of the result set and we end up spilling to
disk with the HashAgg?  I think I'm giving that risk the amount of
concern it deserves- which is, frankly, not very much.  Users who run
into that issue, as this tweet *also* showed, are familiar with work_mem
and can tune it to address that.  This reaction to demand a new GUC to
break up work_mem into pieces strikes me as unjustified, and doing so
during beta makes it that much worse.

Having looked back, I'm not sure that I'm really in the minority
regarding the proposal to add this at this time either- there's been a
few different comments that it's too late for v13 and/or that we should
see if we actually end up with users seriously complaining about the
lack of a separate way to specify the memory for a given node type,
and/or that if we're going to do this then we should have a broader set
of options covering other nodes types too, all of which are positions
that I agree with.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Peter Eisentraut

On 2020-07-13 14:16, David Rowley wrote:

Isn't that what temp_file_limit is for?


Yeah, I guess that is so rarely used that I had forgotten about it.  So 
maybe that is also something that more users will want to be aware of.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread David Rowley
On Mon, 13 Jul 2020 at 23:51, Peter Eisentraut
 wrote:
> I have an anecdote that might be related to this discussion.
>
> I was running an unrelated benchmark suite.  With PostgreSQL 12, one
> query ran out of memory.  With PostgreSQL 13, the same query instead ran
> out of disk space.  I bisected this to the introduction of disk-based
> hash aggregation.  Of course, the very point of that feature is to
> eliminate the out of memory and make use of disk space instead.  But
> running out of disk space is likely to be a worse experience than
> running out of memory.  Also, while it's relatively easy to limit memory
> use both in PostgreSQL and in the kernel, it is difficult or impossible
> to limit disk space use in a similar way.

Isn't that what temp_file_limit is for?

David




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Peter Eisentraut

On 2020-04-07 20:20, Jeff Davis wrote:

Now that we have Disk-based Hash Aggregation, there are a lot more
situations where the planner can choose HashAgg. The
enable_hashagg_disk GUC, if set to true, chooses HashAgg based on
costing. If false, it only generates a HashAgg path if it thinks it
will fit in work_mem, similar to the old behavior (though it wlil now
spill to disk if the planner was wrong about it fitting in work_mem).
The current default is true.


I have an anecdote that might be related to this discussion.

I was running an unrelated benchmark suite.  With PostgreSQL 12, one 
query ran out of memory.  With PostgreSQL 13, the same query instead ran 
out of disk space.  I bisected this to the introduction of disk-based 
hash aggregation.  Of course, the very point of that feature is to 
eliminate the out of memory and make use of disk space instead.  But 
running out of disk space is likely to be a worse experience than 
running out of memory.  Also, while it's relatively easy to limit memory 
use both in PostgreSQL and in the kernel, it is difficult or impossible 
to limit disk space use in a similar way.


I don't have a solution or proposal here, I just want to mention this as 
a possibility and suggest that we look out for similar experiences.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default setting for enable_hashagg_disk

2020-07-13 Thread Amit Kapila
On Sat, Jul 11, 2020 at 3:30 AM Tom Lane  wrote:
>
> Stephen Frost  writes:
> > I don't see hash_mem as being any kind of proper fix- it's just punting
> > to the user saying "we can't figure this out, how about you do it" and,
> > worse, it's in conflict with how we already ask the user that question.
> > Turning it into a multiplier doesn't change that either.
>
> Have you got a better proposal that is reasonably implementable for v13?
> (I do not accept the argument that "do nothing" is a better proposal.)
>
> I agree that hash_mem is a stopgap, whether it's a multiplier or no,
> but at this point it seems difficult to avoid inventing a stopgap.
> Getting rid of the process-global work_mem setting is a research project,
> and one I wouldn't even count on having results from for v14.  In the
> meantime, it seems dead certain that there are applications for which
> the current behavior will be problematic.
>

If this is true then certainly it adds more weight to the argument for
having a solution like hash_mem or some other escape-hatch.  I know it
would be difficult to get the real-world data but why not try TPC-H or
similar workloads at a few different scale_factor/size?  I was
checking some old results with me for TPC-H runs and I found that many
of the plans were using Finalize GroupAggregate and Partial
GroupAggregate kinds of plans, there were few where I saw Partial
HashAggregate being used but it appears on a random check that
GroupAggregate seems to be used more.  It could be that after
parallelism GroupAggregate plans are getting preference but I am not
sure about this.  However, even if that is not true, I think after the
parallel aggregates the memory-related thing is taken care of to some
extent automatically because I think after that each worker doing
partial aggregation can be allowed to consume work_mem memory.  So,
probably the larger aggregates which are going to give better
performance by consuming more memory would already be parallelized and
would have given the desired results.  Now, allowing aggregates to use
more memory via hash_mem kind of thing is beneficial in non-parallel
cases but for cases where parallelism is used it could be worse
because now each work will be entitled to use more memory.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Default setting for enable_hashagg_disk

2020-07-12 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 10:26:22PM -0700, David G. Johnston wrote:

On Saturday, July 11, 2020, Tom Lane  wrote:


"David G. Johnston"  writes:
> On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:
>> It seems like a lot of the disagreement here is focused on Peter's
>> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
>> seem to me that that's a critical element of the proposal.  Why not just
>> make it default to 1.0, thus keeping the default behavior identical
>> to what it is now?

> If we don't default it to something other than 1.0 we might as well just
> make it memory units and let people decide precisely what they want to
use
> instead of adding the complexity of a multiplier.

Not sure how that follows?  The advantage of a multiplier is that it
tracks whatever people might do to work_mem automatically.






I was thinking that setting -1 would basically do that.



I think Tom meant that the multiplier would automatically track any
changes to work_mem, and adjust the hash_mem accordingly. With -1 (and
the GUC in units) you could only keep it exactly equal to work_mem, but
then as soon as you change it you'd have to update both.


  In general
I'd view work_mem as the base value that people twiddle to control
executor memory consumption.  Having to also twiddle this other value
doesn't seem especially user-friendly.



I’ll admit I don’t have a feel for what is or is not user-friendly when
setting these GUCs in a session to override the global defaults.  But as
far as the global defaults I say it’s a wash between (32mb, -1) -> (32mb,
48mb) and (32mb, 1.0) -> (32mb, 1.5)

If you want 96mb for the session/query hash setting it to 96mb is
invariant, whilesetting it to 3.0 means it can change in the future if the
system work_mem changes.  Knowing the multiplier is 1.5 and choosing 64mb
for work_mem in the session is possible but also mutable and has
side-effects.  If the user is going to set both values to make it invariant
we are back to it being a wash.

I don’t believe using a multiplier will promote better comprehension for
why this setting exists compared to “-1 means use work_mem but you can
override a subset if you want.”

Is having a session level memory setting be mutable something we want to
introduce?

Is it more user-friendly?



I still think it should be in simple units, TBH. We already have
somewhat similar situation with cost parameters, where we often say that
seq_page_cost = 1.0 is the baseline for the other cost parameters, yet
we have not coded that as multipliers.


If we find that's a poor default, we can always change it later;

>> but it seems to me that the evidence for a higher default is
>> a bit thin at this point.

> So "your default is 1.0 unless you installed the new database on or after
> 13.4 in which case it's 2.0"?

What else would be new?  See e.g. 848ae330a.  (Note I'm not suggesting
that we'd change it in a minor release.)



Minor release update is what I had thought, and to an extent was making
possible by not using the multiplier upfront.

I agree options are wide open come v14 and beyond.

David J.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-12 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 08:47:54PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

I don't know, but one of the main arguments against simply suggesting
people to bump up work_mem (if they're hit by the hashagg spill in v13)
was that it'd increase overall memory usage for them. It seems strange
to then propose a new GUC set to a default that would result in higher
memory usage *for everyone*.


It seems like a lot of the disagreement here is focused on Peter's
proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
seem to me that that's a critical element of the proposal.  Why not just
make it default to 1.0, thus keeping the default behavior identical
to what it is now?

If we find that's a poor default, we can always change it later;
but it seems to me that the evidence for a higher default is
a bit thin at this point.



You're right, I was specifically pushing against that aspect of the
proposal. Sorry for not making that clearer, I assumed it's clear from
the context of this (sub)thread.

I agree making it 1.0 (or equal to work_mem, if it's not a multiplier)
by default, but allowing it to be increased if needed would address most
of the spilling issues.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-11 Thread David G. Johnston
On Saturday, July 11, 2020, Tom Lane  wrote:

> "David G. Johnston"  writes:
> > On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:
> >> It seems like a lot of the disagreement here is focused on Peter's
> >> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
> >> seem to me that that's a critical element of the proposal.  Why not just
> >> make it default to 1.0, thus keeping the default behavior identical
> >> to what it is now?
>
> > If we don't default it to something other than 1.0 we might as well just
> > make it memory units and let people decide precisely what they want to
> use
> > instead of adding the complexity of a multiplier.
>
> Not sure how that follows?  The advantage of a multiplier is that it
> tracks whatever people might do to work_mem automatically.


>
I was thinking that setting -1 would basically do that.


>   In general
> I'd view work_mem as the base value that people twiddle to control
> executor memory consumption.  Having to also twiddle this other value
> doesn't seem especially user-friendly.


I’ll admit I don’t have a feel for what is or is not user-friendly when
setting these GUCs in a session to override the global defaults.  But as
far as the global defaults I say it’s a wash between (32mb, -1) -> (32mb,
48mb) and (32mb, 1.0) -> (32mb, 1.5)

If you want 96mb for the session/query hash setting it to 96mb is
invariant, whilesetting it to 3.0 means it can change in the future if the
system work_mem changes.  Knowing the multiplier is 1.5 and choosing 64mb
for work_mem in the session is possible but also mutable and has
side-effects.  If the user is going to set both values to make it invariant
we are back to it being a wash.

I don’t believe using a multiplier will promote better comprehension for
why this setting exists compared to “-1 means use work_mem but you can
override a subset if you want.”

Is having a session level memory setting be mutable something we want to
introduce?

Is it more user-friendly?

>> If we find that's a poor default, we can always change it later;
> >> but it seems to me that the evidence for a higher default is
> >> a bit thin at this point.
>
> > So "your default is 1.0 unless you installed the new database on or after
> > 13.4 in which case it's 2.0"?
>
> What else would be new?  See e.g. 848ae330a.  (Note I'm not suggesting
> that we'd change it in a minor release.)
>

Minor release update is what I had thought, and to an extent was making
possible by not using the multiplier upfront.

I agree options are wide open come v14 and beyond.

David J.


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tom Lane
"David G. Johnston"  writes:
> On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:
>> It seems like a lot of the disagreement here is focused on Peter's
>> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
>> seem to me that that's a critical element of the proposal.  Why not just
>> make it default to 1.0, thus keeping the default behavior identical
>> to what it is now?

> If we don't default it to something other than 1.0 we might as well just
> make it memory units and let people decide precisely what they want to use
> instead of adding the complexity of a multiplier.

Not sure how that follows?  The advantage of a multiplier is that it
tracks whatever people might do to work_mem automatically.  In general
I'd view work_mem as the base value that people twiddle to control
executor memory consumption.  Having to also twiddle this other value
doesn't seem especially user-friendly.

>> If we find that's a poor default, we can always change it later;
>> but it seems to me that the evidence for a higher default is
>> a bit thin at this point.

> So "your default is 1.0 unless you installed the new database on or after
> 13.4 in which case it's 2.0"?

What else would be new?  See e.g. 848ae330a.  (Note I'm not suggesting
that we'd change it in a minor release.)

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread David G. Johnston
On Sat, Jul 11, 2020 at 5:47 PM Tom Lane  wrote:

> Tomas Vondra  writes:
> > I don't know, but one of the main arguments against simply suggesting
> > people to bump up work_mem (if they're hit by the hashagg spill in v13)
> > was that it'd increase overall memory usage for them. It seems strange
> > to then propose a new GUC set to a default that would result in higher
> > memory usage *for everyone*.
>
> It seems like a lot of the disagreement here is focused on Peter's
> proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
> seem to me that that's a critical element of the proposal.  Why not just
> make it default to 1.0, thus keeping the default behavior identical
> to what it is now?
>

If we don't default it to something other than 1.0 we might as well just
make it memory units and let people decide precisely what they want to use
instead of adding the complexity of a multiplier.


> If we find that's a poor default, we can always change it later;
> but it seems to me that the evidence for a higher default is
> a bit thin at this point.
>

So "your default is 1.0 unless you installed the new database on or after
13.4 in which case it's 2.0"?

I'd rather have it be just memory units defaulting to -1 meaning "use
work_mem".  In the unlikely scenario we decide post-release to want a
multiplier > 1.0 we can add the GUC with that default at that point.  The
multiplier would want to be ignored if hash_mem if set to anything other
than -1.

David J.


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Thomas Munro
On Sun, Jul 12, 2020 at 2:27 AM Tom Lane  wrote:
> David Rowley  writes:
> > hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented
> > it differently.  The problem is made worse by the fact that we'll only
> > release the memory for the hash table during ExecEndHashJoin(). If the
> > planner had some ability to provide the executor with knowledge that
> > the node would never be rescanned, then the executor could release the
> > memory for the hash table after the join is complete.
>
> EXEC_FLAG_REWIND seems to fit the bill already?

FWIW I have a patch that does exactly that, which I was planning to
submit for CF2 along with some other patches that estimate and measure
peak executor memory usage.




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
I would be okay with a default of 1.0.

Peter Geoghegan
(Sent from my phone)


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tom Lane
Tomas Vondra  writes:
> I don't know, but one of the main arguments against simply suggesting
> people to bump up work_mem (if they're hit by the hashagg spill in v13)
> was that it'd increase overall memory usage for them. It seems strange
> to then propose a new GUC set to a default that would result in higher
> memory usage *for everyone*.

It seems like a lot of the disagreement here is focused on Peter's
proposal to make hash_mem_multiplier default to 2.0.  But it doesn't
seem to me that that's a critical element of the proposal.  Why not just
make it default to 1.0, thus keeping the default behavior identical
to what it is now?

If we find that's a poor default, we can always change it later;
but it seems to me that the evidence for a higher default is
a bit thin at this point.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 09:02:43AM -0700, David G. Johnston wrote:

On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:


There now seems to be some suggestions that not only should we have a
new GUC, but we should default to having it not be equal to work_mem (or
1.0 or whatever) and instead by higher, to be *twice* or larger whatever
the existing work_mem setting is- meaning that people whose systems are
working just fine and have good estimates that represent their workload
and who get the plans they want may then start seeing differences and
increased memory utilization in places that they *don't* want that, all
because we're scared that someone, somewhere, might see a regression due
to HashAgg spilling to disk.



If that increased memory footprint allows the planner to give me a better
plan with faster execution and with no OOM I'd be very happy that this
change happened. While having a more flexible memory allocation framework
is not a primary goal in and of itself it is a nice side-effect.  I'm not
going to say "let's only set work_mem to 32MB instead of 48MB so I can
avoid this faster HashAgg node and instead execute a nested loop (or
whatever)".  More probable is the user whose current nested loop plan is
fast enough and doesn't even realize that with a bit more memory they could
get an HashAgg that performs 15% faster.  For them this is a win on its
face.

I don't believe this negatively impacts the super-admin in our user-base
and is a decent win for the average and below average admin.

Do we really have an issue with plans being chosen while having access to
more memory being slower than plans chosen while having less memory?

The main risk here is that we choose for a user to consume more memory than
they expected and they report OOM issues to us.  We tell them to set this
new GUC to 1.0.  But that implies they are getting many non-HashAgg plans
produced when with a bit more memory those HashAgg plans would have been
chosen.  If they get those faster plans without OOM it's a win, if it OOMs
it's a loss.  I'm feeling optimistic here and we'll get considerably more
wins than losses.  How loss-averse do we need to be here though?  Npte we
can give the upgrading user advance notice of our loss-aversion level and
they can simply disagree and set it to 1.0 and/or perform more thorough
testing.  So being optimistic feels like the right choice.



I don't know, but one of the main arguments against simply suggesting
people to bump up work_mem (if they're hit by the hashagg spill in v13)
was that it'd increase overall memory usage for them. It seems strange
to then propose a new GUC set to a default that would result in higher
memory usage *for everyone*.

Of course, having such GUC with a default a multiple of work_mem might
be a win overall - or maybe not. I don't have a very good idea how many
people will get bitten by this, and how many will get speedup (and how
significant the speedup would be).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
On Sat, Jul 11, 2020 at 4:23 PM Tomas Vondra
 wrote:
> I find that example rather suspicious. I mean, what exactly in the
> GroupAgg plan would consume this memory? Surely it'd have to be some
> node below the grouping, but sort shouldn't do that, no?
>
> Seems strange.

Well, I imagine hash aggregate manages to use much less memory than
the equivalent groupagg's sort, even though to the optimizer it
appears as if hash agg should end up using more memory (which is not
allowed by the optimizer when it exceeds work_mem, regardless of
whether or not it's faster). It may also be relevant that Hash agg can
use less memory simply by being faster. Going faster could easily
reduce the memory usage for the system as a whole, even when you
assume individual group agg nodes use more memory for as long as they
run. So in-memory hash agg is effectively less memory hungry.

It's not a great example of a specific case that we'd regress by not
having hash_mem/hash_mem_multiplier. It's an overestimate where older
releases accidentally got a bad, slow plan, not an underestimate where
older releases "lived beyond their means but got away with it" by
getting a good, fast plan. ISTM that the example is a good example of
the strange dynamics involved.

> I agree grouping estimates are often quite off, and I kinda agree with
> introducing hash_mem (or at least with the concept that hashing is more
> sensitive to amount of memory than sort). Not sure it's the right espace
> hatch to the hashagg spill problem, but maybe it is.

The hash_mem/hash_mem_multiplier proposal aims to fix the problem
directly, and not be an escape hatch, because we don't like escape
hatches. I think that that probably fixes many or most of the problems
in practice, at least assuming that the admin is willing to tune it.
But a small number of remaining installations may still need a "true"
escape hatch. There is an argument for having both, though I hope that
the escape hatch can be avoided.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
On Fri, Jul 10, 2020 at 10:00 PM David Rowley  wrote:
> hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented
> it differently.  The problem is made worse by the fact that we'll only
> release the memory for the hash table during ExecEndHashJoin(). If the
> planner had some ability to provide the executor with knowledge that
> the node would never be rescanned, then the executor could release the
> memory for the hash table after the join is complete. For now, we'll
> need to live with the fact that an Append containing many children
> doing hash joins will mean holding onto all that memory until the
> executor is shutdown :-(
>
> There's room to make improvements there, for sure, but not for PG13.

I think that we're stuck with the idea that partitionwise join uses up
to one work_mem allocation per partition until we deprecate work_mem
as a concept.

Anyway, I only talked about partitionwise join because that was your
example. I could just as easily have picked on parallel hash join
instead, which is something that I was involved in myself (kind of).
This is more or less a negative consequence of the incremental
approach we have taken here, which is a collective failure.

I have difficulty accepting that something like hash_mem_multiplier
cannot be accepted because it risks making the consequence of
questionable designs even worse. The problem remains that the original
assumption just isn't very robust, and isn't something that the user
has granular control over. In general it makes sense that a design in
a stable branch is assumed to be the norm that new things need to
respect, and not the other way around. But there must be some limit to
how far that's taken.

> It sounds interesting, but it also sounds like a new feature
> post-beta. Perhaps it's better we minimise the scope of the change to
> be a minimal fix just for the behaviour we predict some users might
> not like.

That's an understandable interpretation of the
hash_mem/hash_mem_multiplier proposal on the table, and yet one that I
disagree with. I consider it highly desirable to have a GUC that can
be tuned in a generic and high level way, on general principle. We
don't really do escape hatches, and I'd rather avoid adding one now
(though it's far preferable to doing nothing, which I consider totally
out of the question).

Pursuing what you called hashagg_mem is a compromise that will make
neither of us happy. It seems like an escape hatch by another name. I
would rather just go with your original proposal instead, especially
if that's the only thing that'll resolve the problem in front of us.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tomas Vondra

On Sat, Jul 11, 2020 at 09:49:43AM -0700, Peter Geoghegan wrote:

On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Have you got a better proposal that is reasonably implementable for v13?
> (I do not accept the argument that "do nothing" is a better proposal.)



So, no, I don't agree that 'do nothing' (except ripping out the one GUC
that was already added) is a worse proposal than adding another work_mem
like thing that's only for some nodes types.


The question was "Have you got a better proposal that is reasonably
implementable for v13?".

This is anecdotal, but just today somebody on Twitter reported
*increasing* work_mem to stop getting OOMs from group aggregate +
sort:

https://twitter.com/theDressler/status/1281942941133615104

It was possible to fix the problem in this instance, since evidently
there wasn't anything else that really did try to consume ~5 GB of
work_mem memory. Evidently the memory isn't available in any general
sense, so there are no OOMs now. Nevertheless, we can expect OOMs on
this server just as soon as there is a real need to do a ~5GB sort,
regardless of anything else.



I find that example rather suspicious. I mean, what exactly in the
GroupAgg plan would consume this memory? Surely it'd have to be some
node below the grouping, but sort shouldn't do that, no?

Seems strange.


I don't think that this kind of perverse effect is uncommon. Hash
aggregate can naturally be far faster than group agg + sort, Hash agg
can naturally use a lot less memory in many cases, and we have every
reason to think that grouping estimates are regularly totally wrong.
You're significantly underestimating the risk.



I agree grouping estimates are often quite off, and I kinda agree with
introducing hash_mem (or at least with the concept that hashing is more
sensitive to amount of memory than sort). Not sure it's the right espace
hatch to the hashagg spill problem, but maybe it is.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Peter Geoghegan
On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Have you got a better proposal that is reasonably implementable for v13?
> > (I do not accept the argument that "do nothing" is a better proposal.)

> So, no, I don't agree that 'do nothing' (except ripping out the one GUC
> that was already added) is a worse proposal than adding another work_mem
> like thing that's only for some nodes types.

The question was "Have you got a better proposal that is reasonably
implementable for v13?".

This is anecdotal, but just today somebody on Twitter reported
*increasing* work_mem to stop getting OOMs from group aggregate +
sort:

https://twitter.com/theDressler/status/1281942941133615104

It was possible to fix the problem in this instance, since evidently
there wasn't anything else that really did try to consume ~5 GB of
work_mem memory. Evidently the memory isn't available in any general
sense, so there are no OOMs now. Nevertheless, we can expect OOMs on
this server just as soon as there is a real need to do a ~5GB sort,
regardless of anything else.

I don't think that this kind of perverse effect is uncommon. Hash
aggregate can naturally be far faster than group agg + sort, Hash agg
can naturally use a lot less memory in many cases, and we have every
reason to think that grouping estimates are regularly totally wrong.
You're significantly underestimating the risk.

--
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread David G. Johnston
On Sat, Jul 11, 2020 at 7:22 AM Stephen Frost  wrote:

> There now seems to be some suggestions that not only should we have a
> new GUC, but we should default to having it not be equal to work_mem (or
> 1.0 or whatever) and instead by higher, to be *twice* or larger whatever
> the existing work_mem setting is- meaning that people whose systems are
> working just fine and have good estimates that represent their workload
> and who get the plans they want may then start seeing differences and
> increased memory utilization in places that they *don't* want that, all
> because we're scared that someone, somewhere, might see a regression due
> to HashAgg spilling to disk.
>

If that increased memory footprint allows the planner to give me a better
plan with faster execution and with no OOM I'd be very happy that this
change happened. While having a more flexible memory allocation framework
is not a primary goal in and of itself it is a nice side-effect.  I'm not
going to say "let's only set work_mem to 32MB instead of 48MB so I can
avoid this faster HashAgg node and instead execute a nested loop (or
whatever)".  More probable is the user whose current nested loop plan is
fast enough and doesn't even realize that with a bit more memory they could
get an HashAgg that performs 15% faster.  For them this is a win on its
face.

I don't believe this negatively impacts the super-admin in our user-base
and is a decent win for the average and below average admin.

Do we really have an issue with plans being chosen while having access to
more memory being slower than plans chosen while having less memory?

The main risk here is that we choose for a user to consume more memory than
they expected and they report OOM issues to us.  We tell them to set this
new GUC to 1.0.  But that implies they are getting many non-HashAgg plans
produced when with a bit more memory those HashAgg plans would have been
chosen.  If they get those faster plans without OOM it's a win, if it OOMs
it's a loss.  I'm feeling optimistic here and we'll get considerably more
wins than losses.  How loss-averse do we need to be here though?  Npte we
can give the upgrading user advance notice of our loss-aversion level and
they can simply disagree and set it to 1.0 and/or perform more thorough
testing.  So being optimistic feels like the right choice.

David J.


Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Tom Lane
David Rowley  writes:
> hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented
> it differently.  The problem is made worse by the fact that we'll only
> release the memory for the hash table during ExecEndHashJoin(). If the
> planner had some ability to provide the executor with knowledge that
> the node would never be rescanned, then the executor could release the
> memory for the hash table after the join is complete.

EXEC_FLAG_REWIND seems to fit the bill already?

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-11 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I don't see hash_mem as being any kind of proper fix- it's just punting
> > to the user saying "we can't figure this out, how about you do it" and,
> > worse, it's in conflict with how we already ask the user that question.
> > Turning it into a multiplier doesn't change that either.
> 
> Have you got a better proposal that is reasonably implementable for v13?
> (I do not accept the argument that "do nothing" is a better proposal.)
> 
> I agree that hash_mem is a stopgap, whether it's a multiplier or no,
> but at this point it seems difficult to avoid inventing a stopgap.
> Getting rid of the process-global work_mem setting is a research project,
> and one I wouldn't even count on having results from for v14.  In the
> meantime, it seems dead certain that there are applications for which
> the current behavior will be problematic.  hash_mem seems like a cleaner
> and more useful stopgap than the "escape hatch" approach, at least to me.

Have we heard from people running actual applications where there is a
problem with raising work_mem to simply match what's already happening
with the v12 behavior?

Sure, there's been some examples on this thread of people who know the
backend well showing how the default work_mem will cause the v13 HashAgg
to spill to disk when given a query which has poor estimates, and that's
slower than v12 where it ignored work_mem and used a bunch of memory,
but it was also shown that raising work_mem addresses that issue and
brings v12 and v13 back in line.

There was a concern raised that other nodes might then use more memory-
but there's nothing new there, if you wanted to avoid batching with a
HashJoin in v12 you'd have exactly the same issue, and yet folks raise
work_mem all the time to address this, and to get that HashAgg plan in
the first place too when the estimates aren't so far off.

There now seems to be some suggestions that not only should we have a
new GUC, but we should default to having it not be equal to work_mem (or
1.0 or whatever) and instead by higher, to be *twice* or larger whatever
the existing work_mem setting is- meaning that people whose systems are
working just fine and have good estimates that represent their workload
and who get the plans they want may then start seeing differences and
increased memory utilization in places that they *don't* want that, all
because we're scared that someone, somewhere, might see a regression due
to HashAgg spilling to disk.

So, no, I don't agree that 'do nothing' (except ripping out the one GUC
that was already added) is a worse proposal than adding another work_mem
like thing that's only for some nodes types.  There's no way that we'd
even be considering such an approach during the regular development
cycle either- there would be calls for a proper wholistic view, at least
to the point where every node type that could possibly allocate a
reasonable chunk of memory would be covered.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-07-10 Thread David Rowley
On Sat, 11 Jul 2020 at 14:02, Peter Geoghegan  wrote:
>
> On Fri, Jul 10, 2020 at 6:19 PM David Rowley  wrote:
> > If we have to have a new GUC, my preference would be hashagg_mem,
> > where -1 means use work_mem and a value between 64 and MAX_KILOBYTES
> > would mean use that value.  We'd need some sort of check hook to
> > disallow 0-63. I really am just failing to comprehend why we're
> > contemplating changing the behaviour of Hash Join here.
>
> I don't understand why parititonwise hash join consumes work_mem in
> the way it does. I assume that the reason is something like "because
> that behavior was the easiest to explain", or perhaps "people that use
> partitioning ought to be able to tune their database well". Or even
> "this design avoids an epic pgsql-hackers thread, because of course
> every hash table should get its own work_mem".

hmm yeah. It's unfortunate, but I'm not sure how I'd have implemented
it differently.  The problem is made worse by the fact that we'll only
release the memory for the hash table during ExecEndHashJoin(). If the
planner had some ability to provide the executor with knowledge that
the node would never be rescanned, then the executor could release the
memory for the hash table after the join is complete. For now, we'll
need to live with the fact that an Append containing many children
doing hash joins will mean holding onto all that memory until the
executor is shutdown :-(

There's room to make improvements there, for sure, but not for PG13.

> > Of course, I
> > understand that that node type also uses a hash table, but why does
> > that give it the right to be involved in a change that we're making to
> > try and give users the ability to avoid possible regressions with Hash
> > Agg?
>
> It doesn't, exactly. The idea of hash_mem came from similar settings
> in another database system that you'll have heard of, that affect all
> nodes that use a hash table. I read about this long ago, and thought
> that it might make sense to do something similar as a way to improving
> work_mem

It sounds interesting, but it also sounds like a new feature
post-beta. Perhaps it's better we minimise the scope of the change to
be a minimal fix just for the behaviour we predict some users might
not like.

David




Re: Default setting for enable_hashagg_disk

2020-07-10 Thread Peter Geoghegan
On Fri, Jul 10, 2020 at 6:19 PM David Rowley  wrote:
> If we get hash_mem
> or some variant that is a multiplier of work_mem, then that user is in
> exactly the same situation for that plan. i.e there's no ability to
> increase the memory allowances for Hash Agg alone.

That's true, of course.

> If we have to have a new GUC, my preference would be hashagg_mem,
> where -1 means use work_mem and a value between 64 and MAX_KILOBYTES
> would mean use that value.  We'd need some sort of check hook to
> disallow 0-63. I really am just failing to comprehend why we're
> contemplating changing the behaviour of Hash Join here.

I don't understand why parititonwise hash join consumes work_mem in
the way it does. I assume that the reason is something like "because
that behavior was the easiest to explain", or perhaps "people that use
partitioning ought to be able to tune their database well". Or even
"this design avoids an epic pgsql-hackers thread, because of course
every hash table should get its own work_mem".

> Of course, I
> understand that that node type also uses a hash table, but why does
> that give it the right to be involved in a change that we're making to
> try and give users the ability to avoid possible regressions with Hash
> Agg?

It doesn't, exactly. The idea of hash_mem came from similar settings
in another database system that you'll have heard of, that affect all
nodes that use a hash table. I read about this long ago, and thought
that it might make sense to do something similar as a way to improving
work_mem (without replacing it with something completely different to
enable things like the "hash teams" design, which should be the long
term goal). It's unusual that it took this hashaggs-that-spill issue
to make the work_mem situation come to a head, and it's unusual that
the proposal on the table doesn't just target hash agg. But it's not
*that* unusual.

I believe that it makes sense on balance to lump together hash
aggregate and hash join, with the expectation that the user might want
to tune them for the system as a whole. This is not an escape hatch --
it's something that adds granularity to how work_mem can be tuned in a
way that makes sense (but doesn't make perfect sense). It doesn't
reflect reality, but I think that it comes closer to reflecting
reality than other variations that I can think of, including your
hashagg_mem compromise proposal (which is still much better than plain
work_mem). In short, hash_mem is relatively conceptually clean, and
doesn't unduly burden the user.

I understand that you only want to add an escape hatch, which is what
hashagg_mem still amounts to. There are negative consequences to the
setting affecting hash join, which I am not unconcerned about. On the
other hand, hashagg_mem is an escape hatch, and that's ugly in a way
that hash_mem isn't. I'm also concerned about that.

In the end, I think that the "hash_mem vs. hashagg_mem" question is
fundamentally a matter of opinion.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-10 Thread David G. Johnston
On Fri, Jul 10, 2020 at 6:43 PM David Rowley  wrote:

> On Sat, 11 Jul 2020 at 13:36, David G. Johnston
>  wrote:
> > If we add a setting that defaults to work_mem then the benefit is
> severely reduced.  You still have to modify individual queries, but the
> change can simply be more targeted than changing work_mem alone.
>
> I think the idea is that this is an escape hatch to allow users to get
> something closer to what PG12 did, but only if they really need it.  I
> can't quite understand why we need to leave the escape hatch open and
> push them halfway through it.  I find escape hatches are best left
> closed until you really have no choice but to use them.
>
>
The escape hatch dynamic is "the user senses a problem, goes into their
query, and modifies some GUCs to make the problem go away".  As a user
I'd much rather have the odds of my needing to use that escape hatch
reduced - especially if that reduction can be done without risk and without
any action on my part.

It's like having someone in a box right now, and then turning up the heat.
We can give them an opening to get out of the box if they need it but we
can also give them A/C.  For some the A/C may be unnecessary, but also not
harmful,  while a smaller group will stay in the margin, while for the
others it's not enough and use the opening (which they would have done
anyway without the A/C).

David J.


Re: Default setting for enable_hashagg_disk

2020-07-10 Thread David Rowley
On Sat, 11 Jul 2020 at 13:36, David G. Johnston
 wrote:
> If we add a setting that defaults to work_mem then the benefit is severely 
> reduced.  You still have to modify individual queries, but the change can 
> simply be more targeted than changing work_mem alone.

I think the idea is that this is an escape hatch to allow users to get
something closer to what PG12 did, but only if they really need it.  I
can't quite understand why we need to leave the escape hatch open and
push them halfway through it.  I find escape hatches are best left
closed until you really have no choice but to use them.

David




Re: Default setting for enable_hashagg_disk

2020-07-10 Thread David G. Johnston
On Fri, Jul 10, 2020 at 6:19 PM David Rowley  wrote:

> If we have to have a new GUC, my preference would be hashagg_mem,
> where -1 means use work_mem and a value between 64 and MAX_KILOBYTES
> would mean use that value.  We'd need some sort of check hook to
> disallow 0-63. I really am just failing to comprehend why we're
> contemplating changing the behaviour of Hash Join here.


If we add a setting that defaults to work_mem then the benefit is severely
reduced.  You still have to modify individual queries, but the change can
simply be more targeted than changing work_mem alone.  I truly desire to
have whatever we do provide that ability as well as a default value that is
greater than the current work_mem value - which in v12 was being ignored
and thus production usages saw memory consumption greater than work_mem.
Only a multiplier does this.  A multiplier-only solution fixes the problem
at hand.  A multiplier-or-memory solution adds complexity but provides
flexibility.  If adding that flexibility is straight-forward I don't see
any serious downside other than the complexity of having the meaning of a
single GUC's value dependent upon its magnitude.

Of course, I
> understand that that node type also uses a hash table, but why does
> that give it the right to be involved in a change that we're making to
> try and give users the ability to avoid possible regressions with Hash
> Agg?
>

If Hash Join isn't affected by the "was allowed to use unlimited amounts of
execution memory but now isn't" change then it probably should continue to
consult work_mem instead of being changed to use the calculated value
(work_mem x multiplier).

David J.


Re: Default setting for enable_hashagg_disk

2020-07-10 Thread David Rowley
On Sat, 11 Jul 2020 at 12:47, David G. Johnston
 wrote:
> The multiplier seems strictly better than "rely on work_mem alone, i.e., do 
> nothing"; the detracting factor being one more GUC.  Even if one wants to 
> argue the solution is ugly or imperfect the current state seems worse and a 
> more perfect option doesn't seem worth waiting for.  The multiplier won't 
> make every single upgrade a non-event but it provides a more than sufficient 
> amount of control and in the worse case can be effectively ignored by setting 
> it to 1.0.

My argument wasn't related to if the new GUC should be a multiplier of
work_mem or an absolute amount of memory. The point I was trying to
make was that the solution to add a GUC to allow users to increase the
memory Hash Join and Hash Agg for plans which don't contain any other
nodes types that use work_mem is the same as doing nothing. As of
today, those people could just increase work_mem. If we get hash_mem
or some variant that is a multiplier of work_mem, then that user is in
exactly the same situation for that plan. i.e there's no ability to
increase the memory allowances for Hash Agg alone.

If we have to have a new GUC, my preference would be hashagg_mem,
where -1 means use work_mem and a value between 64 and MAX_KILOBYTES
would mean use that value.  We'd need some sort of check hook to
disallow 0-63. I really am just failing to comprehend why we're
contemplating changing the behaviour of Hash Join here. Of course, I
understand that that node type also uses a hash table, but why does
that give it the right to be involved in a change that we're making to
try and give users the ability to avoid possible regressions with Hash
Agg?

David




Re: Default setting for enable_hashagg_disk

2020-07-10 Thread David G. Johnston
On Fri, Jul 10, 2020 at 5:16 PM David Rowley  wrote:

> Stephen mentions in [1] that:
> > Users who are actually hit by this in a negative way
> > have an option- increase work_mem to reflect what was actually happening
> > already.
>
> Peter is not a fan of that idea, which can only be due to the fact
> that will also increase the maximum memory consumption allowed by
> other nodes in the plan too.


That isn't the only reason for me - the main advantage of hash_mem is that
we get to set a default to some multiple greater than 1.0 so that an
upgrade to v13 has a region where behavior similar to v12 is effectively
maintained.  I have no feel for whether that should be 2.0, 4.0, or
something else, but 2.0 seemed small and I chose to use a power of 2.

My concern is that if we do hash_mem and
> have that control the memory allowances for Hash Joins and Hash Aggs,
> then that solution is just as good as Stephen's idea when the plan
> only contains Hash Joins and Hash Aggs.
>
> As much as I do want to see us get something to allow users some
> reasonable way to get the same performance as they're used to, I'm
> concerned that giving users something that works for many of the use
> cases is not really going to be as good as giving them something that
> works in all their use cases.   A user who has a partitioned table
> with a good number of partitions and partition-wise joins enabled
> might not like it if their Hash Join plan suddenly consumes hash_mem *
> nPartitions when they've set hash_mem to 10x of work_mem due to some
> other plan that requires that to maintain PG12's performance in PG13.
>

I don't know enough about the hash join dynamic to comment there but if an
admin goes in and changes the system default to 10x in lieu of a targeted
fix for a query that actually needs work_mem to be increased to 10 times
its current value to work properly I'd say that would be a poor decision.
Absent hash_mem they wouldn't update work_mem on their system to 10x its
current value in order to upgrade to v13, they'd set work_mem for that
query specifically.  The same should happen here.

Frankly, if admins are on top of their game and measuring and monitoring
query performance and memory consumption they would be able to operate in
our "do nothing" mode by setting the default for hash_mem to 1.0 and just
dole out memory via work_mem as they have always done.  Though setting
hash_mem to 10x for that single query would reduce their risk of OOM (none
of the work_mem consulting nodes would be increased) so having the GUC
would be a net win should they avail themselves of it.

The multiplier seems strictly better than "rely on work_mem alone, i.e., do
nothing"; the detracting factor being one more GUC.  Even if one wants to
argue the solution is ugly or imperfect the current state seems worse and a
more perfect option doesn't seem worth waiting for.  The multiplier won't
make every single upgrade a non-event but it provides a more than
sufficient amount of control and in the worse case can be effectively
ignored by setting it to 1.0.

Is there some reason to think that having this multiplier with a
conservative default of 2.0 would cause an actual problem - and would that
scenario have likely caused an OOM anyway in v12?  Given that "work_mem can
be used many many times in a single query" I'm having trouble imagining
such a problem.

David J.


Re: Default setting for enable_hashagg_disk

2020-07-10 Thread David Rowley
On Sat, 11 Jul 2020 at 10:00, Tom Lane  wrote:
>
> Stephen Frost  writes:
> > I don't see hash_mem as being any kind of proper fix- it's just punting
> > to the user saying "we can't figure this out, how about you do it" and,
> > worse, it's in conflict with how we already ask the user that question.
> > Turning it into a multiplier doesn't change that either.
>
> Have you got a better proposal that is reasonably implementable for v13?
> (I do not accept the argument that "do nothing" is a better proposal.)
>
> I agree that hash_mem is a stopgap, whether it's a multiplier or no,
> but at this point it seems difficult to avoid inventing a stopgap.
> Getting rid of the process-global work_mem setting is a research project,
> and one I wouldn't even count on having results from for v14.  In the
> meantime, it seems dead certain that there are applications for which
> the current behavior will be problematic.  hash_mem seems like a cleaner
> and more useful stopgap than the "escape hatch" approach, at least to me.

If we're going to end up going down the route of something like
hash_mem for PG13, wouldn't it be better to have something more like
hashagg_mem that only adjusts the memory limits for Hash Agg only?

Stephen mentions in [1] that:
> Users who are actually hit by this in a negative way
> have an option- increase work_mem to reflect what was actually happening
> already.

Peter is not a fan of that idea, which can only be due to the fact
that will also increase the maximum memory consumption allowed by
other nodes in the plan too. My concern is that if we do hash_mem and
have that control the memory allowances for Hash Joins and Hash Aggs,
then that solution is just as good as Stephen's idea when the plan
only contains Hash Joins and Hash Aggs.

As much as I do want to see us get something to allow users some
reasonable way to get the same performance as they're used to, I'm
concerned that giving users something that works for many of the use
cases is not really going to be as good as giving them something that
works in all their use cases.   A user who has a partitioned table
with a good number of partitions and partition-wise joins enabled
might not like it if their Hash Join plan suddenly consumes hash_mem *
nPartitions when they've set hash_mem to 10x of work_mem due to some
other plan that requires that to maintain PG12's performance in PG13.
 If that user is unable to adjust hash_mem due to that then they're
not going to be very satisfied that we've added hash_mem to allow
their query to perform as well as it did in PG12. They'll be at the
same OOM risk that they were exposed to in PG12 if they were to
increase hash_mem here.

David

[1] 
https://www.postgresql.org/message-id/20200710143415.gj12...@tamriel.snowman.net




Re: Default setting for enable_hashagg_disk

2020-07-10 Thread Peter Geoghegan
On Fri, Jul 10, 2020 at 2:50 PM Stephen Frost  wrote:
> Nothing of what you've said thus far has shown me that there were
> material bits of the discussion that I've missed.

Maybe that's just because you missed those bits too?

> No, that other people
> feel differently or have made comments supporting one thing or another
> isn't what I would consider material- I'm as allowed my opinions as much
> as others, even when I disagree with the majority (or so claimed anyhow-
> I've not gone back to count, but I don't claim it to be otherwise
> either).

You are of course entitled to your opinion.

The problem we're trying to address here is paradoxical, in a certain
sense. The HashAggs-that-spill patch is somehow not at fault on the
one hand, but on the other hand has created this urgent need to
ameliorate what is for all intents and purposes a regression.
Everything is intertwined. Yes -- this *is* weird! And, I admit that
the hash_mem proposal is unorthodox, even ugly -- in fact, I've said
words to that effect on perhaps a dozen occasions at this point. This
is also weird.

I pointed out that my hash_mem proposal was popular because it seemed
like it might save time. When I see somebody I know proposing
something strange, my first thought is "why are they proposing that?".
I might only realize some time later that there are special
circumstances that make the proposal much more reasonable than it
seemed at first (maybe even completely reasonable). There is no
inherent reason why other people supporting the proposal makes it more
valid, but in general it does suggest that special circumstances might
apply. It guides me in the direction of looking for and understanding
what they might be sooner.

-- 
Peter Geoghegan




Re: Default setting for enable_hashagg_disk

2020-07-10 Thread Tom Lane
Stephen Frost  writes:
> I don't see hash_mem as being any kind of proper fix- it's just punting
> to the user saying "we can't figure this out, how about you do it" and,
> worse, it's in conflict with how we already ask the user that question.
> Turning it into a multiplier doesn't change that either.

Have you got a better proposal that is reasonably implementable for v13?
(I do not accept the argument that "do nothing" is a better proposal.)

I agree that hash_mem is a stopgap, whether it's a multiplier or no,
but at this point it seems difficult to avoid inventing a stopgap.
Getting rid of the process-global work_mem setting is a research project,
and one I wouldn't even count on having results from for v14.  In the
meantime, it seems dead certain that there are applications for which
the current behavior will be problematic.  hash_mem seems like a cleaner
and more useful stopgap than the "escape hatch" approach, at least to me.

regards, tom lane




  1   2   3   >