subject:"Spilling hashed SetOps and aggregates to disk"

On 06/11/2018 08:13 PM, Jeff Davis wrote:
> On Mon, 2018-06-11 at 19:33 +0200, Tomas Vondra wrote:
>> For example we hit the work_mem limit after processing 10% tuples,
>>  switching to sort would mean spill+sort of 900GB of data. Or we 
>> might say - hmm, we're 10% through, so we expect hitting the limit
>> 10x, so let's spill the hash table and then do sort on that,
>> writing and sorting only 10GB of data. (Or merging it in some
>> hash-based way, per Robert's earlier message.)
> 
> Your example depends on large groups and a high degree of group
> clustering. That's fine, but it's a special case,
> 

True, it's a special case and it won't work for other cases. It was
merely an example for Andres.

OTOH it's not entirely unrealistic, I think. Consider something like

  SELECT
extract(year from ts) AS y,
extract(month from ts) AS m,
extract(day from ts) AS d,
string_agg(x),
array_agg(y)
  FROM fact_table
  GROUP BY y, m, d;

which is likely very correlated (assuming new data is appended to the
table), and the string_agg/array_agg are likely to produce fairly large
groups (about proportional to the number of tuples in the group).

Another example might be about HLL aggregate, although in that case the
transition state does not grow, so it may not be that bad (and the
default estimate of 1kB would work pretty nicely). But there certainly
are other aggregates with large transition state, where this might not
be the case, and we currently have no way to communicate that to the
planner - except for setting work_mem much lower :-/

However, I now realize I've ignored the fact that we typically don't
sort the whole table but only a very few columns, so the example was not
entirely fair - we would not sort the whole remaining 900GB but likely
much less.

> and complexity does have a cost, too.

Sure.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Tom Lane

Jeff Davis  writes:
> Also, I am not sure we should really be designing around data types
> where it makes sense to group and then don't supply a btree opclass.
> Seems like they are likely to hit a problem soon anyway.

It's not that unreasonable to have a hash opclass and no btree opclass;
the datatype might not have a natural linear ordering.

But in any case, I think Robert's point was that he'd prefer to avoid
having a poorly-tested special-case code path for that situation, which
seems like a good idea independently of performance considerations.

regards, tom lane

Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Jeff Davis

On Mon, 2018-06-11 at 11:55 -0400, Robert Haas wrote:
> performance degradation is not necessarily much better than OOM.  I
> suspect that part of the reason why both Andres and David proposed
> algorithms that combined hashing and sorting is out of a desire to
> cater somewhat to both few-tuples-per-group and many-tuples-per-
> group.
> 

I think average group size is the wrong way to look at it, because
averages are only useful for a normal distribution. The two most
interesting cases are:

1. Fairly uniform group size (e.g. normal dist)
2. Skew values, power law distribution, etc., where some groups are
very large and most are tiny by comparison. I am calling the very large
groups "skewed groups".

For #1, hashing and sorting are both reasonable, and it depends on a
lot of factors (number of groups, clustering, available memory).

For #2, hashing is really good and sorting is really bad. That's
because (at present) we sort all of the tuples before doing any
aggregation, so it expends a lot of effort on the skewed groups.
Hashing can keep skewed groups in memory and avoid spilling a large
fraction of the input tuples at all.

If we get the skewed groups into the hash table, and avoid OOM, I think
that is a success. My patch did that, except it didn't account for two
cases:

  (a) clustered input, where skewed groups may not appear until the
hash table is already full; and
  (b) variable-sized transition values that grow with the group size

> One big advantage to just partitioning the input tuples by hash code
> and then proceeding batch by batch is that it works for any aggregate
> that can support hash aggregation in the first place.  You do not

Agreed, but I think we should evaluate Andres's idea of feeding spilled
tuples to a Sort, because the overall complexity might be lower even
accounting for the special cases you're worried about.

Also, I am not sure we should really be designing around data types
where it makes sense to group and then don't supply a btree opclass.
Seems like they are likely to hit a problem soon anyway.

Regards,
Jeff Davis

Re: Spilling hashed SetOps and aggregates to disk





On 06/11/2018 07:14 PM, Andres Freund wrote:

Hi,

On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote:

It would be great to get something that performs better than just falling
back to sort (and I was advocating for that), but I'm worried we might be
moving the goalposts way too far.


I'm unclear on why that'd have that bad performance in relevant
cases. You're not going to hit the path unless the number of groups is
pretty large (or work_mem is ridiculously small, in which case we don't
care). With a large number of groups the sorting path isn't particularly
inefficient, because repeatedly storing the input values isn't such a
large fraction in comparison to the number of groups (and their
transition values).  Which scenarios are you concerned about?



Say you have a 1TB table, and keeping the groups in memory would require 
 work_mem=2GB. After hitting the work_mem limit, there may be pretty 
large amount of tuples you'd have to spill to disk and sort.


For example we hit the work_mem limit after processing 10% tuples, 
switching to sort would mean spill+sort of 900GB of data. Or we might 
say - hmm, we're 10% through, so we expect hitting the limit 10x, so 
let's spill the hash table and then do sort on that, writing and sorting 
only 10GB of data. (Or merging it in some hash-based way, per Robert's 
earlier message.)


I don't quite understand the argument that the number of groups needs to 
be pretty large for us to hit this. So what if the groups are 2x or 10x 
more than work_mem? It can still be cheaper than switching to sort-based 
approach, no?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Andres Freund

Hi,

On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote:
> It would be great to get something that performs better than just falling
> back to sort (and I was advocating for that), but I'm worried we might be
> moving the goalposts way too far.

I'm unclear on why that'd have that bad performance in relevant
cases. You're not going to hit the path unless the number of groups is
pretty large (or work_mem is ridiculously small, in which case we don't
care). With a large number of groups the sorting path isn't particularly
inefficient, because repeatedly storing the input values isn't such a
large fraction in comparison to the number of groups (and their
transition values).  Which scenarios are you concerned about?

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Robert Haas

On Mon, Jun 11, 2018 at 11:29 AM, Tomas Vondra
 wrote:
> I think the underlying question is whether we want to treat this as an
> emergency safety against OOM (which should be a rare thing in most cases) or
> something allowing us to pick hash aggregate more often.
>
> It would be great to get something that performs better than just falling
> back to sort (and I was advocating for that), but I'm worried we might be
> moving the goalposts way too far.

Well, the scope of the first patch is always negotiable, but I think
we should try to think beyond the first patch in design discussions,
so that we don't paint ourselves into a corner.  I also think it's
worth noting that an emergency safety which also causes a 10x
performance degradation is not necessarily much better than OOM.  I
suspect that part of the reason why both Andres and David proposed
algorithms that combined hashing and sorting is out of a desire to
cater somewhat to both few-tuples-per-group and many-tuples-per-group.
AFAICS, Andres's algorithm will be better for few-tuples-per-group
(maybe with a few large groups mixed in) and David's algorithm will be
better for many-tuples-per-group (maybe with some small groups mixed
in), but in each case the inclusion of a sorting component seems like
it will ease the pain in the opposite use case.  However, I wonder
whether it will be better still to just acknowledge that those two
cases really need separate algorithms and then think about the best
approach for each one individually.

One big advantage to just partitioning the input tuples by hash code
and then proceeding batch by batch is that it works for any aggregate
that can support hash aggregation in the first place.  You do not need
a btree opclass.  You do not need support for serialize/deserialize or
combine.  If you have serialize/deserialize, it's better, because now
you can spill growing transition values out of the hash table on the
fly.  But the basic contours of the algorithm don't change even if
such support is lacking.  That's a significant advantage in my view,
because with some of these more complex strategies, we'll end up with
code paths for data types lacking the necessary support which are
almost never exercised in practice.  That's likely to lead to bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Spilling hashed SetOps and aggregates to disk


On 06/11/2018 04:25 PM, Robert Haas wrote:



> ...


Maybe that's not exactly what Tomas (or you) meant by eviction
strategy, though.  Not sure.  But it does seem to me that we need to
pick the algorithm we're going to use, more or less, in order to
decide what infrastructure we need, and at least to some extent, we
ought to let our choice of algorithm be informed by the desire not to
need too much infrastructure.



I was using eviction strategy in a somewhat narrower sense - pretty much 
just "Which groups to evict from the hash table?" but you're right the 
question is more general and depends on which scheme we end up using 
(just hashagg, hash+sort, something else ...)


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk





On 06/11/2018 05:16 PM, Robert Haas wrote:

On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
 wrote:

... and this is pretty much what Jeff Davis suggested, I think. The
trouble is, each of those cases behaves nicely/terribly in different
corner cases.


That's a really good point.  If the number of groups is pretty small
compared to the number of input tuples, then you really only ever want
to dump out transition values.  By doing so, you minimize the amount
of data you have to write.  But if the number of groups is, say, half
the number of input tuples, then computing transition values before
you have all the values that belong to that group is probably a waste
of time.  I wonder if we could decide what to do by comparing the
number of input tuples consumed to the number of groups created at the
time we run out of memory.  If we've got less than, I dunno, five
tuples per group, then assume most groups are small.  Pick a strategy
that (mainly) spools input tuples.  Otherwise, pick a strategy that
spools transition tuples.

In either case, it seems like we can pick a pure hashing strategy or
switch to using both hashing and sorting.  For example, IIUC, Andres's
proposal involves spooling mostly input tuples, but can also spool
transition tuples if the transition values consume more memory as they
absorb more tuples.  However, you could do a similar kind of thing
without needing sort support.  When you see a value that's not doesn't
fit in your in-memory hash table, use the hash code to route it to 1
of N batch files.  Have a second set of batch files for transition
tuples in case you need to kick things out of the in-memory hash
table.  Once you finish reading the input, emit all the values that
remain in the in-memory hash table and then process each batch file
separately.

Similarly, David's strategy involves spooling only transition tuples
and then sorting on the group key, but it's again possible to do
something similar without relying on sorting.  Instead of flushing the
in-memory hash table to a tuple store, split the transition tuples it
contains among N batch files based on the hash code.  Once you've read
all of the input, go back and reprocess each batch file, combining
transition values whenever the same group keys appear in more than one
transition tuple.



Yeah, essentially something like the batching in hash joins.


To me, the pure-hashing strategies look a little simpler, but maybe
there's enough performance benefit from combining hashing and sorting
that it's worth the complexity, or maybe we should just accept
whichever variant somebody's willing to code.  But I think we almost
have to have separate handling for many-row-per-group and
few-rows-per-group, because those seem fundamentally different.



I think the underlying question is whether we want to treat this as an 
emergency safety against OOM (which should be a rare thing in most 
cases) or something allowing us to pick hash aggregate more often.


It would be great to get something that performs better than just 
falling back to sort (and I was advocating for that), but I'm worried we 
might be moving the goalposts way too far.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Robert Haas

On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
 wrote:
> ... and this is pretty much what Jeff Davis suggested, I think. The
> trouble is, each of those cases behaves nicely/terribly in different
> corner cases.

That's a really good point.  If the number of groups is pretty small
compared to the number of input tuples, then you really only ever want
to dump out transition values.  By doing so, you minimize the amount
of data you have to write.  But if the number of groups is, say, half
the number of input tuples, then computing transition values before
you have all the values that belong to that group is probably a waste
of time.  I wonder if we could decide what to do by comparing the
number of input tuples consumed to the number of groups created at the
time we run out of memory.  If we've got less than, I dunno, five
tuples per group, then assume most groups are small.  Pick a strategy
that (mainly) spools input tuples.  Otherwise, pick a strategy that
spools transition tuples.

In either case, it seems like we can pick a pure hashing strategy or
switch to using both hashing and sorting.  For example, IIUC, Andres's
proposal involves spooling mostly input tuples, but can also spool
transition tuples if the transition values consume more memory as they
absorb more tuples.  However, you could do a similar kind of thing
without needing sort support.  When you see a value that's not doesn't
fit in your in-memory hash table, use the hash code to route it to 1
of N batch files.  Have a second set of batch files for transition
tuples in case you need to kick things out of the in-memory hash
table.  Once you finish reading the input, emit all the values that
remain in the in-memory hash table and then process each batch file
separately.

Similarly, David's strategy involves spooling only transition tuples
and then sorting on the group key, but it's again possible to do
something similar without relying on sorting.  Instead of flushing the
in-memory hash table to a tuple store, split the transition tuples it
contains among N batch files based on the hash code.  Once you've read
all of the input, go back and reprocess each batch file, combining
transition values whenever the same group keys appear in more than one
transition tuple.

To me, the pure-hashing strategies look a little simpler, but maybe
there's enough performance benefit from combining hashing and sorting
that it's worth the complexity, or maybe we should just accept
whichever variant somebody's willing to code.  But I think we almost
have to have separate handling for many-row-per-group and
few-rows-per-group, because those seem fundamentally different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Spilling hashed SetOps and aggregates to disk

2018-06-11 Thread Robert Haas

On Tue, Jun 5, 2018 at 1:46 AM, Jeff Davis  wrote:
> On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
>> I expect the eviction strategy to be the primary design challenge of
>> this patch. The other bits will be mostly determined by this one
>> piece.
>
> Not sure I agree that this is the primary challenge.
>
> The cases that benefit from eviction are probably a minority. I see two
> categories that would benefit:
>
>   * Natural clustering in the heap. This sounds fairly common, but a
> lot of the cases that come to mind are too low-cardinality to be
> compelling; e.g. timestamps grouped by hour/day/month. If someone has
> run into a high-cardinality natural grouping case, let me know.
>   * ARRAY_AGG (or similar): individual state values can be large enough
> that we need to evict to avoid exceeding work_mem even if not adding
> any new groups.
>
> In either case, it seems like a fairly simple eviction strategy would
> work. For instance, we could just evict the entire hash table if
> work_mem is exceeded or if the hit rate on the hash table falls below a
> certain threshold. If there was really something important that should
> have stayed in the hash table, it will go back in soon anyway.
>
> So why should eviction be a major driver for the entire design? I agree
> it should be an area of improvement for the future, so let me know if
> you see a major problem, but I haven't been as focused on eviction.

Hmm, I think the eviction strategy is pretty important.  For example,
as we discussed in the hallway at PGCon, if we're flushing the whole
hash table (or one of several hash tables), we can account for memory
usage chunk-by-chunk instead of palloc-by-palloc, which as you pointed
out is more accurate and substantially less expensive.  If we are
evicting individual tuples, though, it's quite possible that evicting
even a large number of tuples might not free up any chunks, so we'd
probably have to do accounting palloc-by-palloc.  Working harder to
get a less accurate answer isn't real appealing even if the absolute
overhead is low.

As Andres mentioned, it also matters what kind of thing we evict.  If
we push raw input tuples to a spool file in lieu of creating new
groups and deal with those in a second pass with a new hash table, we
don't need anything extra at all.  If the second pass uses sort +
group, we need the aggregates to be both hashable and sortable.  If we
evict transition tuples rather than input tuples, we need
serialization + deserialization functions.  If we do that in a way
that might create multiple transition values for the same combination
of grouping values, then we also need combine functions.

Maybe that's not exactly what Tomas (or you) meant by eviction
strategy, though.  Not sure.  But it does seem to me that we need to
pick the algorithm we're going to use, more or less, in order to
decide what infrastructure we need, and at least to some extent, we
ought to let our choice of algorithm be informed by the desire not to
need too much infrastructure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Spilling hashed SetOps and aggregates to disk

2018-06-07 Thread David Gershuni

As Serge mentioned, we’ve implemented spill-to-disk for SetOps and Aggregates
at Salesforce. We were hitting OOMs often enough that this became a high
priority for us. However, our current spill implementation is based on dynahash
from 9.6, and we’re not happy with its performance (it was primarily an OOM
stop-gap and was not focused on optimizing performance).

Because of this, we’ve spec’d out a new spill-to-disk design for hash-based
aggregates (and later SetOps), and we plan to begin implementation very soon.
Since this is an important fix for the community as well, we would be happy
(and would prefer) to share our spill-to-disk implementation.

Our new spill approach is very similar to Jeff’s and Heikki’s and is designed
to use simplehash.h. It’s based on an algorithm called “HybridHash with
Pre-Partitioning” found in [1]. It may later make sense to implement the
“HashSort” Algorithm from [1] as well, which works better for highly skewed
grouping keys. The optimizer could eventually choose between the two based on
the stats available. We also like Heikki’s suggestions to use logtape.c to
reduce disk usage and a trie-based approach to control the size of partitions
dynamically.

We’ve also been grappling with how to handle the implementation challenges
pointed out in this thread. These include:
• tracking memory usage
• choosing a smart eviction policy (which is touched on in [2])
• serializing opaque user-defined transition values when eviction is
required

For 1), we plan to use our WithStats memory context, which Serge mentioned.
For 2), we plan to start with a simple non-eviction policy since we don’t have
the stats do anything smarter (i.e. insert until we fill the hash table, then
spill until we finish processing the batch, with evictions only happening if a
group’s transition value grows too large).
For 3), we don’t have a good solution yet. We could serialize/deserialize for
built-in types and rely on users to provide serialize/deserialize functions for
user-defined aggregates going forward.

But we’re open to suggestions :)

Regards,
David
Salesforce

[1] Revisiting Aggregation for Data Intensive Applications: A Performance Study
https://arxiv.org/pdf/1311.0059.pdf

[2] DB2 with BLU acceleration: so much more than just a column store
http://delivery.acm.org/10.1145/254/2536233/p1080-raman.pdf?ip=204.14.239.107=2536233=ACTIVE%20SERVICE=37B0A9F49C26EEFC%2E37B0A9F49C26EEFC%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1528414374_aeb9f862ae2acc26db305d591095e3f7

Re: Spilling hashed SetOps and aggregates to disk

2018-06-07 Thread Tomas Vondra


On 06/07/2018 02:18 AM, Andres Freund wrote:

On 2018-06-06 17:17:52 -0700, Andres Freund wrote:

On 2018-06-07 12:11:37 +1200, David Rowley wrote:

On 7 June 2018 at 08:11, Tomas Vondra  wrote:

On 06/06/2018 04:11 PM, Andres Freund wrote:

Consider e.g. a scheme where we'd switch from hashed aggregation to
sorted aggregation due to memory limits, but already have a number of
transition values in the hash table. Whenever the size of the transition
values in the hashtable exceeds memory size, we write one of them to the
tuplesort (with serialized transition value). From then on further input
rows for that group would only be written to the tuplesort, as the group
isn't present in the hashtable anymore.



Ah, so you're suggesting that during the second pass we'd deserialize
the transition value and then add the tuples to it, instead of building
a new transition value. Got it.


Having to deserialize every time we add a new tuple sounds terrible
from a performance point of view.


I didn't mean that we do that, and I don't think David understood it as
that either. I was talking about the approach where the second pass is a
sort rather than hash based aggregation.  Then we would *not* need to
deserialize more than exactly once.


s/David/Tomas/, obviously. Sorry, it's been a long day.



Solution is simple: drink more coffee. ;-)

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread David Rowley

On 6 June 2018 at 01:17, David Rowley  wrote:
> On 6 June 2018 at 01:09, Andres Freund  wrote:
>> On 2018-06-06 01:06:39 +1200, David Rowley wrote:
>>> My concern is that only accounting memory for the group and not the
>>> state is only solving half the problem. It might be fine for
>>> aggregates that don't stray far from their aggtransspace, but for the
>>> other ones, we could still see OOM.
>>
>>> If solving the problem completely is too hard, then a half fix (maybe
>>> 3/4) is better than nothing, but if we can get a design for a full fix
>>> before too much work is done, then isn't that better?
>>
>> I don't think we actually disagree.  I was really primarily talking
>> about the case where we can't really do better because we don't have
>> serialization support.  I mean we could just rescan from scratch, using
>> a groupagg, but that obviously sucks.
>
> I don't think we do. To take yours to the 100% solution might just
> take adding the memory accounting to palloc that Jeff proposed a few
> years ago and use that accounting to decide when we should switch
> method.
>
> However, I don't quite fully recall how the patch accounted for memory
> consumed by sub-contexts and if getting the entire consumption
> required recursively looking at subcontexts. If that's the case then
> checking the consumption would likely cost too much if it was done
> after each tuple was aggregated.

I wonder if the whole internal state memory accounting problem could
be solved by just adding an aggregate supporting function for internal
state aggregates that returns the number of bytes consumed by the
state. It might be good enough to fall back on aggtransspace when the
function is not defined. Such a function would be about 3 lines long
for string_agg and array_agg, and these are the problem aggregates.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread Andres Freund

On 2018-06-06 17:17:52 -0700, Andres Freund wrote:
> On 2018-06-07 12:11:37 +1200, David Rowley wrote:
> > On 7 June 2018 at 08:11, Tomas Vondra  wrote:
> > > On 06/06/2018 04:11 PM, Andres Freund wrote:
> > >> Consider e.g. a scheme where we'd switch from hashed aggregation to
> > >> sorted aggregation due to memory limits, but already have a number of
> > >> transition values in the hash table. Whenever the size of the transition
> > >> values in the hashtable exceeds memory size, we write one of them to the
> > >> tuplesort (with serialized transition value). From then on further input
> > >> rows for that group would only be written to the tuplesort, as the group
> > >> isn't present in the hashtable anymore.
> > >>
> > >
> > > Ah, so you're suggesting that during the second pass we'd deserialize
> > > the transition value and then add the tuples to it, instead of building
> > > a new transition value. Got it.
> > 
> > Having to deserialize every time we add a new tuple sounds terrible
> > from a performance point of view.
> 
> I didn't mean that we do that, and I don't think David understood it as
> that either. I was talking about the approach where the second pass is a
> sort rather than hash based aggregation.  Then we would *not* need to
> deserialize more than exactly once.

s/David/Tomas/, obviously. Sorry, it's been a long day.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread Andres Freund

On 2018-06-07 12:11:37 +1200, David Rowley wrote:
> On 7 June 2018 at 08:11, Tomas Vondra  wrote:
> > On 06/06/2018 04:11 PM, Andres Freund wrote:
> >> Consider e.g. a scheme where we'd switch from hashed aggregation to
> >> sorted aggregation due to memory limits, but already have a number of
> >> transition values in the hash table. Whenever the size of the transition
> >> values in the hashtable exceeds memory size, we write one of them to the
> >> tuplesort (with serialized transition value). From then on further input
> >> rows for that group would only be written to the tuplesort, as the group
> >> isn't present in the hashtable anymore.
> >>
> >
> > Ah, so you're suggesting that during the second pass we'd deserialize
> > the transition value and then add the tuples to it, instead of building
> > a new transition value. Got it.
> 
> Having to deserialize every time we add a new tuple sounds terrible
> from a performance point of view.

I didn't mean that we do that, and I don't think David understood it as
that either. I was talking about the approach where the second pass is a
sort rather than hash based aggregation.  Then we would *not* need to
deserialize more than exactly once.

> Can't we just:
> 
> 1. HashAgg until the hash table reaches work_mem.
> 2. Spill the entire table to disk.
> 3. Destroy the table and create a new one.
> 4. If more tuples: goto 1
> 5. Merge sort and combine each dumped set of tuples.

Well, that requires sorting without really much advantage.  I was more
thinking of

1.  HashAgg until the hash table reaches work_mem.
2.  If entry is in HashTable advance() there, goto 3a. Otherwise put into
tuplestore, goto 2.
3.  If this increases the size of transition values too much, spill
tuple into tuplestore.
4.  Project tuples from hashtable, destroy.
5.  Sort tuplestore with a comparator that sorts extra column with
serialized states first. Project.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk




On 06/07/2018 02:11 AM, David Rowley wrote:
> On 7 June 2018 at 08:11, Tomas Vondra  wrote:
>> On 06/06/2018 04:11 PM, Andres Freund wrote:
>>> Consider e.g. a scheme where we'd switch from hashed aggregation to
>>> sorted aggregation due to memory limits, but already have a number of
>>> transition values in the hash table. Whenever the size of the transition
>>> values in the hashtable exceeds memory size, we write one of them to the
>>> tuplesort (with serialized transition value). From then on further input
>>> rows for that group would only be written to the tuplesort, as the group
>>> isn't present in the hashtable anymore.
>>>
>>
>> Ah, so you're suggesting that during the second pass we'd deserialize
>> the transition value and then add the tuples to it, instead of building
>> a new transition value. Got it.
> 
> Having to deserialize every time we add a new tuple sounds terrible
> from a performance point of view.
> 

I don't think Andres was suggesting that. I think the scheme was to sort
the tuples and read all the tuples for a group at once (so we would
deserialize just once).

> Can't we just:
> 
> 1. HashAgg until the hash table reaches work_mem.
> 2. Spill the entire table to disk.
> 3. Destroy the table and create a new one.
> 4. If more tuples: goto 1
> 5. Merge sort and combine each dumped set of tuples.
> 

... and this is pretty much what Jeff Davis suggested, I think. The
trouble is, each of those cases behaves nicely/terribly in different
corner cases.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread David Rowley

On 7 June 2018 at 08:11, Tomas Vondra  wrote:
> On 06/06/2018 04:11 PM, Andres Freund wrote:
>> Consider e.g. a scheme where we'd switch from hashed aggregation to
>> sorted aggregation due to memory limits, but already have a number of
>> transition values in the hash table. Whenever the size of the transition
>> values in the hashtable exceeds memory size, we write one of them to the
>> tuplesort (with serialized transition value). From then on further input
>> rows for that group would only be written to the tuplesort, as the group
>> isn't present in the hashtable anymore.
>>
>
> Ah, so you're suggesting that during the second pass we'd deserialize
> the transition value and then add the tuples to it, instead of building
> a new transition value. Got it.

Having to deserialize every time we add a new tuple sounds terrible
from a performance point of view.

Can't we just:

1. HashAgg until the hash table reaches work_mem.
2. Spill the entire table to disk.
3. Destroy the table and create a new one.
4. If more tuples: goto 1
5. Merge sort and combine each dumped set of tuples.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

On 06/06/2018 04:11 PM, Andres Freund wrote:
> On 2018-06-06 16:06:18 +0200, Tomas Vondra wrote:
>> On 06/06/2018 04:01 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:
 The other issue is that serialize/deserialize is only a part of a problem -
 you also need to know how to do "combine", and not all aggregates can do
 that ... (certainly not in universal way).
>>>
>>> There are several schemes where only serialize/deserialize are needed,
>>> no?  There are a number of fairly sensible schemes where there won't be
>>> multiple transition values for the same group, no?
>>>
>>
>> Possibly, not sure what schemes you have in mind exactly ...
>>
>> But if you know there's only a single transition value, why would you need
>> serialize/deserialize at all. Why couldn't you just finalize the value and
>> serialize that?
> 
> Because you don't necessarily have all the necessary input rows
> yet.
> 
> Consider e.g. a scheme where we'd switch from hashed aggregation to
> sorted aggregation due to memory limits, but already have a number of
> transition values in the hash table. Whenever the size of the transition
> values in the hashtable exceeds memory size, we write one of them to the
> tuplesort (with serialized transition value). From then on further input
> rows for that group would only be written to the tuplesort, as the group
> isn't present in the hashtable anymore.
> 

Ah, so you're suggesting that during the second pass we'd deserialize
the transition value and then add the tuples to it, instead of building
a new transition value. Got it.

That being said, I'm not sure if such generic serialize/deserialize can
work, but I'd guess no, otherwise we'd probably use it when implementing
the parallel aggregate.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-06 Thread Andres Freund

On 2018-06-06 16:06:18 +0200, Tomas Vondra wrote:
> On 06/06/2018 04:01 PM, Andres Freund wrote:
> > Hi,
> > 
> > On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:
> > > The other issue is that serialize/deserialize is only a part of a problem 
> > > -
> > > you also need to know how to do "combine", and not all aggregates can do
> > > that ... (certainly not in universal way).
> > 
> > There are several schemes where only serialize/deserialize are needed,
> > no?  There are a number of fairly sensible schemes where there won't be
> > multiple transition values for the same group, no?
> > 
> 
> Possibly, not sure what schemes you have in mind exactly ...
> 
> But if you know there's only a single transition value, why would you need
> serialize/deserialize at all. Why couldn't you just finalize the value and
> serialize that?

Because you don't necessarily have all the necessary input rows
yet.

Consider e.g. a scheme where we'd switch from hashed aggregation to
sorted aggregation due to memory limits, but already have a number of
transition values in the hash table. Whenever the size of the transition
values in the hashtable exceeds memory size, we write one of them to the
tuplesort (with serialized transition value). From then on further input
rows for that group would only be written to the tuplesort, as the group
isn't present in the hashtable anymore.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk


On 06/06/2018 04:01 PM, Andres Freund wrote:

Hi,

On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:

The other issue is that serialize/deserialize is only a part of a problem -
you also need to know how to do "combine", and not all aggregates can do
that ... (certainly not in universal way).


There are several schemes where only serialize/deserialize are needed,
no?  There are a number of fairly sensible schemes where there won't be
multiple transition values for the same group, no?



Possibly, not sure what schemes you have in mind exactly ...

But if you know there's only a single transition value, why would you 
need serialize/deserialize at all. Why couldn't you just finalize the 
value and serialize that?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk





On 06/05/2018 07:39 PM, David Fetter wrote:

On Tue, Jun 05, 2018 at 01:27:01PM -0400, Tom Lane wrote:

David Fetter  writes:

On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:

True. Although not all built in aggregates have those defined.



Just out of curiosity, which ones don't? As of
3f85c62d9e825eedd1315d249ef1ad793ca78ed4, pg_aggregate has both of
those as NOT NULL.


NOT NULL isn't too relevant; that's just protecting the fixed-width
nature of the catalog rows.  What's important is which ones are zero.


Thanks for helping me understand this better.


# select aggfnoid::regprocedure, aggkind from pg_aggregate where (aggserialfn=0 
or aggdeserialfn=0) and aggtranstype = 'internal'::regtype;
aggfnoid   | aggkind
--+-
[snip]
(19 rows)

Probably the ordered-set/hypothetical ones aren't relevant for this
issue.

Whether or not we feel like fixing the above "normal" aggs for this,
the patch would have to not fail on extension aggregates that don't
support serialization.


Could there be some kind of default serialization with reasonable
properties?



Not really, because the aggregates often use "internal" i.e. a pointer 
referencing whothehellknowswhat, and how do you serialize/deserialize 
that? The other issue is that serialize/deserialize is only a part of a 
problem - you also need to know how to do "combine", and not all 
aggregates can do that ... (certainly not in universal way).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

Hi,

On 2018-06-05 10:47:49 -0700, Jeff Davis wrote:
> The thing I don't like about it is that it requires running two memory-
> hungry operations at once. How much of work_mem do we use for sorted
> runs, and how much do we use for the hash table?

Is that necessarily true? I'd assume that we'd use a small amount of
memory for the tuplesort, enough to avoid unnecessary disk spills for
each tuple. But a few kb should be enough - think it's fine to
aggressively spill to disk, we after all already have handled the case
of smaller number of input rows.  Then at the end of the run, we empty
out the hashtable, and free it. Only then we do to the sort.

One thing this wouldn't handle are datatypes that support hashing, but
no sorting. Not exactly common.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Jeff Davis

On Tue, 2018-06-05 at 05:42 -0700, Andres Freund wrote:
> That's an interesting idea, but it seems simpler to stick to
> > hashing
> > rather than using a combination strategy. It also seems like it
> > would
> > take less CPU effort.
> Isn't the locality of access going to considerably better with the
> sort
> based approach?

I think I see what you mean, but without measuring it's hard to say.
You may be able to achieve similar locality in the hashing approach by
using a smaller hash table -- an effect that I think I observed
previously but I'd have to rerun the numbers.

> > 
> > What advantages do you have in mind? My patch partitions the
> > spilled
> > data, so it should have similar disk costs as a sort approach.
> I think one part of it is that I think the amount of code is going to
> be
> lower - we essentially have already all the code to handle sort based
> aggs, and to have both sort and hash based aggs in the same query.
> We'd
> mostly need a way to scan the hashtable and stuff it into a
> tuplesort,
> that's not hard.  nodeAgg.c is already more than complex enough, I'm
> not
> sure that full blown partitioning is worth the cost.

The thing I like about your idea is that we wouldn't need to make a
choice at plan time, we just always do the hybrid hashing/sorting
unless we know that it's already sorted.

The thing I don't like about it is that it requires running two memory-
hungry operations at once. How much of work_mem do we use for sorted
runs, and how much do we use for the hash table?

Regards,
Jeff Davis

Re: Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Jeff Davis

On Tue, 2018-06-05 at 08:39 -0700, se...@rielau.com wrote:
> Strange. We don't see this overeahd and measure a lot more than just
> a single metric:

Let's investigate again. I wasn't able to repro the overhead on x86;
Robert saw it on a POWER machine, and it was somewhat noisy. I don't
think we were ever very sure the overhead existed.

My basic opinion is that we make small changes all the time that may
have a small performance impact, and we shouldn't let that become a
blocker for an important feature. Nor should we let it make the design
overly complicated or awkward. We should just see what we can
reasonably do to understand and mitigate it.

Regards,
Jeff Davis

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tom Lane

David Fetter  writes:
> On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
>> True. Although not all built in aggregates have those defined.

> Just out of curiosity, which ones don't? As of
> 3f85c62d9e825eedd1315d249ef1ad793ca78ed4, pg_aggregate has both of
> those as NOT NULL.

NOT NULL isn't too relevant; that's just protecting the fixed-width
nature of the catalog rows.  What's important is which ones are zero.

# select aggfnoid::regprocedure, aggkind from pg_aggregate where (aggserialfn=0 
or aggdeserialfn=0) and aggtranstype = 'internal'::regtype;
   aggfnoid   | aggkind 
--+-
 array_agg(anynonarray)   | n
 array_agg(anyarray)  | n
 string_agg(text,text)| n
 string_agg(bytea,bytea)  | n
 json_agg(anyelement) | n
 json_object_agg("any","any") | n
 jsonb_agg(anyelement)| n
 jsonb_object_agg("any","any")| n
 percentile_disc(double precision,anyelement) | o
 percentile_cont(double precision,double precision)   | o
 percentile_cont(double precision,interval)   | o
 percentile_disc(double precision[],anyelement)   | o
 percentile_cont(double precision[],double precision) | o
 percentile_cont(double precision[],interval) | o
 mode(anyelement) | o
 rank("any")  | h
 percent_rank("any")  | h
 cume_dist("any") | h
 dense_rank("any")| h
(19 rows)

Probably the ordered-set/hypothetical ones aren't relevant for this
issue.

Whether or not we feel like fixing the above "normal" aggs for this,
the patch would have to not fail on extension aggregates that don't
support serialization.

regards, tom lane

Re: Spilling hashed SetOps and aggregates to disk

On 2018-06-05 19:04:11 +0200, David Fetter wrote:
> On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
> > On 5 June 2018 at 06:52, Andres Freund  wrote:
> > > That part has gotten a bit easier since, because we have serialize
> > > / deserialize operations for aggregates these days.
> > 
> > True. Although not all built in aggregates have those defined.
> 
> Just out of curiosity, which ones don't? As of
> 3f85c62d9e825eedd1315d249ef1ad793ca78ed4, pg_aggregate has both of
> those as NOT NULL.

That doesn't mean much. We commonly store 0 / InvalidOid for such things
in the catalog. Allows us to still map the tuples to structs (due to
fixed width).

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread David Fetter

On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund  wrote:
> > That part has gotten a bit easier since, because we have serialize
> > / deserialize operations for aggregates these days.
> 
> True. Although not all built in aggregates have those defined.

Just out of curiosity, which ones don't? As of
3f85c62d9e825eedd1315d249ef1ad793ca78ed4, pg_aggregate has both of
those as NOT NULL.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread David Fetter

On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund  wrote:
> > That part has gotten a bit easier since, because we have serialize /
> > deserialize operations for aggregates these days.
> 
> True. Although not all built in aggregates have those defined.

That is a SMoP which we could, at some point, enforce by requiring
that they always be defined.  Is there something other than round
tuits that's preventing that?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Spilling hashed SetOps and aggregates to disk


On 06/05/2018 03:41 PM, se...@rielau.com wrote:
In our code base we have added a WithStats-Flavor for creating memory 
contexts.
This api accepts a pointer to metric for accounting and it is inherited 
by all subcontexts unless overridden.
So we only needed to change context creation API where we wanted (such 
as TopTansactionContext, Message Context, ..)

That's quite trivial, actually.


I think that's pretty much what we tried when this patch was first 
discussed in 2015, but it added measurable overhead (1-3%) even when the 
accounting was disabled. Which is not quite desirable, of course.


Also we have fixed all those missing hash spills - albeit based on the 
9.6 hash table design I think.


I don't think this part of the code changed all that much.


If there is interest by the community we are very willing to share.


Absolutely! I think it'd be great to share both the code and the 
reasoning behind the design.



Cheers
Serge
Salesforce


cheers

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread serge

In our code base we have added a WithStats-Flavor for creating memory contexts.
This api accepts a pointer to metric for accounting and it is inherited by all 
subcontexts unless overridden.
So we only needed to change context creation API where we wanted (such as 
TopTansactionContext, Message Context, ..)
That's quite trivial, actually.
 
Also we have fixed all those missing hash spills - albeit based on the 9.6 hash 
table design I think.
 
If there is interest by the community we are very willing to share.
 
Cheers
Serge
Salesforce

Re: Spilling hashed SetOps and aggregates to disk


On 06/05/2018 03:17 PM, David Rowley wrote:

On 6 June 2018 at 01:09, Andres Freund  wrote:

On 2018-06-06 01:06:39 +1200, David Rowley wrote:

My concern is that only accounting memory for the group and not the
state is only solving half the problem. It might be fine for
aggregates that don't stray far from their aggtransspace, but for the
other ones, we could still see OOM.



If solving the problem completely is too hard, then a half fix (maybe
3/4) is better than nothing, but if we can get a design for a full fix
before too much work is done, then isn't that better?


I don't think we actually disagree.  I was really primarily talking
about the case where we can't really do better because we don't have
serialization support.  I mean we could just rescan from scratch, using
a groupagg, but that obviously sucks.


I don't think we do. To take yours to the 100% solution might just
take adding the memory accounting to palloc that Jeff proposed a few
years ago and use that accounting to decide when we should switch
method.

However, I don't quite fully recall how the patch accounted for
memory consumed by sub-contexts and if getting the entire
consumption required recursively looking at subcontexts. If that's
the case then checking the consumption would likely cost too much if
it was done after each tuple was aggregated.


Yeah, a simple wrapper would not really fix the issue, because 
allocating memory in the subcontext would not update the accounting 
information. So we'd be oblivious of possibly large amounts of memory. I 
don't quite see how is this related to (not) having serialization 
support, as it's more about not knowing we already hit the memory limit.



That being said, I don't think array_agg actually has the issue, at 
least since b419865a814abbca12bdd6eef6a3d5ed67f432e1 (it does behave 
differently in aggregate and non-aggregate contexts, IIRC).


I don't know how common this issue is, though. I don't think any 
built-in aggregates create additional contexts, but I may be wrong. But 
we have this damn extensibility thing, where users can write their own 
aggregates ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

On 6 June 2018 at 01:09, Andres Freund  wrote:
> On 2018-06-06 01:06:39 +1200, David Rowley wrote:
>> My concern is that only accounting memory for the group and not the
>> state is only solving half the problem. It might be fine for
>> aggregates that don't stray far from their aggtransspace, but for the
>> other ones, we could still see OOM.
>
>> If solving the problem completely is too hard, then a half fix (maybe
>> 3/4) is better than nothing, but if we can get a design for a full fix
>> before too much work is done, then isn't that better?
>
> I don't think we actually disagree.  I was really primarily talking
> about the case where we can't really do better because we don't have
> serialization support.  I mean we could just rescan from scratch, using
> a groupagg, but that obviously sucks.

I don't think we do. To take yours to the 100% solution might just
take adding the memory accounting to palloc that Jeff proposed a few
years ago and use that accounting to decide when we should switch
method.

However, I don't quite fully recall how the patch accounted for memory
consumed by sub-contexts and if getting the entire consumption
required recursively looking at subcontexts. If that's the case then
checking the consumption would likely cost too much if it was done
after each tuple was aggregated.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk


On 06/05/2018 02:49 PM, Andres Freund wrote:

Hi,

On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:

My concern is more about what happens when the input tuple ordering is
inherently incompatible with the eviction strategy, greatly increasing the
amount of data written to disk during evictions.

Say for example that we can fit 1000 groups into work_mem, and that there
are 2000 groups in the input data set. If the input is correlated with the
groups, everything is peachy because we'll evict the first batch, and then
group the remaining 1000 groups (or something like that). But if the input
data is random (which can easily happen, e.g. for IP addresses, UUIDs and
such) we'll hit the limit repeatedly and will evict much sooner.



I know you suggested simply dumping the whole hash table and starting from
scratch while we talked about this at pgcon, but ISTM it has exactly this
issue.


Yea, that's the case I was thinking of where going to sorting will very
likely have better performance.

I think it'd even be sensible to have a "skew tuple" like
optimization. When detecting getting closer to memory exhaustion, move
new groups to the tuplesort, but already hashed tuples stay in the
hashtable.  That'd still need tuples being moved to the sort in the
cases where the transition values get to big (say array_agg), but that
should be comparatively rare.  I'm sure we could do better in selecting
the hash-tabled values than just taking the first incoming ones, but
that shouldn't be too bad.



Not sure. I'd imagine the "compression" due to aggregating many tuples 
into much smaller amount of memory would be a clear win, so I don't find 
the "let's just dump all input tuples into a file" very attractive.


I think evicting a fraction of the aggregate state (say, ~25%) would 
work better - choosing the "oldest" items seems OK, as those likely 
represent many input tuples (thus having a high "compaction" ratio). Or 
we could evict items representing rare groups, to make space for the 
common ones. Perhaps a combined eviction strategy (10% of each type) 
would work well. I'm conveniently ignoring the fact that we don't have 
information to determine this, at the moment, of course.


I'm sure it's possible to make better decisions based on some cost 
estimates, both at plan time and then during execution.


That being said, I don't want to over-think / over-engineer this. 
Getting something that reduces the risk of OOM in the first step is a 
good enough improvement. If we can do something better next, great.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

Hi,

On 2018-06-06 01:06:39 +1200, David Rowley wrote:
> On 6 June 2018 at 00:57, Andres Freund  wrote:
> > I think it's ok to only handle this gracefully if serialization is
> > supported.
> >
> > But I think my proposal to continue use a hashtable for the already
> > known groups, and sorting for additional groups would largely address
> > that largely, right?  We couldn't deal with groups becoming too large,
> > but easily with the number of groups becoming too large.
> 
> My concern is that only accounting memory for the group and not the
> state is only solving half the problem. It might be fine for
> aggregates that don't stray far from their aggtransspace, but for the
> other ones, we could still see OOM.

> If solving the problem completely is too hard, then a half fix (maybe
> 3/4) is better than nothing, but if we can get a design for a full fix
> before too much work is done, then isn't that better?

I don't think we actually disagree.  I was really primarily talking
about the case where we can't really do better because we don't have
serialization support.  I mean we could just rescan from scratch, using
a groupagg, but that obviously sucks.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

On 6 June 2018 at 00:57, Andres Freund  wrote:
> On 2018-06-06 00:53:42 +1200, David Rowley wrote:
>> On 6 June 2018 at 00:45, Andres Freund  wrote:
>> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> >> I wonder if an aggregate might use a custom context
>> >> internally (I don't recall anything like that). The accounting capability
>> >> seems potentially useful for other places, and those might not use 
>> >> AllocSet
>> >> (or at least not directly).
>> >
>> > Yea, that seems like a big issue.
>>
>> Unfortunately, at least one of the built-in ones do. See initArrayResultArr.
>
> I think it's ok to only handle this gracefully if serialization is
> supported.
>
> But I think my proposal to continue use a hashtable for the already
> known groups, and sorting for additional groups would largely address
> that largely, right?  We couldn't deal with groups becoming too large,
> but easily with the number of groups becoming too large.

My concern is that only accounting memory for the group and not the
state is only solving half the problem. It might be fine for
aggregates that don't stray far from their aggtransspace, but for the
other ones, we could still see OOM.

If solving the problem completely is too hard, then a half fix (maybe
3/4) is better than nothing, but if we can get a design for a full fix
before too much work is done, then isn't that better?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

On 2018-06-06 00:53:42 +1200, David Rowley wrote:
> On 6 June 2018 at 00:45, Andres Freund  wrote:
> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
> >> I wonder if an aggregate might use a custom context
> >> internally (I don't recall anything like that). The accounting capability
> >> seems potentially useful for other places, and those might not use AllocSet
> >> (or at least not directly).
> >
> > Yea, that seems like a big issue.
> 
> Unfortunately, at least one of the built-in ones do. See initArrayResultArr.

I think it's ok to only handle this gracefully if serialization is
supported.

But I think my proposal to continue use a hashtable for the already
known groups, and sorting for additional groups would largely address
that largely, right?  We couldn't deal with groups becoming too large,
but easily with the number of groups becoming too large.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

On 6 June 2018 at 00:45, Andres Freund  wrote:
> On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> I wonder if an aggregate might use a custom context
>> internally (I don't recall anything like that). The accounting capability
>> seems potentially useful for other places, and those might not use AllocSet
>> (or at least not directly).
>
> Yea, that seems like a big issue.

Unfortunately, at least one of the built-in ones do. See initArrayResultArr.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

Hi,

On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:
> My concern is more about what happens when the input tuple ordering is
> inherently incompatible with the eviction strategy, greatly increasing the
> amount of data written to disk during evictions.
> 
> Say for example that we can fit 1000 groups into work_mem, and that there
> are 2000 groups in the input data set. If the input is correlated with the
> groups, everything is peachy because we'll evict the first batch, and then
> group the remaining 1000 groups (or something like that). But if the input
> data is random (which can easily happen, e.g. for IP addresses, UUIDs and
> such) we'll hit the limit repeatedly and will evict much sooner.

> I know you suggested simply dumping the whole hash table and starting from
> scratch while we talked about this at pgcon, but ISTM it has exactly this
> issue.

Yea, that's the case I was thinking of where going to sorting will very
likely have better performance.

I think it'd even be sensible to have a "skew tuple" like
optimization. When detecting getting closer to memory exhaustion, move
new groups to the tuplesort, but already hashed tuples stay in the
hashtable.  That'd still need tuples being moved to the sort in the
cases where the transition values get to big (say array_agg), but that
should be comparatively rare.  I'm sure we could do better in selecting
the hash-tabled values than just taking the first incoming ones, but
that shouldn't be too bad.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk


On 06/05/2018 07:46 AM, Jeff Davis wrote:

On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:

I expect the eviction strategy to be the primary design challenge of
this patch. The other bits will be mostly determined by this one
piece.


Not sure I agree that this is the primary challenge.

The cases that benefit from eviction are probably a minority. I see two
categories that would benefit:

   * Natural clustering in the heap. This sounds fairly common, but a
lot of the cases that come to mind are too low-cardinality to be
compelling; e.g. timestamps grouped by hour/day/month. If someone has
run into a high-cardinality natural grouping case, let me know.
   * ARRAY_AGG (or similar): individual state values can be large enough
that we need to evict to avoid exceeding work_mem even if not adding
any new groups.

In either case, it seems like a fairly simple eviction strategy would
work. For instance, we could just evict the entire hash table if
work_mem is exceeded or if the hit rate on the hash table falls below a
certain threshold. If there was really something important that should
have stayed in the hash table, it will go back in soon anyway.

So why should eviction be a major driver for the entire design? I agree
it should be an area of improvement for the future, so let me know if
you see a major problem, but I haven't been as focused on eviction.



My concern is more about what happens when the input tuple ordering is 
inherently incompatible with the eviction strategy, greatly increasing 
the amount of data written to disk during evictions.


Say for example that we can fit 1000 groups into work_mem, and that 
there are 2000 groups in the input data set. If the input is correlated 
with the groups, everything is peachy because we'll evict the first 
batch, and then group the remaining 1000 groups (or something like 
that). But if the input data is random (which can easily happen, e.g. 
for IP addresses, UUIDs and such) we'll hit the limit repeatedly and 
will evict much sooner.


I know you suggested simply dumping the whole hash table and starting 
from scratch while we talked about this at pgcon, but ISTM it has 
exactly this issue.


But I don't know if there actually is a better option - maybe we simply 
have to accept this problem. After all, running slowly is still better 
than OOM (which may or may not happen now).


I wonder if we can somehow detect this at run-time and maybe fall-back 
to groupagg. E.g. we could compare number of groups vs. number of input 
tuples when we first hit the limit. It's a rough heuristics, but maybe 
sufficient.



While the primary goal of the patch is addressing the OOM risks in
hash
aggregate, I think it'd be a mistake to see it just that way. I see
it
could allow us to perform hash aggregate more often, even if we know
the
groups won't fit into work_mem. If we could estimate how much of the
aggregate state we'll have to spill to disk, we could still prefer
hashagg over groupagg. We would pay the price for eviction, but on
large
data sets that can be massively cheaper than having to do sort.


Agreed. I ran some tests of my patch in the last round, and they
strongly supported choosing HashAgg a lot more often. A lot of sort
improvements have been made though, so I really need to run some new
numbers.



Yeah, Peter made the sort stuff a lot faster. But I think there still is 
a lot of cases where the hashagg would greatly reduce the amount of data 
that needs to be written to disk, and that's not something the sort 
improvements could improve. If you need to aggregate a 1TB table, it's 
still going to be ~1TB of data you need to write to disk during on-disk 
sort. But if there is only a reasonable number of groups, that greatly 
simplifies things.



far away), but it would be unfortunate to make this improvement
impossible/more difficult in the future.


If you see anything that would make this difficult in the future, let
me know.



Sure. Sorry for being too hand-wavy in this thread, and possibly moving 
the goal posts further away :-/ I got excited by you planning to work on 
this again and perhaps a bit carried away ;-)



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

On 5 June 2018 at 17:04, Tomas Vondra  wrote:
> On 06/05/2018 04:56 AM, David Rowley wrote:
>> Isn't there still a problem determining when the memory exhaustion
>> actually happens though?   As far as I know, we've still little
>> knowledge how much memory each aggregate state occupies.
>>
>> Jeff tried to solve this in [1], but from what I remember, there was
>> too much concern about the overhead of the additional accounting code.
>>
>> [1] 
>> https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com
>>
>
> I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
> a couple of people who were originally worried about the overhead seem
> to be accepting it now.

Is there any great need to make everything pay the small price for
this? Couldn't we just create a new MemoryContextMethod that
duplicates aset.c, but has the require additional accounting built in
at the implementation level, then make execGrouping.c use that
allocator for its hash table? The code would not really need to be
duplicated, we could just do things the same way as like.c does with
like_match.c and include a .c file. We'd need another interface
function in MemoryContextMethods to support getting the total memory
allocated in the context, but that does not seem like a problem.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk