Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-20 Thread Simon Riggs
On Thu, 2006-01-19 at 18:38 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  This seems to lead to a super-geometric progression in the number of
  files required,
 
 But we double the number of batches at each step, so there are going to
 be at most 20 or so levels, and that's only assuming a *horridly* wrong
 initial guess by the planner.  In practice I think it's reasonable to
 assume at most a couple rounds of doubling.  If you have more than that,
 the extra data-shuffling is going to exhaust your patience anyway.

What I'm saying is that if we start from 1 batch and move dynamically
upwards we quickly get an unmanageable number of files. However, if we
start at a particular number N, then we start with N-1 files, then move
to at most 2N(N-1) files etc..

So we can only get it wrong and double the number of batches about
twice before we get swamped with files. i.e. if we start at 1 we can
only reasonably get to 8 batches.

So we should start at a number higher than 1, attempting to make an
accurate guess about number of batches (N) required. If we have R rows
to aggregate and we get N correct, then the cost of the HashAgg is
2*R*(N-1)/N I/Os, which is cheaper than a sort, for *any* value of R for
both CPU and I/O costs. If we get it wrong, we have to read and re-write
more and more rows, which could eventually surpass the sort costs,
especially if we have growing transition state data from the aggregate.
I think the cost will be to re-write half of all rows already written
when we double N. If we fail early because we got Ndistinct wrong then
this could be cheap, though if we fail later on because of a growing
aggregate then this could easily be very expensive and quickly exceed
the cost of a sort.

My thought is to collect statistics about an aggregate at CREATE
AGGREGATE time. Simply send the aggregate 100 data values and see if the
output varies in size according to the input, if it does we take much
greater care about selecting HashAgg plans with that aggregate. ...and
that way we don't need the user to define the aggregate type directly.
This would only work with aggregates that return well known datatypes
such as int or char.

So getting the number of groups correct would be critical to making this
work, but HashAgg could be effective for even very large aggregates.

Any holes in that thinking?

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-20 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Any holes in that thinking?

Only that it's about five times more complicated than is currently known
to be necessary ;-).  How about we just implement the dynamic spill to
disk first, and not bother with the other stuff until we see problems in
the field?  Saying we have to do all this is a good recipe for not
getting any of it done.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-19 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 This seems to lead to a super-geometric progression in the number of
 files required,

But we double the number of batches at each step, so there are going to
be at most 20 or so levels, and that's only assuming a *horridly* wrong
initial guess by the planner.  In practice I think it's reasonable to
assume at most a couple rounds of doubling.  If you have more than that,
the extra data-shuffling is going to exhaust your patience anyway.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-19 Thread Simon Riggs
On Tue, 2006-01-17 at 21:43 +, Simon Riggs wrote:
 On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote:
  I was thinking along the lines of having multiple temp files per hash
  bucket.  If you have a tuple that needs to migrate from bucket M to
  bucket N, you know that it arrived before every tuple that was
  assigned
  to bucket N originally, so put such tuples into a separate temp file
  and process them before the main bucket-N temp file.  This might get a
  little tricky to manage after multiple hash resizings, but in
  principle
  it seems doable.

 You can manage that with file naming. Rows moved from batch N to batch M
 would be renamed N.M, so you'd be able to use file ordering to retrieve
 all files for *.M
 That scheme would work for multiple splits too, so that filenames could
 grow yet retain their sort order and final target batch properties.

This seems to lead to a super-geometric progression in the number of
files required, if we assume that the current batch could be
redistributed to all future batches each of which could be similarly
redistributed.

batches
1   no files
2   1 file
4   7 files
8   64 files
16  64,000 files
32  4 billion files ish

So it does seem important whether we demand sorted input or not.

Or at least requires us to provide the executor with a starting point
for the number of batches, so we could manage that.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-17 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Mon, 2006-01-16 at 20:02 -0500, Tom Lane wrote:
 But our idea of the number of batches needed can change during that
 process, resulting in some inner tuples being initially assigned to the
 wrong temp file.  This would also be true for hashagg.

 So we correct that before we start reading the outer table.

Why?  That would require a useless additional pass over the data.  With
the current design, we can process and discard at least *some* of the
data in a temp file when we read it, but a reorganization pass would
mean that it *all* goes back out to disk a second time.

Also, you assume that we can accurately tell how many tuples will fit in
memory in advance of actually processing them --- a presumption clearly
false in the hashagg case, and not that easy to do even for hashjoin.
(You can tell the overall size of a temp file, sure, but how do you know
how it will split when the batch size changes?  A perfectly even split
is unlikely.)

 OK, I see what you mean. Sounds like we should have a new definition for
 Aggregates, Sort Insensitive that allows them to work when the input
 ordering does not effect the result, since that case can be optimised
 much better when using HashAgg.

Please don't propose pushing this problem onto the user until it's
demonstrated that there's no other way.  I don't want to become the
next Oracle, with forty zillion knobs that it takes a highly trained
DBA to deal with.

 But all of them sound ugly.

I was thinking along the lines of having multiple temp files per hash
bucket.  If you have a tuple that needs to migrate from bucket M to
bucket N, you know that it arrived before every tuple that was assigned
to bucket N originally, so put such tuples into a separate temp file
and process them before the main bucket-N temp file.  This might get a
little tricky to manage after multiple hash resizings, but in principle
it seems doable.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-17 Thread Simon Riggs
On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:

 The tricky part is to preserve the existing guarantee that tuples are
 merged into their aggregate in arrival order.  (This does not matter for
 the standard aggregates but it definitely does for custom aggregates,
 and there will be unhappy villagers appearing on our doorsteps if we
 break it.)  I think this can work correctly under the above sketch but
 it needs to be verified.  It might require different handling of the
 TODO files than what hashjoin does.

You almost had me there... but there isn't any arrival order. The sort
that precedes an aggregation only sorts on the GROUP BY columns, not on
additional columns - so by the SQL standard there is not a guaranteed
ordering of the data into a aggregate. That is exactly what windowed
aggregates are for. (There isn't any way of specifying an ORDER BY yet
either).

The only way of doing this is by doing a derived table
select a, sum(b) from (select a,b order by a,b);
but AFAICS this is not part of the standard??

It is highly likely that rows are clumped together, but there just isn't
any guarantee that is the case. Any update of any row would change the
arrival order. Should we support something that has worked by luck?

I've been looking into windowed aggregates; these will provide this
functionality should people require it. I don't see how we'd be able to
do windowed aggregates and hashAgg at the same time, so this seems less
relevant. 

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-17 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
 The tricky part is to preserve the existing guarantee that tuples are
 merged into their aggregate in arrival order.

 You almost had me there... but there isn't any arrival order.

The fact that it's not in the spec doesn't mean we don't support it.
Here are a couple of threads on the subject:
http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php
http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php

Per the second message, this has worked since 7.4, and it was requested
fairly often before that.

 Should we support something that has worked by luck?

No luck about it, and yes people are depending on it.  You don't get to
break it just because it's not in the spec.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-17 Thread Simon Riggs
On Tue, 2006-01-17 at 14:41 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
  The tricky part is to preserve the existing guarantee that tuples are
  merged into their aggregate in arrival order.
 
  You almost had me there... but there isn't any arrival order.
 
 The fact that it's not in the spec doesn't mean we don't support it.
 Here are a couple of threads on the subject:
 http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php
 http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php
 
 Per the second message, this has worked since 7.4, and it was requested
 fairly often before that.

OK My interest was in expanding the role of HashAgg, which as Rod
says can be used to avoid the sort, so the overlap between those ideas
was low anyway.

On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote:
 I was thinking along the lines of having multiple temp files per hash
 bucket.  If you have a tuple that needs to migrate from bucket M to
 bucket N, you know that it arrived before every tuple that was
 assigned
 to bucket N originally, so put such tuples into a separate temp file
 and process them before the main bucket-N temp file.  This might get a
 little tricky to manage after multiple hash resizings, but in
 principle
 it seems doable.

OK, so we do need to do this when we have a defined arrival order: this
idea the best one so far. I don't see any optimization of this by
ignoring the arrival order, so it seems best to preserve the ordering
this way in all cases.

You can manage that with file naming. Rows moved from batch N to batch M
would be renamed N.M, so you'd be able to use file ordering to retrieve
all files for *.M
That scheme would work for multiple splits too, so that filenames could
grow yet retain their sort order and final target batch properties.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-17 Thread Simon Riggs
On Tue, 2006-01-17 at 21:43 +, Simon Riggs wrote:
 OK My interest was in expanding the role of HashAgg, which as Rod
 says can be used to avoid the sort, so the overlap between those ideas
 was low anyway.

Am I right in thinking that HashAgg would almost always be quicker than
SortAgg, even for large ( memory) aggregation sets? (Except where the
prior ordering has already been forced via an ORDER BY).

If that is so, then I will probably look to work on this sooner,
especially since we seem to have a clear design.

I'd originally viewed the spill-to-disk logic as a safety measure rather
than as a performance feature.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
 A couple of days ago I found myself wanting to aggregate 3 Billion
 tuples down to 100 Million tuples based on an integer key with six
 integer values -- six sum()'s.
 
 PostgreSQL ran out of memory with its Hash Aggregator and doing an old
 style Sort  Sum took a fair amount of time to complete (cancelled the
 process after 24 hours -- small machine).

 Spilling to disk would be nice but I suspect the obvious method would
 thrash quite badly with non-sorted input.

There is already hash table overflow (spill to disk) logic in HashJoins,
so this should be possible by reusing that code for HashAggs. That's on
my todo list, but I'd welcome any assistance.

A question: Are the rows in your 3 B row table clumped together based
upon the 100M row key? (or *mostly* so) We might also be able to
pre-aggregate the rows using a plan like
HashAgg
SortedAgg
or
SortedAgg
Sort
SortedAgg

The first SortedAgg seems superfluous, buy would reduce the row volume
considerably if incoming rows were frequently naturally adjacent, even
if the values were not actually sorted. (This could also be done during
sorting, but its much easier to slot the extra executor step into the
plan). That might then reduce the size of the later sort, or allow it to
become a HashAgg.

I could make that manually enabled using enable_pre_agg to allow us to
measure the effectiveness of that technique and decide what cost model
we'd use to make it automatic. Would that help?

 I've written something similar using a client and COPY with temporary
 tables. Even with the Export/Import copy I still beat the SortSum
 method PostgreSQL falls back to.

You can get round this now by chopping the larger table into pieces with
a WHERE clause and then putting them back together with a UNION. If the
table is partitioned, then do this by partitions.

This should also help when it comes to recalculating the sums again in
the future, since you'll only need to rescan the rows that have been
added since the last summation.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Rod Taylor
On Mon, 2006-01-16 at 08:32 +, Simon Riggs wrote:
 On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
  A couple of days ago I found myself wanting to aggregate 3 Billion
  tuples down to 100 Million tuples based on an integer key with six
  integer values -- six sum()'s.
  
  PostgreSQL ran out of memory with its Hash Aggregator and doing an old
  style Sort  Sum took a fair amount of time to complete (cancelled the
  process after 24 hours -- small machine).
 
  Spilling to disk would be nice but I suspect the obvious method would
  thrash quite badly with non-sorted input.
 
 There is already hash table overflow (spill to disk) logic in HashJoins,
 so this should be possible by reusing that code for HashAggs. That's on
 my todo list, but I'd welcome any assistance.

 A question: Are the rows in your 3 B row table clumped together based
 upon the 100M row key? (or *mostly* so) We might also be able to

They are randomly distributed. Fully sorting the data is quite painful.

 pre-aggregate the rows using a plan like
   HashAgg
   SortedAgg
 or
   SortedAgg
   Sort
   SortedAgg
 
 The first SortedAgg seems superfluous, buy would reduce the row volume
 considerably if incoming rows were frequently naturally adjacent, even
 if the values were not actually sorted. (This could also be done during
 sorting, but its much easier to slot the extra executor step into the
 plan). That might then reduce the size of the later sort, or allow it to
 become a HashAgg.
 
 I could make that manually enabled using enable_pre_agg to allow us to
 measure the effectiveness of that technique and decide what cost model
 we'd use to make it automatic. Would that help?

I don't understand how this helps. The problem isn't the 3B data source
rows but rather the 100M destination keys that are being aggregated
against.

The memory constraints of HashAgg are a result of the large number of
target keys and should be the same if it was 100M rows or 10B rows.

I think I need something closer to:

HashAgg
- HashSort (to disk)

HashSort would create a number of files on disk with similar data.
Grouping all similar keys into a single temporary file which HashAgg can
deal with individually (100 loops by 1M target keys instead of 1 loop by
100M target keys). The results would be the same as partitioning by
keyblock and running a HashAgg on each partition, but it would be
handled by the Executor rather than by client side code.

  I've written something similar using a client and COPY with temporary
  tables. Even with the Export/Import copy I still beat the SortSum
  method PostgreSQL falls back to.
 
 You can get round this now by chopping the larger table into pieces with
 a WHERE clause and then putting them back together with a UNION. If the
 table is partitioned, then do this by partitions.

True, except this results in several sequential scans over the source
data. I can extract and sort in a single pass at client side but it
would be far better if I could get PostgreSQL to do the same. I could
probably write a plpgsql function to do that logic but it would be quite
messy.

 This should also help when it comes to recalculating the sums again in
 the future, since you'll only need to rescan the rows that have been
 added since the last summation.

We store the aggregated results and never do this type of calculation on
that dataset again. The original dataset comes from about 300 partitions
(time and source) and they are removed upon completion. While this
calculation is being performed additional partitions are added.

I suppose I could store source data in 300 * 1000 partitions (Approx 300
batches times 1000 segments) but that would probably run into other
problems. PostgreSQL probably has issues with that many tables.

-- 


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
 A couple of days ago I found myself wanting to aggregate 3 Billion
 tuples down to 100 Million tuples based on an integer key with six
 integer values -- six sum()'s.

 There is already hash table overflow (spill to disk) logic in HashJoins,
 so this should be possible by reusing that code for HashAggs. That's on
 my todo list, but I'd welcome any assistance.

Yeah, I proposed something similar awhile back in conjunction with
fixing the spill logic for hash joins (which was always there, but was
not adaptive until recently).  I got the join part done but got
distracted before fixing HashAgg :-(

In principle, you just reduce the range of currently-in-memory hash
codes whenever you run low on memory.  The so-far-accumulated working
state for aggregates that are not in the range anymore goes into a temp
file, and subsequently any incoming tuples that hash outside the range
go into another temp file.  After you've completed the scan, you
finalize and emit the aggregates that are still in memory, then you pick
up the first set of dropped aggregates, rescan the associated TODO
file of unprocessed tuples, lather rinse repeat till done.

The tricky part is to preserve the existing guarantee that tuples are
merged into their aggregate in arrival order.  (This does not matter for
the standard aggregates but it definitely does for custom aggregates,
and there will be unhappy villagers appearing on our doorsteps if we
break it.)  I think this can work correctly under the above sketch but
it needs to be verified.  It might require different handling of the
TODO files than what hashjoin does.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 09:42 -0500, Rod Taylor wrote:
 On Mon, 2006-01-16 at 08:32 +, Simon Riggs wrote:
  On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
   
  A question: Are the rows in your 3 B row table clumped together based
  upon the 100M row key? (or *mostly* so) We might also be able to
 
 They are randomly distributed. Fully sorting the data is quite painful.

...

 I don't understand how this helps. 

It wouldn't since your rows are randomly distributed. The idea was not
related to improving HashAgg, but to improving Aggregation for the case
of naturally grouped data.

 I think I need something closer to:
 
 HashAgg
   - HashSort (to disk)
 
 HashSort would create a number of files on disk with similar data.
 Grouping all similar keys into a single temporary file which HashAgg can
 deal with individually (100 loops by 1M target keys instead of 1 loop by
 100M target keys). The results would be the same as partitioning by
 keyblock and running a HashAgg on each partition, but it would be
 handled by the Executor rather than by client side code.
 
   I've written something similar using a client and COPY with temporary
   tables. Even with the Export/Import copy I still beat the SortSum
   method PostgreSQL falls back to.

That is exactly how the spill to disk logic works for HashJoin (and
incidentally, identical to an Oracle one-pass hash join since both are
based upon the hybrid hash join algorithm). 

Multi-pass would only be required to handle very skewed hash
distributions, which HJ doesn't do yet.

So yes, this can be done. 

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
  A couple of days ago I found myself wanting to aggregate 3 Billion
  tuples down to 100 Million tuples based on an integer key with six
  integer values -- six sum()'s.
 
  There is already hash table overflow (spill to disk) logic in HashJoins,
  so this should be possible by reusing that code for HashAggs. That's on
  my todo list, but I'd welcome any assistance.
 
 Yeah, I proposed something similar awhile back in conjunction with
 fixing the spill logic for hash joins (which was always there, but was
 not adaptive until recently).  I got the join part done but got
 distracted before fixing HashAgg :-(

You've done the main work. :-)

 The tricky part is to preserve the existing guarantee that tuples are
 merged into their aggregate in arrival order.  (This does not matter for
 the standard aggregates but it definitely does for custom aggregates,
 and there will be unhappy villagers appearing on our doorsteps if we
 break it.)  I think this can work correctly under the above sketch but
 it needs to be verified.  It might require different handling of the
 TODO files than what hashjoin does.

For HJ we write each outer tuple to its own file-per-batch in the order
they arrive. Reading them back in preserves the original ordering. So
yes, caution required, but I see no difficulty, just reworking the HJ
code (nodeHashjoin and nodeHash). What else do you see?

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 For HJ we write each outer tuple to its own file-per-batch in the order
 they arrive. Reading them back in preserves the original ordering. So
 yes, caution required, but I see no difficulty, just reworking the HJ
 code (nodeHashjoin and nodeHash). What else do you see?

With dynamic adjustment of the hash partitioning, some tuples will go
through multiple temp files before they ultimately get eaten, and
different tuples destined for the same aggregate may take different
paths through the temp files depending on when they arrive.  It's not
immediately obvious that ordering is preserved when that happens.
I think it can be made to work but it may take different management of
the temp files than hashjoin uses.  (Worst case, we could use just a
single temp file for all unprocessed tuples, but this would result in
extra I/O.)

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 14:43 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  For HJ we write each outer tuple to its own file-per-batch in the order
  they arrive. Reading them back in preserves the original ordering. So
  yes, caution required, but I see no difficulty, just reworking the HJ
  code (nodeHashjoin and nodeHash). What else do you see?
 
 With dynamic adjustment of the hash partitioning, some tuples will go
 through multiple temp files before they ultimately get eaten, and
 different tuples destined for the same aggregate may take different
 paths through the temp files depending on when they arrive.  It's not
 immediately obvious that ordering is preserved when that happens.
 I think it can be made to work but it may take different management of
 the temp files than hashjoin uses.  (Worst case, we could use just a
 single temp file for all unprocessed tuples, but this would result in
 extra I/O.)

Sure hash table is dynamic, but we read all inner rows to create the
hash table (nodeHash) before we get the outer rows (nodeHJ).
Why would we continue to dynamically build the hash table after the
start of the outer scan? (I see that we do this, as you say, but I am
surprised).

Best Regards, Simon Riggs






---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Sure hash table is dynamic, but we read all inner rows to create the
 hash table (nodeHash) before we get the outer rows (nodeHJ).

But our idea of the number of batches needed can change during that
process, resulting in some inner tuples being initially assigned to the
wrong temp file.  This would also be true for hashagg.

 Why would we continue to dynamically build the hash table after the
 start of the outer scan?

The number of tuples written to a temp file might exceed what we want to
hold in memory; we won't detect this until the batch is read back in,
and in that case we have to split the batch at that time.  For hashagg
this point would apply to the aggregate states not the input tuples, but
it's still a live problem (especially if the aggregate states aren't
fixed-size values ... consider a concat aggregate for instance).

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

  Why would we continue to dynamically build the hash table after the
  start of the outer scan?
 
 The number of tuples written to a temp file might exceed what we want to
 hold in memory; we won't detect this until the batch is read back in,
 and in that case we have to split the batch at that time.  For hashagg
 this point would apply to the aggregate states not the input tuples, but
 it's still a live problem (especially if the aggregate states aren't
 fixed-size values ... consider a concat aggregate for instance).

For a hash aggregate would it be possible to rescan the original table instead
of spilling to temporary files? Then when you run out of working memory you
simply throw out half the hash table and ignore subsequent tuples that fall in
those hash buckets. Then you rescan for the discarded hash bucket regions.

This avoids having to do any disk writes at the expense possibly of additional
reads. I think in terms of i/o it would be much faster in most cases.

The downsides are: a) volatile aggregates or aggregates with side-effects
would be confused by being executed twice. I'm not clear that volatile
aggregate functions make any sense anyways though. b) I'm unclear whether
rescanning the table could potentially find tuples in a different state than
previous scans. If so then the idea doesn't work at all. But I don't think
that's possible is it?

The main problem is c) it may lose in terms of i/o for cases where the
cardinality is low (ie, it's overflowing despite having low cardinality
because the table is really really big too). But most cases will be spilling
because the cardinality is high. So the table may be big but the spill files
are nearly as big anyways and having to write and then read them means double
the i/o.

The upside of not having to write out temporary files is big. I find queries
that require temporary sort files get hit with a *huge* performance penalty.
Often an order of magnitude. Part of that could probably be mitigated by
having the sort files on a separate spindle but I think it's always going to
hurt especially if there are multiple operations spilling to disk
simultaneously.

-- 
greg


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 For a hash aggregate would it be possible to rescan the original table
 instead of spilling to temporary files?

Sure, but the possible performance gain is finite and the possible
performance loss is not.  The original table could be an extremely
expensive join.  We'd like to think that the planner gets the input size
estimate approximately right and so the amount of extra I/O caused by
hash table resizing should normally be minimal.  The cases where it is
not right are *especially* not likely to be a trivial table scan as you
are supposing.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 20:02 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Sure hash table is dynamic, but we read all inner rows to create the
  hash table (nodeHash) before we get the outer rows (nodeHJ).
 
 But our idea of the number of batches needed can change during that
 process, resulting in some inner tuples being initially assigned to the
 wrong temp file.  This would also be true for hashagg.

So we correct that before we start reading the outer table.

  Why would we continue to dynamically build the hash table after the
  start of the outer scan?
 
 The number of tuples written to a temp file might exceed what we want to
 hold in memory; we won't detect this until the batch is read back in,
 and in that case we have to split the batch at that time.  For hashagg
 this point would apply to the aggregate states not the input tuples, but
 it's still a live problem (especially if the aggregate states aren't
 fixed-size values ... consider a concat aggregate for instance).

OK, I see what you mean. Sounds like we should have a new definition for
Aggregates, Sort Insensitive that allows them to work when the input
ordering does not effect the result, since that case can be optimised
much better when using HashAgg. Since we know that applies to the common
cases of SUM, AVG etc this will certainly help people.

For sort-sensitive aggregates sounds like we either:
1. Write to a single file, while we remember the start offset of the
first row of each batch.
2. Write to multiple files, adding a globally incrementing sequenceid.
Batches are then resorted on the sequenceid before processing.
3. We give up, delete the existing batches and restart the scan from the
beginning of the outer table.

Sounds like (1) is best, since the overflow just becomes a SortedAgg.
But all of them sound ugly.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Large Scale Aggregation (HashAgg Enhancement)

2006-01-15 Thread Rod Taylor
A couple of days ago I found myself wanting to aggregate 3 Billion
tuples down to 100 Million tuples based on an integer key with six
integer values -- six sum()'s.

PostgreSQL ran out of memory with its Hash Aggregator and doing an old
style Sort  Sum took a fair amount of time to complete (cancelled the
process after 24 hours -- small machine).


Spilling to disk would be nice but I suspect the obvious method would
thrash quite badly with non-sorted input.


One solution is to partially sort the data into various buckets. If we
know how many keys can fit into sort_mem and what the upper and lower
bounds of our keys are then # Keys per MB / sort_mem temporary files can
be created. A sequential scan of the source data would sort each tuple
into the appropriate temporary file.  From there we can loop through a
temporary file, HashAgg the contents, present the results, and move to
the next temporary file.

For my particular problem the lower bound is 1 and the upper bound is
about 100M. The sort_mem setting allows HashAgg to handle 1M keys at a
time.  The first pass through the 3B tuples would create 100 temporary
files on disk. Temp file 1 would get 1 through 1M, temp file 2 gets keys
1M + 1 through 2M, etc. From there it is pretty easy.

This would allow for a 1000 fold increase in the number of distinct keys
PostgreSQL can simultaneously HashAgg in the default configuration at a
reasonable speed.


I've written something similar using a client and COPY with temporary
tables. Even with the Export/Import copy I still beat the SortSum
method PostgreSQL falls back to.


--


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly