Recenly, I've been researching and putting together a proposal for window
functions. I have not finished this but when I do, I will post it. A nice
list of examples can be found here[1].
Rather than spend a lot of time talking about the problems window
functions present to the planner and executor, I'd like to bring up the
topic of an existing piece of SQL which is well understood and presents a
similar problem. Consider the following query:
select saledate, count(distinct prodid), count(distinct sellerid),
sum(price) from sales group by 1;
The point here is that aggregates usually just receive the input that the
lower level of the plan generates. When qualified with distinct, however,
that changes. Notice that the count on prodid and sellerid receive
unique input while the sum on price does not.
We do not create seperate plan nodes to achieve this. Rather, it is a
property of the aggregation code inside the executor[2].
It seems to me that by creating actual plan nodes for this distinct step
we can improve the range of options for executing these types of queries.
For example, consider a more complex query than the above that
groups over a join using a join key of saledate, prodid (and that the
planner implements with a merge join). This means that the sort order is
preserved and count(distinct prodid) will receive sorted input. As far as
I can tell, however, the executor doesn't know this and but the planner
does. That is, the sort step inside the aggregate code is redundant.
Another way it could be improved is if we ever introduce a 'hash distinct'
execution node. This has been discussed before.
My interest here is not so much to do with distinct aggregates but,
rather, window functions. Window functions have this same problem as the
input to the functions is generally sorted by different keys.
So, hypothetically, lets say we wanted to create a plan for the above
query which had an explicit Sort -> Unique 'branch' for each of the
distinct aggregates. This is actually difficult to represent with the
existing plan constructs, as it turns out.
Currently, the plan looks something like:
GroupAggregate
^
|
Sort Op
^
|
Scan
What we want to do is have a kind of 'sub plan' for each aggregate. In
effect, the plan might start looking like a directed graph. Here is part
of the plan as a directed graph.
GroupAggregate
/-^---...
| |
| |
^ |
| Unique
| ^
| |
Sort Sort
(saledate)(saledate,prodid)
^ ^
| |
-- Fan Out ...
^
|
Scan
This idea was presented by Brian Hagenbuch at Greenplum. He calls it a
'Fan Out' plan. It is trivial to rejoin the data because all data input to
the aggregates is sorted by the same primary key. Also, we could/would
optimize this to perform a single sort for those columns which require the
same sort order (saledate and sum(price) in the example query). An extra
step would be required for (some) window functions because they wont
necessarily preserve order. That's not important now.
An alternative approach would be a 'pipeline' design. This would fit into
the existing plan structure better. In this approach, each aggregate
would be a distinct step in the plan.
Finalize (like Result)
^
|
Agg (sum(price))
^
|
Sort (saledate)
^
|
Agg (count(sellerid))
^
|
Sort (saleid, sellerid)/Unique
^
|
Agg (count(prodid))
^
|
Sort (saleid, prodid)/Unique
^
|
Scan
Now, this would actually work as follows: the input would be received from
the Scan node. We sort by the input by the key saleid, prodid and produce
a unique result. It is input to the aggregate and we produce a result for
a distinct grouping expression. We then pass the output of the aggregate
for this grouping expression up the tree along with a pointer to the scan
data. We do not discard the