Re: grouping pushdown

2023-01-04 Thread Antonin Houska
David Rowley  wrote:

> On Wed, 4 Jan 2023 at 23:21, Spring Zhong  wrote:
> > The plan is apparently inefficient, since the hash aggregate goes after the 
> > Cartesian product. We could expect the query's performance get much 
> > improved if the HashAggregate node can be pushed down to the SCAN node.
> 
> > Is someone has suggestions on this?
> 
> I think this is being worked on. See [1].

Well, the current version of that patch requires the query to contain at least
one aggregate. It shouldn't be a big deal to modify it. However note that this
feature pushes the aggregate/grouping only to one side of the join ("fake"
aggregate count(*) added to the query):

SET enable_agg_pushdown TO on;

EXPLAIN select i1,i2, count(*) from t1, t2 group by i1,i2;
   QUERY PLAN   

 Finalize GroupAggregate  (cost=440.02..440.04 rows=1 width=16)
   Group Key: t1.i1, t2.i2
   ->  Sort  (cost=440.02..440.02 rows=1 width=16)
 Sort Key: t1.i1, t2.i2
 ->  Nested Loop  (cost=195.00..440.01 rows=1 width=16)
   ->  Partial HashAggregate  (cost=195.00..195.01 rows=1 width=12)
 Group Key: t1.i1
 ->  Seq Scan on t1  (cost=0.00..145.00 rows=1 width=4)
   ->  Seq Scan on t2  (cost=0.00..145.00 rows=1 width=4)

If both sides should be grouped, finalization of the partial aggregates would
be more difficult, and I'm not sure it'd be worth the effort.

> [1] https://commitfest.postgresql.org/41/3764/

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: grouping pushdown

2023-01-04 Thread David Rowley
On Wed, 4 Jan 2023 at 23:21, Spring Zhong  wrote:
> The plan is apparently inefficient, since the hash aggregate goes after the 
> Cartesian product. We could expect the query's performance get much improved 
> if the HashAggregate node can be pushed down to the SCAN node.

> Is someone has suggestions on this?

I think this is being worked on. See [1].

David

[1] https://commitfest.postgresql.org/41/3764/




grouping pushdown

2023-01-04 Thread Spring Zhong
Hi hackers,

I came across a problem on how to improve the performance of queries with GROUP 
BY clause when the grouping columns have much duplicate data. For example:

create table t1(i1) as select 1 from generate_series(1,1);
create table t2(i2) as select 2 from generate_series(1,1);

select i1,i2 from t1, t2 group by i1,i2;
 i1 | i2
+
  1 |  2

   QUERY PLAN
---
 HashAggregate
   Group Key: t1.i1, t2.i2
   Batches: 1  Memory Usage: 24kB
   ->  Nested Loop
 ->  Seq Scan on t1
 ->  Materialize
   ->  Seq Scan on t2
 Planning Time: 0.067 ms
 Execution Time: 15864.585 ms


The plan is apparently inefficient, since the hash aggregate goes after the 
Cartesian product. We could expect the query's performance get much improved if 
the HashAggregate node can be pushed down to the SCAN node. For example, the 
plan may looks like:

 expected QUERY PLAN

 Group
   Group Key: t1.i1, t2.i2
   ->  Sort
 Sort Key: t1.i1, t2.i2
 ->  Nested Loop
   ->  HashAggregate
 Group Key: t1.i1
 ->  Seq Scan on t1
   ->  HashAggregate
 Group Key: t2.i2
 ->  Seq Scan on t2

Moreover, queries with expressions as GROUP BY columns may also take advantage 
of this feature, e.g.

select i1+i2 from t1, t2 group by i1+i2;
 ?column?
--
3

  expected QUERY PLAN

 Group
   Group Key: ((t1.i1 + t2.i2))
   ->  Sort
 Sort Key: ((t1.i1 + t2.i2))
 ->  Nested Loop
   ->  HashAggregate
 Group Key: t1.i1
 ->  Seq Scan on t1
   ->  HashAggregate
 Group Key: t2.i2
 ->  Seq Scan on t2

Is someone has suggestions on this?