Re: Remove useless GROUP BY columns considering unique index

2023-12-31 Thread jian he
On Fri, Dec 29, 2023 at 11:05 PM Zhang Mingli  wrote:
>
> Hi,
>
> This idea first came from remove_useless_groupby_columns does not need to 
> record constraint dependencie[0] which points out that
> unique index whose columns all have NOT NULL constraints  could also take the 
> work with primary key when removing useless GROUP BY columns.
> I study it and implement the idea.
>
> [0]https://www.postgresql.org/message-id/flat/CAApHDvrdYa%3DVhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL%2BPYMVN8OnQ%40mail.gmail.com
>

cosmetic issues:
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+-- materialized view
+--
+create temp table t1 (a int, b int, c int, primary key (a, b), unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a,
b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null,
primary key (a, b), unique(c, d));
+create temp table t4 (a int, b int not null, c int not null, d int
not null, primary key (a, b), unique(b, c), unique(d));
+create temp table t5 (a int not null, b int not null, c int not null,
d int not null, unique (a, b), unique(b, c), unique(a, c, d));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- Test unique not null index beats primary key.
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null index with less columns wins.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t5 group by a,b,c,d;
+
+drop table t1;
+drop table t2;
+drop table t3;
+drop table t4;
+

line `materailzed view` is unnecessary?
you forgot to drop table t5.

create temp table p_t2 (
  a int not null,
  b int not null,
  c int not null,
  d int,
  unique(a), unique(a,b),unique(a,b,c)
) partition by list(a);
create temp table p_t2_1 partition of p_t2 for values in(1);
create temp table p_t2_2 partition of p_t2 for values in(2);
explain (costs off, verbose) select * from p_t2 group by a,b,c,d;
your patch output:
  QUERY PLAN
--
 HashAggregate
   Output: p_t2.a, p_t2.b, p_t2.c, p_t2.d
   Group Key: p_t2.a
   ->  Append
 ->  Seq Scan on pg_temp.p_t2_1
   Output: p_t2_1.a, p_t2_1.b, p_t2_1.c, p_t2_1.d
 ->  Seq Scan on pg_temp.p_t2_2
   Output: p_t2_2.a, p_t2_2.b, p_t2_2.c, p_t2_2.d
(8 rows)

so it seems to work with partitioned tables, maybe you should add some
test cases with partition tables.


- * XXX This is only used to create derived tables, so NO INHERIT constraints
- * are always skipped.
+ * XXX When used to create derived tables, set skip_no_inherit tp true,
+ * so that NO INHERIT constraints will be skipped.
  */
 List *
-RelationGetNotNullConstraints(Oid relid, bool cooked)
+RelationGetNotNullConstraints(Oid relid, bool cooked, bool skip_no_inherit)
 {
  List   *notnulls = NIL;
  Relation constrRel;
@@ -815,7 +815,7 @@ RelationGetNotNullConstraints(Oid relid, bool cooked)

  if (conForm->contype != CONSTRAINT_NOTNULL)
  continue;
- if (conForm->connoinherit)
+ if (skip_no_inherit && conForm->connoinherit)
  continue;

I don't think you need to refactor RelationGetNotNullConstraints.
however i found it hard to explain it, (which one is parent, which one
is children is very confusing for me).
so i use the following script to test it:
DROP TABLE ATACC1, ATACC2;
CREATE TABLE ATACC1 (a int);
CREATE TABLE ATACC2 (b int,c int, unique(c)) INHERITS (ATACC1);
ALTER TABLE ATACC1 ADD NOT NULL a;
-- ALTER TABLE ATACC1 ADD NOT NULL a NO INHERIT;
ALTER TABLE ATACC2 ADD unique(a);
explain (costs off, verbose) select * from ATACC2 group by a,b,c;
-

create table t_0_3 (a int not null, b int not null, c int not null, d
int not null, unique (a, b), unique(b, c) DEFERRABLE, unique(d));
explain (costs off, verbose) select * from t_0_3 group by a,b,c,d;
   QUERY PLAN

 HashAggregate
   Output: a, b, c, d
   Group Key: t_0_3.a, t_0_3.b
   ->  Seq Scan on public.t_0_3
 Output: a, b, c, d
(5 rows)

the above part seems not right, it should be `Group Key: t_0_3.d`
so i changed to:

/* Skip constraints that are not UNIQUE */
+ if (con->contype != CONSTRAINT_UNIQUE)
+ continue;
+
+ /* Skip unique constraints that are condeferred */
+ if (con->condeferrable && con->condeferred)
+ continue;

I guess you probably have noticed, in the function
remove_useless_groupby_columns,
get_primary_key_attnos followed by get_min_unique_not_null_attnos.
Also, get_min_unique_not_null_attnos main code is very similar to
get_primary_key_attnos.

They both do `pg_constraint = table_open(ConstraintRelationId,
AccessShareLock);`
you might need to explain why we must open pg_constraint twice.
either 

Remove useless GROUP BY columns considering unique index

2023-12-29 Thread Zhang Mingli
Hi,

This idea first came from remove_useless_groupby_columns does not need to 
record constraint dependencie[0] which points out that
unique index whose columns all have NOT NULL constraints  could also take the 
work with primary key when removing useless GROUP BY columns.
I study it and implement the idea.

Ex:

create temp table t2 (a int, b int, c int not null, primary key (a, b), 
unique(c));

 explain (costs off) select * from t2 group by a,b,c;
 QUERY PLAN
 --
 HashAggregate
 Group Key: c
 -> Seq Scan on t2

The plan drop column a, b as I did a little more.

For the query, as t2 has primary key (a, b), before this patch, we could drop 
column c because {a, b} are PK.
And  we have an unique  index(c) with NOT NULL constraint now, we could drop 
column {a, b}, just keep {c}.

While we have multiple choices, group by a, b (c is removed  by PK) and group 
by c (a, b are removed by unique not null index)
And  I implement it to choose the one with less columns so that we can drop as 
more columns as possible.
I think it’s good for planner to save some cost like Sort less columns.
There may be better one for some reason like: try to keep PK for planner?
I’m not sure about that and it seems not worth much complex.

The NOT NULL constraint may also be computed from primary keys, ex:
create temp table t2 (a int, b int, c int not null, primary key (a, b), 
unique(a));
Primary key(a, b) ensure a is NOT NULL and we have a unique index(a), but it 
will introduce more complex to check if a unique index could be used.
I also doubt it worths doing that..
So my patch make it easy: check unique index’s columns, it’s a valid candidate 
if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in 
get_min_unique_not_null_attnos(), as the reason: less columns mean that more 
group by columns could be removed.

create temp table t3 (a int, b int, c int not null, d int not null, primary key 
(a, b), unique(c, d));
-- Test primary key beats unique not null index.
explain (costs off) select * from t3 group by a,b,c,d;
 QUERY PLAN
--
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t3
(3 rows)

create temp table t4 (a int, b int not null, c int not null, d int not null, 
primary key (a, b), unique(b, c), unique(d));
-- Test unique not null index with less columns wins.
explain (costs off) select * from t4 group by a,b,c,d;
 QUERY PLAN
--
 HashAggregate
 Group Key: d
 -> Seq Scan on t4
(3 rows)

The unique Indices could have overlaps with primary keys and indices themselves.

create temp table t5 (a int not null, b int not null, c int not null, d int not 
null, unique (a, b), unique(b, c), unique(a, c, d));
-- Test unique not null indices have overlap.
explain (costs off) select * from t5 group by a,b,c,d;
 QUERY PLAN
--
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t5
(3 rows)





[0]https://www.postgresql.org/message-id/flat/CAApHDvrdYa%3DVhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL%2BPYMVN8OnQ%40mail.gmail.com


Zhang Mingli
www.hashdata.xyz


v1-0001-Remove-useless-GROUP-BY-columns-considering-unique-i.patch
Description: Binary data