Re: UniqueKey v2
>> Consider join of tables "a", "b" and "c". My understanding is that >> make_join_rel() is called once with rel1={a} and rel2={b join c}, then with >> rel1={a join b} and rel2={c}, etc. I wanted to say that each call should >> produce the same set of unique keys. >> >> I need to check this part more in detail. > > I think I understand now. By calling populate_joinrel_uniquekeys() for various > orderings, you can find out that various input relation unique keys can > represent the whole join. For example, if the ordering is > > A JOIN (B JOIN C) > > you can prove that the unique keys of A can be used for the whole join, while > for the ordering > > B JOIN (A JOIN C) > > you can prove the same for the unique keys of B, and so on. Yes. >> > We don't create the component uniquekey if any one side of the boths >> > sides is unique already. For example: >> > >> > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is >> > unique", there is no need to create component UniqueKey (t1.id, t2.id); >> >> ok, I need to check more in detail how this part works. > > This optimization makes sense to me. OK. >> > > >> > > Of course one problem is that the number of combinations can grow >> > > exponentially as new relations are joined. >> > >> > Yes, that's why "rule 1" needed and "How to reduce the overhead" in >> > UniqueKey.README is introduced. > > I think there should yet be some guarantee that the number of unique keys does > not grow exponentially. Perhaps a constant that allows a relation (base or > join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3 > or 4.) And when picking the "best N keys", one criterion could be the number > of expressions in the key (the shorter key the better). I don't want to introduce this complextity right now. I'm more inerested with how to work with them effectivity. main effort includes: - the design of bitmapset which is memory usage friendly and easy for combinations. - Optimize the singlerow cases to reduce N UnqiueKeys to 1 UniqueKey. I hope we can pay more attention to this optimization (at most N UniqueKeys) when the major inforastruce has been done. >> > You are correct and IMO my current code are able to tell it is a single >> > row as well. >> > >> > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a >> > consequence. >> > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept >> > after the join because of rule 1 on joinrel. and t1 is singlerow, so the >> > joinrel is singlerow as well. > > ok, I think I understand now. OK. At last, this probably is my first non-trival patchs which has multiple authors, I don't want myself is the bottleneck for the coorperation, so if you need something to do done sooner, please don't hesitate to ask me for it explicitly. Here is my schedule about this. I can provide the next version based our discussion and your patches at the eariler of next week. and update the UniqueKey.README to make sure the overall design clearer. What I hope you to pay more attention is the UniqueKey.README besides the code. I hope the UniqueKey.README can reduce the effort for others to understand the overall design enormously. -- Best Regards Andy Fan
Re: UniqueKey v2
Antonin Houska writes: >> Could you make the reason clearer for adding 'List *opfamily_lists;' >> into UniqueKey? You said "This is needed to create ECs in the parent >> query if the upper relation represents a subquery." but I didn't get the >> it. Since we need to maintain the UniqueKey in the many places, I'd like >> to keep it as simple as possbile. Of course, anything essentical should >> be added for sure. > > If unique keys are generated for a subquery output, they also need to be > created for the corresponding relation in the upper query ("sub" in the > following example): OK. > > select * from tab1 left join (select * from tab2) sub; > > However, to create an unique key for "sub", you need an EC for each expression > of the key. OK. > And to create an EC, you in turn need the list of operator > families. I'm thinking if we need to "create" any EC. Can you find out a user case where the outer EC is missed and the UniqueKey is still interesting? I don't have an example now. convert_subquery_pathkeys has a similar sistuation and has the following codes: outer_ec = get_eclass_for_sort_expr(root, (Expr *) outer_var, sub_eclass->ec_opfamilies, sub_member->em_datatype, sub_eclass->ec_collation, 0, rel->relids, NULL, false); /* * If we don't find a matching EC, sub-pathkey isn't * interesting to the outer query */ if (outer_ec) best_pathkey = make_canonical_pathkey(root, outer_ec, sub_pathkey->pk_opfamily, sub_pathkey->pk_strategy, sub_pathkey->pk_nulls_first); } > Even if the parent query already had ECs for the columns of "sub" which are > contained in the unique key, you need to make sure that those ECs are > "compatible" with the ECs of the subquery which generated the unique key. That > is, if an EC of the subquery considers certain input values equal, the EC of > the parent query must also be able to determine if they are equal or not. > >> > * RelOptInfo.notnullattrs >> > >> > My understanding is that this field contains the set attributes whose >> > uniqueness is guaranteed by the unique key. They are acceptable because >> > they >> > are either 1) not allowed to be NULL due to NOT NULL constraint or 2) >> > NULL >> > value makes the containing row filtered out, so the row cannot break >> > uniqueness of the output. Am I right? >> > >> > If so, I suggest to change the field name to something less generic, >> > maybe >> > 'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment >> > that >> > more checks are needed before particular attribute can actually be used >> > in >> > UniqueKey. >> >> I don't think so, UniqueKey is just one of the places to use this >> not-null property, see 3af704098 for the another user case of it. >> >> (Because of 3af704098, we should leverage notnullattnums somehow in this >> patch, which will be included in the next version as well). > > In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that > does not happen to 'notnullattnums' in the current master branch. Thus I think > that 'notnullattrs' is specific to the unique keys feature, so the field name > should be less generic. OK. >> > >> > * uniquekey_useful_for_merging() >> > >> > How does uniqueness relate to merge join? In README.uniquekey you seem to >> > point out that a single row is always sorted, but I don't think this >> > function is related to that fact. (Instead, I'd expect that pathkeys are >> > added to all paths for a single-row relation, but I'm not sure you do >> > that >> > in the current version of
Re: UniqueKey v2
Antonin Houska wrote: > Andy Fan wrote: > > > > > > * Combining the UKs > > > > > > IMO this is the most problematic part of the patch. You call > > > populate_joinrel_uniquekeys() for the same join multiple times, > > > > Why do you think so? The below code is called in "make_join_rel" > > Consider join of tables "a", "b" and "c". My understanding is that > make_join_rel() is called once with rel1={a} and rel2={b join c}, then with > rel1={a join b} and rel2={c}, etc. I wanted to say that each call should > produce the same set of unique keys. > > I need to check this part more in detail. I think I understand now. By calling populate_joinrel_uniquekeys() for various orderings, you can find out that various input relation unique keys can represent the whole join. For example, if the ordering is A JOIN (B JOIN C) you can prove that the unique keys of A can be used for the whole join, while for the ordering B JOIN (A JOIN C) you can prove the same for the unique keys of B, and so on. > > Is your original question is about populate_joinrel_uniquekey_for_rel > > rather than populate_joinrel_uniquekeys? We have the below codes: > > > > outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, > > outerrel, > > > > innerrel, restrictlist); > > inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, > > innerrel, > > > > outerrel, restrictlist); > > > > This is mainly because of the following theory. Quoted from > > README.uniquekey. Let's called this as "rule 1". > > > > """ > > How to maintain the uniquekey? > > --- > > .. From the join relation, it is maintained with two rules: > > > > - the uniquekey in one side is still unique if it can't be duplicated > > after the join. for example: > > > > SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk; > > UniqueKey on t1: t1.pk > > UniqueKey on t1 Join t2: t1.pk > > """ > > > > AND the blow codes: > > > > > > if (outeruk_still_valid || inneruk_still_valid) > > > > /* > > * the uniquekey on outers or inners have been added into > > joinrel so > > * the combined uniuqekey from both sides is not needed. > > */ > > return; > > > > > > We don't create the component uniquekey if any one side of the boths > > sides is unique already. For example: > > > > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is > > unique", there is no need to create component UniqueKey (t1.id, t2.id); > > ok, I need to check more in detail how this part works. This optimization makes sense to me. > > > > > > Of course one problem is that the number of combinations can grow > > > exponentially as new relations are joined. > > > > Yes, that's why "rule 1" needed and "How to reduce the overhead" in > > UniqueKey.README is introduced. I think there should yet be some guarantee that the number of unique keys does not grow exponentially. Perhaps a constant that allows a relation (base or join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3 or 4.) And when picking the "best N keys", one criterion could be the number of expressions in the key (the shorter key the better). > > > > > > 2) Check if the join relation is single-row > > > > > > I in order to get rid of the dependency on 'restrictlist', I think you > > > can > > > use ECs. Consider a query from your regression tests: > > > > > > CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c > > > int, d int, e int); > > > > > > SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id > > > = 1; > > > > > > The idea here seems to be that no more than one row of t1 matches the > > > query > > > clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures > > > that > > > no more than one row of t2 matches the query (because t1 cannot provide > > > the > > > clause with more than one input value of 'e'). And therefore, the join > > > also > > > produces at most one row. > > > > You are correct and IMO my current code are able to tell it is a single > > row as well. > > > > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a > > consequence. > > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept > > after the join because of rule 1 on joinrel. and t1 is singlerow, so the > > joinrel is singlerow as well. ok, I think I understand now. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: UniqueKey v2
Andy Fan wrote: > > * I think that, before EC is considered suitable for an UK, its > > ec_opfamilies > > field needs to be checked. I try to do that in > > find_ec_position_matching_expr(), see 0004. > > Could you make the reason clearer for adding 'List *opfamily_lists;' > into UniqueKey? You said "This is needed to create ECs in the parent > query if the upper relation represents a subquery." but I didn't get the > it. Since we need to maintain the UniqueKey in the many places, I'd like > to keep it as simple as possbile. Of course, anything essentical should > be added for sure. If unique keys are generated for a subquery output, they also need to be created for the corresponding relation in the upper query ("sub" in the following example): select * from tab1 left join (select * from tab2) sub; However, to create an unique key for "sub", you need an EC for each expression of the key. And to create an EC, you in turn need the list of operator families. Even if the parent query already had ECs for the columns of "sub" which are contained in the unique key, you need to make sure that those ECs are "compatible" with the ECs of the subquery which generated the unique key. That is, if an EC of the subquery considers certain input values equal, the EC of the parent query must also be able to determine if they are equal or not. > > * RelOptInfo.notnullattrs > > > > My understanding is that this field contains the set attributes whose > > uniqueness is guaranteed by the unique key. They are acceptable because > > they > > are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL > > value makes the containing row filtered out, so the row cannot break > > uniqueness of the output. Am I right? > > > > If so, I suggest to change the field name to something less generic, maybe > > 'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment > > that > > more checks are needed before particular attribute can actually be used in > > UniqueKey. > > I don't think so, UniqueKey is just one of the places to use this > not-null property, see 3af704098 for the another user case of it. > > (Because of 3af704098, we should leverage notnullattnums somehow in this > patch, which will be included in the next version as well). In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that does not happen to 'notnullattnums' in the current master branch. Thus I think that 'notnullattrs' is specific to the unique keys feature, so the field name should be less generic. > > > > * uniquekey_useful_for_merging() > > > > How does uniqueness relate to merge join? In README.uniquekey you seem to > > point out that a single row is always sorted, but I don't think this > > function is related to that fact. (Instead, I'd expect that pathkeys are > > added to all paths for a single-row relation, but I'm not sure you do that > > in the current version of the patch.) > > The merging is for "mergejoinable join clauses", see function > eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a"; My question is: why is the uniqueness important specifically to merge join? I understand that join evaluation can be more efficient if we know that one input relation is unique (i.e. we only scan that relation until we find the first match), but this is not specific to merge join. > > * is_uniquekey_useful_afterjoin() > > > > Now that my patch (0004) allows propagation of the unique keys from a > > subquery to the upper query, I was wondering if the UniqueKey structure > > needs the 'use_for_distinct field' I mean we should now propagate the > > unique > > keys to the parent query whether the subquery has DISTINCT clause or not. > > I > > noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I > > changed the function to always returned true. However nothing changed in > > the > > output of regression tests (installcheck). Do you insist that the > > 'use_for_distinct' field is needed? I miss your answer to this comment. > > * uniquekey_contains_multinulls() > > > > ** Instead of calling the function when trying to use the UK, how about > > checking the ECs when considering creation of the UK? If the tests > > fail, > > just don't create the UK. > > I don't think so since we maintain the UniqueKey from bottom to top, you > can double check if my reason is appropriate. > > CREATE TABLE t1(a int); > CREATE INDEX ON t1(a); > > SELECT distinct t1.a FROM t1 JOIN t2 using(a); > > We need to create the UniqueKey on the baserel for t1 and the NULL > values is filtered out in the joinrel. so we have to creating it with > allowing NULL values first. ok > > ** What does the 'multi' word in the function name mean? > > multi means multiple, I thought we use this short name in the many > places, for ex bt_multi_page_stats after a quick search. Why not simply uniquekey_contains_nulls() ? Actually
Re: UniqueKey v2
Hello Antonin, Thanks for interesting with this topic! > * I think that, before EC is considered suitable for an UK, its ec_opfamilies > field needs to be checked. I try to do that in > find_ec_position_matching_expr(), see 0004. Could you make the reason clearer for adding 'List *opfamily_lists;' into UniqueKey? You said "This is needed to create ECs in the parent query if the upper relation represents a subquery." but I didn't get the it. Since we need to maintain the UniqueKey in the many places, I'd like to keep it as simple as possbile. Of course, anything essentical should be added for sure. > > * Set-returning functions (SRFs) can make a "distinct path" necessary even if > the join output is unique. You are right at this point, I will fix it in the coming version. > > * RelOptInfo.notnullattrs > > My understanding is that this field contains the set attributes whose > uniqueness is guaranteed by the unique key. They are acceptable because they > are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL > value makes the containing row filtered out, so the row cannot break > uniqueness of the output. Am I right? > > If so, I suggest to change the field name to something less generic, maybe > 'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that > more checks are needed before particular attribute can actually be used in > UniqueKey. I don't think so, UniqueKey is just one of the places to use this not-null property, see 3af704098 for the another user case of it. (Because of 3af704098, we should leverage notnullattnums somehow in this patch, which will be included in the next version as well). > * add_uniquekey_for_uniqueindex() > > I'd appreciate an explanation why the "single-row UK" is created. I think > the reason for unique_exprs==NIL is that a restriction clause VAR=CONST > exists for each column of the unique index. Whether I'm right or not, a > comment should state clearly what the reason is. You are understanding it correctly. I will add comments in the next version. > > * uniquekey_useful_for_merging() > > How does uniqueness relate to merge join? In README.uniquekey you seem to > point out that a single row is always sorted, but I don't think this > function is related to that fact. (Instead, I'd expect that pathkeys are > added to all paths for a single-row relation, but I'm not sure you do that > in the current version of the patch.) The merging is for "mergejoinable join clauses", see function eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a"; > * is_uniquekey_useful_afterjoin() > > Now that my patch (0004) allows propagation of the unique keys from a > subquery to the upper query, I was wondering if the UniqueKey structure > needs the 'use_for_distinct field' I mean we should now propagate the unique > keys to the parent query whether the subquery has DISTINCT clause or not. I > noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I > changed the function to always returned true. However nothing changed in the > output of regression tests (installcheck). Do you insist that the > 'use_for_distinct' field is needed? > > > * uniquekey_contains_multinulls() > > ** Instead of calling the function when trying to use the UK, how about > checking the ECs when considering creation of the UK? If the tests fail, > just don't create the UK. I don't think so since we maintain the UniqueKey from bottom to top, you can double check if my reason is appropriate. CREATE TABLE t1(a int); CREATE INDEX ON t1(a); SELECT distinct t1.a FROM t1 JOIN t2 using(a); We need to create the UniqueKey on the baserel for t1 and the NULL values is filtered out in the joinrel. so we have to creating it with allowing NULL values first. > ** What does the 'multi' word in the function name mean? multi means multiple, I thought we use this short name in the many places, for ex bt_multi_page_stats after a quick search. > * relation_is_distinct_for() > > The function name is too similar to rel_is_distinct_for(). I think the name > should indicate that you are checking the relation against a set of > pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from > the argument name)? At the same time, it might be good to rename > rel_is_distinct_for() to rel_is_distinct_for_clauses(). OK. > * uniquekey_contains_in() > > Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the > comment be " ... if UniqueKey is contained in the list of EquivalenceClass" > ? OK. > > (In general, even though I'm not English native speaker, I think I see quite > a few grammar issues, which often make reading of the comments/documentation Your English is really good:) > > > * Combining the UKs > > IMO this is the most problematic part of the patch. You call > populate_joinrel_uniquekeys() for the same join multiple
Re: UniqueKey v2
Hello Antonin, > zhihuifan1...@163.com wrote: > >> Here is the v3, ... > > I'm trying to enhance the join removal functionality (will post my patch in a > separate thread soon) and I consider your patch very helpful in this > area. Thanks for these words. The point 2) and 3) is pretty interesting to me at [1] and "enhance join removal" is another interesting user case. > Following is my review. Attached are also some fixes and one enhancement: > propagation of the unique keys (UK) from a subquery to the parent query > (0004). (Note that 0001 is just your patch rebased). Thanks for that! more enhancment like uniquekey in partitioned table is needed. This post is mainly used to check if more people is still interested with this. > Are you going to submit the patch to the first CF of PG 18? Since there are still known work to do, I'm not sure if it is OK to submit in CF. What do you think about this part? > > Please let me know if I can contribute to the effort by reviewing or writing > some code. Absolutely yes! please feel free to review / writing any of them and do remember add yourself into the author list if you do that. Thanks for your review suggestion, I will get to this very soon if once I get time, I hope it is in 4 weeks. [1] https://www.postgresql.org/message-id/7mlamswjp81p.fsf%40e18c07352.et15sqa -- Best Regards Andy Fan
Re: UniqueKey v2
zhihuifan1...@163.com wrote: > Here is the v3, ... I'm trying to enhance the join removal functionality (will post my patch in a separate thread soon) and I consider your patch very helpful in this area. Following is my review. Attached are also some fixes and one enhancement: propagation of the unique keys (UK) from a subquery to the parent query (0004). (Note that 0001 is just your patch rebased). * I think that, before EC is considered suitable for an UK, its ec_opfamilies field needs to be checked. I try to do that in find_ec_position_matching_expr(), see 0004. * Set-returning functions (SRFs) can make a "distinct path" necessary even if the join output is unique. * RelOptInfo.notnullattrs My understanding is that this field contains the set attributes whose uniqueness is guaranteed by the unique key. They are acceptable because they are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL value makes the containing row filtered out, so the row cannot break uniqueness of the output. Am I right? If so, I suggest to change the field name to something less generic, maybe 'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that more checks are needed before particular attribute can actually be used in UniqueKey. * add_uniquekey_for_uniqueindex() I'd appreciate an explanation why the "single-row UK" is created. I think the reason for unique_exprs==NIL is that a restriction clause VAR=CONST exists for each column of the unique index. Whether I'm right or not, a comment should state clearly what the reason is. * uniquekey_useful_for_merging() How does uniqueness relate to merge join? In README.uniquekey you seem to point out that a single row is always sorted, but I don't think this function is related to that fact. (Instead, I'd expect that pathkeys are added to all paths for a single-row relation, but I'm not sure you do that in the current version of the patch.) * is_uniquekey_useful_afterjoin() Now that my patch (0004) allows propagation of the unique keys from a subquery to the upper query, I was wondering if the UniqueKey structure needs the 'use_for_distinct field' I mean we should now propagate the unique keys to the parent query whether the subquery has DISTINCT clause or not. I noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I changed the function to always returned true. However nothing changed in the output of regression tests (installcheck). Do you insist that the 'use_for_distinct' field is needed? * uniquekey_contains_multinulls() ** Instead of calling the function when trying to use the UK, how about checking the ECs when considering creation of the UK? If the tests fail, just don't create the UK. ** What does the 'multi' word in the function name mean? * relation_is_distinct_for() The function name is too similar to rel_is_distinct_for(). I think the name should indicate that you are checking the relation against a set of pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from the argument name)? At the same time, it might be good to rename rel_is_distinct_for() to rel_is_distinct_for_clauses(). * uniquekey_contains_in() Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the comment be " ... if UniqueKey is contained in the list of EquivalenceClass" ? (In general, even though I'm not English native speaker, I think I see quite a few grammar issues, which often make reading of the comments/documentation a bit difficult.) * Combining the UKs IMO this is the most problematic part of the patch. You call populate_joinrel_uniquekeys() for the same join multiple times, each time with a different 'restrictlist', and you try to do two things at the same time: 1) combine the UKs of the input relations into the UKs of the join relation, 2) check if the join relation can be marked single-row. I think that both 1) and 2) should be independent from join order, and thus both computations should only take place once for given set of input relations. And I think they should be done separately: 1) Compute the join UKs As you admit in a comment in populate_joinrel_uniquekeys(), neither join method nor clauses should matter. So I think you only need to pick the "component UKs" (i.e. UKs of the input relations) which are usable above that join (i.e. neither the join itself nor any join below sets any column of the UK to NULL) and combine them. Of course one problem is that the number of combinations can grow exponentially as new relations are joined. I'm not sure it's necessary to combine the UKs (and to discard some of them) immediately. Instead, maybe we can keep lists of UKs only for base relations, and postpone picking the suitable UKs and combining them until we actually need to check the relation uniqueness. 2) Check if the join relation
Re: UniqueKey v2
zhihuifan1...@163.com writes: Hi, Here is the v3, the mainly changes is it maintains the UniqueKey on joinrel level, which probabaly is the most important part of this feature. It shows how the UnqiueKey on joinrel is generated and how it is discarded due to non-interesting-uniquekey and also show much details about the single-row case. I will always maintain README.uniquekey under src/backend/optimizer/path/ to include the latest state of this feature to save the time for reviewer from going through from the begining. I also use the word "BAD CASE" in uniquekey.sql to demo which sistuation is not handled well so far, that probably needs more attention at the first review. >From 987c1d08bedc0c9c7436ad2daddbab2202b52425 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Thu, 9 Nov 2023 17:25:35 +0800 Subject: [PATCH v3 1/1] maintain uniquekey on baserel/joinrel level and use it for marking distinct-as-noop. --- src/backend/nodes/list.c| 17 + src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/README.uniquekey | 220 +++ src/backend/optimizer/path/allpaths.c | 2 + src/backend/optimizer/path/equivclass.c | 48 ++ src/backend/optimizer/path/joinrels.c | 2 + src/backend/optimizer/path/uniquekey.c | 654 src/backend/optimizer/plan/initsplan.c | 8 + src/backend/optimizer/plan/planner.c| 33 + src/backend/optimizer/util/plancat.c| 10 + src/backend/optimizer/util/relnode.c| 2 + src/include/nodes/pathnodes.h | 19 + src/include/nodes/pg_list.h | 2 + src/include/optimizer/paths.h | 13 + src/test/regress/expected/join.out | 11 +- src/test/regress/expected/uniquekey.out | 268 src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/uniquekey.sql | 104 18 files changed, 1409 insertions(+), 9 deletions(-) create mode 100644 src/backend/optimizer/path/README.uniquekey create mode 100644 src/backend/optimizer/path/uniquekey.c create mode 100644 src/test/regress/expected/uniquekey.out create mode 100644 src/test/regress/sql/uniquekey.sql diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index 750ee5a7e5..eaeaa19ad2 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -694,6 +694,23 @@ list_member_ptr(const List *list, const void *datum) return false; } +/* + * Return index of the datum in list if found. otherwise return -1. + */ +int +list_member_ptr_pos(const List *list, const void *datum) +{ + ListCell *lc; + + foreach(lc, list) + { + if (lfirst(lc) == datum) + return foreach_current_index(lc); + } + + return -1; +} + /* * Return true iff the integer 'datum' is a member of the list. */ diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 1e199ff66f..63cc1505d9 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -21,6 +21,7 @@ OBJS = \ joinpath.o \ joinrels.o \ pathkeys.o \ - tidpath.o + tidpath.o \ + uniquekey.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey new file mode 100644 index 00..be13b113b9 --- /dev/null +++ b/src/backend/optimizer/path/README.uniquekey @@ -0,0 +1,220 @@ +Here is a design document and a part of implementation. + +What is UniqueKey? +- + +UniqueKey represents a uniqueness information for a RelOptInfo. for +example: + +SELECT id FROM t; + +where the ID is the UniqueKey for the RelOptInfo (t). In the real world, +it has the following attributes: + +1). It should be EquivalenceClass based. for example: + +SELECT a FROM t WHERE a = id; + +In this case, the UniqueKey should be 1 EC with two members +- EC(EM(a), EM(id)). + + +2). Each UniqueKey may be made up with 1+ EquivalenceClass. for example: + +CREATE TABLE t(a int not null, b int not null); +CREATE UNIQUE INDEX on t(a, b); +SELECT * FROM t; + +Where the UniqueKey for RelOptInfo (t) will be 2 ECs with each one has 1 +member. + +- EC(em=a), EC(em=b) + +3). Each RelOptInfo may have 1+ UniqueKeys. + +CREATE TABLE t(a int not null, b int not null, c int not null); +CREATE UNIQUE INDEX on t(a, b); +CREATE UNIQUE INDEX on t(c); + +SELECT * FROM t; + +Where the UniqueKey for RelOptInfo (t) will be +- [EC(em=a), EC(em=b)]. +- [EC(em=c)] + +4). A special case is about the one-row case. It works like: +SELECT * FROM t WHERE id = 1; +Here every single expression in the RelOptInfo (t) is unique since only +one row here. + +Where can we use it? + +1. mark the distinct as no-op. SELECT DISTINCT uniquekey FROM v; This + optimization has been required several times in our threads. + +2. Figure out more pathkey within the onerow case, then some planning + time can be reduced to be big extend. This user case is not
Re: UniqueKey v2
jian he writes: > On Fri, Oct 20, 2023 at 4:33 PM wrote: >> >> >> > i did some simple tests using text data type. >> > >> > it works with the primary key, not with unique indexes. >> > it does not work when the column is unique, not null. >> > >> > The following is my test. >> >> Can you simplify your test case please? I can't undertand what "doesn't >> work" mean here and for which case. FWIW, this feature has nothing with >> the real data, I don't think inserting any data is helpful unless I >> missed anything. > > Sorry for not explaining it very well. > "make distinct as no-op." > my understanding: it means: if fewer rows meet the criteria "columnX < > const_a;" , after analyze the table, it should use index only scan No, "mark distinct as no-op" means the distinct node can be discarded automatically since it is not needed any more. The simplest case would be "select distinct pk from t", where it should be same as "select pk from t". You can check the testcase for the more cases. -- Best Regards Andy Fan
Re: UniqueKey v2
On Fri, Oct 20, 2023 at 4:33 PM wrote: > > > > i did some simple tests using text data type. > > > > it works with the primary key, not with unique indexes. > > it does not work when the column is unique, not null. > > > > The following is my test. > > Can you simplify your test case please? I can't undertand what "doesn't > work" mean here and for which case. FWIW, this feature has nothing with > the real data, I don't think inserting any data is helpful unless I > missed anything. Sorry for not explaining it very well. "make distinct as no-op." my understanding: it means: if fewer rows meet the criteria "columnX < const_a;" , after analyze the table, it should use index only scan for the queryA? --queryA: select distinct columnX from the_table where columnX < const_a; There are several ways for columnX to be unique: primark key, unique key, unique key nulls distinct, unique key nulls not distinct, unique key and not null. After applying your patch, only the primary key case will make the queryA explain output using the index-only scan.
Re: UniqueKey v2
> i did some simple tests using text data type. > > it works with the primary key, not with unique indexes. > it does not work when the column is unique, not null. > > The following is my test. Can you simplify your test case please? I can't undertand what "doesn't work" mean here and for which case. FWIW, this feature has nothing with the real data, I don't think inserting any data is helpful unless I missed anything. -- Best Regards Andy Fan
Re: UniqueKey v2
On Tue, Oct 17, 2023 at 11:21 AM wrote: > > > thanks for the really good suggestion. Here is the newer version: > --- a/src/backend/optimizer/path/meson.build +++ b/src/backend/optimizer/path/meson.build @@ -10,4 +10,5 @@ backend_sources += files( 'joinrels.c', 'pathkeys.c', 'tidpath.c', + 'uniquekey.c' ) diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h index 3ac25d47..5ed550ca 100644 --- a/src/include/optimizer/paths.h +++ b/src/include/optimizer/paths.h @@ -264,7 +264,10 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root, int strategy, bool nulls_first); extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, List *live_childrels); - +/* + * uniquekey.c + * uniquekey.c related functions. + */ - i did some simple tests using text data type. it works with the primary key, not with unique indexes. it does not work when the column is unique, not null. The following is my test. begin; CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper'); commit; begin; CREATE TABLE test_uniquekey3(a text, b text); CREATE TABLE test_uniquekey4(a text, b text); CREATE TABLE test_uniquekey5(a text, b text); CREATE TABLE test_uniquekey6(a text, b text); CREATE TABLE test_uniquekey7(a text not null, b text not null); CREATE TABLE test_uniquekey8(a text not null, b text not null); CREATE TABLE test_uniquekey9(a text primary key COLLATE upper_first, b text not null); CREATE TABLE test_uniquekey10(a text primary key COLLATE case_insensitive, b text not null); create unique index on test_uniquekey3 (a COLLATE case_insensitive nulls first) nulls distinct with (fillfactor = 80); create unique index on test_uniquekey4 (a COLLATE case_insensitive nulls first) nulls not distinct with (fillfactor = 80); create unique index on test_uniquekey5 (a COLLATE upper_first nulls first) nulls distinct; create unique index on test_uniquekey6 (a COLLATE upper_first nulls first) nulls not distinct; create unique index on test_uniquekey7 (a COLLATE upper_first nulls first) nulls distinct; create unique index on test_uniquekey8 (a COLLATE case_insensitive nulls first) nulls not distinct; insert into test_uniquekey3(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey4(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey5(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey6(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey7(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey8(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey9(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey10(a,b) select g::text, (g+10)::text from generate_series(1,1e5) g; insert into test_uniquekey3(a) VALUES(null),(null),(null); insert into test_uniquekey4(a) VALUES(null); insert into test_uniquekey5(a) VALUES(null),(null),(null); insert into test_uniquekey6(a) VALUES(null); commit; ANALYZE test_uniquekey3, test_uniquekey4, test_uniquekey5 ,test_uniquekey6,test_uniquekey7, test_uniquekey8 ,test_uniquekey9, test_uniquekey10; explain (costs off) select distinct a from test_uniquekey3; explain (costs off) select distinct a from test_uniquekey4; explain (costs off) select distinct a from test_uniquekey5; explain (costs off) select distinct a from test_uniquekey6; explain (costs off) select distinct a from test_uniquekey7; explain (costs off) select distinct a from test_uniquekey8; explain (costs off) select distinct a from test_uniquekey9; explain (costs off) select distinct a from test_uniquekey10; explain (costs off) select distinct a from test_uniquekey3 where a < '2000'; explain (costs off) select distinct a from test_uniquekey4 where a < '2000'; explain (costs off) select distinct a from test_uniquekey5 where a < '2000'; explain (costs off) select distinct a from test_uniquekey6 where a < '2000'; explain (costs off) select distinct a from test_uniquekey7 where a < '2000'; explain (costs off) select distinct a from test_uniquekey8 where a < '2000'; explain (costs off) select distinct a from test_uniquekey9 where a < '2000'; explain (costs off) select distinct a from test_uniquekey10 where a < '2000'; --very high selectivity explain (costs off) select distinct a from test_uniquekey3 where a < '1001'; explain (costs off) select distinct a from test_uniquekey4 where a < '1001'; explain (costs off) select distinct a from test_uniquekey5 where a < '1001'; explain (costs off) select distinct a from test_uniquekey6 where a < '1001'; explain (costs off) select distinct a from test_uniquekey7 where a < '1001'; explain (costs off) select distinct a from
Re: UniqueKey v2
jian he writes: Hi jian, > hi. > After `git am`, I still cannot build. > > ../../Desktop/pg_sources/main/postgres/src/backend/optimizer/path/uniquekey.c:125:45: > error: variable ‘var’ set but not used > [-Werror=unused-but-set-variable] > 125 | Var*var; > | ^~~ Thanks for this report, looks clang 11 can't capture this error. I have switched to clang 17 which would report this issue at the first place. > > You also need to change src/backend/optimizer/path/meson.build. Great thanks. > > git apply failed. > > git am warning: > Applying: uniquekey on base relation and used it for mark-distinct-as-op. > .git/rebase-apply/patch:876: new blank line at EOF. > + > warning: 1 line adds whitespace errors. > > I think you can use `git diff --check` > (https://git-scm.com/docs/git-diff) to check for whitespace related > errors. thanks for the really good suggestion. Here is the newer version: >From c2c752a3f66353d7d391976b0152c353a239ce60 Mon Sep 17 00:00:00 2001 From: Andy Fan Date: Tue, 17 Oct 2023 11:06:53 +0800 Subject: [PATCH v2 1/1] uniquekey on base relation and used it for mark-distinct-as-noop. --- src/backend/nodes/list.c| 17 ++ src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/allpaths.c | 2 + src/backend/optimizer/path/equivclass.c | 48 +++ src/backend/optimizer/path/uniquekey.c | 384 src/backend/optimizer/plan/initsplan.c | 8 + src/backend/optimizer/plan/planner.c| 33 ++ src/backend/optimizer/util/plancat.c| 10 + src/backend/optimizer/util/relnode.c| 2 + src/include/nodes/pathnodes.h | 19 ++ src/include/nodes/pg_list.h | 2 + src/include/optimizer/paths.h | 10 + src/test/regress/expected/uniquekey.out | 69 + src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/uniquekey.sql | 28 ++ 15 files changed, 635 insertions(+), 2 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekey.c create mode 100644 src/test/regress/expected/uniquekey.out create mode 100644 src/test/regress/sql/uniquekey.sql diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index 750ee5a7e55..eaeaa19ad2d 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -694,6 +694,23 @@ list_member_ptr(const List *list, const void *datum) return false; } +/* + * Return index of the datum in list if found. otherwise return -1. + */ +int +list_member_ptr_pos(const List *list, const void *datum) +{ + ListCell *lc; + + foreach(lc, list) + { + if (lfirst(lc) == datum) + return foreach_current_index(lc); + } + + return -1; +} + /* * Return true iff the integer 'datum' is a member of the list. */ diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 1e199ff66f7..63cc1505d96 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -21,6 +21,7 @@ OBJS = \ joinpath.o \ joinrels.o \ pathkeys.o \ - tidpath.o + tidpath.o \ + uniquekey.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 3cda88e..815a74a9dfa 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -458,6 +458,8 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel, } } + populate_baserel_uniquekeys(root, rel); + /* * We insist that all non-dummy rels have a nonzero rowcount estimate. */ diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 7fa502d6e25..29edc8d1b5a 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -744,6 +744,54 @@ get_eclass_for_sort_expr(PlannerInfo *root, return newec; } +/* + * find_ec_position_matching_expr + * Locate the position of EquivalenceClass whose members matching + * the given expr, if any; return -1 if no match. + */ +int +find_ec_position_matching_expr(PlannerInfo *root, + Expr *expr, + RelOptInfo *baserel) +{ + int i = -1; + + while ((i = bms_next_member(baserel->eclass_indexes, i)) >= 0) + { + EquivalenceClass *ec = list_nth(root->eq_classes, i); + + if (find_ec_member_matching_expr(ec, expr, baserel->relids)) + return i; + } + return -1; +} + +/* + * build_ec_positions_for_exprs + * + * Given a list of exprs, find the related EC positions for each of + * them. if any exprs has no EC related, return NULL; + */ +Bitmapset * +build_ec_positions_for_exprs(PlannerInfo *root, List *exprs, RelOptInfo *rel) +{ + ListCell *lc; + Bitmapset *res = NULL; + + foreach(lc, exprs) + { + int pos = find_ec_position_matching_expr(root, lfirst(lc), rel); + + if (pos < 0) + { + bms_free(res); + return NULL; + } + res = bms_add_member(res, pos); + } + return res; +} + /* *
Re: UniqueKey v2
hi. After `git am`, I still cannot build. ../../Desktop/pg_sources/main/postgres/src/backend/optimizer/path/uniquekey.c:125:45: error: variable ‘var’ set but not used [-Werror=unused-but-set-variable] 125 | Var*var; | ^~~ You also need to change src/backend/optimizer/path/meson.build. git apply failed. git am warning: Applying: uniquekey on base relation and used it for mark-distinct-as-op. .git/rebase-apply/patch:876: new blank line at EOF. + warning: 1 line adds whitespace errors. I think you can use `git diff --check` (https://git-scm.com/docs/git-diff) to check for whitespace related errors.