Re: [PATCH] Erase the distinctClause if the result is unique by definition
I have started the new thread [1] to continue talking about this. Mr. cfbot is happy now. [1] https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com Thanks >
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi David: On Wed, Mar 18, 2020 at 12:13 PM David Rowley wrote: > On Wed, 18 Mar 2020 at 15:57, Andy Fan wrote: > > I'm now writing the code for partition index stuff, which > > is a bit of boring, since every partition may have different unique > index. > > Why is that case so different? > > For a partitioned table to have a valid unique index, a unique index > must exist on each partition having columns that are a superset of the > partition key columns. An IndexOptInfo will exist on the partitioned > table's RelOptInfo, in this case. > > The main difference are caused: 1. we can create unique index on some of partition only. create table q100 (a int, b int, c int) partition by range (b); create table q100_1 partition of q100 for values from (1) to (10); create table q100_2 partition of q100 for values from (11) to (20); create unique index q100_1_c on q100_1(c); // user may create this index on q100_1 only 2. The unique index may not contains part key as above. For the above case, even the same index on all the partition as well, we still can't use it since the it unique on local partition only. 3. So the unique index on a partition table can be used only if it contains the partition key AND exists on all the partitions. 4. When we apply the uniquekey_is_useful_for_rel, I compare the information between ind->indextlist and rel->reltarget, but the indextlist has a wrong varno, we we have to change the varno with ChangeVarNodes for the indextlist from childrel since the varno is for childrel. 5. When we detect the uk = 1 case, the uk is also present with parentrel->relid information, which we may requires the ChangeVarNodes on childrel->indexinfo->indextlist as well. Even the rules looks long, The run time should be very short since usually we don't have many unique index on partition table. > At the leaf partition level, wouldn't you just add the uniquekeys the > same as we do for base rels? Yes, But due to the uk of a childrel may be not useful for parent rel (the uk only exist in one partiton), so I think we can bypass if it is a child rel case? Best Regards Andy Fan
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Wed, 18 Mar 2020 at 15:57, Andy Fan wrote: > I'm now writing the code for partition index stuff, which > is a bit of boring, since every partition may have different unique index. Why is that case so different? For a partitioned table to have a valid unique index, a unique index must exist on each partition having columns that are a superset of the partition key columns. An IndexOptInfo will exist on the partitioned table's RelOptInfo, in this case. At the leaf partition level, wouldn't you just add the uniquekeys the same as we do for base rels? Maybe only do it if enable_partitionwise_aggregation is on. Otherwise, I don't think we'll currently have a need for them. Currently, we don't do unique joins for partition-wise joins. Perhaps uniquekeys will be a good way to fix that omission in the future.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi David: Thanks for your time. On Wed, Mar 18, 2020 at 9:56 AM David Rowley wrote: > On Mon, 16 Mar 2020 at 06:01, Andy Fan wrote: > > > > Hi All: > > > > I have re-implemented the patch based on David's suggestion/code, Looks > it > > works well. The updated patch mainly includes: > > > > 1. Maintain the not_null_colno in RelOptInfo, which includes the not > null from > > catalog and the not null from vars. > > What about non-nullability that we can derive from means other than > NOT NULL constraints. Where will you track that now that you've > removed the UniqueKeySet type? > I tracked it in 'deconstruct_recurse', just before the distribute_qual_to_rels call. + ListCell*lc; + foreach(lc, find_nonnullable_vars(qual)) + { + Var *var = lfirst_node(Var, lc); + RelOptInfo *rel = root->simple_rel_array[var->varno]; + if (var->varattno > InvalidAttrNumber) + rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno); + } > Traditionally we use attno or attnum rather than colno for variable > names containing attribute numbers > Currently I use a list of Var for a UnqiueKey, I guess it is ok? > > > 3. postpone the propagate_unique_keys_to_joinrel call to > populate_joinrel_with_paths > > since we know jointype at that time. so we can handle the semi/anti > join specially. > > ok, but the join type was known already where I was calling the > function from. It just wasn't passed to the function. > > > 4. Add the rule I suggested above, if both of the 2 relation yields the > a unique result, > > the join result will be unique as well. the UK can be ( (rel1_uk1, > rel1_uk2).. ) > > I see. So basically you're saying that the joinrel's uniquekeys should > be the cartesian product of the unique rels from either side of the > join. I wonder if that's a special case we need to worry about too > much. Surely it only applies for clauseless joins Some other cases we may need this as well:). like select m1.pk, m2.pk from m1, m2 where m1.b = m2.b; The cartesian product of the unique rels will make the unqiue keys too long, so I maintain the UnqiueKeyContext to make it short. The idea is if (UK1) is unique already, no bother to add another UK as (UK1, UK2) which is just a superset of it. > > > 5. If the unique key is impossible to be referenced by others, we can > safely ignore > > it in order to keep the (join)rel->unqiuekeys short. > > You could probably have an equivalent of has_useful_pathkeys() and > pathkeys_useful_for_ordering() > > Thanks for suggestion, I will do so in the v5-xx.patch. > > 6. I only consider the not null check/opfamily check for the uniquekey > which comes > >from UniqueIndex. I think that should be correct. > > 7. I defined each uniquekey as List of Expr, so I didn't introduce new > node type. > > Where will you store the collation Oid? I left comments to mention > that needed to be checked but just didn't wire it up. > This is too embarrassed, I am not sure if it is safe to ignore it. I removed it due to the following reasons (sorry for that I didn't explain it carefully for the last email). 1. When we choose if an UK is usable, we have chance to compare the collation info for restrictinfo (where uk = 1) or target list (select uk from t) with the indexinfo's collation, the targetlist one has more trouble since we need to figure out the default collation for it. However relation_has_unique_index_for has the same situation as us, it ignores it as well. See comment /* XXX at some point we may need to check collations here too. */. It think if there are some reasons we can ignore that. 2. What we expect from UK is: a). Where m1.uniquekey = m2.b m2.uk will not be duplicated by this joinclause. Here if m1.uk has different collation, it will raise runtime error. b). Where m1.uniquekey collate '' = m2.b. We may can't depends on the run-time error this time. But if we are sure that *if uk is uk at default collation is unique, then (uk collcate 'other-colation') is unique as well**. if so we may safe ignore it as well. c). select uniquekey from t / select uniquekey collate '' from t. This have the same requirement as item b). 3). Looks maintain the collation for our case totally is a big effort, and user rarely use it, If my expectation for 2.b is not true, I prefer to detect such case (user use a different collation), we can just ignore the UK for that. But After all, I think this should be an open question for now. --- At last, I am so grateful for your suggestion/feedback, that's really heuristic and constructive. And so thanks Tom's for the quick review and suggest to add a new fields for RelOptInfo, without it I don't think I can add a new field to a so important struct. And
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Mon, 16 Mar 2020 at 06:01, Andy Fan wrote: > > Hi All: > > I have re-implemented the patch based on David's suggestion/code, Looks it > works well. The updated patch mainly includes: > > 1. Maintain the not_null_colno in RelOptInfo, which includes the not null from > catalog and the not null from vars. What about non-nullability that we can derive from means other than NOT NULL constraints. Where will you track that now that you've removed the UniqueKeySet type? Traditionally we use attno or attnum rather than colno for variable names containing attribute numbers > 3. postpone the propagate_unique_keys_to_joinrel call to > populate_joinrel_with_paths > since we know jointype at that time. so we can handle the semi/anti join > specially. ok, but the join type was known already where I was calling the function from. It just wasn't passed to the function. > 4. Add the rule I suggested above, if both of the 2 relation yields the a > unique result, > the join result will be unique as well. the UK can be ( (rel1_uk1, > rel1_uk2).. ) I see. So basically you're saying that the joinrel's uniquekeys should be the cartesian product of the unique rels from either side of the join. I wonder if that's a special case we need to worry about too much. Surely it only applies for clauseless joins. > 5. If the unique key is impossible to be referenced by others, we can safely > ignore > it in order to keep the (join)rel->unqiuekeys short. You could probably have an equivalent of has_useful_pathkeys() and pathkeys_useful_for_ordering() > 6. I only consider the not null check/opfamily check for the uniquekey which > comes >from UniqueIndex. I think that should be correct. > 7. I defined each uniquekey as List of Expr, so I didn't introduce new node > type. Where will you store the collation Oid? I left comments to mention that needed to be checked but just didn't wire it up.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi All: I have re-implemented the patch based on David's suggestion/code, Looks it works well. The updated patch mainly includes: 1. Maintain the not_null_colno in RelOptInfo, which includes the not null from catalog and the not null from vars. 2. Add the restictinfo check at populate_baserel_uniquekeys. If we are sure about only 1 row returned, I add each expr in rel->reltarget->expr as a unique key. like (select a, b, c from t where pk = 1), the uk will be ( (a), (b), (c) ) 3. postpone the propagate_unique_keys_to_joinrel call to populate_joinrel_with_paths since we know jointype at that time. so we can handle the semi/anti join specially. 4. Add the rule I suggested above, if both of the 2 relation yields the a unique result, the join result will be unique as well. the UK can be ( (rel1_uk1, rel1_uk2).. ) 5. If the unique key is impossible to be referenced by others, we can safely ignore it in order to keep the (join)rel->unqiuekeys short. 6. I only consider the not null check/opfamily check for the uniquekey which comes from UniqueIndex. I think that should be correct. 7. I defined each uniquekey as List of Expr, so I didn't introduce new node type. 8. checked the uniquekeys's information before create_distinct_paths and create_group_paths ignore the new paths to be created if the sortgroupclauses is unique already or else create it and add the new uniquekey to the distinctrel/grouprel. There are some things I still be in-progress, like: 1. Partition table. 2. union/union all 3. maybe refactor the is_innerrel_unqiue_for/query_is_distinct_for to use UniqueKey 4. if we are sure the groupby clause is unique, and we have aggregation call, maybe we should try Bapat's suggestion, we can use sort rather than hash. The strategy sounds awesome, but I didn't check the details so far. 5. more clearer commit message. 6. any more ? Any feedback is welcome, Thanks for you for your any ideas, suggestions, demo code! Best Regards Andy Fan v4-0001-Patch-Bypass-distinctClause-groupbyClause-if-the-.patch Description: Binary data
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Fri, Mar 13, 2020 at 11:46 AM David Rowley wrote: > On Fri, 13 Mar 2020 at 14:47, Andy Fan wrote: > > 1. for pupulate_baserel_uniquekeys, we need handle the "pk = Const" > as well. > > (relation_has_unqiue_for has a similar logic) currently the following > distinct path is still > > there. > > Yeah, I left a comment in propagate_unique_keys_to_joinrel() to > mention that still needs to be done. > postgres=# explain select distinct b from t100 where pk = 1; > > QUERY PLAN > > > -- > > Unique (cost=8.18..8.19 rows=1 width=4) > >-> Sort (cost=8.18..8.19 rows=1 width=4) > > Sort Key: b > > -> Index Scan using t100_pkey on t100 (cost=0.15..8.17 rows=1 > width=4) > >Index Cond: (pk = 1) > > (5 rows) > > > > I think in this case, we can add both (pk) and (b) as the > UniquePaths. If so we > > can get more opportunities to reach our goal. > > The UniqueKeySet containing "b" could only be added in the > distinct_rel in the upper planner. It must not change the input_rel > for the distinct. > > I think we maintain UniqueKey even without distinct_rel, so at this stage, Can we say b is unique for this(no is possible)? If yes, we probably need to set that information without consider the distinct clause. It's likely best to steer clear of calling UniqueKeys UniquePaths as > it might confuse people. The term "path" is used in PostgreSQL as a > lightweight representation containing all the information required to > build a plan node in createplan.c. More details in > src/backend/optimizer/README. > > OK. > > 2. As for the propagate_unique_keys_to_joinrel, we can add 1 more > UniquePath as > > (rel1_unique_paths, rel2_unique_paths) if the current rules doesn't > apply. > > or else the following cases can't be handled. > > > > postgres=# explain select distinct t100.pk, t101.pk from t100, t101; > >QUERY PLAN > > > > > Unique (cost=772674.11..810981.11 rows=5107600 width=8) > >-> Sort (cost=772674.11..785443.11 rows=5107600 width=8) > > Sort Key: t100.pk, t101.pk > > -> Nested Loop (cost=0.00..63915.85 rows=5107600 width=8) > >-> Seq Scan on t100 (cost=0.00..32.60 rows=2260 width=4) > >-> Materialize (cost=0.00..43.90 rows=2260 width=4) > > -> Seq Scan on t101 (cost=0.00..32.60 rows=2260 > width=4) > > (7 rows) > > I don't really follow what you mean here. It seems to me there's no > way we can skip doing DISTINCT in the case above. If you've just > missed out the join clause and you meant to have "WHERE t100.pk = > t101.pk", then we can likely fix that later with some sort of > functional dependency tracking. In the above case the result should be unique, the knowledge behind that is if *we join 2 unique results in any join method, the result is unique as well* in the above example, the final unique Key is (t100.pk, t101.pk). > Likely we can just add a Relids field > to UniqueKeySet to track which relids are functionally dependant on a > row from the UniqueKeySet's uk_exprs. That might be as simple as just > pull_varnos from the non-matched exprs and checking to ensure the > result is a subset of functionally dependant rels. I'd need to give > that more thought. > > Was this a case you had working in your patch? > I think we can do that after I get your UniqueKey idea, so, no, my previous patch is not as smart as yours:) > > But if we add such rule, the unique paths probably become much longer, > so we need > > a strategy to tell if the UniquePath is useful for our query, if not, > we can ignore that. > > rel->reltarget maybe a good info for such optimization. I think we can > take this into > > consideration for pupulate_baserel_uniquekeys as well. > > I don't really think the number of unique indexes in a base rel will > really ever get out of hand for legitimate cases. > propagate_unique_keys_to_joinrel is just concatenating baserel > UniqueKeySets to the joinrel. They're not copied, so it's just tagging > pointers onto the end of an array, which is at best a memcpy(), or at > worst a realloc() then memcpy(). That's not so costly. > The memcpy is not key concerns here. My main point is we need to focus on the length of RelOptInfo->uniquekeys. For example: t has 3 uk like this (uk1), (uk2), (uk3). And the query is select b from t where m = 1; If so there is no need to add these 3 to UniqueKeys so that we can keep rel->uniquekeys shorter. The the length of rel->uniquekeys maybe a concern if we add the rule I suggested above , the (t100.pk, t101.pk) case. Think about this for example: 1. select .. from t1, t2, t3, t4...; 2. suppose each table has 2 UniqueKeys, named (t{m}_uk{n}) 3. follow my above rule (t1.pk1,
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Fri, 13 Mar 2020 at 14:47, Andy Fan wrote: > 1. for pupulate_baserel_uniquekeys, we need handle the "pk = Const" as > well. > (relation_has_unqiue_for has a similar logic) currently the following > distinct path is still > there. Yeah, I left a comment in propagate_unique_keys_to_joinrel() to mention that still needs to be done. > postgres=# explain select distinct b from t100 where pk = 1; > QUERY PLAN > -- > Unique (cost=8.18..8.19 rows=1 width=4) >-> Sort (cost=8.18..8.19 rows=1 width=4) > Sort Key: b > -> Index Scan using t100_pkey on t100 (cost=0.15..8.17 rows=1 > width=4) >Index Cond: (pk = 1) > (5 rows) > > I think in this case, we can add both (pk) and (b) as the UniquePaths. If > so we > can get more opportunities to reach our goal. The UniqueKeySet containing "b" could only be added in the distinct_rel in the upper planner. It must not change the input_rel for the distinct. It's likely best to steer clear of calling UniqueKeys UniquePaths as it might confuse people. The term "path" is used in PostgreSQL as a lightweight representation containing all the information required to build a plan node in createplan.c. More details in src/backend/optimizer/README. > 2. As for the propagate_unique_keys_to_joinrel, we can add 1 more > UniquePath as > (rel1_unique_paths, rel2_unique_paths) if the current rules doesn't apply. > or else the following cases can't be handled. > > postgres=# explain select distinct t100.pk, t101.pk from t100, t101; >QUERY PLAN > > Unique (cost=772674.11..810981.11 rows=5107600 width=8) >-> Sort (cost=772674.11..785443.11 rows=5107600 width=8) > Sort Key: t100.pk, t101.pk > -> Nested Loop (cost=0.00..63915.85 rows=5107600 width=8) >-> Seq Scan on t100 (cost=0.00..32.60 rows=2260 width=4) >-> Materialize (cost=0.00..43.90 rows=2260 width=4) > -> Seq Scan on t101 (cost=0.00..32.60 rows=2260 > width=4) > (7 rows) I don't really follow what you mean here. It seems to me there's no way we can skip doing DISTINCT in the case above. If you've just missed out the join clause and you meant to have "WHERE t100.pk = t101.pk", then we can likely fix that later with some sort of functional dependency tracking. Likely we can just add a Relids field to UniqueKeySet to track which relids are functionally dependant on a row from the UniqueKeySet's uk_exprs. That might be as simple as just pull_varnos from the non-matched exprs and checking to ensure the result is a subset of functionally dependant rels. I'd need to give that more thought. Was this a case you had working in your patch? > But if we add such rule, the unique paths probably become much longer, so > we need > a strategy to tell if the UniquePath is useful for our query, if not, we > can ignore that. > rel->reltarget maybe a good info for such optimization. I think we can take > this into > consideration for pupulate_baserel_uniquekeys as well. I don't really think the number of unique indexes in a base rel will really ever get out of hand for legitimate cases. propagate_unique_keys_to_joinrel is just concatenating baserel UniqueKeySets to the joinrel. They're not copied, so it's just tagging pointers onto the end of an array, which is at best a memcpy(), or at worst a realloc() then memcpy(). That's not so costly. > For the non_null info, Tom suggested to add maintain such info RelOptInfo, > I have done that for the not_null_info for basic relation catalog, I think > we can > maintain the same flag for joinrel and the not null info from > find_nonnullable_vars as > well, but I still didn't find a good place to add that so far. I'd considered just adding a get_notnull() function to lsyscache.c. Just below get_attname() looks like a good spot. I imagined just setting the bit in the UniqueKeySet's non_null_keys field corresponding to the column position from the index. I could see the benefit of having a field in RelOptInfo if there was some way to determine the not-null properties of all columns in the table at once, but there's not, so we're likely best just looking at the ones that there are unique indexes on. > A small question about the following code: > > + if (relation_has_uniquekeys_for(root, input_rel, > get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false)) > + { > + > + add_path(distinct_rel, (Path *)cheapest_input_path); > + > + /* XXX yeah yeah, need to call the hooks etc. */ > + > + /* Now choose the best path(s) */ > + set_cheapest(distinct_rel); > + > + return distinct_rel; > + } > > Since we don't create
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi David: On Thu, Mar 12, 2020 at 3:51 PM David Rowley wrote: > On Wed, 11 Mar 2020 at 17:23, Andy Fan wrote: > > Now I am convinced that we should maintain UniquePath on RelOptInfo, > > I would see how to work with "Index Skip Scan" patch. > > I've attached a very early proof of concept patch for unique keys. > The NULL detection stuff is not yet hooked up, so it'll currently do > the wrong thing for NULLable columns. I've left some code in there > with my current idea of how to handle that, but I'll need to add more > code both to look at the catalogue tables to see if there's a NOT NULL > constraint and also to check for strict quals that filter out NULLs. > > Additionally, I've not hooked up the collation checking stuff yet. I > just wanted to see if it would work ok for non-collatable types first. > > I've added a couple of lines to create_distinct_paths() to check if > the input_rel has the required UniqueKeys to skip doing the DISTINCT. > It seems to work, but my tests so far are limited to: > > create table t1(a int primary key, b int); > create table t2(a int primary key, b int); > > postgres=# -- t2 could duplicate t1, don't remove DISTINCT > postgres=# explain (costs off) select distinct t1.a from t1 inner join > t2 on t1.a = t2.b; > QUERY PLAN > -- > HashAggregate >Group Key: t1.a >-> Hash Join > Hash Cond: (t2.b = t1.a) > -> Seq Scan on t2 > -> Hash >-> Seq Scan on t1 > (7 rows) > > > postgres=# -- neither rel can duplicate the other due to join on PK. > Remove DISTINCT > postgres=# explain (costs off) select distinct t1.a from t1 inner join > t2 on t1.a = t2.a; > QUERY PLAN > > Hash Join >Hash Cond: (t1.a = t2.a) >-> Seq Scan on t1 >-> Hash > -> Seq Scan on t2 > (5 rows) > > > postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT > postgres=# explain (costs off) select distinct t1.a from t1 inner join > t2 on t1.b = t2.a; > QUERY PLAN > > Hash Join >Hash Cond: (t1.b = t2.a) >-> Seq Scan on t1 >-> Hash > -> Seq Scan on t2 > (5 rows) > > > postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT > postgres=# explain (costs off) select distinct t2.a from t1 inner join > t2 on t1.b = t2.a; > QUERY PLAN > -- > HashAggregate >Group Key: t2.a >-> Hash Join > Hash Cond: (t1.b = t2.a) > -> Seq Scan on t1 > -> Hash >-> Seq Scan on t2 > (7 rows) > > > postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT. > postgres=# explain (costs off) select distinct t2.a from t1 inner join > t2 on t1.a = t2.b; > QUERY PLAN > > Hash Join >Hash Cond: (t2.b = t1.a) >-> Seq Scan on t2 >-> Hash > -> Seq Scan on t1 > (5 rows) > > I've also left a bunch of XXX comments for things that I know need more > thought. > > I believe we can propagate the joinrel's unique keys where the patch > is currently doing it. I understand that in > populate_joinrel_with_paths() we do things like swapping LEFT JOINs > for RIGHT JOINs and switch the input rels around, but we do so only > because it's equivalent, so I don't currently see why we can't take > the jointype for the SpecialJoinInfo. I need to know that as I'll need > to ignore pushed down RestrictInfos for outer joins. > > I'm posting now as I know I've been mentioning this UniqueKeys idea > for quite a while and if it's not something that's going to get off > the ground, then it's better to figure that out now. > Thanks for the code! Here is some points from me. 1. for pupulate_baserel_uniquekeys, we need handle the "pk = Const" as well. (relation_has_unqiue_for has a similar logic) currently the following distinct path is still there. postgres=# explain select distinct b from t100 where pk = 1; QUERY PLAN -- Unique (cost=8.18..8.19 rows=1 width=4) -> Sort (cost=8.18..8.19 rows=1 width=4) Sort Key: b -> Index Scan using t100_pkey on t100 (cost=0.15..8.17 rows=1 width=4) Index Cond: (pk = 1) (5 rows) I think in this case, we can add both (pk) and (b) as the UniquePaths. If so we can get more opportunities to reach our goal. 2. As for the propagate_unique_keys_to_joinrel, we can add 1 more UniquePath as (rel1_unique_paths, rel2_unique_paths) if the current rules doesn't apply. or else the following cases can't be handled. postgres=# explain select distinct t100.pk, t101.pk from t100, t101; QUERY PLAN Unique (cost=772674.11..810981.11 rows=5107600 width=8) -> Sort
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Wed, 11 Mar 2020 at 17:23, Andy Fan wrote: > Now I am convinced that we should maintain UniquePath on RelOptInfo, > I would see how to work with "Index Skip Scan" patch. I've attached a very early proof of concept patch for unique keys. The NULL detection stuff is not yet hooked up, so it'll currently do the wrong thing for NULLable columns. I've left some code in there with my current idea of how to handle that, but I'll need to add more code both to look at the catalogue tables to see if there's a NOT NULL constraint and also to check for strict quals that filter out NULLs. Additionally, I've not hooked up the collation checking stuff yet. I just wanted to see if it would work ok for non-collatable types first. I've added a couple of lines to create_distinct_paths() to check if the input_rel has the required UniqueKeys to skip doing the DISTINCT. It seems to work, but my tests so far are limited to: create table t1(a int primary key, b int); create table t2(a int primary key, b int); postgres=# -- t2 could duplicate t1, don't remove DISTINCT postgres=# explain (costs off) select distinct t1.a from t1 inner join t2 on t1.a = t2.b; QUERY PLAN -- HashAggregate Group Key: t1.a -> Hash Join Hash Cond: (t2.b = t1.a) -> Seq Scan on t2 -> Hash -> Seq Scan on t1 (7 rows) postgres=# -- neither rel can duplicate the other due to join on PK. Remove DISTINCT postgres=# explain (costs off) select distinct t1.a from t1 inner join t2 on t1.a = t2.a; QUERY PLAN Hash Join Hash Cond: (t1.a = t2.a) -> Seq Scan on t1 -> Hash -> Seq Scan on t2 (5 rows) postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT postgres=# explain (costs off) select distinct t1.a from t1 inner join t2 on t1.b = t2.a; QUERY PLAN Hash Join Hash Cond: (t1.b = t2.a) -> Seq Scan on t1 -> Hash -> Seq Scan on t2 (5 rows) postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT postgres=# explain (costs off) select distinct t2.a from t1 inner join t2 on t1.b = t2.a; QUERY PLAN -- HashAggregate Group Key: t2.a -> Hash Join Hash Cond: (t1.b = t2.a) -> Seq Scan on t1 -> Hash -> Seq Scan on t2 (7 rows) postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT. postgres=# explain (costs off) select distinct t2.a from t1 inner join t2 on t1.a = t2.b; QUERY PLAN Hash Join Hash Cond: (t2.b = t1.a) -> Seq Scan on t2 -> Hash -> Seq Scan on t1 (5 rows) I've also left a bunch of XXX comments for things that I know need more thought. I believe we can propagate the joinrel's unique keys where the patch is currently doing it. I understand that in populate_joinrel_with_paths() we do things like swapping LEFT JOINs for RIGHT JOINs and switch the input rels around, but we do so only because it's equivalent, so I don't currently see why we can't take the jointype for the SpecialJoinInfo. I need to know that as I'll need to ignore pushed down RestrictInfos for outer joins. I'm posting now as I know I've been mentioning this UniqueKeys idea for quite a while and if it's not something that's going to get off the ground, then it's better to figure that out now. bearly_poc_uniquekeys_v0.patch Description: Binary data
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Wed, Mar 11, 2020 at 4:19 AM David Rowley wrote: > > On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat > wrote: > > > > On Tue, Mar 10, 2020 at 1:49 PM Andy Fan wrote: > > > In my current implementation, it calculates the uniqueness for each > > > BaseRel only, but in your way, looks we need to calculate the > > > UniquePathKey for both BaseRel and JoinRel. This makes more > > > difference for multi table join. > > > > I didn't understand this concern. I think, it would be better to do it > > for all kinds of relation types base, join etc. This way we are sure > > that one method works across the planner to eliminate the need for > > Distinct or grouping. If we just implement something for base > > relations right now and don't do that for joins, there is a chance > > that that method may not work for joins when we come to implement it. > > Yeah, it seems to me that we're seeing more and more features that > require knowledge of uniqueness of a RelOptInfo. The skip scans patch > needs to know if a join will cause row duplication so it knows if the > skip scan path can be joined to without messing up the uniqueness of > the skip scan. Adding more and more places that loop over the rel's > indexlist just does not seem the right way to do it, especially so > when you have to dissect the join rel down to its base rel components > to check which indexes there are. Having the knowledge on-hand at the > RelOptInfo level means we no longer have to look at indexes for unique > proofs. +1. Yep. When we break join down to the base relation, partitioned relation pose another challenge that the partitioned relation may not have an index on it per say but each partition may have it and the index key happens to be part of the partition key. That case would be easy to track through RelOptInfo instead of breaking a base rel down into its child rels. > > > > Another concern is UniquePathKey > > > is designed for a general purpose, we need to maintain it no matter > > > distinctClause/groupbyClause. > > > > This should be ok. The time spent in annotating a RelOptInfo about > > uniqueness is not going to be a lot. But doing so would help generic > > elimination of Distinct/Group/Unique operations. What is > > UniquePathKey; I didn't find this in your patch or in the code. > > Possibly a misinterpretation. There is some overlap between this patch > and the skip scan patch, both would like to skip doing explicit work > to implement DISTINCT. Skip scans just go about it by adding new path > types that scan the index and only gathers up unique values. In that > case, the RelOptInfo won't contain the unique keys, but the skip scan > path will. How I imagine both of these patches working together in > create_distinct_paths() is that we first check if the DISTINCT clause > is a subset of the a set of the RelOptInfo's unique keys (this patch), > else we check if there are any paths with uniquekeys that we can use > to perform a no-op on the DISTINCT clause (the skip scan patch), if > none of those apply, we create the required paths to uniquify the > results. Looks good to me. But I have not seen index skip patch. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Mar 10, 2020 at 9:12 PM Andy Fan wrote: > > > Hi Tom & David & Bapat: > > Thanks for your review so far. I want to summarize the current issues to help > our following discussion. > > 1. Shall we bypass the AggNode as well with the same logic. > > I think yes, since the rules to bypass a AggNode and UniqueNode is exactly > same. > The difficulty of bypassing AggNode is the current aggregation function call > is closely > coupled with AggNode. In the past few days, I have make the aggregation call > can > run without AggNode (at least I tested sum(without finalized fn), avg (with > finalized fn)). > But there are a few things to do, like acl check, anynull check and maybe > more check. > also there are some MemoryContext mess up need to fix. > I still need some time for this goal, so I think the complex of it deserves > another thread > to discuss it, any thought? I think if the relation underlying an Agg node is know to be unique for given groupByClause, we could safely use AGG_SORTED strategy. Though the input is not ordered, it's sorted thus for every row Agg node will combine/finalize the aggregate result. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Wed, Mar 11, 2020 at 6:49 AM David Rowley wrote: > On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat > wrote: > > > > On Tue, Mar 10, 2020 at 1:49 PM Andy Fan > wrote: > > > In my current implementation, it calculates the uniqueness for each > > > BaseRel only, but in your way, looks we need to calculate the > > > UniquePathKey for both BaseRel and JoinRel. This makes more > > > difference for multi table join. > > > > I didn't understand this concern. I think, it would be better to do it > > for all kinds of relation types base, join etc. This way we are sure > > that one method works across the planner to eliminate the need for > > Distinct or grouping. If we just implement something for base > > relations right now and don't do that for joins, there is a chance > > that that method may not work for joins when we come to implement it. > > Yeah, it seems to me that we're seeing more and more features that > require knowledge of uniqueness of a RelOptInfo. The skip scans patch > needs to know if a join will cause row duplication so it knows if the > skip scan path can be joined to without messing up the uniqueness of > the skip scan. Adding more and more places that loop over the rel's > indexlist just does not seem the right way to do it, especially so > when you have to dissect the join rel down to its base rel components > to check which indexes there are. Having the knowledge on-hand at the > RelOptInfo level means we no longer have to look at indexes for unique > proofs. > > > > Another concern is UniquePathKey > > > is designed for a general purpose, we need to maintain it no matter > > > distinctClause/groupbyClause. > > > > This should be ok. The time spent in annotating a RelOptInfo about > > uniqueness is not going to be a lot. But doing so would help generic > > elimination of Distinct/Group/Unique operations. What is > > UniquePathKey; I didn't find this in your patch or in the code. > > Possibly a misinterpretation. There is some overlap between this patch > and the skip scan patch, both would like to skip doing explicit work > to implement DISTINCT. Skip scans just go about it by adding new path > types that scan the index and only gathers up unique values. In that > case, the RelOptInfo won't contain the unique keys, but the skip scan > path will. How I imagine both of these patches working together in > create_distinct_paths() is that we first check if the DISTINCT clause > is a subset of the a set of the RelOptInfo's unique keys (this patch), > else we check if there are any paths with uniquekeys that we can use > to perform a no-op on the DISTINCT clause (the skip scan patch), if > none of those apply, we create the required paths to uniquify the > results. > Now I am convinced that we should maintain UniquePath on RelOptInfo, I would see how to work with "Index Skip Scan" patch.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, 10 Mar 2020 at 21:19, Andy Fan wrote: > >> SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part. > > > Actually I want to keep the distinct for this case now. One reason is there > are only 1 > row returned, so the distinct erasing can't help much. The more important > reason is > Query->hasAggs is true for "select distinct (select count(*) filter (where > t2.c2 = 6 > and t2.c1 < 10) from ft1 t1 where t1.c1 = 6) from ft2 t2 where t2.c2 % 6 = 0 > order by 1;" > (this sql came from postgres_fdw.sql). I think that sort of view is part of the problem here. If you want to invent some new way to detect uniqueness that does not count that case then we have more code with more possible places to have bugs. FWIW, query_is_distinct_for() does detect that case with: /* * If we have no GROUP BY, but do have aggregates or HAVING, then the * result is at most one row so it's surely unique, for any operators. */ if (query->hasAggs || query->havingQual) return true; which can be seen by the fact that the following find the unique join on t2. postgres=# explain verbose select * from t1 inner join (select count(*) c from t1) t2 on t1.a=t2.c; QUERY PLAN Hash Join (cost=41.91..84.25 rows=13 width=12) Output: t1.a, (count(*)) Inner Unique: true Hash Cond: (t1.a = (count(*))) -> Seq Scan on public.t1 (cost=0.00..35.50 rows=2550 width=4) Output: t1.a -> Hash (cost=41.89..41.89 rows=1 width=8) Output: (count(*)) -> Aggregate (cost=41.88..41.88 rows=1 width=8) Output: count(*) -> Seq Scan on public.t1 t1_1 (cost=0.00..35.50 rows=2550 width=0) Output: t1_1.a (12 rows) It will be very simple to add an empty List of UniqueKeys to the GROUP BY's RelOptInfo to indicate that all expressions are unique. That way any code that checks if some of the RelOptInfo's unique keys are a subset of some expressions they'd like to know are unique, then they'll get a match. It does not really matter how much effort is saved in your example above. The UniqueKey infrastructure won't know how much effort properly adding all the uniquekeys will save. It should just add all the keys it can and let whichever code cares about that reap the benefits.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat wrote: > > On Tue, Mar 10, 2020 at 1:49 PM Andy Fan wrote: > > In my current implementation, it calculates the uniqueness for each > > BaseRel only, but in your way, looks we need to calculate the > > UniquePathKey for both BaseRel and JoinRel. This makes more > > difference for multi table join. > > I didn't understand this concern. I think, it would be better to do it > for all kinds of relation types base, join etc. This way we are sure > that one method works across the planner to eliminate the need for > Distinct or grouping. If we just implement something for base > relations right now and don't do that for joins, there is a chance > that that method may not work for joins when we come to implement it. Yeah, it seems to me that we're seeing more and more features that require knowledge of uniqueness of a RelOptInfo. The skip scans patch needs to know if a join will cause row duplication so it knows if the skip scan path can be joined to without messing up the uniqueness of the skip scan. Adding more and more places that loop over the rel's indexlist just does not seem the right way to do it, especially so when you have to dissect the join rel down to its base rel components to check which indexes there are. Having the knowledge on-hand at the RelOptInfo level means we no longer have to look at indexes for unique proofs. > > Another concern is UniquePathKey > > is designed for a general purpose, we need to maintain it no matter > > distinctClause/groupbyClause. > > This should be ok. The time spent in annotating a RelOptInfo about > uniqueness is not going to be a lot. But doing so would help generic > elimination of Distinct/Group/Unique operations. What is > UniquePathKey; I didn't find this in your patch or in the code. Possibly a misinterpretation. There is some overlap between this patch and the skip scan patch, both would like to skip doing explicit work to implement DISTINCT. Skip scans just go about it by adding new path types that scan the index and only gathers up unique values. In that case, the RelOptInfo won't contain the unique keys, but the skip scan path will. How I imagine both of these patches working together in create_distinct_paths() is that we first check if the DISTINCT clause is a subset of the a set of the RelOptInfo's unique keys (this patch), else we check if there are any paths with uniquekeys that we can use to perform a no-op on the DISTINCT clause (the skip scan patch), if none of those apply, we create the required paths to uniquify the results.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi Tom & David & Bapat: Thanks for your review so far. I want to summarize the current issues to help our following discussion. 1. Shall we bypass the AggNode as well with the same logic. I think yes, since the rules to bypass a AggNode and UniqueNode is exactly same. The difficulty of bypassing AggNode is the current aggregation function call is closely coupled with AggNode. In the past few days, I have make the aggregation call can run without AggNode (at least I tested sum(without finalized fn), avg (with finalized fn)). But there are a few things to do, like acl check, anynull check and maybe more check. also there are some MemoryContext mess up need to fix. I still need some time for this goal, so I think the complex of it deserves another thread to discuss it, any thought? 2. Shall we used the UniquePath as David suggested. Actually I am trending to this way now. Daivd, can you share more insights about the benefits of UniquePath? Costing size should be one of them, another one may be changing the semi join to normal join as the current innerrel_is_unique did. any others? 3. Can we make the rule more general? Currently it requires every relation yields a unique result. Daivd & Bapat provides another example: select m2.pk from m1, m2 where m1.pk = m2.non_unqiue_key. That's interesting and not easy to handle in my current framework. This is another reason I want to take the UniquePath framework. Do we have any other rules to think about before implementing it? Thanks for your feedback. > This should be ok. The time spent in annotating a RelOptInfo about > uniqueness is not going to be a lot. But doing so would help generic > elimination of Distinct/Group/Unique operations. What is > UniquePathKey; I didn't find this in your patch or in the code. > > This is a proposal from David, so not in current patch/code :) Regards Andy Fan
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi Andy, On Tue, Mar 10, 2020 at 1:49 PM Andy Fan wrote: > > Hi David: > >> >> 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join >> rel itself (I've not looked for the best location in detail), >> determine if the join can cause rows to be duplicated. If it can't, >> then add the UniqueKeys from that rel. > > > I have some concerns about this method, maybe I misunderstand > something, if so, please advise. > > In my current implementation, it calculates the uniqueness for each > BaseRel only, but in your way, looks we need to calculate the > UniquePathKey for both BaseRel and JoinRel. This makes more > difference for multi table join. I didn't understand this concern. I think, it would be better to do it for all kinds of relation types base, join etc. This way we are sure that one method works across the planner to eliminate the need for Distinct or grouping. If we just implement something for base relations right now and don't do that for joins, there is a chance that that method may not work for joins when we come to implement it. > Another concern is UniquePathKey > is designed for a general purpose, we need to maintain it no matter > distinctClause/groupbyClause. This should be ok. The time spent in annotating a RelOptInfo about uniqueness is not going to be a lot. But doing so would help generic elimination of Distinct/Group/Unique operations. What is UniquePathKey; I didn't find this in your patch or in the code. > > >> >> For example: SELECT * FROM t1 >> INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for >> {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows >> since it's an eqijoin and t1.unique has a unique index). > > > Thanks for raising this. My current rule requires *every* relation yields a > unique result and *no matter* with the join method. Actually I want to make > the rule less strict, for example, we may just need 1 table yields unique > result > and the final result will be unique as well under some join type. That is desirable. > > As for the t1 INNER JOIN t2 ON t1.unique = t2.not_unique; looks we can't > remove the distinct based on this. > > create table m1(a int primary key, b int); > create table m2(a int primary key, b int); > insert into m1 values(1, 1), (2, 1); > insert into m2 values(1, 1), (2, 1); > select distinct m1.a from m1, m2 where m1.a = m2.b; IIUC, David's rule is other way round. "select distinct m2.a from m1, m2 where m1.a = m2.b" won't need DISTINCT node since the result of joining m1 and m2 has unique value of m2.a for each row. In your example the join will produce two rows (m1.a, m1.b, m2.a, m2.b) (1, 1, 1, 1) and (1, 1, 2, 1) where m2.a is unique key. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Mar 10, 2020 at 3:51 AM David Rowley wrote: > On Sat, 7 Mar 2020 at 00:47, Andy Fan wrote: > > Upload the newest patch so that the cfbot can pass. The last patch > failed > > because some explain without the (cost off). > > I've only really glanced at this patch, but I think we need to do this > in a completely different way. > > I've been mentioning UniqueKeys around this mailing list for quite a > while now [1]. To summarise the idea: > > 1. Add a new List field to RelOptInfo named unique_keys > 2. During get_relation_info() process the base relation's unique > indexes and add to the RelOptInfo's unique_keys list the indexed > expressions from each unique index (this may need to be delayed until > check_index_predicates() since predOK is only set there) > 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join > rel itself (I've not looked for the best location in detail), > build_*_join_rel() will be a good place for this. The paths created might take advantage of this information for costing. > determine if the join can cause rows to be duplicated. If it can't, > then add the UniqueKeys from that rel. For example: SELECT * FROM t1 > INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for > {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows > since it's an eqijoin and t1.unique has a unique index). this is interesting. > If the > condition was t1.unique = t2.unique then we could take the unique keys > from both sides of the join, and with t1.non_unique = t2.non_unique, > we can take neither. > 4. When creating the GROUP BY paths (when there are no aggregates), > don't bother doing anything if the input rel's unique keys are a > subset of the GROUP BY clause. Otherwise, create the group by paths > and tag the new unique keys onto the GROUP BY rel. > 5. When creating the DISTINCT paths, don't bother if the input rel has > unique keys are a subset of the distinct clause. > Thanks for laying this out in more details. Two more cases can be added to this 6. When creating RelOptInfo for a grouped/aggregated result, if all the columns of a group by clause are part of the result i.e. targetlist, the columns in group by clause server as the unique keys of the result. So the corresponding RelOptInfo can be marked as such. 7. The result of DISTINCT is unique for the columns contained in the DISTINCT clause. Hence we can add those columns to the unique_key of the RelOptInfo representing the result of the distinct clause. 8. If each partition of a partitioned table has a unique key with the same columns in it and that happens to be superset of the partition key, then the whole partitioned table gets that unique key as well. With this we could actually pass the uniqueness information through Subquery scans as well and the overall query will benefit with that. > > 4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY > non_unique will just uniquify once for the GROUP BY and not for the > distinct. SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do > anything to uniquify the results. > > Because both 4 and 5 require that the uniquekeys are a subset of the > distinct/group by clause, an empty uniquekey set would mean that the > RelOptInfo returns no more than 1 row. That would allow your: > > SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part. > > There's a separate effort in > https://commitfest.postgresql.org/27/1741/ to implement some parts of > the uniquekeys idea. However the implementation currently only covers > adding the unique keys to Paths, not to RelOptInfos. > I haven't looked at that patch, but as discussed upthread, in this case we want the uniqueness associated with the RelOptInfo and not the path. > > I also believe that the existing code in analyzejoins.c should be > edited to make use of unique keys rather than how it looks at unique > indexes and group by / distinct clauses. > +1. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi David: > 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join > rel itself (I've not looked for the best location in detail), > determine if the join can cause rows to be duplicated. If it can't, > then add the UniqueKeys from that rel. I have some concerns about this method, maybe I misunderstand something, if so, please advise. In my current implementation, it calculates the uniqueness for each BaseRel only, but in your way, looks we need to calculate the UniquePathKey for both BaseRel and JoinRel. This makes more difference for multi table join.Another concern is UniquePathKey is designed for a general purpose, we need to maintain it no matter distinctClause/groupbyClause. > For example: SELECT * FROM t1 > INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for > {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows > since it's an eqijoin and t1.unique has a unique index). > Thanks for raising this. My current rule requires *every* relation yields a unique result and *no matter* with the join method. Actually I want to make the rule less strict, for example, we may just need 1 table yields unique result and the final result will be unique as well under some join type. As for the t1 INNER JOIN t2 ON t1.unique = t2.not_unique; looks we can't remove the distinct based on this. create table m1(a int primary key, b int); create table m2(a int primary key, b int); insert into m1 values(1, 1), (2, 1); insert into m2 values(1, 1), (2, 1); select distinct m1.a from m1, m2 where m1.a = m2.b; > SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part. > Actually I want to keep the distinct for this case now. One reason is there are only 1 row returned, so the distinct erasing can't help much. The more important reason is Query->hasAggs is true for "select distinct (select count(*) filter (where t2.c2 = 6 and t2.c1 < 10) from ft1 t1 where t1.c1 = 6) from ft2 t2 where t2.c2 % 6 = 0 order by 1;" (this sql came from postgres_fdw.sql). There's a separate effort in > https://commitfest.postgresql.org/27/1741/ to implement some parts of > the uniquekeys idea. However the implementation currently only covers > adding the unique keys to Paths, not to RelOptInfos Thanks for this info. I guess this patch is not merged so far, but looks the cfbot for my patch [1] failed due to this :( search "explain (costs off) select distinct on(pk1) pk1, pk2 from select_distinct_a;" > I also believe that the existing code in analyzejoins.c should be > edited to make use of unique keys rather than how it looks at unique > indexes and group by / distinct clauses. > > I can do this after we have agreement on the UniquePath. For my cbbot failure, another strange thing is "A" appear ahead of "a" after the order by.. Still didn't find out why. [1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.83298 Regards Andy Fan
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Sat, 7 Mar 2020 at 00:47, Andy Fan wrote: > Upload the newest patch so that the cfbot can pass. The last patch failed > because some explain without the (cost off). I've only really glanced at this patch, but I think we need to do this in a completely different way. I've been mentioning UniqueKeys around this mailing list for quite a while now [1]. To summarise the idea: 1. Add a new List field to RelOptInfo named unique_keys 2. During get_relation_info() process the base relation's unique indexes and add to the RelOptInfo's unique_keys list the indexed expressions from each unique index (this may need to be delayed until check_index_predicates() since predOK is only set there) 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join rel itself (I've not looked for the best location in detail), determine if the join can cause rows to be duplicated. If it can't, then add the UniqueKeys from that rel. For example: SELECT * FROM t1 INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows since it's an eqijoin and t1.unique has a unique index). If the condition was t1.unique = t2.unique then we could take the unique keys from both sides of the join, and with t1.non_unique = t2.non_unique, we can take neither. 4. When creating the GROUP BY paths (when there are no aggregates), don't bother doing anything if the input rel's unique keys are a subset of the GROUP BY clause. Otherwise, create the group by paths and tag the new unique keys onto the GROUP BY rel. 5. When creating the DISTINCT paths, don't bother if the input rel has unique keys are a subset of the distinct clause. 4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY non_unique will just uniquify once for the GROUP BY and not for the distinct. SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do anything to uniquify the results. Because both 4 and 5 require that the uniquekeys are a subset of the distinct/group by clause, an empty uniquekey set would mean that the RelOptInfo returns no more than 1 row. That would allow your: SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part. There's a separate effort in https://commitfest.postgresql.org/27/1741/ to implement some parts of the uniquekeys idea. However the implementation currently only covers adding the unique keys to Paths, not to RelOptInfos. I also believe that the existing code in analyzejoins.c should be edited to make use of unique keys rather than how it looks at unique indexes and group by / distinct clauses. [1] https://www.postgresql.org/search/?m=1=pgsql-hackers=uniquekeys
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Upload the newest patch so that the cfbot can pass. The last patch failed because some explain without the (cost off). I'm still on the way to figure out how to handle aggregation calls without aggregation path. Probably we can get there by hacking some ExprEvalPushStep for Aggref node. But since the current patch not tied with this closely, so I would put this patch for review first. On Wed, Mar 4, 2020 at 9:13 PM Andy Fan wrote: > > >> >>> * There are some changes in existing regression cases that aren't >>> visibly related to the stated purpose of the patch, eg it now >>> notices that "select distinct max(unique2) from tenk1" doesn't >>> require an explicit DISTINCT step. That's not wrong, but I wonder >>> if maybe you should subdivide this patch into more than one patch, >>> because that must be coming from some separate change. I'm also >>> wondering what caused the plan change in expected/join.out. >>> >> >> Per my purpose it should be in the same patch, the logical here is we >> have distinct in the sql and the query is distinct already since the max >> function (the rule is defined in query_is_distinct_agg which is splited >> from >> the original query_is_distinct_for clause). >> > > I think I was right until I come > into contrib/postgres_fdw/sql/postgres_fdw.sql. > Per my understanding, the query the result of "select max(a) from t" is > unique > since the aggregation function and has no group clause there. But in the > postgres_fdw.sql case, the Query->hasAggs is true for "select distinct > (select count(*) filter (where t2.c2 = 6 and t2.c1 < 10) from ft1 t1 where > t1.c1 = 6) > from ft2 t2 where t2.c2 % 6 = 0 order by 1;" This looks very strange to > me. > Is my understanding wrong or there is a bug here? > > query->hasAggs was set to true in the following call stack. > > pstate->p_hasAggs = true; > > .. > > qry->hasAggs = pstate->p_hasAggs; > > > 0 in check_agglevels_and_constraints of parse_agg.c:343 > 1 in transformAggregateCall of parse_agg.c:236 > 2 in ParseFuncOrColumn of parse_func.c:805 > 3 in transformFuncCall of parse_expr.c:1558 > 4 in transformExprRecurse of parse_expr.c:265 > 5 in transformExpr of parse_expr.c:155 > 6 in transformTargetEntry of parse_target.c:105 > 7 in transformTargetList of parse_target.c:193 > 8 in transformSelectStmt of analyze.c:1224 > 9 in transformStmt of analyze.c:301 > > You can see the new updated patch which should fix all the issues you > point out > except the one for supporting group by. The another reason for this > patch will > not be the final one is because the changes for postgres_fdw.out is too > arbitrary. > uploading it now just for reference. (The new introduced guc variable can > be > removed at last, keeping it now just make sure the testing is easier.) > > > At a high level, I'm a bit disturbed that this focuses only on DISTINCT and doesn't (appear to) have any equivalent intelligence for GROUP BY, though surely that offers much the same opportunities for optimization. It seems like it'd be worthwhile to take a couple steps back and see if we couldn't recast the logic to work for both. >>> OK, Looks group by is a bit harder than distinct since the aggregation >>> function. >>> I will go through the code to see where to add this logic. >>> >>> >> >> Can we grantee any_aggr_func(a) == a if only 1 row returned, if so, we >> can do >> some work on the pathtarget/reltarget by transforming the Aggref to raw >> expr. >> I checked the execution path of the aggregation call, looks it depends on >> Agg node >> which is the thing we want to remove. >> > > We can't grantee any_aggr_func(a) == a when only 1 row returned, so the > above > method doesn't work. do you have any suggestion for this? > v3-0001-PATCH-Erase-the-distinct-path-if-the-result-is-un.patch Description: Binary data
Re: [PATCH] Erase the distinctClause if the result is unique by definition
> > >> * There are some changes in existing regression cases that aren't >> visibly related to the stated purpose of the patch, eg it now >> notices that "select distinct max(unique2) from tenk1" doesn't >> require an explicit DISTINCT step. That's not wrong, but I wonder >> if maybe you should subdivide this patch into more than one patch, >> because that must be coming from some separate change. I'm also >> wondering what caused the plan change in expected/join.out. >> > > Per my purpose it should be in the same patch, the logical here is we > have distinct in the sql and the query is distinct already since the max > function (the rule is defined in query_is_distinct_agg which is splited > from > the original query_is_distinct_for clause). > I think I was right until I come into contrib/postgres_fdw/sql/postgres_fdw.sql. Per my understanding, the query the result of "select max(a) from t" is unique since the aggregation function and has no group clause there. But in the postgres_fdw.sql case, the Query->hasAggs is true for "select distinct (select count(*) filter (where t2.c2 = 6 and t2.c1 < 10) from ft1 t1 where t1.c1 = 6) from ft2 t2 where t2.c2 % 6 = 0 order by 1;" This looks very strange to me. Is my understanding wrong or there is a bug here? query->hasAggs was set to true in the following call stack. pstate->p_hasAggs = true; .. qry->hasAggs = pstate->p_hasAggs; 0 in check_agglevels_and_constraints of parse_agg.c:343 1 in transformAggregateCall of parse_agg.c:236 2 in ParseFuncOrColumn of parse_func.c:805 3 in transformFuncCall of parse_expr.c:1558 4 in transformExprRecurse of parse_expr.c:265 5 in transformExpr of parse_expr.c:155 6 in transformTargetEntry of parse_target.c:105 7 in transformTargetList of parse_target.c:193 8 in transformSelectStmt of analyze.c:1224 9 in transformStmt of analyze.c:301 You can see the new updated patch which should fix all the issues you point out except the one for supporting group by. The another reason for this patch will not be the final one is because the changes for postgres_fdw.out is too arbitrary. uploading it now just for reference. (The new introduced guc variable can be removed at last, keeping it now just make sure the testing is easier.) At a high level, I'm a bit disturbed that this focuses only on DISTINCT >>> and doesn't (appear to) have any equivalent intelligence for GROUP BY, >>> though surely that offers much the same opportunities for optimization. >>> It seems like it'd be worthwhile to take a couple steps back and see >>> if we couldn't recast the logic to work for both. >>> >>> >> OK, Looks group by is a bit harder than distinct since the aggregation >> function. >> I will go through the code to see where to add this logic. >> >> > > Can we grantee any_aggr_func(a) == a if only 1 row returned, if so, we > can do > some work on the pathtarget/reltarget by transforming the Aggref to raw > expr. > I checked the execution path of the aggregation call, looks it depends on > Agg node > which is the thing we want to remove. > We can't grantee any_aggr_func(a) == a when only 1 row returned, so the above method doesn't work. do you have any suggestion for this? From 9449c09688d542c4dc201ee866f67d67304cff98 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= Date: Wed, 4 Mar 2020 14:33:56 +0800 Subject: [PATCH v2] [PATCH] Erase the distinct path if the result is unique by catalog For a single relation, we can tell it by any one of the following is true: 1. The pk is in the target list. 2. The uk is in the target list and the columns is not null 3. The columns in group-by clause is also in the target list for relation join, we can tell it by: if every relation in the jointree yield a unique result set,then the final result is unique as well regardless the join method. --- .../postgres_fdw/expected/postgres_fdw.out| 28 +- src/backend/optimizer/path/costsize.c | 1 + src/backend/optimizer/plan/analyzejoins.c | 184 +++- src/backend/optimizer/plan/planner.c | 27 ++ src/backend/optimizer/util/plancat.c | 9 + src/backend/utils/misc/guc.c | 10 + src/include/nodes/pathnodes.h | 1 + src/include/optimizer/cost.h | 1 + src/include/optimizer/planmain.h | 2 + src/test/regress/expected/aggregates.out | 13 +- src/test/regress/expected/join.out| 16 +- src/test/regress/expected/select_distinct.out | 276 ++ src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/select_distinct.sql | 84 ++ 14 files changed, 619 insertions(+), 36 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 84fd3ad2e0..215f10bf7d 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -2902,22 +2902,20 @@
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Mar 3, 2020 at 1:24 AM Andy Fan wrote: > Thank you Tom for the review! > > On Mon, Mar 2, 2020 at 4:46 AM Tom Lane wrote: > >> Andy Fan writes: >> > Please see if you have any comments. Thanks >> >> The cfbot isn't at all happy with this. Its linux build is complaining >> about a possibly-uninitialized variable, and then giving up: >> >> https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993 >> >> The Windows build isn't using -Werror, but it is crashing in at least >> two different spots in the regression tests: >> >> >> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778 >> >> I've not attempted to identify the cause of that. >> >> > Before I submit the patch, I can make sure "make check-world" is > successful, but > since the compile option is not same, so I didn't catch > the possibly-uninitialized > variable. As for the crash on the windows, I didn't get the enough > information > now, I will find a windows server and reproduce the cases. > > I just found the link http://commitfest.cputube.org/ this morning, I will > make sure > the next patch can go pass this test. > > > At a high level, I'm a bit disturbed that this focuses only on DISTINCT >> and doesn't (appear to) have any equivalent intelligence for GROUP BY, >> though surely that offers much the same opportunities for optimization. >> It seems like it'd be worthwhile to take a couple steps back and see >> if we couldn't recast the logic to work for both. >> >> > OK, Looks group by is a bit harder than distinct since the aggregation > function. > I will go through the code to see why to add this logic. > > Can we grantee any_aggr_func(a) == a if only 1 row returned, if so, we can do some work on the pathtarget/reltarget by transforming the Aggref to raw expr. I checked the execution path of the aggregation call, looks it depends on Agg node which is the thing we want to remove. > >> * There seem to be some pointless #include additions, eg in planner.c >> the added code doesn't look to justify any of them. Please also >> avoid unnecessary whitespace changes, those also slow down reviewing. >> > fixed some typo errors. That may be because I added the header file at time 1 and then refactored the code but forget to remove the header file when it is not necessary. Do we need to relay on experience to tell if the header file is needed or not, or do we have any tool to tell it automatically? regards, Andy Fan
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Thank you Tom for the review! On Mon, Mar 2, 2020 at 4:46 AM Tom Lane wrote: > Andy Fan writes: > > Please see if you have any comments. Thanks > > The cfbot isn't at all happy with this. Its linux build is complaining > about a possibly-uninitialized variable, and then giving up: > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993 > > The Windows build isn't using -Werror, but it is crashing in at least > two different spots in the regression tests: > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778 > > I've not attempted to identify the cause of that. > > Before I submit the patch, I can make sure "make check-world" is successful, but since the compile option is not same, so I didn't catch the possibly-uninitialized variable. As for the crash on the windows, I didn't get the enough information now, I will find a windows server and reproduce the cases. I just found the link http://commitfest.cputube.org/ this morning, I will make sure the next patch can go pass this test. At a high level, I'm a bit disturbed that this focuses only on DISTINCT > and doesn't (appear to) have any equivalent intelligence for GROUP BY, > though surely that offers much the same opportunities for optimization. > It seems like it'd be worthwhile to take a couple steps back and see > if we couldn't recast the logic to work for both. > > OK, Looks group by is a bit harder than distinct since the aggregation function. I will go through the code to see why to add this logic. > Some other random comments: > > * Don't especially like the way you broke query_is_distinct_for() > into two functions, especially when you then introduced a whole > lot of other code in between. This is not expected by me until you point it out. In this case, I have to break the query_is_distinct_for to two functions, but it true that we should put the two functions together. That's just making reviewer's job > harder to see what changed. It makes the comments a bit disjointed > too, that is where you even had any. (Zero introductory comment > for query_is_distinct_agg is *not* up to project coding standards. > There are a lot of other undercommented places in this patch, too.) > > * Definitely don't like having query_distinct_through_join re-open > all the relations. The data needed for this should get absorbed > while plancat.c has the relations open at the beginning. (Non-nullness > of columns, in particular, seems like it'll be useful for other > purposes; I'm a bit surprised the planner isn't using that already.) > I can add new attributes to RelOptInfo and fill the value in get_relation_info call. > * In general, query_distinct_through_join seems hugely messy, expensive, > and not clearly correct. If it is correct, the existing comments sure > aren't enough to explain what it is doing or why. > Removing the relation_open call can make it a bit simpler, I will try more comment to make it clearer in the following patch. > * There seem to be some pointless #include additions, eg in planner.c > the added code doesn't look to justify any of them. Please also > avoid unnecessary whitespace changes, those also slow down reviewing. > > That may because I added the header file some time 1 and then refactored the code later then forget the remove the header file accordingly. Do we need to relay on experience to tell if the header file is needed or not, or do have have any code to tell it automatically? > * I see you decided to add a new regression test file select_distinct_2. > That's a poor choice of name because it conflicts with our rules for the > naming of alternative output files. Besides which, you forgot to plug > it into the test schedule files, so it isn't actually getting run. > Is there a reason not to just add the new test cases to select_distinct? > > Adding it to select_distinct.sql is ok for me as well. Actually I have no obviously reason to add the new file. > * There are some changes in existing regression cases that aren't > visibly related to the stated purpose of the patch, eg it now > notices that "select distinct max(unique2) from tenk1" doesn't > require an explicit DISTINCT step. That's not wrong, but I wonder > if maybe you should subdivide this patch into more than one patch, > because that must be coming from some separate change. I'm also > wondering what caused the plan change in expected/join.out. > Per my purpose it should be in the same patch, the logical here is we have distinct in the sql and the query is distinct already since the max function (the rule is defined in query_is_distinct_agg which is splited from the original query_is_distinct_for clause). > regards, tom lane >
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Andy Fan writes: > Please see if you have any comments. Thanks The cfbot isn't at all happy with this. Its linux build is complaining about a possibly-uninitialized variable, and then giving up: https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993 The Windows build isn't using -Werror, but it is crashing in at least two different spots in the regression tests: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778 I've not attempted to identify the cause of that. At a high level, I'm a bit disturbed that this focuses only on DISTINCT and doesn't (appear to) have any equivalent intelligence for GROUP BY, though surely that offers much the same opportunities for optimization. It seems like it'd be worthwhile to take a couple steps back and see if we couldn't recast the logic to work for both. Some other random comments: * Don't especially like the way you broke query_is_distinct_for() into two functions, especially when you then introduced a whole lot of other code in between. That's just making reviewer's job harder to see what changed. It makes the comments a bit disjointed too, that is where you even had any. (Zero introductory comment for query_is_distinct_agg is *not* up to project coding standards. There are a lot of other undercommented places in this patch, too.) * Definitely don't like having query_distinct_through_join re-open all the relations. The data needed for this should get absorbed while plancat.c has the relations open at the beginning. (Non-nullness of columns, in particular, seems like it'll be useful for other purposes; I'm a bit surprised the planner isn't using that already.) * In general, query_distinct_through_join seems hugely messy, expensive, and not clearly correct. If it is correct, the existing comments sure aren't enough to explain what it is doing or why. * Not entirely convinced that a new GUC is needed for this, but if it is, you have to document it. * I wouldn't bother with bms_array_free(), nor with any of the other cleanup you've got at the bottom of query_distinct_through_join. The planner leaks *lots* of memory, and this function isn't going to be called so many times that it'll move the needle. * There seem to be some pointless #include additions, eg in planner.c the added code doesn't look to justify any of them. Please also avoid unnecessary whitespace changes, those also slow down reviewing. * I see you decided to add a new regression test file select_distinct_2. That's a poor choice of name because it conflicts with our rules for the naming of alternative output files. Besides which, you forgot to plug it into the test schedule files, so it isn't actually getting run. Is there a reason not to just add the new test cases to select_distinct? * There are some changes in existing regression cases that aren't visibly related to the stated purpose of the patch, eg it now notices that "select distinct max(unique2) from tenk1" doesn't require an explicit DISTINCT step. That's not wrong, but I wonder if maybe you should subdivide this patch into more than one patch, because that must be coming from some separate change. I'm also wondering what caused the plan change in expected/join.out. regards, tom lane
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi All: Here is the updated patch. It used some functions from query_is_distinct_for. I check the query's distinctness in create_distinct_paths, if it is distinct already, it will not generate the paths for that. so at last the query tree is not untouched. Please see if you have any comments. Thanks On Mon, Feb 24, 2020 at 8:38 PM Andy Fan wrote: > > > On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat < > ashutosh.bapat@gmail.com> wrote: > >> >> >> On Tue, Feb 11, 2020 at 8:27 AM Andy Fan >> wrote: >> >>> >>> >>> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < >>> ashutosh.bapat@gmail.com> wrote: >>> > > [PATCH] Erase the distinctClause if the result is unique by > definition > I forgot to mention this in the last round of comments. Your patch was actually removing distictClause from the Query structure. Please avoid doing that. If you remove it, you are also removing the evidence that this Query had a DISTINCT clause in it. >>> >>> Yes, I removed it because it is the easiest way to do it. what is the >>> purpose of keeping the evidence? >>> >> >> Julien's example provides an explanation for this. The Query structure is >> serialised into a view definition. Removing distinctClause from there means >> that the view will never try to produce unique results. >> >>> >>> >> > Actually it is not true. If a view is used in the query, the definition > will be *copied* > into the query tree. so if we modify the query tree, the definition of > the view never > touched. The issue of Julien reported is because of a typo error. > > -- session 2 >>> postgres=# alter table t alter column b drop not null; >>> ALTER TABLE >>> >>> -- session 1: >>> postgres=# explain execute st(1); >>> QUERY PLAN >>> - >>> Unique (cost=1.03..1.04 rows=1 width=4) >>>-> Sort (cost=1.03..1.04 rows=1 width=4) >>> Sort Key: b >>> -> Seq Scan on t (cost=0.00..1.02 rows=1 width=4) >>>Filter: (c = $1) >>> (5 rows) >>> >> >> Since this prepared statement is parameterised PostgreSQL is replanning >> it every time it gets executed. It's not using a stored prepared plan. Try >> without parameters. Also make sure that a prepared plan is used for >> execution and not a new plan. >> > > Even for parameterised prepared statement, it is still possible to > generate an generic > plan. so it will not replanning every time. But no matter generic plan or > not, after a DDL like > changing the NOT NULL constraints. pg will generated a plan based on the > stored query > tree. However, the query tree will be *copied* again to generate a new > plan. so even I > modified the query tree, everything will be ok as well. > > At last, I am agreed with that modifying the query tree is not a good > idea. > so my updated patch doesn't use it any more. > 0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch Description: Binary data
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat < ashutosh.bapat@gmail.com> wrote: > > > On Tue, Feb 11, 2020 at 8:27 AM Andy Fan wrote: > >> >> >> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < >> ashutosh.bapat@gmail.com> wrote: >> >>> >>> [PATCH] Erase the distinctClause if the result is unique by definition >>> >>> I forgot to mention this in the last round of comments. Your patch was >>> actually removing distictClause from the Query structure. Please avoid >>> doing that. If you remove it, you are also removing the evidence that this >>> Query had a DISTINCT clause in it. >>> >> >> Yes, I removed it because it is the easiest way to do it. what is the >> purpose of keeping the evidence? >> > > Julien's example provides an explanation for this. The Query structure is > serialised into a view definition. Removing distinctClause from there means > that the view will never try to produce unique results. > >> >> > Actually it is not true. If a view is used in the query, the definition will be *copied* into the query tree. so if we modify the query tree, the definition of the view never touched. The issue of Julien reported is because of a typo error. -- session 2 >> postgres=# alter table t alter column b drop not null; >> ALTER TABLE >> >> -- session 1: >> postgres=# explain execute st(1); >> QUERY PLAN >> - >> Unique (cost=1.03..1.04 rows=1 width=4) >>-> Sort (cost=1.03..1.04 rows=1 width=4) >> Sort Key: b >> -> Seq Scan on t (cost=0.00..1.02 rows=1 width=4) >>Filter: (c = $1) >> (5 rows) >> > > Since this prepared statement is parameterised PostgreSQL is replanning it > every time it gets executed. It's not using a stored prepared plan. Try > without parameters. Also make sure that a prepared plan is used for > execution and not a new plan. > Even for parameterised prepared statement, it is still possible to generate an generic plan. so it will not replanning every time. But no matter generic plan or not, after a DDL like changing the NOT NULL constraints. pg will generated a plan based on the stored query tree. However, the query tree will be *copied* again to generate a new plan. so even I modified the query tree, everything will be ok as well. At last, I am agreed with that modifying the query tree is not a good idea. so my updated patch doesn't use it any more.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Thu, Feb 13, 2020 at 5:39 PM Julien Rouhaud wrote: > On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote: > > On Tue, Feb 11, 2020 at 8:27 AM Andy Fan > wrote: > > > > > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < > > > ashutosh.bapat@gmail.com> wrote: > > > > > >>> > > >>> [PATCH] Erase the distinctClause if the result is unique by > > >>> definition > > >> > > >> I forgot to mention this in the last round of comments. Your patch was > > >> actually removing distictClause from the Query structure. Please avoid > > >> doing that. If you remove it, you are also removing the evidence that > this > > >> Query had a DISTINCT clause in it. > > > > > > Yes, I removed it because it is the easiest way to do it. what is the > > > purpose of keeping the evidence? > > > > > > > Julien's example provides an explanation for this. The Query structure is > > serialised into a view definition. Removing distinctClause from there > means > > that the view will never try to produce unique results. > > And also I think that this approach will have a lot of other unexpected > side > effects. Isn't changing the Query going to affect pg_stat_statements > queryid > computing for instance? > Thanks, the 2 factors above are pretty valuable. so erasing the distinctClause is not reasonable, I will try another way.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote: > On Tue, Feb 11, 2020 at 8:27 AM Andy Fan wrote: > > > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < > > ashutosh.bapat@gmail.com> wrote: > > > >>> > >>> [PATCH] Erase the distinctClause if the result is unique by > >>> definition > >> > >> I forgot to mention this in the last round of comments. Your patch was > >> actually removing distictClause from the Query structure. Please avoid > >> doing that. If you remove it, you are also removing the evidence that this > >> Query had a DISTINCT clause in it. > > > > Yes, I removed it because it is the easiest way to do it. what is the > > purpose of keeping the evidence? > > > > Julien's example provides an explanation for this. The Query structure is > serialised into a view definition. Removing distinctClause from there means > that the view will never try to produce unique results. And also I think that this approach will have a lot of other unexpected side effects. Isn't changing the Query going to affect pg_stat_statements queryid computing for instance?
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Feb 11, 2020 at 8:27 AM Andy Fan wrote: > > > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < > ashutosh.bapat@gmail.com> wrote: > >> >> >>> >>> [PATCH] Erase the distinctClause if the result is unique by >>> definition >>> >> >> I forgot to mention this in the last round of comments. Your patch was >> actually removing distictClause from the Query structure. Please avoid >> doing that. If you remove it, you are also removing the evidence that this >> Query had a DISTINCT clause in it. >> > > Yes, I removed it because it is the easiest way to do it. what is the > purpose of keeping the evidence? > Julien's example provides an explanation for this. The Query structure is serialised into a view definition. Removing distinctClause from there means that the view will never try to produce unique results. > > > > Suppose after a DDL, the prepared statement need to be re-parsed/planned > if it is not executed or it will prevent the DDL to happen. > The query will be replanned. I am not sure about reparsed though. > > > -- session 2 > postgres=# alter table t alter column b drop not null; > ALTER TABLE > > -- session 1: > postgres=# explain execute st(1); > QUERY PLAN > - > Unique (cost=1.03..1.04 rows=1 width=4) >-> Sort (cost=1.03..1.04 rows=1 width=4) > Sort Key: b > -> Seq Scan on t (cost=0.00..1.02 rows=1 width=4) >Filter: (c = $1) > (5 rows) > Since this prepared statement is parameterised PostgreSQL is replanning it every time it gets executed. It's not using a stored prepared plan. Try without parameters. Also make sure that a prepared plan is used for execution and not a new plan. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Mon, Feb 10, 2020 at 10:57 PM Tom Lane wrote: > Ashutosh Bapat writes: > >> On Sat, Feb 8, 2020 at 12:53 PM Andy Fan > wrote: > >> Do you mean adding some information into PlannerInfo, and when we > create > >> a node for Unique/HashAggregate/Group, we can just create a dummy node? > > > Not so much as PlannerInfo but something on lines of PathKey. See PathKey > > structure and related code. What I envision is PathKey class is also > > annotated with the information whether that PathKey implies uniqueness. > > E.g. a PathKey derived from a Primary index would imply uniqueness also. > A > > PathKey derived from say Group operation also implies uniqueness. Then > just > > by looking at the underlying Path we would be able to say whether we need > > Group/Unique node on top of it or not. I think that would make it much > > wider usecase and a very useful optimization. > > FWIW, that doesn't seem like a very prudent approach to me, because it > confuses sorted-ness with unique-ness. PathKeys are about sorting, > but it's possible to have uniqueness guarantees without having sorted > anything, for instance via hashed grouping. > > I haven't looked at this patch, but I'd expect it to use infrastructure > related to query_is_distinct_for(), and that doesn't deal in PathKeys. > > Thanks for the pointer. I think there's another problem with my approach. PathKeys are specific to paths since the order of the result depends upon the Path. But uniqueness is a property of the result i.e. relation and thus should be attached to RelOptInfo as query_is_distinct_for() does. I think uniquness should bubble up the RelOptInfo tree, annotating each RelOptInfo with the minimum set of TLEs which make the result from that relation unique. Thus we could eliminate extra Group/Unique node if the underlying RelOptInfo's unique column set is subset of required uniqueness. -- -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Feb 11, 2020 at 08:14:14PM +0800, Andy Fan wrote: > On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud wrote: > > > > > > > and if we prepare sql outside a transaction, and execute it in the > > > transaction, the other session can't drop the constraint until the > > > transaction is ended. > > > > And what if you create a view on top of a query containing a distinct > > clause > > rather than using prepared statements? FWIW your patch doesn't handle such > > case at all, without even needing to drop constraints: > > > > CREATE TABLE t (a int primary key, b int not null, c int); > > INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2); > > CREATE UNIQUE INDEX t_idx1 on t(b); > > CREATE VIEW v1 AS SELECT DISTINCT b FROM t; > > EXPLAIN SELECT * FROM v1; > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > > > > This error can be fixed with > > - num_of_rtables = bms_num_members(non_semi_anti_relids); > + num_of_rtables = list_length(query->rtable); > > This test case also be added into the patch. > > > > I also think this is not the right way to handle this optimization. > > > > do you have any other concerns? Yes, it seems to be broken as soon as you alter the view's underlying table: =# CREATE TABLE t (a int primary key, b int not null, c int); CREATE TABLE =# INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2); INSERT 0 2 =# CREATE UNIQUE INDEX t_idx1 on t(b); CREATE INDEX =# CREATE VIEW v1 AS SELECT DISTINCT b FROM t; CREATE VIEW =# EXPLAIN SELECT * FROM v1; QUERY PLAN - Seq Scan on t (cost=0.00..1.02 rows=2 width=4) (1 row) =# EXPLAIN SELECT DISTINCT b FROM t; QUERY PLAN - Seq Scan on t (cost=0.00..1.02 rows=2 width=4) (1 row) =# ALTER TABLE t ALTER COLUMN b DROP NOT NULL; ALTER TABLE =# EXPLAIN SELECT * FROM v1; QUERY PLAN - Seq Scan on t (cost=0.00..1.02 rows=2 width=4) (1 row) =# EXPLAIN SELECT DISTINCT b FROM t; QUERY PLAN - Unique (cost=1.03..1.04 rows=2 width=4) -> Sort (cost=1.03..1.03 rows=2 width=4) Sort Key: b -> Seq Scan on t (cost=0.00..1.02 rows=2 width=4) (4 rows)
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud wrote: > > > > and if we prepare sql outside a transaction, and execute it in the > > transaction, the other session can't drop the constraint until the > > transaction is ended. > > And what if you create a view on top of a query containing a distinct > clause > rather than using prepared statements? FWIW your patch doesn't handle such > case at all, without even needing to drop constraints: > > CREATE TABLE t (a int primary key, b int not null, c int); > INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2); > CREATE UNIQUE INDEX t_idx1 on t(b); > CREATE VIEW v1 AS SELECT DISTINCT b FROM t; > EXPLAIN SELECT * FROM v1; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > This error can be fixed with - num_of_rtables = bms_num_members(non_semi_anti_relids); + num_of_rtables = list_length(query->rtable); This test case also be added into the patch. > I also think this is not the right way to handle this optimization. > do you have any other concerns? 0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch Description: Binary data
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud wrote: > On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote: > > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < > > ashutosh.bapat@gmail.com> wrote: > > > > > I forgot to mention this in the last round of comments. Your patch was > > > actually removing distictClause from the Query structure. Please avoid > > > doing that. If you remove it, you are also removing the evidence that > this > > > Query had a DISTINCT clause in it. > > > > > > > Yes, I removed it because it is the easiest way to do it. what is the > > purpose of keeping the evidence? > > > > >> However the patch as presented has some problems > > >> 1. What happens if the primary key constraint or NOT NULL constraint > gets > > >> dropped between a prepare and execute? The plan will no more be valid > and > > >> thus execution may produce non-distinct results. > > > > > But that doesn't matter since a query can be prepared outside a > > > transaction and executed within one or more subsequent transactions. > > > > > > > Suppose after a DDL, the prepared statement need to be re-parsed/planned > > if it is not executed or it will prevent the DDL to happen. > > > > The following is my test. > > > > postgres=# create table t (a int primary key, b int not null, c int); > > CREATE TABLE > > postgres=# insert into t values(1, 1, 1), (2, 2, 2); > > INSERT 0 2 > > postgres=# create unique index t_idx1 on t(b); > > CREATE INDEX > > > > postgres=# prepare st as select distinct b from t where c = $1; > > PREPARE > > postgres=# explain execute st(1); > >QUERY PLAN > > - > > Seq Scan on t (cost=0.00..1.02 rows=1 width=4) > >Filter: (c = 1) > > (2 rows) > > ... > > postgres=# explain execute st(1); > >QUERY PLAN > > - > > Seq Scan on t (cost=0.00..1.02 rows=1 width=4) > >Filter: (c = $1) > > (2 rows) > > > > -- session 2 > > postgres=# alter table t alter column b drop not null; > > ALTER TABLE > > > > -- session 1: > > postgres=# explain execute st(1); > > QUERY PLAN > > - > > Unique (cost=1.03..1.04 rows=1 width=4) > >-> Sort (cost=1.03..1.04 rows=1 width=4) > > Sort Key: b > > -> Seq Scan on t (cost=0.00..1.02 rows=1 width=4) > >Filter: (c = $1) > > (5 rows) > > > > -- session 2 > > postgres=# insert into t values (3, null, 3), (4, null, 3); > > INSERT 0 2 > > > > -- session 1 > > postgres=# execute st(3); > > b > > --- > > > > (1 row) > > > > and if we prepare sql outside a transaction, and execute it in the > > transaction, the other session can't drop the constraint until the > > transaction is ended. > > And what if you create a view on top of a query containing a distinct > clause > rather than using prepared statements? FWIW your patch doesn't handle such > case at all, without even needing to drop constraints: > CREATE TABLE t (a int primary key, b int not null, c int); > INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2); > CREATE UNIQUE INDEX t_idx1 on t(b); > CREATE VIEW v1 AS SELECT DISTINCT b FROM t; > EXPLAIN SELECT * FROM v1; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > Thanks for pointing it out. This is unexpected based on my current knowledge,I will check that. > I also think this is not the right way to handle this optimization. > I started to check query_is_distinct_for when Tom point it out, but still doesn't understand the context fully. I will take your finding with this as well.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote: > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < > ashutosh.bapat@gmail.com> wrote: > > > I forgot to mention this in the last round of comments. Your patch was > > actually removing distictClause from the Query structure. Please avoid > > doing that. If you remove it, you are also removing the evidence that this > > Query had a DISTINCT clause in it. > > > > Yes, I removed it because it is the easiest way to do it. what is the > purpose of keeping the evidence? > > >> However the patch as presented has some problems > >> 1. What happens if the primary key constraint or NOT NULL constraint gets > >> dropped between a prepare and execute? The plan will no more be valid and > >> thus execution may produce non-distinct results. > > > But that doesn't matter since a query can be prepared outside a > > transaction and executed within one or more subsequent transactions. > > > > Suppose after a DDL, the prepared statement need to be re-parsed/planned > if it is not executed or it will prevent the DDL to happen. > > The following is my test. > > postgres=# create table t (a int primary key, b int not null, c int); > CREATE TABLE > postgres=# insert into t values(1, 1, 1), (2, 2, 2); > INSERT 0 2 > postgres=# create unique index t_idx1 on t(b); > CREATE INDEX > > postgres=# prepare st as select distinct b from t where c = $1; > PREPARE > postgres=# explain execute st(1); >QUERY PLAN > - > Seq Scan on t (cost=0.00..1.02 rows=1 width=4) >Filter: (c = 1) > (2 rows) > ... > postgres=# explain execute st(1); >QUERY PLAN > - > Seq Scan on t (cost=0.00..1.02 rows=1 width=4) >Filter: (c = $1) > (2 rows) > > -- session 2 > postgres=# alter table t alter column b drop not null; > ALTER TABLE > > -- session 1: > postgres=# explain execute st(1); > QUERY PLAN > - > Unique (cost=1.03..1.04 rows=1 width=4) >-> Sort (cost=1.03..1.04 rows=1 width=4) > Sort Key: b > -> Seq Scan on t (cost=0.00..1.02 rows=1 width=4) >Filter: (c = $1) > (5 rows) > > -- session 2 > postgres=# insert into t values (3, null, 3), (4, null, 3); > INSERT 0 2 > > -- session 1 > postgres=# execute st(3); > b > --- > > (1 row) > > and if we prepare sql outside a transaction, and execute it in the > transaction, the other session can't drop the constraint until the > transaction is ended. And what if you create a view on top of a query containing a distinct clause rather than using prepared statements? FWIW your patch doesn't handle such case at all, without even needing to drop constraints: CREATE TABLE t (a int primary key, b int not null, c int); INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2); CREATE UNIQUE INDEX t_idx1 on t(b); CREATE VIEW v1 AS SELECT DISTINCT b FROM t; EXPLAIN SELECT * FROM v1; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. I also think this is not the right way to handle this optimization.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat < ashutosh.bapat@gmail.com> wrote: > > >> >> [PATCH] Erase the distinctClause if the result is unique by >> definition >> > > I forgot to mention this in the last round of comments. Your patch was > actually removing distictClause from the Query structure. Please avoid > doing that. If you remove it, you are also removing the evidence that this > Query had a DISTINCT clause in it. > Yes, I removed it because it is the easiest way to do it. what is the purpose of keeping the evidence? > > >> >> >> However the patch as presented has some problems >> 1. What happens if the primary key constraint or NOT NULL constraint gets >> dropped between a prepare and execute? The plan will no more be valid and >> thus execution may produce non-distinct results. >> >> Will this still be an issue if user use doesn't use a "read uncommitted" >> isolation level? I suppose it should be ok for this case. But even >> though >> I should add an isolation level check for this. Just added that in the >> patch >> to continue discussing of this issue. >> > > In PostgreSQL there's no "read uncommitted". > Thanks for the hint, I just noticed read uncommitted is treated as read committed in Postgresql. > But that doesn't matter since a query can be prepared outside a > transaction and executed within one or more subsequent transactions. > Suppose after a DDL, the prepared statement need to be re-parsed/planned if it is not executed or it will prevent the DDL to happen. The following is my test. postgres=# create table t (a int primary key, b int not null, c int); CREATE TABLE postgres=# insert into t values(1, 1, 1), (2, 2, 2); INSERT 0 2 postgres=# create unique index t_idx1 on t(b); CREATE INDEX postgres=# prepare st as select distinct b from t where c = $1; PREPARE postgres=# explain execute st(1); QUERY PLAN - Seq Scan on t (cost=0.00..1.02 rows=1 width=4) Filter: (c = 1) (2 rows) ... postgres=# explain execute st(1); QUERY PLAN - Seq Scan on t (cost=0.00..1.02 rows=1 width=4) Filter: (c = $1) (2 rows) -- session 2 postgres=# alter table t alter column b drop not null; ALTER TABLE -- session 1: postgres=# explain execute st(1); QUERY PLAN - Unique (cost=1.03..1.04 rows=1 width=4) -> Sort (cost=1.03..1.04 rows=1 width=4) Sort Key: b -> Seq Scan on t (cost=0.00..1.02 rows=1 width=4) Filter: (c = $1) (5 rows) -- session 2 postgres=# insert into t values (3, null, 3), (4, null, 3); INSERT 0 2 -- session 1 postgres=# execute st(3); b --- (1 row) and if we prepare sql outside a transaction, and execute it in the transaction, the other session can't drop the constraint until the transaction is ended. > -- > Best Wishes, > Ashutosh Bapat >
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Ashutosh Bapat writes: >> On Sat, Feb 8, 2020 at 12:53 PM Andy Fan wrote: >> Do you mean adding some information into PlannerInfo, and when we create >> a node for Unique/HashAggregate/Group, we can just create a dummy node? > Not so much as PlannerInfo but something on lines of PathKey. See PathKey > structure and related code. What I envision is PathKey class is also > annotated with the information whether that PathKey implies uniqueness. > E.g. a PathKey derived from a Primary index would imply uniqueness also. A > PathKey derived from say Group operation also implies uniqueness. Then just > by looking at the underlying Path we would be able to say whether we need > Group/Unique node on top of it or not. I think that would make it much > wider usecase and a very useful optimization. FWIW, that doesn't seem like a very prudent approach to me, because it confuses sorted-ness with unique-ness. PathKeys are about sorting, but it's possible to have uniqueness guarantees without having sorted anything, for instance via hashed grouping. I haven't looked at this patch, but I'd expect it to use infrastructure related to query_is_distinct_for(), and that doesn't deal in PathKeys. regards, tom lane
Re: [PATCH] Erase the distinctClause if the result is unique by definition
On Sat, Feb 8, 2020 at 12:53 PM Andy Fan wrote: > Hi Ashutosh: >Thanks for your time. > > On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat < > ashutosh.bapat@gmail.com> wrote: > >> Hi Andy, >> What might help is to add more description to your email message like >> giving examples to explain your idea. >> >> Anyway, I looked at the testcases you added for examples. >> +create table select_distinct_a(a int, b char(20), c char(20) not null, >> d int, e int, primary key(a, b)); >> +set enable_mergejoin to off; >> +set enable_hashjoin to off; >> +-- no node for distinct. >> +explain (costs off) select distinct * from select_distinct_a; >> + QUERY PLAN >> +--- >> + Seq Scan on select_distinct_a >> +(1 row) >> >> From this example, it seems that the distinct operation can be dropped >> because (a, b) is a primary key. Is my understanding correct? >> > > Yes, you are correct. Actually I added then to commit message, > but it's true that I should have copied them in this email body as > well. so copy it now. > > [PATCH] Erase the distinctClause if the result is unique by > definition > I forgot to mention this in the last round of comments. Your patch was actually removing distictClause from the Query structure. Please avoid doing that. If you remove it, you are also removing the evidence that this Query had a DISTINCT clause in it. > > > However the patch as presented has some problems > 1. What happens if the primary key constraint or NOT NULL constraint gets > dropped between a prepare and execute? The plan will no more be valid and > thus execution may produce non-distinct results. > > Will this still be an issue if user use doesn't use a "read uncommitted" > isolation level? I suppose it should be ok for this case. But even though > I should add an isolation level check for this. Just added that in the > patch > to continue discussing of this issue. > In PostgreSQL there's no "read uncommitted". But that doesn't matter since a query can be prepared outside a transaction and executed within one or more subsequent transactions. > > >> PostgreSQL has similar concept of allowing non-grouping expression as >> part of targetlist when those expressions can be proved to be functionally >> dependent on the GROUP BY clause. See check_functional_grouping() and its >> caller. I think, DISTINCT elimination should work on similar lines. >> > 2. For the same reason described in check_functional_grouping(), using >> unique indexes for eliminating DISTINCT should be discouraged. >> > > I checked the comments of check_functional_grouping, the reason is > > * Currently we only check to see if the rel has a primary key that is a > * subset of the grouping_columns. We could also use plain unique > constraints > * if all their columns are known not null, but there's a problem: we need > * to be able to represent the not-null-ness as part of the constraints > added > * to *constraintDeps. FIXME whenever not-null constraints get represented > * in pg_constraint. > > Actually I am doubtful the reason for pg_constraint since we still be able > to get the not null information from > relation->rd_attr->attrs[n].attnotnull which > is just what this patch did. > The problem isn't whether not-null-less can be inferred or not, the problem is whether that can be guaranteed across planning and execution of query (prepare and execute for example.) The constraintDep machinary registers the constraints used for preparing plan and invalidates the plan if any of those constraints change after plan is created. > > 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY >> as well >> > > This is a good point. The rules may have some different for join, so I > prefer > to to focus on the current one so far. > I doubt that since DISTINCT is ultimately carried out as Grouping operation. But anyway, I won't hang upon that. > > >> 4. The patch works only at the query level, but that functionality can be >> expanded generally to other places which add Unique/HashAggregate/Group >> nodes if the underlying relation can be proved to produce distinct rows. >> But that's probably more work since we will have to label paths with unique >> keys similar to pathkeys. >> > > Do you mean adding some information into PlannerInfo, and when we create > a node for Unique/HashAggregate/Group, we can just create a dummy node? > Not so much as PlannerInfo but something on lines of PathKey. See PathKey structure and related code. What I envision is PathKey class is also annotated with the information whether that PathKey implies uniqueness. E.g. a PathKey derived from a Primary index would imply uniqueness also. A PathKey derived from say Group operation also implies uniqueness. Then just by looking at the underlying Path we would be able to say whether we need Group/Unique node on top of it or not. I think that would make it much wider usecase and a very useful optimization.
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi Ashutosh: Thanks for your time. On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat wrote: > Hi Andy, > What might help is to add more description to your email message like > giving examples to explain your idea. > > Anyway, I looked at the testcases you added for examples. > +create table select_distinct_a(a int, b char(20), c char(20) not null, > d int, e int, primary key(a, b)); > +set enable_mergejoin to off; > +set enable_hashjoin to off; > +-- no node for distinct. > +explain (costs off) select distinct * from select_distinct_a; > + QUERY PLAN > +--- > + Seq Scan on select_distinct_a > +(1 row) > > From this example, it seems that the distinct operation can be dropped > because (a, b) is a primary key. Is my understanding correct? > Yes, you are correct. Actually I added then to commit message, but it's true that I should have copied them in this email body as well. so copy it now. [PATCH] Erase the distinctClause if the result is unique by definition For a single relation, we can tell it by any one of the following is true: 1. The pk is in the target list. 2. The uk is in the target list and the columns is not null 3. The columns in group-by clause is also in the target list for relation join, we can tell it by: if every relation in the jointree yields a unique result set, then the final result is unique as well regardless the join method. for semi/anti join, we will ignore the righttable. I like the idea since it eliminates one expensive operation. > > However the patch as presented has some problems > 1. What happens if the primary key constraint or NOT NULL constraint gets > dropped between a prepare and execute? The plan will no more be valid and > thus execution may produce non-distinct results. > Will this still be an issue if user use doesn't use a "read uncommitted" isolation level? I suppose it should be ok for this case. But even though I should add an isolation level check for this. Just added that in the patch to continue discussing of this issue. > PostgreSQL has similar concept of allowing non-grouping expression as part > of targetlist when those expressions can be proved to be functionally > dependent on the GROUP BY clause. See check_functional_grouping() and its > caller. I think, DISTINCT elimination should work on similar lines. > 2. For the same reason described in check_functional_grouping(), using > unique indexes for eliminating DISTINCT should be discouraged. > I checked the comments of check_functional_grouping, the reason is * Currently we only check to see if the rel has a primary key that is a * subset of the grouping_columns. We could also use plain unique constraints * if all their columns are known not null, but there's a problem: we need * to be able to represent the not-null-ness as part of the constraints added * to *constraintDeps. FIXME whenever not-null constraints get represented * in pg_constraint. Actually I am doubtful the reason for pg_constraint since we still be able to get the not null information from relation->rd_attr->attrs[n].attnotnull which is just what this patch did. 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY > as well > This is a good point. The rules may have some different for join, so I prefer to to focus on the current one so far. > 4. The patch works only at the query level, but that functionality can be > expanded generally to other places which add Unique/HashAggregate/Group > nodes if the underlying relation can be proved to produce distinct rows. > But that's probably more work since we will have to label paths with unique > keys similar to pathkeys. > Do you mean adding some information into PlannerInfo, and when we create a node for Unique/HashAggregate/Group, we can just create a dummy node? > 5. Have you tested this OUTER joins, which can render inner side nullable? > Yes, that part was missed in the test case. I just added them. On Thu, Feb 6, 2020 at 11:31 AM Andy Fan wrote: > >> update the patch with considering the semi/anti join. >> >> Can anyone help to review this patch? >> >> Thanks >> >> >> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan >> wrote: >> >>> Hi: >>> >>> I wrote a patch to erase the distinctClause if the result is unique by >>> definition, I find this because a user switch this code from oracle >>> to PG and find the performance is bad due to this, so I adapt pg for >>> this as well. >>> >>> This patch doesn't work for a well-written SQL, but some drawback >>> of a SQL may be not very obvious, since the cost of checking is pretty >>> low as well, so I think it would be ok to add.. >>> >>> Please see the patch for details. >>> >>> Thank you. >>> >> > > -- > -- > Best Wishes, > Ashutosh Bapat > 0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch Description: Binary data
Re: [PATCH] Erase the distinctClause if the result is unique by definition
Hi Andy, What might help is to add more description to your email message like giving examples to explain your idea. Anyway, I looked at the testcases you added for examples. +create table select_distinct_a(a int, b char(20), c char(20) not null, d int, e int, primary key(a, b)); +set enable_mergejoin to off; +set enable_hashjoin to off; +-- no node for distinct. +explain (costs off) select distinct * from select_distinct_a; + QUERY PLAN +--- + Seq Scan on select_distinct_a +(1 row) >From this example, it seems that the distinct operation can be dropped because (a, b) is a primary key. Is my understanding correct? I like the idea since it eliminates one expensive operation. However the patch as presented has some problems 1. What happens if the primary key constraint or NOT NULL constraint gets dropped between a prepare and execute? The plan will no more be valid and thus execution may produce non-distinct results. PostgreSQL has similar concept of allowing non-grouping expression as part of targetlist when those expressions can be proved to be functionally dependent on the GROUP BY clause. See check_functional_grouping() and its caller. I think, DISTINCT elimination should work on similar lines. 2. For the same reason described in check_functional_grouping(), using unique indexes for eliminating DISTINCT should be discouraged. 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY as well 4. The patch works only at the query level, but that functionality can be expanded generally to other places which add Unique/HashAggregate/Group nodes if the underlying relation can be proved to produce distinct rows. But that's probably more work since we will have to label paths with unique keys similar to pathkeys. 5. Have you tested this OUTER joins, which can render inner side nullable? On Thu, Feb 6, 2020 at 11:31 AM Andy Fan wrote: > update the patch with considering the semi/anti join. > > Can anyone help to review this patch? > > Thanks > > > On Fri, Jan 31, 2020 at 8:39 PM Andy Fan wrote: > >> Hi: >> >> I wrote a patch to erase the distinctClause if the result is unique by >> definition, I find this because a user switch this code from oracle >> to PG and find the performance is bad due to this, so I adapt pg for >> this as well. >> >> This patch doesn't work for a well-written SQL, but some drawback >> of a SQL may be not very obvious, since the cost of checking is pretty >> low as well, so I think it would be ok to add.. >> >> Please see the patch for details. >> >> Thank you. >> > -- -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Erase the distinctClause if the result is unique by definition
update the patch with considering the semi/anti join. Can anyone help to review this patch? Thanks On Fri, Jan 31, 2020 at 8:39 PM Andy Fan wrote: > Hi: > > I wrote a patch to erase the distinctClause if the result is unique by > definition, I find this because a user switch this code from oracle > to PG and find the performance is bad due to this, so I adapt pg for > this as well. > > This patch doesn't work for a well-written SQL, but some drawback > of a SQL may be not very obvious, since the cost of checking is pretty > low as well, so I think it would be ok to add.. > > Please see the patch for details. > > Thank you. > 0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch Description: Binary data