Re: UniqueKey v2

2024-05-13 Thread Andy Fan


>> Consider join of tables "a", "b" and "c". My understanding is that
>> make_join_rel() is called once with rel1={a} and rel2={b join c}, then with
>> rel1={a join b} and rel2={c}, etc. I wanted to say that each call should
>> produce the same set of unique keys.
>> 
>> I need to check this part more in detail.
>
> I think I understand now. By calling populate_joinrel_uniquekeys() for various
> orderings, you can find out that various input relation unique keys can
> represent the whole join. For example, if the ordering is
>
> A JOIN (B JOIN C)
>
> you can prove that the unique keys of A can be used for the whole join, while
> for the ordering
>
> B JOIN (A JOIN C)
>
> you can prove the same for the unique keys of B, and so on.

Yes.

>> > We don't create the component uniquekey if any one side of the boths
>> > sides is unique already. For example:
>> > 
>> > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
>> > unique", there is no need to create component UniqueKey (t1.id, t2.id);  
>> 
>> ok, I need to check more in detail how this part works.
>
> This optimization makes sense to me.

OK.

>> > >
>> > >   Of course one problem is that the number of combinations can grow
>> > >   exponentially as new relations are joined.
>> > 
>> > Yes, that's why "rule 1" needed and "How to reduce the overhead" in
>> > UniqueKey.README is introduced. 
>
> I think there should yet be some guarantee that the number of unique keys does
> not grow exponentially. Perhaps a constant that allows a relation (base or
> join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3
> or 4.) And when picking the "best N keys", one criterion could be the number
> of expressions in the key (the shorter key the better).

I don't want to introduce this complextity right now. I'm more
inerested with how to work with them effectivity. main effort includes: 

- the design of bitmapset which is memory usage friendly and easy for
combinations.
- Optimize the singlerow cases to reduce N UnqiueKeys to 1 UniqueKey.

I hope we can pay more attention to this optimization (at most N
UniqueKeys) when the major inforastruce has been done. 

>> > You are correct and IMO my current code are able to tell it is a single
>> > row as well.
>> > 
>> > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
>> > consequence.
>> > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
>> > after the join because of rule 1 on joinrel. and t1 is singlerow, so the
>> > joinrel is singlerow as well.
>
> ok, I think I understand now.

OK.

At last, this probably is my first non-trival patchs which has multiple
authors, I don't want myself is the bottleneck for the coorperation, so
if you need something to do done sooner, please don't hesitate to ask me
for it explicitly.

Here is my schedule about this. I can provide the next version based
our discussion and your patches at the eariler of next week. and update
the UniqueKey.README to make sure the overall design clearer. What I
hope you to pay more attention is the UniqueKey.README besides the
code. I hope the UniqueKey.README can reduce the effort for others to
understand the overall design enormously.

-- 
Best Regards
Andy Fan





Re: UniqueKey v2

2024-05-13 Thread Andy Fan


Antonin Houska  writes:

>> Could you make the reason clearer for adding 'List *opfamily_lists;'
>> into UniqueKey?  You said "This is needed to create ECs in the parent
>> query if the upper relation represents a subquery." but I didn't get the
>> it. Since we need to maintain the UniqueKey in the many places, I'd like
>> to keep it as simple as possbile. Of course, anything essentical should
>> be added for sure. 
>
> If unique keys are generated for a subquery output, they also need to be
> created for the corresponding relation in the upper query ("sub" in the
> following example):

OK.
>
> select * from tab1 left join (select * from tab2) sub;
>
> However, to create an unique key for "sub", you need an EC for each expression
> of the key.

OK.
> And to create an EC, you in turn need the list of operator
> families.

I'm thinking if we need to "create" any EC. Can you find out a user case
where the outer EC is missed and the UniqueKey is still interesting? I
don't have an example now.  

convert_subquery_pathkeys has a similar sistuation and has the following
codes:

outer_ec =
get_eclass_for_sort_expr(root,

 (Expr *) outer_var,

 sub_eclass->ec_opfamilies,

 sub_member->em_datatype,

 sub_eclass->ec_collation,

 0,

 rel->relids,

 NULL,

 false);

/*
 * If we don't find a matching EC, sub-pathkey 
isn't
 * interesting to the outer query
 */
if (outer_ec)
best_pathkey =
make_canonical_pathkey(root,

   outer_ec,

   sub_pathkey->pk_opfamily,

   sub_pathkey->pk_strategy,

   sub_pathkey->pk_nulls_first);
}

> Even if the parent query already had ECs for the columns of "sub" which are
> contained in the unique key, you need to make sure that those ECs are
> "compatible" with the ECs of the subquery which generated the unique key. That
> is, if an EC of the subquery considers certain input values equal, the EC of
> the parent query must also be able to determine if they are equal or not.
>
>> > * RelOptInfo.notnullattrs
>> >
>> >   My understanding is that this field contains the set attributes whose
>> >   uniqueness is guaranteed by the unique key. They are acceptable because 
>> > they
>> >   are either 1) not allowed to be NULL due to NOT NULL constraint or 2) 
>> > NULL
>> >   value makes the containing row filtered out, so the row cannot break
>> >   uniqueness of the output. Am I right?
>> >
>> >   If so, I suggest to change the field name to something less generic, 
>> > maybe
>> >   'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment 
>> > that
>> >   more checks are needed before particular attribute can actually be used 
>> > in
>> >   UniqueKey.
>> 
>> I don't think so, UniqueKey is just one of the places to use this
>> not-null property, see 3af704098 for the another user case of it. 
>> 
>> (Because of 3af704098, we should leverage notnullattnums somehow in this
>> patch, which will be included in the next version as well).
>
> In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that
> does not happen to 'notnullattnums' in the current master branch. Thus I think
> that 'notnullattrs' is specific to the unique keys feature, so the field name
> should be less generic.

OK.

>> >
>> > * uniquekey_useful_for_merging()
>> >
>> >   How does uniqueness relate to merge join? In README.uniquekey you seem to
>> >   point out that a single row is always sorted, but I don't think this
>> >   function is related to that fact. (Instead, I'd expect that pathkeys are
>> >   added to all paths for a single-row relation, but I'm not sure you do 
>> > that
>> >   in the current version of 

Re: UniqueKey v2

2024-05-13 Thread Antonin Houska
Antonin Houska  wrote:

> Andy Fan  wrote:
> > >
> > > * Combining the UKs
> > >
> > >   IMO this is the most problematic part of the patch. You call
> > >   populate_joinrel_uniquekeys() for the same join multiple times,
> > 
> > Why do you think so? The below code is called in "make_join_rel"
> 
> Consider join of tables "a", "b" and "c". My understanding is that
> make_join_rel() is called once with rel1={a} and rel2={b join c}, then with
> rel1={a join b} and rel2={c}, etc. I wanted to say that each call should
> produce the same set of unique keys.
> 
> I need to check this part more in detail.

I think I understand now. By calling populate_joinrel_uniquekeys() for various
orderings, you can find out that various input relation unique keys can
represent the whole join. For example, if the ordering is

A JOIN (B JOIN C)

you can prove that the unique keys of A can be used for the whole join, while
for the ordering

B JOIN (A JOIN C)

you can prove the same for the unique keys of B, and so on.

> > Is your original question is about populate_joinrel_uniquekey_for_rel
> > rather than populate_joinrel_uniquekeys? We have the below codes:
> > 
> > outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, 
> > outerrel,
> > 
> >  innerrel, restrictlist);
> > inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, 
> > innerrel,
> > 
> >  outerrel, restrictlist);
> > 
> > This is mainly because of the following theory. Quoted from
> > README.uniquekey. Let's called this as "rule 1".
> > 
> > """
> > How to maintain the uniquekey?
> > ---
> > .. From the join relation, it is maintained with two rules:
> > 
> > - the uniquekey in one side is still unique if it can't be duplicated
> >   after the join. for example:
> > 
> >   SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk;
> >   UniqueKey on t1:  t1.pk
> >   UniqueKey on t1 Join t2:  t1.pk
> > """
> > 
> > AND the blow codes:
> > 
> > 
> > if (outeruk_still_valid || inneruk_still_valid)
> > 
> > /*
> >  * the uniquekey on outers or inners have been added into 
> > joinrel so
> >  * the combined uniuqekey from both sides is not needed.
> >  */
> > return;
> > 
> > 
> > We don't create the component uniquekey if any one side of the boths
> > sides is unique already. For example:
> > 
> > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
> > unique", there is no need to create component UniqueKey (t1.id, t2.id);  
> 
> ok, I need to check more in detail how this part works.

This optimization makes sense to me.

> > >
> > >   Of course one problem is that the number of combinations can grow
> > >   exponentially as new relations are joined.
> > 
> > Yes, that's why "rule 1" needed and "How to reduce the overhead" in
> > UniqueKey.README is introduced. 

I think there should yet be some guarantee that the number of unique keys does
not grow exponentially. Perhaps a constant that allows a relation (base or
join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3
or 4.) And when picking the "best N keys", one criterion could be the number
of expressions in the key (the shorter key the better).

> > >
> > >   2) Check if the join relation is single-row
> > >
> > >   I in order to get rid of the dependency on 'restrictlist', I think you 
> > > can
> > >   use ECs. Consider a query from your regression tests:
> > >
> > > CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c 
> > > int, d int, e int);
> > >
> > > SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id 
> > > = 1;
> > >
> > >   The idea here seems to be that no more than one row of t1 matches the 
> > > query
> > >   clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures 
> > > that
> > >   no more than one row of t2 matches the query (because t1 cannot provide 
> > > the
> > >   clause with more than one input value of 'e'). And therefore, the join 
> > > also
> > >   produces at most one row.
> > 
> > You are correct and IMO my current code are able to tell it is a single
> > row as well.
> > 
> > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
> > consequence.
> > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
> > after the join because of rule 1 on joinrel. and t1 is singlerow, so the
> > joinrel is singlerow as well.

ok, I think I understand now.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: UniqueKey v2

2024-05-13 Thread Antonin Houska
Andy Fan  wrote:

> > * I think that, before EC is considered suitable for an UK, its 
> > ec_opfamilies
> >   field needs to be checked. I try to do that in
> >   find_ec_position_matching_expr(), see 0004.
> 
> Could you make the reason clearer for adding 'List *opfamily_lists;'
> into UniqueKey?  You said "This is needed to create ECs in the parent
> query if the upper relation represents a subquery." but I didn't get the
> it. Since we need to maintain the UniqueKey in the many places, I'd like
> to keep it as simple as possbile. Of course, anything essentical should
> be added for sure. 

If unique keys are generated for a subquery output, they also need to be
created for the corresponding relation in the upper query ("sub" in the
following example):

select * from tab1 left join (select * from tab2) sub;

However, to create an unique key for "sub", you need an EC for each expression
of the key. And to create an EC, you in turn need the list of operator
families.

Even if the parent query already had ECs for the columns of "sub" which are
contained in the unique key, you need to make sure that those ECs are
"compatible" with the ECs of the subquery which generated the unique key. That
is, if an EC of the subquery considers certain input values equal, the EC of
the parent query must also be able to determine if they are equal or not.

> > * RelOptInfo.notnullattrs
> >
> >   My understanding is that this field contains the set attributes whose
> >   uniqueness is guaranteed by the unique key. They are acceptable because 
> > they
> >   are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
> >   value makes the containing row filtered out, so the row cannot break
> >   uniqueness of the output. Am I right?
> >
> >   If so, I suggest to change the field name to something less generic, maybe
> >   'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment 
> > that
> >   more checks are needed before particular attribute can actually be used in
> >   UniqueKey.
> 
> I don't think so, UniqueKey is just one of the places to use this
> not-null property, see 3af704098 for the another user case of it. 
> 
> (Because of 3af704098, we should leverage notnullattnums somehow in this
> patch, which will be included in the next version as well).

In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that
does not happen to 'notnullattnums' in the current master branch. Thus I think
that 'notnullattrs' is specific to the unique keys feature, so the field name
should be less generic.

> >
> > * uniquekey_useful_for_merging()
> >
> >   How does uniqueness relate to merge join? In README.uniquekey you seem to
> >   point out that a single row is always sorted, but I don't think this
> >   function is related to that fact. (Instead, I'd expect that pathkeys are
> >   added to all paths for a single-row relation, but I'm not sure you do that
> >   in the current version of the patch.)
> 
> The merging is for "mergejoinable join clauses", see function
> eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";

My question is: why is the uniqueness important specifically to merge join? I
understand that join evaluation can be more efficient if we know that one
input relation is unique (i.e. we only scan that relation until we find the
first match), but this is not specific to merge join.

> > * is_uniquekey_useful_afterjoin()
> >
> >   Now that my patch (0004) allows propagation of the unique keys from a
> >   subquery to the upper query, I was wondering if the UniqueKey structure
> >   needs the 'use_for_distinct field' I mean we should now propagate the 
> > unique
> >   keys to the parent query whether the subquery has DISTINCT clause or not. 
> > I
> >   noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
> >   changed the function to always returned true. However nothing changed in 
> > the
> >   output of regression tests (installcheck). Do you insist that the
> >   'use_for_distinct' field is needed?

I miss your answer to this comment.

> > * uniquekey_contains_multinulls()
> >
> >   ** Instead of calling the function when trying to use the UK, how about
> >  checking the ECs when considering creation of the UK? If the tests 
> > fail,
> >  just don't create the UK.
> 
> I don't think so since we maintain the UniqueKey from bottom to top, you
> can double check if my reason is appropriate.  
> 
> CREATE TABLE t1(a int);
> CREATE INDEX ON t1(a);
> 
> SELECT distinct t1.a FROM t1 JOIN t2 using(a);
> 
> We need to create the UniqueKey on the baserel for t1 and the NULL
> values is filtered out in the joinrel. so we have to creating it with
> allowing NULL values first. 

ok

> >   ** What does the 'multi' word in the function name mean?
> 
> multi means multiple, I thought we use this short name in the many
> places, for ex bt_multi_page_stats after a quick search. 

Why not simply uniquekey_contains_nulls() ?

Actually 

Re: UniqueKey v2

2024-05-06 Thread Andy Fan


Hello Antonin,

Thanks for interesting with this topic!

> * I think that, before EC is considered suitable for an UK, its ec_opfamilies
>   field needs to be checked. I try to do that in
>   find_ec_position_matching_expr(), see 0004.

Could you make the reason clearer for adding 'List *opfamily_lists;'
into UniqueKey?  You said "This is needed to create ECs in the parent
query if the upper relation represents a subquery." but I didn't get the
it. Since we need to maintain the UniqueKey in the many places, I'd like
to keep it as simple as possbile. Of course, anything essentical should
be added for sure. 

>
> * Set-returning functions (SRFs) can make a "distinct path" necessary even if
>   the join output is unique.

You are right at this point, I will fix it in the coming version.

>
> * RelOptInfo.notnullattrs
>
>   My understanding is that this field contains the set attributes whose
>   uniqueness is guaranteed by the unique key. They are acceptable because they
>   are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
>   value makes the containing row filtered out, so the row cannot break
>   uniqueness of the output. Am I right?
>
>   If so, I suggest to change the field name to something less generic, maybe
>   'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
>   more checks are needed before particular attribute can actually be used in
>   UniqueKey.

I don't think so, UniqueKey is just one of the places to use this
not-null property, see 3af704098 for the another user case of it. 

(Because of 3af704098, we should leverage notnullattnums somehow in this
patch, which will be included in the next version as well).

> * add_uniquekey_for_uniqueindex()
>
>   I'd appreciate an explanation why the "single-row UK" is created. I think
>   the reason for unique_exprs==NIL is that a restriction clause VAR=CONST
>   exists for each column of the unique index. Whether I'm right or not, a
>   comment should state clearly what the reason is.

You are understanding it correctly. I will add comments in the next
version.

>
> * uniquekey_useful_for_merging()
>
>   How does uniqueness relate to merge join? In README.uniquekey you seem to
>   point out that a single row is always sorted, but I don't think this
>   function is related to that fact. (Instead, I'd expect that pathkeys are
>   added to all paths for a single-row relation, but I'm not sure you do that
>   in the current version of the patch.)

The merging is for "mergejoinable join clauses", see function
eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";

> * is_uniquekey_useful_afterjoin()
>
>   Now that my patch (0004) allows propagation of the unique keys from a
>   subquery to the upper query, I was wondering if the UniqueKey structure
>   needs the 'use_for_distinct field' I mean we should now propagate the unique
>   keys to the parent query whether the subquery has DISTINCT clause or not. I
>   noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
>   changed the function to always returned true. However nothing changed in the
>   output of regression tests (installcheck). Do you insist that the
>   'use_for_distinct' field is needed?
>
>
> * uniquekey_contains_multinulls()
>
>   ** Instead of calling the function when trying to use the UK, how about
>  checking the ECs when considering creation of the UK? If the tests fail,
>  just don't create the UK.

I don't think so since we maintain the UniqueKey from bottom to top, you
can double check if my reason is appropriate.  

CREATE TABLE t1(a int);
CREATE INDEX ON t1(a);

SELECT distinct t1.a FROM t1 JOIN t2 using(a);

We need to create the UniqueKey on the baserel for t1 and the NULL
values is filtered out in the joinrel. so we have to creating it with
allowing NULL values first. 

>   ** What does the 'multi' word in the function name mean?

multi means multiple, I thought we use this short name in the many
places, for ex bt_multi_page_stats after a quick search. 

> * relation_is_distinct_for()
>
>   The function name is too similar to rel_is_distinct_for(). I think the name
>   should indicate that you are checking the relation against a set of
>   pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from
>   the argument name)? At the same time, it might be good to rename
>   rel_is_distinct_for() to rel_is_distinct_for_clauses().

OK.

> * uniquekey_contains_in()
>
>   Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the
>   comment be " ... if UniqueKey is contained in the list of EquivalenceClass"
>   ?

OK.
>
>   (In general, even though I'm not English native speaker, I think I see quite
>   a few grammar issues, which often make reading of the comments/documentation

Your English is really good:)

>
>
> * Combining the UKs
>
>   IMO this is the most problematic part of the patch. You call
>   populate_joinrel_uniquekeys() for the same join multiple 

Re: UniqueKey v2

2024-04-19 Thread Andy Fan


Hello Antonin,

> zhihuifan1...@163.com wrote:
>
>> Here is the v3, ...
>
> I'm trying to enhance the join removal functionality (will post my patch in a
> separate thread soon) and I consider your patch very helpful in this
> area.

Thanks for these words.  The point 2) and 3) is pretty interesting to me
at [1] and "enhance join removal" is another interesting user case. 

> Following is my review. Attached are also some fixes and one enhancement:
> propagation of the unique keys (UK) from a subquery to the parent query
> (0004). (Note that 0001 is just your patch rebased).

Thanks for that! more enhancment like uniquekey in partitioned table is
needed. This post is mainly used to check if more people is still
interested with this. 

> Are you going to submit the patch to the first CF of PG 18?

Since there are still known work to do, I'm not sure if it is OK to
submit in CF. What do you think about this part?

>
> Please let me know if I can contribute to the effort by reviewing or writing
> some code.

Absolutely yes! please feel free to review / writing any of them and do
remember add yourself into the author list if you do that. 

Thanks for your review suggestion, I will get to this very soon if once
I get time, I hope it is in 4 weeks. 

[1]
https://www.postgresql.org/message-id/7mlamswjp81p.fsf%40e18c07352.et15sqa

-- 
Best Regards
Andy Fan





Re: UniqueKey v2

2024-04-19 Thread Antonin Houska
zhihuifan1...@163.com wrote:

> Here is the v3, ...

I'm trying to enhance the join removal functionality (will post my patch in a
separate thread soon) and I consider your patch very helpful in this
area.

Following is my review. Attached are also some fixes and one enhancement:
propagation of the unique keys (UK) from a subquery to the parent query
(0004). (Note that 0001 is just your patch rebased).


* I think that, before EC is considered suitable for an UK, its ec_opfamilies
  field needs to be checked. I try to do that in
  find_ec_position_matching_expr(), see 0004.


* Set-returning functions (SRFs) can make a "distinct path" necessary even if
  the join output is unique.


* RelOptInfo.notnullattrs

  My understanding is that this field contains the set attributes whose
  uniqueness is guaranteed by the unique key. They are acceptable because they
  are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
  value makes the containing row filtered out, so the row cannot break
  uniqueness of the output. Am I right?

  If so, I suggest to change the field name to something less generic, maybe
  'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
  more checks are needed before particular attribute can actually be used in
  UniqueKey.


* add_uniquekey_for_uniqueindex()

  I'd appreciate an explanation why the "single-row UK" is created. I think
  the reason for unique_exprs==NIL is that a restriction clause VAR=CONST
  exists for each column of the unique index. Whether I'm right or not, a
  comment should state clearly what the reason is.


* uniquekey_useful_for_merging()

  How does uniqueness relate to merge join? In README.uniquekey you seem to
  point out that a single row is always sorted, but I don't think this
  function is related to that fact. (Instead, I'd expect that pathkeys are
  added to all paths for a single-row relation, but I'm not sure you do that
  in the current version of the patch.)


* is_uniquekey_useful_afterjoin()

  Now that my patch (0004) allows propagation of the unique keys from a
  subquery to the upper query, I was wondering if the UniqueKey structure
  needs the 'use_for_distinct field' I mean we should now propagate the unique
  keys to the parent query whether the subquery has DISTINCT clause or not. I
  noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
  changed the function to always returned true. However nothing changed in the
  output of regression tests (installcheck). Do you insist that the
  'use_for_distinct' field is needed?


* uniquekey_contains_multinulls()

  ** Instead of calling the function when trying to use the UK, how about
 checking the ECs when considering creation of the UK? If the tests fail,
 just don't create the UK.

  ** What does the 'multi' word in the function name mean?


* relation_is_distinct_for()

  The function name is too similar to rel_is_distinct_for(). I think the name
  should indicate that you are checking the relation against a set of
  pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from
  the argument name)? At the same time, it might be good to rename
  rel_is_distinct_for() to rel_is_distinct_for_clauses().


* uniquekey_contains_in()

  Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the
  comment be " ... if UniqueKey is contained in the list of EquivalenceClass"
  ?

  (In general, even though I'm not English native speaker, I think I see quite
  a few grammar issues, which often make reading of the comments/documentation
  a bit difficult.)


* Combining the UKs

  IMO this is the most problematic part of the patch. You call
  populate_joinrel_uniquekeys() for the same join multiple times, each time
  with a different 'restrictlist', and you try to do two things at the same
  time: 1) combine the UKs of the input relations into the UKs of the join
  relation, 2) check if the join relation can be marked single-row.

  I think that both 1) and 2) should be independent from join order, and thus
  both computations should only take place once for given set of input
  relations. And I think they should be done separately:

  1) Compute the join UKs

  As you admit in a comment in populate_joinrel_uniquekeys(), neither join
  method nor clauses should matter. So I think you only need to pick the
  "component UKs" (i.e. UKs of the input relations) which are usable above
  that join (i.e. neither the join itself nor any join below sets any column
  of the UK to NULL) and combine them.

  Of course one problem is that the number of combinations can grow
  exponentially as new relations are joined. I'm not sure it's necessary to
  combine the UKs (and to discard some of them) immediately. Instead, maybe we
  can keep lists of UKs only for base relations, and postpone picking the
  suitable UKs and combining them until we actually need to check the relation
  uniqueness.

  2) Check if the join relation 

Re: UniqueKey v2

2023-11-09 Thread zhihuifan1213

zhihuifan1...@163.com writes:

Hi,

Here is the v3, the mainly changes is it maintains the UniqueKey on
joinrel level, which probabaly is the most important part of this
feature. It shows how the UnqiueKey on joinrel is generated and how it
is discarded due to non-interesting-uniquekey and also show much details
about the single-row case.

I will always maintain README.uniquekey under src/backend/optimizer/path/
to include the latest state of this feature to save the time for
reviewer from going through from the begining. I also use the word "BAD
CASE" in uniquekey.sql to demo which sistuation is not handled well so
far, that probably needs more attention at the first review. 

>From 987c1d08bedc0c9c7436ad2daddbab2202b52425 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 9 Nov 2023 17:25:35 +0800
Subject: [PATCH v3 1/1] maintain uniquekey on baserel/joinrel level and use it
 for marking

distinct-as-noop.
---
 src/backend/nodes/list.c|  17 +
 src/backend/optimizer/path/Makefile |   3 +-
 src/backend/optimizer/path/README.uniquekey | 220 +++
 src/backend/optimizer/path/allpaths.c   |   2 +
 src/backend/optimizer/path/equivclass.c |  48 ++
 src/backend/optimizer/path/joinrels.c   |   2 +
 src/backend/optimizer/path/uniquekey.c  | 654 
 src/backend/optimizer/plan/initsplan.c  |   8 +
 src/backend/optimizer/plan/planner.c|  33 +
 src/backend/optimizer/util/plancat.c|  10 +
 src/backend/optimizer/util/relnode.c|   2 +
 src/include/nodes/pathnodes.h   |  19 +
 src/include/nodes/pg_list.h |   2 +
 src/include/optimizer/paths.h   |  13 +
 src/test/regress/expected/join.out  |  11 +-
 src/test/regress/expected/uniquekey.out | 268 
 src/test/regress/parallel_schedule  |   2 +-
 src/test/regress/sql/uniquekey.sql  | 104 
 18 files changed, 1409 insertions(+), 9 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekey.c
 create mode 100644 src/test/regress/expected/uniquekey.out
 create mode 100644 src/test/regress/sql/uniquekey.sql

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 750ee5a7e5..eaeaa19ad2 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -694,6 +694,23 @@ list_member_ptr(const List *list, const void *datum)
 	return false;
 }
 
+/*
+ * Return index of the datum in list if found. otherwise return -1.
+ */
+int
+list_member_ptr_pos(const List *list, const void *datum)
+{
+	ListCell   *lc;
+
+	foreach(lc, list)
+	{
+		if (lfirst(lc) == datum)
+			return foreach_current_index(lc);
+	}
+
+	return -1;
+}
+
 /*
  * Return true iff the integer 'datum' is a member of the list.
  */
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 00..be13b113b9
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,220 @@
+Here is a design document and a part of implementation.
+
+What is UniqueKey?
+-
+
+UniqueKey represents a uniqueness information for a RelOptInfo. for
+example:
+
+SELECT id FROM t;
+
+where the ID is the UniqueKey for the RelOptInfo (t).  In the real world,
+it has the following attributes:
+
+1). It should be EquivalenceClass based. for example:
+
+SELECT a FROM t WHERE a = id;
+
+In this case, the UniqueKey should be 1 EC with two members
+- EC(EM(a), EM(id)).
+
+
+2). Each UniqueKey may be made up with 1+ EquivalenceClass. for example:
+
+CREATE TABLE t(a int not null,  b int not null);
+CREATE UNIQUE INDEX on t(a, b);
+SELECT * FROM t;
+
+Where the UniqueKey for RelOptInfo (t) will be 2 ECs with each one has 1
+member.
+
+- EC(em=a), EC(em=b)
+
+3). Each RelOptInfo may have 1+ UniqueKeys.
+
+CREATE TABLE t(a int not null,  b int not null, c int not null);
+CREATE UNIQUE INDEX on t(a, b);
+CREATE UNIQUE INDEX on t(c);
+
+SELECT * FROM t;
+
+Where the UniqueKey for RelOptInfo (t) will be
+- [EC(em=a), EC(em=b)].
+- [EC(em=c)]
+
+4). A special case is about the one-row case. It works like:
+SELECT * FROM t WHERE id = 1;
+Here every single expression in the RelOptInfo (t) is unique since only
+one row here.
+
+Where can we use it?
+
+1. mark the distinct as no-op.  SELECT DISTINCT uniquekey FROM v; This
+  optimization has been required several times in our threads.
+
+2. Figure out more pathkey within the onerow case, then some planning
+  time can be reduced to be big extend. This user case is not 

Re: UniqueKey v2

2023-11-06 Thread zhihuifan1213


jian he  writes:

> On Fri, Oct 20, 2023 at 4:33 PM  wrote:
>>
>>
>> > i did some simple tests using text data type.
>> >
>> > it works with the primary key, not with unique indexes.
>> > it does not work when the column is unique, not null.
>> >
>> > The following is my test.
>>
>> Can you simplify your test case please? I can't undertand what "doesn't
>> work" mean here and for which case. FWIW, this feature has nothing with
>> the real data, I don't think inserting any data is helpful unless I
>> missed anything.
>
> Sorry for not explaining it very well.
> "make distinct as no-op."
> my understanding: it means: if fewer rows meet the criteria "columnX <
>  const_a;" , after analyze the table, it should use index only scan

No, "mark distinct as no-op" means the distinct node can be discarded
automatically since it is not needed any more. The simplest case would
be "select distinct pk from t", where it should be same as "select pk
from t".  You can check the testcase for the more cases. 

-- 
Best Regards
Andy Fan





Re: UniqueKey v2

2023-10-22 Thread jian he
On Fri, Oct 20, 2023 at 4:33 PM  wrote:
>
>
> > i did some simple tests using text data type.
> >
> > it works with the primary key, not with unique indexes.
> > it does not work when the column is unique, not null.
> >
> > The following is my test.
>
> Can you simplify your test case please? I can't undertand what "doesn't
> work" mean here and for which case. FWIW, this feature has nothing with
> the real data, I don't think inserting any data is helpful unless I
> missed anything.

Sorry for not explaining it very well.
"make distinct as no-op."
my understanding: it means: if fewer rows meet the criteria "columnX <
 const_a;" , after analyze the table, it should use index only scan
for the queryA?
--queryA:
select distinct columnX from the_table where columnX <  const_a;

There are several ways for columnX to be unique: primark key, unique
key, unique key nulls distinct, unique key nulls not distinct, unique
key and not null.

After applying your patch, only the primary key case will make the
queryA explain output using the index-only scan.




Re: UniqueKey v2

2023-10-20 Thread zhihuifan1213


> i did some simple tests using text data type.
>
> it works with the primary key, not with unique indexes.
> it does not work when the column is unique, not null.
>
> The following is my test.

Can you simplify your test case please? I can't undertand what "doesn't
work" mean here and for which case. FWIW, this feature has nothing with
the real data, I don't think inserting any data is helpful unless I
missed anything. 


-- 
Best Regards
Andy Fan





Re: UniqueKey v2

2023-10-18 Thread jian he
On Tue, Oct 17, 2023 at 11:21 AM  wrote:
>
>
> thanks for the really good suggestion.  Here is the newer version:
>

--- a/src/backend/optimizer/path/meson.build
+++ b/src/backend/optimizer/path/meson.build
@@ -10,4 +10,5 @@ backend_sources += files(
   'joinrels.c',
   'pathkeys.c',
   'tidpath.c',
+  'uniquekey.c'
 )
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 3ac25d47..5ed550ca 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -264,7 +264,10 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,

int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,

 List *live_childrels);
-
+/*
+ * uniquekey.c
+ * uniquekey.c related functions.
+ */

-
i did some simple tests using text data type.

it works with the primary key, not with unique indexes.
it does not work when the column is unique, not null.

The following is my test.

begin;
CREATE COLLATION case_insensitive (provider = icu, locale =
'und-u-ks-level2', deterministic = false);
CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper');
commit;
begin;
CREATE TABLE test_uniquekey3(a text, b text);
CREATE TABLE test_uniquekey4(a text, b text);
CREATE TABLE test_uniquekey5(a text, b text);
CREATE TABLE test_uniquekey6(a text, b text);
CREATE TABLE test_uniquekey7(a text not null, b text not null);
CREATE TABLE test_uniquekey8(a text not null, b text not null);
CREATE TABLE test_uniquekey9(a text primary key COLLATE upper_first, b
text not null);
CREATE TABLE test_uniquekey10(a text primary key COLLATE
case_insensitive, b text not null);
create unique index on test_uniquekey3 (a COLLATE case_insensitive nulls first)
nulls distinct
with (fillfactor = 80);
create unique index on test_uniquekey4 (a COLLATE case_insensitive nulls first)
nulls not distinct
with (fillfactor = 80);
create unique index on test_uniquekey5 (a COLLATE upper_first nulls first)
nulls distinct;
create unique index on test_uniquekey6 (a COLLATE upper_first nulls first)
nulls not distinct;
create unique index on test_uniquekey7 (a COLLATE upper_first nulls
first) nulls distinct;
create unique index on test_uniquekey8 (a COLLATE case_insensitive
nulls first) nulls not distinct;
insert into test_uniquekey3(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey4(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey5(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey6(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey7(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey8(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey9(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey10(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey3(a) VALUES(null),(null),(null);
insert into test_uniquekey4(a) VALUES(null);
insert into test_uniquekey5(a) VALUES(null),(null),(null);
insert into test_uniquekey6(a) VALUES(null);
commit;

ANALYZE test_uniquekey3, test_uniquekey4, test_uniquekey5
,test_uniquekey6,test_uniquekey7, test_uniquekey8
,test_uniquekey9, test_uniquekey10;

explain (costs off) select distinct a from test_uniquekey3;
explain (costs off) select distinct a from test_uniquekey4;
explain (costs off) select distinct a from test_uniquekey5;
explain (costs off) select distinct a from test_uniquekey6;
explain (costs off) select distinct a from test_uniquekey7;
explain (costs off) select distinct a from test_uniquekey8;
explain (costs off) select distinct a from test_uniquekey9;
explain (costs off) select distinct a from test_uniquekey10;
explain (costs off) select distinct a from test_uniquekey3 where a < '2000';
explain (costs off) select distinct a from test_uniquekey4 where a < '2000';
explain (costs off) select distinct a from test_uniquekey5 where a < '2000';
explain (costs off) select distinct a from test_uniquekey6 where a < '2000';
explain (costs off) select distinct a from test_uniquekey7 where a < '2000';
explain (costs off) select distinct a from test_uniquekey8 where a < '2000';
explain (costs off) select distinct a from test_uniquekey9 where a < '2000';
explain (costs off) select distinct a from test_uniquekey10 where a < '2000';

--very high selectivity
explain (costs off) select distinct a from test_uniquekey3 where a < '1001';
explain (costs off) select distinct a from test_uniquekey4 where a < '1001';
explain (costs off) select distinct a from test_uniquekey5 where a < '1001';
explain (costs off) select distinct a from test_uniquekey6 where a < '1001';
explain (costs off) select distinct a from test_uniquekey7 where a < '1001';
explain (costs off) select distinct a from 

Re: UniqueKey v2

2023-10-16 Thread zhihuifan1213

jian he  writes:

Hi jian,

> hi.
> After `git am`, I still cannot build.
>
> ../../Desktop/pg_sources/main/postgres/src/backend/optimizer/path/uniquekey.c:125:45:
> error: variable ‘var’ set but not used
> [-Werror=unused-but-set-variable]
>   125 | Var*var;
>   | ^~~

Thanks for this report, looks clang 11 can't capture this error.  I have
switched to clang 17 which would report this issue at the first place. 

>
> You also need to change src/backend/optimizer/path/meson.build.

Great thanks.

>
> git apply failed.
>
> git am warning:
> Applying: uniquekey on base relation and used it for mark-distinct-as-op.
> .git/rebase-apply/patch:876: new blank line at EOF.
> +
> warning: 1 line adds whitespace errors.
>
> I think you can use `git diff --check`
> (https://git-scm.com/docs/git-diff) to check for whitespace related
> errors.

thanks for the really good suggestion.  Here is the newer version:

>From c2c752a3f66353d7d391976b0152c353a239ce60 Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Tue, 17 Oct 2023 11:06:53 +0800
Subject: [PATCH v2 1/1] uniquekey on base relation and used it for
 mark-distinct-as-noop.

---
 src/backend/nodes/list.c|  17 ++
 src/backend/optimizer/path/Makefile |   3 +-
 src/backend/optimizer/path/allpaths.c   |   2 +
 src/backend/optimizer/path/equivclass.c |  48 +++
 src/backend/optimizer/path/uniquekey.c  | 384 
 src/backend/optimizer/plan/initsplan.c  |   8 +
 src/backend/optimizer/plan/planner.c|  33 ++
 src/backend/optimizer/util/plancat.c|  10 +
 src/backend/optimizer/util/relnode.c|   2 +
 src/include/nodes/pathnodes.h   |  19 ++
 src/include/nodes/pg_list.h |   2 +
 src/include/optimizer/paths.h   |  10 +
 src/test/regress/expected/uniquekey.out |  69 +
 src/test/regress/parallel_schedule  |   2 +-
 src/test/regress/sql/uniquekey.sql  |  28 ++
 15 files changed, 635 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c
 create mode 100644 src/test/regress/expected/uniquekey.out
 create mode 100644 src/test/regress/sql/uniquekey.sql

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 750ee5a7e55..eaeaa19ad2d 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -694,6 +694,23 @@ list_member_ptr(const List *list, const void *datum)
 	return false;
 }
 
+/*
+ * Return index of the datum in list if found. otherwise return -1.
+ */
+int
+list_member_ptr_pos(const List *list, const void *datum)
+{
+	ListCell   *lc;
+
+	foreach(lc, list)
+	{
+		if (lfirst(lc) == datum)
+			return foreach_current_index(lc);
+	}
+
+	return -1;
+}
+
 /*
  * Return true iff the integer 'datum' is a member of the list.
  */
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f7..63cc1505d96 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3cda88e..815a74a9dfa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -458,6 +458,8 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		}
 	}
 
+	populate_baserel_uniquekeys(root, rel);
+
 	/*
 	 * We insist that all non-dummy rels have a nonzero rowcount estimate.
 	 */
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 7fa502d6e25..29edc8d1b5a 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -744,6 +744,54 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	return newec;
 }
 
+/*
+ * find_ec_position_matching_expr
+ *		Locate the position of EquivalenceClass whose members matching
+ *		the given expr, if any; return -1 if no match.
+ */
+int
+find_ec_position_matching_expr(PlannerInfo *root,
+			   Expr *expr,
+			   RelOptInfo *baserel)
+{
+	int			i = -1;
+
+	while ((i = bms_next_member(baserel->eclass_indexes, i)) >= 0)
+	{
+		EquivalenceClass *ec = list_nth(root->eq_classes, i);
+
+		if (find_ec_member_matching_expr(ec, expr, baserel->relids))
+			return i;
+	}
+	return -1;
+}
+
+/*
+ * build_ec_positions_for_exprs
+ *
+ *		Given a list of exprs, find the related EC positions for each of
+ *		them. if any exprs has no EC related, return NULL;
+ */
+Bitmapset *
+build_ec_positions_for_exprs(PlannerInfo *root, List *exprs, RelOptInfo *rel)
+{
+	ListCell   *lc;
+	Bitmapset  *res = NULL;
+
+	foreach(lc, exprs)
+	{
+		int			pos = find_ec_position_matching_expr(root, lfirst(lc), rel);
+
+		if (pos < 0)
+		{
+			bms_free(res);
+			return NULL;
+		}
+		res = bms_add_member(res, pos);
+	}
+	return res;
+}
+
 /*
  * 

Re: UniqueKey v2

2023-10-15 Thread jian he
hi.
After `git am`, I still cannot build.

../../Desktop/pg_sources/main/postgres/src/backend/optimizer/path/uniquekey.c:125:45:
error: variable ‘var’ set but not used
[-Werror=unused-but-set-variable]
  125 | Var*var;
  | ^~~


You also need to change src/backend/optimizer/path/meson.build.

git apply failed.

git am warning:
Applying: uniquekey on base relation and used it for mark-distinct-as-op.
.git/rebase-apply/patch:876: new blank line at EOF.
+
warning: 1 line adds whitespace errors.

I think you can use `git diff --check`
(https://git-scm.com/docs/git-diff) to check for whitespace related
errors.