subject:"Re\: \[PATCH\] Erase the distinctClause if the result is unique by definition"

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-25 Thread Andy Fan

I have started the new thread [1] to continue talking about this.
Mr. cfbot is happy now.

[1]
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

Thanks

>

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-17 Thread Andy Fan

Hi David:

On Wed, Mar 18, 2020 at 12:13 PM David Rowley  wrote:

> On Wed, 18 Mar 2020 at 15:57, Andy Fan  wrote:
> > I'm now writing the code for partition index stuff, which
> > is a bit of boring, since every partition may have different unique
> index.
>
> Why is that case so different?
>
> For a partitioned table to have a valid unique index, a unique index
> must exist on each partition having columns that are a superset of the
> partition key columns. An IndexOptInfo will exist on the partitioned
> table's RelOptInfo, in this case.
>
> The main difference are caused:

1.  we can create unique index on some of  partition only.

create table q100 (a int, b int, c int) partition by range (b);
create table q100_1 partition of q100 for values from (1) to (10);
create table q100_2 partition of q100 for values from (11) to (20);
create unique index q100_1_c on q100_1(c);   // user may create this index
on q100_1 only

2.  The unique index may not contains part key as above.

For the above case, even the same index on all the partition as well, we
still can't
use it since the it unique on local partition only.

3.  So the unique index on a partition table can be used only if it
contains the partition key
AND exists on all the partitions.

4.  When we apply the uniquekey_is_useful_for_rel,  I compare the
information between ind->indextlist
and rel->reltarget,  but the indextlist has a wrong varno,  we we have to
change the varno with
ChangeVarNodes for the indextlist from childrel since the varno is for
childrel.

5.  When we detect the  uk = 1 case,  the uk is also present with
parentrel->relid information, which
we may requires the ChangeVarNodes on childrel->indexinfo->indextlist  as
well.

Even the rules looks long,   The run time should be very short since
usually we don't have
many unique index on partition table.

> At the leaf partition level, wouldn't you just add the uniquekeys the
> same as we do for base rels?

Yes, But due to the uk of a childrel may be not useful for parent rel (the
uk only exist
in one partiton),  so I think we can bypass if it is a child rel case?

Best Regards
Andy Fan

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-17 Thread David Rowley

On Wed, 18 Mar 2020 at 15:57, Andy Fan  wrote:
> I'm now writing the code for partition index stuff, which
> is a bit of boring, since every partition may have different unique index.

Why is that case so different?

For a partitioned table to have a valid unique index, a unique index
must exist on each partition having columns that are a superset of the
partition key columns. An IndexOptInfo will exist on the partitioned
table's RelOptInfo, in this case.

At the leaf partition level, wouldn't you just add the uniquekeys the
same as we do for base rels? Maybe only do it if
enable_partitionwise_aggregation is on. Otherwise, I don't think we'll
currently have a need for them.  Currently, we don't do unique joins
for partition-wise joins. Perhaps uniquekeys will be a good way to fix
that omission in the future.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-17 Thread Andy Fan

Hi David:

Thanks for your time.

On Wed, Mar 18, 2020 at 9:56 AM David Rowley  wrote:

> On Mon, 16 Mar 2020 at 06:01, Andy Fan  wrote:
> >
> > Hi All:
> >
> > I have re-implemented the patch based on David's suggestion/code,  Looks
> it
> > works well.   The updated patch mainly includes:
> >
> > 1. Maintain the not_null_colno in RelOptInfo, which includes the not
> null from
> > catalog and the not null from vars.
>
> What about non-nullability that we can derive from means other than
> NOT NULL constraints. Where will you track that now that you've
> removed the UniqueKeySet type?
>

I tracked it in 'deconstruct_recurse',  just before
the distribute_qual_to_rels call.

+   ListCell*lc;
+   foreach(lc, find_nonnullable_vars(qual))
+   {
+   Var *var = lfirst_node(Var, lc);
+   RelOptInfo *rel =
root->simple_rel_array[var->varno];
+   if (var->varattno > InvalidAttrNumber)
+   rel->not_null_cols =
bms_add_member(rel->not_null_cols, var->varattno);
+   }

> Traditionally we use attno or attnum rather than colno for variable
> names containing attribute numbers
>

Currently I use a list of Var for a UnqiueKey,   I guess it is ok?

>
> > 3. postpone the propagate_unique_keys_to_joinrel call to
> populate_joinrel_with_paths
> >   since we know jointype at that time. so we can handle the semi/anti
> join specially.
>
> ok, but the join type was known already where I was calling the
> function from. It just wasn't passed to the function.
>
> > 4. Add the rule I suggested above,  if both of the 2 relation yields the
> a unique result,
> >   the join result will be unique as well. the UK can be  ( (rel1_uk1,
> rel1_uk2).. )
>
> I see. So basically you're saying that the joinrel's uniquekeys should
> be the cartesian product of the unique rels from either side of the
> join.  I wonder if that's a special case we need to worry about too
> much. Surely it only applies for clauseless joins

Some other cases we may need this as well:).   like select m1.pk, m2.pk
from m1, m2
where m1.b = m2.b;

The cartesian product of the unique rels will make the unqiue keys too
long,  so I maintain
the UnqiueKeyContext to make it short.  The idea is if  (UK1) is unique
already,  no bother
to add another UK as (UK1, UK2) which is just a superset of it.

>
>
> 5. If the unique key is impossible to be referenced by others,  we can
> safely ignore
> > it in order  to keep the (join)rel->unqiuekeys short.
>
> You could probably have an equivalent of has_useful_pathkeys() and
> pathkeys_useful_for_ordering()
>
>
Thanks for suggestion,  I will do so  in the v5-xx.patch.

> > 6. I only consider the not null check/opfamily check for the uniquekey
> which comes
> >from UniqueIndex.  I think that should be correct.
> > 7. I defined each uniquekey as List of Expr,  so I didn't introduce new
> node type.
>
> Where will you store the collation Oid? I left comments to mention
> that needed to be checked but just didn't wire it up.
>

This is too embarrassed,  I am not sure if it is safe to ignore it.  I
removed it due to
the following reasons (sorry for that I didn't explain it carefully for the
last email).

1.  When we choose if an UK is usable,  we have chance to compare the
collation info
for  restrictinfo  (where uk = 1) or target list (select uk from t) with
the indexinfo's collation,
the targetlist one has more trouble since we need to figure out the default
collation  for it.
However relation_has_unique_index_for has the same situation as us,  it
ignores it as well.
See comment /* XXX at some point we may need to check collations here too.
*/. It think
if there are some reasons we can ignore that.

2. What we expect from UK is:
a).  Where m1.uniquekey = m2.b   m2.uk will not be duplicated by this
joinclause.   Here
if m1.uk has different collation, it will raise runtime error.
b).  Where m1.uniquekey collate '' = m2.b.   We may can't depends on
the run-time error this time. But if we are sure that  *if uk is uk at
default collation is unique,
then (uk collcate 'other-colation') is unique as well**.  if so we may safe
ignore it as well.
c).  select uniquekey from t / select uniquekey collate ''  from t.
This have the same
requirement as item b).

3).  Looks maintain the collation for our case totally is a big effort,
and user rarely use it,  If
my expectation for 2.b is not true,  I prefer to detect such case (user use
a different collation),
we can just ignore the UK for that.

But After all, I think this should be an open question for now.

---
At last,  I am so grateful for your suggestion/feedback,  that's really
heuristic and constructive.
And so thanks Tom's for the quick review and suggest to add a new fields
for RelOptInfo,
without it I don't think I can add a new field to a so important struct.
And

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-17 Thread David Rowley

On Mon, 16 Mar 2020 at 06:01, Andy Fan  wrote:
>
> Hi All:
>
> I have re-implemented the patch based on David's suggestion/code,  Looks it
> works well.   The updated patch mainly includes:
>
> 1. Maintain the not_null_colno in RelOptInfo, which includes the not null from
> catalog and the not null from vars.

What about non-nullability that we can derive from means other than
NOT NULL constraints. Where will you track that now that you've
removed the UniqueKeySet type?

Traditionally we use attno or attnum rather than colno for variable
names containing attribute numbers

> 3. postpone the propagate_unique_keys_to_joinrel call to 
> populate_joinrel_with_paths
>   since we know jointype at that time. so we can handle the semi/anti join 
> specially.

ok, but the join type was known already where I was calling the
function from. It just wasn't passed to the function.

> 4. Add the rule I suggested above,  if both of the 2 relation yields the a 
> unique result,
>   the join result will be unique as well. the UK can be  ( (rel1_uk1, 
> rel1_uk2).. )

I see. So basically you're saying that the joinrel's uniquekeys should
be the cartesian product of the unique rels from either side of the
join.  I wonder if that's a special case we need to worry about too
much. Surely it only applies for clauseless joins.

> 5. If the unique key is impossible to be referenced by others,  we can safely 
> ignore
> it in order  to keep the (join)rel->unqiuekeys short.

You could probably have an equivalent of has_useful_pathkeys() and
pathkeys_useful_for_ordering()

> 6. I only consider the not null check/opfamily check for the uniquekey which 
> comes
>from UniqueIndex.  I think that should be correct.
> 7. I defined each uniquekey as List of Expr,  so I didn't introduce new node 
> type.

Where will you store the collation Oid? I left comments to mention
that needed to be checked but just didn't wire it up.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-15 Thread Andy Fan

Hi All:

I have re-implemented the patch based on David's suggestion/code,  Looks it
works well.   The updated patch mainly includes:

1. Maintain the not_null_colno in RelOptInfo, which includes the not null
from
catalog and the not null from vars.
2. Add the restictinfo check at populate_baserel_uniquekeys. If we are sure
  about only 1 row returned, I add each expr in rel->reltarget->expr as a
unique key.
  like (select a, b, c from t where pk = 1),  the uk will be ( (a), (b),
(c) )
3. postpone the propagate_unique_keys_to_joinrel call to
populate_joinrel_with_paths
  since we know jointype at that time. so we can handle the semi/anti join
specially.
4. Add the rule I suggested above,  if both of the 2 relation yields the a
unique result,
  the join result will be unique as well. the UK can be  ( (rel1_uk1,
rel1_uk2).. )
5. If the unique key is impossible to be referenced by others,  we can
safely ignore
it in order  to keep the (join)rel->unqiuekeys short.
6. I only consider the not null check/opfamily check for the uniquekey
which comes
   from UniqueIndex.  I think that should be correct.
7. I defined each uniquekey as List of Expr,  so I didn't introduce new
node type.
8. checked the uniquekeys's information before create_distinct_paths and
   create_group_paths ignore the new paths to be created if the
sortgroupclauses
is unique already or else create it and add the new uniquekey to the
distinctrel/grouprel.

There are some things I still be in-progress, like:
1. Partition table.
2. union/union all
3. maybe refactor the is_innerrel_unqiue_for/query_is_distinct_for to use
UniqueKey
4. if we are sure the groupby clause is unique, and we have aggregation
call, maybe we
should try Bapat's suggestion,  we can use sort rather than hash.  The
strategy sounds
awesome,  but I didn't check the details so far.
5. more clearer commit message.
6. any more ?

Any feedback is welcome, Thanks for you for your any ideas, suggestions,
demo code!

Best Regards
Andy Fan


v4-0001-Patch-Bypass-distinctClause-groupbyClause-if-the-.patch
Description: Binary data

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-12 Thread Andy Fan

On Fri, Mar 13, 2020 at 11:46 AM David Rowley  wrote:

> On Fri, 13 Mar 2020 at 14:47, Andy Fan  wrote:
> > 1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const"
> as well.
> > (relation_has_unqiue_for has a similar logic) currently the following
> distinct path is still
> >  there.
>
> Yeah, I left a comment in propagate_unique_keys_to_joinrel() to
> mention that still needs to be done.

> postgres=# explain select distinct b from t100 where pk = 1;
> > QUERY PLAN
> >
> --
> >  Unique  (cost=8.18..8.19 rows=1 width=4)
> >->  Sort  (cost=8.18..8.19 rows=1 width=4)
> >  Sort Key: b
> >  ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1
> width=4)
> >Index Cond: (pk = 1)
> > (5 rows)
> >
> > I think in this case,  we can add  both (pk) and (b) as the
> UniquePaths.  If so we
> > can get more opportunities to reach our goal.
>
> The UniqueKeySet containing "b" could only be added in the
> distinct_rel in the upper planner. It must not change the input_rel
> for the distinct.
>
> I think we maintain UniqueKey even without distinct_rel,  so at this
stage,
Can we say b is unique for this(no is possible)?  If yes,  we probably
need to set that information without consider the distinct clause.

It's likely best to steer clear of calling UniqueKeys UniquePaths as
> it might confuse people.  The term "path" is used in PostgreSQL as a
> lightweight representation containing all the information required to
> build a plan node in createplan.c. More details in
> src/backend/optimizer/README.
>
>
OK.


> > 2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more
> UniquePath as
> > (rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't
> apply.
> > or else the following cases can't be handled.
> >
> > postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
> >QUERY PLAN
> >
> 
> >  Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
> >->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
> >  Sort Key: t100.pk, t101.pk
> >  ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
> >->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
> >->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
> >  ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260
> width=4)
> > (7 rows)
>
> I don't really follow what you mean here. It seems to me there's no
> way we can skip doing DISTINCT in the case above.  If you've just
> missed out the join clause and you meant to have "WHERE t100.pk =
> t101.pk", then we can likely fix that later with some sort of
> functional dependency tracking.


In the above case the result should be unique, the knowledge behind that
is if *we join 2 unique results in any join method, the result is unique as
well*
in the above example,  the final unique Key is (t100.pk, t101.pk).


> Likely we can just add a Relids field
> to UniqueKeySet to track which relids are functionally dependant on a
> row from the UniqueKeySet's uk_exprs.  That might be as simple as just
> pull_varnos from the non-matched exprs and checking to ensure the
> result is a subset of functionally dependant rels.  I'd need to give
> that more thought.
>
> Was this a case you had working in your patch?
>

I think we can do that after I get your UniqueKey idea,  so, no,  my
previous patch is not as smart
as yours:)


> > But if we add such rule,  the unique paths probably become much longer,
> so we need
> > a strategy to tell if the UniquePath is useful for our query,  if not,
> we can ignore that.
> > rel->reltarget maybe a good info for such optimization.   I think we can
> take this into
> > consideration for pupulate_baserel_uniquekeys as well.
>
> I don't really think the number of unique indexes in a base rel will
> really ever get out of hand for legitimate cases.
> propagate_unique_keys_to_joinrel is just concatenating baserel
> UniqueKeySets to the joinrel. They're not copied, so it's just tagging
> pointers onto the end of an array, which is at best a memcpy(), or at
> worst a realloc() then memcpy(). That's not so costly.
>

The memcpy is not key concerns here.  My main point is we need
to focus on the length of RelOptInfo->uniquekeys.  For example:
t  has 3 uk like this (uk1),  (uk2), (uk3).  And the query is
select b from t where m = 1;  If so there is no need to add these 3
to UniqueKeys so that we can keep rel->uniquekeys shorter.

The the length of rel->uniquekeys maybe a  concern if we add the rule
I suggested above , the (t100.pk, t101.pk) case.   Think about this
for example:

1. select .. from t1, t2, t3, t4...;
2. suppose each table has 2 UniqueKeys, named (t{m}_uk{n})
3. follow my above rule (t1.pk1,

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-12 Thread David Rowley

On Fri, 13 Mar 2020 at 14:47, Andy Fan  wrote:
> 1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as 
> well.
> (relation_has_unqiue_for has a similar logic) currently the following 
> distinct path is still
>  there.

Yeah, I left a comment in propagate_unique_keys_to_joinrel() to
mention that still needs to be done.

> postgres=# explain select distinct b from t100 where pk = 1;
> QUERY PLAN
> --
>  Unique  (cost=8.18..8.19 rows=1 width=4)
>->  Sort  (cost=8.18..8.19 rows=1 width=4)
>  Sort Key: b
>  ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1 
> width=4)
>Index Cond: (pk = 1)
> (5 rows)
>
> I think in this case,  we can add  both (pk) and (b) as the UniquePaths.  If 
> so we
> can get more opportunities to reach our goal.

The UniqueKeySet containing "b" could only be added in the
distinct_rel in the upper planner. It must not change the input_rel
for the distinct.

It's likely best to steer clear of calling UniqueKeys UniquePaths as
it might confuse people.  The term "path" is used in PostgreSQL as a
lightweight representation containing all the information required to
build a plan node in createplan.c. More details in
src/backend/optimizer/README.

> 2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more 
> UniquePath as
> (rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply.
> or else the following cases can't be handled.
>
> postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
>QUERY PLAN
> 
>  Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
>->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
>  Sort Key: t100.pk, t101.pk
>  ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
>->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
>->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
>  ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260 
> width=4)
> (7 rows)

I don't really follow what you mean here. It seems to me there's no
way we can skip doing DISTINCT in the case above.  If you've just
missed out the join clause and you meant to have "WHERE t100.pk =
t101.pk", then we can likely fix that later with some sort of
functional dependency tracking. Likely we can just add a Relids field
to UniqueKeySet to track which relids are functionally dependant on a
row from the UniqueKeySet's uk_exprs.  That might be as simple as just
pull_varnos from the non-matched exprs and checking to ensure the
result is a subset of functionally dependant rels.  I'd need to give
that more thought.

Was this a case you had working in your patch?

> But if we add such rule,  the unique paths probably become much longer,  so 
> we need
> a strategy to tell if the UniquePath is useful for our query,  if not,  we 
> can ignore that.
> rel->reltarget maybe a good info for such optimization.   I think we can take 
> this into
> consideration for pupulate_baserel_uniquekeys as well.

I don't really think the number of unique indexes in a base rel will
really ever get out of hand for legitimate cases.
propagate_unique_keys_to_joinrel is just concatenating baserel
UniqueKeySets to the joinrel. They're not copied, so it's just tagging
pointers onto the end of an array, which is at best a memcpy(), or at
worst a realloc() then memcpy(). That's not so costly.

> For the non_null info,  Tom suggested to add maintain such info RelOptInfo,
> I have done that for the not_null_info for basic relation catalog,  I think 
> we can
> maintain the same flag for joinrel and the not null info from 
> find_nonnullable_vars as
> well, but I still didn't find a good place to add that so far.

I'd considered just adding a get_notnull() function to lsyscache.c.
Just below get_attname() looks like a good spot.  I imagined just
setting the bit in the UniqueKeySet's non_null_keys field
corresponding to the column position from the index.  I could see the
benefit of having a field in RelOptInfo if there was some way to
determine the not-null properties of all columns in the table at once,
but there's not, so we're likely best just looking at the ones that
there are unique indexes on.

> A small question about the following code:
>
> +   if (relation_has_uniquekeys_for(root, input_rel, 
> get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
> +   {
> +
> +   add_path(distinct_rel, (Path *)cheapest_input_path);
> +
> +   /* XXX yeah yeah, need to call the hooks etc. */
> +
> +   /* Now choose the best path(s) */
> +   set_cheapest(distinct_rel);
> +
> +   return distinct_rel;
> +   }
>
> Since we don't create

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-12 Thread Andy Fan

Hi David:

On Thu, Mar 12, 2020 at 3:51 PM David Rowley  wrote:

> On Wed, 11 Mar 2020 at 17:23, Andy Fan  wrote:
> > Now I am convinced that we should maintain UniquePath on RelOptInfo,
> > I would see how to work with "Index Skip Scan" patch.
>
> I've attached a very early proof of concept patch for unique keys.
> The NULL detection stuff is not yet hooked up, so it'll currently do
> the wrong thing for NULLable columns.  I've left some code in there
> with my current idea of how to handle that, but I'll need to add more
> code both to look at the catalogue tables to see if there's a NOT NULL
> constraint and also to check for strict quals that filter out NULLs.
>
> Additionally, I've not hooked up the collation checking stuff yet. I
> just wanted to see if it would work ok for non-collatable types first.
>
> I've added a couple of lines to create_distinct_paths() to check if
> the input_rel has the required UniqueKeys to skip doing the DISTINCT.
> It seems to work, but my tests so far are limited to:
>
> create table t1(a int primary key, b int);
> create table t2(a int primary key, b int);
>
> postgres=# -- t2 could duplicate t1, don't remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.a = t2.b;
> QUERY PLAN
> --
>  HashAggregate
>Group Key: t1.a
>->  Hash Join
>  Hash Cond: (t2.b = t1.a)
>  ->  Seq Scan on t2
>  ->  Hash
>->  Seq Scan on t1
> (7 rows)
>
>
> postgres=# -- neither rel can duplicate the other due to join on PK.
> Remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.a = t2.a;
>  QUERY PLAN
> 
>  Hash Join
>Hash Cond: (t1.a = t2.a)
>->  Seq Scan on t1
>->  Hash
>  ->  Seq Scan on t2
> (5 rows)
>
>
> postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.b = t2.a;
>  QUERY PLAN
> 
>  Hash Join
>Hash Cond: (t1.b = t2.a)
>->  Seq Scan on t1
>->  Hash
>  ->  Seq Scan on t2
> (5 rows)
>
>
> postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
> postgres=# explain (costs off) select distinct t2.a from t1 inner join
> t2 on t1.b = t2.a;
> QUERY PLAN
> --
>  HashAggregate
>Group Key: t2.a
>->  Hash Join
>  Hash Cond: (t1.b = t2.a)
>  ->  Seq Scan on t1
>  ->  Hash
>->  Seq Scan on t2
> (7 rows)
>
>
> postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
> postgres=# explain (costs off) select distinct t2.a from t1 inner join
> t2 on t1.a = t2.b;
>  QUERY PLAN
> 
>  Hash Join
>Hash Cond: (t2.b = t1.a)
>->  Seq Scan on t2
>->  Hash
>  ->  Seq Scan on t1
> (5 rows)
>
> I've also left a bunch of XXX comments for things that I know need more
> thought.
>
> I believe we can propagate the joinrel's unique keys where the patch
> is currently doing it.  I understand that in
> populate_joinrel_with_paths() we do things like swapping LEFT JOINs
> for RIGHT JOINs and switch the input rels around, but we do so only
> because it's equivalent, so I don't currently see why we can't take
> the jointype for the SpecialJoinInfo. I need to know that as I'll need
> to ignore pushed down RestrictInfos for outer joins.
>
> I'm posting now as I know I've been mentioning this UniqueKeys idea
> for quite a while and if it's not something that's going to get off
> the ground, then it's better to figure that out now.
>

Thanks for the code!   Here is some points from me.

1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as
well.
(relation_has_unqiue_for has a similar logic) currently the following
distinct path is still
 there.

postgres=# explain select distinct b from t100 where pk = 1;
QUERY PLAN
--
 Unique  (cost=8.18..8.19 rows=1 width=4)
   ->  Sort  (cost=8.18..8.19 rows=1 width=4)
 Sort Key: b
 ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1
width=4)
   Index Cond: (pk = 1)
(5 rows)

I think in this case,  we can add  both (pk) and (b) as the UniquePaths.
If so we
can get more opportunities to reach our goal.

2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more
UniquePath as
(rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply.
or else the following cases can't be handled.

postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
   QUERY PLAN

 Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
   ->  Sort

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-12 Thread David Rowley

On Wed, 11 Mar 2020 at 17:23, Andy Fan  wrote:
> Now I am convinced that we should maintain UniquePath on RelOptInfo,
> I would see how to work with "Index Skip Scan" patch.

I've attached a very early proof of concept patch for unique keys.
The NULL detection stuff is not yet hooked up, so it'll currently do
the wrong thing for NULLable columns.  I've left some code in there
with my current idea of how to handle that, but I'll need to add more
code both to look at the catalogue tables to see if there's a NOT NULL
constraint and also to check for strict quals that filter out NULLs.

Additionally, I've not hooked up the collation checking stuff yet. I
just wanted to see if it would work ok for non-collatable types first.

I've added a couple of lines to create_distinct_paths() to check if
the input_rel has the required UniqueKeys to skip doing the DISTINCT.
It seems to work, but my tests so far are limited to:

create table t1(a int primary key, b int);
create table t2(a int primary key, b int);

postgres=# -- t2 could duplicate t1, don't remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.b;
QUERY PLAN
--
 HashAggregate
   Group Key: t1.a
   ->  Hash Join
 Hash Cond: (t2.b = t1.a)
 ->  Seq Scan on t2
 ->  Hash
   ->  Seq Scan on t1
(7 rows)

postgres=# -- neither rel can duplicate the other due to join on PK.
Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.a;
 QUERY PLAN

 Hash Join
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Hash
 ->  Seq Scan on t2
(5 rows)

postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.b = t2.a;
 QUERY PLAN

 Hash Join
   Hash Cond: (t1.b = t2.a)
   ->  Seq Scan on t1
   ->  Hash
 ->  Seq Scan on t2
(5 rows)

postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.b = t2.a;
QUERY PLAN
--
 HashAggregate
   Group Key: t2.a
   ->  Hash Join
 Hash Cond: (t1.b = t2.a)
 ->  Seq Scan on t1
 ->  Hash
   ->  Seq Scan on t2
(7 rows)

postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.a = t2.b;
 QUERY PLAN

 Hash Join
   Hash Cond: (t2.b = t1.a)
   ->  Seq Scan on t2
   ->  Hash
 ->  Seq Scan on t1
(5 rows)

I've also left a bunch of XXX comments for things that I know need more thought.

I believe we can propagate the joinrel's unique keys where the patch
is currently doing it.  I understand that in
populate_joinrel_with_paths() we do things like swapping LEFT JOINs
for RIGHT JOINs and switch the input rels around, but we do so only
because it's equivalent, so I don't currently see why we can't take
the jointype for the SpecialJoinInfo. I need to know that as I'll need
to ignore pushed down RestrictInfos for outer joins.

I'm posting now as I know I've been mentioning this UniqueKeys idea
for quite a while and if it's not something that's going to get off
the ground, then it's better to figure that out now.

bearly_poc_uniquekeys_v0.patch
Description: Binary data

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-11 Thread Ashutosh Bapat

On Wed, Mar 11, 2020 at 4:19 AM David Rowley  wrote:
>
> On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat
>  wrote:
> >
> > On Tue, Mar 10, 2020 at 1:49 PM Andy Fan  wrote:
> > > In my current implementation, it calculates the uniqueness for each
> > > BaseRel only, but in your way,  looks we need to calculate the
> > > UniquePathKey for both BaseRel and JoinRel.   This makes more
> > > difference for multi table join.
> >
> > I didn't understand this concern. I think, it would be better to do it
> > for all kinds of relation types base, join etc. This way we are sure
> > that one method works across the planner to eliminate the need for
> > Distinct or grouping. If we just implement something for base
> > relations right now and don't do that for joins, there is a chance
> > that that method may not work for joins when we come to implement it.
>
> Yeah, it seems to me that we're seeing more and more features that
> require knowledge of uniqueness of a RelOptInfo. The skip scans patch
> needs to know if a join will cause row duplication so it knows if the
> skip scan path can be joined to without messing up the uniqueness of
> the skip scan.  Adding more and more places that loop over the rel's
> indexlist just does not seem the right way to do it, especially so
> when you have to dissect the join rel down to its base rel components
> to check which indexes there are. Having the knowledge on-hand at the
> RelOptInfo level means we no longer have to look at indexes for unique
> proofs.

+1. Yep. When we break join down to the base relation, partitioned
relation pose another challenge that the partitioned relation may not
have an index on it per say but each partition may have it and the
index key happens to be part of the partition key. That case would be
easy to track through RelOptInfo instead of breaking a base rel down
into its child rels.

>
> > > Another concern is UniquePathKey
> > > is designed for a general purpose,  we need to maintain it no matter
> > > distinctClause/groupbyClause.
> >
> > This should be ok. The time spent in annotating a RelOptInfo about
> > uniqueness is not going to be a lot. But doing so would help generic
> > elimination of Distinct/Group/Unique operations. What is
> > UniquePathKey; I didn't find this in your patch or in the code.
>
> Possibly a misinterpretation. There is some overlap between this patch
> and the skip scan patch, both would like to skip doing explicit work
> to implement DISTINCT. Skip scans just go about it by adding new path
> types that scan the index and only gathers up unique values.  In that
> case, the RelOptInfo won't contain the unique keys, but the skip scan
> path will. How I imagine both of these patches working together in
> create_distinct_paths() is that we first check if the DISTINCT clause
> is a subset of the a set of the RelOptInfo's unique keys (this patch),
> else we check if there are any paths with uniquekeys that we can use
> to perform a no-op on the DISTINCT clause (the skip scan patch), if
> none of those apply, we create the required paths to uniquify the
> results.

Looks good to me. But I have not seen index skip patch.

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-11 Thread Ashutosh Bapat

On Tue, Mar 10, 2020 at 9:12 PM Andy Fan  wrote:
>
>
> Hi Tom & David & Bapat:
>
> Thanks for your review so far.  I want to summarize the current issues to help
> our following discussion.
>
> 1. Shall we bypass the AggNode as well with the same logic.
>
> I think yes, since the rules to bypass a AggNode and UniqueNode is exactly 
> same.
> The difficulty of bypassing AggNode is the current aggregation function call 
> is closely
> coupled with AggNode.  In the past few days, I have make the aggregation call 
> can
> run without AggNode (at least I tested sum(without finalized  fn),  avg (with 
> finalized fn)).
> But there are a few things to do, like acl check,  anynull check and maybe 
> more check.
> also there are some MemoryContext mess up need to fix.
> I still need some time for this goal,  so I think the complex of it deserves 
> another thread
> to discuss it,  any thought?

I think if the relation underlying an Agg node is know to be unique
for given groupByClause, we could safely use AGG_SORTED strategy.
Though the input is not ordered, it's sorted thus for every row Agg
node will combine/finalize the aggregate result.

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-10 Thread Andy Fan

On Wed, Mar 11, 2020 at 6:49 AM David Rowley  wrote:

> On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat
>  wrote:
> >
> > On Tue, Mar 10, 2020 at 1:49 PM Andy Fan 
> wrote:
> > > In my current implementation, it calculates the uniqueness for each
> > > BaseRel only, but in your way,  looks we need to calculate the
> > > UniquePathKey for both BaseRel and JoinRel.   This makes more
> > > difference for multi table join.
> >
> > I didn't understand this concern. I think, it would be better to do it
> > for all kinds of relation types base, join etc. This way we are sure
> > that one method works across the planner to eliminate the need for
> > Distinct or grouping. If we just implement something for base
> > relations right now and don't do that for joins, there is a chance
> > that that method may not work for joins when we come to implement it.
>
> Yeah, it seems to me that we're seeing more and more features that
> require knowledge of uniqueness of a RelOptInfo. The skip scans patch
> needs to know if a join will cause row duplication so it knows if the
> skip scan path can be joined to without messing up the uniqueness of
> the skip scan.  Adding more and more places that loop over the rel's
> indexlist just does not seem the right way to do it, especially so
> when you have to dissect the join rel down to its base rel components
> to check which indexes there are. Having the knowledge on-hand at the
> RelOptInfo level means we no longer have to look at indexes for unique
> proofs.
>
> > > Another concern is UniquePathKey
> > > is designed for a general purpose,  we need to maintain it no matter
> > > distinctClause/groupbyClause.
> >
> > This should be ok. The time spent in annotating a RelOptInfo about
> > uniqueness is not going to be a lot. But doing so would help generic
> > elimination of Distinct/Group/Unique operations. What is
> > UniquePathKey; I didn't find this in your patch or in the code.
>
> Possibly a misinterpretation. There is some overlap between this patch
> and the skip scan patch, both would like to skip doing explicit work
> to implement DISTINCT. Skip scans just go about it by adding new path
> types that scan the index and only gathers up unique values.  In that
> case, the RelOptInfo won't contain the unique keys, but the skip scan
> path will. How I imagine both of these patches working together in
> create_distinct_paths() is that we first check if the DISTINCT clause
> is a subset of the a set of the RelOptInfo's unique keys (this patch),
> else we check if there are any paths with uniquekeys that we can use
> to perform a no-op on the DISTINCT clause (the skip scan patch), if
> none of those apply, we create the required paths to uniquify the
> results.
>

Now I am convinced that we should maintain UniquePath on RelOptInfo,
I would see how to work with "Index Skip Scan" patch.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-10 Thread David Rowley

On Tue, 10 Mar 2020 at 21:19, Andy Fan  wrote:
>
>> SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
>
>
> Actually I want to keep the distinct for this case now.  One reason is there 
> are only 1
> row returned, so the distinct erasing can't help much.   The more important 
> reason is
> Query->hasAggs is true for "select distinct  (select count(*) filter (where 
> t2.c2 = 6
> and t2.c1 < 10) from ft1 t1 where t1.c1 = 6)  from ft2 t2 where t2.c2 % 6 = 0 
> order by 1;"
> (this sql came from postgres_fdw.sql).

I think that sort of view is part of the problem here.  If you want to
invent some new way to detect uniqueness that does not count that case
then we have more code with more possible places to have bugs.

FWIW, query_is_distinct_for() does detect that case with:

/*
* If we have no GROUP BY, but do have aggregates or HAVING, then the
* result is at most one row so it's surely unique, for any operators.
*/
if (query->hasAggs || query->havingQual)
return true;

which can be seen by the fact that the following find the unique join on t2.

postgres=# explain verbose select * from t1 inner join (select
count(*) c from t1) t2 on t1.a=t2.c;
 QUERY PLAN

 Hash Join  (cost=41.91..84.25 rows=13 width=12)
   Output: t1.a, (count(*))
   Inner Unique: true
   Hash Cond: (t1.a = (count(*)))
   ->  Seq Scan on public.t1  (cost=0.00..35.50 rows=2550 width=4)
 Output: t1.a
   ->  Hash  (cost=41.89..41.89 rows=1 width=8)
 Output: (count(*))
 ->  Aggregate  (cost=41.88..41.88 rows=1 width=8)
   Output: count(*)
   ->  Seq Scan on public.t1 t1_1  (cost=0.00..35.50
rows=2550 width=0)
 Output: t1_1.a
(12 rows)

It will be very simple to add an empty List of UniqueKeys to the GROUP
BY's RelOptInfo to indicate that all expressions are unique.  That way
any code that checks if some of the RelOptInfo's unique keys are a
subset of some expressions they'd like to know are unique, then
they'll get a match.

It does not really matter how much effort is saved in your example
above. The UniqueKey infrastructure won't know how much effort
properly adding all the uniquekeys will save. It should just add all
the keys it can and let whichever code cares about that reap the
benefits.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-10 Thread David Rowley

On Wed, 11 Mar 2020 at 02:50, Ashutosh Bapat
 wrote:
>
> On Tue, Mar 10, 2020 at 1:49 PM Andy Fan  wrote:
> > In my current implementation, it calculates the uniqueness for each
> > BaseRel only, but in your way,  looks we need to calculate the
> > UniquePathKey for both BaseRel and JoinRel.   This makes more
> > difference for multi table join.
>
> I didn't understand this concern. I think, it would be better to do it
> for all kinds of relation types base, join etc. This way we are sure
> that one method works across the planner to eliminate the need for
> Distinct or grouping. If we just implement something for base
> relations right now and don't do that for joins, there is a chance
> that that method may not work for joins when we come to implement it.

Yeah, it seems to me that we're seeing more and more features that
require knowledge of uniqueness of a RelOptInfo. The skip scans patch
needs to know if a join will cause row duplication so it knows if the
skip scan path can be joined to without messing up the uniqueness of
the skip scan.  Adding more and more places that loop over the rel's
indexlist just does not seem the right way to do it, especially so
when you have to dissect the join rel down to its base rel components
to check which indexes there are. Having the knowledge on-hand at the
RelOptInfo level means we no longer have to look at indexes for unique
proofs.

> > Another concern is UniquePathKey
> > is designed for a general purpose,  we need to maintain it no matter
> > distinctClause/groupbyClause.
>
> This should be ok. The time spent in annotating a RelOptInfo about
> uniqueness is not going to be a lot. But doing so would help generic
> elimination of Distinct/Group/Unique operations. What is
> UniquePathKey; I didn't find this in your patch or in the code.

Possibly a misinterpretation. There is some overlap between this patch
and the skip scan patch, both would like to skip doing explicit work
to implement DISTINCT. Skip scans just go about it by adding new path
types that scan the index and only gathers up unique values.  In that
case, the RelOptInfo won't contain the unique keys, but the skip scan
path will. How I imagine both of these patches working together in
create_distinct_paths() is that we first check if the DISTINCT clause
is a subset of the a set of the RelOptInfo's unique keys (this patch),
else we check if there are any paths with uniquekeys that we can use
to perform a no-op on the DISTINCT clause (the skip scan patch), if
none of those apply, we create the required paths to uniquify the
results.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-10 Thread Andy Fan

Hi Tom & David & Bapat:

Thanks for your review so far.  I want to summarize the current issues to
help
our following discussion.

1. Shall we bypass the AggNode as well with the same logic.

I think yes, since the rules to bypass a AggNode and UniqueNode is exactly
same.
The difficulty of bypassing AggNode is the current aggregation function
call is closely
coupled with AggNode.  In the past few days, I have make the aggregation
call can
run without AggNode (at least I tested sum(without finalized  fn),  avg
(with finalized fn)).
But there are a few things to do, like acl check,  anynull check and maybe
more check.
also there are some MemoryContext mess up need to fix.
I still need some time for this goal,  so I think the complex of it
deserves another thread
to discuss it,  any thought?

2. Shall we used the UniquePath as David suggested.

Actually I am trending to this way now.  Daivd, can you share more insights
about the
benefits of UniquePath?  Costing size should be one of them,  another one
may be
changing the semi join to normal join as the current innerrel_is_unique
did.  any others?

3. Can we make the rule more general?

Currently it requires every relation yields a unique result.  Daivd & Bapat
provides another example:
select m2.pk from m1, m2 where m1.pk = m2.non_unqiue_key. That's
interesting and not easy to
handle in my current framework.  This is another reason I want to take the
UniquePath framework.

Do we have any other rules to think about before implementing it?

Thanks for your feedback.


> This should be ok. The time spent in annotating a RelOptInfo about
> uniqueness is not going to be a lot. But doing so would help generic
> elimination of Distinct/Group/Unique operations. What is
> UniquePathKey; I didn't find this in your patch or in the code.
>
> This is a proposal from David,  so not in current patch/code :)

Regards
Andy Fan

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-10 Thread Ashutosh Bapat

Hi Andy,

On Tue, Mar 10, 2020 at 1:49 PM Andy Fan  wrote:
>
> Hi David:
>
>>
>> 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
>> rel itself (I've not looked for the best location in detail),
>> determine if the join can cause rows to be duplicated. If it can't,
>> then add the UniqueKeys from that rel.
>
>
> I have some concerns about this method,  maybe I misunderstand
> something, if so, please advise.
>
> In my current implementation, it calculates the uniqueness for each
> BaseRel only, but in your way,  looks we need to calculate the
> UniquePathKey for both BaseRel and JoinRel.   This makes more
> difference for multi table join.

I didn't understand this concern. I think, it would be better to do it
for all kinds of relation types base, join etc. This way we are sure
that one method works across the planner to eliminate the need for
Distinct or grouping. If we just implement something for base
relations right now and don't do that for joins, there is a chance
that that method may not work for joins when we come to implement it.

> Another concern is UniquePathKey
> is designed for a general purpose,  we need to maintain it no matter
> distinctClause/groupbyClause.

This should be ok. The time spent in annotating a RelOptInfo about
uniqueness is not going to be a lot. But doing so would help generic
elimination of Distinct/Group/Unique operations. What is
UniquePathKey; I didn't find this in your patch or in the code.

>
>
>>
>>  For example: SELECT * FROM t1
>> INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
>> {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
>> since it's an eqijoin and t1.unique has a unique index).
>
>
> Thanks for raising this.  My current rule requires *every* relation yields a
> unique result and *no matter* with the join method.  Actually I want to make
> the rule less strict, for example, we  may just need 1 table yields unique 
> result
> and the final result will be unique as well under some join type.

That is desirable.

>
> As for the t1 INNER JOIN t2 ON t1.unique = t2.not_unique;  looks we can't
> remove the distinct based on this.
>
> create table m1(a int primary key, b int);
> create table m2(a int primary key, b int);
> insert into m1 values(1, 1), (2, 1);
> insert into m2 values(1, 1), (2, 1);
> select distinct m1.a from m1, m2 where m1.a = m2.b;

IIUC, David's rule is other way round. "select distinct m2.a from m1,
m2 where m1.a = m2.b" won't need DISTINCT node since the result of
joining m1 and m2 has unique value of m2.a for each row. In your
example the join will produce two rows (m1.a, m1.b, m2.a, m2.b) (1, 1,
1, 1) and (1, 1, 2, 1) where m2.a is unique key.

--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-10 Thread Ashutosh Bapat

On Tue, Mar 10, 2020 at 3:51 AM David Rowley  wrote:

> On Sat, 7 Mar 2020 at 00:47, Andy Fan  wrote:
> > Upload the newest patch so that the cfbot can pass.  The last patch
> failed
> > because some explain without the (cost off).
>
> I've only really glanced at this patch, but I think we need to do this
> in a completely different way.
>
> I've been mentioning UniqueKeys around this mailing list for quite a
> while now [1]. To summarise the idea:
>
> 1. Add a new List field to RelOptInfo named unique_keys
> 2. During get_relation_info() process the base relation's unique
> indexes and add to the RelOptInfo's unique_keys list the indexed
> expressions from each unique index (this may need to be delayed until
> check_index_predicates() since predOK is only set there)
> 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
> rel itself (I've not looked for the best location in detail),
>

build_*_join_rel() will be a good place for this. The paths created might
take advantage of this information for costing.

> determine if the join can cause rows to be duplicated. If it can't,
> then add the UniqueKeys from that rel.  For example: SELECT * FROM t1
> INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
> {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
> since it's an eqijoin and t1.unique has a unique index).

this is interesting.

> If the
> condition was t1.unique = t2.unique then we could take the unique keys
> from both sides of the join, and with t1.non_unique = t2.non_unique,
> we can take neither.
> 4. When creating the GROUP BY paths (when there are no aggregates),
> don't bother doing anything if the input rel's unique keys are a
> subset of the GROUP BY clause. Otherwise, create the group by paths
> and tag the new unique keys onto the GROUP BY rel.
> 5. When creating the DISTINCT paths, don't bother if the input rel has
> unique keys are a subset of the distinct clause.
>

Thanks for laying this out in more details. Two more cases can be added to
this
6. When creating RelOptInfo for a grouped/aggregated result, if all the
columns of a group by clause are part of the result i.e. targetlist, the
columns in group by clause server as the unique keys of the result. So the
corresponding RelOptInfo can be marked as such.
7. The result of DISTINCT is unique for the columns contained in the
DISTINCT clause. Hence we can add those columns to the unique_key of the
RelOptInfo representing the result of the distinct clause.
8. If each partition of a partitioned table has a unique key with the same
columns in it and that happens to be superset of the partition key, then
the whole partitioned table gets that unique key as well.

With this we could actually pass the uniqueness information through
Subquery scans as well and the overall query will benefit with that.

>
> 4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY
> non_unique will just uniquify once for the GROUP BY and not for the
> distinct.  SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do
> anything to uniquify the results.
>
> Because both 4 and 5 require that the uniquekeys are a subset of the
> distinct/group by clause, an empty uniquekey set would mean that the
> RelOptInfo returns no more than 1 row.  That would allow your:
>
> SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
>
> There's a separate effort in
> https://commitfest.postgresql.org/27/1741/ to implement some parts of
> the uniquekeys idea.  However the implementation currently only covers
> adding the unique keys to Paths, not to RelOptInfos.
>

I haven't looked at that patch, but as discussed upthread, in this case we
want the uniqueness associated with the RelOptInfo and not the path.

>
> I also believe that the existing code in analyzejoins.c should be
> edited to make use of unique keys rather than how it looks at unique
> indexes and group by / distinct clauses.
>

+1.
-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-10 Thread Andy Fan

Hi David:


> 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
> rel itself (I've not looked for the best location in detail),
> determine if the join can cause rows to be duplicated. If it can't,
> then add the UniqueKeys from that rel.


I have some concerns about this method,  maybe I misunderstand
something, if so, please advise.

In my current implementation, it calculates the uniqueness for each
BaseRel only, but in your way,  looks we need to calculate the
UniquePathKey for both BaseRel and JoinRel.   This makes more
difference for multi table join.Another concern is UniquePathKey
is designed for a general purpose,  we need to maintain it no matter
distinctClause/groupbyClause.


>  For example: SELECT * FROM t1
> INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
> {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
> since it's an eqijoin and t1.unique has a unique index).
>

Thanks for raising this.  My current rule requires *every* relation yields
a
unique result and *no matter* with the join method.  Actually I want to make
the rule less strict, for example, we  may just need 1 table yields unique
result
and the final result will be unique as well under some join type.

As for the t1 INNER JOIN t2 ON t1.unique = t2.not_unique;  looks we can't
remove the distinct based on this.

create table m1(a int primary key, b int);
create table m2(a int primary key, b int);
insert into m1 values(1, 1), (2, 1);
insert into m2 values(1, 1), (2, 1);
select distinct m1.a from m1, m2 where m1.a = m2.b;



> SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
>

Actually I want to keep the distinct for this case now.  One reason is
there are only 1
row returned, so the distinct erasing can't help much.   The more important
reason is
Query->hasAggs is true for "select distinct  (select count(*) filter (where
t2.c2 = 6
and t2.c1 < 10) from ft1 t1 where t1.c1 = 6)  from ft2 t2 where t2.c2 % 6 =
0 order by 1;"
(this sql came from postgres_fdw.sql).

There's a separate effort in
> https://commitfest.postgresql.org/27/1741/ to implement some parts of
> the uniquekeys idea.  However the implementation currently only covers
> adding the unique keys to Paths, not to RelOptInfos


Thanks for this info.  I guess this patch is not merged so far, but looks
the cfbot
for my patch [1]  failed due to this :(   search
"explain (costs off) select distinct on(pk1) pk1, pk2 from
select_distinct_a;"


> I also believe that the existing code in analyzejoins.c should be
> edited to make use of unique keys rather than how it looks at unique
> indexes and group by / distinct clauses.
>
>  I can do this after we have agreement on the UniquePath.

For my cbbot failure, another strange thing is "A" appear ahead of "a" after
the order by..  Still didn't find out why.

[1]
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.83298

Regards
Andy Fan

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-09 Thread David Rowley

On Sat, 7 Mar 2020 at 00:47, Andy Fan  wrote:
> Upload the newest patch so that the cfbot can pass.  The last patch failed
> because some explain without the (cost off).

I've only really glanced at this patch, but I think we need to do this
in a completely different way.

I've been mentioning UniqueKeys around this mailing list for quite a
while now [1]. To summarise the idea:

1. Add a new List field to RelOptInfo named unique_keys
2. During get_relation_info() process the base relation's unique
indexes and add to the RelOptInfo's unique_keys list the indexed
expressions from each unique index (this may need to be delayed until
check_index_predicates() since predOK is only set there)
3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
rel itself (I've not looked for the best location in detail),
determine if the join can cause rows to be duplicated. If it can't,
then add the UniqueKeys from that rel.  For example: SELECT * FROM t1
INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
{t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
since it's an eqijoin and t1.unique has a unique index). If the
condition was t1.unique = t2.unique then we could take the unique keys
from both sides of the join, and with t1.non_unique = t2.non_unique,
we can take neither.
4. When creating the GROUP BY paths (when there are no aggregates),
don't bother doing anything if the input rel's unique keys are a
subset of the GROUP BY clause. Otherwise, create the group by paths
and tag the new unique keys onto the GROUP BY rel.
5. When creating the DISTINCT paths, don't bother if the input rel has
unique keys are a subset of the distinct clause.

4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY
non_unique will just uniquify once for the GROUP BY and not for the
distinct.  SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do
anything to uniquify the results.

Because both 4 and 5 require that the uniquekeys are a subset of the
distinct/group by clause, an empty uniquekey set would mean that the
RelOptInfo returns no more than 1 row.  That would allow your:

SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.

There's a separate effort in
https://commitfest.postgresql.org/27/1741/ to implement some parts of
the uniquekeys idea.  However the implementation currently only covers
adding the unique keys to Paths, not to RelOptInfos.

I also believe that the existing code in analyzejoins.c should be
edited to make use of unique keys rather than how it looks at unique
indexes and group by / distinct clauses.

[1] https://www.postgresql.org/search/?m=1=pgsql-hackers=uniquekeys

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-06 Thread Andy Fan

Upload the newest patch so that the cfbot can pass.  The last patch failed
because some explain without the (cost off).

I'm still on the way to figure out how to handle aggregation calls without
aggregation path.  Probably we can get there by hacking some
ExprEvalPushStep for Aggref node.  But since the current patch not tied
with this closely,   so I would put this patch for review first.



On Wed, Mar 4, 2020 at 9:13 PM Andy Fan  wrote:

>
>
>>
>>> * There are some changes in existing regression cases that aren't
>>> visibly related to the stated purpose of the patch, eg it now
>>> notices that "select distinct max(unique2) from tenk1" doesn't
>>> require an explicit DISTINCT step.  That's not wrong, but I wonder
>>> if maybe you should subdivide this patch into more than one patch,
>>> because that must be coming from some separate change.  I'm also
>>> wondering what caused the plan change in expected/join.out.
>>>
>>
>> Per my purpose it should be in the same patch,  the logical here is we
>> have distinct in the sql and the query is distinct already since the max
>> function (the rule is defined in query_is_distinct_agg which is splited
>> from
>> the original query_is_distinct_for clause).
>>
>
> I think I was right until I come
> into contrib/postgres_fdw/sql/postgres_fdw.sql.
> Per my understanding,  the query the result of  "select max(a) from t"  is
> unique
> since the aggregation function and has no group clause there.  But in the
> postgres_fdw.sql case,  the Query->hasAggs is true for "select distinct
> (select count(*) filter (where t2.c2 = 6 and t2.c1 < 10) from ft1 t1 where
> t1.c1 = 6)
> from ft2 t2 where t2.c2 % 6 = 0 order by 1;"  This looks very strange to
> me.
> Is my understanding wrong or there is a bug here?
>
> query->hasAggs was set to true in the following call stack.
>
>  pstate->p_hasAggs = true;
>
> ..
>
>  qry->hasAggs =  pstate->p_hasAggs;
>
>
> 0  in check_agglevels_and_constraints of parse_agg.c:343
> 1  in transformAggregateCall of parse_agg.c:236
> 2  in ParseFuncOrColumn of parse_func.c:805
> 3  in transformFuncCall of parse_expr.c:1558
> 4  in transformExprRecurse of parse_expr.c:265
> 5  in transformExpr of parse_expr.c:155
> 6  in transformTargetEntry of parse_target.c:105
> 7  in transformTargetList of parse_target.c:193
> 8  in transformSelectStmt of analyze.c:1224
> 9  in transformStmt of analyze.c:301
>
> You can see the new updated patch which should fix all the issues you
> point out
> except the one for supporting group by.   The another reason for this
> patch will
> not be the final one is because the changes for postgres_fdw.out is too
> arbitrary.
> uploading it now just for reference.  (The new introduced guc variable can
> be
> removed at last, keeping it now just make sure the testing is easier.)
>
>
> At a high level, I'm a bit disturbed that this focuses only on DISTINCT
 and doesn't (appear to) have any equivalent intelligence for GROUP BY,
 though surely that offers much the same opportunities for optimization.
 It seems like it'd be worthwhile to take a couple steps back and see
 if we couldn't recast the logic to work for both.


>>> OK,  Looks group by is a bit harder  than distinct since the aggregation
>>> function.
>>> I will go through the code to see where to add this logic.
>>>
>>>
>>
>> Can we grantee  any_aggr_func(a) == a  if only 1 row returned,  if so, we
>> can do
>> some work on the pathtarget/reltarget by transforming the Aggref to raw
>> expr.
>> I checked the execution path of the aggregation call, looks it depends on
>> Agg node
>> which is the thing we want to remove.
>>
>
> We can't grantee  any_aggr_func(a) == a  when only 1 row returned, so the
> above
> method doesn't work.   do you have any suggestion for this?
>


v3-0001-PATCH-Erase-the-distinct-path-if-the-result-is-un.patch
Description: Binary data

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-04 Thread Andy Fan

>
>
>> * There are some changes in existing regression cases that aren't
>> visibly related to the stated purpose of the patch, eg it now
>> notices that "select distinct max(unique2) from tenk1" doesn't
>> require an explicit DISTINCT step.  That's not wrong, but I wonder
>> if maybe you should subdivide this patch into more than one patch,
>> because that must be coming from some separate change.  I'm also
>> wondering what caused the plan change in expected/join.out.
>>
>
> Per my purpose it should be in the same patch,  the logical here is we
> have distinct in the sql and the query is distinct already since the max
> function (the rule is defined in query_is_distinct_agg which is splited
> from
> the original query_is_distinct_for clause).
>

I think I was right until I come
into contrib/postgres_fdw/sql/postgres_fdw.sql.
Per my understanding,  the query the result of  "select max(a) from t"  is
unique
since the aggregation function and has no group clause there.  But in the
postgres_fdw.sql case,  the Query->hasAggs is true for "select distinct
(select count(*) filter (where t2.c2 = 6 and t2.c1 < 10) from ft1 t1 where
t1.c1 = 6)
from ft2 t2 where t2.c2 % 6 = 0 order by 1;"  This looks very strange to
me.
Is my understanding wrong or there is a bug here?

query->hasAggs was set to true in the following call stack.

 pstate->p_hasAggs = true;

..

 qry->hasAggs =  pstate->p_hasAggs;


0  in check_agglevels_and_constraints of parse_agg.c:343
1  in transformAggregateCall of parse_agg.c:236
2  in ParseFuncOrColumn of parse_func.c:805
3  in transformFuncCall of parse_expr.c:1558
4  in transformExprRecurse of parse_expr.c:265
5  in transformExpr of parse_expr.c:155
6  in transformTargetEntry of parse_target.c:105
7  in transformTargetList of parse_target.c:193
8  in transformSelectStmt of analyze.c:1224
9  in transformStmt of analyze.c:301

You can see the new updated patch which should fix all the issues you point
out
except the one for supporting group by.   The another reason for this patch
will
not be the final one is because the changes for postgres_fdw.out is too
arbitrary.
uploading it now just for reference.  (The new introduced guc variable can
be
removed at last, keeping it now just make sure the testing is easier.)


At a high level, I'm a bit disturbed that this focuses only on DISTINCT
>>> and doesn't (appear to) have any equivalent intelligence for GROUP BY,
>>> though surely that offers much the same opportunities for optimization.
>>> It seems like it'd be worthwhile to take a couple steps back and see
>>> if we couldn't recast the logic to work for both.
>>>
>>>
>> OK,  Looks group by is a bit harder  than distinct since the aggregation
>> function.
>> I will go through the code to see where to add this logic.
>>
>>
>
> Can we grantee  any_aggr_func(a) == a  if only 1 row returned,  if so, we
> can do
> some work on the pathtarget/reltarget by transforming the Aggref to raw
> expr.
> I checked the execution path of the aggregation call, looks it depends on
> Agg node
> which is the thing we want to remove.
>

We can't grantee  any_aggr_func(a) == a  when only 1 row returned, so the
above
method doesn't work.   do you have any suggestion for this?
From 9449c09688d542c4dc201ee866f67d67304cff98 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= 
Date: Wed, 4 Mar 2020 14:33:56 +0800
Subject: [PATCH v2] [PATCH] Erase the distinct path if the result is unique by
 catalog

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yield a unique result set,then
the final result is unique as well regardless the join method.
---
 .../postgres_fdw/expected/postgres_fdw.out|  28 +-
 src/backend/optimizer/path/costsize.c |   1 +
 src/backend/optimizer/plan/analyzejoins.c | 184 +++-
 src/backend/optimizer/plan/planner.c  |  27 ++
 src/backend/optimizer/util/plancat.c  |   9 +
 src/backend/utils/misc/guc.c  |  10 +
 src/include/nodes/pathnodes.h |   1 +
 src/include/optimizer/cost.h  |   1 +
 src/include/optimizer/planmain.h  |   2 +
 src/test/regress/expected/aggregates.out  |  13 +-
 src/test/regress/expected/join.out|  16 +-
 src/test/regress/expected/select_distinct.out | 276 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/select_distinct.sql  |  84 ++
 14 files changed, 619 insertions(+), 36 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 84fd3ad2e0..215f10bf7d 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2902,22 +2902,20 @@

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-02 Thread Andy Fan

On Tue, Mar 3, 2020 at 1:24 AM Andy Fan  wrote:

> Thank you Tom for the review!
>
> On Mon, Mar 2, 2020 at 4:46 AM Tom Lane  wrote:
>
>> Andy Fan  writes:
>> > Please see if you have any comments.   Thanks
>>
>> The cfbot isn't at all happy with this.  Its linux build is complaining
>> about a possibly-uninitialized variable, and then giving up:
>>
>> https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993
>>
>> The Windows build isn't using -Werror, but it is crashing in at least
>> two different spots in the regression tests:
>>
>>
>> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778
>>
>> I've not attempted to identify the cause of that.
>>
>>
> Before I submit the patch, I can make sure "make check-world" is
> successful, but
> since the compile option is not same,  so I didn't catch
> the possibly-uninitialized
> variable.   As for the crash on the windows,  I didn't get the enough
> information
> now,  I will find a windows server and reproduce the cases.
>
> I just found the link http://commitfest.cputube.org/ this morning, I will
> make sure
> the next patch can go pass this test.
>
>
> At a high level, I'm a bit disturbed that this focuses only on DISTINCT
>> and doesn't (appear to) have any equivalent intelligence for GROUP BY,
>> though surely that offers much the same opportunities for optimization.
>> It seems like it'd be worthwhile to take a couple steps back and see
>> if we couldn't recast the logic to work for both.
>>
>>
> OK,  Looks group by is a bit harder  than distinct since the aggregation
> function.
> I will go through the code to see why to add this logic.
>
>

Can we grantee  any_aggr_func(a) == a  if only 1 row returned,  if so, we
can do
some work on the pathtarget/reltarget by transforming the Aggref to raw
expr.
I checked the execution path of the aggregation call, looks it depends on
Agg node
which is the thing we want to remove.

>
>> * There seem to be some pointless #include additions, eg in planner.c
>> the added code doesn't look to justify any of them.  Please also
>> avoid unnecessary whitespace changes, those also slow down reviewing.
>>
>
fixed some typo errors.

That may be because I added the header file at time 1 and then refactored
the code but forget to remove the header file when it is not necessary.
Do we need to relay on experience to tell if the header file is needed or
not,
or do we have any tool to tell it automatically?


 regards,  Andy Fan

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-02 Thread Andy Fan

Thank you Tom for the review!

On Mon, Mar 2, 2020 at 4:46 AM Tom Lane  wrote:

> Andy Fan  writes:
> > Please see if you have any comments.   Thanks
>
> The cfbot isn't at all happy with this.  Its linux build is complaining
> about a possibly-uninitialized variable, and then giving up:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993
>
> The Windows build isn't using -Werror, but it is crashing in at least
> two different spots in the regression tests:
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778
>
> I've not attempted to identify the cause of that.
>
>
Before I submit the patch, I can make sure "make check-world" is
successful, but
since the compile option is not same,  so I didn't catch
the possibly-uninitialized
variable.   As for the crash on the windows,  I didn't get the enough
information
now,  I will find a windows server and reproduce the cases.

I just found the link http://commitfest.cputube.org/ this morning, I will
make sure
the next patch can go pass this test.


At a high level, I'm a bit disturbed that this focuses only on DISTINCT
> and doesn't (appear to) have any equivalent intelligence for GROUP BY,
> though surely that offers much the same opportunities for optimization.
> It seems like it'd be worthwhile to take a couple steps back and see
> if we couldn't recast the logic to work for both.
>
>
OK,  Looks group by is a bit harder  than distinct since the aggregation
function.
I will go through the code to see why to add this logic.


> Some other random comments:
>
> * Don't especially like the way you broke query_is_distinct_for()
> into two functions, especially when you then introduced a whole
> lot of other code in between.


This is not expected by me until you point it out.  In this case, I have to
break the query_is_distinct_for to two functions, but it true that we
should put the two functions together.


That's just making reviewer's job
> harder to see what changed.  It makes the comments a bit disjointed
> too, that is where you even had any.  (Zero introductory comment
> for query_is_distinct_agg is *not* up to project coding standards.
> There are a lot of other undercommented places in this patch, too.)
>
> * Definitely don't like having query_distinct_through_join re-open
> all the relations.  The data needed for this should get absorbed
> while plancat.c has the relations open at the beginning.  (Non-nullness
> of columns, in particular, seems like it'll be useful for other
> purposes; I'm a bit surprised the planner isn't using that already.)
>

I can add new attributes to RelOptInfo and fill the value in
get_relation_info
call.


> * In general, query_distinct_through_join seems hugely messy, expensive,
> and not clearly correct.  If it is correct, the existing comments sure
> aren't enough to explain what it is doing or why.


>
Removing the relation_open call can make it a bit simpler,  I will try more
comment to make it clearer in the following patch.


> * There seem to be some pointless #include additions, eg in planner.c
> the added code doesn't look to justify any of them.  Please also
> avoid unnecessary whitespace changes, those also slow down reviewing.
>
>
That may because I added the header file some time 1 and then refactored
the code later then forget the remove the header file accordingly.  Do we
need
to relay on experience to tell if the header file is needed or not, or do
have have
any code to tell it automatically?


> * I see you decided to add a new regression test file select_distinct_2.
> That's a poor choice of name because it conflicts with our rules for the
> naming of alternative output files.  Besides which, you forgot to plug
> it into the test schedule files, so it isn't actually getting run.
> Is there a reason not to just add the new test cases to select_distinct?
>
>
Adding it to select_distinct.sql is ok for me as well.  Actually I have no
obviously reason to add the new file.


> * There are some changes in existing regression cases that aren't
> visibly related to the stated purpose of the patch, eg it now
> notices that "select distinct max(unique2) from tenk1" doesn't
> require an explicit DISTINCT step.  That's not wrong, but I wonder
> if maybe you should subdivide this patch into more than one patch,
> because that must be coming from some separate change.  I'm also
> wondering what caused the plan change in expected/join.out.
>

Per my purpose it should be in the same patch,  the logical here is we
have distinct in the sql and the query is distinct already since the max
function (the rule is defined in query_is_distinct_agg which is splited
from
the original query_is_distinct_for clause).


> regards, tom lane
>

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-01 Thread Tom Lane

Andy Fan  writes:
> Please see if you have any comments.   Thanks

The cfbot isn't at all happy with this.  Its linux build is complaining
about a possibly-uninitialized variable, and then giving up:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993

The Windows build isn't using -Werror, but it is crashing in at least
two different spots in the regression tests:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778

I've not attempted to identify the cause of that.

At a high level, I'm a bit disturbed that this focuses only on DISTINCT
and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.

Some other random comments:

* Don't especially like the way you broke query_is_distinct_for()
into two functions, especially when you then introduced a whole
lot of other code in between.  That's just making reviewer's job
harder to see what changed.  It makes the comments a bit disjointed
too, that is where you even had any.  (Zero introductory comment
for query_is_distinct_agg is *not* up to project coding standards.
There are a lot of other undercommented places in this patch, too.)

* Definitely don't like having query_distinct_through_join re-open
all the relations.  The data needed for this should get absorbed
while plancat.c has the relations open at the beginning.  (Non-nullness
of columns, in particular, seems like it'll be useful for other
purposes; I'm a bit surprised the planner isn't using that already.)

* In general, query_distinct_through_join seems hugely messy, expensive,
and not clearly correct.  If it is correct, the existing comments sure
aren't enough to explain what it is doing or why.

* Not entirely convinced that a new GUC is needed for this, but if
it is, you have to document it.

* I wouldn't bother with bms_array_free(), nor with any of the other
cleanup you've got at the bottom of query_distinct_through_join.
The planner leaks *lots* of memory, and this function isn't going to
be called so many times that it'll move the needle.

* There seem to be some pointless #include additions, eg in planner.c
the added code doesn't look to justify any of them.  Please also
avoid unnecessary whitespace changes, those also slow down reviewing.

* I see you decided to add a new regression test file select_distinct_2.
That's a poor choice of name because it conflicts with our rules for the
naming of alternative output files.  Besides which, you forgot to plug
it into the test schedule files, so it isn't actually getting run.
Is there a reason not to just add the new test cases to select_distinct?

* There are some changes in existing regression cases that aren't
visibly related to the stated purpose of the patch, eg it now
notices that "select distinct max(unique2) from tenk1" doesn't
require an explicit DISTINCT step.  That's not wrong, but I wonder
if maybe you should subdivide this patch into more than one patch,
because that must be coming from some separate change.  I'm also
wondering what caused the plan change in expected/join.out.

regards, tom lane

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-24 Thread Andy Fan

Hi All:

Here is the updated patch.  It used some functions from
query_is_distinct_for.
I check the query's distinctness in create_distinct_paths,  if it is
distinct already,
it will not generate the paths for that.  so at last the query tree is not
untouched.

Please see if you have any comments.   Thanks

On Mon, Feb 24, 2020 at 8:38 PM Andy Fan  wrote:

>
>
> On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat <
> ashutosh.bapat@gmail.com> wrote:
>
>>
>>
>> On Tue, Feb 11, 2020 at 8:27 AM Andy Fan 
>> wrote:
>>
>>>
>>>
>>> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
>>> ashutosh.bapat@gmail.com> wrote:
>>>


>
> [PATCH] Erase the distinctClause if the result is unique by
>  definition
>

 I forgot to mention this in the last round of comments. Your patch was
 actually removing distictClause from the Query structure. Please avoid
 doing that. If you remove it, you are also removing the evidence that this
 Query had a DISTINCT clause in it.

>>>
>>> Yes, I removed it because it is the easiest way to do it.  what is the
>>> purpose of keeping the evidence?
>>>
>>
>> Julien's example provides an explanation for this. The Query structure is
>> serialised into a view definition. Removing distinctClause from there means
>> that the view will never try to produce unique results.
>>
>>>
>>>
>>
> Actually it is not true.  If a view is used in the query,  the definition
> will be *copied*
> into the query tree. so if we modify the query tree,  the definition of
> the view never
> touched.  The issue of Julien reported is because of a typo error.
>
> -- session 2
>>> postgres=# alter table t alter column b drop not null;
>>> ALTER TABLE
>>>
>>> -- session 1:
>>> postgres=# explain execute st(1);
>>>  QUERY PLAN
>>> -
>>>  Unique  (cost=1.03..1.04 rows=1 width=4)
>>>->  Sort  (cost=1.03..1.04 rows=1 width=4)
>>>  Sort Key: b
>>>  ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>>>Filter: (c = $1)
>>> (5 rows)
>>>
>>
>> Since this prepared statement is parameterised PostgreSQL is replanning
>> it every time it gets executed. It's not using a stored prepared plan. Try
>> without parameters. Also make sure that a prepared plan is used for
>> execution and not a new plan.
>>
>
> Even for  parameterised prepared statement, it is still possible to
> generate an generic
> plan. so it will not replanning every time.  But no matter generic plan or
> not,  after a DDL like
> changing the NOT NULL constraints.   pg will generated a plan based on the
> stored query
> tree.   However, the query tree will be *copied* again to generate a new
> plan. so even I
> modified the query tree,  everything will be ok as well.
>
> At last,  I am agreed with that modifying the query tree is not a good
> idea.
> so my updated patch doesn't use it any more.
>


0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-24 Thread Andy Fan

On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat <
ashutosh.bapat@gmail.com> wrote:

>
>
> On Tue, Feb 11, 2020 at 8:27 AM Andy Fan  wrote:
>
>>
>>
>> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
>> ashutosh.bapat@gmail.com> wrote:
>>
>>>
>>>

 [PATCH] Erase the distinctClause if the result is unique by
  definition

>>>
>>> I forgot to mention this in the last round of comments. Your patch was
>>> actually removing distictClause from the Query structure. Please avoid
>>> doing that. If you remove it, you are also removing the evidence that this
>>> Query had a DISTINCT clause in it.
>>>
>>
>> Yes, I removed it because it is the easiest way to do it.  what is the
>> purpose of keeping the evidence?
>>
>
> Julien's example provides an explanation for this. The Query structure is
> serialised into a view definition. Removing distinctClause from there means
> that the view will never try to produce unique results.
>
>>
>>
>
Actually it is not true.  If a view is used in the query,  the definition
will be *copied*
into the query tree. so if we modify the query tree,  the definition of the
view never
touched.  The issue of Julien reported is because of a typo error.

-- session 2
>> postgres=# alter table t alter column b drop not null;
>> ALTER TABLE
>>
>> -- session 1:
>> postgres=# explain execute st(1);
>>  QUERY PLAN
>> -
>>  Unique  (cost=1.03..1.04 rows=1 width=4)
>>->  Sort  (cost=1.03..1.04 rows=1 width=4)
>>  Sort Key: b
>>  ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>>Filter: (c = $1)
>> (5 rows)
>>
>
> Since this prepared statement is parameterised PostgreSQL is replanning it
> every time it gets executed. It's not using a stored prepared plan. Try
> without parameters. Also make sure that a prepared plan is used for
> execution and not a new plan.
>

Even for  parameterised prepared statement, it is still possible to
generate an generic
plan. so it will not replanning every time.  But no matter generic plan or
not,  after a DDL like
changing the NOT NULL constraints.   pg will generated a plan based on the
stored query
tree.   However, the query tree will be *copied* again to generate a new
plan. so even I
modified the query tree,  everything will be ok as well.

At last,  I am agreed with that modifying the query tree is not a good
idea.
so my updated patch doesn't use it any more.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-13 Thread Andy Fan

On Thu, Feb 13, 2020 at 5:39 PM Julien Rouhaud  wrote:

> On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote:
> > On Tue, Feb 11, 2020 at 8:27 AM Andy Fan 
> wrote:
> >
> > > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> > > ashutosh.bapat@gmail.com> wrote:
> > >
> > >>>
> > >>> [PATCH] Erase the distinctClause if the result is unique by
> > >>>  definition
> > >>
> > >> I forgot to mention this in the last round of comments. Your patch was
> > >> actually removing distictClause from the Query structure. Please avoid
> > >> doing that. If you remove it, you are also removing the evidence that
> this
> > >> Query had a DISTINCT clause in it.
> > >
> > > Yes, I removed it because it is the easiest way to do it.  what is the
> > > purpose of keeping the evidence?
> > >
> >
> > Julien's example provides an explanation for this. The Query structure is
> > serialised into a view definition. Removing distinctClause from there
> means
> > that the view will never try to produce unique results.
>
> And also I think that this approach will have a lot of other unexpected
> side
> effects.  Isn't changing the Query going to affect pg_stat_statements
> queryid
> computing for instance?
>

Thanks,  the 2 factors above are pretty valuable.  so erasing the
distinctClause is not reasonable,  I will try another way.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-13 Thread Julien Rouhaud

On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote:
> On Tue, Feb 11, 2020 at 8:27 AM Andy Fan  wrote:
>
> > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> > ashutosh.bapat@gmail.com> wrote:
> >
> >>>
> >>> [PATCH] Erase the distinctClause if the result is unique by
> >>>  definition
> >>
> >> I forgot to mention this in the last round of comments. Your patch was
> >> actually removing distictClause from the Query structure. Please avoid
> >> doing that. If you remove it, you are also removing the evidence that this
> >> Query had a DISTINCT clause in it.
> >
> > Yes, I removed it because it is the easiest way to do it.  what is the
> > purpose of keeping the evidence?
> >
>
> Julien's example provides an explanation for this. The Query structure is
> serialised into a view definition. Removing distinctClause from there means
> that the view will never try to produce unique results.

And also I think that this approach will have a lot of other unexpected side
effects.  Isn't changing the Query going to affect pg_stat_statements queryid
computing for instance?

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-11 Thread Ashutosh Bapat

On Tue, Feb 11, 2020 at 8:27 AM Andy Fan  wrote:

>
>
> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> ashutosh.bapat@gmail.com> wrote:
>
>>
>>
>>>
>>> [PATCH] Erase the distinctClause if the result is unique by
>>>  definition
>>>
>>
>> I forgot to mention this in the last round of comments. Your patch was
>> actually removing distictClause from the Query structure. Please avoid
>> doing that. If you remove it, you are also removing the evidence that this
>> Query had a DISTINCT clause in it.
>>
>
> Yes, I removed it because it is the easiest way to do it.  what is the
> purpose of keeping the evidence?
>

Julien's example provides an explanation for this. The Query structure is
serialised into a view definition. Removing distinctClause from there means
that the view will never try to produce unique results.

>
>


>
> Suppose after a DDL, the prepared statement need to be re-parsed/planned
> if it is not executed or it will prevent the DDL to happen.
>

The query will be replanned. I am not sure about reparsed though.


>
>
> -- session 2
> postgres=# alter table t alter column b drop not null;
> ALTER TABLE
>
> -- session 1:
> postgres=# explain execute st(1);
>  QUERY PLAN
> -
>  Unique  (cost=1.03..1.04 rows=1 width=4)
>->  Sort  (cost=1.03..1.04 rows=1 width=4)
>  Sort Key: b
>  ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>Filter: (c = $1)
> (5 rows)
>

Since this prepared statement is parameterised PostgreSQL is replanning it
every time it gets executed. It's not using a stored prepared plan. Try
without parameters. Also make sure that a prepared plan is used for
execution and not a new plan.
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-11 Thread Ashutosh Bapat

On Mon, Feb 10, 2020 at 10:57 PM Tom Lane  wrote:

> Ashutosh Bapat  writes:
> >> On Sat, Feb 8, 2020 at 12:53 PM Andy Fan 
> wrote:
> >> Do you mean adding some information into PlannerInfo,  and when we
> create
> >> a node for Unique/HashAggregate/Group,  we can just create a dummy node?
>
> > Not so much as PlannerInfo but something on lines of PathKey. See PathKey
> > structure and related code. What I envision is PathKey class is also
> > annotated with the information whether that PathKey implies uniqueness.
> > E.g. a PathKey derived from a Primary index would imply uniqueness also.
> A
> > PathKey derived from say Group operation also implies uniqueness. Then
> just
> > by looking at the underlying Path we would be able to say whether we need
> > Group/Unique node on top of it or not. I think that would make it much
> > wider usecase and a very useful optimization.
>
> FWIW, that doesn't seem like a very prudent approach to me, because it
> confuses sorted-ness with unique-ness.  PathKeys are about sorting,
> but it's possible to have uniqueness guarantees without having sorted
> anything, for instance via hashed grouping.
>

> I haven't looked at this patch, but I'd expect it to use infrastructure
> related to query_is_distinct_for(), and that doesn't deal in PathKeys.
>
> Thanks for the pointer. I think there's another problem with my approach.
PathKeys are specific to paths since the order of the result depends upon
the Path. But uniqueness is a property of the result i.e. relation and thus
should be attached to RelOptInfo as query_is_distinct_for() does. I think
uniquness should bubble up the RelOptInfo tree, annotating each RelOptInfo
with the minimum set of TLEs which make the result from that relation
unique. Thus we could eliminate extra Group/Unique node if the underlying
RelOptInfo's unique column set is subset of required uniqueness.
-- 
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-11 Thread Julien Rouhaud

On Tue, Feb 11, 2020 at 08:14:14PM +0800, Andy Fan wrote:
> On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud  wrote:
> 
> > >
> > > and if we prepare sql outside a transaction, and execute it in the
> > > transaction, the other session can't drop the constraint until the
> > > transaction is ended.
> >
> > And what if you create a view on top of a query containing a distinct
> > clause
> > rather than using prepared statements?  FWIW your patch doesn't handle such
> > case at all, without even needing to drop constraints:
> >
> > CREATE TABLE t (a int primary key, b int not null,  c int);
> > INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
> > CREATE UNIQUE INDEX t_idx1 on t(b);
> > CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
> > EXPLAIN SELECT * FROM v1;
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> >
> >
> This error can be fixed with
> 
> -   num_of_rtables = bms_num_members(non_semi_anti_relids);
> +   num_of_rtables = list_length(query->rtable);
> 
> This test case also be added into the patch.
> 
> 
> > I also think this is not the right way to handle this optimization.
> >
> 
> do you have any other concerns?

Yes, it seems to be broken as soon as you alter the view's underlying table:

=# CREATE TABLE t (a int primary key, b int not null,  c int);
CREATE TABLE

=# INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
INSERT 0 2

=# CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE INDEX

=# CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
CREATE VIEW

=# EXPLAIN SELECT * FROM v1;
   QUERY PLAN
-
 Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# EXPLAIN SELECT DISTINCT b FROM t;
   QUERY PLAN
-
 Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# ALTER TABLE t ALTER COLUMN b DROP NOT NULL;
ALTER TABLE

=# EXPLAIN SELECT * FROM v1;
   QUERY PLAN
-
 Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# EXPLAIN SELECT DISTINCT b FROM t;
 QUERY PLAN
-
 Unique  (cost=1.03..1.04 rows=2 width=4)
   ->  Sort  (cost=1.03..1.03 rows=2 width=4)
 Sort Key: b
 ->  Seq Scan on t  (cost=0.00..1.02 rows=2 width=4)
(4 rows)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-11 Thread Andy Fan

On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud  wrote:

> >
> > and if we prepare sql outside a transaction, and execute it in the
> > transaction, the other session can't drop the constraint until the
> > transaction is ended.
>
> And what if you create a view on top of a query containing a distinct
> clause
> rather than using prepared statements?  FWIW your patch doesn't handle such
> case at all, without even needing to drop constraints:
>
> CREATE TABLE t (a int primary key, b int not null,  c int);
> INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
> CREATE UNIQUE INDEX t_idx1 on t(b);
> CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
> EXPLAIN SELECT * FROM v1;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
>
This error can be fixed with

-   num_of_rtables = bms_num_members(non_semi_anti_relids);
+   num_of_rtables = list_length(query->rtable);

This test case also be added into the patch.


> I also think this is not the right way to handle this optimization.
>

do you have any other concerns?


0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-11 Thread Andy Fan

On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud  wrote:

> On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote:
> > On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> > ashutosh.bapat@gmail.com> wrote:
> >
> > > I forgot to mention this in the last round of comments. Your patch was
> > > actually removing distictClause from the Query structure. Please avoid
> > > doing that. If you remove it, you are also removing the evidence that
> this
> > > Query had a DISTINCT clause in it.
> > >
> >
> > Yes, I removed it because it is the easiest way to do it.  what is the
> > purpose of keeping the evidence?
> >
> > >> However the patch as presented has some problems
> > >> 1. What happens if the primary key constraint or NOT NULL constraint
> gets
> > >> dropped between a prepare and execute? The plan will no more be valid
> and
> > >> thus execution may produce non-distinct results.
> >
> > > But that doesn't matter since a query can be prepared outside a
> > > transaction and executed within one or more subsequent transactions.
> > >
> >
> > Suppose after a DDL, the prepared statement need to be re-parsed/planned
> > if it is not executed or it will prevent the DDL to happen.
> >
> > The following is my test.
> >
> > postgres=# create table t (a int primary key, b int not null,  c int);
> > CREATE TABLE
> > postgres=# insert into t values(1, 1, 1), (2, 2, 2);
> > INSERT 0 2
> > postgres=# create unique index t_idx1 on t(b);
> > CREATE INDEX
> >
> > postgres=# prepare st as select distinct b from t where c = $1;
> > PREPARE
> > postgres=# explain execute st(1);
> >QUERY PLAN
> > -
> >  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
> >Filter: (c = 1)
> > (2 rows)
> > ...
> > postgres=# explain execute st(1);
> >QUERY PLAN
> > -
> >  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
> >Filter: (c = $1)
> > (2 rows)
> >
> > -- session 2
> > postgres=# alter table t alter column b drop not null;
> > ALTER TABLE
> >
> > -- session 1:
> > postgres=# explain execute st(1);
> >  QUERY PLAN
> > -
> >  Unique  (cost=1.03..1.04 rows=1 width=4)
> >->  Sort  (cost=1.03..1.04 rows=1 width=4)
> >  Sort Key: b
> >  ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
> >Filter: (c = $1)
> > (5 rows)
> >
> > -- session 2
> > postgres=# insert into t values (3, null, 3), (4, null, 3);
> > INSERT 0 2
> >
> > -- session 1
> > postgres=# execute st(3);
> >  b
> > ---
> >
> > (1 row)
> >
> > and if we prepare sql outside a transaction, and execute it in the
> > transaction, the other session can't drop the constraint until the
> > transaction is ended.
>
> And what if you create a view on top of a query containing a distinct
> clause
> rather than using prepared statements?  FWIW your patch doesn't handle such
> case at all, without even needing to drop constraints:


>
CREATE TABLE t (a int primary key, b int not null,  c int);
> INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
> CREATE UNIQUE INDEX t_idx1 on t(b);
> CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
> EXPLAIN SELECT * FROM v1;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
>
Thanks for pointing it out.  This is unexpected based on my current
knowledge,I
will check that.


> I also think this is not the right way to handle this optimization.
>

I started to check query_is_distinct_for when Tom point it out,  but still
doesn't
understand the context fully.   I will take your finding with this as well.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-10 Thread Julien Rouhaud

On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote:
> On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
> ashutosh.bapat@gmail.com> wrote:
>
> > I forgot to mention this in the last round of comments. Your patch was
> > actually removing distictClause from the Query structure. Please avoid
> > doing that. If you remove it, you are also removing the evidence that this
> > Query had a DISTINCT clause in it.
> >
>
> Yes, I removed it because it is the easiest way to do it.  what is the
> purpose of keeping the evidence?
>
> >> However the patch as presented has some problems
> >> 1. What happens if the primary key constraint or NOT NULL constraint gets
> >> dropped between a prepare and execute? The plan will no more be valid and
> >> thus execution may produce non-distinct results.
>
> > But that doesn't matter since a query can be prepared outside a
> > transaction and executed within one or more subsequent transactions.
> >
>
> Suppose after a DDL, the prepared statement need to be re-parsed/planned
> if it is not executed or it will prevent the DDL to happen.
>
> The following is my test.
>
> postgres=# create table t (a int primary key, b int not null,  c int);
> CREATE TABLE
> postgres=# insert into t values(1, 1, 1), (2, 2, 2);
> INSERT 0 2
> postgres=# create unique index t_idx1 on t(b);
> CREATE INDEX
>
> postgres=# prepare st as select distinct b from t where c = $1;
> PREPARE
> postgres=# explain execute st(1);
>QUERY PLAN
> -
>  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>Filter: (c = 1)
> (2 rows)
> ...
> postgres=# explain execute st(1);
>QUERY PLAN
> -
>  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>Filter: (c = $1)
> (2 rows)
>
> -- session 2
> postgres=# alter table t alter column b drop not null;
> ALTER TABLE
>
> -- session 1:
> postgres=# explain execute st(1);
>  QUERY PLAN
> -
>  Unique  (cost=1.03..1.04 rows=1 width=4)
>->  Sort  (cost=1.03..1.04 rows=1 width=4)
>  Sort Key: b
>  ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
>Filter: (c = $1)
> (5 rows)
>
> -- session 2
> postgres=# insert into t values (3, null, 3), (4, null, 3);
> INSERT 0 2
>
> -- session 1
> postgres=# execute st(3);
>  b
> ---
>
> (1 row)
>
> and if we prepare sql outside a transaction, and execute it in the
> transaction, the other session can't drop the constraint until the
> transaction is ended.

And what if you create a view on top of a query containing a distinct clause
rather than using prepared statements?  FWIW your patch doesn't handle such
case at all, without even needing to drop constraints:

CREATE TABLE t (a int primary key, b int not null,  c int);
INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

I also think this is not the right way to handle this optimization.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-10 Thread Andy Fan

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat@gmail.com> wrote:

>
>
>>
>> [PATCH] Erase the distinctClause if the result is unique by
>>  definition
>>
>
> I forgot to mention this in the last round of comments. Your patch was
> actually removing distictClause from the Query structure. Please avoid
> doing that. If you remove it, you are also removing the evidence that this
> Query had a DISTINCT clause in it.
>

Yes, I removed it because it is the easiest way to do it.  what is the
purpose of keeping the evidence?


>
>
>>
>>
>> However the patch as presented has some problems
>> 1. What happens if the primary key constraint or NOT NULL constraint gets
>> dropped between a prepare and execute? The plan will no more be valid and
>> thus execution may produce non-distinct results.
>>
>> Will this still be an issue if user use doesn't use a "read uncommitted"
>> isolation level?  I suppose it should be ok for this case.  But even
>> though
>> I should add an isolation level check for this.  Just added that in the
>> patch
>> to continue discussing of this issue.
>>
>
> In PostgreSQL there's no "read uncommitted".
>

Thanks for the hint, I just noticed read uncommitted is treated as read
committed
 in Postgresql.


> But that doesn't matter since a query can be prepared outside a
> transaction and executed within one or more subsequent transactions.
>

Suppose after a DDL, the prepared statement need to be re-parsed/planned
if it is not executed or it will prevent the DDL to happen.

The following is my test.

postgres=# create table t (a int primary key, b int not null,  c int);
CREATE TABLE
postgres=# insert into t values(1, 1, 1), (2, 2, 2);
INSERT 0 2
postgres=# create unique index t_idx1 on t(b);
CREATE INDEX

postgres=# prepare st as select distinct b from t where c = $1;
PREPARE
postgres=# explain execute st(1);
   QUERY PLAN
-
 Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
   Filter: (c = 1)
(2 rows)
...
postgres=# explain execute st(1);
   QUERY PLAN
-
 Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
   Filter: (c = $1)
(2 rows)

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
 QUERY PLAN
-
 Unique  (cost=1.03..1.04 rows=1 width=4)
   ->  Sort  (cost=1.03..1.04 rows=1 width=4)
 Sort Key: b
 ->  Seq Scan on t  (cost=0.00..1.02 rows=1 width=4)
   Filter: (c = $1)
(5 rows)

-- session 2
postgres=# insert into t values (3, null, 3), (4, null, 3);
INSERT 0 2

-- session 1
postgres=# execute st(3);
 b
---

(1 row)

and if we prepare sql outside a transaction, and execute it in the
transaction, the other session can't drop the constraint until the
transaction is ended.



> --
> Best Wishes,
> Ashutosh Bapat
>

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-10 Thread Tom Lane

Ashutosh Bapat  writes:
>> On Sat, Feb 8, 2020 at 12:53 PM Andy Fan  wrote:
>> Do you mean adding some information into PlannerInfo,  and when we create
>> a node for Unique/HashAggregate/Group,  we can just create a dummy node?

> Not so much as PlannerInfo but something on lines of PathKey. See PathKey
> structure and related code. What I envision is PathKey class is also
> annotated with the information whether that PathKey implies uniqueness.
> E.g. a PathKey derived from a Primary index would imply uniqueness also. A
> PathKey derived from say Group operation also implies uniqueness. Then just
> by looking at the underlying Path we would be able to say whether we need
> Group/Unique node on top of it or not. I think that would make it much
> wider usecase and a very useful optimization.

FWIW, that doesn't seem like a very prudent approach to me, because it
confuses sorted-ness with unique-ness.  PathKeys are about sorting,
but it's possible to have uniqueness guarantees without having sorted
anything, for instance via hashed grouping.

I haven't looked at this patch, but I'd expect it to use infrastructure
related to query_is_distinct_for(), and that doesn't deal in PathKeys.

regards, tom lane

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-10 Thread Ashutosh Bapat

On Sat, Feb 8, 2020 at 12:53 PM Andy Fan  wrote:

> Hi Ashutosh:
>Thanks for your time.
>
> On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <
> ashutosh.bapat@gmail.com> wrote:
>
>> Hi Andy,
>> What might help is to add more description to your email message like
>> giving examples to explain your idea.
>>
>> Anyway, I looked at the testcases you added for examples.
>> +create table select_distinct_a(a int, b char(20),  c char(20) not null,
>>  d int, e int, primary key(a, b));
>> +set enable_mergejoin to off;
>> +set enable_hashjoin to off;
>> +-- no node for distinct.
>> +explain (costs off) select distinct * from select_distinct_a;
>> +  QUERY PLAN
>> +---
>> + Seq Scan on select_distinct_a
>> +(1 row)
>>
>> From this example, it seems that the distinct operation can be dropped
>> because (a, b) is a primary key. Is my understanding correct?
>>
>
> Yes, you are correct.   Actually I added then to commit message,
> but it's true that I should have copied them in this email body as
>  well.  so copy it now.
>
> [PATCH] Erase the distinctClause if the result is unique by
>  definition
>

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

>
>
> However the patch as presented has some problems
> 1. What happens if the primary key constraint or NOT NULL constraint gets
> dropped between a prepare and execute? The plan will no more be valid and
> thus execution may produce non-distinct results.
>
> Will this still be an issue if user use doesn't use a "read uncommitted"
> isolation level?  I suppose it should be ok for this case.  But even though
> I should add an isolation level check for this.  Just added that in the
> patch
> to continue discussing of this issue.
>

In PostgreSQL there's no "read uncommitted". But that doesn't matter since
a query can be prepared outside a transaction and executed within one or
more subsequent transactions.

>
>
>> PostgreSQL has similar concept of allowing non-grouping expression as
>> part of targetlist when those expressions can be proved to be functionally
>> dependent on the GROUP BY clause. See check_functional_grouping() and its
>> caller. I think, DISTINCT elimination should work on similar lines.
>>
> 2. For the same reason described in check_functional_grouping(), using
>> unique indexes for eliminating DISTINCT should be discouraged.
>>
>
> I checked the comments of check_functional_grouping,  the reason is
>
>  * Currently we only check to see if the rel has a primary key that is a
>  * subset of the grouping_columns.  We could also use plain unique
> constraints
>  * if all their columns are known not null, but there's a problem: we need
>  * to be able to represent the not-null-ness as part of the constraints
> added
>  * to *constraintDeps.  FIXME whenever not-null constraints get represented
>  * in pg_constraint.
>
> Actually I am doubtful the reason for pg_constraint since we still be able
> to get the not null information from
> relation->rd_attr->attrs[n].attnotnull which
> is just what this patch did.
>

The problem isn't whether not-null-less can be inferred or not, the problem
is whether that can be guaranteed across planning and execution of query
(prepare and execute for example.) The constraintDep machinary registers
the constraints used for preparing plan and invalidates the plan if any of
those constraints change after plan is created.

>
> 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
>> as well
>>
>
> This is a good point.   The rules may have some different for join,  so I
> prefer
> to to focus on the current one so far.
>

I doubt that since DISTINCT is ultimately carried out as Grouping
operation. But anyway, I won't hang upon that.

>
>
>> 4. The patch works only at the query level, but that functionality can be
>> expanded generally to other places which add Unique/HashAggregate/Group
>> nodes if the underlying relation can be proved to produce distinct rows.
>> But that's probably more work since we will have to label paths with unique
>> keys similar to pathkeys.
>>
>
> Do you mean adding some information into PlannerInfo,  and when we create
> a node for Unique/HashAggregate/Group,  we can just create a dummy node?
>

Not so much as PlannerInfo but something on lines of PathKey. See PathKey
structure and related code. What I envision is PathKey class is also
annotated with the information whether that PathKey implies uniqueness.
E.g. a PathKey derived from a Primary index would imply uniqueness also. A
PathKey derived from say Group operation also implies uniqueness. Then just
by looking at the underlying Path we would be able to say whether we need
Group/Unique node on top of it or not. I think that would make it much
wider usecase and a very useful optimization.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-07 Thread Andy Fan

Hi Ashutosh:
   Thanks for your time.

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat 
wrote:

> Hi Andy,
> What might help is to add more description to your email message like
> giving examples to explain your idea.
>
> Anyway, I looked at the testcases you added for examples.
> +create table select_distinct_a(a int, b char(20),  c char(20) not null,
>  d int, e int, primary key(a, b));
> +set enable_mergejoin to off;
> +set enable_hashjoin to off;
> +-- no node for distinct.
> +explain (costs off) select distinct * from select_distinct_a;
> +  QUERY PLAN
> +---
> + Seq Scan on select_distinct_a
> +(1 row)
>
> From this example, it seems that the distinct operation can be dropped
> because (a, b) is a primary key. Is my understanding correct?
>

Yes, you are correct.   Actually I added then to commit message,
but it's true that I should have copied them in this email body as
 well.  so copy it now.

[PATCH] Erase the distinctClause if the result is unique by
 definition

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yields a unique result set, then
the final result is unique as well regardless the join method.
for semi/anti join, we will ignore the righttable.

I like the idea since it eliminates one expensive operation.
>
> However the patch as presented has some problems
> 1. What happens if the primary key constraint or NOT NULL constraint gets
> dropped between a prepare and execute? The plan will no more be valid and
> thus execution may produce non-distinct results.
>

Will this still be an issue if user use doesn't use a "read uncommitted"
isolation level?  I suppose it should be ok for this case.  But even though
I should add an isolation level check for this.  Just added that in the
patch
to continue discussing of this issue.

> PostgreSQL has similar concept of allowing non-grouping expression as part
> of targetlist when those expressions can be proved to be functionally
> dependent on the GROUP BY clause. See check_functional_grouping() and its
> caller. I think, DISTINCT elimination should work on similar lines.
>
2. For the same reason described in check_functional_grouping(), using
> unique indexes for eliminating DISTINCT should be discouraged.
>

I checked the comments of check_functional_grouping,  the reason is

 * Currently we only check to see if the rel has a primary key that is a
 * subset of the grouping_columns.  We could also use plain unique
constraints
 * if all their columns are known not null, but there's a problem: we need
 * to be able to represent the not-null-ness as part of the constraints
added
 * to *constraintDeps.  FIXME whenever not-null constraints get represented
 * in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able
to get the not null information from relation->rd_attr->attrs[n].attnotnull
which
is just what this patch did.

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
> as well
>

This is a good point.   The rules may have some different for join,  so I
prefer
to to focus on the current one so far.

> 4. The patch works only at the query level, but that functionality can be
> expanded generally to other places which add Unique/HashAggregate/Group
> nodes if the underlying relation can be proved to produce distinct rows.
> But that's probably more work since we will have to label paths with unique
> keys similar to pathkeys.
>

Do you mean adding some information into PlannerInfo,  and when we create
a node for Unique/HashAggregate/Group,  we can just create a dummy node?

> 5. Have you tested this OUTER joins, which can render inner side nullable?
>

Yes, that part was missed in the test case.  I just added them.

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan  wrote:
>
>> update the patch with considering the semi/anti join.
>>
>> Can anyone help to review this patch?
>>
>> Thanks
>>
>>
>> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan 
>> wrote:
>>
>>> Hi:
>>>
>>> I wrote a patch to erase the distinctClause if the result is unique by
>>> definition,  I find this because a user switch this code from oracle
>>> to PG and find the performance is bad due to this,  so I adapt pg for
>>> this as well.
>>>
>>> This patch doesn't work for a well-written SQL,  but some drawback
>>> of a SQL may be not very obvious,  since the cost of checking is pretty
>>> low as well,  so I think it would be ok to add..
>>>
>>> Please see the patch for details.
>>>
>>> Thank you.
>>>
>>
>
> --
> --
> Best Wishes,
> Ashutosh Bapat
>

0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-07 Thread Ashutosh Bapat

Hi Andy,
What might help is to add more description to your email message like
giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,  d
int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+  QUERY PLAN
+---
+ Seq Scan on select_distinct_a
+(1 row)

>From this example, it seems that the distinct operation can be dropped
because (a, b) is a primary key. Is my understanding correct?

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results. PostgreSQL has similar
concept of allowing non-grouping expression as part of targetlist when
those expressions can be proved to be functionally dependent on the GROUP
BY clause. See check_functional_grouping() and its caller. I think,
DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using
unique indexes for eliminating DISTINCT should be discouraged.
3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
as well
4. The patch works only at the query level, but that functionality can be
expanded generally to other places which add Unique/HashAggregate/Group
nodes if the underlying relation can be proved to produce distinct rows.
But that's probably more work since we will have to label paths with unique
keys similar to pathkeys.
5. Have you tested this OUTER joins, which can render inner side nullable?

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan  wrote:

> update the patch with considering the semi/anti join.
>
> Can anyone help to review this patch?
>
> Thanks
>
>
> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan  wrote:
>
>> Hi:
>>
>> I wrote a patch to erase the distinctClause if the result is unique by
>> definition,  I find this because a user switch this code from oracle
>> to PG and find the performance is bad due to this,  so I adapt pg for
>> this as well.
>>
>> This patch doesn't work for a well-written SQL,  but some drawback
>> of a SQL may be not very obvious,  since the cost of checking is pretty
>> low as well,  so I think it would be ok to add..
>>
>> Please see the patch for details.
>>
>> Thank you.
>>
>

-- 
--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-02-05 Thread Andy Fan

update the patch with considering the semi/anti join.

Can anyone help to review this patch?

Thanks


On Fri, Jan 31, 2020 at 8:39 PM Andy Fan  wrote:

> Hi:
>
> I wrote a patch to erase the distinctClause if the result is unique by
> definition,  I find this because a user switch this code from oracle
> to PG and find the performance is bad due to this,  so I adapt pg for
> this as well.
>
> This patch doesn't work for a well-written SQL,  but some drawback
> of a SQL may be not very obvious,  since the cost of checking is pretty
> low as well,  so I think it would be ok to add..
>
> Please see the patch for details.
>
> Thank you.
>


0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data

41 matches

Mail list logo