Re: Removing useless DISTINCT clauses

2018-08-26 Thread David Rowley
On 25 August 2018 at 04:05, Finnerty, Jim  wrote:
> I'll explore a proper way to test that it's in the single-relation case, and 
> will post a separate thread for the 'remove unnecessary DISTINCT' 
> optimization.

See: has_multiple_baserels() in allpaths.c

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-08-24 Thread Finnerty, Jim
I feel strongly that eliminating the entire DISTINCT or GROUP BY clause (when 
there are no aggs) is an important optimization, especially when the 
incremental cost to test for it is so tiny.  I'm happy to submit that as a 
separate thread.

My goal here was to move the original proposal along and contribute a little 
something back to the community in the process.  DISTINCT optimization is 
currently quite poor compared to the leading commercial RDBMS alternatives, and 
doing unnecessary DISTINCT in the single-table case is an example of that.  
There are other missing DISTINCT optimizations.

I'll explore a proper way to test that it's in the single-relation case, and 
will post a separate thread for the 'remove unnecessary DISTINCT' optimization.

Cheers,

/Jim


On 8/23/18, 11:12 PM, "David Rowley"  wrote:

You might be confusing #1 and #2. My concern is with #2.  The existing
GROUP BY clause optimisation is almost identical to #1. I just wanted
to also apply it to the DISTINCT clause.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Removing useless DISTINCT clauses

2018-08-23 Thread David Rowley
On 24 August 2018 at 14:12, Stephen Frost  wrote:
> This is happening at the same time as some optimizations around GROUP
> BY, so either there's something different about what's happening there
> and I didn't appreciate it, or does that optimization suffer from a
> similar issue?

There are two separate optimisations in Jim's version of the patch:

1) Attempting to remove useless DISTINCT clause items in cases where a
true subset of the clauses make up the entire table's primary key.
This can be applied no matter how many joins are in the query.  If a
table's PK is a,b then it's not going to be any more distinct if we
distinctify on a,b,c. Distification on a,b is the most we'll never
need.  We already apply this optimisation to the GROUP BY clause.

2) Attempt to remove the entire DISTINCT clause if the query is to a
single base table when the DISTINCT clause contains the entire table's
PK. Here there's need uniquify the results.

I started this thread for #1 and Jim would like to add #2 as an
additional optimisation.  I feel that #2 is different enough from #1
that they should be considered independently of each other. The area
where #2 should be implemented is far away at the other end of
planning from where #1 is being done.  I believe that two separate
patches are the best way forward here as it does not really make sense
to reject #1 because there are concerns with #2, vice versa.  I think
if Jim would like to propose #2, then I think a separate thread is a
better option than tagging along on this one. This might also reduce
confusion.

I'd also rather not turn this thread into people discussing/reviewing
Jim's work that he plans to implement in some fork taken from Postgres
10.5. I don't really think even this mailing list is the right place
for that, let alone this thread.

> > I've not read the patch, but David's reaction makes it sound like its
> > processing is done too early.  There are right places and wrong places
> > to do most everything in the planner, and I do not wish to accept a
> > patch that does something in the wrong place.
>
> Right, I definitely agree with you there.  This seemed like a reasonable
> place given the similar optimization (at least in appearance to me)
> being done there for the GROUP BY case.  I'm happy to admit that I
> haven't looked at it in very much depth (hence my question to David) and
> I'm not an expert in this area, but I did want to bring up that the
> general idea and the relative trade-offs at least sounded reasonable.

You might be confusing #1 and #2. My concern is with #2.  The existing
GROUP BY clause optimisation is almost identical to #1. I just wanted
to also apply it to the DISTINCT clause.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-08-23 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * David Rowley (david.row...@2ndquadrant.com) wrote:
> >> On 24 August 2018 at 11:34, Stephen Frost  wrote:
> >>> * David Rowley (david.row...@2ndquadrant.com) wrote:
>  My personal opinion of only being able to completely remove the
>  DISTINCT when there's a single item in the rtable (or a single base
>  table) is that it's just too poor to bother with.
> 
> > Hm, so you're suggesting that this isn't the right place for this
> > optimization to be implemented, even now, with the single-relation
> > caveat?
> 
> There is no case where planner optimizations should depend on the length
> of the rtable.  Full stop.
> 
> It could make sense to optimize if there is just one baserel in the join
> tree --- although even that is best checked only after join removal.

Hm, that's certainly a fair point.

> As an example of the difference, such an optimization should be able to
> optimize "select * from view" if the view contains just one base table.
> The rtable will list both the view and the base table, but the view
> is only hanging around for permissions-checking purposes; it should not
> affect the planner's behavior.

This is happening at the same time as some optimizations around GROUP
BY, so either there's something different about what's happening there
and I didn't appreciate it, or does that optimization suffer from a
similar issue?

> I've not read the patch, but David's reaction makes it sound like its
> processing is done too early.  There are right places and wrong places
> to do most everything in the planner, and I do not wish to accept a
> patch that does something in the wrong place.

Right, I definitely agree with you there.  This seemed like a reasonable
place given the similar optimization (at least in appearance to me)
being done there for the GROUP BY case.  I'm happy to admit that I
haven't looked at it in very much depth (hence my question to David) and
I'm not an expert in this area, but I did want to bring up that the
general idea and the relative trade-offs at least sounded reasonable.

I'll also note that I didn't see these concerned raised earlier on the
thread when I re-read your remarks on it, so I'm a bit concerned that
perhaps either this isn't an actual concern to be realized or perhaps it
was missed previously.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Removing useless DISTINCT clauses

2018-08-23 Thread Tom Lane
Stephen Frost  writes:
> * David Rowley (david.row...@2ndquadrant.com) wrote:
>> On 24 August 2018 at 11:34, Stephen Frost  wrote:
>>> * David Rowley (david.row...@2ndquadrant.com) wrote:
 My personal opinion of only being able to completely remove the
 DISTINCT when there's a single item in the rtable (or a single base
 table) is that it's just too poor to bother with.

> Hm, so you're suggesting that this isn't the right place for this
> optimization to be implemented, even now, with the single-relation
> caveat?

There is no case where planner optimizations should depend on the length
of the rtable.  Full stop.

It could make sense to optimize if there is just one baserel in the join
tree --- although even that is best checked only after join removal.
As an example of the difference, such an optimization should be able to
optimize "select * from view" if the view contains just one base table.
The rtable will list both the view and the base table, but the view
is only hanging around for permissions-checking purposes; it should not
affect the planner's behavior.

I've not read the patch, but David's reaction makes it sound like its
processing is done too early.  There are right places and wrong places
to do most everything in the planner, and I do not wish to accept a
patch that does something in the wrong place.

regards, tom lane



Re: Removing useless DISTINCT clauses

2018-08-23 Thread Stephen Frost
Greetings,

* David Rowley (david.row...@2ndquadrant.com) wrote:
> On 24 August 2018 at 11:34, Stephen Frost  wrote:
> > * David Rowley (david.row...@2ndquadrant.com) wrote:
> >> My personal opinion of only being able to completely remove the
> >> DISTINCT when there's a single item in the rtable (or a single base
> >> table) is that it's just too poor to bother with.  I think such a
> >> solution is best left to another patch and it should be much more
> >> complete and be able to track the unique properties through joins.
> >
> > Doesn't that move the goalposts for this pretty far off?  Is there
> > some reason that doing this for the single-item case will make it
> > difficult to implement the additional improvements you're suggesting
> > later?
> 
> The DISTINCT clause removal would likely need to wait until around
> where the distinct paths are created and all the join processing has
> been done.
> 
> If the optimisation is deemed worthwhile, then maybe that would be the
> place to put this code along with a comment that explains how it can
> be improved to support the removal with multiple relations.

Hm, so you're suggesting that this isn't the right place for this
optimization to be implemented, even now, with the single-relation
caveat?

I do agree that including comments about how it could/should be improved
would be good for future developers.

> > Given the cost of a DISTINCT and that we know pretty easily if one
> > exists or not, trying to eliminate it, even in the single-item case,
> > seems pretty worthwhile to me.  If it makes it difficult to improve on
> > this later then I could see pushing back on it, but this patch doesn't
> > seem like its doing that.
> 
> If people truly think it's worthwhile, then maybe it is, but if it's
> being done because it can be done then perhaps it's best not to
> bother.  I was just thinking of the bug reports we'd get when people
> see it does not work with multiple relations. There's already no
> shortage of bug reports that are not bugs. I also wanted to keep this
> patch just doing the simple case that we already apply to GROUP BY,
> but control over the patch may well be out of my hands now.

I'm not really sure I'm following along this part of the discussion.
As long as it isn't making the code particularly ugly or violating the
seperation of various components or making later hacking in this area
particularly worse, then the question primairly comes down to making
sure that the code is of quality and that the performance optimization
is worth the cycles to check for the opportunity and that it's not too
expensive compared to the cost of the operation to implement it, all of
which can be pretty objectively assessed and provide the answer to if
it's worthwhile or not.  Given the cost of a Unique node, and that we
will only check for this optimization when we are going to have to
introduce one, it certainly seems like it's worthwhile (again, proviso
on code quality, et al, above, but I don't think that's what you were
arguing a concern about here).

In short, I don't buy the concern about having to deal with "bug reports
that aren't bug reports" or that they should be a barrier to feature
work.  We already see complaints pretty regularly about optimizations we
don't have and this would at least be a step in the right direction for
dealing with these kinds of queries and that does seem better than doing
nothing, imv.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Removing useless DISTINCT clauses

2018-08-23 Thread Stephen Frost
Greetings,

* Jim Finnerty (jfinn...@amazon.com) wrote:
> I was thinking about this last night and I realized that putting the new
> hasModifiedDistinct flag on the PlannerInfo struct eliminates the need to
> deal with the serialization issues, and makes it simpler.
> 
> Here's a new patch (v7) that puts the bool on the PlannerInfo struct, and
> adds a couple of tests.
> 
> re: why did you apply the patch on v10?
> 
> I'm developing on a v10.5 codebase at the moment, though this will change
> soon.  If the v7 patch doesn't apply cleanly on later versions, please let
> me know and I'll fix it.

Just going back through this thread to check that I didn't miss
anything, I saw this and felt it deserved a comment- please don't post
feature patches like this against existing released versions and then
ask everyone else to see if it works against current head or not.  We're
all for having new contributors of features, but those contributions
should be against the current HEAD.

> re: if you're proposing the patch for v12, why do you care about catversion?
> 
> Only because it would be a problem to test the patch on 10.5 with a
> catversion change that wouldn't come until v12.  With the v7 patch this
> issue is moot because it no longer requires a catversion change.

This would be an example of concerns that really shouldn't be getting
raised on this list for new features.  The project has a very clear
policy regarding features vs. bug-fixes and I really don't think it's
necessary or desirable to have patches being floated with subsequent
threads here which go directly against that policy.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Removing useless DISTINCT clauses

2018-08-23 Thread David Rowley
On 24 August 2018 at 11:34, Stephen Frost  wrote:
> Greetings,
>
> * David Rowley (david.row...@2ndquadrant.com) wrote:
>> My personal opinion of only being able to completely remove the
>> DISTINCT when there's a single item in the rtable (or a single base
>> table) is that it's just too poor to bother with.  I think such a
>> solution is best left to another patch and it should be much more
>> complete and be able to track the unique properties through joins.
>
> Doesn't that move the goalposts for this pretty far off?  Is there
> some reason that doing this for the single-item case will make it
> difficult to implement the additional improvements you're suggesting
> later?

The DISTINCT clause removal would likely need to wait until around
where the distinct paths are created and all the join processing has
been done.

If the optimisation is deemed worthwhile, then maybe that would be the
place to put this code along with a comment that explains how it can
be improved to support the removal with multiple relations.

> Given the cost of a DISTINCT and that we know pretty easily if one
> exists or not, trying to eliminate it, even in the single-item case,
> seems pretty worthwhile to me.  If it makes it difficult to improve on
> this later then I could see pushing back on it, but this patch doesn't
> seem like its doing that.

If people truly think it's worthwhile, then maybe it is, but if it's
being done because it can be done then perhaps it's best not to
bother.  I was just thinking of the bug reports we'd get when people
see it does not work with multiple relations. There's already no
shortage of bug reports that are not bugs. I also wanted to keep this
patch just doing the simple case that we already apply to GROUP BY,
but control over the patch may well be out of my hands now.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-08-23 Thread Stephen Frost
Greetings,

* David Rowley (david.row...@2ndquadrant.com) wrote:
> My personal opinion of only being able to completely remove the
> DISTINCT when there's a single item in the rtable (or a single base
> table) is that it's just too poor to bother with.  I think such a
> solution is best left to another patch and it should be much more
> complete and be able to track the unique properties through joins.

Doesn't that move the goalposts for this pretty far off?  Is there
some reason that doing this for the single-item case will make it
difficult to implement the additional improvements you're suggesting
later?

Given the cost of a DISTINCT and that we know pretty easily if one
exists or not, trying to eliminate it, even in the single-item case,
seems pretty worthwhile to me.  If it makes it difficult to improve on
this later then I could see pushing back on it, but this patch doesn't
seem like its doing that.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Removing useless DISTINCT clauses

2018-08-23 Thread David Rowley
On 24 August 2018 at 03:29, Finnerty, Jim  wrote:
> Thank you Álvaro.  Here's the distinct_opt_v7 patch.

Determining if there's just 1 base relation by checking the length the
rtable is not a good way. If you want to see why, try creating a
primary key on a partitioned table.

My personal opinion of only being able to completely remove the
DISTINCT when there's a single item in the rtable (or a single base
table) is that it's just too poor to bother with.  I think such a
solution is best left to another patch and it should be much more
complete and be able to track the unique properties through joins.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-08-23 Thread Alvaro Herrera
On 2018-Aug-23, Jim Finnerty wrote:

> distinct_opt_v7.patch
>   

Are you posting this via postgresql-archive.org?  Please don't do that.
These posts of yours are mangled by postgresql-archive and do not
contain the file you're attaching.  Use direct posting by email to
pgsql-hackers@lists.postgresql.org instead.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Removing useless DISTINCT clauses

2018-08-22 Thread David Rowley
On 23 August 2018 at 08:11, Jim Finnerty  wrote:
> Here is an update to this thread, for potential inclusion in v12.  I
> couldn't get the most recent 'v5' patch to apply cleanly, so I recreated a
> v6 patch on PG10.5 by hand, and made a few changes and improvements:

Well, the patch I wrote was never intended for v10. Why did you go to
the effort of doing that when you're proposing the patch for v12? It
seems my v5 patch and Tom's v6 version apply fine to today's master
without any rejected hunks.

> The pg_rewrite catalog contains a serialized representation of the Query
> node in its ev_action column.  If there is a way to recreate the contents of
> the pg_rewrite relation without bumping the catversion, can someone please
> explain how?  If not, then this change is incomplete and would require a new
> catalog version (catversion.h) too.

If you're proposing for v12, why do you care about that? catversion
changes are commonplace during major version development.  Or are you
proposing this gets committed to PostgreSQL 10? ... That's not going
to happen... Plus, it does not seem well aligned with the intent of
this thread to discuss ways to do that either.

> Additional work on this patch would be desirable.  It should check for
> unique + not null, in addition to just the pk constraint.  The DISTINCT
> could be eliminated in cases with multiple relations if all the joins are
> 1:1, although that would arguably belong in a different patch.

Certainly. Patches are welcome, but I think I've mentioned to you
previously that I'm not planning that for this patch.

> p.s. the v6 patch works for the problem case that Tom Lane reported with the
> v5 patch

That would be less ambiguous if there wasn't now two v6 version of the patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-04-07 Thread David Rowley
On 8 April 2018 at 08:55, Tom Lane  wrote:
> regression=# create table t1 (a int,b int, c int, d int, primary key(a,b));
> CREATE TABLE
> regression=# explain verbose select distinct * from t1 order by a,c,b;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

Ouch!

> The reason is we're hitting this assert, a bit further down in
> create_distinct_paths:
>
> /* Assert checks that parser didn't mess up... */
> Assert(pathkeys_contained_in(root->distinct_pathkeys,
>  needed_pathkeys));
>
>
> I don't think that that makes this patch unsalvageable.  The way forward
> probably involves remembering that we removed some distinctClause items
> (so that we know the correspondence to the sortClause no longer holds)
> and then working harder in create_distinct_paths when that's the case.
>
> However, that's not something that's going to happen on the last day
> of the commitfest.  So I'm going to mark this Returned With Feedback
> and encourage you to return to the matter in v12.
>
> In the meantime, attached is the version of the patch that I was about to
> commit before getting cold feet.  It has some cosmetic improvements
> over yours, notably comment additions in allpaths.c.

Thanks a lot for spending time on this.

I'll no doubt have some time over this coming winter to see if it can
be fixed and re-submitted in the PG12 cycle.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-04-07 Thread Tom Lane
I wrote:
> My gripe about
> this as it stands is mainly that it seems like it's going to do
> a lot of catalog-lookup work twice over, at least in cases that
> have both DISTINCT and GROUP BY --- or is that too small a set to
> worry about?

I convinced myself that that was a silly objection, and started looking
through the patch with intent to commit.  However, re-reading the thread
re-awakened my concern about whether deleting items from the
distinctClause is actually safe.

You're right that remove_unused_subquery_outputs isn't really in trouble
with this; it might fail to remove some things it could remove, but that
seems OK, and at best deserving of a comment.

However, that doesn't mean we're out of the woods.  See also
preprocess_groupclause, particularly this comment:
 * Note: we need no comparable processing of the distinctClause because
 * the parser already enforced that that matches ORDER BY.

as well as the interesting assumptions in create_distinct_paths about
what it means to compare the lengths of the pathkey lists for these
clauses.  Once I noticed that, I became convinced that this patch actually
breaks planning of sort/unique-based DISTINCT: if we have a pkey a,b,
but the ORDER BY list is a,c,b, then we will sort by a,c,b but the
Unique node will be told it only needs to unique-ify on the first two
sort columns.

That led me to try this test case, and it blew up even better than
I expected:

regression=# create table t1 (a int,b int, c int, d int, primary key(a,b));
CREATE TABLE
regression=# explain verbose select distinct * from t1 order by a,c,b;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

The reason is we're hitting this assert, a bit further down in
create_distinct_paths:

/* Assert checks that parser didn't mess up... */
Assert(pathkeys_contained_in(root->distinct_pathkeys,
 needed_pathkeys));


I don't think that that makes this patch unsalvageable.  The way forward
probably involves remembering that we removed some distinctClause items
(so that we know the correspondence to the sortClause no longer holds)
and then working harder in create_distinct_paths when that's the case.

However, that's not something that's going to happen on the last day
of the commitfest.  So I'm going to mark this Returned With Feedback
and encourage you to return to the matter in v12.

In the meantime, attached is the version of the patch that I was about to
commit before getting cold feet.  It has some cosmetic improvements
over yours, notably comment additions in allpaths.c.

regards, tom lane

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 65a34a2..345ed17 100644
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*** remove_unused_subquery_outputs(Query *su
*** 3306,3311 
--- 3306,3315 
  	/*
  	 * If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
  	 * time: all its output columns must be used in the distinctClause.
+ 	 * (Note: the latter is not necessarily true anymore, because planner.c
+ 	 * might have found some of the DISTINCT columns to be redundant and
+ 	 * dropped them.  But they'd still have sortgroupref markings, so unless
+ 	 * we improve the heuristic below, we would not recognize them as unused.)
  	 */
  	if (subquery->distinctClause && !subquery->hasDistinctOn)
  		return;
*** remove_unused_subquery_outputs(Query *su
*** 3348,3358 
  
  		/*
  		 * If it has a sortgroupref number, it's used in some sort/group
! 		 * clause so we'd better not remove it.  Also, don't remove any
! 		 * resjunk columns, since their reason for being has nothing to do
! 		 * with anybody reading the subquery's output.  (It's likely that
! 		 * resjunk columns in a sub-SELECT would always have ressortgroupref
! 		 * set, but even if they don't, it seems imprudent to remove them.)
  		 */
  		if (tle->ressortgroupref || tle->resjunk)
  			continue;
--- 3352,3365 
  
  		/*
  		 * If it has a sortgroupref number, it's used in some sort/group
! 		 * clause so we'd better not remove it.  (This is a conservative
! 		 * heuristic, since it might not actually be used by any surviving
! 		 * sort/group clause; but we don't bother to expend the cycles needed
! 		 * for a more accurate test.)  Also, don't remove any resjunk columns,
! 		 * since their reason for being has nothing to do with anybody reading
! 		 * the subquery's output.  (It's likely that resjunk columns in a
! 		 * sub-SELECT would always have ressortgroupref set, but even if they
! 		 * don't, it seems imprudent to remove them.)
  		 */
  		if (tle->ressortgroupref || tle->resjunk)
  			continue;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 008492b..a9ccfed 100644

Re: Removing useless DISTINCT clauses

2018-04-07 Thread Tom Lane
David Rowley  writes:
> It's certainly possible to do more here.  I'm uncertain what needs to
> be done in regards to cached plan invalidation, but we'd certainly
> need to ensure cached plans are invalidated whenever the attnotnull is
> changed.

They already are; any change at all in a pg_class or pg_attribute
row results in a relcache inval for the associated table, cf
CacheInvalidateHeapTuple.  As far as the invalidation machinery is
concerned, changing attnotnull is indistinguishable from changing
atttypid, which I'm sure you'll agree must force plan regeneration.

So this line of thought leads to the conclusion that we were too
conservative in not allowing unique constraints to be used along
with actual pkeys in remove_useless_groupby_columns.  We do have
the additional requirement of checking attnotnull, but I think
we can assume that a plan inval will occur if that changes.

In fact, now that I think about it, I wonder why we're insisting
on tracking the constraint OID either.  Don't index drops force
relcache invals regardless?  If not, how do we ensure that an
indexscan plan relying on that index gets redone?

Part of this probably comes from looking at the parser's
check_ungrouped_columns(), but that is operating under different
constraints, since it is generating what might be a long-lived
parse tree, eg something that gets stored as a view.  We do need
to be able to register an actual dependency on the uniqueness
constraint in that situation.  But plans get tossed on the basis
of relcache invals, so I think they don't need it.

Anyway, that's clearly a future improvement rather than something
I'd insist on this patch tackling immediately.  My gripe about
this as it stands is mainly that it seems like it's going to do
a lot of catalog-lookup work twice over, at least in cases that
have both DISTINCT and GROUP BY --- or is that too small a set to
worry about?

regards, tom lane



Re: Removing useless DISTINCT clauses

2018-04-05 Thread David Rowley
On 6 April 2018 at 03:36, Melanie Plageman  wrote:
> The updated patch looks good to me.
> I've changed the status to "ready for committer"

Great. Thanks for reviewing this!


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-04-05 Thread Melanie Plageman
Hi David,
The updated patch looks good to me.
I've changed the status to "ready for committer"
Thanks


Re: Removing useless DISTINCT clauses

2018-03-23 Thread David Rowley
On 24 March 2018 at 01:42, Jim Finnerty  wrote:
> Distinctness can also be preserved across joins, so if you have a 'snowflake
> query' type join, where all the joins are to a unique key, then the
> distinctness of the other side of the join is preserved.  For example, a
> SELECT DISTINCT * FROM fact_table ... that joins from each column in its
> compound primary key to a unique key of another (dimension) table would
> remain distinct, and so you could drop the DISTINCT from the query.

I'm aware. It is something I'm interested in but would require several
orders of magnitude more work than what I've done for this patch. You
may have noticed the other work I did a while back to detect if joins
cause row duplicate or not, so it's certainly something I've thought
about.

If Amazon would like to sponsor work in this area then please see [1].

It certainly would be great to see that happen.

[1] https://wiki.postgresql.org/wiki/How_to_sponsor_a_feature

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-03-23 Thread David Rowley
On 24 March 2018 at 05:55, Melanie Plageman  wrote:
> I was just wondering, since get_primary_key_attnos opens pg_constraint and
> just skips attributes with other constraint types than primary, what would
> be the reason we wouldn't just also open pg_attribute and check for the
> non-nullness of the non-primary key unique attributes?
> Couldn't we add a function which opens both pg_attribute and pg_constraint
> to get non-null unique attributes?
> I suppose that would have a performance impact.
> And, I did notice the FIXME in the comment before check_functional_grouping
> which seems to make the argument that there are other use cases for wanting
> the non-nullness represented in pg_constraint.

It's certainly possible to do more here.  I'm uncertain what needs to
be done in regards to cached plan invalidation, but we'd certainly
need to ensure cached plans are invalidated whenever the attnotnull is
changed.

There would be no need to look at the pg_attribute table directly. A
syscache function would just need added named get_attnotnull() which
would be akin to something like get_atttypmod().

I'm personally not planning on doing anything in that regard for this
patch. We have 1 week to go before PG11 feature freeze. My goals for
this patch was to apply the same optimization to DISTINCT as I did for
GROUP BY a while back, which is exactly what the patch does. I agree
that more would be nice, but unfortunately, time is finite.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-03-23 Thread David Rowley
On 22 March 2018 at 18:58, David Rowley  wrote:
> On 21 March 2018 at 16:29, Melanie Plageman  wrote:
>> The new status of this patch is: Waiting on Author
>
> Thanks for reviewing this.  I've attached an updated patch. I'll set
> back to waiting for review again.

I've attached another version of the patch. The only change is a
thinko in a comment which claimed it may be possible to do something
with DISTINCT ON clauses, but on further thought, I don't think that's
possible, so I've updated the comment to reflect my new thoughts.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_useless_distinct_clauses_v5.patch
Description: Binary data


Re: Removing useless DISTINCT clauses

2018-03-23 Thread Melanie Plageman
On Thu, Mar 22, 2018 at 5:20 PM, David Rowley 
wrote:

> The problem is that in order to properly invalidate cached plans we
> must record the constraint OIDs which the plan depends on.  We can do
> that for PK and UNIQUE constraints, unfortunately, we can't do it for
> NOT NULL constraints as, at the moment, these are just properties of
> pg_attribute.  For this to be made to work we'd need to store those in
> pg_constraint too, which is more work that I'm going to do for this
> patch.
>

I was just wondering, since get_primary_key_attnos opens pg_constraint and
just skips attributes with other constraint types than primary, what would
be the reason we wouldn't just also open pg_attribute and check for the
non-nullness of the non-primary key unique attributes?
Couldn't we add a function which opens both pg_attribute and pg_constraint
to get non-null unique attributes?
I suppose that would have a performance impact.
And, I did notice the FIXME in the comment before check_functional_grouping
which seems to make the argument that there are other use cases for wanting
the non-nullness represented in pg_constraint.

 * Currently we only check to see if the rel has a primary key that is a
 * subset of the grouping_columns.  We could also use plain unique
constraints
 * if all their columns are known not null, but there's a problem: we need
 * to be able to represent the not-null-ness as part of the constraints
added
 * to *constraintDeps.  FIXME whenever not-null constraints get represented
 * in pg_constraint.
 */
-- 
Melanie Plageman


Re: Removing useless DISTINCT clauses

2018-03-23 Thread Laurenz Albe
David Rowley wrote:
> > Would it be very difficult to extend that to "if any unique constraints are
> > contained in the DISTINCT clause"?
> 
> Unfortunately, it's quite a bit more work to make that happen. It's
> not just unique constraints that are required to make this work. We
> also need to pay attention to NOT NULL constraints too, as we're
> unable to prove function dependency when the keys have NULLs, as there
> may be any number of rows containing NULL values.
> 
> The problem is that in order to properly invalidate cached plans we
> must record the constraint OIDs which the plan depends on.  We can do
> that for PK and UNIQUE constraints, unfortunately, we can't do it for
> NOT NULL constraints as, at the moment, these are just properties of
> pg_attribute.  For this to be made to work we'd need to store those in
> pg_constraint too, which is more work that I'm going to do for this
> patch.

That makes sense; thanks for explaining.

Yours,
Laurenz Albe



Re: Removing useless DISTINCT clauses

2018-03-22 Thread David Rowley
On 22 March 2018 at 21:16, Laurenz Albe  wrote:
> Would it be very difficult to extend that to "if any unique constraints are
> contained in the DISTINCT clause"?

Unfortunately, it's quite a bit more work to make that happen. It's
not just unique constraints that are required to make this work. We
also need to pay attention to NOT NULL constraints too, as we're
unable to prove function dependency when the keys have NULLs, as there
may be any number of rows containing NULL values.

The problem is that in order to properly invalidate cached plans we
must record the constraint OIDs which the plan depends on.  We can do
that for PK and UNIQUE constraints, unfortunately, we can't do it for
NOT NULL constraints as, at the moment, these are just properties of
pg_attribute.  For this to be made to work we'd need to store those in
pg_constraint too, which is more work that I'm going to do for this
patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Removing useless DISTINCT clauses

2018-03-22 Thread Laurenz Albe
Melanie Plageman wrote:
> Contents & Purpose
> ==
> 
> This patch removes any additional columns in the DISTINCT clause when the
> table's primary key columns are entirely present in the DISTINCT clause. This
> optimization works because the PK columns functionally determine the other
> columns in the relation. The patch includes regression test cases.

Would it be very difficult to extend that to "if any unique constraints are
contained in the DISTINCT clause"?

Yours,
Laurenz Albe



Re: Removing useless DISTINCT clauses

2018-03-21 Thread David Rowley
On 21 March 2018 at 16:29, Melanie Plageman  wrote:
> For a small performance hit but an improvement in readability, the length 
> check
> can be moved from the individual group by and distinct clause checks into the
> helper function
>
> if (list_length(parse->distinctClause) < 2)
>   return;
>
> and
>
> if (list_length(parse->groupClause) < 2)
>   return;
>
> can be moved into `remove_functionally_dependent_clauses`

I remember thinking about this when writing, and I think I ended up
doing the check earlier as I thought that having 1 GROUP BY clause
item would be the most common case, so it seems like a good idea to
abort as early as possible in that case. That's most likely not the
case for DISTINCT.

I imagine it probably does not make that much difference anyway, so
I've moved it into the remove_functionally_dependent_clauses()
function, as mentioned.

> The main helper function that is added `remove_functionally_dependent_clauses`
> uses one style of comment--with the name of the function and then the rest of
> the description indented which is found some places in the code,
> /*
>  * remove_functionally_dependent_clauses
>  *  Processes clauselist and removes any items which are deemed 
> to be
>  *  functionally dependent on other clauselist items.
>  *
>  * If any item from the list can be removed, then a new list is built which
>  * does not contain the removed items.  If no item can be removed then the
>  * original list is returned.
>  */
>
> while other helper functions in the same file use a different style--all lines
> flush to the side with a space. I'm not sure which is preferred

You might have a point, but remove_useless_groupby_columns already
does it this way, so I don't think changing that is a good idea.

> The new status of this patch is: Waiting on Author

Thanks for reviewing this.  I've attached an updated patch. I'll set
back to waiting for review again.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_useless_distinct_clauses_v4.patch
Description: Binary data


Re: Removing useless DISTINCT clauses

2018-03-20 Thread Melanie Plageman
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

This is a review of the patch to remove useless DISTINCT columns from the 
DISTINCT clause
https://www.postgresql.org/message-id/CAKJS1f8UMJ137sRuVSnEMDDpa57Q71JuLZi4yLCFMekNYVYqaQ%40mail.gmail.com


Contents & Purpose
==

This patch removes any additional columns in the DISTINCT clause when the
table's primary key columns are entirely present in the DISTINCT clause. This
optimization works because the PK columns functionally determine the other
columns in the relation. The patch includes regression test cases.

Initial Run
===
The patch applies cleanly to HEAD. All tests which are run as part of the
`installcheck` target pass correctly (all existing tests pass, all new tests
pass with the patch applied and fail without it). All TAP tests pass. All tests
which are run as part of the `installcheck-world` target pass except one of the
Bloom contrib module tests in `contrib/bloom/sql/bloom.sql`,
 `CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3);`
which is currently also failing (crashing) for me on master, so it is unrelated 
to the
patch. The test cases seem to cover the new behavior.

Functionality
=
Given that this optimization is safe for the GROUP BY clause (you can remove
functionally dependent attributes from the clause there as well), the logic
does seem to follow for DISTINCT. It seems semantically correct to do this. As
for the implementation, the author seems to have reasonably addressed concerns
expressed over the distinction between DISTINCT and DISTINCT ON. As far as I
can see, this should be fine.

Code Review
===
For a small performance hit but an improvement in readability, the length check
can be moved from the individual group by and distinct clause checks into the
helper function

if (list_length(parse->distinctClause) < 2)
  return;

and

if (list_length(parse->groupClause) < 2)
  return;

can be moved into `remove_functionally_dependent_clauses`


Comments/Documentation
===

The main helper function that is added `remove_functionally_dependent_clauses`
uses one style of comment--with the name of the function and then the rest of
the description indented which is found some places in the code, 
/*
 * remove_functionally_dependent_clauses
 *  Processes clauselist and removes any items which are deemed to 
be
 *  functionally dependent on other clauselist items.
 *
 * If any item from the list can be removed, then a new list is built which
 * does not contain the removed items.  If no item can be removed then the
 * original list is returned.
 */

while other helper functions in the same file use a different style--all lines
flush to the side with a space. I'm not sure which is preferred

/*
 * limit_needed - do we actually need a Limit plan node?
 *
 * If we have constant-zero OFFSET and constant-null LIMIT, we can skip adding
 * a Limit node.  This is worth checking for because "OFFSET 0" is a common
 * locution for an optimization fence.  (Because other places in the planner
 ...

it looks like the non-indented ones are from older commits (2013 vs 2016).

The list length change is totally optional, but I'll leave it as Waiting On 
Author in case the author wants to make that change.

Overall, LGTM.

The new status of this patch is: Waiting on Author