Re: [HACKERS] Discussion on missing optimizations

2017-10-22 Thread Andreas Seltenreich
Alvaro Herrera writes:

> Andres Freund wrote:
>> Unfortunately it won't help in this specific case (no support for UNION,
>> just UNION ALL), but I thought it might be interesting to reference
>> https://medium.com/@uwdb/introducing-cosette-527898504bd6
>> here.
>
> Interesting.  I thought about a completely different approach -- use a
> fuzzer, which runs each generated query on two servers, one patched one
> not, and compare the results.  Would it be possible to tweak sqlsmith to
> do this?

I think the tweaking needed would be:

1. Log successful queries as well along with a result description.
   Maybe logging returned/affected rows is sufficent.

2. Make it avoid nondeterministic things such as joining
   pg_stat_activity or calling non-immutable functions

The second one is a bit harder and I can't think of a more elegant
solution than adding a blacklisting/whitelisting feature and let the
user do the hard work…

If these are solved though, one could make multiple runs with the same
random seed and query the logging database for differences in the result
descriptions.

regards,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-13 Thread Andres Freund
Hi,

On 2017-10-14 10:38:13 +1300, David Rowley wrote:
> On 12 October 2017 at 04:50, Robert Haas  wrote:
> > We haven't really done a very good job figuring out what to do about
> > optimizations that some people (mostly you) think are marginal.  It's
> > obviously true that we can't just burn infinite planner cycles on
> > things that don't usually help, but at the same time, we can't just
> > keep ignoring requests by users for the same features and saying
> > "you'll be sad if we give this to you".  Those people don't give up on
> > wanting the optimization; they just don't use PostgreSQL.  I think we
> > need to stop just saying "no" and start thinking about what we could
> > do that would let us say "yes".

+1 to the general sentiment.


> I'm with Robert on this.  Planning time *is* important, but if we were
> to do a quick tally on the number of patches that we've seen in the
> last few years to improve execution time by either improving the
> executor code or adding some smarts to the planner to reduce executor
> work vs patches aimed to reduce planning time, I think we'd find the
> general focus is on the executor.

That's true. But I think that's partially because people benchmarking
with the goal to identify bottlenecks just habitually use prepared
statements to remove planning overhead.

That works well enough in benchmarking scenarios, but in practice I
think that's less clearly a good tradeoff - e.g in OLTP workloads you'll
often see custom plans defeating the intent of using prepared
statements.


> Personally, I don't recall a single patch aimed to just speed up the
> planner.

There were some a couple years back.


> It looks like there's plenty we could do in there, just nobody seems
> interested enough to go and do it, everyone who cares about
> performance is too busy trying to make execution run faster.

I think that's largely because our executor speed is quite bad. To some
degree because we paid much less attention to seed degradations there
for a couple years, absurdly enough.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-13 Thread David Rowley
On 12 October 2017 at 04:50, Robert Haas  wrote:
> We haven't really done a very good job figuring out what to do about
> optimizations that some people (mostly you) think are marginal.  It's
> obviously true that we can't just burn infinite planner cycles on
> things that don't usually help, but at the same time, we can't just
> keep ignoring requests by users for the same features and saying
> "you'll be sad if we give this to you".  Those people don't give up on
> wanting the optimization; they just don't use PostgreSQL.  I think we
> need to stop just saying "no" and start thinking about what we could
> do that would let us say "yes".

I'm with Robert on this.  Planning time *is* important, but if we were
to do a quick tally on the number of patches that we've seen in the
last few years to improve execution time by either improving the
executor code or adding some smarts to the planner to reduce executor
work vs patches aimed to reduce planning time, I think we'd find the
general focus is on the executor. Personally, I don't recall a single
patch aimed to just speed up the planner. Perhaps I missed some? It
looks like there's plenty we could do in there, just nobody seems
interested enough to go and do it, everyone who cares about
performance is too busy trying to make execution run faster.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-13 Thread Adam Brusselback
So from following this discussion and others focused on making the
planner "smarter", there is always an argument to be had over wasting
planner cycles, and it's always a hard fought battle to get any changes made.

Now, i'm speaking without any knowledge of the Postgres internals, so
please bear with me.  It seems like the underlying "feature" which would
allow more edge case optimizations into the planner would be some
framework for knowing when to consider using those optimizations, and
when to not waste time planning and skip expensive planning optimizations.

It seems with a system like that, maybe Postgres could still maintain
(or even improve) it's hard fought quick planning time for OLTP queries,
while being able to spend more time planning for OLAP queries where
spending an extra 500ms planning may save minutes of execution time.

I know in my database, I have analytical queries which I wouldn't mind spending
multiple seconds planning if necessary to produce an optimal plan,
because execution
time dominates everything.

Now what that system looks like is something I have no real opinion
or authority on.  It seems like there are three options though, automatic,
manual, and automatic with manual override.  Two of those, manual / automatic
with manual override seem to be a little too close to query hints for them to be
considered from all previous discussions I've heard (correct me if i'm wrong).
So that leaves automatic as the only option I can see being considered viable.

Now comes the question of if it's possible to automatically classify queries in
such a way that we can cheaply know what optimizations we should attempt,
and what should not even be considered because the planning time would
dominate.


I'll leave my thoughts at that, and await comments from the smarter people
in the room who know exactly why this idea wouldn't work.  The general thought
behind it is to make it easier for more and more query optimizations
to make it into
the planner, while not hurting queries which shouldn't spend their time on it.

If there are better ways of going about it, great!

Thanks for listening to me ramble,
-Adam


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-13 Thread Laurenz Albe
Stephen Frost wrote:
> * Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> > Robert Haas wrote:
> > > One trick that some system use is avoid replanning as much as we do
> > > by, for example, saving plans in a shared cache and reusing them even
> > > in other sessions.  That's hard to do in our architecture because the
> > > controlling GUCs can be different in every session and there's not
> > > even any explicit labeling of which GUCs control planner behavior. But
> > > if you had it, then extra planning cycles would be, perhaps, more
> > > tolerable.
> > > From my experience with Oracle I would say that that is a can of worms.
> > 
> > Perhaps it really brings the performance benefits they claim, but
> > a) there have been a number of bugs where the wrong plan got used
> >(you have to keep several plans for the same statement around,
> >since - as you say - different sessions have different environments)
> 
> I'm not sure this is really a fair strike against this concept- bugs
> happen, even bugs in planning, and what we need is more testing, imv, to
> reduce the number and minimize the risk as much as we can.

Right, I guess I didn't express my concern properly.
Bugs can certainly happen, but if a certain feature is particularly
rich in them, I take it as an indication that it is something difficult
to get right.

> > b) it is a frequent problem that this shared memory area grows
> >too large if the application does not use prepared statements
> >but dynamic SQL with varying constants.
> 
> This just requires that the memory area be managed somehow, not unlike
> how our shared buffers have to deal with evicting out old pages.
> There's a challenge there around making sure that it doesn't make the
> performance of the system be much worse than not having a cache at all,
> naturally, but given that a lot of people use pg_stat_statements to good
> effect and without much in the way of complaints, it seems like it might
> be possible make it work reasonably (just imagining a generic plan being
> attached to pg_stat_statements with some information about if the
> generic plan works well or not, blah blah, hand waving goes here).

pg_stat_statements is quite different, since it ignores constants.

I don't know how often I have seen the advice to use dynamic SQL
because a generic plan was not so great, and in OLAP you usually
want each individual query to be planned.

So I think it is important that a plan is only reused if it matches
the query exactly.  Oracle has an option CURSOR_SHARING = FORCE to
reuse plans even if they were planned with different constants, but
their own documentation warns against using it:
http://docs.oracle.com/database/122/TGSQL/improving-rwp-cursor-sharing.
htm#GUID-7FF4E133-06A7-401E-9BFC-3B0B9C902346

Maybe it is an idea to only cache generic plans in a shared area?

Yours,
Laurenz Albe



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread Stephen Frost
Laurenz,

* Laurenz Albe (laurenz.a...@cybertec.at) wrote:
> Robert Haas wrote:
> > One trick that some system use is avoid replanning as much as we do
> > by, for example, saving plans in a shared cache and reusing them even
> > in other sessions.  That's hard to do in our architecture because the
> > controlling GUCs can be different in every session and there's not
> > even any explicit labeling of which GUCs control planner behavior. But
> > if you had it, then extra planning cycles would be, perhaps, more
> > tolerable.
> 
> >From my experience with Oracle I would say that that is a can of worms.
> 
> Perhaps it really brings the performance benefits they claim, but
> a) there have been a number of bugs where the wrong plan got used
>(you have to keep several plans for the same statement around,
>since - as you say - different sessions have different environments)

I'm not sure this is really a fair strike against this concept- bugs
happen, even bugs in planning, and what we need is more testing, imv, to
reduce the number and minimize the risk as much as we can.

> b) it is a frequent problem that this shared memory area grows
>too large if the application does not use prepared statements
>but dynamic SQL with varying constants.

This just requires that the memory area be managed somehow, not unlike
how our shared buffers have to deal with evicting out old pages.
There's a challenge there around making sure that it doesn't make the
performance of the system be much worse than not having a cache at all,
naturally, but given that a lot of people use pg_stat_statements to good
effect and without much in the way of complaints, it seems like it might
be possible make it work reasonably (just imagining a generic plan being
attached to pg_stat_statements with some information about if the
generic plan works well or not, blah blah, hand waving goes here).

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread David Rowley
On 13 October 2017 at 03:00, Tom Lane  wrote:
> Laurenz Albe  writes:
>> Robert Haas wrote:
>>> One trick that some system use is avoid replanning as much as we do
>>> by, for example, saving plans in a shared cache and reusing them even
>>> in other sessions.  That's hard to do in our architecture because the
>>> controlling GUCs can be different in every session and there's not
>>> even any explicit labeling of which GUCs control planner behavior. But
>>> if you had it, then extra planning cycles would be, perhaps, more
>>> tolerable.
>
>> From my experience with Oracle I would say that that is a can of worms.
>
> Yeah, I'm pretty suspicious of the idea too.  We've had an awful lot of
> bad experience with local plan caching, to the point where people wonder
> why we don't just auto-replan every time.  How would a shared cache
> make that better?  (Even assuming it was otherwise free, which it
> surely won't be.)

One idea I had when working on unique joins was that we could use it
to determine if a plan could only return 0 or 1 row by additionally
testing baserestrictinfo of each rel to see if an equality OpExpr with
a const Const matches a unique constraint which would serve as a
guarantee that only 0 or 1 row would match. I thought that these
single row plans could be useful as a start for some pro-active plan
cache. The plan here should be stable as they're not prone to any data
skew that could cause a plan change. You might think it would be very
few plans would fit the bill, but you'd likely find that the majority
of UPDATE and DELETE queries run in an OTLP application would be
eligible, and likely some decent percentage of SELECTs too.

I had imagined this would be some backend local cache that saved MRU
plans up to some size of memory defined by a GUC, where 0 would
disable the feature. I never got any further than those thoughts.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread Robert Haas
On Thu, Oct 12, 2017 at 10:00 AM, Tom Lane  wrote:
>> From my experience with Oracle I would say that that is a can of worms.
>
> Yeah, I'm pretty suspicious of the idea too.  We've had an awful lot of
> bad experience with local plan caching, to the point where people wonder
> why we don't just auto-replan every time.  How would a shared cache
> make that better?  (Even assuming it was otherwise free, which it
> surely won't be.)

Obviously it wouldn't.  But it might make other things better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread Tom Lane
Laurenz Albe  writes:
> Robert Haas wrote:
>> One trick that some system use is avoid replanning as much as we do
>> by, for example, saving plans in a shared cache and reusing them even
>> in other sessions.  That's hard to do in our architecture because the
>> controlling GUCs can be different in every session and there's not
>> even any explicit labeling of which GUCs control planner behavior. But
>> if you had it, then extra planning cycles would be, perhaps, more
>> tolerable.

> From my experience with Oracle I would say that that is a can of worms.

Yeah, I'm pretty suspicious of the idea too.  We've had an awful lot of
bad experience with local plan caching, to the point where people wonder
why we don't just auto-replan every time.  How would a shared cache
make that better?  (Even assuming it was otherwise free, which it
surely won't be.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread Alvaro Herrera
Andres Freund wrote:
> On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
> > Adam Brusselback  writes:
> > > On another note:
> > >> turning ORs into UNIONs
> > 
> > > This is another one which would be incredibly useful for me.  I've had
> > > to do this manually for performance reasons far too often.
> > 
> > Well, maybe you could sign up to help review the open patch for that then:
> > https://commitfest.postgresql.org/15/1001/
> > 
> > The reason that's not in v10 is we haven't been able to convince
> > ourselves whether it's 100% correct.
> 
> Unfortunately it won't help in this specific case (no support for UNION,
> just UNION ALL), but I thought it might be interesting to reference
> https://medium.com/@uwdb/introducing-cosette-527898504bd6
> here.

Interesting.  I thought about a completely different approach -- use a
fuzzer, which runs each generated query on two servers, one patched one
not, and compare the results.  Would it be possible to tweak sqlsmith to
do this?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread Laurenz Albe
Robert Haas wrote:
> One trick that some system use is avoid replanning as much as we do
> by, for example, saving plans in a shared cache and reusing them even
> in other sessions.  That's hard to do in our architecture because the
> controlling GUCs can be different in every session and there's not
> even any explicit labeling of which GUCs control planner behavior. But
> if you had it, then extra planning cycles would be, perhaps, more
> tolerable.

>From my experience with Oracle I would say that that is a can of worms.

Perhaps it really brings the performance benefits they claim, but
a) there have been a number of bugs where the wrong plan got used
   (you have to keep several plans for the same statement around,
   since - as you say - different sessions have different environments)
b) it is a frequent problem that this shared memory area grows
   too large if the application does not use prepared statements
   but dynamic SQL with varying constants.

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-11 Thread Robert Haas
On Fri, Oct 6, 2017 at 10:19 PM, Tom Lane  wrote:
>> 9. Unneeded Self JOIN
>
>> Can't remember discussions of this.
>
> I can't get very excited about that one either.

My memories of being a PostgreSQL user rather than a developer are
getting a bit distant by now, but I definitely remember hitting this
problem especially in cases involving DELETE FROM and UPDATE USING.
You can't specify a left join between the result relation and some
other table directly, so you end up having to do DELETE FROM x USING x
LEFT JOIN y ... which then has lousy performance.

I think you could hit it in other cases, too, e.g. a join between two
views which share a common base table.  Of course maybe you wouldn't
have two views in the first place if join removal worked better than
it currently does, but without, at least, inner join removal you can't
really rely on the query planner to prune things down to what's really
needed.

> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I think what you're failing to consider is that the author of the
article is a user.  When he says these optimizations are cool, he
means that they would benefit his use cases (and presumably that he's
willing to pay some number of additional cycles to get them).

We haven't really done a very good job figuring out what to do about
optimizations that some people (mostly you) think are marginal.  It's
obviously true that we can't just burn infinite planner cycles on
things that don't usually help, but at the same time, we can't just
keep ignoring requests by users for the same features and saying
"you'll be sad if we give this to you".  Those people don't give up on
wanting the optimization; they just don't use PostgreSQL.  I think we
need to stop just saying "no" and start thinking about what we could
do that would let us say "yes".

One trick that some system use is avoid replanning as much as we do
by, for example, saving plans in a shared cache and reusing them even
in other sessions.  That's hard to do in our architecture because the
controlling GUCs can be different in every session and there's not
even any explicit labeling of which GUCs control planner behavior. But
if you had it, then extra planning cycles would be, perhaps, more
tolerable.

Another trick is to plan harder when the cost or complexity of the
query exceeds some threshold, but that's hard to do because we don't
really know what the cost or complexity is until after we get done
planning, by which time it's too late to (cheaply) go back and apply
those optimizations.  This problem of not really knowing whether we've
got an expensive query is also a problem in terms of being smart about
how many parallel workers to pick or whether to consider parallelism
at all; we'd like to consider more options for expensive queries than
cheap ones.  But there's not really any good way to do this as things
stand today.

Maybe one way to attack this problem is to make it more general:
what's the next big thing for the planner?  It wasn't altogether clear
to me that the path->plan conversion was going to be a huge
improvement, but it definitely has been.  Nothing really got much
faster as a direct result of that work, but it made the code simpler
and more understandable and thus easier to enhance, unblocking
improvements like postgres_fdw aggregate pushdown.  I think we're a
long way from reaching the end of the benefits of that commit - there
is a lot more that can usefully be done.  But, what's next?  You told
me a few years back when I was thinking about what to do next that
there was still a lot of gains to be had from improving the planner,
and I wasn't sure I quite believed it.  But it's becoming more clear
to me that this is true, and that the kinds of things this article is
talking about are one approach.  I don't really have the answers here,
but I don't think our planner is anywhere close to as good as it can
be, even if in certain respects it is stuck at some kind of local
maximum.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-09 Thread Ants Aasma
On Sun, Oct 8, 2017 at 7:24 PM, Tom Lane  wrote:
> Petr Jelinek  writes:
>> Okay, that makes sense, thanks for explanation. Your patch is the way to
>> go then.
>
> Hearing no further comment, pushed.  Thanks for reviewing it.

The tautological col = col comparison on is occasionally used as a
planner "hint" to correct for row count overestimates. Not a great
solution, but PostgreSQL doesn't really have a better way to guide the
planner. Those queries will now have to do something else, like col =
col + 0, which still works.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread David Rowley
On 9 October 2017 at 17:41, David Rowley  wrote:
> Thoughts?

Actually, I was a little inconsistent with my List NULL/NIL checks in
that last one.

I've attached an updated patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_left_join_distinct_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread David Rowley
On 7 October 2017 at 14:48, Andres Freund  wrote:
> 3. JOIN Elimination
>
> There's been a lot of discussion and several patches. There's a bunch of
> problems here, one being that there's cases (during trigger firing,
> before the constraint checks) where foreign keys don't hold true, so we
> can't quite generally make these optimization.  Another concern is
> whether the added plan time is actually worthwhile.

I looked over this and it seems there's some pretty low hanging fruit
in there that we can add with just a handful of lines of new code.

This is the case for LEFT JOINs with a DISTINCT clause. Normally we
can only remove an unused LEFT JOINed relation if there are some
unique properties that ensure that the join does not duplicate any
outer row. We don't need to worry about this when there's a DISTINCT
clause, as the DISTINCT would remove any duplicates anyway. If I'm not
mistaken, we also don't need to bother looking at the actual distinct
clause's exprs since we'll already know that nothing is in there
regarding the relation being removed. The benefit to this could be
two-fold, as 1) we don't need to join to the unused relation and 2) we
don't need to remove any row duplication caused by the join.

While someone might argue that this is not going to be that common a
case to hit, if we consider how cheap this is to implement, it does
seem to be worth doing a couple of NULL checks in the planner for it.

The only slight downside I can see to this is that we're letting a few
more cases through rel_supports_distinctness() which is also used for
unique joins, and these proofs are not valid in those. However, it may
not be worth the trouble doing anything about that as relations
without unique indexes are pretty uncommon (at least in my world).

Thoughts?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_left_join_distinct.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Andres Freund
On 2017-10-08 17:11:44 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
> >> https://commitfest.postgresql.org/15/1001/
> >> The reason that's not in v10 is we haven't been able to convince
> >> ourselves whether it's 100% correct.
> 
> > Unfortunately it won't help in this specific case (no support for UNION,
> > just UNION ALL), but I thought it might be interesting to reference
> > https://medium.com/@uwdb/introducing-cosette-527898504bd6
> > here.
> 
> Huh, that is an interesting project indeed.  Although I'm not sure that
> it quite addresses the question of whether an optimization transform
> is valid.  IIUC, it could prove that a particular query having been fed
> through the transform didn't change semantics, but that offers only
> limited insight into whether some other query fed through the transform
> might change.

According to the guide it offers some support for more general
transformations:
http://cosette.cs.washington.edu/guide#24-symbolic-predicates That's
still only going to be sufficient for some of the interesting cases, but
still...

Wonder about pinging them about the OR -> UNION case, they've been
responsive to problem in some threads I found online.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Tom Lane
Andres Freund  writes:
> On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
>> https://commitfest.postgresql.org/15/1001/
>> The reason that's not in v10 is we haven't been able to convince
>> ourselves whether it's 100% correct.

> Unfortunately it won't help in this specific case (no support for UNION,
> just UNION ALL), but I thought it might be interesting to reference
> https://medium.com/@uwdb/introducing-cosette-527898504bd6
> here.

Huh, that is an interesting project indeed.  Although I'm not sure that
it quite addresses the question of whether an optimization transform
is valid.  IIUC, it could prove that a particular query having been fed
through the transform didn't change semantics, but that offers only
limited insight into whether some other query fed through the transform
might change.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Tom Lane
Petr Jelinek  writes:
> Okay, that makes sense, thanks for explanation. Your patch is the way to
> go then.

Hearing no further comment, pushed.  Thanks for reviewing it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Andres Freund
On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
> Adam Brusselback  writes:
> > On another note:
> >> turning ORs into UNIONs
> 
> > This is another one which would be incredibly useful for me.  I've had
> > to do this manually for performance reasons far too often.
> 
> Well, maybe you could sign up to help review the open patch for that then:
> https://commitfest.postgresql.org/15/1001/
> 
> The reason that's not in v10 is we haven't been able to convince
> ourselves whether it's 100% correct.

Unfortunately it won't help in this specific case (no support for UNION,
just UNION ALL), but I thought it might be interesting to reference
https://medium.com/@uwdb/introducing-cosette-527898504bd6
here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Tom Lane
Adam Brusselback  writes:
> On another note:
>> turning ORs into UNIONs

> This is another one which would be incredibly useful for me.  I've had
> to do this manually for performance reasons far too often.

Well, maybe you could sign up to help review the open patch for that then:
https://commitfest.postgresql.org/15/1001/

The reason that's not in v10 is we haven't been able to convince
ourselves whether it's 100% correct.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Adam Brusselback
> I can't get very excited about this one either, though I do believe it
> can arise as the author says, "when you build complex views and JOIN
> them to each other".  Maybe I'm not excited about it because I've not
> needed it :)

This is one that I know would help with my database.  There is a ton
of logic stored in views,
which get joined to to the original table to filter the set rather
than imposing that set of
conditions in every separate query.

It would be really nice if the optimizer could simplify those to
eliminate the self join.  It's almost always
on the primary key of a table that the join would happen on, and if
not it'd be a non-nullable column for sure.


On another note:
> turning ORs into UNIONs
This is another one which would be incredibly useful for me.  I've had
to do this manually for performance
reasons far too often.


> Partially agreed. A comment to the article also mentions that some other
> database performs more optimizations depending on the cost of the
> plan. That's not easy to do in our current plan structure, but I think
> it's quite a worthwhile concept.

I would love to see this in Postgres.  It would allow the planner to
not waste cycles unnecessarily on
queries where it's just not needed, and to potentially spend a few
more cycles planning on very
costly queries to save a ton while executing.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Petr Jelinek
On 07/10/17 19:59, Tom Lane wrote:
> Petr Jelinek  writes:
>> On 07/10/17 18:15, Tom Lane wrote:
>>> No, I'm afraid you didn't read that comment closely enough.  This will
>>> flat out fail for cases like "select ... where x=x order by x", because
>>> there will already be a single-element EC for x and so the clause will
>>> just disappear entirely.  If that doesn't happen, then instead you're
>>> creating an EC with duplicate entries, which is an idea I seriously
>>> dislike; the semantics of that don't seem clear at all to me.
> 
>> Hmm it did not disappear (and worked fine in SQL level tests).
> 
> I may not be correctly remembering what the query would need to look
> like for there to be single-element ECs in existence at this point; but
> I surely would not like this code to assume that none will exist till
> later.
> 
>> I don't
>> think I fully understand the "EC with duplicate entries" part and what's
>> the problem with it so I'll defer to your wisdom there.
> 
> Well, as one example, assume that we use your patch and consider what
> happens with
>   where x = y and x = x
> vs
>   where x = x and x = y
> 
> In the first case we end up with an EC that's just {x,y}, because the
> second process_equivalence() will find both sides in the same EC and
> conclude that the second clause is redundant.  (Which it is, if the
> equality operators have the same notion of what to do with nulls.)
> In the second case we end up with an EC containing {x,x,y}, which
> at minimum will result in emitting redundant generated quals.  I'm
> not sure if it could have any worse consequences than that, but I'm
> not sure it couldn't either.  But this is bogus in any case, because
> those WHERE clauses surely should produce identical results.
> 
> Looking further ahead, if ECs have to support being multisets rather
> than pure sets, that would put a crimp in ever improving this logic to
> use a smarter UNION-FIND algorithm.  (I've not yet seen queries where the
> number of EC members is large enough to make that a serious issue, but
> I think it could happen, particularly with inheritance/partitioning.)
> 

Okay, that makes sense, thanks for explanation. Your patch is the way to
go then.

>>> This passes the smell test for me in the sense of not adding any
>>> significant number of planner cycles except when the weird case occurs.
>>> It leaves something on the table in that there are some situations
>>> where X=X could be converted, but we won't do it because we don't see
>>> the clause as being a potential EC (because it's not at top level),
>>> as in the second new regression test case below.  I think that's
>>> probably all right; I don't see any way to be more thorough without
>>> adding a lot of new cycles all the time, and I don't believe this is
>>> worth that.
> 
>> My code had same issue. I think going deeper would require quite a bit
>> of new code (and cycles) for something that's even less likely to happen
>> than simple X=X while the current patch is quite cheap win.
> 
> Yeah.  I'm not really convinced it's a big win, but it's so cheap we
> might as well do it.  The only case where it would expend cycles and
> fail to give an improvement is if we have X=X with a non-strict operator,
> which I think is a case that never arises in practice at present,
> because btree effectively assumes that all btree operators are strict.
> (Someday we might want to relax that, though, which is why I don't
> want to let the planner just assume it.)
> 

In real-life probably not, but seems like these small bits can be useful
marketing wise, given articles like the one which started this. And it
should not hurt anything either.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Nico Williams
On Fri, Oct 06, 2017 at 10:19:54PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
> >> The article in question is here:
> >> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/
> 
> > That's interesting.
> 
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in themselves.

That's true for some of those.  But some of them might become important
when you start pushing WHERE constraints from outside into inner table
sources and subqueries, as dumb-looking constraints can simply appear
from pushing non-dumb-looking constraints.

More than the op optimizations would make a big difference for me:

 - turning subqueries into joins

 - turning ORs into UNIONs

   It is easy enough to work around the lack of this optimization in
   many cases, but it does make queries more verbose.

 - pushing WHERE constraints from outer queries into the table source
   queries (_including_ VIEWs)

- determining that some table in a query that had WHERE constraints
  pushed into it... now has a very well-filled out lookup key,
  therefore it's the one that should be the table source to start
  the plan with, i.e., that it should be first in the outermost loop
  of a nested loop join 

  For me these two would be huge wins.  I have to resort to
  functions with roughly the same body as views just so that I can
  have the optimizer pick the correct plan.  This causes a lot of
  code duplication in my schemas.

 - pushing WHERE constraints from outer queries into HAVING thence WHERE
   constraints on GROUP BY queries where the outer constraints are on
   columns used to GROUP BY

   I find myself making two versions of views that do aggregation: one
   that does not, and one that does.  This allows me to use the
   non-aggregating view in contexts where I need this optimization, but
   then I have to re-code the aggregation at that layer.  Again, lots of
   duplication.

These sorts of optimizations are huge.

> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?

If you can reduce the number of compilations / optimization passes for
statements, then spending more time in the optimizer is not a big deal.
So, when invoked via PREPARE I would say spending more cycles looking
for this sort of thing is OK, but in many other cases it's not.

Also, sometimes these cases crop up do to pushing constraints into VIEWs
and sub-queries.  In those cases then constant sub-expression
elimination can be a win.

> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)

I'd expect that column = column is not trivial to turn into TRUE, not
unless those columns are NOT NULLable.

> > 9. Unneeded Self JOIN
> 
> > Can't remember discussions of this.
> 
> I can't get very excited about that one either.
> 
> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I can't get very excited about this one either, though I do believe it
can arise as the author says, "when you build complex views and JOIN
them to each other".  Maybe I'm not excited about it because I've not
needed it :)

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Tom Lane
Petr Jelinek  writes:
> On 07/10/17 18:15, Tom Lane wrote:
>> No, I'm afraid you didn't read that comment closely enough.  This will
>> flat out fail for cases like "select ... where x=x order by x", because
>> there will already be a single-element EC for x and so the clause will
>> just disappear entirely.  If that doesn't happen, then instead you're
>> creating an EC with duplicate entries, which is an idea I seriously
>> dislike; the semantics of that don't seem clear at all to me.

> Hmm it did not disappear (and worked fine in SQL level tests).

I may not be correctly remembering what the query would need to look
like for there to be single-element ECs in existence at this point; but
I surely would not like this code to assume that none will exist till
later.

> I don't
> think I fully understand the "EC with duplicate entries" part and what's
> the problem with it so I'll defer to your wisdom there.

Well, as one example, assume that we use your patch and consider what
happens with
where x = y and x = x
vs
where x = x and x = y

In the first case we end up with an EC that's just {x,y}, because the
second process_equivalence() will find both sides in the same EC and
conclude that the second clause is redundant.  (Which it is, if the
equality operators have the same notion of what to do with nulls.)
In the second case we end up with an EC containing {x,x,y}, which
at minimum will result in emitting redundant generated quals.  I'm
not sure if it could have any worse consequences than that, but I'm
not sure it couldn't either.  But this is bogus in any case, because
those WHERE clauses surely should produce identical results.

Looking further ahead, if ECs have to support being multisets rather
than pure sets, that would put a crimp in ever improving this logic to
use a smarter UNION-FIND algorithm.  (I've not yet seen queries where the
number of EC members is large enough to make that a serious issue, but
I think it could happen, particularly with inheritance/partitioning.)

>> This passes the smell test for me in the sense of not adding any
>> significant number of planner cycles except when the weird case occurs.
>> It leaves something on the table in that there are some situations
>> where X=X could be converted, but we won't do it because we don't see
>> the clause as being a potential EC (because it's not at top level),
>> as in the second new regression test case below.  I think that's
>> probably all right; I don't see any way to be more thorough without
>> adding a lot of new cycles all the time, and I don't believe this is
>> worth that.

> My code had same issue. I think going deeper would require quite a bit
> of new code (and cycles) for something that's even less likely to happen
> than simple X=X while the current patch is quite cheap win.

Yeah.  I'm not really convinced it's a big win, but it's so cheap we
might as well do it.  The only case where it would expend cycles and
fail to give an improvement is if we have X=X with a non-strict operator,
which I think is a case that never arises in practice at present,
because btree effectively assumes that all btree operators are strict.
(Someday we might want to relax that, though, which is why I don't
want to let the planner just assume it.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Petr Jelinek
On 07/10/17 18:15, Tom Lane wrote:
> Petr Jelinek  writes:
>> On 07/10/17 04:19, Tom Lane wrote:
>>> (edit: a few minutes later, I seem to remember that equivclass.c has
>>> to do something special with the X=X case, so maybe it could do
>>> something else special instead, with little new overhead.)
> 
>> So I wrote prototype of achieving this optimization and it seems to be
>> really quite simple code-wise (see attached). I did only a limited
>> manual testing of this but I don't see any negative impact on planning time.
> 
> No, I'm afraid you didn't read that comment closely enough.  This will
> flat out fail for cases like "select ... where x=x order by x", because
> there will already be a single-element EC for x and so the clause will
> just disappear entirely.  If that doesn't happen, then instead you're
> creating an EC with duplicate entries, which is an idea I seriously
> dislike; the semantics of that don't seem clear at all to me.

Hmm it did not disappear (and worked fine in SQL level tests). I don't
think I fully understand the "EC with duplicate entries" part and what's
the problem with it so I'll defer to your wisdom there.

> What I was imagining was that having detected X=X, instead of "throwing
> back" the clause as-is we could throw back an X IS NOT NULL clause,
> along the lines of the attached.

Right, I wanted to avoid messing with the incoming result info, but if
we don't want to call the code bellow or write tons of code for this, I
guess it's the best we can do.

> This passes the smell test for me in the sense of not adding any
> significant number of planner cycles except when the weird case occurs.
> It leaves something on the table in that there are some situations
> where X=X could be converted, but we won't do it because we don't see
> the clause as being a potential EC (because it's not at top level),
> as in the second new regression test case below.  I think that's
> probably all right; I don't see any way to be more thorough without
> adding a lot of new cycles all the time, and I don't believe this is
> worth that.
> 

My code had same issue. I think going deeper would require quite a bit
of new code (and cycles) for something that's even less likely to happen
than simple X=X while the current patch is quite cheap win.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Tom Lane
David Rowley  writes:
> It would be much nicer if you'd at least wait for benchmarks before
> shooting.

Benchmarks of what?  We'd have to expend quite a bit of effort just
to get to a place where we'd have something to benchmark.  I do not
think it's unreasonable of me to express an opinion that that would
likely be wasted effort.  If you disagree, and are willing to expend
such effort speculatively, I'm not stopping you.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Tom Lane
Petr Jelinek  writes:
> On 07/10/17 04:19, Tom Lane wrote:
>> (edit: a few minutes later, I seem to remember that equivclass.c has
>> to do something special with the X=X case, so maybe it could do
>> something else special instead, with little new overhead.)

> So I wrote prototype of achieving this optimization and it seems to be
> really quite simple code-wise (see attached). I did only a limited
> manual testing of this but I don't see any negative impact on planning time.

No, I'm afraid you didn't read that comment closely enough.  This will
flat out fail for cases like "select ... where x=x order by x", because
there will already be a single-element EC for x and so the clause will
just disappear entirely.  If that doesn't happen, then instead you're
creating an EC with duplicate entries, which is an idea I seriously
dislike; the semantics of that don't seem clear at all to me.

What I was imagining was that having detected X=X, instead of "throwing
back" the clause as-is we could throw back an X IS NOT NULL clause,
along the lines of the attached.

This passes the smell test for me in the sense of not adding any
significant number of planner cycles except when the weird case occurs.
It leaves something on the table in that there are some situations
where X=X could be converted, but we won't do it because we don't see
the clause as being a potential EC (because it's not at top level),
as in the second new regression test case below.  I think that's
probably all right; I don't see any way to be more thorough without
adding a lot of new cycles all the time, and I don't believe this is
worth that.

regards, tom lane

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 7997f50..a225414 100644
*** a/src/backend/optimizer/path/equivclass.c
--- b/src/backend/optimizer/path/equivclass.c
***
*** 27,32 
--- 27,33 
  #include "optimizer/paths.h"
  #include "optimizer/planmain.h"
  #include "optimizer/prep.h"
+ #include "optimizer/restrictinfo.h"
  #include "optimizer/var.h"
  #include "utils/lsyscache.h"
  
*** static bool reconsider_full_join_clause(
*** 71,78 
   *	  any delay by an outer join, so its two sides can be considered equal
   *	  anywhere they are both computable; moreover that equality can be
   *	  extended transitively.  Record this knowledge in the EquivalenceClass
!  *	  data structure.  Returns TRUE if successful, FALSE if not (in which
!  *	  case caller should treat the clause as ordinary, not an equivalence).
   *
   * If below_outer_join is true, then the clause was found below the nullable
   * side of an outer join, so its sides might validly be both NULL rather than
--- 72,85 
   *	  any delay by an outer join, so its two sides can be considered equal
   *	  anywhere they are both computable; moreover that equality can be
   *	  extended transitively.  Record this knowledge in the EquivalenceClass
!  *	  data structure, if applicable.  Returns TRUE if successful, FALSE if not
!  *	  (in which case caller should treat the clause as ordinary, not an
!  *	  equivalence).
!  *
!  * In some cases, although we cannot convert a clause into EquivalenceClass
!  * knowledge, we can still modify it to a more useful form than the original.
!  * Then, *p_restrictinfo will be replaced by a new RestrictInfo, which is what
!  * the caller should use for further processing.
   *
   * If below_outer_join is true, then the clause was found below the nullable
   * side of an outer join, so its sides might validly be both NULL rather than
*** static bool reconsider_full_join_clause(
*** 104,112 
   * memory context.
   */
  bool
! process_equivalence(PlannerInfo *root, RestrictInfo *restrictinfo,
  	bool below_outer_join)
  {
  	Expr	   *clause = restrictinfo->clause;
  	Oid			opno,
  collation,
--- 111,121 
   * memory context.
   */
  bool
! process_equivalence(PlannerInfo *root,
! 	RestrictInfo **p_restrictinfo,
  	bool below_outer_join)
  {
+ 	RestrictInfo *restrictinfo = *p_restrictinfo;
  	Expr	   *clause = restrictinfo->clause;
  	Oid			opno,
  collation,
*** process_equivalence(PlannerInfo *root, R
*** 154,169 
  	   collation);
  
  	/*
! 	 * Reject clauses of the form X=X.  These are not as redundant as they
! 	 * might seem at first glance: assuming the operator is strict, this is
! 	 * really an expensive way to write X IS NOT NULL.  So we must not risk
! 	 * just losing the clause, which would be possible if there is already a
! 	 * single-element EquivalenceClass containing X.  The case is not common
! 	 * enough to be worth contorting the EC machinery for, so just reject the
! 	 * clause and let it be processed as a normal restriction clause.
  	 */
  	if (equal(item1, item2))
! 		return false;			/* X=X is not a useful equivalence */
  
  	/*
  	 * If below outer join, check for 

Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread David Rowley
On 7 October 2017 at 15:19, Tom Lane  wrote:
>> 9. Unneeded Self JOIN
>
>> Can't remember discussions of this.
>
> I can't get very excited about that one either.
>
> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I think that final sentence lacks imagination.

I've seen plenty of views being called where some column is
unavailable, but the caller joins the very same table again on the
primary key to add the column. There was no brain-dead ORM involved,
just perhaps questionable design. This was very common in my last job
where we had some rats nest of views several layers deep, the core of
which often had some complex logic that nobody dared to try and
replicate.

It would be fairly cheap to check if any of the rtekind==RTE_RELATION
joinlist  items have above 1 RangeTblEntry with the same relid.  The
joinlist is never that big anyway, and if it was the join search would
be slow. The more expensive part would be to build the join clauses,
check if the expressions on either side of all OpExpr matches and that
nothing else will cause a non-match, then perform the uniqueness check
on those OpExpr operands. We do have some infrastructure to do the
unique checks.  Likely the slowdown in planning would be just for
cases with a legitimately useful self-join, I doubt checking for a
duplicate RangeTblEntry->relid would cause much of a concern.

Anyway, this is giving me some feeling of Deja vu.. Perhaps we need
some pg_stats view that shows us planning time and execution time so
that we can get a better idea on how much these things matter in the
average case. We tend to never fare so well in these sorts of
comparisons with commercial databases. It's not hard to imagine
someone with migration plans loading some rats nest of views into
Postgres and taking it for a spin and finding performance is not quite
what they need. It's a bit sad that often the people with the loudest
voices are always so fast to stomp on the ideas for improvements. It
would be much nicer if you'd at least wait for benchmarks before
shooting.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread Petr Jelinek
On 07/10/17 04:19, Tom Lane wrote:
> Andres Freund  writes:
>> On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
>>> The article in question is here:
>>> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/
> 
>> That's interesting.
> 
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in themselves.
> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?
> 
> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)
> 

What it actually does is to specifically skip the processing for X=X
(the const expression will be simplified by
estimate_expression_value/eval_const_expressions separately). There is
comment there that specifically states that it's not worth it to process
this as it's rare clause which is equal to X IS NOT NULL.

I don't actually agree with the argument of the comment there, since in
practice the if the "silly" equality is not there, we'll just waste
equal() call and if it is there the optimization seems worth it as it
will lead to orders of magnitude better estimation in many cases.

So I wrote prototype of achieving this optimization and it seems to be
really quite simple code-wise (see attached). I did only a limited
manual testing of this but I don't see any negative impact on planning time.

Thoughts?

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From a7d6f3fe0a5d42a96c711cc33e6962b9a2f6511f Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Sat, 7 Oct 2017 15:10:54 +0200
Subject: [PATCH] Transform X=X expressions into X IS NOT NULL

---
 src/backend/optimizer/path/equivclass.c | 12 
 src/backend/optimizer/plan/initsplan.c  | 11 +++
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 7997f50c18..d4ffdf66f2 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -154,18 +154,6 @@ process_equivalence(PlannerInfo *root, RestrictInfo *restrictinfo,
 	   collation);
 
 	/*
-	 * Reject clauses of the form X=X.  These are not as redundant as they
-	 * might seem at first glance: assuming the operator is strict, this is
-	 * really an expensive way to write X IS NOT NULL.  So we must not risk
-	 * just losing the clause, which would be possible if there is already a
-	 * single-element EquivalenceClass containing X.  The case is not common
-	 * enough to be worth contorting the EC machinery for, so just reject the
-	 * clause and let it be processed as a normal restriction clause.
-	 */
-	if (equal(item1, item2))
-		return false;			/* X=X is not a useful equivalence */
-
-	/*
 	 * If below outer join, check for strictness, else reject.
 	 */
 	if (below_outer_join)
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 9931dddba4..1c3672a978 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -2347,6 +2347,17 @@ process_implied_equality(PlannerInfo *root,
 return;
 		}
 	}
+	else if (equal(item1, item2))
+	{
+		NullTest	   *ntest;
+		ntest = makeNode(NullTest);
+		ntest->arg = item1;
+		ntest->nulltesttype = IS_NOT_NULL;
+		ntest->argisrow = false;
+		ntest->location = exprLocation((Node *) clause);
+
+		clause = (Expr *) ntest;
+	}
 
 	/*
 	 * Push the new clause into all the appropriate restrictinfo lists.
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-06 Thread Andres Freund
On 2017-10-06 22:19:54 -0400, Tom Lane wrote:
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in
> themselves.

> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

Partially agreed. A comment to the article also mentions that some other
database performs more optimizations depending on the cost of the
plan. That's not easy to do in our current plan structure, but I think
it's quite a worthwhile concept.


> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?
> 
> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)

Yea, I think this should be inferrable from information we essentially
already compute for equivclasses.


> > (i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
> > with OR)
> 
> > I can't remember proposals about adding this, but it seems worthwhile to
> > consider. I think we should be able to check for this without a lot of
> > planner overhead.
> 
> It would be enormously expensive to check that in the general case with
> a bunch of entries in each IN list.  Perhaps it would be OK to add on
> the presumption that few queries would contain multiple IN's on the same
> column in the first place, though.  Not sure.

Merging of ORs should be near free if you leave duplicates in there, and
the duplicates aren't a huge concern if your alternative is is to have
them, but also have two clauses to evaluate.

I think the IN AND IN case should usually end up a clear win as
well. Both lists are going to be evaluated for each row anyway - you
don't need a lot of rows where clauses are evaluated to make it worth
the O(n*log(n)) of sorting the lists.  Sorting them would be beneficial
for other reasons as well, e.g. it improves access patterns for SAO
index scans.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-06 Thread Tom Lane
Andres Freund  writes:
> On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
>> The article in question is here:
>> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/

> That's interesting.

The impression I have in a quick scan is that probably hardly any of these
are cases that any of the DB designers think are important in themselves.
Rather, they fall out of more general optimization attempts, or not,
depending on the optimization mechanisms in use in a particular DB.
For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
comes out of a constant-subexpression-precalculation mechanism for us,
whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
would be really dumb to expend planner cycles looking specifically for
that case, so I guess that DB2 et al are finding it as a side-effect of
some more general optimization ... I wonder what that is?

(edit: a few minutes later, I seem to remember that equivclass.c has
to do something special with the X=X case, so maybe it could do
something else special instead, with little new overhead.)

> (i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
> with OR)

> I can't remember proposals about adding this, but it seems worthwhile to
> consider. I think we should be able to check for this without a lot of
> planner overhead.

It would be enormously expensive to check that in the general case with
a bunch of entries in each IN list.  Perhaps it would be OK to add on
the presumption that few queries would contain multiple IN's on the same
column in the first place, though.  Not sure.

> 9. Unneeded Self JOIN

> Can't remember discussions of this.

I can't get very excited about that one either.

In the end, what the article fails to consider is that all of these are
tradeoffs, not unalloyed goods.  If you spend planner cycles on every
query to look for cases that only the most unabashedly brain-dead ORMs
ever generate, you're not really doing your users a favor on balance.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-06 Thread Andres Freund
Hi,

On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
> Hopefully it's alright for me to post this here, please let me know if not.

> I ran across an article on blog.jooq.org comparing all the major RDBMS'
> with their ability to optimize away unnecessary work with queries which are
> less than optimal, and saw some discussion on hackernews and reddit, but I
> hadn't seen any discussion here.
> 
> The article in question is here:
> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/

That's interesting.


> Commercial databases seem to have a serious leg up in this area, and there
> are even some that MySQL has that Postgres doesn't.
> 
> I was wondering which of these are planned, which have had discussion
> before and decided not to support, and which just haven't been thought of?

Hm, going through the ones we don't [fully] support:

3. JOIN Elimination

There's been a lot of discussion and several patches. There's a bunch of
problems here, one being that there's cases (during trigger firing,
before the constraint checks) where foreign keys don't hold true, so we
can't quite generally make these optimization.  Another concern is
whether the added plan time is actually worthwhile.


4. Removing “Silly” Predicates

(i.e stuff like column = column)

This has deemed not to be a valuable use of plan time. I'm doubtful
about that argument, but that IIRC was the concensus the last time this
came up.


6. Predicate Merging

(i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
with OR)

I can't remember proposals about adding this, but it seems worthwhile to
consider. I think we should be able to check for this without a lot of
planner overhead.


7. Provably Empty Sets
8. CHECK Constraints

I think some of this should actually work with constraint exclusion
turned on.


9. Unneeded Self JOIN

Can't remember discussions of this.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers