Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

2014-11-22 Thread Tom Lane
David Rowley  writes:
> I'm reasonably happy with the patch now, the only small niggles are maybe
> the Assert() is probably not so much needed as transformLimitClause() seems
> to coerce to int8 anyway, and recompute_limits() does not bother with the
> Assert() when it does the same thing, either way, it's certainly doing no
> harm. The other thing was the regression tests, I'd rather see the tests
> use 2 separate tables for the EXISTS checks, but this is likely only
> because I'm thinking of doing some work in the future to remove duplicate
> joins from queries, so perhaps it's fine for now.

> Good work. Marking as ready for committer.

I've committed this patch with small adjustments; mostly, making the
comments more explicit.  But I also reordered the operations so that the
LIMIT-ectomy can be performed even if we decide the targetlist is unsafe.
This is within simplify_EXISTS_query's stated charter: it does not have
to be all-or-nothing.

(On reflection, we could zap a constant limit even when there are GROUP BY
etc, but it's probably not worth worrying about.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

2014-10-26 Thread David Rowley
On Mon, Oct 27, 2014 at 1:56 AM, Marti Raudsepp  wrote:

> On Wed, Oct 22, 2014 at 1:37 PM, David Rowley 
> wrote:
> > I've had a bit of a look at this and here's a couple of things:
>
> Thanks. Sorry I didn't back to you earlier, I almost forgot about the
> review.
>
> > /*
> > +* LIMIT clause can be removed if it's a positive constant or
> ALL, to
> > +* prevent it from being an optimization barrier. It's a common
> meme to put
> > +* LIMIT 1 within EXISTS subqueries.
> > +*/
> >
> > I think this comment may be better explained along the lines of:
> >
> > "A subquery which has a LIMIT clause with a positive value is
> effectively a
> > no-op in this scenario. With EXISTS we only care about the first row
> anyway,
> > so any positive limit value will have no behavioral change to the query,
> so
> > we'll simply remove the LIMIT clause. If we're unable to prove that the
> > LIMIT value is a positive number then we'd better not touch it."
>
> That's a bit redundant. Here's what I wrote instead, does this sound
> better?
>
> * A subquery that has a LIMIT clause with a positive value or NULL causes
> * no behavioral change to the query. With EXISTS we only care about the
> * first row anyway, so we'll simply remove the LIMIT clause. If the LIMIT
> * value does not reduce to a constant that's positive or NULL then do not
> * touch it.
>
>
Sounds better.


> > + /* Checking for negative values is done later; 0 is just silly */
> > + if (! limit->constisnull && DatumGetInt64(limit->constvalue) <= 0)
> >
> > I'm a bit confused by the comment here. You're saying that we'll check
> for
> > negatives later... but you're checking for <= 0 on the next line... Did I
> > miss something or is this a mistake?
>
> I meant that the case when a negative value raises an error is checked
> later in the execution pass. But you're right, this code has nothing
> to do with that so I've removed the mention about negative values. It
> now says:
>
> /* If it's not NULL and not positive, keep it as a regular subquery */
>
>
Also better.


> > I guess here you're just testing to ensure that you've not broken this
> and
> > that the new code does not accidentally remove the LIMIT clause. I think
> it
> > also might be worth adding some tests to ensure that co-related EXISTS
> > clauses with a positive LIMIT value get properly changed into a SEMI JOIN
> > instead of remaining as a subplan. And also make sure that a LIMIT 0 or
> > LIMIT -1 gets left as a subplan. I'd say such a test would supersede the
> > above test, and I think it's also the case you were talking about
> optimising
> > anyway?
>
> Sure, this is a planner-only change so it makes sense to test EXPLAIN
> output.
>
> The dilemma I always have is: wouldn't this cause failures due to plan
> instability, if for example autoanalyze gets around to this table
> earlier or later and nudges it from a hash join to merge etc? Or is
> this not a problem?
>
>
I guess it could be a problem. You could write your tests to use tenk1
instead, it seems to be analyzed in copy.sql

I'm reasonably happy with the patch now, the only small niggles are maybe
the Assert() is probably not so much needed as transformLimitClause() seems
to coerce to int8 anyway, and recompute_limits() does not bother with the
Assert() when it does the same thing, either way, it's certainly doing no
harm. The other thing was the regression tests, I'd rather see the tests
use 2 separate tables for the EXISTS checks, but this is likely only
because I'm thinking of doing some work in the future to remove duplicate
joins from queries, so perhaps it's fine for now.

Good work. Marking as ready for committer.

Regards

David Rowley




Patch attached with all above changes.
>
> Regards,
> Marti
>


Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

2014-10-26 Thread Marti Raudsepp
On Wed, Oct 22, 2014 at 1:37 PM, David Rowley  wrote:
> I've had a bit of a look at this and here's a couple of things:

Thanks. Sorry I didn't back to you earlier, I almost forgot about the review.

> /*
> +* LIMIT clause can be removed if it's a positive constant or ALL, to
> +* prevent it from being an optimization barrier. It's a common meme 
> to put
> +* LIMIT 1 within EXISTS subqueries.
> +*/
>
> I think this comment may be better explained along the lines of:
>
> "A subquery which has a LIMIT clause with a positive value is effectively a
> no-op in this scenario. With EXISTS we only care about the first row anyway,
> so any positive limit value will have no behavioral change to the query, so
> we'll simply remove the LIMIT clause. If we're unable to prove that the
> LIMIT value is a positive number then we'd better not touch it."

That's a bit redundant. Here's what I wrote instead, does this sound better?

* A subquery that has a LIMIT clause with a positive value or NULL causes
* no behavioral change to the query. With EXISTS we only care about the
* first row anyway, so we'll simply remove the LIMIT clause. If the LIMIT
* value does not reduce to a constant that's positive or NULL then do not
* touch it.

> + /* Checking for negative values is done later; 0 is just silly */
> + if (! limit->constisnull && DatumGetInt64(limit->constvalue) <= 0)
>
> I'm a bit confused by the comment here. You're saying that we'll check for
> negatives later... but you're checking for <= 0 on the next line... Did I
> miss something or is this a mistake?

I meant that the case when a negative value raises an error is checked
later in the execution pass. But you're right, this code has nothing
to do with that so I've removed the mention about negative values. It
now says:

/* If it's not NULL and not positive, keep it as a regular subquery */

> I guess here you're just testing to ensure that you've not broken this and
> that the new code does not accidentally remove the LIMIT clause. I think it
> also might be worth adding some tests to ensure that co-related EXISTS
> clauses with a positive LIMIT value get properly changed into a SEMI JOIN
> instead of remaining as a subplan. And also make sure that a LIMIT 0 or
> LIMIT -1 gets left as a subplan. I'd say such a test would supersede the
> above test, and I think it's also the case you were talking about optimising
> anyway?

Sure, this is a planner-only change so it makes sense to test EXPLAIN output.

The dilemma I always have is: wouldn't this cause failures due to plan
instability, if for example autoanalyze gets around to this table
earlier or later and nudges it from a hash join to merge etc? Or is
this not a problem?

Patch attached with all above changes.

Regards,
Marti
From 28543dda9febe8d8b5fc91060a4323c08f3c4a8a Mon Sep 17 00:00:00 2001
From: Marti Raudsepp 
Date: Wed, 1 Oct 2014 02:17:21 +0300
Subject: [PATCH] Simplify EXISTS subqueries containing LIMIT

---
 src/backend/optimizer/plan/subselect.c  | 42 ++-
 src/test/regress/expected/subselect.out | 51 +
 src/test/regress/sql/subselect.sql  | 14 +
 3 files changed, 100 insertions(+), 7 deletions(-)

diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..66d1b90 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -70,7 +70,7 @@ static Node *convert_testexpr_mutator(Node *node,
 static bool subplan_is_hashable(Plan *plan);
 static bool testexpr_is_hashable(Node *testexpr);
 static bool hash_ok_operator(OpExpr *expr);
-static bool simplify_EXISTS_query(Query *query);
+static bool simplify_EXISTS_query(PlannerInfo *root, Query *query);
 static Query *convert_EXISTS_to_ANY(PlannerInfo *root, Query *subselect,
 	  Node **testexpr, List **paramIds);
 static Node *replace_correlation_vars_mutator(Node *node, PlannerInfo *root);
@@ -452,7 +452,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
 	 * If it's an EXISTS subplan, we might be able to simplify it.
 	 */
 	if (subLinkType == EXISTS_SUBLINK)
-		simple_exists = simplify_EXISTS_query(subquery);
+		simple_exists = simplify_EXISTS_query(root, subquery);
 
 	/*
 	 * For an EXISTS subplan, tell lower-level planner to expect that only the
@@ -518,7 +518,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
 		/* Make a second copy of the original subquery */
 		subquery = (Query *) copyObject(orig_subquery);
 		/* and re-simplify */
-		simple_exists = simplify_EXISTS_query(subquery);
+		simple_exists = simplify_EXISTS_query(root, subquery);
 		Assert(simple_exists);
 		/* See if it can be converted to an ANY query */
 		subquery = convert_EXISTS_to_ANY(root, subquery,
@@ -1359,7 +1359,7 @@ convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * targetlist, we have to fail, because the pullup operation leaves us
 	 * with no

Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

2014-10-22 Thread David Rowley
On Tue, Oct 21, 2014 at 11:00 PM, Marti Raudsepp  wrote:

> On Sun, Oct 19, 2014 at 1:22 PM, David Rowley 
> wrote:
> > the argument for this would
> > have been much stronger if anti join support had just been added last
> week.
> > It's been quite a few years now and the argument for this must be getting
> > weaker with every release.
>
> I see your point, but I would put it another way: we've had this for a
> few years, but people haven't learned and are *still* using LIMIT 1.
>
>
I've had a bit of a look at this and here's a couple of things:

/*
+* LIMIT clause can be removed if it's a positive constant or ALL,
to
+* prevent it from being an optimization barrier. It's a common
meme to put
+* LIMIT 1 within EXISTS subqueries.
+*/

I think this comment may be better explained along the lines of:

"A subquery which has a LIMIT clause with a positive value is effectively a
no-op in this scenario. With EXISTS we only care about the first row
anyway, so any positive limit value will have no behavioral change to the
query, so we'll simply remove the LIMIT clause. If we're unable to prove
that the LIMIT value is a positive number then we'd better not touch it."


+ /* Checking for negative values is done later; 0 is just silly */
+ if (! limit->constisnull && DatumGetInt64(limit->constvalue) <= 0)

I'm a bit confused by the comment here. You're saying that we'll check for
negatives later... but you're checking for <= 0 on the next line... Did I
miss something or is this a mistake?


This test:

+select * from int4_tbl o where exists (select 1 limit 0);
+ f1
+
+(0 rows)

I guess here you're just testing to ensure that you've not broken this and
that the new code does not accidentally remove the LIMIT clause. I think it
also might be worth adding some tests to ensure that co-related EXISTS
clauses with a positive LIMIT value get properly changed into a SEMI JOIN
instead of remaining as a subplan. And also make sure that a LIMIT 0 or
LIMIT -1 gets left as a subplan. I'd say such a test would supersede the
above test, and I think it's also the case you were talking about
optimising anyway?

You can use EXPLAIN (COSTS OFF) to get a stable explain output.

Regards

David Rowley


Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

2014-10-21 Thread Marti Raudsepp
Hi

Thanks for taking a look.

On Sun, Oct 19, 2014 at 1:22 PM, David Rowley  wrote:
> the argument for this would
> have been much stronger if anti join support had just been added last week.
> It's been quite a few years now and the argument for this must be getting
> weaker with every release.

I see your point, but I would put it another way: we've had this for a
few years, but people haven't learned and are *still* using LIMIT 1.


Actually I thought of a cleverer approach to solve this, that doesn't
require calling eval_const_expressions and works with any expression.

Given query:
  EXISTS (SELECT ... WHERE foo LIMIT any_expr())
we could turn it into the almost-equivalent form:
  EXISTS (SELECT ... WHERE foo AND any_expr() > 0)

The only problem is that we'd no longer be able to throw an error for
negative values and that seems like a deal-breaker.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

2014-10-19 Thread David Rowley
On Fri, Oct 3, 2014 at 10:41 AM, Marti Raudsepp  wrote:

> Hi list,
>
> Attached patch allows semijoin/antijoin/hashed SubPlan optimization
> when an EXISTS subquery contains a LIMIT clause with a positive
> constant. It seems to be a fairly common meme to put LIMIT 1 into
> EXISTS() subqueries, and it even makes sense when you're not aware
> that the database already does this optimization.
>
>
I had a quick look at this, and the code looks fairly simple. Although,
I've got mixed feelings about it;

I guess there's not really any real performance penalty in planning time
for everyone else who does not put LIMIT clauses into their exists
subqueries, so maybe it's worth it as it seems there could still be a few
people out there suffering from this, but at the same time, the argument
for this would have been much stronger if anti join support had just been
added last week. It's been quite a few years now and the argument for this
must be getting weaker with every release.

I think I'm leaning towards a +1 on this as it seems a shame for people who
have no control over the queries sent to their database to have to be
excluded from the benefits of semi join and anti join.

Regards

David Rowley

Do we want this?
>
> It has come up in #postgresql, and at twice times on mailing lists:
> http://www.postgresql.org/message-id/53279529.2070...@freemail.hu
> http://www.postgresql.org/message-id/50a36820.4030...@pingpong.net
>
> And there may even be good reasons, such as writing performant
> portable SQL code for Other Databases:
> https://dev.mysql.com/doc/refman/5.1/en/optimizing-subqueries.html
>
>


[HACKERS] [PATCH] Simplify EXISTS subqueries containing LIMIT

2014-10-02 Thread Marti Raudsepp
Hi list,

Attached patch allows semijoin/antijoin/hashed SubPlan optimization
when an EXISTS subquery contains a LIMIT clause with a positive
constant. It seems to be a fairly common meme to put LIMIT 1 into
EXISTS() subqueries, and it even makes sense when you're not aware
that the database already does this optimization.

Do we want this?

It has come up in #postgresql, and at twice times on mailing lists:
http://www.postgresql.org/message-id/53279529.2070...@freemail.hu
http://www.postgresql.org/message-id/50a36820.4030...@pingpong.net

And there may even be good reasons, such as writing performant
portable SQL code for Other Databases:
https://dev.mysql.com/doc/refman/5.1/en/optimizing-subqueries.html


The code is fairly straightforward. The only ugly part is that I need
to call eval_const_expressions() on the LIMIT expression because
subquery_planner() does subquery optimizations before constant
folding. A "LIMIT 1" clause will actually produce an int8(1)
expression. And I have to drag along PlannerInfo for that.

If it fails to yield a constant we've done some useless work, but it
should be nothing compared to the caller doing a deep copy of the
whole subquery.

Regards,
Marti
From 3448408121e7e32a12fc16403c9d48bce63503f5 Mon Sep 17 00:00:00 2001
From: Marti Raudsepp 
Date: Wed, 1 Oct 2014 02:17:21 +0300
Subject: [PATCH] Simplify EXISTS subqueries containing LIMIT

---
 src/backend/optimizer/plan/subselect.c  | 40 +++--
 src/test/regress/expected/subselect.out | 14 
 src/test/regress/sql/subselect.sql  |  7 ++
 3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 3e7dc85..09b153e 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -70,7 +70,7 @@ static Node *convert_testexpr_mutator(Node *node,
 static bool subplan_is_hashable(Plan *plan);
 static bool testexpr_is_hashable(Node *testexpr);
 static bool hash_ok_operator(OpExpr *expr);
-static bool simplify_EXISTS_query(Query *query);
+static bool simplify_EXISTS_query(PlannerInfo *root, Query *query);
 static Query *convert_EXISTS_to_ANY(PlannerInfo *root, Query *subselect,
 	  Node **testexpr, List **paramIds);
 static Node *replace_correlation_vars_mutator(Node *node, PlannerInfo *root);
@@ -452,7 +452,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
 	 * If it's an EXISTS subplan, we might be able to simplify it.
 	 */
 	if (subLinkType == EXISTS_SUBLINK)
-		simple_exists = simplify_EXISTS_query(subquery);
+		simple_exists = simplify_EXISTS_query(root, subquery);
 
 	/*
 	 * For an EXISTS subplan, tell lower-level planner to expect that only the
@@ -518,7 +518,7 @@ make_subplan(PlannerInfo *root, Query *orig_subquery,
 		/* Make a second copy of the original subquery */
 		subquery = (Query *) copyObject(orig_subquery);
 		/* and re-simplify */
-		simple_exists = simplify_EXISTS_query(subquery);
+		simple_exists = simplify_EXISTS_query(root, subquery);
 		Assert(simple_exists);
 		/* See if it can be converted to an ANY query */
 		subquery = convert_EXISTS_to_ANY(root, subquery,
@@ -1359,7 +1359,7 @@ convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	 * targetlist, we have to fail, because the pullup operation leaves us
 	 * with noplace to evaluate the targetlist.
 	 */
-	if (!simplify_EXISTS_query(subselect))
+	if (!simplify_EXISTS_query(root, subselect))
 		return NULL;
 
 	/*
@@ -1486,11 +1486,11 @@ convert_EXISTS_sublink_to_join(PlannerInfo *root, SubLink *sublink,
  * Returns TRUE if was able to discard the targetlist, else FALSE.
  */
 static bool
-simplify_EXISTS_query(Query *query)
+simplify_EXISTS_query(PlannerInfo *root, Query *query)
 {
 	/*
 	 * We don't try to simplify at all if the query uses set operations,
-	 * aggregates, modifying CTEs, HAVING, LIMIT/OFFSET, or FOR UPDATE/SHARE;
+	 * aggregates, modifying CTEs, HAVING, OFFSET, or FOR UPDATE/SHARE;
 	 * none of these seem likely in normal usage and their possible effects
 	 * are complex.
 	 */
@@ -1501,7 +1501,6 @@ simplify_EXISTS_query(Query *query)
 		query->hasModifyingCTE ||
 		query->havingQual ||
 		query->limitOffset ||
-		query->limitCount ||
 		query->rowMarks)
 		return false;
 
@@ -1513,6 +1512,33 @@ simplify_EXISTS_query(Query *query)
 		return false;
 
 	/*
+	 * LIMIT clause can be removed if it's a positive constant or ALL, to
+	 * prevent it from being an optimization barrier. It's a common meme to put
+	 * LIMIT 1 within EXISTS subqueries.
+	 */
+	if (query->limitCount)
+	{
+		/*
+		 * eval_const_expressions has not been called yet by subquery_planner,
+		 * may still contain int64 coercions etc.
+		 */
+		Node	   *node = eval_const_expressions(root, query->limitCount);
+		Const	   *limit;
+
+		if (! IsA(node, Const))
+			return false;
+
+		limit = (Const *) node;
+		Assert(limit->consttype == INT8OID);
+
+		/* Checking for negative val