Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-13 Thread Antonin Houska
On 12/12/2013 04:36 PM, Tom Lane wrote:
 BTW, on further thought, I'm afraid this is a bigger can of worms than
 it appears.  The remarks above presume that the subquery is simple enough
 to be pulled up, which is the case in this example.  It might not be too
 hard to make that case work.  But what if the subquery *isn't* simple
 enough to be pulled up --- for instance, it includes grouping or
 aggregation?  Then there's no way to unify its WHERE clause with the upper
 semijoin qual.  At the very least, this breaks the uniqueify-then-do-a-
 plain-join implementation strategy for semijoins.

After having thought about it further, I think I understand.

 So I'm now thinking this patch isn't worth pursuing.  Getting all the
 corner cases right would be a significant amount of work, and in the
 end it would only benefit strangely-written queries.

Originally it seemed to me that I just (luckily) found a new opportunity
for the existing infrastructure. To change the infrastructure because of
this small feature would be exactly the opposite. Thanks for having
taken a look at it.

// Antonin Houska (Tony)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Yeah, I found myself wishing for an EXPLAIN option that would show
 that.

 It's not hard to do ... how about the attached?

+1

 I chose to print grouping keys for both Agg and Group nodes, and to
 show them unconditionally.  There's some case maybe for only including
 them in verbose mode, but since sort keys are shown unconditionally,
 it seemed more consistent this way.

+1

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:

 Hm, that means there's only one grouping column (and it's the second
 tlist entry of the child plan node).  So that seems conclusive that
 the unique-ification is being done wrong.

Further confirmation using the EXPLAIN patch with Antonin's v2
patch against the table before any EXPLAIN or ANALYZE:

 Hash Join  (cost=37.12..80.40 rows=442 width=12)
   Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = 
lower.f2))
   -  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16)
   -  Hash  (cost=34.12..34.12 rows=200 width=12)
 -  HashAggregate  (cost=32.12..34.12 rows=200 width=12)
   Group Key: lower.f2
   -  Seq Scan on subselect_tbl lower  (cost=0.00..27.70 rows=1770 
width=12)

The additional information is so useful, I'm all for committing
that patch.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 Further confirmation using the EXPLAIN patch with Antonin's v2
 patch against the table before any EXPLAIN or ANALYZE:

  Hash Join  (cost=37.12..80.40 rows=442 width=12)
    Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = 
 lower.f2))
    -  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16)
    -  Hash  (cost=34.12..34.12 rows=200 width=12)
  -  HashAggregate  (cost=32.12..34.12 rows=200 width=12)
    Group Key: lower.f2
    -  Seq Scan on subselect_tbl lower  (cost=0.00..27.70 
 rows=1770 width=12)

That's about what I thought: it's unique-ifying according to the original
semijoin qual, without realizing that the pulled-up clause from the lower
WHERE would need to be treated as part of the semijoin qual.  This isn't
a bug in the existing code, because the case can never arise, since we
don't treat an IN/=ANY as a semijoin if the sub-select contains any
outer-level Vars.  But if you remove that check from
convert_ANY_sublink_to_join then you've got to deal with the problem.

That said, I'm not too sure where the problem is in detail.  I'd have
thought that subquery pullup would stick the subquery's WHERE clause
into the join quals of the parent JoinExpr node.  Is that not happening,
or is it perhaps not sufficient to cue the UniquePath machinery?

 The additional information is so useful, I'm all for committing
 that patch.

Will do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-12 Thread Tom Lane
I wrote:
 That's about what I thought: it's unique-ifying according to the original
 semijoin qual, without realizing that the pulled-up clause from the lower
 WHERE would need to be treated as part of the semijoin qual.  This isn't
 a bug in the existing code, because the case can never arise, since we
 don't treat an IN/=ANY as a semijoin if the sub-select contains any
 outer-level Vars.  But if you remove that check from
 convert_ANY_sublink_to_join then you've got to deal with the problem.

 That said, I'm not too sure where the problem is in detail.  I'd have
 thought that subquery pullup would stick the subquery's WHERE clause
 into the join quals of the parent JoinExpr node.  Is that not happening,
 or is it perhaps not sufficient to cue the UniquePath machinery?

BTW, on further thought, I'm afraid this is a bigger can of worms than
it appears.  The remarks above presume that the subquery is simple enough
to be pulled up, which is the case in this example.  It might not be too
hard to make that case work.  But what if the subquery *isn't* simple
enough to be pulled up --- for instance, it includes grouping or
aggregation?  Then there's no way to unify its WHERE clause with the upper
semijoin qual.  At the very least, this breaks the uniqueify-then-do-a-
plain-join implementation strategy for semijoins.

So I'm now thinking this patch isn't worth pursuing.  Getting all the
corner cases right would be a significant amount of work, and in the
end it would only benefit strangely-written queries.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-11 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:

 I applied it to master and ran the regression tests, and one of
 the subselect tests failed.

 This query:

 SELECT '' AS six, f1 AS Correlated Field, f3 AS Second
 Field
   FROM SUBSELECT_TBL upper
   WHERE f1 IN
 (SELECT f2 FROM SUBSELECT_TBL WHERE CAST(upper.f2 AS float) = f3);

 [ ... ] during the `make check` or `make install-check` [ ... ]
 is missing the last two rows.  Oddly, if I go into the database
 later and try it, the rows show up.  It's not immediately
 apparent to me what's wrong.

Using the v2 patch, with the default statistics from table
creation, the query modified with an alias of lower for the
second reference, just for clarity, yields a plan which generates
incorrect results:

 Hash Join  (cost=37.12..80.40 rows=442 width=12) (actual time=0.059..0.064 
rows=3 loops=1)
   Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = 
lower.f2))
   -  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16) 
(actual time=0.006..0.007 rows=8 loops=1)
   -  Hash  (cost=34.12..34.12 rows=200 width=12) (actual time=0.020..0.020 
rows=5 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 1kB
 -  HashAggregate  (cost=32.12..34.12 rows=200 width=12) (actual 
time=0.014..0.018 rows=6 loops=1)
   -  Seq Scan on subselect_tbl lower  (cost=0.00..27.70 rows=1770 
width=12) (actual time=0.002..0.004 rows=8 loops=1)
 Total runtime: 0.111 ms

As soon as there is a VACUUM and/or ANALYZE it generates a plan
more like what the OP was hoping for:

 Hash Semi Join  (cost=1.20..2.43 rows=6 width=12) (actual time=0.031..0.036 
rows=5 loops=1)
   Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = 
lower.f2))
   -  Seq Scan on subselect_tbl upper  (cost=0.00..1.08 rows=8 width=16) 
(actual time=0.004..0.007 rows=8 loops=1)
   -  Hash  (cost=1.08..1.08 rows=8 width=12) (actual time=0.012..0.012 rows=7 
loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 1kB
 -  Seq Scan on subselect_tbl lower  (cost=0.00..1.08 rows=8 width=12) 
(actual time=0.003..0.005 rows=8 loops=1)
 Total runtime: 0.074 ms

By comparison, without the patch this is the plan:

 Seq Scan on subselect_tbl upper  (cost=0.00..5.59 rows=4 width=12) (actual 
time=0.022..0.037 rows=5 loops=1)
   Filter: (SubPlan 1)
   Rows Removed by Filter: 3
   SubPlan 1
 -  Seq Scan on subselect_tbl lower  (cost=0.00..1.12 rows=1 width=4) 
(actual time=0.002..0.003 rows=1 loops=8)
   Filter: ((upper.f2)::double precision = f3)
   Rows Removed by Filter: 4
 Total runtime: 0.066 ms

When I run the query with fresh statistics and without EXPLAIN both
ways, the unpatched is consistently about 10% faster.

So pulling up the subquery can yield an incorrect plan, and even
when it yields the desired plan with the semi-join it is
marginally slower than using the subplan, at least for this small
data set.  I think it's an interesting idea, but it still needs
work.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-11 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 Kevin Grittner kgri...@ymail.com wrote:
 I applied it to master and ran the regression tests, and one of
 the subselect tests failed.
 
 This query:
 
 SELECT '' AS six, f1 AS Correlated Field, f3 AS Second
 Field
    FROM SUBSELECT_TBL upper
    WHERE f1 IN
  (SELECT f2 FROM SUBSELECT_TBL WHERE CAST(upper.f2 AS float) = f3);

 [ ... ] during the `make check` or `make install-check` [ ... ]
 is missing the last two rows.  Oddly, if I go into the database
 later and try it, the rows show up.  It's not immediately
 apparent to me what's wrong.

 Using the v2 patch, with the default statistics from table
 creation, the query modified with an alias of lower for the
 second reference, just for clarity, yields a plan which generates
 incorrect results:

  Hash Join  (cost=37.12..80.40 rows=442 width=12) (actual time=0.059..0.064 
 rows=3 loops=1)
    Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = 
 lower.f2))
    -  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16) 
 (actual time=0.006..0.007 rows=8 loops=1)
    -  Hash  (cost=34.12..34.12 rows=200 width=12) (actual time=0.020..0.020 
 rows=5 loops=1)
  Buckets: 1024  Batches: 1  Memory Usage: 1kB
  -  HashAggregate  (cost=32.12..34.12 rows=200 width=12) (actual 
 time=0.014..0.018 rows=6 loops=1)
    -  Seq Scan on subselect_tbl lower  (cost=0.00..27.70 
 rows=1770 width=12) (actual time=0.002..0.004 rows=8 loops=1)
  Total runtime: 0.111 ms

FWIW, that plan isn't obviously wrong; if it is broken, most likely the
reason is that the HashAggregate is incorrectly unique-ifying the lower
table.  (Unfortunately, EXPLAIN doesn't show enough about the HashAgg
to know what it's doing exactly.)  The given query is, I think, in
principle equivalent to

 SELECT ...
  FROM SUBSELECT_TBL upper
  WHERE (f1, f2::float) IN
(SELECT f2, f3 FROM SUBSELECT_TBL);

and if you ask unmodified HEAD to plan that you get

 Hash Join  (cost=41.55..84.83 rows=442 width=16)
   Hash Cond: ((upper.f1 = subselect_tbl.f2) AND ((upper.f2)::double precision 
= subselect_tbl.f3))
   -  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16)
   -  Hash  (cost=38.55..38.55 rows=200 width=12)
 -  HashAggregate  (cost=36.55..38.55 rows=200 width=12)
   -  Seq Scan on subselect_tbl  (cost=0.00..27.70 rows=1770 
width=12)

which is the same thing at the visible level of detail ... but this
version computes the correct result.  The cost of the HashAggregate is
estimated higher, though, which suggests that maybe it's distinct'ing on
two columns where the bogus plan only does one.

Not sure about where Antonin's patch is going off the rails.  I suspect
it's too simple somehow, but it's also possible that it's OK and the
real issue is some previously undetected bug in LATERAL processing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-11 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:

 FWIW, that plan isn't obviously wrong; if it is broken, most
 likely the reason is that the HashAggregate is incorrectly
 unique-ifying the lower table.  (Unfortunately, EXPLAIN doesn't
 show enough about the HashAgg to know what it's doing exactly.)

Yeah, I found myself wishing for an EXPLAIN option that would show
that.

 The cost of the HashAggregate is estimated higher, though, which
 suggests that maybe it's distinct'ing on two columns where the
 bogus plan only does one.

FWIW, I noticed that the actual row counts suggested that, too.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-11 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 FWIW, that plan isn't obviously wrong; if it is broken, most
 likely the reason is that the HashAggregate is incorrectly
 unique-ifying the lower table.  (Unfortunately, EXPLAIN doesn't
 show enough about the HashAgg to know what it's doing exactly.)

 Yeah, I found myself wishing for an EXPLAIN option that would show
 that.

It's not hard to do ... how about the attached?

I chose to print grouping keys for both Agg and Group nodes, and to
show them unconditionally.  There's some case maybe for only including
them in verbose mode, but since sort keys are shown unconditionally,
it seemed more consistent this way.

regards, tom lane

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index bd5428d..9969a25 100644
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
*** static void show_sort_keys(SortState *so
*** 76,84 
  			   ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  	   ExplainState *es);
! static void show_sort_keys_common(PlanState *planstate,
! 	  int nkeys, AttrNumber *keycols,
! 	  List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_instrumentation_count(const char *qlabel, int which,
--- 76,88 
  			   ExplainState *es);
  static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
  	   ExplainState *es);
! static void show_agg_keys(AggState *astate, List *ancestors,
! 			  ExplainState *es);
! static void show_group_keys(GroupState *gstate, List *ancestors,
! ExplainState *es);
! static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
! 	 int nkeys, AttrNumber *keycols,
! 	 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
  static void show_instrumentation_count(const char *qlabel, int which,
*** ExplainNode(PlanState *planstate, List *
*** 1341,1347 
--- 1345,1358 
  		   planstate, es);
  			break;
  		case T_Agg:
+ 			show_agg_keys((AggState *) planstate, ancestors, es);
+ 			show_upper_qual(plan-qual, Filter, planstate, ancestors, es);
+ 			if (plan-qual)
+ show_instrumentation_count(Rows Removed by Filter, 1,
+ 		   planstate, es);
+ 			break;
  		case T_Group:
+ 			show_group_keys((GroupState *) planstate, ancestors, es);
  			show_upper_qual(plan-qual, Filter, planstate, ancestors, es);
  			if (plan-qual)
  show_instrumentation_count(Rows Removed by Filter, 1,
*** show_sort_keys(SortState *sortstate, Lis
*** 1693,1701 
  {
  	Sort	   *plan = (Sort *) sortstate-ss.ps.plan;
  
! 	show_sort_keys_common((PlanState *) sortstate,
! 		  plan-numCols, plan-sortColIdx,
! 		  ancestors, es);
  }
  
  /*
--- 1704,1712 
  {
  	Sort	   *plan = (Sort *) sortstate-ss.ps.plan;
  
! 	show_sort_group_keys((PlanState *) sortstate, Sort Key,
! 		 plan-numCols, plan-sortColIdx,
! 		 ancestors, es);
  }
  
  /*
*** show_merge_append_keys(MergeAppendState 
*** 1707,1720 
  {
  	MergeAppend *plan = (MergeAppend *) mstate-ps.plan;
  
! 	show_sort_keys_common((PlanState *) mstate,
! 		  plan-numCols, plan-sortColIdx,
! 		  ancestors, es);
  }
  
  static void
! show_sort_keys_common(PlanState *planstate, int nkeys, AttrNumber *keycols,
! 	  List *ancestors, ExplainState *es)
  {
  	Plan	   *plan = planstate-plan;
  	List	   *context;
--- 1718,1773 
  {
  	MergeAppend *plan = (MergeAppend *) mstate-ps.plan;
  
! 	show_sort_group_keys((PlanState *) mstate, Sort Key,
! 		 plan-numCols, plan-sortColIdx,
! 		 ancestors, es);
  }
  
+ /*
+  * Show the grouping keys for an Agg node.
+  */
  static void
! show_agg_keys(AggState *astate, List *ancestors,
! 			  ExplainState *es)
! {
! 	Agg		   *plan = (Agg *) astate-ss.ps.plan;
! 
! 	if (plan-numCols  0)
! 	{
! 		/* The key columns refer to the tlist of the child plan */
! 		ancestors = lcons(astate, ancestors);
! 		show_sort_group_keys(outerPlanState(astate), Group Key,
! 			 plan-numCols, plan-grpColIdx,
! 			 ancestors, es);
! 		ancestors = list_delete_first(ancestors);
! 	}
! }
! 
! /*
!  * Show the grouping keys for a Group node.
!  */
! static void
! show_group_keys(GroupState *gstate, List *ancestors,
! ExplainState *es)
! {
! 	Group	   *plan = (Group *) gstate-ss.ps.plan;
! 
! 	/* The key columns refer to the tlist of the child plan */
! 	ancestors = lcons(gstate, ancestors);
! 	show_sort_group_keys(outerPlanState(gstate), Group Key,
! 		 plan-numCols, plan-grpColIdx,
! 		 ancestors, es);
! 	ancestors = list_delete_first(ancestors);
! }
! 
! /*
!  * Common code to show 

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-11 Thread Antonin Houska
On 12/11/2013 10:15 PM, Tom Lane wrote:
 
 FWIW, that plan isn't obviously wrong; if it is broken, most likely the
 reason is that the HashAggregate is incorrectly unique-ifying the lower
 table.  (Unfortunately, EXPLAIN doesn't show enough about the HashAgg
 to know what it's doing exactly.)  The given query is, I think, in
 principle equivalent to
 
  SELECT ...
   FROM SUBSELECT_TBL upper
   WHERE (f1, f2::float) IN
 (SELECT f2, f3 FROM SUBSELECT_TBL);
 
 and if you ask unmodified HEAD to plan that you get
 
  Hash Join  (cost=41.55..84.83 rows=442 width=16)
Hash Cond: ((upper.f1 = subselect_tbl.f2) AND ((upper.f2)::double 
 precision = subselect_tbl.f3))
-  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16)
-  Hash  (cost=38.55..38.55 rows=200 width=12)
  -  HashAggregate  (cost=36.55..38.55 rows=200 width=12)
-  Seq Scan on subselect_tbl  (cost=0.00..27.70 rows=1770 
 width=12)

Before I opened your mail, I also recalled the technique that I noticed
in the planner code, to evaluate SEMI JOIN as INNER JOIN with the RHS
uniquified, so also thought it could be about the uniquification.

 which is the same thing at the visible level of detail ... but this
 version computes the correct result.  The cost of the HashAggregate is
 estimated higher, though, which suggests that maybe it's distinct'ing on
 two columns where the bogus plan only does one.

debug_print_plan output contains
:grpColIdx 2
in the AGG node. I think this corresponds to the join condition, which
IMO should be
(upper.f1 = subselect_tbl.f2)
while the other condition was not in the list of join clauses and
therefore ignored for the uniquification's purpose.

And gdb tells me that create_unique_path() never gets more than 1
clause. I can't tell whether it should do just for this special purpose.

 Not sure about where Antonin's patch is going off the rails.  I suspect
 it's too simple somehow, but it's also possible that it's OK and the
 real issue is some previously undetected bug in LATERAL processing.

So far I have no idea how to achieve such conditions without this patch.
Thanks for your comments.

// Antonin Houska (Tony)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-11 Thread Tom Lane
Antonin Houska antonin.hou...@gmail.com writes:
 debug_print_plan output contains
 :grpColIdx 2
 in the AGG node.

Hm, that means there's only one grouping column (and it's the second
tlist entry of the child plan node).  So that seems conclusive that
the unique-ification is being done wrong.  It's not very clear why
though.  It doesn't seem like your patch is doing anything that
would directly affect that.

For comparison purposes, using the patch I just posted, I get
this description of a correct plan:

regression=# explain verbose SELECT *
  FROM SUBSELECT_TBL upper
  WHERE (f1, f2::float) IN
(SELECT f2, f3 FROM SUBSELECT_TBL);
 QUERY PLAN 


 Hash Join  (cost=41.55..84.83 rows=442 width=16)
   Output: upper.f1, upper.f2, upper.f3
   Hash Cond: ((upper.f1 = subselect_tbl.f2) AND ((upper.f2)::double precision 
= subselect_tbl.f3))
   -  Seq Scan on public.subselect_tbl upper  (cost=0.00..27.70 rows=1770 
width=16)
 Output: upper.f1, upper.f2, upper.f3
   -  Hash  (cost=38.55..38.55 rows=200 width=12)
 Output: subselect_tbl.f2, subselect_tbl.f3
 -  HashAggregate  (cost=36.55..38.55 rows=200 width=12)
   Output: subselect_tbl.f2, subselect_tbl.f3
   Group Key: subselect_tbl.f2, subselect_tbl.f3
   -  Seq Scan on public.subselect_tbl  (cost=0.00..27.70 
rows=1770 width=12)
 Output: subselect_tbl.f1, subselect_tbl.f2, 
subselect_tbl.f3
(12 rows)

so it's unique-ifying on both f2 and f3, which is clearly necessary
for executing the IN with a plain join.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-10 Thread Antonin Houska
On 12/06/2013 03:33 PM, Kevin Grittner wrote:
 Antonin Houska antonin.hou...@gmail.com wrote:
 
 SELECT *
 FROMtab1 a
  LEFT JOIN
  tab2 b
  ON a.i = ANY (
  SELECT  k
  FROMtab3 c
  WHEREk = a.i);
 
 This query works with k in any or all tables, but the semantics
 certainly vary depending on where k happens to be.  It would help a
 lot if you showed SQL statements to create and populate the tables
 involved and/or if you qualified all referenced column names with
 the table alias to avoid ambiguity.

I used the DDLs attached (tables.ddl) for this query too, not only for
the queries in quaries.sql. Yes, if I had mentioned it and/or qualified
the 'k' column reference, it wouldn't have broken anything.

 If I assume that the k reference is supposed to be a column in
 tab3, what you have is a query where you always get all rows from
 tab1, and for each row from tab1 you either match it to all rows
 from tab2 or no rows from tab2 depending on whether the tab1 row
 has a match in tab3.

I concede this particular query is not useful. But the important thing
to consider here is which side of the LEFT JOIN the subquery references.

 SELECT  *
 FROMtab1 a
  LEFT JOIN
  (
SELECT *
tab2 b
SEMI JOIN
(  SELECT  k
FROMtab3 c
WHERE  k = a.i
) AS ANY_subquery
ON a.i = ANY_subquery.k
  ) AS SJ_subquery
  ON true;
 
 It is hard to see what you intend here, since this is not valid
 syntax.

This is what I - after having read the related source code - imagine to
happen internally when the ANY predicate of the first query is being
processed. In fact it should become something like this (also internal
stuff)

SELECT  *
FROMtab1 a
LEFT JOIN
(
  tab2 b
  SEMI JOIN
  (  SELECT  k
  FROMtab3 c
  WHERE  k = a.i
  ) AS ANY_subquery
  ON a.i = ANY_subquery.k
)
ON true;

that is, SEMI JOIN node inserted into the tree rather than a subquery
(SJ_subquery). I posted the construct with SJ_subquery to show how I
thought about the problem: I thought it's safe (even though not
necessarily beautiful) to wrap the SEMI JOIN into the SJ_subquery and
let the existing infrastructure decide whether it's legal to turn it
into a join node. I concluded that the subquery's references to the tab1
ensure that SJ_subquery won't be flattened, so the patch does nothing if
such a reference exists.

 PostgreSQL supports semi-joins; but that is an implementation detail
 for the EXISTS or IN syntax.

... and for ANY, see subselect.c:convert_ANY_sublink_to_join()

 Could you clarify your intent?

To get rid of a subplan in some cases that require it so far: when the
subquery references table exactly 1 level higher (i.e. the immediate
parent query).

(I got the idea while reading the source code, as opposed to query
tuning.)

// Antonin Houska (Tony)

 --
 Kevin Grittner
 EDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-10 Thread Kevin Grittner
Antonin Houska antonin.hou...@gmail.com wrote:

 I used the DDLs attached (tables.ddl) for this query too, not
 only for the queries in quaries.sql. Yes, if I had mentioned it
 and/or qualified the 'k' column reference, it wouldn't have
 broken anything.

Apologies; I missed the attachments.  It makes a lot more sense now
that I see those.

I see this was a patch originally posted on 2013-10-31 and not
added to the CommitFest.

I applied it to master and ran the regression tests, and one of the
subselect tests failed.

This query:

SELECT '' AS six, f1 AS Correlated Field, f3 AS Second Field
  FROM SUBSELECT_TBL upper
  WHERE f1 IN
    (SELECT f2 FROM SUBSELECT_TBL WHERE CAST(upper.f2 AS float) = f3);

Should have this for a result set:

 six | Correlated Field | Second Field
-+--+--
 |    2 |    4
 |    3 |    5
 |    1 |    1
 |    2 |    2
 |    3 |    3
(5 rows)

But during the `make check` or `make install-check` it is missing
the last two rows.  Oddly, if I go into the database later and try
it, the rows show up.  It's not immediately apparent to me what's
wrong.

Will look again later, or maybe someone more familiar with the
planner can spot the problem.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-06 Thread Kevin Grittner
Antonin Houska antonin.hou...@gmail.com wrote:

 SELECT *
 FROM    tab1 a
 LEFT JOIN
 tab2 b
 ON a.i = ANY (
 SELECT  k
 FROM    tab3 c
 WHERE    k = a.i);

This query works with k in any or all tables, but the semantics
certainly vary depending on where k happens to be.  It would help a
lot if you showed SQL statements to create and populate the tables
involved and/or if you qualified all referenced column names with
the table alias to avoid ambiguity.

If I assume that the k reference is supposed to be a column in
tab3, what you have is a query where you always get all rows from
tab1, and for each row from tab1 you either match it to all rows
from tab2 or no rows from tab2 depending on whether the tab1 row
has a match in tab3.

test=# create table tab1 (i int);
CREATE TABLE
test=# create table tab2 (j int);
CREATE TABLE
test=# create table tab3 (k int);
CREATE TABLE
test=# insert into tab1 values (1), (2), (3);
INSERT 0 3
test=# insert into tab2 values (4), (5), (6);
INSERT 0 3
test=# insert into tab3 values (1), (3);
INSERT 0 2
test=# SELECT *
FROM    tab1 a
    LEFT JOIN
    tab2 b
    ON a.i = ANY (
    SELECT  k    
    FROM    tab3 c
    WHERE    k = a.i);
 i | j
---+---
 1 | 4
 1 | 5
 1 | 6
 2 | 
 3 | 4
 3 | 5
 3 | 6
(7 rows)

 SELECT  *
 FROM    tab1 a
 LEFT JOIN
 (
   SELECT *
   tab2 b
   SEMI JOIN
   (  SELECT  k
   FROM    tab3 c
   WHERE  k = a.i
   ) AS ANY_subquery
   ON a.i = ANY_subquery.k
 ) AS SJ_subquery
 ON true;

It is hard to see what you intend here, since this is not valid
syntax.  I assume you want a FROM keyword before the tab2
reference, but it's less clear what you intend with the SEMI JOIN
syntax.  PostgreSQL supports semi-joins; but that is an
implementation detail for the EXISTS or IN syntax.  Could you
clarify your intent?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-06 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:

 test=# SELECT *
 FROM    tab1 a
 LEFT JOIN
 tab2 b
 ON a.i = ANY (
 SELECT  k    
 FROM    tab3 c
 WHERE    k = a.i);
  i | j
 ---+---
  1 | 4
  1 | 5
  1 | 6
  2 |
  3 | 4
  3 | 5
  3 | 6
 (7 rows)

 SELECT  *
 FROM    tab1 a
  LEFT JOIN
  (
    SELECT *
    tab2 b
    SEMI JOIN
    (  SELECT  k
    FROM    tab3 c
    WHERE  k = a.i
    ) AS ANY_subquery
    ON a.i = ANY_subquery.k
  ) AS SJ_subquery
  ON true;

 It is hard to see what you intend here

Perhaps you were looking for a way to formulate it something like
this?:

test=# SELECT *
test-#   FROM tab1 a
test-#   LEFT JOIN LATERAL
test-# (
test(#   SELECT *
test(# FROM tab2 b
test(# WHERE EXISTS
test(# (
test(#   SELECT *
test(# FROM tab3 c
test(# WHERE c.k = a.i
test(# )
test(# ) AS SJ_subquery
test-# ON true;
 i | j 
---+---
 1 | 4
 1 | 5
 1 | 6
 2 |  
 3 | 4
 3 | 5
 3 | 6
(7 rows)

Without LATERAL you get an error:

ERROR:  invalid reference to FROM-clause entry for table a
LINE 11: WHERE c.k = a.i
 ^

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-05 Thread Antonin Houska
On 10/31/2013 09:37 PM, Antonin Houska wrote:
 On 10/31/2013 03:46 PM, Antonin Houska wrote:
 I'm not sure if it's legal for the WHERE clause to reference LHS of the
 original outer join (a.j). Some more restriction may be needed. I need
 to think about it a bit more.

For a subquery or sublink expression referencing the outer table of an
OJ (see tab1)

SELECT *
FROMtab1 a
LEFT JOIN
tab2 b
ON a.i = ANY (
SELECT  k
FROMtab3 c
WHERE   k = a.i);

I started my considerations by inserting the SEMI JOIN in a form of
subquery, instead of a join node - see SJ_subquery here:

SELECT  *
FROMtab1 a
LEFT JOIN
(
   SELECT *
   tab2 b
   SEMI JOIN
   (  SELECT  k
  FROMtab3 c
  WHERE   k = a.i
   ) AS ANY_subquery
   ON a.i = ANY_subquery.k
) AS SJ_subquery
ON true;

(To allow a.i in the sublink expression, we'd only need to pass both
tab1 and tab2 to pull_up_sublinks_qual_recurse() in available_rels1.)

However it seem to be these lateral references (from the subquery and/or
the sublink expression) to tab1 that make it impossible for
SJ_subquery to be pulled up into the parent query's join tree - see
jointree_contains_lateral_outer_refs(). I'm not sure if it makes much
sense to pull up the sublink in such a case, does it?

I ended up with this logic: if the join is INNER, both the subquery and
sublink expression can reference either side. If the join is OUTER, only
the inner side can be referenced. Otherwise no attempt to introduce the
SEMI JOIN.

Can this be considered a patch, or is it wrong/incomplete?

// Antonin Houska (Tony)



diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d8cabbd..6095008 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -1167,6 +1167,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	JoinExpr   *result;
 	Query	   *parse = root-parse;
 	Query	   *subselect = (Query *) sublink-subselect;
+	bool	subselectLateral;
 	Relids		upper_varnos;
 	int			rtindex;
 	RangeTblEntry *rte;
@@ -1177,11 +1178,10 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	Assert(sublink-subLinkType == ANY_SUBLINK);
 
 	/*
-	 * The sub-select must not refer to any Vars of the parent query. (Vars of
-	 * higher levels should be okay, though.)
+	 * If the subselect refers to any Vars of the parent query, it must be
+	 * marked LATERAL.
 	 */
-	if (contain_vars_of_level((Node *) subselect, 1))
-		return NULL;
+	subselectLateral = contain_vars_of_level((Node *) subselect, 1);
 
 	/*
 	 * The test expression must contain some Vars of the parent query, else
@@ -1199,6 +1199,20 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 		return NULL;
 
 	/*
+	 * Neither the subselect can refer to outside available_rels.
+	 * (Such lateral references would prevent the resulting subquery from
+	 * being reached by pull_up_simple_subquery.)
+	 */
+	if (subselectLateral)
+	{
+		Relids sub_varnos = pull_varnos_of_level((Node *) subselect, 1);
+
+		if (!bms_is_subset(sub_varnos, available_rels))
+			return NULL;
+	}
+
+
+	/*
 	 * The combining operators and left-hand expressions mustn't be volatile.
 	 */
 	if (contain_volatile_functions(sublink-testexpr))
@@ -1215,7 +1229,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	rte = addRangeTableEntryForSubquery(NULL,
 		subselect,
 		makeAlias(ANY_subquery, NIL),
-		false,
+		subselectLateral,
 		false);
 	parse-rtable = lappend(parse-rtable, rte);
 	rtindex = list_length(parse-rtable);
CREATE TABLE tab1(i int);
CREATE TABLE tab2(j int);
CREATE TABLE tab3(k int);
EXPLAIN
SELECT *
FROM	tab1 a
	JOIN
	tab2 b
	ON b.j = ANY (
		SELECT  k
		FROM	tab3
		WHERE k = b.j);

EXPLAIN
SELECT *
FROM	tab1 a
	JOIN
	tab2 b
	-- Available_rels1 contains both 'a' and 'b'.
	-- Both sublink expression and the subselect can refer to either side
	-- of the JOIN.
	ON a.i = ANY (
		SELECT  k
		FROM	tab3
		WHERE k = a.i);

EXPLAIN
SELECT *
FROM	tab1 a
	LEFT JOIN
	tab2 b
	ON b.j = ANY (
		SELECT  k
		FROM	tab3
		WHERE k = b.j);

EXPLAIN
SELECT *
FROM	tab1 a
	LEFT JOIN
	tab2 b
	ON b.j = ANY (
		SELECT  k
		FROM	tab3
		-- Lateral reference to 'a', no sublink pull-up
	 	WHERE k = a.i); 

EXPLAIN
SELECT *
FROM	tab1 a
	LEFT JOIN
	tab2 b
	-- Lateral reference to 'a', no sublink pull-up
	ON a.i = ANY (
		SELECT  k
		FROM	tab3
	 	WHERE k = b.j); 
QUERY PLAN
--
 Nested Loop  (cost=0.00..84113.00 rows=288 width=8)
   -  Seq Scan on tab1 a  (cost=0.00..34.00 rows=2400 width=4)
   -  Materialize  (cost=0.00..48082.00 rows=1200 width=4)
 -  Seq Scan on tab2 b  (cost=0.00..48076.00 

Re: [HACKERS] Reference to parent query from ANY sublink

2013-10-31 Thread Antonin Houska
On 10/31/2013 03:46 PM, Antonin Houska wrote:
 Can the change be as simple as this or do I neglect anything?

Well, the example of outer join is wrong. Instead I think query

SELECT *
FROMtab1 a
LEFT JOIN
tab1 b
ON b.i = ANY (
SELECT  tab2.k
FROMtab2
WHERE   k = a.j);


should be converted to

SELECT  *
FROMtab1 a
LEFT JOIN
(  tab1 b
   LATERAL SEMI JOIN
   (  SELECT  tab2.k
  FROM  tab2
  WHERE k = a.j
   ) AS ANY_subquery
   ON b.i = sub.k
)

I'm not sure if it's legal for the WHERE clause to reference LHS of the
original outer join (a.j). Some more restriction may be needed. I need
to think about it a bit more.

// Tony



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers