Re: Question about pull_up_sublinks_qual_recurse

2023-03-20 Thread Andy Fan
On Sat, Oct 15, 2022 at 3:27 AM Tom Lane  wrote:

> Andy Fan  writes:
> > After some more self review,  I find my proposal has the following side
> > effects.
>
> Yeah, I do not think this works at all.  

The discussion of outer join
> reordering in optimizer/README says that that doesn't work, and while
> I'm too lazy to construct an example right now, I believe it's true.
>

I came to this topic again recently and have finally figured out the
reason.  It looks to me that semi join is slightly different with outer
join in this case.

The following test case can show why we have to
pull_up_sublinks_qual_recurse to either LHS or RHS rather than the
JoinExpr.

create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);

insert into t1 select 1, 1, 2;
insert into t2 select 1, 2, 1;
insert into t2 select 1, 1, 2;
insert into t3 select 1, 1, 2;

select * from t1
where exists (select 1 from t2
-- below references to t1 and t2 at the same time
where exists (select 1 from t3
   where t1.c = t2.c and t2.b = t3.b)
and t1.a = t2.a);

which can be transformed to

SELECT * FROM t1 SEMI JOIN t2
ON t1.a = t2.a
AND exists (select 1 from t3
where t1.c = t2.c
and t2.b = t3.b)

Here the semantics of the query is return the rows in T1 iff there is a
row in t2 matches the whole clause (t1.a = t2.a AND exists..);

But If we transform it to

SELECT * FROM (t1 SEMI JOIN t2
ON t1.a = t2.a) SEMI JOIN t3
on t1.c = t2.c and t2.b = t3.b;

The scan on T2 would stop if ONLY (t1.a = t2.a) matches and the following
rows will be ignored. However the matched rows may doesn't match the
exists clause! So in the above example, the correct result set will be 1
row. If we pull up the sublink above the JoinExpr, no row would be found.

The attached is just a comment and a test case to help understand why we
have to do things like this.

-- 
Best Regards
Andy Fan


v1-0001-test-case-and-comments-to-help-understand-why-we-.patch
Description: Binary data


Re: Question about pull_up_sublinks_qual_recurse

2022-10-17 Thread Andy Fan
Hi Tom:

Thanks for your reply! I have self reviewed the below message at 3 different
time periods to prevent from too inaccurate replies. It may be more detailed
than it really needed, but it probably can show where I am lost.

On Sat, Oct 15, 2022 at 3:27 AM Tom Lane  wrote:

>
> If the pulled-up join doesn't go into the nullable side of the upper
> join then you've changed semantics.  In this case, it'd amount to

reassociating a semijoin that was within the righthand side of another
> semijoin to above that other semijoin.


I understand your reply as:

select * from t1 *left join* t2 on exists (select 1 from t3 where t3.a =
t2.a);
= select * from t1 *left join* (t2 semi join t3 on t3.a = t2.a) on true;
-- go to nullable side
!=  select * from (t1 *left join* t2 on true) semi join t3 on (t3.a =
t2.a);  -- go to above the JoinExpr

I CAN follow the above.  And for this case it is controlled by below code:

pull_up_sublinks_qual_recurse

switch (j->jointype)
{
  case JOIN_INNER:
   ...
  case JOIN_LEFT:
j->quals = pull_up_sublinks_qual_recurse(root, j->quals,

   >rarg,

   rightrelids,

   NULL, NULL);
  break;
...
}

and I didn't change this.   My question is could we assume

A *semijoin* B ON EXISTS (SELECT 1 FROM C on (Pbc))
=  (A *semijoin* (B *semijoin* C on (Pbc))) on TRUE.  (current master did)
=  (A *semijoin* B ON true) *semijoin* C on (Pbc)(my current
thinking)

Note that there is no 'left outer join' at this place.   Since there are too
many places called pull_up_sublinks_qual_recurse, to make things
less confused, I prepared a patch for this one line change to show where
exactly I changed (see patch 2);  I think this is the first place I lost.

The discussion of outer join
> reordering in optimizer/README says that that doesn't work,


I think you are talking about the graph "Valid OUTER JOIN Optimizations".
I can follow until below.

"
SEMI joins work a little bit differently.  A semijoin can be reassociated
into or out of the lefthand side of another semijoin, left join, or
antijoin, but not into or out of the righthand side.  ..
"
I am unclear why
 (A semijoin B on (Pab)) semijoin C on (Pbc)
!= A semijoin (B semijoin C on (Pbc)) on (Pab);

Seems both return rows from A which match both semijoin (Pab) and
(Pbc).  or I misunderstand the above words in the first place?

At last, when I checked optimizer/README, it looks like we used
a 'nullable side'  while it should be 'nonnullable side'? see patch 1
for details.

-- 
Best Regards
Andy Fan
From 8327cfcf969704e9db73bc43e8f270b70263e493 Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Mon, 17 Oct 2022 08:38:41 +0800
Subject: [PATCH v1 1/2] a typo error? I think it should be nonnullable side
 rather thanks

nullable side.
---
 src/backend/optimizer/README | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 41c120e0cdf..0e4232f409f 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -217,7 +217,7 @@ RIGHT JOIN is equivalent to LEFT JOIN after switching the two input
 tables, so the same identities work for right joins.
 
 An example of a case that does *not* work is moving an innerjoin into or
-out of the nullable side of an outer join:
+out of the nonnullable side of an outer join:
 
 	A leftjoin (B join C on (Pbc)) on (Pab)
 	!= (A leftjoin B on (Pab)) join C on (Pbc)
-- 
2.21.0

From ab226166518936d81029fbaab7c8aba9fe1550ec Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Mon, 17 Oct 2022 09:08:37 +0800
Subject: [PATCH v1 2/2] This patch doesn't mean I insist on this at all, it
 just shows where I'm lost.

---
 src/backend/optimizer/prep/prepjointree.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 41c7066d90a..581ff2db36b 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -551,12 +551,24 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node *node,
  * joins can get stacked onto either j->larg or j->rarg,
  * depending on which rels they reference.
  */
+
+/*
+ * Here the j comes from convert_EXISTS_sublink_to_join, so
+ * j->jointype must be SEMIJOIN, so no left/right/full outer
+ * join involved. I prefer to believe that:
+ *
+ * A semi join B on exists (select 1 from C on (Pbc))  -- id1
+ * = A semi join (B semi join C on (Pbc) ) on true  --  id2 current master does.
+ * = (A semi join B on true) semi join C on (Pbc) -- id3 this patch does.
+ *
+ * I think this is the place I lost:(
+ */
 j->quals = pull_up_sublinks_qual_recurse(root,
 		 j->quals,
-		 >larg,
-		 available_rels1,
-		 >rarg,
-		 child_rels);
+	

Re: Question about pull_up_sublinks_qual_recurse

2022-10-14 Thread Tom Lane
Andy Fan  writes:
> After some more self review,  I find my proposal has the following side
> effects.

Yeah, I do not think this works at all.  The mechanism as it stands
right now is that we can insert pulled-up semijoins above j->larg
if they only need variables from relations in j->larg, and we can
insert them above j->rarg if they only need variables from relations
in j->rarg.  You can't just ignore that distinction and insert them
somewhere further up the tree.  Per the comment in
pull_up_sublinks_jointree_recurse:

 * Now process qual, showing appropriate child relids as available,
 * and attach any pulled-up jointree items at the right place. In the
 * inner-join case we put new JoinExprs above the existing one (much
 * as for a FromExpr-style join).  In outer-join cases the new
 * JoinExprs must go into the nullable side of the outer join. The
 * point of the available_rels machinations is to ensure that we only
 * pull up quals for which that's okay.

If the pulled-up join doesn't go into the nullable side of the upper
join then you've changed semantics.  In this case, it'd amount to
reassociating a semijoin that was within the righthand side of another
semijoin to above that other semijoin.  The discussion of outer join
reordering in optimizer/README says that that doesn't work, and while
I'm too lazy to construct an example right now, I believe it's true.

regards, tom lane




Re: Question about pull_up_sublinks_qual_recurse

2022-10-14 Thread Andy Fan
>
> Later we tried to pull up the EXISTS sublink to t1 OR t2 *separately*,
> since
> this subselect referenced to t1 *AND* t2, so we CAN'T pull up the
> sublink. I
> am thinking why we have to pull up it t1 OR t2 rather than JoinExpr(t1,
> t2),
> I think the latter one is better.
>

After some more self review,  I find my proposal has the following side
effects.

select * from t1
where exists (select 1 from t2
  where exists (select 1 from t3
   where t3.c = t1.c)
and t2.a = t1.a);

In the above example,  the innermost sublink will be joined with
SemiJoin (t1 t2) in the patched version,  but joined with t1 in the current
master.  However, even if we set the JoinTree with
SemiJoin(SemiJoin(t1 t2), t3),  the join reorder functions can generate a
path which joins t1 with t3 first and then t2 still.  So any hint about this
patch's self-review is welcome.

-- 
Best Regards
Andy Fan


Question about pull_up_sublinks_qual_recurse

2022-10-13 Thread Andy Fan
Hi:

When I was working on another task, the following case caught my mind.

create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
create table t3(a int, b int, c int);

explain (costs off) select * from t1
where exists (select 1 from t2
  where exists (select 1 from t3
   where t3.c = t1.c
   and t2.b = t3.b)
and t2.a = t1.a);


I got the plan like this:

QUERY PLAN
---
 Hash Semi Join
   Hash Cond: (t1.a = t2.a)
   Join Filter: (hashed SubPlan 2)
   ->  Seq Scan on t1
   ->  Hash
 ->  Seq Scan on t2
   SubPlan 2
 ->  Seq Scan on t3
(8 rows)

Note we CAN'T pull up the inner sublink which produced the SubPlan 2.


I traced the reason is after we pull up the outer sublink, we got:

select * from t1 semi join t2 on t2.a = t1.a AND
exists (select 1 from t3
where t3.c = t1.c
and t2.b = t3.b);

Later we tried to pull up the EXISTS sublink to t1 OR t2 *separately*, since
this subselect referenced to t1 *AND* t2, so we CAN'T pull up the sublink. I
am thinking why we have to pull up it t1 OR t2 rather than JoinExpr(t1, t2),
I think the latter one is better.

So I changed the code like this,  I got the plan I wanted and 'make
installcheck' didn't find any exception.


   QUERY PLAN

 Hash Semi Join
   Hash Cond: ((t2.b = t3.b) AND (t1.c = t3.c))
   ->  Hash Semi Join
 Hash Cond: (t1.a = t2.a)
 ->  Seq Scan on t1
 ->  Hash
   ->  Seq Scan on t2
   ->  Hash
 ->  Seq Scan on t3
(9 rows)

@@ -553,10 +553,10 @@ pull_up_sublinks_qual_recurse(PlannerInfo *root, Node
*node,
  */
  j->quals = pull_up_sublinks_qual_recurse(root,
  j->quals,
- >larg,
- available_rels1,
- >rarg,
- child_rels);
+ jtlink1,
+ bms_union(available_rels1, child_rels),
+ NULL,
+ NULL);
  /* Return NULL representing constant TRUE */
  return NULL;
  }

Any feedback is welcome.

-- 
Best Regards
Andy Fan