Re: [Maria-developers] Review input for: MDEV-22534 Decorrelate IN subquery

2023-06-01 Thread Yuchen Pei
Hi Sergei,

Thanks again for your comments.

I have added a comment on the jira ticket including the link to a patch
that addresses your comments. I will take one more look at it tomorrow
morning before sending it to you for review, but I thought maybe it
makes sense to respond now (see below) given the timezone difference.

> Hi Yuchen,
> 
> Please find below review input for 
> 
> > commit ba9cf8efeb0b1e1c132d565adb79c98156783f15
> > Author: Yuchen Pei 
> > Date:   Fri May 19 20:06:31 2023 +1000
> >
> > MDEV-22534 Decorrelate IN subquery
> > 
> 
> First, general questions:
> 
> Q1: It seems there is some overlap between
> Item_in_subselect::exists2in_processor() and 
> Item_exists_subselect::exists2in_processor() Did you consider factoring
> out common code rather than copying it? (if yes, I'd like to know why
> it couldn't be done).

Done. I factored out the common code into three methods:

- Item_exists_subselect::exists2in_prepare()
- Item_exists_subselect::exists2in_create_or_update_in()
- Item_exists_subselect::exists2in_and_is_not_nulls()

In doing so I tried to not alter the logic of the exists2in
transformation, while keeping the decorrelate-in transformation
simple. For things done in exists2in but not in my decorrelate-in
implementation, I add a substype() check.

> Q2: Where is the coe that removes the correlated equalities from the 
> subquery's WHERE? (I assume Item_exists_subselect code does it?)

Naively, the best place to remove the equalities would be in
find_inner_outer_equalities(), when it iterates over the conds and one
could simply call the remove() method on the iterator. But we can't just
do that because after the call to find_inner_outer_equalities() we need
to do more checks, as well as replacing the equalities with `IS NOT
NULL`s in case of upper_not to ensure the 3vl value of the subquery
remains the same (see my previous response about three value logic).

> > diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc
> > index 9e6c205ca76..8d893794da3 100644
> > --- a/sql/item_subselect.cc
> > +++ b/sql/item_subselect.cc
> > @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item 
> > **conds,
> >return TRUE;
> >  }
> >  
> > +/* Checks whether item tree intersects with the free list */
> > +static bool intersects_free_list(Item *item, THD *thd)
> > +{
> > +  for (const Item *to_find= thd->free_list; to_find; to_find= 
> > to_find->next)
> > +if (item->walk(::find_item_processor, 1, (void *) to_find))
> > +  return true;
> > +  return false;
> > +}
> > +
> 
> Please try moving this function into a separate file. First name that came to
> my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea
> is that we should aim for better structure.
> (This request is only valid if the factoring out suggested above doesn't
> work out).

Replied previously. BTW why should this function be in a separate file?
Is it because it is not a method of any of the Item_subselect classes?
 
> > +/* De-correlate where conditions in an IN subquery */
> 
> This needs a bigger comment describing what the rewrite does.

Done and replied previously

> > +bool Item_in_subselect::exists2in_processor(void *opt_arg)
> > +{
> > +  THD *thd= (THD *)opt_arg;
> > +  SELECT_LEX *first_select= unit->first_select();
> > +  JOIN *join= first_select->join;
> > +  bool will_be_correlated;
> > +  Dynamic_array eqs(PSI_INSTRUMENT_MEM, 5, 5);
> > +  List outer;
> > +  Query_arena *arena= NULL, backup;
> > +  int res= FALSE;
> > +  DBUG_ENTER("Item_in_subselect::exists2in_processor");
> > +
> 
> Please move comments out of the if statement and number the conditions.
> 
> For examples, see best_access_path(), large if-statement starting with
>   "Don't test table scan if it can't be better."
> or 
>   "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause"
> in Item_exists_subselect::exists2in_processor().

Done and replied previously

> > +  if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) ||
> > +  /* proceed only if I'm a toplevel IN or a toplevel NOT IN */
> > +  !(is_top_level_item() ||
> > +(upper_not && upper_not->is_top_level_item())) ||
> 
> what is the reason behind this condition?
> If we're able to handle top-level NOT IN, why can't we handle subquery in any
> context?
> 
> Can we really handle the NOT IN? (I suspect not, it will suffer from the
> semantics of IN subqueries with NULLs.. 
> I can't find the name for this, it's described e.g. here:
> https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )

Done and replied previously

> 
> > +  first_select->is_part_of_union() ||  /* skip if part of a union */
> > +  first_select->group_list.elements || /* skip if with group by */
> > +  first_select->with_sum_func ||   /* skip if aggregation */
> > +  join->having ||  /* skip if with having */
> > +  !join->conds ||  /* skip if no conds */
> > +  /* skip if left 

Re: [Maria-developers] Review input for: MDEV-22534 Decorrelate IN subquery

2023-05-26 Thread Yuchen Pei
Hi Sergey,

Thanks for your comments. Please find some quick responses to some of
the comments while they are still fresh in my mind before the weekend
comes, and more to follow next week.
On Thu 2023-05-25 22:14:13 +0300, Sergey Petrunia wrote:

> Hi Yuchen,
>
> Please find below review input for 
>
>> commit ba9cf8efeb0b1e1c132d565adb79c98156783f15
>> Author: Yuchen Pei 
>> Date:   Fri May 19 20:06:31 2023 +1000
>>
>> MDEV-22534 Decorrelate IN subquery
>> 
>
> First, general questions:
>
> Q1: It seems there is some overlap between
> Item_in_subselect::exists2in_processor() and 
> Item_exists_subselect::exists2in_processor() Did you consider factoring
> out common code rather than copying it? (if yes, I'd like to know why
> it couldn't be done).

Indeed Item_in_subselect::exists2in_processor() is modeled after
Item_exists_subselect::exists2in_processor(). The issue with the latter
is that it is a large function that could have grown over time and I
want to build Item_in_subselect::exists2in_processor() with only what is
necessary. I will try to factor out common code. Also please take a look
at the added test file in the patch - I tried to come up with tests for
every line, including the skip condition involving upper_not mentioned
below.

>
> Q2: Where is the coe that removes the correlated equalities from the 
> subquery's WHERE? (I assume Item_exists_subselect code does it?)

Like the processor for `exists`, I replaces the correlated equalities in
`where` with int item 1 rather than removing them. Is it better to
remove them?

>
>> diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc
>> index 9e6c205ca76..8d893794da3 100644
>> --- a/sql/item_subselect.cc
>> +++ b/sql/item_subselect.cc
>> @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds,
>>return TRUE;
>>  }
>>  
>> +/* Checks whether item tree intersects with the free list */
>> +static bool intersects_free_list(Item *item, THD *thd)
>> +{
>> +  for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next)
>> +if (item->walk(::find_item_processor, 1, (void *) to_find))
>> +  return true;
>> +  return false;
>> +}
>> +
>
> Please try moving this function into a separate file. First name that came to
> my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea
> is that we should aim for better structure.
> (This request is only valid if the factoring out suggested above doesn't
> work out).

Sanja commented on a similar commit to MDEV-31269 that this way of
finding intersection is too costly. What do you think? Before moving
this function, either we need to agree this is an acceptable function or
figure out an acceptable way of finding free items in the item tree from
the equalities. Failing both, then I suggest (as I did in a comment[1]
on MDEV-22534), we mark this function as out of scope for the review of
MDEV-22534 and as part of MDEV-31269 which may be fixed by MDEV-30073
(see a comment[2] in MDEV-31269).

[1]
https://jira.mariadb.org/browse/MDEV-22534?focusedCommentId=259406=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259406
[2]
https://jira.mariadb.org/browse/MDEV-31269?focusedCommentId=259516=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-259516

>
>> +/* De-correlate where conditions in an IN subquery */
>
> This needs a bigger comment describing what the rewrite does.
>
>> +bool Item_in_subselect::exists2in_processor(void *opt_arg)
>> +{
>> +  THD *thd= (THD *)opt_arg;
>> +  SELECT_LEX *first_select= unit->first_select();
>> +  JOIN *join= first_select->join;
>> +  bool will_be_correlated;
>> +  Dynamic_array eqs(PSI_INSTRUMENT_MEM, 5, 5);
>> +  List outer;
>> +  Query_arena *arena= NULL, backup;
>> +  int res= FALSE;
>> +  DBUG_ENTER("Item_in_subselect::exists2in_processor");
>> +
>
> Please move comments out of the if statement and number the conditions.
>
> For examples, see best_access_path(), large if-statement starting with
>   "Don't test table scan if it can't be better."
> or 
>   "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause"
> in Item_exists_subselect::exists2in_processor().
>
>
>> +  if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) ||
>> +  /* proceed only if I'm a toplevel IN or a toplevel NOT IN */
>> +  !(is_top_level_item() ||
>> +(upper_not && upper_not->is_top_level_item())) ||
>
> what is the reason behind this condition?

Without these conditions, the following testcase in
`subslect_decorrelate_in.test` will fail (produce wrong results IIRC):

--8<---cut here---start->8---
--echo # skip transformation when neither toplevel IN nor toplevel NOT IN; 
testcase adapated from main.subslect4
CREATE TABLE t1 ( a INT, b INT );
INSERT INTO t1 VALUES ( 1, NULL ), ( 2, NULL );
CREATE TABLE t2 ( c INT, d INT );
INSERT INTO t2 VALUES ( NULL, 3 ), ( NULL, 4 );
# not top level, upper_not is not top level
SELECT * FROM t1 WHERE a NOT 

[Maria-developers] Review input for: MDEV-22534 Decorrelate IN subquery

2023-05-25 Thread Sergey Petrunia
Hi Yuchen,

Please find below review input for 

> commit ba9cf8efeb0b1e1c132d565adb79c98156783f15
> Author: Yuchen Pei 
> Date:   Fri May 19 20:06:31 2023 +1000
>
> MDEV-22534 Decorrelate IN subquery
> 

First, general questions:

Q1: It seems there is some overlap between
Item_in_subselect::exists2in_processor() and 
Item_exists_subselect::exists2in_processor() Did you consider factoring
out common code rather than copying it? (if yes, I'd like to know why
it couldn't be done).

Q2: Where is the coe that removes the correlated equalities from the 
subquery's WHERE? (I assume Item_exists_subselect code does it?)

> diff --git a/sql/item_subselect.cc b/sql/item_subselect.cc
> index 9e6c205ca76..8d893794da3 100644
> --- a/sql/item_subselect.cc
> +++ b/sql/item_subselect.cc
> @@ -3099,6 +3099,132 @@ static bool find_inner_outer_equalities(Item **conds,
>return TRUE;
>  }
>  
> +/* Checks whether item tree intersects with the free list */
> +static bool intersects_free_list(Item *item, THD *thd)
> +{
> +  for (const Item *to_find= thd->free_list; to_find; to_find= to_find->next)
> +if (item->walk(::find_item_processor, 1, (void *) to_find))
> +  return true;
> +  return false;
> +}
> +

Please try moving this function into a separate file. First name that came to
my mind: opt_subq_decorrelate.cc Feel free to suggest better names. The idea
is that we should aim for better structure.
(This request is only valid if the factoring out suggested above doesn't
work out).

> +/* De-correlate where conditions in an IN subquery */

This needs a bigger comment describing what the rewrite does.

> +bool Item_in_subselect::exists2in_processor(void *opt_arg)
> +{
> +  THD *thd= (THD *)opt_arg;
> +  SELECT_LEX *first_select= unit->first_select();
> +  JOIN *join= first_select->join;
> +  bool will_be_correlated;
> +  Dynamic_array eqs(PSI_INSTRUMENT_MEM, 5, 5);
> +  List outer;
> +  Query_arena *arena= NULL, backup;
> +  int res= FALSE;
> +  DBUG_ENTER("Item_in_subselect::exists2in_processor");
> +

Please move comments out of the if statement and number the conditions.

For examples, see best_access_path(), large if-statement starting with
  "Don't test table scan if it can't be better."
or 
  "EXISTS-to-IN coversion and ORDER BY ... LIMIT clause"
in Item_exists_subselect::exists2in_processor().


> +  if (!optimizer_flag(thd, OPTIMIZER_SWITCH_EXISTS_TO_IN) ||
> +  /* proceed only if I'm a toplevel IN or a toplevel NOT IN */
> +  !(is_top_level_item() ||
> +(upper_not && upper_not->is_top_level_item())) ||

what is the reason behind this condition?
If we're able to handle top-level NOT IN, why can't we handle subquery in any
context?

Can we really handle the NOT IN? (I suspect not, it will suffer from the
semantics of IN subqueries with NULLs.. 
I can't find the name for this, it's described e.g. here:
https://petrunia.net/2006/11/16/inany-subqueries-null-woes/ )

> +  first_select->is_part_of_union() ||  /* skip if part of a union */
> +  first_select->group_list.elements || /* skip if with group by */
> +  first_select->with_sum_func ||   /* skip if aggregation */
> +  join->having ||  /* skip if with having */
> +  !join->conds ||  /* skip if no conds */
> +  /* skip if left expr is a single nonscalar subselect */
> +  (left_expr->type() == Item::SUBSELECT_ITEM &&
> +   !left_expr->type_handler()->is_scalar_type()))

The above seems to be some exotic SQL. Is it something like 
  
  (select col1,lol2 from t1) IN (select col3,col4 from t2 where ..)

and does MariaDB really support this? Please add a comment about this.

> +DBUG_RETURN(FALSE);
> +  /* iterate over conditions, and check whether they can be moved out. */
> +  if (find_inner_outer_equalities(>conds, eqs))
> +DBUG_RETURN(FALSE);
> +  /* If we are in the execution of a prepared statement, check for
> +  intersection with the free list to avoid segfault. Note that the
> +  check for prepared statement execution is necessary, otherwise it
> +  will likely always find intersection thus abort the
> +  transformation.
> +
> +  fixme: this only works when HAVE_PSI_STATEMENT_INTERFACE is
> +  defined. It causes CI's like amd64-debian-10-debug-embedded to
> +  fail. Are there other ways to find out we are in the execution of a
> +  prepared statement? */

I'm looking at the code of activate_stmt_arena_if_needed(). It seems
"stmt_arena->is_conventional()" is the check you're looking for.

> +  if (thd->m_statement_state.m_parent_prepared_stmt)
> +  {
> +for (uint i= 0; i < (uint) eqs.elements(); i++)
> +{
> +  if (intersects_free_list(*eqs.at(i).eq_ref, thd))
> +DBUG_RETURN(FALSE);

This check doesn't look good. 
As far as I understand, it is a temporary measure until related
re-execution bug(s) are fixed by Igor (and/or Sanja)?
> +}
> +  }

> +  /* Determines whether the result will be correlated */
> +  {
> +List unused;