Re: [Maria-developers] On constructing ref access from mis-matched charset comparisons

2017-02-27 Thread Alexander Barkov
Hi Sergey,


On 02/24/2017 04:23 AM, Sergey Petrunia wrote:
> Hi Alexander,
> 
> I've got a question about mis-matched charset comparisons and ref access 
> method.
> 
> == Short form ==
> I know that VARCHAR comparisons over mis-matching charsets cannot be used for 
> constructing index lookups into the index over the "narrower" character set
> column. 
> But is this a real limitation or just an optimizer deficiency?
> 
> If it is the latter, do people hit it in the real world sufficiently often to
> warrant lifting it?
> 
> == Long form ==
> 
> === Example dataset ===
> 
> create table t0 (
>   a varchar(32) character set utf8 collate utf8_general_ci
> );
> 
> create table t1 (
>   a varchar(32) character set latin1 collate latin1_swedish_ci,
>   col1 varchar(32),
>   key(a)
> );
> 
> create table t2 (
>   a varchar(32) character set utf8 collate utf8_general_ci,
>   col1 varchar(32),
>   key(a)
> );
> 
> insert into t0 select a from ten;
> insert into t1 select a,a from one_k;
> insert into t2 select a,a from one_k;
> 
> === Queries ===
> 
> MariaDB [test]> explain 
> -> select * from t0, t2 where t0.a=t2.a;
> +--+-+---+--+---+--+-+---+--+-+
> | id   | select_type | table | type | possible_keys | key  | key_len | ref
>| rows | Extra   |
> +--+-+---+--+---+--+-+---+--+-+
> |1 | SIMPLE  | t0| ALL  | NULL  | NULL | NULL| NULL   
>|   10 | Using where |
> |1 | SIMPLE  | t2| ref  | a | a| 99  | 
> test.t0.a |1 | |
> +--+-+---+--+---+--+-+---+--+-+
> 
> Ok, good.
> 
> MariaDB [test]> explain 
> -> select * from t0, t1 where t0.a=t1.a;
> +--+-+---+--+---+--+-+--+--+-+
> | id   | select_type | table | type | possible_keys | key  | key_len | ref  | 
> rows | Extra   |
> +--+-+---+--+---+--+-+--+--+-+
> |1 | SIMPLE  | t0| ALL  | NULL  | NULL | NULL| NULL | 
>   10 | |
> |1 | SIMPLE  | t1| ALL  | NULL  | NULL | NULL| NULL | 
> 1000 | Using where; Using join buffer (flat, BNL join) |
> +--+-+---+--+---+--+-+--+--+-+
> 2 rows in set (0.00 sec)
> 
> Oops, the index t1(a) is not used. FORMAT=JSON shows the reason:
> 
>   "attached_condition": "(t0.a = convert(t1.a using utf8))"
> 
> Still it feels like the optimizer could
> 
> * Try to convert the value of t0.a into latin1.  
> * If it can't be represented in latin1, then we know that no row
>   in t1 has t1.a=t0.a.

This is correct, if it cannot be represented in latin1, then there are
no rows that match the equality.

> * make a lookup on t1.a= convert(t0.a using latin1) , using t1.a's collation
>   rules.

This is not generally correct, because you propose to compare
using the collation of t1.a instead of the collation of t0.a,
and the two collations can have different rules.



> 
> Is this true [for some charsets] ?

It can work in some cases:

If the collation of t0.a is binary (e.g. utf8_bin),
then it can be done. But index lookup on t1.a will return
some false positives, so you'll need to filter them
out after index lookup. So it will involve two
character set conversions:

- Lookup on t1.a= convert(t0.a using latin1)
  It will return some values.
  Let's call them T0A0, T0A1, T0A2.
  The number of distinct values will depend on the t1.a's collation
  style (e.g. case and accent sensitivity).


- Test the found values using the original condition:
   t0.a=convert(T0A0 using utf8)
   t0.a=convert(T0A1 using utf8)
   t0.a=convert(T0A2 using utf8)
  and choose the one that matches
  (In case if t0.a has a _bin collation,
   not more than one value can match).


Note, the same trick can actually work for some non-binary collations
as well, but I need some time to formalize the rules when it's possible.

> 
> If yes, is it worth doing?

Index lookup is better than a full table scan.
So if it does not take too much implementation efforts,
then I'd do this.


> 
> BR
>  Sergei
> 

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp


Re: [Maria-developers] On constructing ref access from mis-matched charset comparisons

2017-02-24 Thread Sergei Golubchik
Hi, Sergey!

On Feb 24, Sergey Petrunia wrote:
> == Long form ==
> 
> === Example dataset ===
> 
> create table t0 (
>   a varchar(32) character set utf8 collate utf8_general_ci
> );
> 
> create table t1 (
>   a varchar(32) character set latin1 collate latin1_swedish_ci,
>   col1 varchar(32),
>   key(a)
> );
> 
> insert into t0 select a from ten;
> insert into t1 select a,a from one_k;
> 
> === Queries ===
> 
> MariaDB [test]> explain 
> -> select * from t0, t1 where t0.a=t1.a;
> +--+-+---+--+---+--+-+--+
> | id   | select_type | table | type | possible_keys | key  | key_len | ref  |
> +--+-+---+--+---+--+-+--+
> |1 | SIMPLE  | t0| ALL  | NULL  | NULL | NULL| NULL |
> |1 | SIMPLE  | t1| ALL  | NULL  | NULL | NULL| NULL |
> +--+-+---+--+---+--+-+--+
> 2 rows in set (0.00 sec)
> 
> Oops, the index t1(a) is not used.  Still it feels like the optimizer could
> 
> * Try to convert the value of t0.a into latin1.  
> * If it can't be represented in latin1, then we know that no row
>   in t1 has t1.a=t0.a.
> * make a lookup on t1.a= convert(t0.a using latin1) , using t1.a's collation
>   rules.
> 
> Is this true [for some charsets] ?

I suppose it's true for *some charsets*, it's not true generally.
It is certainly not true that if you can convert, you can compare. For
example, let's compare t1.a with a latin1_german2_ci literal, for
example, "ö". You can convert "ö" to latin1_swedish_ci. But

  t1.a= "ö" collate latin1_german2_ci

is not the same as

  t1.a= "ö" collate latin1_swedish_ci

In the first case you'll find 'oe', in the second - you won't.

Regards,
Sergei
Chief Architect MariaDB
and secur...@mariadb.org

___
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp