[sqlite] [bug] Optimizer chooses suboptimal plan for correlated subqueries containing tables with zero rows

Stephen Broberg Thu, 17 Jul 2014 15:51:13 -0700

Hi,

We encountered an issue in our system where we had two nearly identical 
subqueries in a statement, which differed only in one table (both of which had 
the same definition).  One table had 4 rows, the other table had zero rows.  
The subquery with the 4-row table ran about 10,000 times faster than the 
subquery with the zero-row table.  ANALYZE had been run on this schema, and the 
problem existed on sqlite version 3.7.12 up through 3.8.5 (running on Mac OS 
10.8), and exists whether compiling with SQLITE_ENABLE_STAT3 or not.


After much analysis, it turns out the problem was "fixed" by inserting one row 
into the empty table, running ANALYZE, then removing the row.  Without this 
step, there was no row in sqlite_stat1, and the optimizer used assumed 100,000 
rows for the empty table.

My naïve understanding of the sqlite engine would make me think that the 
problem lies in not storing a stats row for zero-sized tables, resulting in the 
optimizer making the unnecessary assumption.

I've included a script below that will reproduce the problem.  The poor 
performance isn't as dramatic as our production case, likely because the tables 
are not as wide as ours, so the cost of visiting rows is reduced, so I added 
some nonsensical string operations to the select clause to burn cpu.  In our 
code, the timing for the optimal query was "CPU Time: user 0.000142 sys 
0.000072" and the suboptimal one was "CPU Time: user 1.021453 sys 0.036893"

Note that several conditions must exist to produce this perfect storm:

1.  The subquery needs to contain several tables besides the empty one
2.  The stats computed for the schema are as described above
3.  Covering indexes exist for the correlation keys.

Here are query plans & times produced my script:

==== Query against the subquery with 4-row table ====
selectid    order       from        detail
----------  ----------  ----------  
---------------------------------------------------------------------------------------------
0           0           0           SEARCH TABLE child AS mc USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
0           1           1           SEARCH TABLE parent AS mp USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
0           0           0           EXECUTE CORRELATED SCALAR SUBQUERY 1
1           0           0           SCAN TABLE somerows AS s (~4 rows)
1           1           1           SEARCH TABLE child AS c USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
1           2           2           SEARCH TABLE parent AS p USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
CPU Time: user 0.000183 sys 0.000106

==== Query against the subquery with zero-row table Note that it runs about 150 
times slower than the first case ====
selectid    order       from        detail
----------  ----------  ----------  
---------------------------------------------------------------------------------------------
0           0           0           SEARCH TABLE child AS mc USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
0           1           1           SEARCH TABLE parent AS mp USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
0           0           0           EXECUTE CORRELATED SCALAR SUBQUERY 1
1           0           2           SEARCH TABLE parent AS p USING COVERING 
INDEX ak_p (gp_id=?) (~100 rows)
1           1           1           SEARCH TABLE child AS c USING COVERING 
INDEX ak_c (p_id=?) (~50 rows)
1           2           0           SEARCH TABLE norows AS n USING COVERING 
INDEX ak_norows (c_id=?) (~1 rows)
CPU Time: user 0.026749 sys 0.000396

==== Query against the subquery with zero-row table after stats were generated 
with a single row in the table  ====
==== Note that now the query plan is the same as in the "correct" case  ====
selectid    order       from        detail
----------  ----------  ----------  
---------------------------------------------------------------------------------------------
0           0           0           SEARCH TABLE child AS mc USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
0           1           1           SEARCH TABLE parent AS mp USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
0           0           0           EXECUTE CORRELATED SCALAR SUBQUERY 1
1           0           0           SCAN TABLE norows AS n (~1 rows)
1           1           1           SEARCH TABLE child AS c USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
1           2           2           SEARCH TABLE parent AS p USING INTEGER 
PRIMARY KEY (rowid=?) (~1 rows)
CPU Time: user 0.000173 sys 0.000099



==========================
=== Steps to Reproduce ===
==========================

Run the attached bash script and redirect to sqlite, like so:

./testdata.sh | sqlite3 test.db

The script generates a db with a million-row table; 150000 rows (by choosing 
100/50/30 values for the i/j/k loops) is enough to illustrate the problem as 
well, but the results are not as dramatic.

=================
== testdata.sh ==
=================
#!/bin/bash
echo 'create table grandparent (gp_id integer not null primary key, gp_name 
text not null);'
echo 'create table parent (p_id integer not null primary key, p_name text not 
null, gp_id integernot null);'
echo 'create table child (c_id integer not null primary key, c_name text not 
null, p_id integernot null);'

echo 'create temp table if not exists variables (name text primary key, value 
text);'

for i in `seq 1 100`; do
    echo insert into grandparent values \(null, \"gp $i\"\)\;
    echo 'insert or replace into variables values ("gp", last_insert_rowid());'
    for j in `seq 1 100`; do
        echo insert into parent select null, \"gp-\" \|\| value \|\| \" p-$j\", 
value from variables where name = \'gp\'\;
        echo 'insert or replace into variables values ("p", 
last_insert_rowid());'
        for k in `seq 1 100`; do
            echo insert into child select null, \"p-\" \|\| value \|\| \" 
c-$k\", value from variables where name = \'p\'\;
        done
    done
done

echo 'create table somerows (name text not null, c_id integer not null);'
echo 'create table norows (name text not null, c_id integer not null);'

echo 'delete from somerows;'
echo 'insert into somerows values ("apple", 15001);'
echo 'insert into somerows values ("peach", 15003);'
echo 'insert into somerows values ("pear", 15005);'
echo 'insert into somerows values ("plum", 15007);'

echo 'create unique index pk_gp on grandparent (gp_id);'
echo 'create unique index pk_p on parent (p_id);'
echo 'create unique index pk_c on child (c_id);'
echo 'create unique index pk_somerows on somerows (name);'
echo 'create unique index pk_norows on norows (name);'

echo 'create unique index ak_p on parent (gp_id, p_id);'
echo 'create unique index ak_c on child (p_id, c_name);'
echo 'create unique index ak_norows on norows (c_id);'
echo 'create unique index ak_somerows on somerows (c_id);'

echo 'analyze;'

echo '.width 0 0 0 200'

echo 'explain query plan ' \
    'select' \
    '(select substr(replace(c_name || " aa aa aa aa aa aa " || p.p_id, "aa", 
"a"), ' \
    '               length(replace(c_name || " aa aa aa aa aa aa " || p.p_id, 
"aa", "a")) - 8, ' \
    '               4) || c_name expensive_col' \
    'from somerows s' \
    ' join child c' \
    ' on s.c_id = c.c_id' \
    ' join parent p' \
    ' on c.p_id = p.p_id' \
    'where p.gp_id = mp.gp_id and expensive_col != "nothing")' \
    'from child mc, parent mp' \
    'where mc.p_id = mp.p_id and c_id = 15010;'

echo '.timer on'
echo \
    'select' \
    '(select substr(replace(c_name || " aa aa aa aa aa aa " || p.p_id, "aa", 
"a"), ' \
    '               length(replace(c_name || " aa aa aa aa aa aa " || p.p_id, 
"aa", "a")) - 8, ' \
    '               4) || c_name expensive_col' \
    'from somerows s' \
    ' join child c' \
    ' on s.c_id = c.c_id' \
    ' join parent p' \
    ' on c.p_id = p.p_id' \
    'where p.gp_id = mp.gp_id and expensive_col != "nothing")' \
    'from child mc, parent mp' \
    'where mc.p_id = mp.p_id and c_id = 15010;'
echo '.timer off'

echo 'explain query plan ' \
    'select' \
    '(select substr(replace(c_name || " aa aa aa aa aa aa " || p.p_id, "aa", 
"a"), ' \
    '               length(replace(c_name || " aa aa aa aa aa aa " || p.p_id, 
"aa", "a")) - 8, ' \
    '               4) || c_name expensive_col' \
    'from norows n' \
    ' join child c' \
    ' on n.c_id = c.c_id' \
    ' join parent p' \
    ' on c.p_id = p.p_id' \
    'where p.gp_id = mp.gp_id and expensive_col != "nothing")' \
    'from child mc, parent mp' \
    'where mc.p_id = mp.p_id and c_id = 15010;'

echo '.timer on'
echo \
    'select' \
    '(select substr(replace(c_name || " aa aa aa aa aa aa " || p.p_id, "aa", 
"a"), ' \
    '               length(replace(c_name || " aa aa aa aa aa aa " || p.p_id, 
"aa", "a")) - 8, ' \
    '               4) || c_name expensive_col' \
    'from norows n' \
    ' join child c' \
    ' on n.c_id = c.c_id' \
    ' join parent p' \
    ' on c.p_id = p.p_id' \
    'where p.gp_id = mp.gp_id and expensive_col != "nothing")' \
    'from child mc, parent mp' \
    'where mc.p_id = mp.p_id and c_id = 15010;'
echo '.timer off'

echo 'insert into norows values ("dummy", 1);'
echo 'analyze;'
echo 'delete from norows;'

echo 'explain query plan ' \
    'select' \
    '(select substr(replace(c_name || " aa aa aa aa aa aa " || p.p_id, "aa", 
"a"), ' \
    '               length(replace(c_name || " aa aa aa aa aa aa " || p.p_id, 
"aa", "a")) - 8, ' \
    '               4) || c_name expensive_col' \
    'from norows n' \
    ' join child c' \
    ' on n.c_id = c.c_id' \
    ' join parent p' \
    ' on c.p_id = p.p_id' \
    'where p.gp_id = mp.gp_id and expensive_col != "nothing")' \
    'from child mc, parent mp' \
    'where mc.p_id = mp.p_id and c_id = 15010;'

echo '.timer on'
echo \
    'select' \
    '(select substr(replace(c_name || " aa aa aa aa aa aa " || p.p_id, "aa", 
"a"), ' \
    '               length(replace(c_name || " aa aa aa aa aa aa " || p.p_id, 
"aa", "a")) - 8, ' \
    '               4) || c_name expensive_col' \
    'from norows n' \
    ' join child c' \
    ' on n.c_id = c.c_id' \
    ' join parent p' \
    ' on c.p_id = p.p_id' \
    'where p.gp_id = mp.gp_id and expensive_col != "nothing")' \
    'from child mc, parent mp' \
    'where mc.p_id = mp.p_id and c_id = 15010;'
echo '.timer off'

echo 'analyze;'

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] [bug] Optimizer chooses suboptimal plan for correlated subqueries containing tables with zero rows

Reply via email to