On 23/01/2012 3:09 PM, Ryan Johnson wrote:
On 23/01/2012 12:51 PM, Richard Hipp wrote:
On Mon, Jan 23, 2012 at 12:48 PM, Simon Slavin<slav...@bigfraud.org> wrote:
I don't know if Dr Hipp is pursuing this privately or expecting it to be
solved collaboratively on this list.


I don't have a test case to work on.
My database file is ~150MB, but was generated by the TPC-H data generator program. Assuming a linux-like environment (including cygwin), the following will reproduce the setup in under five minutes: The problem persists with a freshly-generated database on my machine, using a just-compiled sqlite-3.7.10.

OK, it looks like I didn't install the 3.7.10 binary properly, because the situation is significantly different once ANALYZE has run.

The first query from the OP is remarkably similar to a query mentioned in where.c (from the 3.7.5 sources, which I happened to have handy):
Previous versions of SQLite [did not always find the lowest-cost plan] for scripts such as the following:
    **
    **   CREATE TABLE t1(a, b);
    **   CREATE TABLE t2(c, d);
    **   SELECT * FROM t2, t1 WHERE t2.rowid = t1.a;
    **
The best strategy is to iterate through table t1 first. However it is not possible to determine this with a simple greedy algorithm. Since the cost of a linear scan through table t2 is the same as the cost of a linear scan through table t1, a simple greedy algorithm may choose to use t2 for the outer loop, which is a much costlier approach.

Re-casting my first query as the above gives:
sqlite> explain query plan select * from customer t2, orders t1 where t2.rowid = t1.custkey;
0|0|0|SCAN TABLE customer AS t2 (~15000 rows)
0|1|1|SEARCH TABLE orders AS t1 USING INDEX OrderCustomers (custKey=?) (~15 rows)

The same query plan is chosen if t1 comes first; execution times and stats confirm the improvement.

For query 2, the situation is less clear. First of all, it appears to execute more than 4x faster overall (10s and .2s -- WOW!) but the optimizer still seems to choose the wrong plan (13.6M rows examined):
1|0|0|SCAN TABLE orders (~37500 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|2|SCAN SUBQUERY 1 AS Y (~37500 rows)
0|1|0|SCAN TABLE orders AS O (~75000 rows)
0|2|1|SEARCH TABLE lineitem AS L USING INDEX LineItemOrders (orderKey=?) (~2 rows)

If I force the proper ordering, "X cross join Y", then the following, better-than-expected plan is used instead (only 300k rows examined!):
1|0|0|SCAN TABLE orders (~37500 rows)
1|0|0|USE TEMP B-TREE FOR DISTINCT
0|0|0|SCAN TABLE orders AS O (~75000 rows)
0|1|1|SEARCH TABLE lineitem AS L USING INDEX LineItemOrders (orderKey=?) (~5 rows)
0|2|2|SCAN SUBQUERY 1 AS Y (~18750 rows)

The main difference seems to be that Y should be the innermost loop but isn't.

Thoughts?
Ryan


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to