On Sat, Sep 25, 2010 at 10:54 AM, Max Vlasov <[email protected]> wrote:
> On Sat, Sep 25, 2010 at 6:35 PM, Richard Hipp <[email protected]> wrote: > > > > > You have a very unusual data distribution in your tables. SQLite does > not > > know this and so it chooses a query plan (in 3.7.1 and later) that > assumes > > a > > different and more typical data distribution. The new query plan works > > better for most cases, but is much worse in your unusual case. > > > > > > > Richard, > I think it's something different. > ..... > So my guess is that an optimization introduced in 3.6.20 and existed in .23 > was removed in 3.7 for some unknown reasons. Maybe actually there was a > reason, but the difference for INNER JOIN and LEFT JOIN leaves a chance > it's > some side effect. But I may be wrong > You are indeed wrong. A simplification of the schema is this: CREATE TABLE a(w, x, PRIMARY KEY(w,x)); CREATE TABLE b(y, z, PRIMARY KEY(y,z)); And the corresponding query is this: SELECT * FROM a JOIN b ON w=3 AND x=z; Because ANALYZE has not been run, SQLite has no statistics on the content of the two tables so it guesses that both tables contain 1,000,000 elements and that the w=3 constraint on table A will narrow down the search to just 10 entries. Given this information, the best plan is to lookup entries in table A where w=3 in the outer loop, and for each match do a full table scan of B looking for entries where x=z. The query planner considers creating a temporary index on table B to help it find entries where z=x, but because it guesses that the inner loop will only run 10 times, the cost of creating an index on 1 million entries exceeds (very slightly) the cost of doing 10 full table scans. And so it chooses to do the full table scan. The other approach of doing a single full scan of B in the outer loop then looking up entries in table A where w=3 and x=z is rejected because that would involve 1 million binary searches of the A table. The cost of 1 million binary searches is slightly greater than the cost of doing 10 full table scans of B, so the first plan with A in the outer loop and B in the inner loop is considered slightly cheaper. All of the above is correct and the most efficient query plan (the first plan) is chosen IF the content of the tables is as SQLite guesses it to be: 1 million rows in each table and the w=3 constraint restricting the search to 10 rows. In reality the data is distributed very differently. There are 12366 rows in table A and the w=3 does not restrict the search at all because every row has the same value for w. Hence, using the first term of the primary key is utterly useless in helping to restrict the amount of work done. With SQLite's guess of the data distribution, doing a search on the w=3 constraint cuts the amount of work by a factor of about 40,000 or so, when in reality doing search on the w=3 constraint does not restrict the search space at all. Hence, SQLite is underestimating the amount of work by the first plan by a factor of 40,000 or so. After you run ANALYZE, SQLite has better information about the content of the tables, it know for example that the w=3 constraint is not useful in restricting the search, and hence it is better able to estimate the work required to run the each proposed query plan. With better information, the query planner therefore is able to choose the second, faster query plan. Without ANALYZE, SQLite should have been choosing the first plan all along. The fact that second plan was chosen by version 3.6.23 was the result of a problem in the query planner that got fixed in version 3.7.1. The fact that the second plan turned out to be faster for the peculiar content of this database was just an accident. > > Max > _______________________________________________ > sqlite-users mailing list > [email protected] > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- D. Richard Hipp [email protected] _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

