Re: [sqlite] Massive performance regression in 3.7.x

Richard Hipp Sat, 25 Sep 2010 11:34:22 -0700

On Sat, Sep 25, 2010 at 10:54 AM, Max Vlasov <[email protected]> wrote:

> On Sat, Sep 25, 2010 at 6:35 PM, Richard Hipp <[email protected]> wrote:
>
> >
> > You have a very unusual data distribution in your tables.  SQLite does
> not
> > know this and so it chooses a query plan (in 3.7.1 and later) that
> assumes
> > a
> > different and more typical data distribution.  The new query plan works
> > better for most cases, but is much worse in your unusual case.
> >
> >
> >
> Richard,
> I think it's something different.
> .....
> So my guess is that an optimization introduced in 3.6.20 and existed in .23
> was removed in 3.7 for some unknown reasons. Maybe actually there was a
> reason, but the difference for INNER JOIN and LEFT JOIN leaves a chance
> it's
> some side effect. But I may be wrong
>

You are indeed wrong.

A simplification of the schema is this:

      CREATE TABLE a(w, x, PRIMARY KEY(w,x));
      CREATE TABLE b(y, z, PRIMARY KEY(y,z));

And the corresponding query is this:

      SELECT * FROM a JOIN b ON w=3 AND x=z;

Because ANALYZE has not been run, SQLite has no statistics on the content of
the two tables so it guesses that both tables contain 1,000,000 elements and
that the w=3 constraint on table A will narrow down the search to just 10
entries.

Given this information, the best plan is to lookup entries in table A where
w=3 in the outer loop, and for each match do a full table scan of B looking
for entries where x=z.  The query planner considers creating a temporary
index on table B to help it find entries where z=x, but because it guesses
that the inner loop will only run 10 times, the cost of creating an index on
1 million entries exceeds (very slightly) the cost of doing 10 full table
scans.  And so it chooses to do the full table scan.

The other approach of doing a single full scan of B in the outer loop then
looking up entries in table A where w=3 and x=z is rejected because that
would involve 1 million binary searches of the A table.  The cost of 1
million binary searches is slightly greater than the cost of doing 10 full
table scans of B, so the first plan with A in the outer loop and B in the
inner loop is considered slightly cheaper.

All of the above is correct and the most efficient query plan (the first
plan) is chosen IF the content of the tables is as SQLite guesses it to be:
1 million rows in each table and the w=3 constraint restricting the search
to 10 rows.  In reality the data is distributed very differently.  There are
12366 rows in table A and the w=3 does not restrict the search at all
because every row has the same value for w.  Hence, using the first term of
the primary key is utterly useless in helping to restrict the amount of work
done.  With SQLite's guess of the data distribution, doing a search on the
w=3 constraint cuts the amount of work by a factor of about 40,000 or so,
when in reality doing search on the w=3 constraint does not restrict the
search space at all.  Hence, SQLite is underestimating the amount of work by
the first plan by a factor of 40,000 or so.

After you run ANALYZE, SQLite has better information about the content of
the tables, it know for example that the w=3 constraint is not useful in
restricting the search, and hence it is better able to estimate the work
required to run the each proposed query plan.  With better information, the
query planner therefore is able to choose the second, faster query plan.

Without ANALYZE, SQLite should have been choosing the first plan all along.
The fact that second plan was chosen by version 3.6.23 was the result of a
problem in the query planner that got fixed in version 3.7.1.  The fact that
the second plan turned out to be faster for the peculiar content of this
database was just an accident.

>
> Max
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>

-- 
D. Richard Hipp
[email protected]
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Massive performance regression in 3.7.x

Reply via email to