Aaron Schneider wrote:
Hello All,
I've using SQLite 2.8.16 with a music management application, and I'm
trying out why certain queries take a long time and to figure out how
SQLite uses my indexes. I've got a master "media" table with a couple
of auxiliary tables like "artists", "albums", and "genres". Each
table's primary key is ____id (mediaid, artistid, albumid, and genreid),
and the media table contains columns to each of the side tables' primary
key (and an index for each one, too).
I found that a particular query takes forever when run on a particular
database:
SELECT DISTINCT albums.albumid, albums.name_lower FROM artists, albums,
media WHERE albums.media_count > 0 AND media.albumid=albums.albumid AND
media.artistid=artists.artistid ORDER BY artists.name_key,
albums.name_key;
In this query sqlite does a table scan through the first table, artists,
to select the records. This table is joined to the media table by the
third condition. I suspect that you do not have an index on the artistid
column in the media table, so sqlite must do a full table scan of the
media table for each record in the artists table (this is what is
slowing it down). This generates a set of virtual records that is the
size of the artists table multiplied by the size of the media table.
Each of these virtual records are then joined to the albums table using
the second condition. In this case I believe that the albumid field is
the key for the albums table, so sqlite can go directly to each matching
record in the albums table. Finally it checks the first condition for
each of these extended records to see if it should be selected. All the
matching records are collected and then sorted before being returned to
you.
The lack of an index on the media table artistid field forces many scans
of your largest table.
Well, the largest table (by far) in these queries is always the media
table, and I found that by moving the media table to the beginning of
the FROM list, it runs almost instantly:
SELECT DISTINCT albums.albumid, albums.name_lower FROM media, artists,
albums WHERE albums.media_count > 0 AND media.albumid=albums.albumid AND
media.artistid=artists.artistid ORDER BY artists.name_key,
albums.name_key
Thus, when "media" is first, SQLite scans each record in the media table
and tests the smaller tables with the NotExists command which uses the
primary key to locate the record immediately.
Yes, I believe this is correct. This requires only one table scan of the
media table thought.
I've been reading up on the virtual machine, op-codes, and performance
tips in the wiki for the past few days, and I've got a few questions
about what was happening:
1) Shouldn't it be faster to iterate through one of the smaller tables
and then use an index on the media table to join with the other small
table?
Yes, see above. You just need the correct index.
2) It's unclear to me what order my FROM tables should be in. In the
first query, did I accidently choose the absolute worst order for the
FROM list?
Yes, I think you did. If you had used the albums table first, it would
have been able to apply the first condition to each record as it scanned
that table. This would have eliminated some records immediately. It
would have then done a slow join by scanning the media table to find
matching albumids (assuming there is no index on the albumid field in
the media table). This would have been just as bad as the first case.
The relative time would depend on the size of the artists table vs the
number of records in the albums table that meet the first criterion.
Again this can be corrected by creating the required index.
So the tables in the FROM clause should appear in the order that tables
are introduced, with special consideration to the first table in the
list be the primary value we are SELECTing? In the query above, that
would mean "FROM albums, media, artists". (Both "FROM media, albums,
artists" and "FROM albums, media, artists" run too fast for eye-ball
speed compare.)
3) What role, if any, do the ORDER BY columns play in the FROM order?
(I would assume none since by the time you've selected a row, you have
all of the data for the sorting of that row.)
I believe sqlite will optimize the order by clause if it can use an
index to scan the first table in the the order that is required. If not
it just does a table scan through the first table and then sorts the
collected records. In your case, you are ordering by columns from two
tables, so you can't create an index, and it can't be optimized.
4) Is there a query optimizer for sqlite? A program that automagically
discovered which queries were not in optimal form that could propose a
different order or alternate values for certain terms in FROM and WHERE
clauses? I'm thinking that the input would be a query and an already
existing database (with indexes), and the output would be a (better)
optimized query. This would be especially useful for SQLite since
everything must be hand-tuned.
I believe that Richard is currently working on improving the query
optimization of sqlite, but your best bet right now is using the explain
command to examine how sqlite will execute various queries.
Thanks!
Aaron
HTH
Dennis Cote