Dennis Cote wrote:
The fast query does a table scan of the Timesheet table looking for
records that are between the specified dates. For each such record, it
looks up the client, project, and employee records by matching the
timesheet fields to the primary key of those tables. These extended
records are collected and sorted.
Assuming 10% of the timesheet records match the date range, you have
one scan of the 4000 record table, and 400 direct record lookups in
each of the other three tables. Then a sort of 400 records.
This query would execute even faster if there was an index on the date
field of the timesheet table. Then it could go directly to the first
record that matches the date range and scan only the 400 records until
the first date that is outside the date range.
The slow query on the other hand is not very good. It does cross join
of the project and employee tables to produce 15 x 200 = 3000 records
with every combination of project and employee. Then for each of those
records it does a lookup of the client records that corresponds to the
project using the primary key of the client table. Now for each of
these 3000 extended records, it scans the 4000 record timesheet table
looking for records that match the employee id, client id, and project
id. That is 3000 scans of 4000 records, or 12,000,000 sets of
comparisons. Each one that matches produces an extended record with
all the data from the four tables. Finally, checks if the date of the
extended record is between the specified dates. This query does not
sort the final results. It should locate the same number of records as
the first query, but will be much slower since it starts at the wrong
end.
Thank you for the very thorough explanation. Indeed, adding an index on
timesheet.date makes the slow query quite fast. The worst case
scenario, in which the date range is wide enough that all records are
selected, is unaffected, which is fine.
When I moved this change to the server which is actually going to be
running this (which was Sqlite 3.1.x) the query was still slow even with
the index available. However, upgrading it to 3.3.3 and ANALYZE'ing it
fixed the problem there, as well.
BTW, the "slow" query is not supposed to sort the final results, since
it works better with SQLObject if I sort the result in Python.
At some point I will be switching to SQLAlchemy as my ORM and will have
much better flexibility, I think.
Thanks again,
Steve