Hi All,
This is more of an academic question as I've come up with a better
query, but I was wondering why SQLite doesn't optimise this query.
Lets say I have two tables, simplified here. One contains webpage
contents and a unique hash of those contents (the primary key), the
other contains a history of lookups. The hash is used as a foreign key
between the two tables.
Table: webpage_contents
content_hash -- Primary key
post_processing_info
page_content
This table has about 10,000 rows
crawling_lookups
content_hash -- foreign key, has an index
is_json -- a flag indicating if it's JSON
is_html -- a flag indicating if it's HTML
This table has 500,000 rows
(Note: I appreciate this is bad design now and the flags should be in
webpage content; it evolved organically).
Now, I want to query the content table post_processing_info, and join in
the flags from the lookups at the same time. After a bit of fiddling
around to stop duplication happening, I got this:
SELECT
w.post_processing_info,
l.is_json
FROM
webpage_contents w
JOIN
(SELECT * from lookup_tables.lookups GROUP BY content_xxhash) l
USING (content_hash)
WHERE
content_hash = 'abc'
This takes about 2 seconds to run, which is quite slow given
content_hash is indexed in both tables and it's running from an SSD.
I appreciate there are better ways to write this (and I'm using one now,
I was a wee bit meandering getting there though), but it I was left
wondering - why did SQLite not optimise the subquery?
The Explain Query Plan for that is:
100SCAN TABLE crawling_lookups USING INDEX content_hash_idx
000SEARCH TABLE webpage_contents AS c USING INDEX
sqlite_autoindex_page_contents_1 (content_hash=?)
011SCAN SUBQUERY 1 AS l
It seems to be doing the group on the subselect using the index (fast -
about 0.1 seconds), then at step 3 doing a scan of the subquery results
for the content_hash. This would be the slow part as that doesn't have
an index.
I figured SQL would spot that the where clause in the outermost scope
would also apply to the subselect and put it in there automatically so
the group-by would only be for the things the WHERE clause affected.
Is this an optimisation opportunity? Or is my SQL so bad it was
inevitable (more likely)?
Cheers,
Jonathan
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users