Hi All,
This is more of an academic question as I've come up with a better query, but I was wondering why SQLite doesn't optimise this query.

Lets say I have two tables, simplified here. One contains webpage contents and a unique hash of those contents (the primary key), the other contains a history of lookups. The hash is used as a foreign key between the two tables.

Table: webpage_contents
content_hash -- Primary key
post_processing_info
page_content

This table has about 10,000 rows

crawling_lookups
content_hash -- foreign key, has an index
is_json -- a flag indicating if it's JSON
is_html -- a flag indicating if it's HTML

This table has 500,000 rows

(Note: I appreciate this is bad design now and the flags should be in webpage content; it evolved organically).

Now, I want to query the content table post_processing_info, and join in the flags from the lookups at the same time. After a bit of fiddling around to stop duplication happening, I got this:

SELECT
    w.post_processing_info,
    l.is_json
FROM
    webpage_contents w
JOIN
    (SELECT * from lookup_tables.lookups GROUP BY content_xxhash) l
    USING (content_hash)
WHERE
    content_hash = 'abc'


This takes about 2 seconds to run, which is quite slow given content_hash is indexed in both tables and it's running from an SSD.

I appreciate there are better ways to write this (and I'm using one now, I was a wee bit meandering getting there though), but it I was left wondering - why did SQLite not optimise the subquery?

The Explain Query Plan for that is:

1    0    0    SCAN TABLE crawling_lookups USING INDEX content_hash_idx
0 0 0 SEARCH TABLE webpage_contents AS c USING INDEX sqlite_autoindex_page_contents_1 (content_hash=?)
0    1    1    SCAN SUBQUERY 1 AS l

It seems to be doing the group on the subselect using the index (fast - about 0.1 seconds), then at step 3 doing a scan of the subquery results for the content_hash. This would be the slow part as that doesn't have an index. I figured SQL would spot that the where clause in the outermost scope would also apply to the subselect and put it in there automatically so the group-by would only be for the things the WHERE clause affected.

Is this an optimisation opportunity? Or is my SQL so bad it was inevitable (more likely)?

Cheers,
Jonathan

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to