Re: [sqlite] Why doesn't SQLite optimise this subselect?

2018-01-06 Thread Dinu
If you want just 1 value (any) from lookup, you can use:

SELECT
 post_processing_info, 
 (
SELECT
is_json
FROM
tables_lookup
WHERE
tables_lookup.content_hash=webpage_contents.content_hash
LIMIT 1
 ) AS is_json
FROM 
 webpage_contents
WHERE 
 content_hash = 'abc' 

If you want only one value,
or

SELECT
post_processing_info, 
x.is_json
FROM
 webpage_contents
JOIN
 (
SELECT 
content_hash,
MIN(is_json)
FROM
tables_lookup
GROUP BY
content_hash
 ) AS x ON
 x.content_hash=webpage_contents.content_hash
FROM 
 webpage_contents

if you want the full join... I'm using MIN() here as an arbitrary function
to get just one value from the group.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Why doesn't SQLite optimise this subselect?

2018-01-06 Thread Dinu
Could you describe your intent in more detail?

SELECT * FROM ... GROUP BY is unstable at best, so I cannot really grasp
your intention. Which lookup record's 

If I guess right, you might want something in the lines of:

SELECT
 w.post_processing_info, 
 l.is_json 
FROM 
 webpage_contents w 
JOIN 
 (SELECT DISTINCT is_json from lookup_tables.lookups) l 
 USING (content_hash) 
WHERE 
 content_hash = 'abc' 

(this might return 0-3 rows if lookup_tables contains 3 values (0,1,and
NULL) for is_json)

which in turn I think would optimize best if written as:

SELECT DISTINCT
 w.content_hash,
 w.post_processing_info, 
 l.is_json 
FROM 
 webpage_contents w 
JOIN 
 lookup_tables.lookups l 
 USING (content_hash) 
WHERE 
 content_hash = 'abc' 




--
Sent from: http://sqlite.1065341.n5.nabble.com/
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Why doesn't SQLite optimise this subselect?

2018-01-06 Thread Jonathan Moules

Hi All,
  This is more of an academic question as I've come up with a better 
query, but I was wondering why SQLite doesn't optimise this query.


Lets say I have two tables, simplified here. One contains webpage 
contents and a unique hash of those contents (the primary key), the 
other contains a history of lookups. The hash is used as a foreign key 
between the two tables.


Table: webpage_contents
content_hash -- Primary key
post_processing_info
page_content

This table has about 10,000 rows

crawling_lookups
content_hash -- foreign key, has an index
is_json -- a flag indicating if it's JSON
is_html -- a flag indicating if it's HTML

This table has 500,000 rows

(Note: I appreciate this is bad design now and the flags should be in 
webpage content; it evolved organically).


Now, I want to query the content table post_processing_info, and join in 
the flags from the lookups at the same time. After a bit of fiddling 
around to stop duplication happening, I got this:


SELECT
w.post_processing_info,
l.is_json
FROM
webpage_contents w
JOIN
(SELECT * from lookup_tables.lookups GROUP BY content_xxhash) l
USING (content_hash)
WHERE
content_hash = 'abc'


This takes about 2 seconds to run, which is quite slow given 
content_hash is indexed in both tables and it's running from an SSD.


I appreciate there are better ways to write this (and I'm using one now, 
I was a wee bit meandering getting there though), but it I was left 
wondering - why did SQLite not optimise the subquery?


The Explain Query Plan for that is:

100SCAN TABLE crawling_lookups USING INDEX content_hash_idx
000SEARCH TABLE webpage_contents AS c USING INDEX 
sqlite_autoindex_page_contents_1 (content_hash=?)

011SCAN SUBQUERY 1 AS l

It seems to be doing the group on the subselect using the index (fast - 
about 0.1 seconds), then at step 3 doing a scan of the subquery results 
for the content_hash. This would be the slow part as that doesn't have 
an index.
I figured SQL would spot that the where clause in the outermost scope 
would also apply to the subselect and put it in there automatically so 
the group-by would only be for the things the WHERE clause affected.


Is this an optimisation opportunity? Or is my SQL so bad it was 
inevitable (more likely)?


Cheers,
Jonathan

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users