Hi List,
So, I thought I'd solved my little problem, but upon testing it in my
application it subjectively didn't seem any faster.
I upgraded the SQLite in my IDE to 3.22, and it is confirming my
suspicions. It turns out that when I run the below in 3.22, it takes
about 0.150s. But in 3.15 it was taking 0.004s!
The original query also takes 0.15s in 3.22 - so that has been mildly
optimised (from ~0.2s). My general thinking-aloud notion is that my
"fix" is getting optimised away in 3.22.
I can provide a small replication database if desired.
Thanks,
Jonathan
On 2018-03-19 00:24, Jonathan Moules wrote:
Thanks Simon and Quan.
I'm not sure it's the view itself per-se - It takes 0.000s (time too
small to measure) for just the full View to be run on this dataset.
It turns out the problem is simpler than that and no data changes are
needed. I did consider Quan Yong Zhai's option and gave it a try, but
couldn't fathom out the necessary SQL to get what I wanted (it's
getting late).
Instead prompted by the replies here, I've changed the query very
slightly to the below which solves the problem:
SELECT
u.url, l.error_code
FROM
urls u
LEFT JOIN
lookups l
USING(url_id)
LEFT JOIN
(select * from v_most_recent_lookup_per_url where url_id in (
select url_id from urls where url = 'example.com'
)) recent
-- By definition url's can) recent
-- This is the important bit
-- Here we use the most recent lookup url_id to link to the
source_seed_id, so we only find its children
-- Or alternatively itself
ON u.source_seed_id = recent.url_id
OR u.url_id = recent.url_id
WHERE
-- For JSON-spider at least, Generic's are guaranteed to be
generic pages.
l.is_generic_flag = 1
AND
-- Must be "or equal to" so we can get the lookup of the very
base url.
l.retrieval_datetime >= recent.retrieval_datetime
AND
DATETIME(recent.retrieval_datetime) > DATETIME('now', '-5 days')
ORDER BY
u.url_id DESC
LIMIT 1;
To save readers having to compare manually, the difference is this: I
turned the "recent" alias item from the View into a subquery (still
using the view), and then moved the "where url = example.com" part in
to there.
The query is now literally two orders of magnitude faster, from 0.2s
to 0.004s. No new indexes or anything, just that change.
Hopefully this will scale to full datasets; if it doesn't I may have
to consider the other suggestions, but for now this is a
minimum-effort solution.
I'm not actually sure what SQLite was doing in the previous query to
make it take so long. , so I imagine there was some hideous recursing
going on or something.
Scope for optimisation?
Thanks again,
Jonathan
On 2018-03-18 23:37, Simon Slavin wrote:
On 18 Mar 2018, at 11:13pm, Jonathan Moules
<jonathan-li...@lightpear.com> wrote:
Given there's such a small amount of data at this point, I suspect
the issue is more related to the recursion. I've tried creating
these two indexes to facilicate that
Nice idea but I can see why it's not working.
You have an underlying problem: the format you're using to store your
data makes it extremely difficult to extract the figures you want.
Quan Yong Zhai has the best idea I've seen: get rid of almost ¾ of
the work you're doing by storing the last retrieval date in your
"urls" table.
As an alternative, store the start your retrieval process by JOINing
the two tables together. Consider the result of this query
SELECT url_id, lookups.error_code
FROM urls
JOIN lookups ON lookups.url_id = urls.url_id AND
lookup.retrieval_datetime = urls.retrieval_datetime
WHERE DATETIME(urls.retrieval_datetime) > DATETIME('now', '-5
days')
and figure out what you would add to that to get your desired result.
Simon.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users