Hi List,
I have a query that's quite slow on a small sample of data and I'd like
to optimise it before larger datasets get in there. With just 1000 rows
in each table I'm currently seeing query times of ~0.2 seconds. My
attempts to optimise it so far by putting in carefully thought out
indexes have _increased_ the time to 0.4s! Hmmmm, so much for "carefully
thought out". So I've removed all the indexes again and am asking the
experts here for some guidance.
The reduced table structure is as below. It's for some web-crawling -
basically each URL will be stored once in the "urls" table. It also
keeps a reference to what the original source url that ended up coming
to this url was (that's source_seed_id - it's used in this query), and
the direct parent (parent_id).
Then any time the application scrapes one of the url's, an entry is
written into the lookups table.
The idea behind the query is to see whether during the last crawl the
application got to the last page in the paginated "chain" (it will have
a specific error code which the application is looking for). This means
it's somewhat recursive.
SQLite IDE version (where the timings are coming from): 3.15.0
SQlite version behind the application itself (not timed the query here,
but it's still slow): 3.22.0
CREATE TABLE urls (
url_id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
parent_id INTEGER REFERENCES urls (url_id)
source_seed_id INTEGER REFERENCES urls (url_id)
);
CREATE TABLE lookups (
url_id INTEGER REFERENCES urls (url_id),
retrieval_datetime DATETIME,
error_code INTEGER,
is_generic_flag BOOLEAN -- about one in 100 will have this
flagged
);
-- This view is used as a convenience, including in the problem query below
CREATE VIEW v_most_recent_lookup_per_url AS
SELECT url_id,
MAX(retrieval_datetime) AS retrieval_datetime
FROM lookups
GROUP BY url_id
ORDER BY url_id;
------------
The query itself:
------------
SELECT
u.url, l.error_code
FROM
urls u
LEFT JOIN
lookups l
USING(url_id)
LEFT JOIN
v_most_recent_lookup_per_url recent
-- This is the important bit
-- Here we use the most recent lookup url_id to link to the
source_seed_id, so we only find its children
-- Or alternatively itself
ON u.source_seed_id = recent.url_id
OR u.url_id = recent.url_id
WHERE
l.is_generic_flag = 1
AND
-- Must be "or equal to" so we can get the lookup of the very
base url.
l.retrieval_datetime >= recent.retrieval_datetime
AND
DATETIME(recent.retrieval_datetime) > DATETIME('now', '-5 days')
AND
recent.url_id in (
select url_id from urls where url = 'example.com'
)
-- By definition url's can only be inserted in pagination order, so
we can order by url_id.
ORDER BY
u.url_id DESC
LIMIT 1;
The current EXPLAIN QUERY PLAN:
1 0 0 SCAN TABLE lookups
1 0 0 USE TEMP B-TREE FOR GROUP BY
0 0 0 SCAN TABLE urls AS u
0 1 1 SEARCH TABLE lookups AS l USING AUTOMATIC COVERING INDEX
(is_generic_flag=? AND url_id=?)
0 2 2 SCAN SUBQUERY 1 AS recent
0 0 0 EXECUTE LIST SUBQUERY 2
2 0 0 SEARCH TABLE urls USING COVERING INDEX
sqlite_autoindex_urls_1 (url=?)
And the EXPLAIN:
0 Init 0 118 0 00
1 Integer 50 1 0 00
2 Once 0 50 0 00
3 Noop 5 4 0 00
4 OpenEphemeral 2 2 0 00
5 SorterOpen 6 2 0 k(1,B) 00
6 Integer 0 6 0 00
7 Integer 0 5 0 00
8 Null 0 9 9 00
9 Gosub 8 48 0 00
10 OpenRead 3 7 0 3 00
11 Rewind 3 17 0 00
12 Column 3 0 11 00
13 Column 3 2 12 00
14 MakeRecord 11 2 13 00
15 SorterInsert 6 13 0 00
16 Next 3 12 0 01
17 Close 3 0 0 00
18 OpenPseudo 7 13 2 00
19 SorterSort 6 50 0 00
20 SorterData 6 13 7 00
21 Column 7 0 10 00
22 Compare 9 10 1 k(1,B) 00
23 Jump 24 28 24 00
24 Move 10 9 1 00
25 Gosub 7 39 0 00
26 IfPos 6 50 0 00
27 Gosub 8 48 0 00
28 Column 7 1 14 00
29 CollSeq 15 0 0 (BINARY) 00
30 AggStep0 0 14 3 max(1) 01
31 If 15 33 0 00
32 Column 7 0 2 00
33 Integer 1 5 0 00
34 SorterNext 6 20 0 00
35 Gosub 7 39 0 00
36 Goto 0 50 0 00
37 Integer 1 6 0 00
38 Return 7 0 0 00
39 IfPos 5 41 0 00
40 Return 7 0 0 00
41 AggFinal 3 1 0 max(1) 00
42 SCopy 2 16 0 00
43 SCopy 3 17 0 00
44 MakeRecord 16 2 14 00
45 NewRowid 2 18 0 00
46 Insert 2 14 18 08
47 Return 7 0 0 00
48 Null 0 2 4 00
49 Return 8 0 0 00
50 Return 1 0 0 00
51 Noop 8 4 0 00
52 Integer 1 19 0 00
53 OpenRead 0 8 0 5 00
54 OpenRead 1 7 0 7 00
55 Last 0 116 0 00
56 Once 0 67 0 00
57 OpenAutoindex 9 5 0 k(5,B,B,,,) 00
58 Rewind 1 67 0 00
59 Column 1 6 21 00
60 Column 1 0 22 00
61 Column 1 2 23 00
62 Column 1 5 24 00
63 Rowid 1 25 0 00
64 MakeRecord 21 5 20 00
65 IdxInsert 9 20 0 10
66 Next 1 59 0 03
67 Integer 0 26 0 00
68 Integer 1 27 0 00
69 Rowid 0 28 0 00
70 SeekGE 9 112 27 2 00
71 IdxGT 9 112 27 2 00
72 Integer 1 26 0 00
73 Column 9 0 20 00
74 Ne 29 111 20 (BINARY) 53
75 Integer 0 30 0 00
76 Rewind 2 108 0 00
77 Column 0 4 31 00
78 Column 2 0 32 00
79 Eq 32 82 31 (BINARY) 43
80 Rowid 0 33 0 00
81 Ne 32 107 33 (BINARY) 53
82 Integer 1 30 0 00
83 Column 9 2 31 00
84 Column 2 1 20 00
85 Lt 20 107 31 (BINARY) 53
86 Copy 20 34 0 00
87 Function0 0 34 33 datetime(-1) 01
88 Le 35 107 33 51
89 Once 0 99 0 00
90 OpenEphemeral 11 1 0 k(1,B) 00
91 OpenRead 12 9 0 k(2,,) 02
92 String8 0 36 0 example.com 00
93 SeekGE 12 98 36 1 00
94 IdxGT 12 98 36 1 00
95 IdxRowid 12 37 0 00
96 MakeRecord 37 1 31 C 00
97 IdxInsert 11 31 0 00
98 Close 12 0 0 00
99 Column 2 0 31 00
100 IsNull 31 107 0 00
101 Affinity 31 1 0 C 00
102 NotFound 11 107 31 1 00
103 Column 0 1 38 00
104 Column 9 3 39 00
105 ResultRow 38 2 0 00
106 DecrJumpZero 19 116 0 00
107 Next 2 77 0 01
108 IfPos 30 111 0 00
109 NullRow 2 0 0 00
110 Goto 0 82 0 00
111 Next 9 71 0 00
112 IfPos 26 115 0 00
113 NullRow 9 0 0 00
114 Goto 0 72 0 00
115 Prev 0 56 0 01
116 Close 0 0 0 00
117 Halt 0 0 0 00
118 Transaction 0 0 338 1 01
119 TableLock 0 7 0 lookups 00
120 TableLock 0 8 0 urls 00
121 Integer 1 29 0 00
122 String8 0 40 0 now 00
123 String8 0 41 0 -5 days 00
124 Function0 3 40 35 datetime(-1) 02
125 Goto 0 1 0 00
Any thoughts/suggestions/feedback welcome.
Many thanks,
Jonathan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users