Hi List,
I have a query that's quite slow on a small sample of data and I'd like to optimise it before larger datasets get in there. With just 1000 rows in each table I'm currently seeing query times of ~0.2 seconds. My attempts to optimise it so far by putting in carefully thought out indexes have _increased_ the time to 0.4s! Hmmmm, so much for "carefully thought out". So I've removed all the indexes again and am asking the experts here for some guidance.

The reduced table structure is as below. It's for some web-crawling - basically each URL will be stored once in the "urls" table. It also keeps a reference to what the original source url that ended up coming to this url was (that's source_seed_id - it's used in this query), and the direct parent (parent_id). Then any time the application scrapes one of the url's, an entry is written into the lookups table.

The idea behind the query is to see whether during the last crawl the application got to the last page in the paginated "chain" (it will have a specific error code which the application is looking for). This means it's somewhat recursive.
SQLite IDE version (where the timings are coming from): 3.15.0
SQlite version behind the application itself (not timed the query here, but it's still slow): 3.22.0


CREATE TABLE urls (
    url_id         INTEGER  PRIMARY KEY AUTOINCREMENT,
    url            TEXT     UNIQUE,
    parent_id        INTEGER  REFERENCES urls (url_id)
    source_seed_id INTEGER  REFERENCES urls (url_id)
);


CREATE TABLE lookups (
    url_id               INTEGER  REFERENCES urls (url_id),
    retrieval_datetime   DATETIME,
    error_code           INTEGER,
is_generic_flag BOOLEAN -- about one in 100 will have this flagged
);


-- This view is used as a convenience, including in the problem query below

CREATE VIEW v_most_recent_lookup_per_url AS
    SELECT url_id,
           MAX(retrieval_datetime) AS retrieval_datetime
      FROM lookups
     GROUP BY url_id
     ORDER BY url_id;

------------
The query itself:
------------

    SELECT
        u.url, l.error_code
    FROM
        urls u
    LEFT JOIN
        lookups l
        USING(url_id)
    LEFT JOIN
        v_most_recent_lookup_per_url recent
        -- This is the important bit
-- Here we use the most recent lookup url_id to link to the source_seed_id, so we only find its children
        -- Or alternatively itself
        ON u.source_seed_id = recent.url_id
            OR u.url_id = recent.url_id
    WHERE
        l.is_generic_flag = 1
        AND
-- Must be "or equal to" so we can get the lookup of the very base url.
        l.retrieval_datetime >= recent.retrieval_datetime
        AND
        DATETIME(recent.retrieval_datetime) > DATETIME('now', '-5 days')
        AND
        recent.url_id in (
            select url_id from urls where url = 'example.com'
            )
-- By definition url's can only be inserted in pagination order, so we can order by url_id.
    ORDER BY
        u.url_id DESC
    LIMIT 1;



The current EXPLAIN QUERY PLAN:

1    0    0    SCAN TABLE lookups
1    0    0    USE TEMP B-TREE FOR GROUP BY
0    0    0    SCAN TABLE urls AS u
0 1 1 SEARCH TABLE lookups AS l USING AUTOMATIC COVERING INDEX (is_generic_flag=? AND url_id=?)
0    2    2    SCAN SUBQUERY 1 AS recent
0    0    0    EXECUTE LIST SUBQUERY 2
2 0 0 SEARCH TABLE urls USING COVERING INDEX sqlite_autoindex_urls_1 (url=?)


And the EXPLAIN:


0    Init    0    118    0        00
1    Integer    50    1    0        00
2    Once    0    50    0        00
3    Noop    5    4    0        00
4    OpenEphemeral    2    2    0        00
5    SorterOpen    6    2    0    k(1,B)    00
6    Integer    0    6    0        00
7    Integer    0    5    0        00
8    Null    0    9    9        00
9    Gosub    8    48    0        00
10    OpenRead    3    7    0    3    00
11    Rewind    3    17    0        00
12    Column    3    0    11        00
13    Column    3    2    12        00
14    MakeRecord    11    2    13        00
15    SorterInsert    6    13    0        00
16    Next    3    12    0        01
17    Close    3    0    0        00
18    OpenPseudo    7    13    2        00
19    SorterSort    6    50    0        00
20    SorterData    6    13    7        00
21    Column    7    0    10        00
22    Compare    9    10    1    k(1,B)    00
23    Jump    24    28    24        00
24    Move    10    9    1        00
25    Gosub    7    39    0        00
26    IfPos    6    50    0        00
27    Gosub    8    48    0        00
28    Column    7    1    14        00
29    CollSeq    15    0    0    (BINARY)    00
30    AggStep0    0    14    3    max(1)    01
31    If    15    33    0        00
32    Column    7    0    2        00
33    Integer    1    5    0        00
34    SorterNext    6    20    0        00
35    Gosub    7    39    0        00
36    Goto    0    50    0        00
37    Integer    1    6    0        00
38    Return    7    0    0        00
39    IfPos    5    41    0        00
40    Return    7    0    0        00
41    AggFinal    3    1    0    max(1)    00
42    SCopy    2    16    0        00
43    SCopy    3    17    0        00
44    MakeRecord    16    2    14        00
45    NewRowid    2    18    0        00
46    Insert    2    14    18        08
47    Return    7    0    0        00
48    Null    0    2    4        00
49    Return    8    0    0        00
50    Return    1    0    0        00
51    Noop    8    4    0        00
52    Integer    1    19    0        00
53    OpenRead    0    8    0    5    00
54    OpenRead    1    7    0    7    00
55    Last    0    116    0        00
56    Once    0    67    0        00
57    OpenAutoindex    9    5    0    k(5,B,B,,,)    00
58    Rewind    1    67    0        00
59    Column    1    6    21        00
60    Column    1    0    22        00
61    Column    1    2    23        00
62    Column    1    5    24        00
63    Rowid    1    25    0        00
64    MakeRecord    21    5    20        00
65    IdxInsert    9    20    0        10
66    Next    1    59    0        03
67    Integer    0    26    0        00
68    Integer    1    27    0        00
69    Rowid    0    28    0        00
70    SeekGE    9    112    27    2    00
71    IdxGT    9    112    27    2    00
72    Integer    1    26    0        00
73    Column    9    0    20        00
74    Ne    29    111    20    (BINARY)    53
75    Integer    0    30    0        00
76    Rewind    2    108    0        00
77    Column    0    4    31        00
78    Column    2    0    32        00
79    Eq    32    82    31    (BINARY)    43
80    Rowid    0    33    0        00
81    Ne    32    107    33    (BINARY)    53
82    Integer    1    30    0        00
83    Column    9    2    31        00
84    Column    2    1    20        00
85    Lt    20    107    31    (BINARY)    53
86    Copy    20    34    0        00
87    Function0    0    34    33    datetime(-1)    01
88    Le    35    107    33        51
89    Once    0    99    0        00
90    OpenEphemeral    11    1    0    k(1,B)    00
91    OpenRead    12    9    0    k(2,,)    02
92    String8    0    36    0    example.com    00
93    SeekGE    12    98    36    1    00
94    IdxGT    12    98    36    1    00
95    IdxRowid    12    37    0        00
96    MakeRecord    37    1    31    C    00
97    IdxInsert    11    31    0        00
98    Close    12    0    0        00
99    Column    2    0    31        00
100    IsNull    31    107    0        00
101    Affinity    31    1    0    C    00
102    NotFound    11    107    31    1    00
103    Column    0    1    38        00
104    Column    9    3    39        00
105    ResultRow    38    2    0        00
106    DecrJumpZero    19    116    0        00
107    Next    2    77    0        01
108    IfPos    30    111    0        00
109    NullRow    2    0    0        00
110    Goto    0    82    0        00
111    Next    9    71    0        00
112    IfPos    26    115    0        00
113    NullRow    9    0    0        00
114    Goto    0    72    0        00
115    Prev    0    56    0        01
116    Close    0    0    0        00
117    Halt    0    0    0        00
118    Transaction    0    0    338    1    01
119    TableLock    0    7    0    lookups    00
120    TableLock    0    8    0    urls    00
121    Integer    1    29    0        00
122    String8    0    40    0    now    00
123    String8    0    41    0    -5 days    00
124    Function0    3    40    35    datetime(-1)    02
125    Goto    0    1    0        00


Any thoughts/suggestions/feedback welcome.
Many thanks,
Jonathan

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to