Re: [sqlite] Optimising query with aggregate in subselect.

Keith Medcalf Wed, 20 Nov 2019 11:32:39 -0800

Did you try retrieving the data "directly" or do you need the subselect in 
order to maintain compatibility with other SQL dialects that are no longer able 
to retrieve data from the row on which the max was found?


CREATE TABLE entrys
(
    logid       INTEGER NOT NULL,
    entrynumber INTEGER NOT NULL,
    region      TEXT NOT NULL,
    key         TEXT NOT NULL,
    timestamp   INTEGER NOT NULL,
    PRIMARY KEY (logid, entrynumber)
);

CREATE INDEX a on entrys (region, logid, key, entrynumber);

  SELECT entrys.logid            AS logid,
         max(entrys.entrynumber) AS entrynumber,
         entrys.region           AS region,
         entrys.key              AS key,
         entrys.timestamp        AS timestamp
    FROM entrys
   WHERE entrys.region = ?
     AND entrys.key > ?
     AND entrys.logid = ?
GROUP BY key
;

NB:  I changed the ill-conceived column names to ones that do not require 
quoting and the identifier quoted items that are not column names with 
parameter markers.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> On
>Behalf Of Andy Bennett
>Sent: Wednesday, 20 November, 2019 09:49
>To: sqlite-users@mailinglists.sqlite.org
>Subject: [sqlite] Optimising query with aggregate in subselect.
>
>Hi,
>
>I'm trying to implement a "streaming" version of the classic "select the
>latest version of a record" query.
>
>
>By "streaming" I mean a query that executes by streaming what it needs
>out
>of tables and indexes as it needs it rather than using temporary b-trees
>or
>materializing anything up front.
>
>I'm looking for a query that I can run and then just consume as many
>results as I want without worrying about the size of the entire result
>set.
>
>
>Here's the schema I'm working with:
>
>-----
>CREATE TABLE "entrys" ("log-id" INTEGER NOT NULL , "entry-number" INTEGER
>NOT NULL , "region" TEXT NOT NULL , "key" TEXT NOT NULL , "timestamp"
>INTEGER NOT NULL , PRIMARY KEY ("log-id", "entry-number"))
>
>CREATE UNIQUE INDEX "entrys-log-id-region-key-entry-number" ON "entrys" (
>"log-id" ASC, "region" ASC, "key" ASC, "entry-number" ASC)
>-----
>
>There's only a couple of million rows in "entrys" and my query times are
>into 2 or 3 seconds of startup time before the first row is returned.
>
>
>
>Here's my query:
>
>-----
>-- explain query plan
>SELECT
>"entrys"."log-id"       AS "log-id",
>"entrys"."entry-number" AS "entry-number",
>"entrys"."region"       AS "region",
>"entrys"."key"          AS "key",
>"entrys"."timestamp"    AS "timestamp"
>
>FROM
>       (SELECT
>       MAX("entry-number") AS "entry-number",
>       "key"
>       FROM "entrys"
>       WHERE
>       "log-id" = 1 AND
>       "region" = "user" AND
>       "entry-number" <= 1700108
>       AND key > "G"
>       GROUP BY "key"
>       ORDER BY "key" DESC
>       limit 20 -- (1)
>       ) AS "specific-entrys"
>
>INNER JOIN "entrys"
>ON
>1 = "entrys"."log-id" AND
>"specific-entrys"."key" = "entrys"."key" AND
>"user" = "entrys"."region" AND
>"specific-entrys"."entry-number" = "entrys"."entry-number"
>AND "entrys"."key" > "G"
>
>WHERE
>"entrys"."log-id" = 1
>
>ORDER BY "key" ASC
>;
>-----
>
>...which has this query plan in SQLite verson 3.31.0
>
>-----
>QUERY PLAN
>|--MATERIALIZE 1
>|  `--SEARCH TABLE entrys USING COVERING INDEX
>entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
>|--SEARCH TABLE entrys USING INDEX entrys-log-id-region-key-entry-number
>(log-id=? AND region=? AND key<?)
>`--SEARCH SUBQUERY 1 AS specific-entrys USING AUTOMATIC COVERING INDEX
>(key=?)
>-----
>
>
>My problem is with the MATERIALIZE.
>
>The query produces just shy of 2 million rows.
>
>It takes several seconds to start up but is then pretty quick when
>fetching
>each row.
>
>What I want to do is get rid of the startup costs so that I can paginate
>it
>efficiently.
>
>If I run the subselect on its own then there is no startup cost. The
>results just get streamed straight out of the index.
>Its query plan is
>
>-----
>QUERY PLAN
>`--SEARCH TABLE entrys USING COVERING INDEX
>entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
>-----
>
>
>Is there anything I can do to make the original version of the query
>stream
>the results in this way?
>
>The best I have come up with is to insert a LIMIT clause at the point
>denoted with "-- (1)". This keeps the subselect small and then
>materialising the subselect and generating the automatic covering index
>becomes cheap.
>
>For pagination I then feed in the key from the last row of the previous
>batch at the points denoted with "-- (2)" and "-- (3)".
>
>If I do this then it seems equally cheap to access batches at the start
>and
>end of the complete result set. The per-query cost is determined by the
>batch size as set but the LIMIT clause.
>
>
>However, I have been under the impression that LIMIT is supposed to be a
>"debugging" extension to the language and not recommended for use in
>queries that end up in one's program. LIMIT also only hides the latency;
>it
>amortises it over each batch, but I still end up with memory requirements
>in each "client" thread that are larger than I'd like; ideally I'd just
>store the current row.
>
>
>
>Thanks for any tips!
>
>
>
>
>Best wishes,
>@ndy
>
>--
>andy...@ashurst.eu.org
>http://www.ashurst.eu.org/
>0x7EBA75FF
>_______________________________________________
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Optimising query with aggregate in subselect.

Reply via email to