[sqlite] Min/Max and skip-scan optimizations

Gerlando Falauto Sun, 27 Jan 2019 11:29:01 -0800

Hi,

I have a database table looking like this:


CREATE TABLE `rolling` (
    `source1`    TEXT NOT NULL,
    `source2`    TEXT NOT NULL,
    `ts`    INTEGER NOT NULL,
    `value`    TEXT
);

CREATE INDEX `sources` ON `rolling` (
    `source1`,
    `source2`,
    `ts`
);

INSERT INTO rolling
    WITH RECURSIVE
      src1( source1 ) AS ( VALUES("aaa") UNION ALL VALUES("bbb") ),
      src2( source2 ) AS ( VALUES("X1") UNION ALL VALUES("X2") UNION ALL
VALUES("X3") UNION ALL VALUES("X4") ),
      cnt( ts, value) AS (
      VALUES( 0, "ZZZZ")
        UNION ALL
      SELECT ts+1, value FROM cnt LIMIT 1000000)

    select src1.source1, src2.source2, cnt.* from src1, src2, cnt;


So the table looks like the following

aaa|X1|0|ZZZZ
aaa|X1|1|ZZZZ
....
aaa|X1|999998|ZZZZ
aaa|X1|999999|ZZZZ
aaa|X2|0|ZZZZ
aaa|X2|1|ZZZZ
...
aaa|X2|999998|ZZZZ
aaa|X2|999999|ZZZZ
...
aaa|X4|999999|ZZZZ
bbb|X1|0|ZZZZ
bbb|X1|1|ZZZZ
bbb|X1|2|ZZZZ
...
bbb|X4|999999|ZZZZ

So we have 2*4*1M = 8M rows, all indexed. The first two text columns
(source1, source2) have a very limited set of possible values, while the
third numeric column is a timestamp.
So it's essentially a partitioned logbook, with a very small number of
partitions.

Given this particular structure, I would expect indexing to be extremely
helpful, thanks to covering indexes, min/max and skip-scan optimizations.

Below as some example queries that I would expect to be taking advantage of
indexing.
They're however sometimes not, so I'd like some help understanding if I'm
getting something wrong.

1)  SELECT DISTINCT source1, source2 FROM rolling;

Provided the table has been analyzed, this will run real fast.
Here I'm essentially try to get a recap of all distinct values for the
first two columns of an index.
So ideally I would just scan the covering index, skipping to the next
values for the first two columns.
I don't really understand why analyze is a precondition for this
optimization to kick in, however I'm pretty happy about this.

SUGGESTION: Perhaps this could also be added as an example in the
description of the skip-scan optimization?

2) SELECT * FROM rolling WHERE ts < 100;

Provided the table has been analyzed, this will run real fast as well.
I believe the skip-scan optimization is responsible for making this
instantaneous, since this looks like one of the canonical examples.

3) SELECT MIN(ts) FROM rolling WHERE (source1,source2)=("aaa", "X2")

Again, this will be almost instantaneous, and works regardless of whether
the table has been analyzed or not.

So it looks like the min/max optimization:

 >> Queries that contain a single MIN() or MAX() aggregate function whose
argument is the left-most column of an index
 >> might be satisfied by doing a single index lookup rather than by
scanning the entire table.

Here the argument of the MIN() function is technically *not* the left-most
column of an index, though imposing a certain
value for the first two columns kind-of makes it the left-most one.

SUGGESTION: Perhaps this could also be added as an example in the
description of the min-max optimization?

QUESTION: this feature -- locking tha value for left-most columns of an
index, I believe it's WHERE CLAUSE analysis --
looks like a general feature does it apply to other cases as well?

4) SELECT MIN(ts), MAX(ts) FROM rolling WHERE (source1,source2)=("aaa",
"X2")

This will *NOT* be optimized. I understand by reading from previous
messages in this group that this is a
known limitation of the Min/Max optimization. There are however known
workarounds for it, like:

SELECT (SELECT MIN(ts) FROM rolling WHERE (source1,source2)=("aaa", "X2")),
(SELECT MAX(ts) FROM rolling WHERE (source1,source2)=("aaa", "X2"));

SUGGESTION: Perhaps this could be added in the documentation?

5) SELECT source1,source2,MIN(ts) FROM rolling GROUP BY source1, source2

Here I would expect the query to be optimized somehow by taking advantage
of both skip-scan and min/max optimizations,
but I could not manage to find a way.

This could be easily done in the application code by just nesting two
separate queries:

SELECT DISTINCT source1, source2 FROM rolling; -- This will return the set
of sources available and will be quite fast

SELECT MIN(ts) FROM rolling WHERE (source1,source2)=(?, ?) -- This will
manually filter by the results returned by the previous one

I tried everything that came to mind to run this as a single query,
including ORDER BY, INDEXED BY, subqueries, joins, window functions,
but everything just seemed to make things worse.

QUESTION: Is there any way to force the query engine to take advantage of
both optimizations?

Thanks in advance!
Gerlando
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Min/Max and skip-scan optimizations

Reply via email to