[sqlite] Optimizing `SELECT a, max(b) GROUP BY a`

Jens Alfke Thu, 24 Oct 2019 11:38:58 -0700

I'm looking for an optimal way to query the maximum value of one (integer) 
column as grouped by another (string) column. For example, say I've got a table 
of timestamped data points from many named sources, and I want to find each 
source's latest timestamp. I'm seeing decent performance if the database is hot 
in the filesystem cache, but not when it's cold — worst case this query takes 
about 7sec to run on a 2018 MacBook Pro with a decent SSD.


My table schema includes something like `… name text, timestamp integer …`.
The table has about 500,000 rows, with 1700 distinct `name` values, and 
occupies about 500MB on disk.
I'm querying `SELECT name, max(timestamp) FROM table GROUP BY name`.
I've created an index on `name, timestamp desc`.

EXPLAIN QUERY PLAN says:
        7|0|0| SCAN TABLE table USING INDEX namesAndTimestamps
Note the dreaded `SCAN`. :-(

I think I understand what's going on here: even if SQLite can tell it only 
needs to use the first index entry for any given name (because that will have 
the largest timestamp), it still has to scan through the index to find where 
the next name starts. (Or is there some optimization to jump rapidly through 
the B-tree to do this?)

Is there anything I can do to the schema, index or query to make this more 
efficient?

One simplifying assumption: I basically never drop any rows, and if I do 
they're old rows, so max(timestamp) for any author will only increase.

Thanks for any advice…

—Jens
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Optimizing `SELECT a, max(b) GROUP BY a`

Reply via email to