The OpenStack Object Storage system uses sqlite3 databases as the
"metadata" storage layer for it's "containers" and "accounts" - similar in
a function to AWS S3 "buckets".

Each "object" in the system is recorded by a row in it's "container".

Given a the API URI for an object:

https://storage.host/v1/<account>/<conainer>/<path/to/some/object>

The container database would have a row for each object in it's object
table.

"path/to/some/object"

These container databases are normally small (<1-4M rows) - and you can
have many of them (>10M) - but depending on the usage pattern - they can
also grow big (100+ GB) when there are many many object rows stored in a
*single* container.

These databases are primarily used to provide object listings as part of
the API - but also account usage information - how many objects, how many
bytes, etc. (these counts are handled via triggers)

We're investigating strategies to improve the way this container
information is stored at scale, one approach we're researching is to
"split" the database such that objects before and after "pivot points" are
stored in different underlying databases.

An analogy might be something akin to an "encyclopedia" or "card catalog" -
if you're storing log files from the last year your db may pivot at
"2016/01/db01/var/log/postgres/9.4/data/pg_log/postgresql.log" and objects
named like "2015/05/24/web01/var/log/nginx.log" might go in one db, while
object named like "2016/03/worker10/var/log/beanstalkd.log" might be in the
other.  For databases that currently have 100's of millions of rows we
would expect many such pivot points - functioning essentially like bookends
- but ideally in the future as soon as a database grows sufficiently large
enough they would automatically be split in half.

Hey, it's an idea. ;)

This brings us to the interesting SQL - what's the best way to ask a table
indexed on name, what is the name (roughly) in the middle (since it might
be potentially helpful, we can assume we have a rough idea on cardinality
[1] of rows in the database through a combination of triggers which update
the object count and to a lesser extent min and max ROW_ID)

One attempt:

SELECT name
FROM object
ORDER BY name LIMIT 1 OFFSET (
SELECT object_count / 2
FROM policy_stat);

Unfortunately - this is not a fast query as OFFSET gets larger i.e. 350M :)

Is there a more obvious way to find this "middle" of a large ordered query
in an indexed table?  Just estimating between the min and max doesn't
always seem to represent a good split since the distribution of prefixes is
not always uniform?  Could one imagine "estimating" the middle of the index
by traversing the B-Tree in a non-standard way!?  Would any of that be
"exposed" in a "supported" fashion that would be likely to be maintainable
across a non-trivial set of sqlite3 versions?  If not, maintainable - would
it even be *possible* to estimate *quickly* - given the current storage
format - to parse the db file and traverse the B-Tree by hand - even if
just purely for the geek factor!?  ;)

I'm interested in your thoughts!

-Clay

1. It might be worth noting these databases are replicated and the object
table is kept in sync via a custom daemon that runs as part of the object
storage consistency engine.  In addition to the name (and some other
*non*-indexed metadata associated with the object) we also store a
"deleted" flag on each row - so the index is actually "(deleted, name)".
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to