Re: [sqlite] [EXTERNAL] Sqlite Sharding HOWTO

Hick Gunter Mon, 30 Jul 2018 00:07:06 -0700

We almost exclusively use virtual tables in our application, and this includes 
virtual table code to access Faircom CTree files and in-memory data 
dictionaries. The structure (fields, indexes) of these tables is fixed (and 
identical for corresponding CTree and DD tables), with sharding achieved by 
"cloning" the common structure into separate tables, with the "clone 
parameters" that describe the value(s) of certain key field(s) being present in 
the name. E.g. the table named customer_PA would contain only customers from 
Pennsylvania. Creating a simple view (select * from customer_PA union all 
customer_NY ...) has the drawback of acessing all member tables, even if the 
constraints would require searching only one table. It also requires that all 
tables be contained in the same database.


Our solution is a "partition" provider that knows about "member tables" and 
"clone parameters" and can handle "partition constraints" as well as ordered 
(merge) and unorded (sequential) retrieval. The name of the "partition" table 
does not include any "clone parameters" (e.g customer).

So "SELECT * FROM customer;" will internally do "SELECT * FROM customer_NY;" 
followed by "SELECT * FROM customer_PA;" because the member table has 2 entries 
('customer','customer_NY'), ('customer','customer_PA').

But "SELECT * FROM customer WHERE ... state = 'NY';" would determine that the 
"clone parameter" state only matches table customer_NY and therefore only query 
that table.

And "SELECT * FROM customer ... ORDER BY name;" would prepare identical 
statements against both tables, fetch a record from each and return the 
"smaller" one (because the virtual table supports indexing by name, the 
xBestIndex method can tell SQLite that it can handle this kind of query and 
sets the "orderByConsumed" flag; if the ODER BY expression cannot be handled 
via an index, it goes back to sequential execution and lets SQLite do the 
sorting). An n-way merge is implemented as a binary tree to minimize 
comparisons.

A smilar approach may be possible with native tables that reside in different 
native database files (limited by the maximum number of concurrently attached 
databases).


-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] Im 
Auftrag von Gerlando Falauto
Gesendet: Sonntag, 29. Juli 2018 10:34
An: sqlite-users@mailinglists.sqlite.org
Betreff: [EXTERNAL] [sqlite] Sqlite Sharding HOWTO

Hi,

I'm totally new to sqlite and I'd like to use it for some logging application 
on an embedded linux-based device.  Data comes from multiple (~10), similar 
sources at a steady rate.
The rolling data set would be in the size of 20 GB. Such an amount of storage 
would suffice to retain data from the previous week or so.

Reading the documentation https://www.sqlite.org/whentouse.html somehow 
suggests the usage of sharding:

>Concurrency is also improved by "database sharding": using separate
database files for
> different subdomains. For example, the server might have a separate
SQLite database for each
> user, so that the server can handle hundreds or thousands of
> simultaneous
connections, but
> each SQLite database is only used by one connection.

In my case I would be doing sharding on the data source and/or the day of the 
timestamp, so to have individual files in the size of a few hundreds MB.
This way, deleting the oldest data would be as easy as deleting the 
corresponding file.

However, I did not find any reference whatsoever on sharding being available 
_within_ sqlite.
Ideally, I would like to have a way of "seeing" the whole dataset with a single 
query spanning all  available databases.

Would that be at all feasible? I saw the "attach database" statement which 
seems closely related but whose use-case I honestly don't get.
If not, is there any state-of-the-art adapter layer that would be performing 
(and hide) the underlying sharding? I don't really care about query performance 
(e.g. if such a global query spanning 20 different databases is indeed 
performed serially, thereby take 20 times longer), I just need a way of hiding 
this detail.

I saw some reference to SPHiveDB
https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg43575.html
but the project looks stale (9 years since the last commit).

I also looked into AtomDB but it looks overly complicated for my use-case 
(single, embedded server), plus it somehow requires the underlying sharding to 
be totally exposed.

Any ideas?
Gerlando
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | 
Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 
1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] [EXTERNAL] Sqlite Sharding HOWTO

Reply via email to