Re: [sqlite] Sqlite Sharding HOWTO

R Smith Sun, 29 Jul 2018 04:30:32 -0700


On 2018/07/29 10:34 AM, Gerlando Falauto wrote:

Hi,


I'm totally new to sqlite and I'd like to use it for some logging


Welcome Gerlando. :)

application on an embedded
linux-based device.  Data comes from multiple (~10), similar sources at a
steady rate.
The rolling data set would be in the size of 20 GB. Such an amount of
storage would suffice to retain data from the previous week or so.

Reading the documentation https://www.sqlite.org/whentouse.html somehow
suggests the usage of sharding:

Concurrency is also improved by "database sharding": using separate

database files for

different subdomains. For example, the server might have a separate

SQLite database for each

user, so that the server can handle hundreds or thousands of simultaneous

connections, but

each SQLite database is only used by one connection.

In my case I would be doing sharding on the data source and/or the day of
the timestamp, so to have individual files in the size of a few hundreds MB.
This way, deleting the oldest data would be as easy as deleting the
corresponding file.

I think you are perhaps missing a core idea here - the only use-casethat requires sharding is where you have very high write-concurrencyfrom multiple sources, and even then, the sharding, in order to have anyhelpful effect, needs to distinguish "write sources", not events ortime-frames or such.

SQLite will very happily run a 20GB (or much larger) database written tofrom many sources, and very happily delete old data from it and pump newdata in without much in the way of "help" needed, AND then produce fastqueries without much fanfare.

The question that needs to be answered specifically is: How many datainput sources are there? as in how many Processes will attempt to writeto the database at the same time? Two processes can obviously NOT writeat the same time, so if a concurrent write happens, one process has towait a few milliseconds. This gets compounded as more and more writesources are added or as write-frequency increases.

If a single process is writing data to a single DB from many differentsources, there is zero reason for sharding. If many processes arerunning all with their own connection to the DB, AND they have highconcurrency (i.e. high frequency updates from many DB connections whichheightens the incidence of simultaneous write attempts to a single DBfile) then it starts becoming a good idea to allocate two or more DBfiles so that we split the connections between those files, effectivelylowering the write-collision frequency for a single file.

Incidentally, all DBs offer some form of concurrency alleviation (loadbalancing, partitioning, etc.) which often also serves other purposes.

To get to the point: With the above in mind, do you still feel like youneed to go the sharding route? Could you perhaps quote figures such ashow many bytes would a typical data update be? How many updates persecond, from how many different Processes? Do you have a maintenanceWindow (as in, is there a time of day or on a weekend or such that youcan go a few minutes without logging so one can clean up the old logs,perhaps Vacuum and re-Analyze?

This will allow much better advice, and someone on here is bound toalready have something just like that running and will be able toquickly give some experience hints.



_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Sqlite Sharding HOWTO

Reply via email to