On 2018/07/29 10:34 AM, Gerlando Falauto wrote:
Hi,
I'm totally new to sqlite and I'd like to use it for some logging
Welcome Gerlando. :)
application on an embedded
linux-based device. Data comes from multiple (~10), similar sources at a
steady rate.
The rolling data set would be in the size of 20 GB. Such an amount of
storage would suffice to retain data from the previous week or so.
Reading the documentation https://www.sqlite.org/whentouse.html somehow
suggests the usage of sharding:
Concurrency is also improved by "database sharding": using separate
database files for
different subdomains. For example, the server might have a separate
SQLite database for each
user, so that the server can handle hundreds or thousands of simultaneous
connections, but
each SQLite database is only used by one connection.
In my case I would be doing sharding on the data source and/or the day of
the timestamp, so to have individual files in the size of a few hundreds MB.
This way, deleting the oldest data would be as easy as deleting the
corresponding file.
I think you are perhaps missing a core idea here - the only use-case
that requires sharding is where you have very high write-concurrency
from multiple sources, and even then, the sharding, in order to have any
helpful effect, needs to distinguish "write sources", not events or
time-frames or such.
SQLite will very happily run a 20GB (or much larger) database written to
from many sources, and very happily delete old data from it and pump new
data in without much in the way of "help" needed, AND then produce fast
queries without much fanfare.
The question that needs to be answered specifically is: How many data
input sources are there? as in how many Processes will attempt to write
to the database at the same time? Two processes can obviously NOT write
at the same time, so if a concurrent write happens, one process has to
wait a few milliseconds. This gets compounded as more and more write
sources are added or as write-frequency increases.
If a single process is writing data to a single DB from many different
sources, there is zero reason for sharding. If many processes are
running all with their own connection to the DB, AND they have high
concurrency (i.e. high frequency updates from many DB connections which
heightens the incidence of simultaneous write attempts to a single DB
file) then it starts becoming a good idea to allocate two or more DB
files so that we split the connections between those files, effectively
lowering the write-collision frequency for a single file.
Incidentally, all DBs offer some form of concurrency alleviation (load
balancing, partitioning, etc.) which often also serves other purposes.
To get to the point: With the above in mind, do you still feel like you
need to go the sharding route? Could you perhaps quote figures such as
how many bytes would a typical data update be? How many updates per
second, from how many different Processes? Do you have a maintenance
Window (as in, is there a time of day or on a weekend or such that you
can go a few minutes without logging so one can clean up the old logs,
perhaps Vacuum and re-Analyze?
This will allow much better advice, and someone on here is bound to
already have something just like that running and will be able to
quickly give some experience hints.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users