Hi Ryan,

thank you for your reply.

>I think you are perhaps missing a core idea here - the only use-case
>that requires sharding is where you have very high write-concurrency
>from multiple sources, and even then, the sharding, in order to have any
>helpful effect, needs to distinguish "write sources", not events or
>time-frames or such.

"Write sources" *could* be different in my case (not in the first
version though).

>SQLite will very happily run a 20GB (or much larger) database written to
>from many sources, and very happily delete old data from it and pump new
>data in without much in the way of "help" needed, AND then produce fast
>queries without much fanfare.

I sort of realized that in the meantime (at least for the fast queries part) ;-)

>The question that needs to be answered specifically is: How many data
>input sources are there? as in how many Processes will attempt to write
>to the database at the same time? Two processes can obviously NOT write
>at the same time, so if a concurrent write happens, one process has to
>wait a few milliseconds. This gets compounded as more and more write
>sources are added or as write-frequency increases.

When you say "a few milliseconds", how exactly do you measure that?

What is the timeframe where the second process has to wait if the
first one is doing a batch of a thousand writes?


>If a single process is writing data to a single DB from many different
>sources, there is zero reason for sharding. If many processes are
>running all with their own connection to the DB, AND they have high
>concurrency (i.e. high frequency updates from many DB connections which
>heightens the incidence of simultaneous write attempts to a single DB
>file) then it starts becoming a good idea to allocate two or more DB
>files so that we split the connections between those files, effectively
>lowering the write-collision frequency for a single file.

>Incidentally, all DBs offer some form of concurrency alleviation (load
>balancing, partitioning, etc.) which often also serves other purposes.

>To get to the point: With the above in mind, do you still feel like you
>need to go the sharding route? Could you perhaps quote figures such as
>how many bytes would a typical data update be? How many updates per
>second, from how many different Processes?  Do you have a maintenance
>Window (as in, is there a time of day or on a weekend or such that you
>can go a few minutes without logging so one can clean up the old logs,
>perhaps Vacuum and re-Analyze?

In the current use case thre's a single process. The way I see it, in
the near future it would probably increase to 3-4 processes,
each doing 10-100 writes per second or so. Each write would be around
1KB-20KB (one single text field, I guess).
I wonder if writing data in batches would be helpful though.

But yeah, I guess sharding makes much less sense right now. :-)

>This will allow much better advice, and someone on here is bound to
>already have something just like that running and will be able to
>quickly give some experience hints.

Thank you again!

Gerlando
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to