On Tue, Aug 08, 2017 at 10:30:33AM -0700, Jens Alfke wrote:
> > On Aug 4, 2017, at 11:28 AM, Nico Williams <n...@cryptonector.com> wrote:
> > Imagine a mode where there is only a WAL, and to checkpoint is to write
> > a new WAL with only live contents and... rename(2) into place.
> What you’re describing is exactly how CouchDB’s storage engine works,
> as well as descendants like Couchbase Server’s CouchStore and
> ForestDB. (Note: I work for Couchbase.)
Also how ZFS works.
> Efficient lookups in a file like this require the existence of a bunch
> of extraneous metadata like interior B-tree nodes. This metadata
> changes all the time as records are written*, so a lot of it has to be
> written out too along with every transaction, resulting in substantial
> write amplification.
Yes. It's called write magnification. ZFS deals with this by first
committing to an intent log (ZIL) without writing all those interior
nodes, then it eventually writes a proper transaction.
One could do the same sort of thing for a single-file DB: write a number
of transactions as intents, then once in a while "checkpoint" them by
writing a b-tree transaction that includes all those updates. For
readers this means always processing all intent log entries since the
last b-tree-updating transaction.
LSMs are... a lot lik this, IIUC. Are they not?
> The other big drawback is that compaction (the checkpoint step you
> describe) is very expensive in terms of I/O. I’ve known of CouchDB
> systems that took many hours to compact their databases, and since
> every write that occurs during a compaction has to be replayed onto
> the new file after the copy before compaction completes, one can get
> into a state where a busy database either never actually finishes
> compacting, or has to temporarily block all writers just so it can get
> the damn job done without interruption. (It’s a similar problem to GC
There's definitely trade-offs. This is already the case for SQLite3's
WAL. You have to checkpoint often, but when you do you lose read
concurrency. When you value read concurrency highly you might tune the
WAL max size up to reduce checkpoint frequence, and now you slow down
Another thing that can be done is to write to the "next" DB as soon as
possible, possibly synchronously with writes to the "current" DB. Then
the checkpoint process is very simple: write the "close" TX to the
"current" DB, rename the "next" into place, create the next "next".
> We’ve also seen that, on low-end hardware like mobile devices, I/O
> bandwidth is limited enough that a running compaction can really harm
> the responsiveness of the _entire OS_, as well as cause significant
> battery drain.
Yes. But there you don't really need read concurrency, so SQLite3
without a WAL (or with a WALL but frequent checkpointing) will suffice.
> * Modifying/rewriting a single record requires rewriting the leaf node
> that points to it, which requires rewriting the parent node that
> points to the leaf, and this ripples all the way up to the root node.
I'm well aware :) Others on the list might not, naturally.
One of the very nice features of an append-only single-file DB file
format is that you can just "tail -f" (well, a binary tail -f) to
replicate. If you add in the intent log concept, it's not so bad in
terms of I/O, but you slow down readers somewhat.
Another very nice feature of an append-only single-file DB file is high
read concurrency. And yet another is fast recovery (throw away a
truncated transaction and "checkpoint").
sqlite-users mailing list