On Tue, Aug 08, 2017 at 10:30:33AM -0700, Jens Alfke wrote: > > On Aug 4, 2017, at 11:28 AM, Nico Williams <n...@cryptonector.com> wrote: > > Imagine a mode where there is only a WAL, and to checkpoint is to write > > a new WAL with only live contents and... rename(2) into place. > > What you’re describing is exactly how CouchDB’s storage engine works, > as well as descendants like Couchbase Server’s CouchStore and > ForestDB. (Note: I work for Couchbase.)
Also how ZFS works. > Efficient lookups in a file like this require the existence of a bunch > of extraneous metadata like interior B-tree nodes. This metadata > changes all the time as records are written*, so a lot of it has to be > written out too along with every transaction, resulting in substantial > write amplification. Yes. It's called write magnification. ZFS deals with this by first committing to an intent log (ZIL) without writing all those interior nodes, then it eventually writes a proper transaction. One could do the same sort of thing for a single-file DB: write a number of transactions as intents, then once in a while "checkpoint" them by writing a b-tree transaction that includes all those updates. For readers this means always processing all intent log entries since the last b-tree-updating transaction. LSMs are... a lot lik this, IIUC. Are they not? > The other big drawback is that compaction (the checkpoint step you > describe) is very expensive in terms of I/O. I’ve known of CouchDB > systems that took many hours to compact their databases, and since > every write that occurs during a compaction has to be replayed onto > the new file after the copy before compaction completes, one can get > into a state where a busy database either never actually finishes > compacting, or has to temporarily block all writers just so it can get > the damn job done without interruption. (It’s a similar problem to GC > thrash.) There's definitely trade-offs. This is already the case for SQLite3's WAL. You have to checkpoint often, but when you do you lose read concurrency. When you value read concurrency highly you might tune the WAL max size up to reduce checkpoint frequence, and now you slow down checkpointing. Another thing that can be done is to write to the "next" DB as soon as possible, possibly synchronously with writes to the "current" DB. Then the checkpoint process is very simple: write the "close" TX to the "current" DB, rename the "next" into place, create the next "next". > We’ve also seen that, on low-end hardware like mobile devices, I/O > bandwidth is limited enough that a running compaction can really harm > the responsiveness of the _entire OS_, as well as cause significant > battery drain. Yes. But there you don't really need read concurrency, so SQLite3 without a WAL (or with a WALL but frequent checkpointing) will suffice. > * Modifying/rewriting a single record requires rewriting the leaf node > that points to it, which requires rewriting the parent node that > points to the leaf, and this ripples all the way up to the root node. I'm well aware :) Others on the list might not, naturally. One of the very nice features of an append-only single-file DB file format is that you can just "tail -f" (well, a binary tail -f) to replicate. If you add in the intent log concept, it's not so bad in terms of I/O, but you slow down readers somewhat. Another very nice feature of an append-only single-file DB file is high read concurrency. And yet another is fast recovery (throw away a truncated transaction and "checkpoint"). Nico -- _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users