Re: (ITS#8475) Feature request: MDB low durability transactions
I'll use this ITS to summarize details for "volatile commits". Hopefully I've managed to keep it all straight. "Volatile" vs. "durable" are the most accurate names I can think of. Not sure if that's more instructive than simply "soft" / "hard". Description: * Volatile commits omit fdatasync() without losing consistency. To survive, they must be checkpointed *before* all processes close the env. Un-checkpointed volatiles are lost when the env closes. Thus a separate checkpointing daemon can keep the env open to protect volatiles from application crash, at least if Robust locks are supported. (The lmdb.h doc seems a bit unclear about Robust.) * Checkpointing == committing a durable (non-volatile) write-txn. (If there is nothing to do, Commit writes nothing.) mdb_env_sync() will not checkpoint volatiles, since existing programs do not expect it to wait for the write mutex. It "checkpoints" MDB_NOMETASYNC/MDB_NOSYNC. Maybe mdb_checkpoint() will have a special case which obsoletes mdb_env_sync(). * Volatiles are unsupported with MDB_NOLOCK and pointless with MDB_NOSYNC. OTOH it makes sense to enable MDB_NOMETASYNC. * Volatiles need a bigger datafile, because it takes two durable commits to make a freed page reusable. (Plus awaiting old readers). Configuration. Too many options, ouch: LMDB can be configured to auto-checkpoint after X volatile commits and/or Y written kbytes(pages?). Programs can also checkpoint every Z minutes(seconds?) - configured in LMDB to mimic Berkeley DB's "checkpoint ", but regular LMDB ops ignore that. The lockfile gives the current config. An MDB_env could override for its particular process, e.g. with an MDB_NO_VOLATILE flag. Maybe resetting the lockfile should keep the previous config? OTOH I suppose MDB_meta can have default params the way it has mm_mapsize. That survives a backup/restore. Implementation - plain version first: * Bump MDB_LOCK_FORMAT, MDB_DATA_VERSION (or make MDB_DATA_FORMAT). * Keep 2 'MDB_meta's in the lockfile, for volatile commits. MDB_env.me_metas[] gets 4 elements: durable + volatile. * mdb_env_open() throws away volatiles if it re-inits the lockfile. * Add field MDB_meta.mm_oldest: 1 + (previous durable meta).mm_txnid in durable metas, and (previous meta).mm_oldest in volatile metas. Init 'oldest' in mdb_find_oldest() to new field MDB_env.me_oldest, which mdb_txn_renew0(write txn) sets to MDB_meta.mm_oldest. When are no volatiles, this ends up initing 'oldest' = same value as today. Usually we could just have used 1 + (oldest durable meta).mm_txnid, but a failed write_meta() may have clobbered that. * Replace MDB_txninfo.mti_txnid with mti_metaref = txnid*16 + P*4 + M: M = index to MDB_env.me_metas[], P = previous M during this session, initialized to the same as M, so we can get this info atomically. P may prove unnecessary, but it's simplest to just include it for now. For when meta M fails a checksum so we want an older meta, for mdb_mutex_failed(), maybe so we can see if there are volatiles yet. * Never use mdb_env_pick_meta() when the current metapage is known: Use it in mdb_env_open(), in mdb_mutex_failed(), and if MDB_NOLOCK. Or rather, I guess it gets a "force" param for those cases. * Add config in the lockfile. Maybe per-env config overriding it and defaults in the datafile. Txn flags "prefer volatile", "checkpoint". * Track the number of pages and volatiles since last durable commit. write_meta() compares with the config limits and makes the final decision of whether the new meta will be volatile. Add MDB_pgstate.mf_pgcount with #pages used so far. The rest goes in a lockfile array[4] indexed by mti_metaref % 4, or in MDB_meta. That way, switching to next snapshot stays atomic - just update mti_metaref. * txn_begin must verify metas, since we have no fdatasync barriers. Re-read and compare, or checksum. write_meta() and mutex_failed(): memory barrier between making a volatile meta and updating mti_metaref. Most modern compilers have that. Maybe a fallback implementation is lock;unlock an otherwise unused mutex. Should also include CACHEFLUSH(). It may make sense to have more than 2 volatile metas, so read-only txns will have more time to read a meta before it gets overwritten. MDB_WRITEMAP (and MDB_VL32?) has non-atomic issues we should deal with anyway. * We can have (durable metapage).mp_pgno == (txnid & 1) as before: mdb_txn_renew0() steps txnid by 2 instead of 1 if 'meta' is volatile. But note that the txnid doesn't say if the snapshot is durable. * "Try to checkpoint" feature, which does not await the write mutex: Trylock the write mutex in mdb_txn_begin(). If it fails, set a lockfile flag "Please checkpoint" and return. Hopefully someone will obey and clear the flag. mdb_env_commit(writer) does. Variants: * Put volatile MDB_metas in the datafile, behind the usual MDB
Re: (ITS#8475) Feature request: MDB low durability transactions
Okay, thanks for taking the time to discuss, and of course for all your work on MDB! Ben On 08/10/2016 08:50 AM, Hallvard Breien Furuseth wrote: > Nope, you're as confused as I was originally:-) LDMB doesn't know or > care when a page was written. A page can be reused when the snapshot > which _freed_ it is known to be durable and there are no older readers. > (We could improve that by tracking page history better. Maybe later.) > > "Known to be durable" = sync datapages, write metapage, sync metapage, > note that the metapage was synced. (We implicitly note that when > writing next txn's metapage, since we must have synced first.) From > a data safety point of view, txns which do all that are the real txns. > Anything else is fluff, like no-sync txns if we implement them. Their > metapages must go somewhere they *won't* be confused with durable ones. > > Think of such a fluffy commit as saving an intermediate stage of a > real txn. That's irrelevant to a later write-txn wanting to not touch > the last two durable snapshots. It's only relevant vs. oldest reader. > > So. 3rd metapage and variants - I've tried and Howard pointed out > the flaws, Howard tried and I said here we go again. We do not need > another round, but it's just as well to have it summarized here. > > (This discussion ignores MDB_NOSYNC and partly MDB_NOLOCK - if the > user enables either, it's his responsibility to compensate.) >
Re: (ITS#8475) Feature request: MDB low durability transactions
Nope, you're as confused as I was originally:-) LDMB doesn't know or care when a page was written. A page can be reused when the snapshot which _freed_ it is known to be durable and there are no older readers. (We could improve that by tracking page history better. Maybe later.) "Known to be durable" = sync datapages, write metapage, sync metapage, note that the metapage was synced. (We implicitly note that when writing next txn's metapage, since we must have synced first.) From a data safety point of view, txns which do all that are the real txns. Anything else is fluff, like no-sync txns if we implement them. Their metapages must go somewhere they *won't* be confused with durable ones. Think of such a fluffy commit as saving an intermediate stage of a real txn. That's irrelevant to a later write-txn wanting to not touch the last two durable snapshots. It's only relevant vs. oldest reader. So. 3rd metapage and variants - I've tried and Howard pointed out the flaws, Howard tried and I said here we go again. We do not need another round, but it's just as well to have it summarized here. (This discussion ignores MDB_NOSYNC and partly MDB_NOLOCK - if the user enables either, it's his responsibility to compensate.) -- Hallvard
Re: (ITS#8475) Feature request: MDB low durability transactions
On 08/08/2016 05:41 AM, Hallvard Breien Furuseth wrote: > A transaction must not reuse data pages visible in the last snapshot > known to be durable, since that's how far back LDMB may need to revert > after abnormal termination. Like a crash after MDB_NOMETASYNC may do. > > Sync the data pages from a txn, write the metapage, eventually sync > that metapage, wait out any older read-only transactions, and *then* > you can reuse the pages the txn freed. Not before. So when you don't > sync, or a read-only txn won't die, LMDB degenerates to append-only. > > ...except if you sync the metapage and exit, next LMDB run may not > know you synced it and must assume the metapage isn't yet durable. > So it might not reuse pages visible to the _previous_ durable > metapage, until it syncs. I'm rather losing track at this point, > but I think it may mean twice as may not-yet-usable pages as one > might expect. Concretely: say the current write transaction is number 10, and a long-lived reader is on number 7. Currently, MDB will be unable to reuse any pages used in transactions 7+ until the reader ends. Now say a 3rd, durable root is added. For the sake of argument, no checksums are used and in the event of a crash, only the last durable state is recovered. Say the durable transaction is number 2. Pages used in transaction 2 need to be preserved, obviously. 7+ still need to be preserved for the slow reader. But pages from transactions 3-6 can be reused. Note that the last durable transaction is controlled purely by the single writer, so tracking it is actually easier than tracking which readers are where. If a crash happens before a durable root is fully synced, then there should be a second, older durable root that hasn't been reused yet. In that case MDB recovers the way it does currently. Does this make sense? Thanks for bearing with me.
Re: (ITS#8475) Feature request: MDB low durability transactions
On 08/08/16 03:51, bentr...@comcast.net wrote: > The idea was that the two "floating" roots would reuse pages the way MDB > does now. The 3rd durable root would have its pages preserved > separately. I can see why this would cause up to a ~2X storage increase > as the two sets diverged, but I don't see why it would need to grow > unbounded. Apologies for this stupid question. A transaction must not reuse data pages visible in the last snapshot known to be durable, since that's how far back LDMB may need to revert after abnormal termination. Like a crash after MDB_NOMETASYNC may do. Sync the data pages from a txn, write the metapage, eventually sync that metapage, wait out any older read-only transactions, and *then* you can reuse the pages the txn freed. Not before. So when you don't sync, or a read-only txn won't die, LMDB degenerates to append-only. ...except if you sync the metapage and exit, next LMDB run may not know you synced it and must assume the metapage isn't yet durable. So it might not reuse pages visible to the _previous_ durable metapage, until it syncs. I'm rather losing track at this point, but I think it may mean twice as may not-yet-usable pages as one might expect.
Re: (ITS#8475) Feature request: MDB low durability transactions
On 08/07/2016 09:29 PM, Howard Chu wrote: > Knowing whether or not the root pages are pristine still doesn't tell > you anything about whether the data pages are intact. The only way to > make any of these schemes work is to avoid overwriting/reusing any data > pages for the last N transactions. I.e., reverting to append-only > behavior. So the underlying question (which we have wrestled with > internally for quite some time) which you haven't asked or answered - > how many of these non-durable transactions will you support at any given > time? The idea was that the two "floating" roots would reuse pages the way MDB does now. The 3rd durable root would have its pages preserved separately. I can see why this would cause up to a ~2X storage increase as the two sets diverged, but I don't see why it would need to grow unbounded. Apologies for this stupid question.
Re: (ITS#8475) Feature request: MDB low durability transactions
Ben Trask wrote: > On 08/07/2016 05:44 PM, Howard Chu wrote: >> The only way to guarantee integrity is with ordered writes. All SCSI >> devices support this feature, but e.g. the Linux kernel does not (and >> neither does SATA, and no idea about PCIe SSDs...). >> >> Lacking a portable mechanism for ordered writes, you have two choices >> for preserving integrity - append-only operation (which forces ordered >> writes anyway) or at least one synchronous write somewhere. >> >> Whenever you decide to reuse existing pages rather than operating as >> append-only, you create the possibility of overwriting some required >> data before it was safe to do so. Your 3-root checksum scheme *might* >> let you detect that the DB is corrupted, but it *won't* let you recover >> to a clean state. Given that writes occur in unpredictable order, >> without fsyncs there is no way you can guarantee that anything sane is >> on the disk. > > Consider three roots without any checksums. Each root has a simple flag > indicating whether it was written durably (fsync write barrier). During > recovery, non-durable roots are simply ignored/discarded. This is equivalent > to Hallvard's suggestion for volatile meta-pages. I think it's pretty clear > this is workable. > > From there, checksums just give you slightly stronger guarantees, although > they might not be worth the overhead (CPU/storage) and recovery complexity. Knowing whether or not the root pages are pristine still doesn't tell you anything about whether the data pages are intact. The only way to make any of these schemes work is to avoid overwriting/reusing any data pages for the last N transactions. I.e., reverting to append-only behavior. So the underlying question (which we have wrestled with internally for quite some time) which you haven't asked or answered - how many of these non-durable transactions will you support at any given time? -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Re: (ITS#8475) Feature request: MDB low durability transactions
On 08/07/2016 05:44 PM, Howard Chu wrote: > The only way to guarantee integrity is with ordered writes. All SCSI > devices support this feature, but e.g. the Linux kernel does not (and > neither does SATA, and no idea about PCIe SSDs...). > > Lacking a portable mechanism for ordered writes, you have two choices > for preserving integrity - append-only operation (which forces ordered > writes anyway) or at least one synchronous write somewhere. > > Whenever you decide to reuse existing pages rather than operating as > append-only, you create the possibility of overwriting some required > data before it was safe to do so. Your 3-root checksum scheme *might* > let you detect that the DB is corrupted, but it *won't* let you recover > to a clean state. Given that writes occur in unpredictable order, > without fsyncs there is no way you can guarantee that anything sane is > on the disk. Consider three roots without any checksums. Each root has a simple flag indicating whether it was written durably (fsync write barrier). During recovery, non-durable roots are simply ignored/discarded. This is equivalent to Hallvard's suggestion for volatile meta-pages. I think it's pretty clear this is workable. From there, checksums just give you slightly stronger guarantees, although they might not be worth the overhead (CPU/storage) and recovery complexity.
Re: (ITS#8475) Feature request: MDB low durability transactions
bentr...@comcast.net wrote: > Thanks for the replies, Hallvard and Howard! > > I was mistaken in thinking that NOMETASYNC didn't guarantee integrity. > However, my proposal would allow fsync to be omitted entirely. > > I think my approach with three roots is better than a WAL because it > keeps the read and write paths simpler and more uniform. It also doesn't > force periodic fsyncs when the log wraps, or consume unbounded space. In > fact it's very similar to the basic design of MDB. > > You're right that you'd actually need to record the page's checksum in > the parent, rather than in the page itself. I guess this would hurt the > branching factor. And then it's turtles all the way down. What you're suggesting won't work. Trust me when I say we have spent far more time thinking about this question than you have. The only way to guarantee integrity is with ordered writes. All SCSI devices support this feature, but e.g. the Linux kernel does not (and neither does SATA, and no idea about PCIe SSDs...). Lacking a portable mechanism for ordered writes, you have two choices for preserving integrity - append-only operation (which forces ordered writes anyway) or at least one synchronous write somewhere. Whenever you decide to reuse existing pages rather than operating as append-only, you create the possibility of overwriting some required data before it was safe to do so. Your 3-root checksum scheme *might* let you detect that the DB is corrupted, but it *won't* let you recover to a clean state. Given that writes occur in unpredictable order, without fsyncs there is no way you can guarantee that anything sane is on the disk. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Re: (ITS#8475) Feature request: MDB low durability transactions
Thanks for the replies, Hallvard and Howard! I was mistaken in thinking that NOMETASYNC didn't guarantee integrity. However, my proposal would allow fsync to be omitted entirely. I think my approach with three roots is better than a WAL because it keeps the read and write paths simpler and more uniform. It also doesn't force periodic fsyncs when the log wraps, or consume unbounded space. In fact it's very similar to the basic design of MDB. You're right that you'd actually need to record the page's checksum in the parent, rather than in the page itself. I guess this would hurt the branching factor. Thanks again, Ben On 08/06/2016 12:42 PM, Hallvard Breien Furuseth wrote: > On 06. aug. 2016 17:38, bentr...@comcast.net wrote: >> Transaction commits are one of the few bottlenecks in MDB, because it >> has to >> fsync twice, sequentially. >> >> I think MDB could support mixed low and high durability transactions >> in the same >> database by adding per-page checksums and a third root page. The idea >> is that >> when committing a low-durability transaction, no fsyncs are performed. >> (...) > > Yesno. We can get rid of fsyncs, but not that way. Checksumming each > page isn't enough. We must know it's the right version of the page and > not e.g. a similar page from a previous aborted transaction. To commit > a branch or meta page, we'd need to scan its children and checksum the > page headers (thus including their checksum) of each. Expensive. > > IIRC there are three things we can do: > > - Use and fsync a WAL (write-ahead log) instead of the database pages. > That can be cheaper because it writes one contiguous region instead > of a lot of random-access pages. Requires recovery after a crash. > > - Volatile metapages which mdb_env_open() _always_ throws away if no > other environment is already open. They are lost of the application > crashes/exits without doing a final checkpoint. > > - Improve that a bit: Put them in a shared memory region, since that > won't survive a system crash (unlike if we put them in the lockfile). > That way they'll survive application crash provided something does > a checkpoint before next system crash. > > We've discussed these sometimes and there are caveats for some of them, > I don't quite remember. One issue is that a "system crash" isn't the > only thing which can lose unsynced pages. Another is unmounting and > re-mounting the disk (i.e. an USB disk). >
Re: (ITS#8475) Feature request: MDB low durability transactions
On 06/08/16 18:56, h...@symas.com wrote: > We already have low durability transactions, that's what MDB_NOMETASYNC is > for. It only does one fsync per commit instead of two. Haha, I didn't notice that since this reminded me of other discussions. The stuff I was talking about was for fewer fsyncs than one per commit.
Re: (ITS#8475) Feature request: MDB low durability transactions
bentr...@comcast.net wrote: > Full_Name: Ben Trask > Version: > OS: > URL: ftp://ftp.openldap.org/incoming/ > Submission from: (NULL) (192.241.251.11) > > > Hi, > > Transaction commits are one of the few bottlenecks in MDB, because it has to > fsync twice, sequentially. > > I think MDB could support mixed low and high durability transactions in the > same We already have low durability transactions, that's what MDB_NOMETASYNC is for. It only does one fsync per commit instead of two. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Re: (ITS#8475) Feature request: MDB low durability transactions
On 06. aug. 2016 17:38, bentr...@comcast.net wrote: > Transaction commits are one of the few bottlenecks in MDB, because it has to > fsync twice, sequentially. > > I think MDB could support mixed low and high durability transactions in the > same > database by adding per-page checksums and a third root page. The idea is that > when committing a low-durability transaction, no fsyncs are performed. (...) Yesno. We can get rid of fsyncs, but not that way. Checksumming each page isn't enough. We must know it's the right version of the page and not e.g. a similar page from a previous aborted transaction. To commit a branch or meta page, we'd need to scan its children and checksum the page headers (thus including their checksum) of each. Expensive. IIRC there are three things we can do: - Use and fsync a WAL (write-ahead log) instead of the database pages. That can be cheaper because it writes one contiguous region instead of a lot of random-access pages. Requires recovery after a crash. - Volatile metapages which mdb_env_open() _always_ throws away if no other environment is already open. They are lost of the application crashes/exits without doing a final checkpoint. - Improve that a bit: Put them in a shared memory region, since that won't survive a system crash (unlike if we put them in the lockfile). That way they'll survive application crash provided something does a checkpoint before next system crash. We've discussed these sometimes and there are caveats for some of them, I don't quite remember. One issue is that a "system crash" isn't the only thing which can lose unsynced pages. Another is unmounting and re-mounting the disk (i.e. an USB disk). -- Hallvard