Re: (ITS#8475) Feature request: MDB low durability transactions

2016-09-16 Thread h . b . furuseth
I'll use this ITS to summarize details for "volatile commits".
Hopefully I've managed to keep it all straight.

"Volatile" vs. "durable" are the most accurate names I can think of.
Not sure if that's more instructive than simply "soft" / "hard".


* Volatile commits omit fdatasync() without losing consistency.  To
  survive, they must be checkpointed *before* all processes close the
  env.  Un-checkpointed volatiles are lost when the env closes.

  Thus a separate checkpointing daemon can keep the env open to
  protect volatiles from application crash, at least if Robust locks
  are supported.  (The lmdb.h doc seems a bit unclear about Robust.)

* Checkpointing == committing a durable (non-volatile) write-txn.
  (If there is nothing to do, Commit writes nothing.)

  mdb_env_sync() will not checkpoint volatiles, since existing
  programs do not expect it to wait for the write mutex.  It
  "checkpoints" MDB_NOMETASYNC/MDB_NOSYNC.  Maybe mdb_checkpoint()
  will have a special case which obsoletes mdb_env_sync().

* Volatiles are unsupported with MDB_NOLOCK and pointless with
  MDB_NOSYNC.  OTOH it makes sense to enable MDB_NOMETASYNC.

* Volatiles need a bigger datafile, because it takes two durable
  commits to make a freed page reusable. (Plus awaiting old readers).

Configuration.  Too many options, ouch:

  LMDB can be configured to auto-checkpoint after X volatile commits
  and/or Y written kbytes(pages?).  Programs can also checkpoint every
  Z minutes(seconds?) - configured in LMDB to mimic Berkeley DB's
  "checkpoint  ", but regular LMDB ops ignore that.

  The lockfile gives the current config.  An MDB_env could override
  for its particular process, e.g. with an MDB_NO_VOLATILE flag.
  Maybe resetting the lockfile should keep the previous config?
  OTOH I suppose MDB_meta can have default params the way it has
  mm_mapsize.  That survives a backup/restore.

Implementation - plain version first:


* Keep 2 'MDB_meta's in the lockfile, for volatile commits.
  MDB_env.me_metas[] gets 4 elements: durable + volatile.
* mdb_env_open() throws away volatiles if it re-inits the lockfile.

* Add field MDB_meta.mm_oldest: 1 + (previous durable meta).mm_txnid
  in durable metas, and (previous meta).mm_oldest in volatile metas.

  Init 'oldest' in mdb_find_oldest() to new field MDB_env.me_oldest,
  which mdb_txn_renew0(write txn) sets to MDB_meta.mm_oldest.

  When are no volatiles, this ends up initing 'oldest' = same value
  as today.  Usually we could just have used 1 + (oldest durable
  meta).mm_txnid, but a failed write_meta() may have clobbered that.

* Replace MDB_txninfo.mti_txnid with mti_metaref = txnid*16 + P*4 + M:
M = index to MDB_env.me_metas[],
P = previous M during this session, initialized to the same as M,
  so we can get this info atomically.

  P may prove unnecessary, but it's simplest to just include it for
  now.  For when meta M fails a checksum so we want an older meta, for
  mdb_mutex_failed(), maybe so we can see if there are volatiles yet.

* Never use mdb_env_pick_meta() when the current metapage is known:
  Use it in mdb_env_open(), in mdb_mutex_failed(), and if MDB_NOLOCK.
  Or rather, I guess it gets a "force" param for those cases.

* Add config in the lockfile. Maybe per-env config overriding it and
  defaults in the datafile. Txn flags "prefer volatile", "checkpoint".
* Track the number of pages and volatiles since last durable commit.
  write_meta() compares with the config limits and makes the final
  decision of whether the new meta will be volatile.

  Add MDB_pgstate.mf_pgcount with #pages used so far.  The rest goes
  in a lockfile array[4] indexed by mti_metaref % 4, or in MDB_meta.
  That way, switching to next snapshot stays atomic - just update

* txn_begin must verify metas, since we have no fdatasync barriers.
  Re-read and compare, or checksum.

  write_meta() and mutex_failed(): memory barrier between making
  a volatile meta and updating mti_metaref.  Most modern compilers
  have that.  Maybe a fallback implementation is lock;unlock an
  otherwise unused mutex.  Should also include CACHEFLUSH().
  It may make sense to have more than 2 volatile metas, so read-only
  txns will have more time to read a meta before it gets overwritten.

  MDB_WRITEMAP (and MDB_VL32?) has non-atomic issues we should deal
  with anyway.

* We can have (durable metapage).mp_pgno == (txnid & 1) as before:
  mdb_txn_renew0() steps txnid by 2 instead of 1 if 'meta' is volatile.
  But note that the txnid doesn't say if the snapshot is durable.
* "Try to checkpoint" feature, which does not await the write mutex:

  Trylock the write mutex in mdb_txn_begin().  If it fails, set a
  lockfile flag "Please checkpoint" and return.  Hopefully someone
  will obey and clear the flag.  mdb_env_commit(writer) does.


* Put volatile MDB_metas in the datafile, behind the usual MDB

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-10 Thread bentrask
Okay, thanks for taking the time to discuss, and of course for all your 
work on MDB!


On 08/10/2016 08:50 AM, Hallvard Breien Furuseth wrote:
> Nope, you're as confused as I was originally:-) LDMB doesn't know or
> care when a page was written.  A page can be reused when the snapshot
> which _freed_ it is known to be durable and there are no older readers.
> (We could improve that by tracking page history better.  Maybe later.)
> "Known to be durable" = sync datapages, write metapage, sync metapage,
> note that the metapage was synced.  (We implicitly note that when
> writing next txn's metapage, since we must have synced first.)  From
> a data safety point of view, txns which do all that are the real txns.
> Anything else is fluff, like no-sync txns if we implement them.  Their
> metapages must go somewhere they *won't* be confused with durable ones.
> Think of such a fluffy commit as saving an intermediate stage of a
> real txn.  That's irrelevant to a later write-txn wanting to not touch
> the last two durable snapshots.  It's only relevant vs. oldest reader.
> So.  3rd metapage and variants - I've tried and Howard pointed out
> the flaws, Howard tried and I said here we go again.  We do not need
> another round, but it's just as well to have it summarized here.
> (This discussion ignores MDB_NOSYNC and partly MDB_NOLOCK - if the
> user enables either, it's his responsibility to compensate.)

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-10 Thread h . b . furuseth
Nope, you're as confused as I was originally:-) LDMB doesn't know or
care when a page was written.  A page can be reused when the snapshot
which _freed_ it is known to be durable and there are no older readers.
(We could improve that by tracking page history better.  Maybe later.)

"Known to be durable" = sync datapages, write metapage, sync metapage,
note that the metapage was synced.  (We implicitly note that when
writing next txn's metapage, since we must have synced first.)  From
a data safety point of view, txns which do all that are the real txns.
Anything else is fluff, like no-sync txns if we implement them.  Their
metapages must go somewhere they *won't* be confused with durable ones.

Think of such a fluffy commit as saving an intermediate stage of a
real txn.  That's irrelevant to a later write-txn wanting to not touch
the last two durable snapshots.  It's only relevant vs. oldest reader.

So.  3rd metapage and variants - I've tried and Howard pointed out
the flaws, Howard tried and I said here we go again.  We do not need
another round, but it's just as well to have it summarized here.

(This discussion ignores MDB_NOSYNC and partly MDB_NOLOCK - if the
user enables either, it's his responsibility to compensate.)


Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-08 Thread bentrask
On 08/08/2016 05:41 AM, Hallvard Breien Furuseth wrote:
> A transaction must not reuse data pages visible in the last snapshot
> known to be durable, since that's how far back LDMB may need to revert
> after abnormal termination.  Like a crash after MDB_NOMETASYNC may do.
> Sync the data pages from a txn, write the metapage, eventually sync
> that metapage, wait out any older read-only transactions, and *then*
> you can reuse the pages the txn freed.  Not before.  So when you don't
> sync, or a read-only txn won't die, LMDB degenerates to append-only.
> ...except if you sync the metapage and exit, next LMDB run may not
> know you synced it and must assume the metapage isn't yet durable.
> So it might not reuse pages visible to the _previous_ durable
> metapage, until it syncs.  I'm rather losing track at this point,
> but I think it may mean twice as may not-yet-usable pages as one
> might expect.

Concretely: say the current write transaction is number 10, and a 
long-lived reader is on number 7. Currently, MDB will be unable to reuse 
any pages used in transactions 7+ until the reader ends.

Now say a 3rd, durable root is added. For the sake of argument, no 
checksums are used and in the event of a crash, only the last durable 
state is recovered. Say the durable transaction is number 2. Pages used 
in transaction 2 need to be preserved, obviously. 7+ still need to be 
preserved for the slow reader. But pages from transactions 3-6 can be 

Note that the last durable transaction is controlled purely by the 
single writer, so tracking it is actually easier than tracking which 
readers are where.

If a crash happens before a durable root is fully synced, then there 
should be a second, older durable root that hasn't been reused yet. In 
that case MDB recovers the way it does currently.

Does this make sense? Thanks for bearing with me.

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-08 Thread h . b . furuseth
On 08/08/16 03:51, wrote:
> The idea was that the two "floating" roots would reuse pages the way MDB
> does now. The 3rd durable root would have its pages preserved
> separately. I can see why this would cause up to a ~2X storage increase
> as the two sets diverged, but I don't see why it would need to grow
> unbounded. Apologies for this stupid question.

A transaction must not reuse data pages visible in the last snapshot
known to be durable, since that's how far back LDMB may need to revert
after abnormal termination.  Like a crash after MDB_NOMETASYNC may do.

Sync the data pages from a txn, write the metapage, eventually sync
that metapage, wait out any older read-only transactions, and *then*
you can reuse the pages the txn freed.  Not before.  So when you don't
sync, or a read-only txn won't die, LMDB degenerates to append-only.

...except if you sync the metapage and exit, next LMDB run may not
know you synced it and must assume the metapage isn't yet durable.
So it might not reuse pages visible to the _previous_ durable
metapage, until it syncs.  I'm rather losing track at this point,
but I think it may mean twice as may not-yet-usable pages as one
might expect.

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-07 Thread bentrask
On 08/07/2016 09:29 PM, Howard Chu wrote:
> Knowing whether or not the root pages are pristine still doesn't tell
> you anything about whether the data pages are intact. The only way to
> make any of these schemes work is to avoid overwriting/reusing any data
> pages for the last N transactions. I.e., reverting to append-only
> behavior. So the underlying question (which we have wrestled with
> internally for quite some time) which you haven't asked or answered -
> how many of these non-durable transactions will you support at any given
> time?

The idea was that the two "floating" roots would reuse pages the way MDB 
does now. The 3rd durable root would have its pages preserved 
separately. I can see why this would cause up to a ~2X storage increase 
as the two sets diverged, but I don't see why it would need to grow 
unbounded. Apologies for this stupid question.

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-07 Thread hyc
Ben Trask wrote:
> On 08/07/2016 05:44 PM, Howard Chu wrote:
>> The only way to guarantee integrity is with ordered writes. All SCSI
>> devices support this feature, but e.g. the Linux kernel does not (and
>> neither does SATA, and no idea about PCIe SSDs...).
>> Lacking a portable mechanism for ordered writes, you have two choices
>> for preserving integrity - append-only operation (which forces ordered
>> writes anyway) or at least one synchronous write somewhere.
>> Whenever you decide to reuse existing pages rather than operating as
>> append-only, you create the possibility of overwriting some required
>> data before it was safe to do so. Your 3-root checksum scheme *might*
>> let you detect that the DB is corrupted, but it *won't* let you recover
>> to a clean state. Given that writes occur in unpredictable order,
>> without fsyncs there is no way you can guarantee that anything sane is
>> on the disk.
> Consider three roots without any checksums. Each root has a simple flag
> indicating whether it was written durably (fsync write barrier). During
> recovery, non-durable roots are simply ignored/discarded. This is equivalent
> to Hallvard's suggestion for volatile meta-pages. I think it's pretty clear
> this is workable.
>  From there, checksums just give you slightly stronger guarantees, although
> they might not be worth the overhead (CPU/storage) and recovery complexity.

Knowing whether or not the root pages are pristine still doesn't tell you 
anything about whether the data pages are intact. The only way to make any of 
these schemes work is to avoid overwriting/reusing any data pages for the last 
N transactions. I.e., reverting to append-only behavior. So the underlying 
question (which we have wrestled with internally for quite some time) which 
you haven't asked or answered - how many of these non-durable transactions 
will you support at any given time?

   -- Howard Chu
   CTO, Symas Corp.
   Director, Highland Sun
   Chief Architect, OpenLDAP

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-07 Thread bentrask
On 08/07/2016 05:44 PM, Howard Chu wrote:
> The only way to guarantee integrity is with ordered writes. All SCSI
> devices support this feature, but e.g. the Linux kernel does not (and
> neither does SATA, and no idea about PCIe SSDs...).
> Lacking a portable mechanism for ordered writes, you have two choices
> for preserving integrity - append-only operation (which forces ordered
> writes anyway) or at least one synchronous write somewhere.
> Whenever you decide to reuse existing pages rather than operating as
> append-only, you create the possibility of overwriting some required
> data before it was safe to do so. Your 3-root checksum scheme *might*
> let you detect that the DB is corrupted, but it *won't* let you recover
> to a clean state. Given that writes occur in unpredictable order,
> without fsyncs there is no way you can guarantee that anything sane is
> on the disk.

Consider three roots without any checksums. Each root has a simple flag 
indicating whether it was written durably (fsync write barrier). During 
recovery, non-durable roots are simply ignored/discarded. This is 
equivalent to Hallvard's suggestion for volatile meta-pages. I think 
it's pretty clear this is workable.

 From there, checksums just give you slightly stronger guarantees, 
although they might not be worth the overhead (CPU/storage) and recovery 

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-07 Thread hyc wrote:
> Thanks for the replies, Hallvard and Howard!
> I was mistaken in thinking that NOMETASYNC didn't guarantee integrity.
> However, my proposal would allow fsync to be omitted entirely.
> I think my approach with three roots is better than a WAL because it
> keeps the read and write paths simpler and more uniform. It also doesn't
> force periodic fsyncs when the log wraps, or consume unbounded space. In
> fact it's very similar to the basic design of MDB.
> You're right that you'd actually need to record the page's checksum in
> the parent, rather than in the page itself. I guess this would hurt the
> branching factor.

And then it's turtles all the way down.

What you're suggesting won't work. Trust me when I say we have spent far more 
time thinking about this question than you have.

The only way to guarantee integrity is with ordered writes. All SCSI devices 
support this feature, but e.g. the Linux kernel does not (and neither does 
SATA, and no idea about PCIe SSDs...).

Lacking a portable mechanism for ordered writes, you have two choices for 
preserving integrity - append-only operation (which forces ordered writes 
anyway) or at least one synchronous write somewhere.

Whenever you decide to reuse existing pages rather than operating as 
append-only, you create the possibility of overwriting some required data 
before it was safe to do so. Your 3-root checksum scheme *might* let you 
detect that the DB is corrupted, but it *won't* let you recover to a clean 
state. Given that writes occur in unpredictable order, without fsyncs there is 
no way you can guarantee that anything sane is on the disk.

   -- Howard Chu
   CTO, Symas Corp.
   Director, Highland Sun
   Chief Architect, OpenLDAP

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-07 Thread bentrask
Thanks for the replies, Hallvard and Howard!

I was mistaken in thinking that NOMETASYNC didn't guarantee integrity. 
However, my proposal would allow fsync to be omitted entirely.

I think my approach with three roots is better than a WAL because it 
keeps the read and write paths simpler and more uniform. It also doesn't 
force periodic fsyncs when the log wraps, or consume unbounded space. In 
fact it's very similar to the basic design of MDB.

You're right that you'd actually need to record the page's checksum in 
the parent, rather than in the page itself. I guess this would hurt the 
branching factor.

Thanks again,

On 08/06/2016 12:42 PM, Hallvard Breien Furuseth wrote:
> On 06. aug. 2016 17:38, wrote:
>> Transaction commits are one of the few bottlenecks in MDB, because it
>> has to
>> fsync twice, sequentially.
>> I think MDB could support mixed low and high durability transactions
>> in the same
>> database by adding per-page checksums and a third root page. The idea
>> is that
>> when committing a low-durability transaction, no fsyncs are performed.
>> (...)
> Yesno.  We can get rid of fsyncs, but not that way.  Checksumming each
> page isn't enough.  We must know it's the right version of the page and
> not e.g. a similar page from a previous aborted transaction.  To commit
> a branch or meta page, we'd need to scan its children and checksum the
> page headers (thus including their checksum) of each.  Expensive.
> IIRC there are three things we can do:
> - Use and fsync a WAL (write-ahead log) instead of the database pages.
>   That can be cheaper because it writes one contiguous region instead
>   of a lot of random-access pages.  Requires recovery after a crash.
> - Volatile metapages which mdb_env_open() _always_ throws away if no
>   other environment is already open.  They are lost of the application
>   crashes/exits without doing a final checkpoint.
> - Improve that a bit: Put them in a shared memory region, since that
>   won't survive a system crash (unlike if we put them in the lockfile).
>   That way they'll survive application crash provided something does
>   a checkpoint before next system crash.
> We've discussed these sometimes and there are caveats for some of them,
> I don't quite remember.  One issue is that a "system crash" isn't the
> only thing which can lose unsynced pages.  Another is unmounting and
> re-mounting the disk (i.e. an USB disk).

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-06 Thread h . b . furuseth
On 06/08/16 18:56, wrote:
> We already have low durability transactions, that's what MDB_NOMETASYNC is
> for. It only does one fsync per commit instead of two.

Haha, I didn't notice that since this reminded me of other discussions.
The stuff I was talking about was for fewer fsyncs than one per commit.

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-06 Thread hyc wrote:
> Full_Name: Ben Trask
> Version:
> OS:
> URL:
> Submission from: (NULL) (
> Hi,
> Transaction commits are one of the few bottlenecks in MDB, because it has to
> fsync twice, sequentially.
> I think MDB could support mixed low and high durability transactions in the 
> same

We already have low durability transactions, that's what MDB_NOMETASYNC is 
for. It only does one fsync per commit instead of two.

   -- Howard Chu
   CTO, Symas Corp.
   Director, Highland Sun
   Chief Architect, OpenLDAP

Re: (ITS#8475) Feature request: MDB low durability transactions

2016-08-06 Thread h . b . furuseth
On 06. aug. 2016 17:38, wrote:
> Transaction commits are one of the few bottlenecks in MDB, because it has to
> fsync twice, sequentially.
> I think MDB could support mixed low and high durability transactions in the 
> same
> database by adding per-page checksums and a third root page. The idea is that
> when committing a low-durability transaction, no fsyncs are performed. (...)

Yesno.  We can get rid of fsyncs, but not that way.  Checksumming each
page isn't enough.  We must know it's the right version of the page and
not e.g. a similar page from a previous aborted transaction.  To commit
a branch or meta page, we'd need to scan its children and checksum the
page headers (thus including their checksum) of each.  Expensive.

IIRC there are three things we can do:

- Use and fsync a WAL (write-ahead log) instead of the database pages.
   That can be cheaper because it writes one contiguous region instead
   of a lot of random-access pages.  Requires recovery after a crash.

- Volatile metapages which mdb_env_open() _always_ throws away if no
   other environment is already open.  They are lost of the application
   crashes/exits without doing a final checkpoint.

- Improve that a bit: Put them in a shared memory region, since that
   won't survive a system crash (unlike if we put them in the lockfile).
   That way they'll survive application crash provided something does
   a checkpoint before next system crash.

We've discussed these sometimes and there are caveats for some of them,
I don't quite remember.  One issue is that a "system crash" isn't the
only thing which can lose unsynced pages.  Another is unmounting and
re-mounting the disk (i.e. an USB disk).
