[Tux3] Design note: Btree index block life cycle

Daniel Phillips Sun, 01 Mar 2009 18:38:21 -0800

(sent again with word wrapping)

Hi Matt,


This note examines specifics of the Tux3 btree index block life cycle, 
including caching, redirection, flushing and reconstruction during log replay.  
This addresses some points that you raised during our earlier discussion, for 
which my response at that time could fairly be described as hand waving.  One 
such point was potential complexity compared to working with a more "uniform" 
btree design like the one you have adopted for Hammer.

Hirofumi is CCed because he has bravely volunteered to complete the 
implementation of this part of the Tux3 atomic commit mechanism for userspace 
and kernel, thus providing me with the necessary motivation to state the 
details precisely.

I think that we can see at this point that the complexity required to implement 
the Tux3 atomic commit strategy is well bounded, especially as it is now mostly 
implemented and can be seen not to amount to a great deal of code.  Mind you, 
some of it has not yet seen the light of day.  For example, I still have not 
addressed all the details of allocation bitmap replay, which will require an 
additional, logical replay pass not described here.  Oh well, that will be 
another note, and hopefully all of this will be up and running fairly soon.

By the way, congratulation on your apparent success with Hammer.  I hope to 
foment a movement to port it to Linux, as it would seem to address a niche that 
is not well covered in Penguin land and is unlikely to be in the forseeable 
future, by my project or any other I know of, which is to say: cluster 
replication and continuous fine grained delta logging for high capacity servers.

Natural metadata position

For file index blocks, the "natural" position is near the file's inode table 
block, and near the beginning of the file data blocks.  Roughly speaking, we 
would like to place index blocks on the same track as both the associated inode 
table block and the first few file data blocks, so that the disk drive will 
pick these up together into its track cache when it first seeks to that track.  
Similarly, we want each inode table index block to lie near the beginning of 
the group of inode table leaf blocks it references.  This is the ideal position 
for read access.

On the other hand, the ideal position for writing is wherever file data is 
currently being written.  Thus, the ideal metadata block position for reading 
conflicts with that for writing.  We resolve this conflict using a scheme I 
call metadata "kiting".

Pinned dirty metadata, or "Kiting"

As an optimization, dirty btree index blocks are not written to disk at each 
delta cycle, but instead flushed in relatively infrequent log rollup cycles.  
This strategy is primarily intended to reduce the number of out of line seeks 
during writing, while also writing the metadata close to its natural position 
for fast read access.  Instead of writing a dirty metadata block to disk at 
each delta, we log details of the changes that have been made to it since the 
last time it was flushed, and write out the log block as part of the associated 
delta, at a location near the other blocks of the delta.  It does not matter if 
this is far from the referenced metadata blocks, because the log block will 
only ever be read on startup, and then only if no log rollup was done before 
shutdown.

A dirty metadata block held for flushing in this way is said to be pinned, 
because it records the only references to at least some of its child blocks.  
It may not be evicted from cache until after we either flush the block to disk 
or log enough information to reconstruct the dirty block in cache on log 
replay.  I coined the term "kiting" for this scheme: such dirty metadata blocks 
normally live "up there in cache" and only "come down to earth" at relatively 
rare log flush intervals.

Besides saving seeks on read and write, kiting should reduce the total number 
of metadata block updates.  The realized savings may be anywhere from nothing 
to a factor of several hundred, or roughly the branching factor of a btree 
index block.  We will eventually measure this effect under various filesystem 
loads.

Block Redirect

A btree block intially loaded into cache as part of a btree probe for read 
access or leaf update is initially marked clean.  If a btree leaf is to be 
updated, for example by storing new inode attributes or new file data extents, 
then unless it is already dirty in the current delta, it is copied to a newly 
allocated position in the Tux3 volmap (formerly bdev, aka buffer) cache and 
marked dirty in the current delta, to be flushed as part of the current delta.  
However, its parent, a btree index node, will not be flushed immediately, even 
though it must be altered to record the new position of the leaf block.  The 
parent is changed and redirected in cache, and its parent as well, right up the 
first block in the btree cursor patch that is already dirty in the current 
delta, which may be the root of the btree.

Thus, altering a file data leaf may entail redirecting the entire path from the 
leaf to the root of the btree.  If the root is redirected, then the cached 
inode must be marked dirty so that its data btree attribute will be updated in 
the current delta.  This may in turn cause the entire path from the affected 
inode table block to the root of the inode table btree to be redirected when 
the affected inode is flushed during delta staging.  So a single change may 
cause a minor storm of volume map cache activity.  However, it does not 
immediately cause a great deal of write activity, because changes to all the 
btree index nodes affected are recorded in a log block and the dirty index 
block itself is merely held dirty in cache until the next log rollup flush 
cycle some time in the future.   (Note: btree leaf nodes are flushed as part of 
the current delta, unlike btree index nodes.  This is due to the inherently 
greater complexity of logging and replaying leaf node operations, and because 
it is much easier to place leaf nodes close to their associated data without 
introducing much extra read or write seeking.)

After redirecting a chain of blocks in a btree cursor path, the original clean 
versions may be discarded, since they will no longer be accessed in cache.  
(When we allow multiple deltas in the update pipeline, this rule will be 
slightly elaborated to accommodate redirections from in-flight dirty blocks in 
earlier deltas.)  Currently, the only case where we need to worry about exactly 
when to discard a clean metadata block is during bitmap flushing, which might 
trigger a read of an out of cache bitmap block, and thus a probe of at least 
part of the access path that we are currently modifying.  Locking strategy to 
handle this unusual case is interesting, to say the least.

When to redirect?

It is necessary to redirect btree nodes when first dirtied, rather than at 
delta staging time because it is only at the time of leaf dirtying that we have 
the entire path from the root of the btree to the leaf available in the cursor 
path, and thus are able to easily locate and redirect the successive parent 
blocks.  Finding the parent later when we only know that some volmap block is 
dirty would require some means of tracking the parent, an additional level of 
complexity compared to doing the redirect at the time we know the parent chain. 
 Additionally, we not only know the parent from the cursor, we have it locked, 
which will be quite helpful as we refine or SMP locking strategy in the 
direction of finer granularity.

Log output records

To guarantee that all changes to a pinned btree index node can be accurately 
reconstructed if it becomes necessary to replay the log, the following types of 
change are logged to log blocks that will be committed as part of the delta in 
progress:

   insert (logical key, parent block, child block)
   update (logical key, parent block, child block)
   delete (logical key, parent block)
   redirect (old block, new block)
   split (logical key, block, new block)
   merge (logical key, block, old block)

Log replay

On log replay, each logged operation must be replayed in cache, yielding 
exactly the same result as the original operation.  One interesting subtlety is 
that insert, update or delete operations are recorded against redirected 
blocks, which have not necessarily ever been written to disk.  In fact, these 
operations must be replayed against the contents of predecessor versions of the 
block.  The redirect records in the delta log chain, when replayed in proper 
order with other operations affecting a given tree block, should produce the 
correct dirty volmap block in every case.  We can verify this with a checksum, 
which suggests the following additional form of log output record:

   check (checksum, block)

which will trigger failure if a dirty metadata block does not match after being 
reconstructed at log replay time.

Log Rollup

An excessively long log chain could cause long replay times on restart (planned 
or unplanned) and might consume an excessive amount of cache.  Therefore from 
time to time a log rollup flush cycle is performed, triggered by current cache 
conditions, block device bandwidth availability, and length of the log.  Log 
rollup consists of moving all dirty pinned metadata blocks to the current delta 
writeout list.  After the delta has committed, these log blocks can be marked 
free in the allocation map.

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3

[Tux3] Design note: Btree index block life cycle

Reply via email to