This Christmas, instead of sugar plums dancing through my head, there were thoughts of logging and atomic commit. As I would much prefer the plums[1] the best way to put this situation right is to commit some code, which has begun. However, writing things down is also an effective way to put those geeky thoughts out of mind, and it helps make good code, so here goes.
The atomic commit design has been adjusted a little to take advantage of the fact that we are now running stable using ->get_block, the traditional Ext2-style block library interface. As a big fan of incremental development, I prefer to add bits and pieces to working code and move incrementally in the direction of Ext3-class recovery characteristics, rather than embarking on a big rewrite that only becomes useful when every last bit of it is in place. We can call the way Tux3 is running now "immediate mode", because blocks are allocated and write transfers start immediately inside sys_write. We get this "for free" just by hooking up our tux3_get_block method, which Hirofumi did last month. After pondering a while, I concluded that we can make some fairly small additions to turn immediate mode writing into a real atomic commit, leaving the much nicer cache layering design for later. The first of these additions is a logging mechanism, like I have described for Tux3 from the beginning, but pared down a little. We do not need forward logging, which is just an optimization, but we do need the concept of log rollup, which is not just an optimization, it solves some fundamental operational issues. The first thing we will log is changes to the allocation bitmap. Bitmap blocks will not be written to disk at the time blocks are allocated in sys_write, sys_create, etc. Instead, allocated and freed extents will be logged, while the bitmap blocks themselves are only updated in cache. From time to time, the cached blocks will be written out, fullfilling the "promises" in the log blocks. This mechanism helps solve two fundamental problems. The first is delayed freeing: when a delta update frees some blocks those blocks must not become available for reallocation before the delta commit has finished, otherwise some blocks of the previous consistent filesystem state may be overwritten, and a crash at that point would leave the filesystem with the old state corrupted and the new state not completed. So any blocks freed in a delta are entered into the allocation map after the delta has completed. The problem is, if we crash just after the commit, nobody knows about those freed blocks any more, they have leaked. This issue is solved tidily by logging the freed blocks (extents) as part of the same delta that frees them. The second fundamental problem is "allocate in allocate": we flush the bitmap to disk, which causes a physical block to be allocated for some dirty bitmap buffer, which dirties a different bitmap block, which then needs a physical block allocated for it, and so on, potentially for many iterations. This makes it hard to know the upper bound of metadata needed to perform a disk update, which in turn makes it hard to handle ENOSPC accurately. Another issue is, an allocation map might be in flight to disk when a bit is changed, making the state of the bit on disk unpredictable. And furthermore, logging a bitmap allocate record for a bit that is already set will cause an error on replay, because replay will check that the bitmap bits it sets are not already set. So in short, this is a big mess. Now I will introduce a technique I call "buffer forking" to solve the problem cleanly. Buffer forking is a form of copy-on-write, in which a copy is made of a buffer's underlying data and substituted for the original. Then the buffer contents can be altered freely without affecting the original. Forking makes flushing the allocation bitmap to disk easy. We just walk all the dirty buffers of the bitmap inode calling our tux3_get_block method to do physical allocation. Any bitmap block that is altered during the flush will be forked, and any allocations will be logged. Then we walk the same dirty buffers to submit the IO. The allocate-in-allocate issue comes up because we use dynamically allocated bitmaps as opposed to the Ext filesystem family, which has them at fixed locations. However, it isn't too hard to deal with and there are big advantages: for one thing, we can implement atomic update for bitmap blocks by redirecting them to new locations by the same mechanism as any other Tux3 filesystem object. If they had fixed locations we would need a different, update in place method that is more code and more writes. [1] http://www.christmas-tree.com/stories/nightbeforechristmas.html http://earthmother-intheraw.blogspot.com/2008/12/visions-of-sugar-plums.html _______________________________________________ Tux3 mailing list [email protected] http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
