In a previous design note, I described Tux3's frontend/backend cache model:
http://mailman.tux3.org/pipermail/tux3/2008-November/000303.html "Deferred Namespace Operations" The subject line is a little inaccurate - that post is really more about Tux3's cache layering, out of which came the conclusion that the concept of deferred namespace operations is necessary in order to implement Tux3's layered model optimally. (The above needs a cleanup and repost as a design note specifically about the cache model.) This post is about an important piece of the cache layering model: the "delta staging" operation (called "delta setup" in the post above) which takes place at each delta transition. Delta staging takes the changes that the user has made to front end cache and formats them into disk blocks, ready to be transferred to disk. Since these disk blocks are all buffered in cache, this is a cache-to-cache operation. Delta staging does the following: - Flushes deferred name creates and deltas to directory blocks (*) - Flushes deferred inode creates and deletes to inode table blocks (*) - Allocates and assigns physical disk addresses for dirty data blocks and dirty directory entry blocks - Updates inode table blocks to point to new locations of the above blocks - Assigns new locations for any modified inode table blocks - Assigns new locations for split index blocks (note: but not for index blocks that just have a new pointer added, that is handled by "promises" and rollup described earlier.) - Allocates and formats one or more log commit blocks to record the physical locations of all blocks now contained by the delta, including dirty file data, changed directory entry blocks and split btree nodes - Chooses one of the log commit blocks to be the delta commit block and adds delta commit information to it. Steps marked (*) above are done only if we have deferred namespace operations available. Otherwise, the front end will directly modify directory entry and inode table blocks, using a lock against delta staging to avoid the situation where the front end wants to modify an inode table block that delta staging has already modified but not yet written. Note that a delta does not include changed bitmaps or changed index blocks, other than split index blocks. This information can be derived from the delta commit blocks at replay time (next mount, whether explicit or after unexpected interruption) so there is no need to write this to disk. Delta staging will normally be a quick, cache to cache operation with running time measured in microseconds. However, if it needs to read metadata blocks from disk then its running will sometimes run into milliseconds. With deferred namespace operations, this will causes no visible interruption to the front end, provided the disk is not backlogged. Without deferred namespace operations, these latency spikes will be visible to the user, which sounds bad, but is actually normal behavior for the current generation of Linux filesystems. With deferred namespace operations, we will push the envelope a little in Tux3. When delta staging completes, we have a set of block images ready to transfer to disk. As soon as the previous delta has completed (that is, its delta commit block write has completed) all the blocks of the new delta are submitted for writeout, except for the delta commit block. When the other blocks have completed writeout, the delta commit block is written and transfer of the next delta can begin. This is a pretty simple strategy. In time we will will elaborate it with measures to eliminate the two synchronous waits at each delta: 1) the commit block waits for the other block writeouts to complete; 2) the next delta waits for the commit block writeout to complete. These waits can be eliminated with some simple tricks. Even if we are lazy and do nothing about these waits, performance will be respectable, because under load the deltas will be quite large. Ten milliseconds spent waiting for a 500 millisecond delta writeout will be scarcely noticeable. But a synchronous write load, like we have with NFS, will benefit visibly from a smarter commit pipeline strategy. More on that later. Regards, Daniel _______________________________________________ Tux3 mailing list [email protected] http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
