A userspace prototype for logging and replay is committed: http://hg.tux3.org/tux3/rev/b329f332ec12
This code demonstrates a nice efficient approach to logging and replay. The encode/decode primitives used for inode attributes are re-used here, which is pleasant. This code is already endian-correct, which was easy because we had already developed the primitives. I ran sparse checking, which was easy thanks to Hirofumi's makefile improvements, and this code came up endian clean on the first try. Except that sparse found a bug in balloc.c I had introduced witt the new (untested) all_clear function. Which means I should run sparse more often. Log blocks are kept in a mapping, for which it is probably easiest just to allocate an inode in kernel. Having them in their own mapping lets us treat the log blocks as an array, and by not using the buffer mapping for logs, we can delay deciding where each log block should be physically located until it is time to commit them to disk. Variable sized log records are serially encoded into the highest log block in the log mapping. The only example so far is log_alloc, which logs bitmap allocations and frees to implement the deferred bitmap update idea I described earlier. We will also have log entries to implement the "promise" idea where we store a block and do not update its parent index on disk, but just update the parent in cache and add a log entry so that the cached parent can be reconstructed on replay. The user space prototype replays just by re-walking the list of log blocks. Replay in production will be a little more complex. Later log entries may invalidate earlier ones, for example, storing a bitmap block invalidates all earlier sets and clears for that bitmap block because they are based on the previously recorded version of the block. So when we update a bitmap block, redirecting it to a new location, the log receives a DEADMAP record that means all previous log entries for the indicated bitmap block should be ignored on replay. So log replay is going to be a multi-pass affair: first we walk the log blocks and find all the places where things are invalidated, then do the replay using that information to ignore what must be ignored. Suggestions are welcome for how to do this elegantly. Each delta will have one or more log blocks. Each log block points at the previous log block, so we can load them in reverse order. The last log block for a delta has a commit entry that includes a pointer to the previous delta commit block, and a count of how many delta commits blocks there are in the chain so that rollup can shorten the log chain just by writing a new commit block. For the time being, the disksuper will maintain a pointer to the most recent delta commit block. This idea should be improved eventually, because repeatedly updating the superblock and hoping that there will never be a partial update is not good practice. However, it is extremely unlikely to cause a problem in practice, so we can defer this issue. On startup, we load the log mapping by chaining back through the commit chain blocks, and chaining backward through each delta log chain. Being lazy, we can start at some high index in the log mapping and work backwards. Loading the log chain this way is perhaps not the greatest idea for the long term, because each log block has to be read synchronously, with an average access time around 6 ms for rotating media. It does not take many log blocks to add up to a significant delay on startup. We can think about improving this later. With this logging code, we officially enter the next phase of the project: robustness and crash recovery. Regards, Daniel _______________________________________________ Tux3 mailing list [email protected] http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
