As requested, here are some more details on the recursive allocation problem, and how block forking solves it. Or nearly solves it.
I indulged in a little hubris when I claimed that the two shiny new techniques of allocation logging and block forking would instantly solve the allocation-during-bitmap-flush problem. While these techniques do in themselves give us the stable bitmap table snapshot that we require, the details of getting the right data on disk at the right time turned out to be a little devilish. I implemented a new unit test in user/commit.c to try out bitmap flushing, and the initial runs showed just the right results. Nearly ready to declare victory, I suddenly had new reservations, wrote a new unit test, and found a flaw: in fact the bitmap data arriving on disk was the same as if there were no fancy forking and logging at all. So back to the drawing board. The lazy solution would be just to ignore the problem. The difference between an absolutely correct on-disk bitmap block snapshot and a slightly later version into which additional allocations have leaked is just a few bits. These are represented in the log by allocation promises, and may or may not be set in the on-disk bitmap blocks depending on the order in which the blocks were written. On replay, the promised allocations are entered into the bitmaps to reconstruct the current bitmap cache. If we quietly leave out the check that the affected bits where not already set, everything just works and the world is a happy place, right? Except that I found it a little disturbing to leave out a safety check just because of not being smart enough to place exactly the data on disk that should be there. Even though Tux3 is not a nuclear reactor, we still hate the idea going Chernobyl on our disk files, so one thing we do not want to do is leave out a perfectly good safety check. In the end, a precise solution only cost 15 new lines of code, although determining exactly which 15 lines were the right ones took a couple of days. A flock of Tux3 techniques came into play: block forking, the log, allocation promises, index update promises, polymorphic block IO transfers, per-delta dirty lists, per-inode dirty lists, and delta transitions, all of which are also needed for other reasons. The winning check-in is here: http://hg.tux3.org/tux3/rev/42fe4237d34b "A reworked fix to the allocate-during-bitmap-flush problem" This solution is more or less as simple as I originally claimed. Forking leaves the clone buffer carrying the original data on the dirty list for the delta. The clone buffer points at the same allocation map as the original and has the same index, however is not entered in the hash map. The problem I was having was due to mapping a buffer to disk and submitting IO for it at the same time. The IO submitted would be for the forked copy of the buffer data, sometimes with unwanted bits in it. Also, buffers appearing later in the bitmap table dirty list might already be forked, and again we would write out the current version of the bitmap instead of the desired read-only clone. This was solved by dividing the bitmap block flush into two steps: 1) map the buffer to disk 2) write out the buffer unless its dirty state is now the current delta, indicating that it was forked during the mapping. In short, we just skip any forked blocks during the bitmap flush, then flush the delta dirty list, and we are done. Flushing the delta dirty list is a minor subproblem in itself. This list may contain blocks from several kinds of structures, which are mapped to disk in different ways. The main variants are logically indexed buffers such as dirent blocks, and physically indexed buffers such as btree nodes. Now we have a third variant: unhashed logically indexed buffers, that is, clones resulting from buffer forks. The normal method for writing logically indexed buffers (filemap_extent_io) is not suitable because it looks up the buffer that will be written in the mapping hash as part of an optimization to generate larger transfers. This will write out the wrong data. So instead, the allocation map gets its own IO mapping method, which uses the existing file mapping method for read, but a simpler, single block method for write. This method includes a check that the data being written does not belong to the current delta, which has not yet been committed. So here we find a useful application of the polymorphic mapping method feature I have implemented in user space, and which I was never sure had a solid reason to exist. Now it has made itself sufficiently useful and deserves to be implemented in kernel as well. Speaking of kernel, this solution needs to be implemented in kernel as well. I am really glad I was able to bang away at this issue in user space instead of kernel, with its longer development cycles and near total lack of unit testing. All the while, I have been mindful that the techniques employed are possible to implement in kernel with its stringent SMP locking requirements and highly optimized internal interfaces, which in some cases do not resemble what we do in user space very much at all. In particular, block forking is a new and untried technique for kernel, and far more difficult than user space because of multiple blocks in different, asynchronously changing states sharing the same underlying data page. We have to rip the data page out from underneath a bunch of buffers and slip in a new one without any of them noticing. Kind of like the trick where you pull the table cloth out from underneath the dinner plates, so fast that nothing crashes to the floor. Except that we also have to copy the table cloth and slip the copy back underneath the dinnerware before it settles back onto the table. Fun. We think this is possible, and pretty soon we will find out for sure. Regards, Daniel _______________________________________________ Tux3 mailing list [email protected] http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
