Tux3 now has extent support, for now and evermore. I flipped over the #defines to use extents for more than just unit testing. The tux3 user command now runs with extents and so do both of the Fuse versions. I did not test any of these! Somebody, kindly check and see what breaks.
I did test extents very lightly using the inode.c unit test, which successfully writes and reads back "hello world" and somewhat more exhaustively using the filemap.c unit tests, but this is still very green code. I expect a number of bugs to turn up. Developing the extent support was by no means a cut and paste job. On the contrary it was nearly three weeks of grinding slow work. This is new territory, quite unlike any filesystem code I have written before, and there was precious little guidance out there on the net. The combinatorics are fairly horrifying as I touched on in an earlier post. There will be a longish post coming soon on the extent machinery and the new api I created for processing and editing extents, which worked out pretty well. For now I will briefly describe how the filemap operations are structured. Both buffer flush and buffer read are handled by the same function, filemap_extent_io, because much of the code is exactly the same. No sense in letting thr common parts drift apart and end up making us fix bugs twice. The extent io code is driven by the block-at-a-time bread and bwrite interfaces, which is perverse but it is much the way things work in kernel at the moment. Naturally, we would really like big reads and writes to drive the extent interface directly, but if we choose to go that route we will basically get stuck with the task of re-implementing the whole kernel generic_read/write family of functions. A big job, and then we will be stuck with maintaining it. Maybe we will eventually do that in the interest of yet more performance, but for now we put up with being called one buffer at a time, and while handling the one buffer we check to see if some neighboring buffers can also be included to form extents. This opportunistic extent formation is handled by the guess_extent function. Once we have the extent, we probe into the btree, 64 blocks below the beginning of the extent, which ensures that no existing extents that may overlap the io extent are missed. (I just realized there is a bug here: we might have to advance to the next block and there is no code to do that yet.) We walk forward through the leading extents until we get to one that overlaps or begins exactly at the io extent. Then a slightly tricky bit of code walks across the pre-existing extents taking note of any gaps between them. For write, blocks are allocated to fill the gaps and for read, the unmapped extents are noted. In either case, all prior extents found and the new extents created to fill the gaps are saved in a "segs" vector. Up till here, read and write are nearly identical. The next few steps are specific to write. We add the remaining extents all the way to the end of the dleaf block to the segs vector. (This will be inefficient in the general case, but does not really matter for now because the tail of the dleaf will usually be empty. This lazy approach will eventually be fixed, but at the moment it would be a lot of distracting work when there are more pressing issues to address.) Continuing in the write-specific code, we use the dwalk_mock function to figure out how much additional space will be needed for the new extents. Dwalk_mock works just like dwalk_pack, but only calculates the space that will be required without modifying the leaf. Then, if insufficient space is available, we split the leaf and retry the whole write up to this point. (And I just noticed... repeating the probe, which is unnecessary.) Continuing on in the write specific code, we truncate the dleaf at the point just before the io extent and append the new set of extents from the segs list, which for write includes all the extents above the io extent to the end of the leaf. (Once we get really good at this we will just memmove the tail of the leaf instead of adding it to the segs vector then packing the segs back in one at a time, which code will be a little tricky.) Then we do the actual IO, a loop across the segs list that differs only slightly between read and write, mainly in that buffers mapping to empty segments for read are zero filled here, and of course the io direction differs. In kernel we would be setting up a bio here, which can do the whole job in one transfer. In userspace we would use preadv/pwritev if Linux had them, but Linux does not, so we use our trusty diskread/diskwrite routines instead. (We are not really aiming for performance in this userspace code, just correctness, and code that will be fast in kernel. Still, if preadv/pwritev had been available I would have used them.) With this new extent code enabled, I broke truncate, which still thinks it is dealing with block pointers. It will do the wrong thing if the truncate lands in the middle of an extent. Probably nobody will notice even with Fuse, because truncate to zero should still work just fine. So sigh, this is done. Nearly. Extents are nothing more than a performance hack, but Tux3 needs to benchmark well in order to thrive in the filesystem jungle, and it will be helpful if it benchmarks well right from the day it lands in kernel. Plus, I would rather not do the versioning work twice, once for pointers and again for extents. With the arrival of extents we also gained a nice api for dleaf editing, which will be a theme to build on when the even trickier versioning code starts to land. Speaking of performance, extents actually reduce it for files that are only one block long. We will have to get busy and do some optimization for the one block file case. I think it is quite optimizable and so we should eventually get it to the point where the overhead vs single block IO is not noticeable. It will not be a huge difference anyway, but as it stands I think it could be measured and might put us a little behind a non-extent filesystem like Ext3 for some loads, until we fix it. For two block files and larger, the extent code is a clear winner by any measure: cpu, cache or disk space. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3