On Fri, Aug 22, 2025 at 06:09:31PM -0700, Konrad Schroder wrote: > I've been thinking about LFS off and on for a while now, and I'd like to > run a few of my thoughts by everyone else.? Since the last time I looked > closely at the code base, there have been quite a number of improvements by > some very good people! but it still has some issues. > > 1) The most vexing outstanding issue, in my mind, is the fact that the > cleaner often cannot improve the amount of available space on disk.? This > is largely due to volatile metadata, in particular index file blocks, being > written into the segments while cleaning.? These blocks almost immediately > become stale again, leaving the newly compacted segment looking as if it > needs cleaning again.? (When the filesystem is empty, this is not a big > deal, but when it approaches full it's a killer.)? The same is true of > inode blocks and indirect blocks, though to a lesser extent.? If the index > file could be segregated from the regular file data, it would help the > situation immensely.
Hmm. I hadn't been aware that's a problem, or rather, that's the state it gets into that causes this problem. Doing anything about it is going to be hard, though. On the one hand, the idea that the segments are in any kind of order and that it matters where new segment data gets written is a fiction, and we could probably clean out the last remnants of pretending it matters. On the other hand, all of this is going to make recovery a lot harder. The current LFS code is pretty vague on the concept of filesystem transactions and already doesn't really handle multiple-metadata operations ("dirops") well (this is something I had been meaning to work on)... basically as things stand you need all the blocks for a single operation to end up in the same segment so that every segment is complete, and this is handled reasonably well for indirect blocks but a mess for anything that touches more than one inode. Adding more complexity to that tracking without cleaning it out thoroughly first seems like a bad plan. Once you start putting pieces of operations in multiple segments, there has to be enough on-disk structure to keep track of which pairs or groups of segments are required to be taken or discarded together. If you always write an ifile segment and a data segment at the same time, and make sure each has only and exactly the data corresponding to the other one, it's probably sufficient to add some info to the segment summaries so that if only one of the segments makes it out you just drop the other. But even to the extent that's feasible to implement it's going to generate bazillions of little partial ifile segments, and that doesn't seem like a great idea. However, anything other than a 1-1 correspondence is going to incur a lot of on-disk complexity that seems like it would require major format changes. I suppose one could also just entirely drop the ability to roll forward from a checkpoint, but that also doesn't seem terribly desirable. For a _separate_ ifile ("Ibis") you'd have to reconstruct the ifile during roll-forward by scanning each segment. That might be possible (I forget to what extent the current metadata supports that, but it'd at most require minor format changes) and with a reasonable checkpoint frequency it shouldn't be that expensive. However, this scheme does require writing out the whole ifile twice for every checkpoint and on what constitute reasonable-size volumes these days that'd be hundreds of megabytes. That seems like a stopper. (I don't see any way to update a fixed ifile on the fly without some way to journal the updates, which we don't have without major changes.) And, I never did understand why the ifile is a file of inode locations instead of a file of inodes. It lets you write out exactly the inodes that have changed, and not others that just happen to be physically next to them. But it seems like there are other ways to arrange that. > 2) Connecting dirty pages directly to buffer headers when writing might be > resulting in incorrect partial-segment checksums.? I can't be sure that > that is the cause, but the checksums are definitely sometimes incorrect > even when the segments were written (for all I can tell) properly. This > would interfere with roll-forward, but more importantly, if the cleaner is > paying attention to the checksums as it ought, then those segments might > become uncleanable.? Before UBC, lfs_writeseg() freed data buffers by > copying their data into larger, pre-reserved buffers before checksumming > the lot and sending it to disk.? This also frees up the buffers/pages very > quickly compared to waiting for the disk, though of course at the expense > of CPU and reserved memory. I have no idea about this. > 3) Roll-forward and some form of cleaning should be moved in-kernel.? I > already have code for in-kernel roll forward past the second checkpoint > that I need to dust off, test and commit.? Cleaning is trickier because an > in-kernel cleaner would be less flexible, but the basic cleaning and > defragmenting functionality should be there. As I've said before, I think the part of the cleaner that cleans a segment should be in the kernel, for both robustness and performance reasons. The part that decides which segments to clean when can and should stay in userland. > There has been quite a lot of work on LFS in the last 20 years, some with > hints of a roadmap.? Does anyone else have specific ideas about the most > glaring issues, or what should be done next? I had some but it's going to take a while to page them in... -- David A. Holland dholl...@netbsd.org