On Tue, Dec 04, 2012 at 09:59:46AM +0300, Alan Barrett wrote: > >the genfs code also never writes clean pages to disk, even though for > >RAID5 storage it would likely be more efficient to write clean pages > >that are in the same stripe as dirty pages if that would avoid issuing > >partial-stripe writes. (which is basically another way of saying > >what david said.) > > Perhaps there should be a way for block devices to report at least three > block sizes: > > a) smallest possible block size (512 for almost all disks) > > b) smallest efficient block size and alignment (4k for modern disks, > stripe size for raid) > > c) largest possible size (a device and bus-dependent variant of MAXPHYS) > > Then the file system could use (b) to know when it's a good idea to > combine dirty and clean pages into the same write.
As I was saying in the other thread, what filesystems really want to know is the atomic write size. E.g. in ffs this affects the way directories are laid out and is necessary (AFAIK including with wapbl) for ~safe operation. This is not (a), and as far as I know it is also not (b); see below. I don't see (a) as useful. It is conceivable that a journaled FS might want to know about it to allow packing journal records as tightly as possible, but doing so is rather dubious from a recovery POV: the point of flushing a journal is to get it physically onto disk safely, and if you later let the disk rewrite part of what you thought was safely on disk, it might cease to be safely on disk and break your recovery scheme. What guarantees do we actually get in practice for RAID5? Do you have to commit journals in units of a whole stripe or stripe group to avoid having them rewritten unsafely later? Or is the parity logging code sufficient to make that safe? This matters for wapbl... -- David A. Holland [email protected]
