Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
On Sun, Oct 21, 2007 at 02:24:46PM +1000, Nick Piggin wrote: > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: > > Currently only > > metadata is more or less in sync with the contents of /dev/hda1. > > It either is or it isn't, right? And it is, isn't it? (at least > for the common filesystems). It is not true for XFS - it's metadata is not in sync with /dev/ at all as all the cached metadata is kept in a different address space to the raw block device. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
Nick Piggin <[EMAIL PROTECTED]> writes: > On Sunday 21 October 2007 14:53, Eric W. Biederman wrote: >> Nick Piggin <[EMAIL PROTECTED]> writes: >> > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: >> >> Andrew Morton <[EMAIL PROTECTED]> writes: >> >> > I don't think we little angels want to tread here. There are so many >> >> > weirdo things out there which will break if we bust the coherence >> >> > between the fs and /dev/hda1. >> >> >> >> We broke coherence between the fs and /dev/hda1 when we introduced >> >> the page cache years ago, >> > >> > Not for metadata. And I wouldn't expect many filesystem analysis >> > tools to care about data. >> >> Well tools like dump certainly weren't happy when we made the change. > > Doesn't that give you any suspicion that other tools mightn't > be happy if we make this change, then? I read a representative sample of the relevant tools before replying to Andrew. >> >> and weird hacky cases like >> >> unmap_underlying_metadata don't change that. >> > >> > unmap_underlying_metadata isn't about raw block device access at >> > all, though (if you write to the filesystem via the blockdevice >> > when it isn't expecting it, it's going to blow up regardless). >> >> Well my goal with separating things is so that we could decouple two >> pieces of code that have different usage scenarios, and where >> supporting both scenarios simultaneously appears to me to needlessly >> complicate the code. >> >> Added to that we could then tune the two pieces of code for their >> different users. > > I don't see too much complication from it. If we can actually > simplify things or make useful tuning, maybe it will be worth > doing. That was my feeling that we could simplify things. The block layer page cache operations certainly. I know in the filesystems that use the buffer cache like reiser and JBD they could stop worrying about the buffers becoming mysteriously dirty. Beyond that I think there is a lot of opportunity I just haven't looked much yet. >> >> Currently only >> >> metadata is more or less in sync with the contents of /dev/hda1. >> > >> > It either is or it isn't, right? And it is, isn't it? (at least >> > for the common filesystems). >> >> ext2 doesn't store directories in the buffer cache. > > Oh that's what you mean. OK, agreed there. But for the filesystems > and types of metadata that can now expect to have coherency, doing > this will break that expectation. > > Again, I have no opinions either way on whether we should do that > in the long run. But doing it as a kneejerk response to braindead > rd.c code is wrong because of what *might* go wrong and we don't > know about. The rd.c code is perfectly valid if someone wasn't forcing buffer heads on it's pages. It is a conflict of expectations. Regardless I didn't do it as a kneejerk and I don't think that patch should be merged at this time. I proposed it because as I see it that starts untangling the mess that is the buffer cache. rd.c was just my entry point into understanding how all of those pieces work. I was doing my best to completely explore my options and what the code was doing before settling on the fix for rd.c >> Journaling filesystems and filesystems that do ordered writes >> game the buffer cache. Putting in data that should not yet >> be written to disk. That gaming is where reiserfs goes BUG >> and where JBD moves the dirty bit to a different dirty bit. > > Filesystems really want better control of writeback, I think. > This isn't really a consequence of the unified blockdev pagecache > / metadata buffer cache, it is just that most of the important > things they do are with metadata. Yes. > If they have their own metadata inode, then they'll need to game > the cache for it, or the writeback code for that inode somehow > too. Yes. Although they will at least get the guarantee that no one else is dirtying their pages at strange times. >> So as far as I can tell what is in the buffer cache is not really >> in sync with what should be on disk at any given movement except >> when everything is clean. > > Naturally. It is a writeback cache. Not that so much as the order in which things go into the cache does not match the order the blocks go to disk. >> My suspicion is that actually reading from disk is likely to >> give a more coherent view of things. Because there at least >> we have the writes as they are expected to be seen by fsck >> to recover the data, and a snapshot there should at least >> be recoverable. Whereas a snapshot provides not such guarantees. > > ext3 fsck I don't think is supposed to be run under a read/write > filesystem, so it's going to explode if you do that regardless. Yes. I was thinking of dump or something like that here. Where we simply read out the data and try to make some coherent sense of it. If we see a version of the metadata that points to things that have not been finished yet or is in the process of being written to that could b
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
On Sunday 21 October 2007 14:53, Eric W. Biederman wrote: > Nick Piggin <[EMAIL PROTECTED]> writes: > > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: > >> Andrew Morton <[EMAIL PROTECTED]> writes: > >> > I don't think we little angels want to tread here. There are so many > >> > weirdo things out there which will break if we bust the coherence > >> > between the fs and /dev/hda1. > >> > >> We broke coherence between the fs and /dev/hda1 when we introduced > >> the page cache years ago, > > > > Not for metadata. And I wouldn't expect many filesystem analysis > > tools to care about data. > > Well tools like dump certainly weren't happy when we made the change. Doesn't that give you any suspicion that other tools mightn't be happy if we make this change, then? > >> and weird hacky cases like > >> unmap_underlying_metadata don't change that. > > > > unmap_underlying_metadata isn't about raw block device access at > > all, though (if you write to the filesystem via the blockdevice > > when it isn't expecting it, it's going to blow up regardless). > > Well my goal with separating things is so that we could decouple two > pieces of code that have different usage scenarios, and where > supporting both scenarios simultaneously appears to me to needlessly > complicate the code. > > Added to that we could then tune the two pieces of code for their > different users. I don't see too much complication from it. If we can actually simplify things or make useful tuning, maybe it will be worth doing. > >> Currently only > >> metadata is more or less in sync with the contents of /dev/hda1. > > > > It either is or it isn't, right? And it is, isn't it? (at least > > for the common filesystems). > > ext2 doesn't store directories in the buffer cache. Oh that's what you mean. OK, agreed there. But for the filesystems and types of metadata that can now expect to have coherency, doing this will break that expectation. Again, I have no opinions either way on whether we should do that in the long run. But doing it as a kneejerk response to braindead rd.c code is wrong because of what *might* go wrong and we don't know about. > Journaling filesystems and filesystems that do ordered writes > game the buffer cache. Putting in data that should not yet > be written to disk. That gaming is where reiserfs goes BUG > and where JBD moves the dirty bit to a different dirty bit. Filesystems really want better control of writeback, I think. This isn't really a consequence of the unified blockdev pagecache / metadata buffer cache, it is just that most of the important things they do are with metadata. If they have their own metadata inode, then they'll need to game the cache for it, or the writeback code for that inode somehow too. > So as far as I can tell what is in the buffer cache is not really > in sync with what should be on disk at any given movement except > when everything is clean. Naturally. It is a writeback cache. > My suspicion is that actually reading from disk is likely to > give a more coherent view of things. Because there at least > we have the writes as they are expected to be seen by fsck > to recover the data, and a snapshot there should at least > be recoverable. Whereas a snapshot provides not such guarantees. ext3 fsck I don't think is supposed to be run under a read/write filesystem, so it's going to explode if you do that regardless. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
Nick Piggin <[EMAIL PROTECTED]> writes: > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: >> Andrew Morton <[EMAIL PROTECTED]> writes: >> > I don't think we little angels want to tread here. There are so many >> > weirdo things out there which will break if we bust the coherence between >> > the fs and /dev/hda1. >> >> We broke coherence between the fs and /dev/hda1 when we introduced >> the page cache years ago, > > Not for metadata. And I wouldn't expect many filesystem analysis > tools to care about data. Well tools like dump certainly weren't happy when we made the change. >> and weird hacky cases like >> unmap_underlying_metadata don't change that. > > unmap_underlying_metadata isn't about raw block device access at > all, though (if you write to the filesystem via the blockdevice > when it isn't expecting it, it's going to blow up regardless). Well my goal with separating things is so that we could decouple two pieces of code that have different usage scenarios, and where supporting both scenarios simultaneously appears to me to needlessly complicate the code. Added to that we could then tune the two pieces of code for their different users. >> Currently only >> metadata is more or less in sync with the contents of /dev/hda1. > > It either is or it isn't, right? And it is, isn't it? (at least > for the common filesystems). ext2 doesn't store directories in the buffer cache. Journaling filesystems and filesystems that do ordered writes game the buffer cache. Putting in data that should not yet be written to disk. That gaming is where reiserfs goes BUG and where JBD moves the dirty bit to a different dirty bit. So as far as I can tell what is in the buffer cache is not really in sync with what should be on disk at any given movement except when everything is clean. My suspicion is that actually reading from disk is likely to give a more coherent view of things. Because there at least we have the writes as they are expected to be seen by fsck to recover the data, and a snapshot there should at least be recoverable. Whereas a snapshot provides not such guarantees. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: > Andrew Morton <[EMAIL PROTECTED]> writes: > > I don't think we little angels want to tread here. There are so many > > weirdo things out there which will break if we bust the coherence between > > the fs and /dev/hda1. > > We broke coherence between the fs and /dev/hda1 when we introduced > the page cache years ago, Not for metadata. And I wouldn't expect many filesystem analysis tools to care about data. > and weird hacky cases like > unmap_underlying_metadata don't change that. unmap_underlying_metadata isn't about raw block device access at all, though (if you write to the filesystem via the blockdevice when it isn't expecting it, it's going to blow up regardless). > Currently only > metadata is more or less in sync with the contents of /dev/hda1. It either is or it isn't, right? And it is, isn't it? (at least for the common filesystems). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
Nick Piggin <[EMAIL PROTECTED]> writes: > > [*] The ramdisk code is simply buggy, right? (and not the buffer > cache) >From the perspective of the ramdisk it expects the buffer cache to simply be a user of the page cache, and thus the buffer cache is horribly buggy. >From the perspective of the buffer cache it still the block device cache in the kernel and it the way it works are the rules for how caching should be done in the kernel, and it doesn't give any mind to this new fangled page cache thingy. > The idea of your patch in theory is OK, but Andrew raises valid > points about potential coherency problems, I think. There are certainly implementation issues in various filesystems to overcome before remounting read-write after doing a fsck on a read-only filesystem will work as it does today. So my patch is incomplete. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
Andrew Morton <[EMAIL PROTECTED]> writes: > > I don't think we little angels want to tread here. There are so many > weirdo things out there which will break if we bust the coherence between > the fs and /dev/hda1. We broke coherence between the fs and /dev/hda1 when we introduced the page cache years ago, and weird hacky cases like unmap_underlying_metadata don't change that. Currently only metadata is more or less in sync with the contents of /dev/hda1. > Online resize, online fs checkers, various local > tools which people have hacked up to look at metadata in a live fs, > direct-io access to the underlying fs, heaven knows how many boot loader > installers, etc. Cerainly I couldn't enumerate tham all. Well I took a look at ext3. For online resize all of the writes are done by the fs not by the user space tool. For e2fsck of a read-only filesystem currently we do cache the buffers for the super block and reexamine those blocks when we mount read-only. Which makes my patch by itself unsafe. If however ext3 and anyone else who does things like that were to reread the data and not to merely reexamine the data we should be fine. Fundamentally doing anything like this requires some form of synchronization, and if that synchronization does not exist today there will be bugs. Further decoupling things only makes that requirement clearer. Unfortunately because of things like the ext3 handling of remounting from ro to rw this doesn't fall into the quick trivial fix category :( > I don't actually see what the conceptual problem is with the existing > implementation. The buffer_head is a finer-grained view onto the > blockdev's pagecache: it provides additional states and additional locking > against a finer-grained section of the page. It works well. The buffer_head itself seems to be a reasonable entity. The buffer cache is a monster. It does not follow the ordinary rules of the page cache, making it extremely hard to reason about. Currently in the buffer cache there are buffer_heads we are not allowed to make dirty which hold dirty data. Some filesystems panic the kernel when they notice this. Others like ext3 use a different bit to remember that the buffer is dirty. Because of ordering considerations the buffer cache does not hold a consistent view of what has been scheduled for being written to disk. It instead holds partially complete pages. The only place we should ever clear the dirty bit is just before calling write_page but try_to_free_buffers clears the dirty bit! We have buffers on pages without a mapping! In general the buffer cache violates a primary rule for comprehensible programming having. The buffer cache does not have a clear enough definition that it is clear what things are bugs and what things are features. 99% of the weird strange behavior in rd.c is because of the buffer cache not following the normal rules. > Yeah, the highmem thing is a bit of a problem (but waning in importance). > But we can fix that by teaching individual filesystems about kmap and then > tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount > time. If anyone cares, which they don't. This presumes I want to use a filesystem on my block device. Where I would care most is when I am doing things like fsck or mkfs on an unmounted filesystem. Where having buffer_heads is just extra memory pressure slowing things down, and similarly for highmem. We have to sync the filesystem before mounting but we have to do that anyway for all of the non metadata so that isn't new. Anyway my main objective was to get a good grasp on the buffer cache and the mm layer again. Which I now more or less have. While I think the buffer cache needs a bunch of tender loving care before it becomes sane I have other projects that I intend to complete before I try anything in this area. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
On Thursday 18 October 2007 13:59, Eric W. Biederman wrote: > If filesystems care at all they want absolute control over the buffer > cache. Controlling which buffers are dirty and when. Because we > keep the buffer cache in the page cache for the block device we have > not quite been giving filesystems that control leading to really weird > bugs. Mmm. Like I said, when a live filesystem is mounted on a bdev, it isn't like you want userspace to go dancing around on it without knowing exactly what it is doing. The kernel more or less does the right thing here with respect to the *state* of the data[*] (that is, buffer heads and pagecache). It's when you actually start changing the data itself around is when you'll blow up the filesystem. [*] The ramdisk code is simply buggy, right? (and not the buffer cache) The idea of your patch in theory is OK, but Andrew raises valid points about potential coherency problems, I think. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
On Wed, 17 Oct 2007 21:59:02 -0600 [EMAIL PROTECTED] (Eric W. Biederman) wrote: > If filesystems care at all they want absolute control over the buffer > cache. Controlling which buffers are dirty and when. Because we > keep the buffer cache in the page cache for the block device we have > not quite been giving filesystems that control leading to really weird > bugs. > > In addition this tieing of the implemetation of block device caching > and the buffer cache has resulted in a much more complicated and > limited implementation then necessary. Block devices for example > don't need buffer_heads, and it is perfectly reasonable to cache > block devices in high memory. > > To start untangling the worst of this mess this patch introduces a > second block device inode for the buffer cache. All buffer cache > operations are diverted to that use the new bd_metadata_inode, which > keeps the weirdness of the metadata requirements isolated in their > own little world. I don't think we little angels want to tread here. There are so many weirdo things out there which will break if we bust the coherence between the fs and /dev/hda1. Online resize, online fs checkers, various local tools which people have hacked up to look at metadata in a live fs, direct-io access to the underlying fs, heaven knows how many boot loader installers, etc. Cerainly I couldn't enumerate tham all. The mere thought of all this scares the crap out of me. I don't actually see what the conceptual problem is with the existing implementation. The buffer_head is a finer-grained view onto the blockdev's pagecache: it provides additional states and additional locking against a finer-grained section of the page. It works well. Yeah, the highmem thing is a bit of a problem (but waning in importance). But we can fix that by teaching individual filesystems about kmap and then tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount time. If anyone cares, which they don't. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
If filesystems care at all they want absolute control over the buffer cache. Controlling which buffers are dirty and when. Because we keep the buffer cache in the page cache for the block device we have not quite been giving filesystems that control leading to really weird bugs. In addition this tieing of the implemetation of block device caching and the buffer cache has resulted in a much more complicated and limited implementation then necessary. Block devices for example don't need buffer_heads, and it is perfectly reasonable to cache block devices in high memory. To start untangling the worst of this mess this patch introduces a second block device inode for the buffer cache. All buffer cache operations are diverted to that use the new bd_metadata_inode, which keeps the weirdness of the metadata requirements isolated in their own little world. This should enable future cleanups to diverge and simplify the address_space_operations of the buffer cache and block device page cache. As a side effect of this cleanup the current ramdisk code should be safe from dropping pages because we never place any buffer heads on ramdisk pages. Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> --- fs/block_dev.c | 45 - fs/buffer.c| 17 - fs/ext3/dir.c |2 +- fs/ext4/dir.c |2 +- fs/fat/inode.c |2 +- include/linux/fs.h |1 + 6 files changed, 48 insertions(+), 21 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 379a446..87a5760 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -59,10 +59,12 @@ static sector_t max_block(struct block_device *bdev) /* Kill _all_ buffers and pagecache , dirty or not.. */ static void kill_bdev(struct block_device *bdev) { - if (bdev->bd_inode->i_mapping->nrpages == 0) - return; - invalidate_bh_lrus(); - truncate_inode_pages(bdev->bd_inode->i_mapping, 0); + if (bdev->bd_inode->i_mapping->nrpages) + truncate_inode_pages(bdev->bd_inode->i_mapping, 0); + if (bdev->bd_metadata_inode->i_mapping->nrpages) { + truncate_inode_pages(bdev->bd_metadata_inode->i_mapping, 0); + invalidate_bh_lrus(); + } } int set_blocksize(struct block_device *bdev, int size) @@ -80,6 +82,7 @@ int set_blocksize(struct block_device *bdev, int size) sync_blockdev(bdev); bdev->bd_block_size = size; bdev->bd_inode->i_blkbits = blksize_bits(size); + bdev->bd_metadata_inode->i_blkbits = blksize_bits(size); kill_bdev(bdev); } return 0; @@ -114,7 +117,7 @@ static int blkdev_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh, int create) { - if (iblock >= max_block(I_BDEV(inode))) { + if (iblock >= max_block(inode->i_bdev)) { if (create) return -EIO; @@ -126,7 +129,7 @@ blkdev_get_block(struct inode *inode, sector_t iblock, */ return 0; } - bh->b_bdev = I_BDEV(inode); + bh->b_bdev = inode->i_bdev; bh->b_blocknr = iblock; set_buffer_mapped(bh); return 0; @@ -136,7 +139,7 @@ static int blkdev_get_blocks(struct inode *inode, sector_t iblock, struct buffer_head *bh, int create) { - sector_t end_block = max_block(I_BDEV(inode)); + sector_t end_block = max_block(inode->i_bdev); unsigned long max_blocks = bh->b_size >> inode->i_blkbits; if ((iblock + max_blocks) > end_block) { @@ -152,7 +155,7 @@ blkdev_get_blocks(struct inode *inode, sector_t iblock, } } - bh->b_bdev = I_BDEV(inode); + bh->b_bdev = inode->i_bdev; bh->b_blocknr = iblock; bh->b_size = max_blocks << inode->i_blkbits; if (max_blocks) @@ -167,7 +170,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; - return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode), + return blockdev_direct_IO_no_locking(rw, iocb, inode, inode->i_bdev, iov, offset, nr_segs, blkdev_get_blocks, NULL); } @@ -244,7 +247,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs) { struct inode *inode = iocb->ki_filp->f_mapping->host; - unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode))); + unsigned blkbits = blksize_bits(bdev_hardsect_size(inode->i_bdev); unsigned blocksize_mask = (1 << blkbits) - 1; unsigned long seg = 0; /* iov segment iterator */ unsigned long nvec; /* number of bio vec needed */ @@ -292,7 +295,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,