Re: [RFC] basic delayed allocation in VFS
On Sun, 2007-07-29 at 20:24 +0100, Christoph Hellwig wrote: On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote: Sigh, we HAVE a patch that was only adding delalloc to ext4, but it was rejected because that functionality should go into the VFS. Since the performance improvement of delalloc is quite large, we'd like to get this into the kernel one way or another. Can we make a decision if the ext4-specific delalloc is acceptable? I'm a big proponent of having proper common delalloc code, but the one proposed here is not generic for the existing filesystem using delalloc. To be fair, what Alex have so far is probably good enough for ext2/3 delayed allocation. It's still on my todo list to revamp the xfs code to get rid of some of the existing mess and make it useable genericly. If the ext4 users are fine with the end result we could move to generic code. Are you okay with having a ext4 delayed allocation implementation (i.e. moving the code proposed in this thread to fs/ext4) first? Then later when you come up with a generic delayed allocation for both ext4 and xfs we could make use of that generic implementation. Is that a acceptable approach? Andrew, what do you think? Regards, Mingming - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Jul 28, 2007 20:51 +0100, Christoph Hellwig wrote: That doesn't mean I want to arge against Alex's code although I'd of course be more happy if we could actually shared code between multiple filesystems. Of ourse the code in it's current form should not go into mpage.c but rather into ext4 so that it doesn't bloat the kernel for everyone. Sigh, we HAVE a patch that was only adding delalloc to ext4, but it was rejected because that functionality should go into the VFS. Since the performance improvement of delalloc is quite large, we'd like to get this into the kernel one way or another. Can we make a decision if the ext4-specific delalloc is acceptable? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Andreas Dilger wrote: Sigh, we HAVE a patch that was only adding delalloc to ext4, but it was rejected because that functionality should go into the VFS. Since the performance improvement of delalloc is quite large, we'd like to get this into the kernel one way or another. Can we make a decision if the ext4-specific delalloc is acceptable? I think the latter one is better because it supports bs pagesize (though I'm not sure about data=ordered yet). I'm not against putting most of the patch into fs/ext4/, but at least few bits to be changed in fs/ - exports in fs/mpage.c and one if in __block_write_full_page(). thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Sun, Jul 29, 2007 at 09:48:10PM +0400, Alex Tomas wrote: I think the latter one is better because it supports bs pagesize (though I'm not sure about data=ordered yet). I'm not against putting most of the patch into fs/ext4/, but at least few bits to be changed in fs/ - exports in fs/mpage.c and one if in __block_write_full_page(). The changes to __block_write_full_page is obviously fine, and exporting mpage.c bits sounds fine to me aswell, although I'd like to take a look at the final patch. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote: Sigh, we HAVE a patch that was only adding delalloc to ext4, but it was rejected because that functionality should go into the VFS. Since the performance improvement of delalloc is quite large, we'd like to get this into the kernel one way or another. Can we make a decision if the ext4-specific delalloc is acceptable? I'm a big proponent of having proper common delalloc code, but the one proposed here is not generic for the existing filesystem using delalloc. It's still on my todo list to revamp the xfs code to get rid of some of the existing mess and make it useable genericly. If the ext4 users are fine with the end result we could move to generic code. Note that moving to VFS is bullshit either way, writeback code is nowhere near the VFS nor should it. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
I'm a bit worried about one thing ... it looks like XFS and ext4 use different techniques to order data and metadata referencing them. now I'm not that optimistic that we can separate ordering from delalloc itself clean and reasonable way. In general, I'd prefer common code in fs/ (mm/?) of course, for number of reasons. thanks, Alex Christoph Hellwig wrote: I'm a big proponent of having proper common delalloc code, but the one proposed here is not generic for the existing filesystem using delalloc. It's still on my todo list to revamp the xfs code to get rid of some of the existing mess and make it useable genericly. If the ext4 users are fine with the end result we could move to generic code. Note that moving to VFS is bullshit either way, writeback code is nowhere near the VFS nor should it. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Sun, Jul 29, 2007 at 08:24:37PM +0100, Christoph Hellwig wrote: I'm a big proponent of having proper common delalloc code, but the one proposed here is not generic for the existing filesystem using delalloc. It's still on my todo list to revamp the xfs code to get rid of some of the existing mess and make it useable genericly. If the ext4 users are fine with the end result we could move to generic code. Do you think it would be faster for you to revamp the code or to give instructions about how you'd like to clean up the code and what has to be preserved in order to keep XFS happy, so someone else could give it a try? Or do you think the code is to grotty and/or tricky for someone else to attempt this? Note that moving to VFS is bullshit either way, writeback code is nowhere near the VFS nor should it. Agreed. I would think the something like mm/delayed_alloc.c would be preferable. Ideally it would be like the filemap.c code, where it would be relatively easy for most standard filesystems to hook into it and get the advantages of delayed allocation. (Although granted it will probably require more effort on the part of a filesystem author than filemap!) - Ted - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Sun, Jul 29, 2007 at 04:09:20PM +0400, Alex Tomas wrote: David Chinner wrote: On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote: But this is really irrelevant - the issue at hand is what we want for VFS level delalloc support. IMO, that mechanism needs to support both XFS and ext4, and I'd prefer if it doesn't perpetuate the bufferhead abuses of the past (i.e. define an iomap structure instead of overloading bufferheads yet again). I'm not sure I understand very well. -get_blocks abuses bufferheads to provide an offset/length/state mapping. That's all it needs. That what the iomap structure is used for. It's smaller than a bufferhead, it's descriptive of it's use and you don't get it confused with the other 10 ways bufferheads are used and abused. where would you track uptodate, dirty and other states then? do you propose to separate block states from block mapping? No. They still get tracked in the bufferheads attached to the page. That's what bufferheads were originally intended for(*). Cheers, Dave. (*) I recently proposed a separate block map tree for this rather than using buffer heads for this because of the memory footprint of N bufferheads per page on contiguous mappings. That's future work, not something we really need to consider here. Chris Mason's extent map tree patches are a start on this concept. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Fri, Jul 27, 2007 at 03:07:14PM +1000, David Chinner wrote: It duplicates fs/mpage.c in bio building and introduces new generic API (iomap, map_blocks_t, etc). Using a new API for new functionality is a bad thing? Depends on wht you do. This patch is just a quickhack to shoe-horn delalloc support into ext4. Introducing a new abstraction is overkill. If we really want an overhaul of the writeback path that's extent-aware, and efficient for delalloc and unwritten extents introducing a proper iomap-like data structure would make sense. That beeing said I personally hate the ubffer_head abuse for bmap data that we have in various places as it's utterly confusing and wasting stack space, but that's a different discussion. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote: Secondly, apart from delalloc, XFS cannot use the generic code paths for writeback because unwritten extent conversion also requires custom I/O completion handlers. Given that __mpage_writepage() only calls -writepage when it is confused, XFS simply cannot use this API. this doesn't mean fs/mpage.c should go, right? mpage.c read side is fine for every block based filesystem I know. mpage.c write side is fine for every simple (non-delalloc, non-unwritten extent, etc) filesystem. So it surely shouldn't go. I didn't say generic, see Subject: :) then it shouldn't be in generic code. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Christoph Hellwig wrote: This is not based on my attempt to make the xfs writeout path generic. Alex's variant is a lot simpler and thus missed various bits required for high sustained writeout performance or xfs functionality. I'd very appreciate any details about high writeout performance. That doesn't mean I want to arge against Alex's code although I'd of course be more happy if we could actually shared code between multiple filesystems. I'm not against at all, of course. but xfs writeout code looks .. hmm .. very xfs :) thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
David Chinner wrote: Using a new API for new functionality is a bad thing? if existing API can be used ... No, it doesn't provide the same functionality. Firstly, XFS attaches a different I/O completion to delalloc writes to allow us to update the file size when the write is beyond the current on disk EOF. This code cannot do that as all it does is allocation and present normal looking buffers to the generic code path. good point, I was going to take care of it in a separate patch to support data=ordered. Secondly, apart from delalloc, XFS cannot use the generic code paths for writeback because unwritten extent conversion also requires custom I/O completion handlers. Given that __mpage_writepage() only calls -writepage when it is confused, XFS simply cannot use this API. this doesn't mean fs/mpage.c should go, right? Also, looking at the way mpage_da_map_blocks() is done - if we have an 128MB delalloc extent - ext4 will allocate that will allocate it in one go, right? What happens if we then crash after only writing a few megabytes of that extent? stale data exposure? XFS can allocate multiple gigabytes in a single get_blocks call so even if ext4 can't do this, it's a problem for XFS. what happens if IO to 2nd MB is completed, while IO to 1st MB is not (probably sitting in queue) ? do you update on-disk size in this case? how do you track this? So without the ability to attach specific I/O completions to bios or support for unwritten extents directly in __mpage_writepage, there is no way XFS can use this generic delayed allocation code. I didn't say generic, see Subject: :) thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Jeff Garzik wrote: Alex Tomas wrote: So without the ability to attach specific I/O completions to bios or support for unwritten extents directly in __mpage_writepage, there is no way XFS can use this generic delayed allocation code. I didn't say generic, see Subject: :) Well, it shouldn't even be in the VFS layer if it's only usable by one filesystem. sorry, but it seems I can say the same about iomap/ioend. I think mpage_da_writepages() is simple enough to be adopted by other filesystem, ext2 for example. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
David Chinner wrote: Firstly, XFS attaches a different I/O completion to delalloc writes to allow us to update the file size when the write is beyond the current on disk EOF. This code cannot do that as all it does is allocation and present normal looking buffers to the generic code path. how do you implement fsync(2) ? you'd have to wait such IO to complete, then update the inode and write it through the log? Also, looking at the way mpage_da_map_blocks() is done - if we have an 128MB delalloc extent - ext4 will allocate that will allocate it in one go, right? What happens if we then crash after only writing a few megabytes of that extent? stale data exposure? XFS can allocate multiple gigabytes in a single get_blocks call so even if ext4 can't do this, it's a problem for XFS. I just realized that you're talking about data=ordered mode in ext4, where care is taken to prevent on-disk references to no-yet-written blocks. The solution is to wait such IO to complete before metadata commit. And the key thing here is to allocate and attach to inode blocks we're writing immediately. IOW, there is no unwritten blocks attached to inode (except fallocate(2) case), but there may be blocks preallocated for this inode in-core. same gigabytes, but different way ;) I have no single objection to custom IO completion callback per mpage_writepages(). thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Alex Tomas wrote: So without the ability to attach specific I/O completions to bios or support for unwritten extents directly in __mpage_writepage, there is no way XFS can use this generic delayed allocation code. I didn't say generic, see Subject: :) Well, it shouldn't even be in the VFS layer if it's only usable by one filesystem. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Alex Tomas wrote: Jeff Garzik wrote: Is this based on Christoph's work? Christoph, or some other XFS hacker, already did generic delalloc, modeled on the XFS delalloc code. nope, this one is simple (something I'd prefer for ext4). The XFS one is proven and the work was already completed. What were the specific technical issues that made it unsuitable for ext4? I would rather not reinvent the wheel, particularly if the reinvention is less capable than the existing work. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Alex Tomas wrote: Good day, please review ... thanks, Alex basic delayed allocation in VFS: * block_prepare_write() can be passed special -get_block() which doesn't allocate blocks, but reserve them and mark bh delayed * a filesystem can use mpage_da_writepages() with other -get_block() which doesn't defer allocation. mpage_da_writepages() finds all non-allocated blocks and try to allocate them with minimal calls to -get_block(), then submit IO using __mpage_writepage() Signed-off-by: Alex Tomas [EMAIL PROTECTED] Is this based on Christoph's work? Christoph, or some other XFS hacker, already did generic delalloc, modeled on the XFS delalloc code. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
[please don't top post!] On Thu, Jul 26, 2007 at 05:33:08PM +0400, Alex Tomas wrote: Jeff Garzik wrote: The XFS one is proven and the work was already completed. What were the specific technical issues that made it unsuitable for ext4? I would rather not reinvent the wheel, particularly if the reinvention is less capable than the existing work. It duplicates fs/mpage.c in bio building and introduces new generic API (iomap, map_blocks_t, etc). Using a new API for new functionality is a bad thing? In contrast, my trivial implementation re-use existing code in fs/mpage.c, doesn't introduce new API and I tend to think provides quite the same functionality. I can be wrong, of course ... No, it doesn't provide the same functionality. Firstly, XFS attaches a different I/O completion to delalloc writes to allow us to update the file size when the write is beyond the current on disk EOF. This code cannot do that as all it does is allocation and present normal looking buffers to the generic code path. Secondly, apart from delalloc, XFS cannot use the generic code paths for writeback because unwritten extent conversion also requires custom I/O completion handlers. Given that __mpage_writepage() only calls -writepage when it is confused, XFS simply cannot use this API. Also, looking at the way mpage_da_map_blocks() is done - if we have an 128MB delalloc extent - ext4 will allocate that will allocate it in one go, right? What happens if we then crash after only writing a few megabytes of that extent? stale data exposure? XFS can allocate multiple gigabytes in a single get_blocks call so even if ext4 can't do this, it's a problem for XFS. So without the ability to attach specific I/O completions to bios or support for unwritten extents directly in __mpage_writepage, there is no way XFS can use this generic delayed allocation code. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html