Re: [RFC] basic delayed allocation in VFS

2007-07-30 Thread Mingming Cao
On Sun, 2007-07-29 at 20:24 +0100, Christoph Hellwig wrote:
 On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote:
  Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
  was rejected because that functionality should go into the VFS.
  Since the performance improvement of delalloc is quite large, we'd
  like to get this into the kernel one way or another.  Can we make a
  decision if the ext4-specific delalloc is acceptable?
 
 I'm a big proponent of having proper common delalloc code, but the
 one proposed here is not generic for the existing filesystem using
 delalloc.  

To be fair, what Alex have so far is probably good enough for ext2/3
delayed allocation.

 It's still on my todo list to revamp the xfs code to get
 rid of some of the existing mess and make it useable genericly.  If
 the ext4 users are fine with the end result we could move to generic
 code.
 

Are you okay with having a ext4 delayed allocation implementation (i.e.
moving the code proposed in this thread to fs/ext4) first?  Then later
when you come up with a generic delayed allocation for both ext4 and xfs
we could make use of that generic implementation. Is that a acceptable
approach? 

Andrew, what do you think?


Regards,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Andreas Dilger
On Jul 28, 2007  20:51 +0100, Christoph Hellwig wrote:
 That doesn't mean I want to arge against Alex's code although I'd of
 course be more happy if we could actually shared code between multiple
 filesystems.
 
 Of ourse the code in it's current form should not go into mpage.c but
 rather into ext4 so that it doesn't bloat the kernel for everyone.

Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
was rejected because that functionality should go into the VFS.
Since the performance improvement of delalloc is quite large, we'd
like to get this into the kernel one way or another.  Can we make a
decision if the ext4-specific delalloc is acceptable?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Alex Tomas

Andreas Dilger wrote:

Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
was rejected because that functionality should go into the VFS.
Since the performance improvement of delalloc is quite large, we'd
like to get this into the kernel one way or another.  Can we make a
decision if the ext4-specific delalloc is acceptable?


I think the latter one is better because it supports bs  pagesize
(though I'm not sure about data=ordered yet). I'm not against putting
most of the patch into fs/ext4/, but at least few bits to be changed
in fs/ - exports in  fs/mpage.c and one if in __block_write_full_page().

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Christoph Hellwig
On Sun, Jul 29, 2007 at 09:48:10PM +0400, Alex Tomas wrote:
 I think the latter one is better because it supports bs  pagesize
 (though I'm not sure about data=ordered yet). I'm not against putting
 most of the patch into fs/ext4/, but at least few bits to be changed
 in fs/ - exports in  fs/mpage.c and one if in __block_write_full_page().

The changes to __block_write_full_page is obviously fine, and exporting
mpage.c bits sounds fine to me aswell, although I'd like to take a look
at the final patch.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Christoph Hellwig
On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote:
 Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
 was rejected because that functionality should go into the VFS.
 Since the performance improvement of delalloc is quite large, we'd
 like to get this into the kernel one way or another.  Can we make a
 decision if the ext4-specific delalloc is acceptable?

I'm a big proponent of having proper common delalloc code, but the
one proposed here is not generic for the existing filesystem using
delalloc.  It's still on my todo list to revamp the xfs code to get
rid of some of the existing mess and make it useable genericly.  If
the ext4 users are fine with the end result we could move to generic
code.

Note that moving to VFS is bullshit either way, writeback code is
nowhere near the VFS nor should it.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Alex Tomas

I'm a bit worried about one thing ... it looks like XFS and ext4
use different techniques to order data and metadata referencing
them. now I'm not that optimistic that we can separate ordering
from delalloc itself clean and reasonable way. In general, I'd
prefer common code in fs/ (mm/?) of course, for number of reasons.

thanks, Alex


Christoph Hellwig wrote:

I'm a big proponent of having proper common delalloc code, but the
one proposed here is not generic for the existing filesystem using
delalloc.  It's still on my todo list to revamp the xfs code to get
rid of some of the existing mess and make it useable genericly.  If
the ext4 users are fine with the end result we could move to generic
code.

Note that moving to VFS is bullshit either way, writeback code is
nowhere near the VFS nor should it.



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Theodore Tso
On Sun, Jul 29, 2007 at 08:24:37PM +0100, Christoph Hellwig wrote:
 I'm a big proponent of having proper common delalloc code, but the
 one proposed here is not generic for the existing filesystem using
 delalloc.  It's still on my todo list to revamp the xfs code to get
 rid of some of the existing mess and make it useable genericly.  If
 the ext4 users are fine with the end result we could move to generic
 code.

Do you think it would be faster for you to revamp the code or to give
instructions about how you'd like to clean up the code and what has to
be preserved in order to keep XFS happy, so someone else could give it
a try?  Or do you think the code is to grotty and/or tricky for
someone else to attempt this?

 Note that moving to VFS is bullshit either way, writeback code is
 nowhere near the VFS nor should it.

Agreed.  I would think the something like mm/delayed_alloc.c would be
preferable.  Ideally it would be like the filemap.c code, where it
would be relatively easy for most standard filesystems to hook into it
and get the advantages of delayed allocation.  (Although granted it
will probably require more effort on the part of a filesystem author
than filemap!)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread David Chinner
On Sun, Jul 29, 2007 at 04:09:20PM +0400, Alex Tomas wrote:
 David Chinner wrote:
 On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote:
 But this is really irrelevant - the issue at hand is what we want
 for VFS level delalloc support. IMO, that mechanism needs to support
 both XFS and ext4, and I'd prefer if it doesn't perpetuate the
 bufferhead abuses of the past (i.e. define an iomap structure
 instead of overloading bufferheads yet again).
 
 I'm not sure I understand very well.

-get_blocks abuses bufferheads to provide an offset/length/state
mapping. That's all it needs. That what the iomap structure is used
for. It's smaller than a bufferhead, it's descriptive of it's use
and you don't get it confused with the other 10 ways bufferheads
are used and abused.

 where would you track uptodate, dirty and other states then?
 do you propose to separate block states from block mapping?

No. They still get tracked in the bufferheads attached to the page.
That's what bufferheads were originally intended for(*).

Cheers,

Dave.

(*) I recently proposed a separate block map tree for this rather
than using buffer heads for this because of the memory footprint of
N bufferheads per page on contiguous mappings. That's future work,
not something we really need to consider here. Chris Mason's extent
map tree patches are a start on this concept.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-28 Thread Christoph Hellwig
On Fri, Jul 27, 2007 at 03:07:14PM +1000, David Chinner wrote:
  It duplicates fs/mpage.c in bio building and introduces new generic API
  (iomap, map_blocks_t, etc).
 
 Using a new API for new functionality is a bad thing?

Depends on wht you do.  This patch is just a quickhack to shoe-horn
delalloc support into ext4.  Introducing a new abstraction is overkill.
If we really want an overhaul of the writeback path that's extent-aware,
and efficient for delalloc and unwritten extents introducing a proper
iomap-like data structure would make sense.  That beeing said I personally
hate the ubffer_head abuse for bmap data that we have in various places
as it's utterly confusing and wasting stack space, but that's a different
discussion.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-28 Thread Christoph Hellwig
On Fri, Jul 27, 2007 at 11:51:56AM +0400, Alex Tomas wrote:
 Secondly, apart from delalloc, XFS cannot use the generic code paths
 for writeback because unwritten extent conversion also requires
 custom I/O completion handlers. Given that __mpage_writepage() only
 calls -writepage when it is confused, XFS simply cannot use this
 API.
 
 this doesn't mean fs/mpage.c should go, right?

mpage.c read side is fine for every block based filesystem I know.
mpage.c write side is fine for every simple (non-delalloc, non-unwritten
extent, etc) filesystem.  So it surely shouldn't go.

 I didn't say generic, see Subject: :)

then it shouldn't be in generic code.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-28 Thread Alex Tomas

Christoph Hellwig wrote:

This is not based on my attempt to make the xfs writeout path generic.
Alex's variant is a lot simpler and thus missed various bits required
for high sustained writeout performance or xfs functionality.


I'd very appreciate any details about high writeout performance.


That doesn't mean I want to arge against Alex's code although I'd of
course be more happy if we could actually shared code between multiple
filesystems.


I'm not against at all, of course. but xfs writeout code looks .. hmm ..
very xfs :)

thanks, Alex


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-27 Thread Alex Tomas

David Chinner wrote:

Using a new API for new functionality is a bad thing?


if existing API can be used ...


No, it doesn't provide the same functionality.

Firstly, XFS attaches a different I/O completion to delalloc writes
to allow us to update the file size when the write is beyond the
current on disk EOF. This code cannot do that as all it does is
allocation and present normal looking buffers to the generic code
path.


good point, I was going to take care of it in a separate patch
to support data=ordered.


Secondly, apart from delalloc, XFS cannot use the generic code paths
for writeback because unwritten extent conversion also requires
custom I/O completion handlers. Given that __mpage_writepage() only
calls -writepage when it is confused, XFS simply cannot use this
API.


this doesn't mean fs/mpage.c should go, right?


Also, looking at the way mpage_da_map_blocks() is done - if we have
an 128MB delalloc extent - ext4 will allocate that will allocate it
in one go, right? What happens if we then crash after only writing a
few megabytes of that extent? stale data exposure? XFS can allocate
multiple gigabytes in a single get_blocks call so even if ext4 can't
do this, it's a problem for XFS.


what happens if IO to 2nd MB is completed, while IO to 1st MB is not
(probably sitting in queue) ? do you update on-disk size in this case?
how do you track this?


So without the ability to attach specific I/O completions to bios
or support for unwritten extents directly in __mpage_writepage,
there is no way XFS can use this generic delayed allocation code.


I didn't say generic, see Subject: :)

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-27 Thread Alex Tomas

Jeff Garzik wrote:

Alex Tomas wrote:

So without the ability to attach specific I/O completions to bios
or support for unwritten extents directly in __mpage_writepage,
there is no way XFS can use this generic delayed allocation code.


I didn't say generic, see Subject: :)


Well, it shouldn't even be in the VFS layer if it's only usable by one 
filesystem.


sorry, but it seems I can say the same about iomap/ioend. I think
mpage_da_writepages() is simple enough to be adopted by other
filesystem, ext2 for example.

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-27 Thread Alex Tomas

David Chinner wrote:

Firstly, XFS attaches a different I/O completion to delalloc writes
to allow us to update the file size when the write is beyond the
current on disk EOF. This code cannot do that as all it does is
allocation and present normal looking buffers to the generic code
path.


how do you implement fsync(2) ? you'd have to wait such IO to complete,
then update the inode and write it through the log?


Also, looking at the way mpage_da_map_blocks() is done - if we have
an 128MB delalloc extent - ext4 will allocate that will allocate it
in one go, right? What happens if we then crash after only writing a
few megabytes of that extent? stale data exposure? XFS can allocate
multiple gigabytes in a single get_blocks call so even if ext4 can't
do this, it's a problem for XFS.


I just realized that you're talking about data=ordered mode in ext4,
where care is taken to prevent on-disk references to no-yet-written
blocks. The solution is to wait such IO to complete before metadata
commit. And the key thing here is to allocate and attach to inode
blocks we're writing immediately. IOW, there is no unwritten blocks
attached to inode (except fallocate(2) case), but there may be blocks
preallocated for this inode in-core. same gigabytes, but different
way ;)

I have no single objection to custom IO completion callback per
mpage_writepages().


thanks, Alex


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-27 Thread Jeff Garzik

Alex Tomas wrote:

So without the ability to attach specific I/O completions to bios
or support for unwritten extents directly in __mpage_writepage,
there is no way XFS can use this generic delayed allocation code.


I didn't say generic, see Subject: :)


Well, it shouldn't even be in the VFS layer if it's only usable by one 
filesystem.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread Jeff Garzik

Alex Tomas wrote:

Jeff Garzik wrote:

Is this based on Christoph's work?

Christoph, or some other XFS hacker, already did generic delalloc, 
modeled on the XFS delalloc code.


nope, this one is simple (something I'd prefer for ext4).


The XFS one is proven and the work was already completed.

What were the specific technical issues that made it unsuitable for ext4?

I would rather not reinvent the wheel, particularly if the reinvention 
is less capable than the existing work.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread Jeff Garzik

Alex Tomas wrote:

Good day,

please review ...

thanks, Alex


basic delayed allocation in VFS:

 * block_prepare_write() can be passed special -get_block() which
   doesn't allocate blocks, but reserve them and mark bh delayed
 * a filesystem can use mpage_da_writepages() with other -get_block()
   which doesn't defer allocation. mpage_da_writepages() finds all
   non-allocated blocks and try to allocate them with minimal calls
   to -get_block(), then submit IO using __mpage_writepage()


Signed-off-by: Alex Tomas [EMAIL PROTECTED]


Is this based on Christoph's work?

Christoph, or some other XFS hacker, already did generic delalloc, 
modeled on the XFS delalloc code.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread David Chinner
[please don't top post!]

On Thu, Jul 26, 2007 at 05:33:08PM +0400, Alex Tomas wrote:
 Jeff Garzik wrote:
 The XFS one is proven and the work was already completed.
 
 What were the specific technical issues that made it unsuitable for ext4?
 
 I would rather not reinvent the wheel, particularly if the reinvention 
 is less capable than the existing work.

 It duplicates fs/mpage.c in bio building and introduces new generic API
 (iomap, map_blocks_t, etc).

Using a new API for new functionality is a bad thing?

 In contrast, my trivial implementation re-use
 existing code in fs/mpage.c, doesn't introduce new API and I tend to think
 provides quite the same functionality. I can be wrong, of course ...

No, it doesn't provide the same functionality.

Firstly, XFS attaches a different I/O completion to delalloc writes
to allow us to update the file size when the write is beyond the
current on disk EOF. This code cannot do that as all it does is
allocation and present normal looking buffers to the generic code
path.

Secondly, apart from delalloc, XFS cannot use the generic code paths
for writeback because unwritten extent conversion also requires
custom I/O completion handlers. Given that __mpage_writepage() only
calls -writepage when it is confused, XFS simply cannot use this
API.

Also, looking at the way mpage_da_map_blocks() is done - if we have
an 128MB delalloc extent - ext4 will allocate that will allocate it
in one go, right? What happens if we then crash after only writing a
few megabytes of that extent? stale data exposure? XFS can allocate
multiple gigabytes in a single get_blocks call so even if ext4 can't
do this, it's a problem for XFS.

So without the ability to attach specific I/O completions to bios
or support for unwritten extents directly in __mpage_writepage,
there is no way XFS can use this generic delayed allocation code.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html