Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-15 Thread Neil Brown
On Monday May 15, [EMAIL PROTECTED] wrote:
> 
> Ho hum, I give up.

Thankyou :-)  I found our debate very valuable - it helped me clarify
my understanding of some areas of linux filesystem semantics (and as I
am trying to write a filesystem in my 'spare time', that will turn out
to be very useful).  It also revealed some problems in the code!

> I don't think, in practice, this code fixes any
> demonstrable bug though.

I thought it was our job to kill the bugs *before* they were
demonstrated :-)

I'm still convinced that the previous code could lead to deadlocks or
worse under sufficiently sustained high memory pressure and fs
activity.

I'll send a patch shortly that fixes the known problems and
awkwardnesses in the new code.

Thanks again,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-15 Thread Andrew Morton
Neil Brown <[EMAIL PROTECTED]> wrote:
>
> ...
> 
> > >  I have a patch which did that,
> > > but decided that the possibility of kmalloc failure at awkward times
> > > would make that not suitable.
> > 
> > submit_bh() can and will allocate memory, although most decent device
> > drivers should be OK.
> > 
> 
> submit_bh (like all decent device drivers) uses a mempool for memory
> allocation so we can be sure that the delay in getting memory is
> bounded by the delay for a few IO requests to complete, and we can be
> sure the allocation won't fail.  This is perfectly fine.

My point is that some device drivers will apparently call into the mamory
allocator (should be GFP_NOIO) on the request submission path.  I don't
recall whcih drivers - but not mainstream ones.

> > > 
> > > I don't think a_ops really provides an interface that I can use, partly
> > > because, as I said in a previous email, it isn't really a public
> > > interface to a filesystem.
> > 
> > It's publicer than bmap+submit_bh!
> > 
> 
> I don't know how you can say that.

bmap() is a userspace API, not really a kernel one.  And it assumes a
block-backed filesystem.

> bmap is so public that it is exported to userspace through an IOCTL
> and is used by lilo (admitted only for reading, not writing).  More
> significantly it is used by swapfile which is a completely independent
> subsystem from the filesystem.  Contrast this with a_ops.  The primary
> users of a_ops are routines like generic_file_{read,write} and
> friends.  These are tools - library routines - that are used by
> filesystems to implement their 'file_operations' which are the real
> public interface.  As far as these uses go, it is not a public
> interface.  Where a filesystem doesn't use some library routines, it
> does not need to implement the matching functionality in the a_op
> interface.

Well yeah.  If one wants to do filesystem I/O in-kernel then one should use
the filesystem I/O entry points: read() and write() (and get flamed ;)).

If one wants to cut in at a lower level than that then we get into a pickle
because different types of filesytems do quite different things to
implement read() and write().

> The other main user is the 'VM' which might try to flush out or
> invalidate pages.  However the VM isn't using this interface to
> interact with files, but only to interact with pages, and it doesn't
> much care what is done with the pages providing they get clean, or get
> released, or whatever.
> 
> The way I perceive Linux design/development, active usage is far more
> significant than documented design.  If some feature of an interface
> isn't being actively used - by in-kernel code - then you cannot be
> sure that feature will be uniformly implemented, or that it won't
> change subtly next week.
> 
> So when I went looking for the best way to get md/bitmap to write to a
> file, I didn't just look at the interface specs (which are pretty
> poorly documented anyway), I looked at existing code.
> I can find 3 different parts of the kernel that write to a file.
> They are
>swap-file
>loop
>nfsd
> 
> nfsd uses vfs_read/vfs_write  which have too many failure/delay modes
>   for me to safely use.
> loop uses prepare_write/commit_write (if available) or f_op->write
>   (not vfs_write - I wonder why) which is not much better than what
>   nfsd uses.  And as far as I can tell loop never actually flushes data to 
>   storage (hope no-one thinks a journalling filesystem on loop is a
>   good idea) so it isn't a good model to follow.
>   [[If loop was a good model to follow, I would have rejected the
>   code for writing to a file in the first place, only accepted code
>   for writing to a device, and insisted on using loop to close the gap]]
> 
> That leaves swap-file as the only in-kernel user of a filesystem that
> accesses the file in a way similar to what I need.  Is it any surprise
> that I chose to copy it?

swapfile doesn't use vfs_read() for swapin...

> 
> > 
> > But if the pagecache is dirty then invalidate_inode_pages() will leave it
> > dirty and the VM will later write it back, corrupting your bitmap file. 
> > You should get i_writecount, fsync the file and then run
> > invalidate_inode_pages().
> 
> Well, as I am currently reading the file through the pagecache but
> yes, I should put an fsync in there (and the invalidate before read is
> fairly pointless.  It is the invalidate after close that is important).

umm, the code in its present form _will_ corrupt MD bitmap files. 
invalidate_inode_pages() will not invalidate dirty pagecache, and that
dirty pagecache will later get written back, undoing any of your
intervening direct-io writes.  Most of the time, MD will do another
direct-io write _after_ the VM has done that writeback and things will get
fixed up.  But there is a window.  So that fsync() is required.

> > 
> > You might as well use kernel_read() too, if you insist on begin oddball ;)
> 
> I didn't know about that one..
> 
> If you

Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Neil Brown

(replying to bits of several emails)

On Friday May 12, [EMAIL PROTECTED] wrote:
> Neil Brown <[EMAIL PROTECTED]> wrote:

> > However some IO requests cannot complete until the filesystem I/O
> > completes, so we need to be sure that the filesystem I/O won't block
> > waiting for memory, or fail with -ENOMEM.
> 
> That sounds like a complex deadlock.  Suppose the bitmap writeout requres
> some writeback to happen before it can get enough memory to proceed.
> 

Exactly. Bitmap writeout must not block on fs-writeback.  It can block
on device writeout (e.g. queue congestion or mempool exhaustion) but
it must complete without waiting in the fs layer or above, and without
the possibility of any error other -EIO.  Otherwise we can get
deadlocked writing to the raid array. bh_submit (or bio_submit) is
certain to be safe in this respect.  I'm not so confident about
anything at the fs level.

> > > Read it with O_DIRECT :(
> > 
> > Which is exactly what the next release of mdadm does.
> > As the patch comment said:
> > 
> > : With this approach the pagecache may contain data which is inconsistent 
> > with 
> > : what is on disk.  To alleviate the problems this can cause, md invalidates
> > : the pagecache when releasing the file.  If the file is to be examined
> > : while the array is active (a non-critical but occasionally useful 
> > function),
> > : O_DIRECT io must be used.  And new version of mdadm will have support for 
> > this.
> 
> Which doesn't help `od -x' and is going to cause older mdadm userspace to
> mysteriously and subtly fail.  Or does the user<->kernel interface have
> versioning which will prevent this?
> 

As I said: 'non-critical'.  Nothing important breaks if reading the
file gets old data.  Reading the file while the array is active is
purely a curiosity thing.  There is information in /proc/mdstat which
gives a fairly coarse view of the same data.  It could lead to some
confusion, but if a compliant mdadm comes out before this gets into a
mainline kernel, I doubt there will be any significant issue.

Read/writing the bitmap needs to work reliably when the array is not
active, but suitable sync/invalidate calls in the kernel should make
that work perfectly.

I know this is technically a regression in user-space interface, and
you don't like such regression with good reason Maybe I could call
invalidate_inode_pages every few seconds or whenever the atime
changes, just to be on the safe side :-)

> >  I have a patch which did that,
> > but decided that the possibility of kmalloc failure at awkward times
> > would make that not suitable.
> 
> submit_bh() can and will allocate memory, although most decent device
> drivers should be OK.
> 

submit_bh (like all decent device drivers) uses a mempool for memory
allocation so we can be sure that the delay in getting memory is
bounded by the delay for a few IO requests to complete, and we can be
sure the allocation won't fail.  This is perfectly fine.

> > 
> > I don't think a_ops really provides an interface that I can use, partly
> > because, as I said in a previous email, it isn't really a public
> > interface to a filesystem.
> 
> It's publicer than bmap+submit_bh!
> 

I don't know how you can say that.

bmap is so public that it is exported to userspace through an IOCTL
and is used by lilo (admitted only for reading, not writing).  More
significantly it is used by swapfile which is a completely independent
subsystem from the filesystem.  Contrast this with a_ops.  The primary
users of a_ops are routines like generic_file_{read,write} and
friends.  These are tools - library routines - that are used by
filesystems to implement their 'file_operations' which are the real
public interface.  As far as these uses go, it is not a public
interface.  Where a filesystem doesn't use some library routines, it
does not need to implement the matching functionality in the a_op
interface.

The other main user is the 'VM' which might try to flush out or
invalidate pages.  However the VM isn't using this interface to
interact with files, but only to interact with pages, and it doesn't
much care what is done with the pages providing they get clean, or get
released, or whatever.

The way I perceive Linux design/development, active usage is far more
significant than documented design.  If some feature of an interface
isn't being actively used - by in-kernel code - then you cannot be
sure that feature will be uniformly implemented, or that it won't
change subtly next week.

So when I went looking for the best way to get md/bitmap to write to a
file, I didn't just look at the interface specs (which are pretty
poorly documented anyway), I looked at existing code.
I can find 3 different parts of the kernel that write to a file.
They are
   swap-file
   loop
   nfsd

nfsd uses vfs_read/vfs_write  which have too many failure/delay modes
  for me to safely use.
loop uses prepare_write/commit_write (if available) or f_op->write
  (not vfs_write - I wonder why) which is 

Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Andrew Morton
Neil Brown <[EMAIL PROTECTED]> wrote:
>
> On Saturday May 13, [EMAIL PROTECTED] wrote:
> > Paul Clements <[EMAIL PROTECTED]> wrote:
> > >
> > > Andrew Morton wrote:
> > > 
> > > > The loss of pagecache coherency seems sad.  I assume there's never a
> > > > requirement for userspace to read this file.
> > > 
> > > Actually, there is. mdadm reads the bitmap file, so that would be 
> > > broken. Also, it's just useful for a user to be able to read the bitmap 
> > > (od -x, or similar) to figure out approximately how much more he's got 
> > > to resync to get an array in-sync. Other than reading the bitmap file, I 
> > > don't know of any way to determine that.
> > 
> > Read it with O_DIRECT :(
> 
> Which is exactly what the next release of mdadm does.
> As the patch comment said:
> 
> : With this approach the pagecache may contain data which is inconsistent 
> with 
> : what is on disk.  To alleviate the problems this can cause, md invalidates
> : the pagecache when releasing the file.  If the file is to be examined
> : while the array is active (a non-critical but occasionally useful function),
> : O_DIRECT io must be used.  And new version of mdadm will have support for 
> this.

Which doesn't help `od -x' and is going to cause older mdadm userspace to
mysteriously and subtly fail.  Or does the user<->kernel interface have
versioning which will prevent this?


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Neil Brown
On Saturday May 13, [EMAIL PROTECTED] wrote:
> Paul Clements <[EMAIL PROTECTED]> wrote:
> >
> > Andrew Morton wrote:
> > 
> > > The loss of pagecache coherency seems sad.  I assume there's never a
> > > requirement for userspace to read this file.
> > 
> > Actually, there is. mdadm reads the bitmap file, so that would be 
> > broken. Also, it's just useful for a user to be able to read the bitmap 
> > (od -x, or similar) to figure out approximately how much more he's got 
> > to resync to get an array in-sync. Other than reading the bitmap file, I 
> > don't know of any way to determine that.
> 
> Read it with O_DIRECT :(

Which is exactly what the next release of mdadm does.
As the patch comment said:

: With this approach the pagecache may contain data which is inconsistent with 
: what is on disk.  To alleviate the problems this can cause, md invalidates
: the pagecache when releasing the file.  If the file is to be examined
: while the array is active (a non-critical but occasionally useful function),
: O_DIRECT io must be used.  And new version of mdadm will have support for 
this.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-13 Thread Andrew Morton
Paul Clements <[EMAIL PROTECTED]> wrote:
>
> Andrew Morton wrote:
> 
> > The loss of pagecache coherency seems sad.  I assume there's never a
> > requirement for userspace to read this file.
> 
> Actually, there is. mdadm reads the bitmap file, so that would be 
> broken. Also, it's just useful for a user to be able to read the bitmap 
> (od -x, or similar) to figure out approximately how much more he's got 
> to resync to get an array in-sync. Other than reading the bitmap file, I 
> don't know of any way to determine that.

Read it with O_DIRECT :(
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-13 Thread Paul Clements

Andrew Morton wrote:


The loss of pagecache coherency seems sad.  I assume there's never a
requirement for userspace to read this file.


Actually, there is. mdadm reads the bitmap file, so that would be 
broken. Also, it's just useful for a user to be able to read the bitmap 
(od -x, or similar) to figure out approximately how much more he's got 
to resync to get an array in-sync. Other than reading the bitmap file, I 
don't know of any way to determine that.


--
Paul

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-13 Thread Andrew Morton
Neil Brown <[EMAIL PROTECTED]> wrote:
>
> On Friday May 12, [EMAIL PROTECTED] wrote:
> > NeilBrown <[EMAIL PROTECTED]> wrote:
> > >
> > > If md is asked to store a bitmap in a file, it tries to hold onto the
> > > page cache pages for that file, manipulate them directly, and call a
> > > cocktail of operations to write the file out.  I don't believe this is
> > > a supportable approach.
> > 
> > erk.  I think it's better than...
> > 
> > > This patch changes the approach to use the same approach as swap files.
> > > i.e. bmap is used to enumerate all the block address of parts of the file
> > > and we write directly to those blocks of the device.
> > 
> > That's going in at a much lower level.  Even swapfiles don't assume
> > buffer_heads.
> 
> I'm not "assuming" buffer_heads.  I'm creating buffer heads and using
> them for my own purposes.  These are my pages and my buffer heads.
> None of them belong to the filesystem.

Right, so it's incoherent with pagecache and userspace can no longer
usefully read this file.

> The buffer_heads are simply a convenient data-structure to record the
> several block addresses for each page.  I could have equally created
> an array storing all the addresses, and built the required bios by
> hand at write time.  But buffer_heads did most of the work for me, so
> I used them.

OK.

> Yes, it is a lower level, but
>  1/ I am certain that there will be no kmalloc problems and
>  2/ Because it is exactly the level used by swapfile, I know that it
> is sufficiently 'well defined' that no-one is going to break it.

It would be nicer of course to actually use the mm/page_io.c code.  That
would involve implementing swap_bmap() and reimplementing the
get_swap_bio() stuff in terms of a_ops->bmap().

But the swap code can afford to skip blockruns which aren't page-sized and
it uses that capability nicely.  You cannot do that.

> > 
> > All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. 
> > Operating at the pagecache a_ops level looked better, and more
> > filesystem-independent.
> 
> If you really want filesystem independence, you need to use vfs_read
> and vfs_write to read/write the file.

yup.

>  I have a patch which did that,
> but decided that the possibility of kmalloc failure at awkward times
> would make that not suitable.

submit_bh() can and will allocate memory, although most decent device
drivers should be OK.

There are tricks we can do with writepage.  If the backing filesystem uses
buffer_heads and if you hold a ref on the page then we know that there
won't be any buffer_head allocations nor any disk reads in the writepage
path.  It'll go direct into bio_alloc+submit_bio, just as you're doing now.
IOW: no gain.

> So I now use vfs_read to read in the file (just like knfsd) and
> bmap/submit_bh to write out the file (just like swapfile).
> 
> I don't think a_ops really provides an interface that I can use, partly
> because, as I said in a previous email, it isn't really a public
> interface to a filesystem.

It's publicer than bmap+submit_bh!

> > 
> > I haven't looked at this patch at all closely yet.  Do I really need to?
> 
> I assume you are asking that because you hope I will retract the
> patch.

Was kinda hoping that would be the outcome.  It's rather gruesome, using
set_fs()+vfs_read() on one side and submit_bh() on the other.

Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file
is OK?

The loss of pagecache coherency seems sad.  I assume there's never a
requirement for userspace to read this file.

invalidate_inode_pages() is best-effort.  If someone else has the page
locked or if the page is mmapped then the attempt to take down the
pagecache will fail.  That's relatively-OK, because it'll just lead to
userspace seeing wrong stuff, and we've already conceded that.

But if the pagecache is dirty then invalidate_inode_pages() will leave it
dirty and the VM will later write it back, corrupting your bitmap file. 
You should get i_writecount, fsync the file and then run
invalidate_inode_pages().

Or not run invalidate_inode_pages() - it doesn't gain anything and will
just reduce the observeability of bugs.  Better off leaving the pagecache
there all the time so that any rarely-occurring bugs become all-the-time
bugs.

You might as well use kernel_read() too, if you insist on begin oddball ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-12 Thread Neil Brown
On Friday May 12, [EMAIL PROTECTED] wrote:
> NeilBrown <[EMAIL PROTECTED]> wrote:
> >
> > If md is asked to store a bitmap in a file, it tries to hold onto the
> > page cache pages for that file, manipulate them directly, and call a
> > cocktail of operations to write the file out.  I don't believe this is
> > a supportable approach.
> 
> erk.  I think it's better than...
> 
> > This patch changes the approach to use the same approach as swap files.
> > i.e. bmap is used to enumerate all the block address of parts of the file
> > and we write directly to those blocks of the device.
> 
> That's going in at a much lower level.  Even swapfiles don't assume
> buffer_heads.

I'm not "assuming" buffer_heads.  I'm creating buffer heads and using
them for my own purposes.  These are my pages and my buffer heads.
None of them belong to the filesystem.
The buffer_heads are simply a convenient data-structure to record the
several block addresses for each page.  I could have equally created
an array storing all the addresses, and built the required bios by
hand at write time.  But buffer_heads did most of the work for me, so
I used them.

Yes, it is a lower level, but
 1/ I am certain that there will be no kmalloc problems and
 2/ Because it is exactly the level used by swapfile, I know that it
is sufficiently 'well defined' that no-one is going to break it.

> 
> When playing with bmap() one needs to be careful that nobody has truncated
> the file on you, else you end up writing to disk blocks which aren't part
> of the file any more.

Well we currently play games with i_write_count to ensure that no-one
else has the file open for write.  And if no-one else can get write
access, then it cannot be truncated.
I did think about adding the S_SWAPFILE flag, but decided to leave
that for a separate patch and review different approaches to
preventing write access first (e.g. can I use a lease?).

> 
> All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. 
> Operating at the pagecache a_ops level looked better, and more
> filesystem-independent.

If you really want filesystem independence, you need to use vfs_read
and vfs_write to read/write the file.  I have a patch which did that,
but decided that the possibility of kmalloc failure at awkward times
would make that not suitable.
So I now use vfs_read to read in the file (just like knfsd) and
bmap/submit_bh to write out the file (just like swapfile).

I don't think a_ops really provides an interface that I can use, partly
because, as I said in a previous email, it isn't really a public
interface to a filesystem.

> 
> I haven't looked at this patch at all closely yet.  Do I really need to?

I assume you are asking that because you hope I will retract the
patch.  While I'm always open to being educated, I am not yet
convinced that there is any better way, or even any other usable way,
to do what needs to be done, so I am not inclined to retract the
patch.

I'd like to say that you don't need to read it because it is perfect,
but unfortunately history suggests that is unlikely to be true.

Whether you look more closely is of course up to you, but I'm convinced
that patch is in the right direction, and your review and comments are
always very valuable.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-12 Thread Andrew Morton
NeilBrown <[EMAIL PROTECTED]> wrote:
>
> If md is asked to store a bitmap in a file, it tries to hold onto the
> page cache pages for that file, manipulate them directly, and call a
> cocktail of operations to write the file out.  I don't believe this is
> a supportable approach.

erk.  I think it's better than...

> This patch changes the approach to use the same approach as swap files.
> i.e. bmap is used to enumerate all the block address of parts of the file
> and we write directly to those blocks of the device.

That's going in at a much lower level.  Even swapfiles don't assume
buffer_heads.

When playing with bmap() one needs to be careful that nobody has truncated
the file on you, else you end up writing to disk blocks which aren't part
of the file any more.

All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. 
Operating at the pagecache a_ops level looked better, and more
filesystem-independent.

I haven't looked at this patch at all closely yet.  Do I really need to?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-11 Thread NeilBrown

If md is asked to store a bitmap in a file, it tries to hold onto the
page cache pages for that file, manipulate them directly, and call a
cocktail of operations to write the file out.  I don't believe this is
a supportable approach.

This patch changes the approach to use the same approach as swap files.
i.e. bmap is used to enumerate all the block address of parts of the file
and we write directly to those blocks of the device.

swapfile only uses parts of the file that provide a full pages at
contiguous addresses.  We don't have that luxury so we have to cope
with pages that are non-contiguous in storage.  To handle this 
we attach buffers to each page, and store the addresses in those buffers.

With this approach the pagecache may contain data which is inconsistent with 
what is on disk.  To alleviate the problems this can cause, md invalidates
the pagecache when releasing the file.  If the file is to be examined
while the array is active (a non-critical but occasionally useful function),
O_DIRECT io must be used.  And new version of mdadm will have support for this.

This approach simplifies a lot of code:
 - we no longer need to keep a list of pages which we need to wait for,
   as the b_endio function can keep track of how many outstanding
   writes there are.  This saves a mempool.
 - -EAGAIN returns from write_page are no longer possible (not sure if
they ever were actually).



Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/bitmap.c |  283 --
 ./include/linux/raid/bitmap.h |7 -
 2 files changed, 139 insertions(+), 151 deletions(-)

diff ./drivers/md/bitmap.c~current~ ./drivers/md/bitmap.c
--- ./drivers/md/bitmap.c~current~  2006-05-12 16:00:03.0 +1000
+++ ./drivers/md/bitmap.c   2006-05-12 16:01:06.0 +1000
@@ -67,7 +67,6 @@ static inline char * bmname(struct bitma
return bitmap->mddev ? mdname(bitmap->mddev) : "mdX";
 }
 
-#define WRITE_POOL_SIZE 256
 
 /*
  * just a placeholder - calls kmalloc for bitmap pages
@@ -279,75 +278,137 @@ static int write_sb_page(mddev_t *mddev,
  */
 static int write_page(struct bitmap *bitmap, struct page *page, int wait)
 {
-   int ret = -ENOMEM;
+   struct buffer_head *bh;
 
if (bitmap->file == NULL)
return write_sb_page(bitmap->mddev, bitmap->offset, page, wait);
 
-   flush_dcache_page(page); /* make sure visible to anyone reading the 
file */
+   bh = page_buffers(page);
 
-   if (wait)
-   lock_page(page);
-   else {
-   if (TestSetPageLocked(page))
-   return -EAGAIN; /* already locked */
-   if (PageWriteback(page)) {
-   unlock_page(page);
-   return -EAGAIN;
-   }
+   while (bh && bh->b_blocknr) {
+   atomic_inc(&bitmap->pending_writes);
+   set_buffer_locked(bh);
+   set_buffer_mapped(bh);
+   submit_bh(WRITE, bh);
+   bh = bh->b_this_page;
+   }
+
+   if (wait) {
+   wait_event(bitmap->write_wait,
+  atomic_read(&bitmap->pending_writes)==0);
+   return (bitmap->flags & BITMAP_WRITE_ERROR) ? -EIO : 0;
}
+   return 0;
+}
+
+static void end_bitmap_write(struct buffer_head *bh, int uptodate)
+{
+   struct bitmap *bitmap = bh->b_private;
+   unsigned long flags;
 
-   ret = page->mapping->a_ops->prepare_write(bitmap->file, page, 0, 
PAGE_SIZE);
-   if (!ret)
-   ret = page->mapping->a_ops->commit_write(bitmap->file, page, 0,
-   PAGE_SIZE);
-   if (ret) {
-   unlock_page(page);
-   return ret;
+   if (!uptodate) {
+   spin_lock_irqsave(&bitmap->lock, flags);
+   bitmap->flags |= BITMAP_WRITE_ERROR;
+   spin_unlock_irqrestore(&bitmap->lock, flags);
}
+   if (atomic_dec_and_test(&bitmap->pending_writes))
+   wake_up(&bitmap->write_wait);
+}
 
-   set_page_dirty(page); /* force it to be written out */
+/* copied from buffer.c */
+static void
+__clear_page_buffers(struct page *page)
+{
+   ClearPagePrivate(page);
+   set_page_private(page, 0);
+   page_cache_release(page);
+}
+static void free_buffers(struct page *page)
+{
+   struct buffer_head *bh = page_buffers(page);
 
-   if (!wait) {
-   /* add to list to be waited for */
-   struct page_list *item = mempool_alloc(bitmap->write_pool, 
GFP_NOIO);
-   item->page = page;
-   spin_lock(&bitmap->write_lock);
-   list_add(&item->list, &bitmap->complete_pages);
-   spin_unlock(&bitmap->write_lock);
+   while (bh) {
+   struct buffer_head *next = bh->b_this_page;
+   free_buffer_head(bh);
+   bh = next;
}
-   return write_one_pag