[PATCH] Fix private_list handling

2008-02-04 Thread Jan Kara
There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list and by reinserting buffer in private_list when
it is found dirty after we have submitted buffer for IO. We also change the
tests whether a buffer is on a private list from
 !list_empty(>b_assoc_buffers) to bh->b_assoc_map
so that they are single word reads and hence lockless checks are safe.

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>
CC: Nick Piggin <[EMAIL PROTECTED]>

---
Hi Andrew,

this is the latest version of the fix for private_list handling races. The
changes are now smaller so I hope you'll like the patch more... Please
apply if you now think the patch is fine. Thanks.

Honza

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..76e1ab4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -678,7 +678,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct 
inode *inode)
} else {
BUG_ON(mapping->assoc_mapping != buffer_mapping);
}
-   if (list_empty(>b_assoc_buffers)) {
+   if (!bh->b_assoc_map) {
spin_lock(_mapping->private_lock);
list_move_tail(>b_assoc_buffers,
>private_list);
@@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct 
list_head *list)
 {
struct buffer_head *bh;
struct list_head tmp;
+   struct address_space *mapping;
int err = 0, err2;
 
INIT_LIST_HEAD();
@@ -801,9 +802,14 @@ static int fsync_buffers_list(spinlock_t *lock, struct 
list_head *list)
spin_lock(lock);
while (!list_empty(list)) {
bh = BH_ENTRY(list->next);
+   mapping = bh->b_assoc_map;
__remove_assoc_queue(bh);
+   /* Avoid race with mark_buffer_dirty_inode() which does
+* a lockless check and we rely on seeing the dirty bit */
+   smp_mb();
if (buffer_dirty(bh) || buffer_locked(bh)) {
list_add(>b_assoc_buffers, );
+   bh->b_assoc_map = mapping;
if (buffer_dirty(bh)) {
get_bh(bh);
spin_unlock(lock);
@@ -822,8 +828,17 @@ static int fsync_buffers_list(spinlock_t *lock, struct 
list_head *list)
 
while (!list_empty()) {
bh = BH_ENTRY(tmp.prev);
-   list_del_init(>b_assoc_buffers);
get_bh(bh);
+   mapping = bh->b_assoc_map;
+   __remove_assoc_queue(bh);
+   /* Avoid race with mark_buffer_dirty_inode() which does
+* a lockless check and we rely on seeing the dirty bit */
+   smp_mb();
+   if (buffer_dirty(bh)) {
+   list_add(>b_assoc_buffers,
+>b_assoc_map->private_list);
+   bh->b_assoc_map = mapping;
+   }
spin_unlock(lock);
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
@@ -1195,7 +1210,7 @@ void __brelse(struct buffer_head * buf)
 void __bforget(struct buffer_head *bh)
 {
clear_buffer_dirty(bh);
-   if (!list_empty(>b_assoc_buffers)) {
+   if (bh->b_assoc_map) {
struct address_space *buffer_mapping = bh->b_page->mapping;
 
spin_lock(_mapping->private_lock);
@@ -3037,7 +3052,7 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
do {
struct buffer_head *next = bh->b_this_page;
 
-   if (!list_empty(>b_assoc_buffers))
+   if (bh->b_assoc_map)
__remove_assoc_queue(bh);
bh = next;
} while (bh != head);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

[PATCH] Fix private_list handling

2008-02-04 Thread Jan Kara
There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list and by reinserting buffer in private_list when
it is found dirty after we have submitted buffer for IO. We also change the
tests whether a buffer is on a private list from
 !list_empty(bh-b_assoc_buffers) to bh-b_assoc_map
so that they are single word reads and hence lockless checks are safe.

Signed-off-by: Jan Kara [EMAIL PROTECTED]
CC: Nick Piggin [EMAIL PROTECTED]

---
Hi Andrew,

this is the latest version of the fix for private_list handling races. The
changes are now smaller so I hope you'll like the patch more... Please
apply if you now think the patch is fine. Thanks.

Honza

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..76e1ab4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -678,7 +678,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct 
inode *inode)
} else {
BUG_ON(mapping-assoc_mapping != buffer_mapping);
}
-   if (list_empty(bh-b_assoc_buffers)) {
+   if (!bh-b_assoc_map) {
spin_lock(buffer_mapping-private_lock);
list_move_tail(bh-b_assoc_buffers,
mapping-private_list);
@@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct 
list_head *list)
 {
struct buffer_head *bh;
struct list_head tmp;
+   struct address_space *mapping;
int err = 0, err2;
 
INIT_LIST_HEAD(tmp);
@@ -801,9 +802,14 @@ static int fsync_buffers_list(spinlock_t *lock, struct 
list_head *list)
spin_lock(lock);
while (!list_empty(list)) {
bh = BH_ENTRY(list-next);
+   mapping = bh-b_assoc_map;
__remove_assoc_queue(bh);
+   /* Avoid race with mark_buffer_dirty_inode() which does
+* a lockless check and we rely on seeing the dirty bit */
+   smp_mb();
if (buffer_dirty(bh) || buffer_locked(bh)) {
list_add(bh-b_assoc_buffers, tmp);
+   bh-b_assoc_map = mapping;
if (buffer_dirty(bh)) {
get_bh(bh);
spin_unlock(lock);
@@ -822,8 +828,17 @@ static int fsync_buffers_list(spinlock_t *lock, struct 
list_head *list)
 
while (!list_empty(tmp)) {
bh = BH_ENTRY(tmp.prev);
-   list_del_init(bh-b_assoc_buffers);
get_bh(bh);
+   mapping = bh-b_assoc_map;
+   __remove_assoc_queue(bh);
+   /* Avoid race with mark_buffer_dirty_inode() which does
+* a lockless check and we rely on seeing the dirty bit */
+   smp_mb();
+   if (buffer_dirty(bh)) {
+   list_add(bh-b_assoc_buffers,
+bh-b_assoc_map-private_list);
+   bh-b_assoc_map = mapping;
+   }
spin_unlock(lock);
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
@@ -1195,7 +1210,7 @@ void __brelse(struct buffer_head * buf)
 void __bforget(struct buffer_head *bh)
 {
clear_buffer_dirty(bh);
-   if (!list_empty(bh-b_assoc_buffers)) {
+   if (bh-b_assoc_map) {
struct address_space *buffer_mapping = bh-b_page-mapping;
 
spin_lock(buffer_mapping-private_lock);
@@ -3037,7 +3052,7 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
do {
struct buffer_head *next = bh-b_this_page;
 
-   if (!list_empty(bh-b_assoc_buffers))
+   if (bh-b_assoc_map)
__remove_assoc_queue(bh);
bh = next;
} while (bh != head);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo 

Re: [PATCH] Fix private_list handling

2008-01-14 Thread Jan Kara
On Fri 11-01-08 15:33:54, Andrew Morton wrote:
> On Fri, 11 Jan 2008 15:21:31 +0100
> Jan Kara <[EMAIL PROTECTED]> wrote:
> 
> > On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> > > On Thu, 10 Jan 2008 16:55:13 +0100
> > > Jan Kara <[EMAIL PROTECTED]> wrote:
> > > 
> > > >   Hi,
> > > > 
> > > >   sorry for the previous empty email...
> > > > 
> > > >   Supriya noted in his testing that sometimes buffers removed by
> > > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > > > won't be properly propagated). Actually, looking more into the code I 
> > > > found
> > > > there are some more races. The patch below should fix them. It survived
> > > > beating with LTP and fsstress on ext2 filesystem on my testing machine 
> > > > so
> > > > it should be reasonably bugfree... Andrew, would you put the patch into
> > > > -mm? Thanks.
> > > > 
> > > > Honza
> > > > -- 
> > > > Jan Kara <[EMAIL PROTECTED]>
> > > > SUSE Labs, CR
> > > > ---
> > > > 
> > > > There are two possible races in handling of private_list in buffer 
> > > > cache.
> > > > 1) When fsync_buffers_list() processes a private_list, it clears
> > > > b_assoc_mapping and moves buffer to its private list. Now 
> > > > drop_buffers() comes,
> > > > sees a buffer is on list so it calls __remove_assoc_queue() which 
> > > > complains
> > > > about b_assoc_mapping being cleared (as it cannot propagate possible IO 
> > > > error).
> > > > This race has been actually observed in the wild.
> > > 
> > > private_lock should prevent this race.
> > > 
> > > Which call to drop_buffers() is the culprit?  The first one in
> > > try_to_free_buffers(), I assume?  The "can this still happen?" one?
> > > 
> > > If so, it can happen.  How?  Perhaps this is a bug.
> >   Good question but I don't think so. The problem is that
> > fsync_buffers_list() drops the private_lock() e.g. when it does
> > wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
> > have b_assoc_mapping set to NULL but the test
> > !list_empty(bh->b_assoc_buffers) succeeds for them and thus
> > __remove_assoc_queue() is called and it complains.
> >   We could also silence the warning by leaving b_assoc_mapping set when we
> > move the buffer to the constructed list.
> 
> Could fsync_buffers_list() leave the buffer not on a list when it does the
> ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
> is still not on a list?
> 
> I guess not, as it still needs to find the buffer to wait upon it.
> 
> > But given the problem below
> > I've decided to do a more complete cleanup of the code.
> 
> Well.  Large-scale changes to long-established code like this are a last
> resort.  We should fully investigate little local fixups first.
> 
> > > > 2) When fsync_buffers_list() processes a private_list,
> > > > mark_buffer_dirty_inode() can be called on bh which is already on the 
> > > > private
> > > > list of fsync_buffers_list(). As buffer is on some list (note that the 
> > > > check is
> > > > performed without private_lock), it is not readded to the mapping's
> > > > private_list and after fsync_buffers_list() finishes, we have a dirty 
> > > > buffer
> > > > which should be on private_list but it isn't. This race has not been 
> > > > reported,
> > > > probably because most (but not all) callers of 
> > > > mark_buffer_dirty_inode() hold
> > > > i_mutex and thus are serialized with fsync().
> > > 
> > > Maybe fsync_buffers_list should put the buffer back onto private_list if 
> > > it
> > > got dirtied again.
> >   Yes, that's what it does in my new version. Only the locking is a bit
> > subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
> > when the buffer is already on some list...
> 
> I fear rewrites in there.  It took five years to find this bug.  How long
> will it take to find new ones which get added?
> 
> Sigh.  lists do suck for this sort of thing.
  OK, below is a minimal fix for the problems...


Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but 

Re: [PATCH] Fix private_list handling

2008-01-14 Thread Jan Kara
On Fri 11-01-08 15:33:54, Andrew Morton wrote:
> On Fri, 11 Jan 2008 15:21:31 +0100
> Jan Kara <[EMAIL PROTECTED]> wrote:
> 
> > On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> > > On Thu, 10 Jan 2008 16:55:13 +0100
> > > Jan Kara <[EMAIL PROTECTED]> wrote:
> > > 
> > > >   Hi,
> > > > 
> > > >   sorry for the previous empty email...
> > > > 
> > > >   Supriya noted in his testing that sometimes buffers removed by
> > > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > > > won't be properly propagated). Actually, looking more into the code I 
> > > > found
> > > > there are some more races. The patch below should fix them. It survived
> > > > beating with LTP and fsstress on ext2 filesystem on my testing machine 
> > > > so
> > > > it should be reasonably bugfree... Andrew, would you put the patch into
> > > > -mm? Thanks.
> > > > 
> > > > Honza
> > > > -- 
> > > > Jan Kara <[EMAIL PROTECTED]>
> > > > SUSE Labs, CR
> > > > ---
> > > > 
> > > > There are two possible races in handling of private_list in buffer 
> > > > cache.
> > > > 1) When fsync_buffers_list() processes a private_list, it clears
> > > > b_assoc_mapping and moves buffer to its private list. Now 
> > > > drop_buffers() comes,
> > > > sees a buffer is on list so it calls __remove_assoc_queue() which 
> > > > complains
> > > > about b_assoc_mapping being cleared (as it cannot propagate possible IO 
> > > > error).
> > > > This race has been actually observed in the wild.
> > > 
> > > private_lock should prevent this race.
> > > 
> > > Which call to drop_buffers() is the culprit?  The first one in
> > > try_to_free_buffers(), I assume?  The "can this still happen?" one?
> > > 
> > > If so, it can happen.  How?  Perhaps this is a bug.
> >   Good question but I don't think so. The problem is that
> > fsync_buffers_list() drops the private_lock() e.g. when it does
> > wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
> > have b_assoc_mapping set to NULL but the test
> > !list_empty(bh->b_assoc_buffers) succeeds for them and thus
> > __remove_assoc_queue() is called and it complains.
> >   We could also silence the warning by leaving b_assoc_mapping set when we
> > move the buffer to the constructed list.
> 
> Could fsync_buffers_list() leave the buffer not on a list when it does the
> ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
> is still not on a list?
> 
> I guess not, as it still needs to find the buffer to wait upon it.
  Exactly... We have to have submitted buffers on some list.

> > But given the problem below
> > I've decided to do a more complete cleanup of the code.
> 
> Well.  Large-scale changes to long-established code like this are a last
> resort.  We should fully investigate little local fixups first.
  I see. Well, I could do the following minimal changes if it'd make you more
comfortable about it):
  1) Don't clear b_assoc_map in fsync_buffers_list() (that would get rid of
race with drop_buffers())
  2) Change fsync_buffers_list() to readd buffer into private_list when it
is dirty after we've waited on it. That should get rid of the second
race.
  But I still think that unnecessary temporary list is worth a cleanup ;).

> > > > 2) When fsync_buffers_list() processes a private_list,
> > > > mark_buffer_dirty_inode() can be called on bh which is already on the 
> > > > private
> > > > list of fsync_buffers_list(). As buffer is on some list (note that the 
> > > > check is
> > > > performed without private_lock), it is not readded to the mapping's
> > > > private_list and after fsync_buffers_list() finishes, we have a dirty 
> > > > buffer
> > > > which should be on private_list but it isn't. This race has not been 
> > > > reported,
> > > > probably because most (but not all) callers of 
> > > > mark_buffer_dirty_inode() hold
> > > > i_mutex and thus are serialized with fsync().
> > > 
> > > Maybe fsync_buffers_list should put the buffer back onto private_list if 
> > > it
> > > got dirtied again.
> >   Yes, that's what it does in my new version. Only the locking is a bit
> > subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
> > when the buffer is already on some list...
> 
> I fear rewrites in there.  It took five years to find this bug.  How long
> will it take to find new ones which get added?
  Well, I've added that WARN_ON() which warned us about the problem about
14 month ago so it was not that bad. But I agree that bugs are hard to
detect here as they result either in lost IO error or in not writing all
buffers on fsync both of which go mostly unnoticed.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [PATCH] Fix private_list handling

2008-01-14 Thread Jan Kara
On Fri 11-01-08 15:33:54, Andrew Morton wrote:
 On Fri, 11 Jan 2008 15:21:31 +0100
 Jan Kara [EMAIL PROTECTED] wrote:
 
  On Thu 10-01-08 16:36:35, Andrew Morton wrote:
   On Thu, 10 Jan 2008 16:55:13 +0100
   Jan Kara [EMAIL PROTECTED] wrote:
   
  Hi,

  sorry for the previous empty email...

  Supriya noted in his testing that sometimes buffers removed by
__remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
won't be properly propagated). Actually, looking more into the code I 
found
there are some more races. The patch below should fix them. It survived
beating with LTP and fsstress on ext2 filesystem on my testing machine 
so
it should be reasonably bugfree... Andrew, would you put the patch into
-mm? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer 
cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now 
drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which 
complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO 
error).
This race has been actually observed in the wild.
   
   private_lock should prevent this race.
   
   Which call to drop_buffers() is the culprit?  The first one in
   try_to_free_buffers(), I assume?  The can this still happen? one?
   
   If so, it can happen.  How?  Perhaps this is a bug.
Good question but I don't think so. The problem is that
  fsync_buffers_list() drops the private_lock() e.g. when it does
  wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
  have b_assoc_mapping set to NULL but the test
  !list_empty(bh-b_assoc_buffers) succeeds for them and thus
  __remove_assoc_queue() is called and it complains.
We could also silence the warning by leaving b_assoc_mapping set when we
  move the buffer to the constructed list.
 
 Could fsync_buffers_list() leave the buffer not on a list when it does the
 ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
 is still not on a list?
 
 I guess not, as it still needs to find the buffer to wait upon it.
  Exactly... We have to have submitted buffers on some list.

  But given the problem below
  I've decided to do a more complete cleanup of the code.
 
 Well.  Large-scale changes to long-established code like this are a last
 resort.  We should fully investigate little local fixups first.
  I see. Well, I could do the following minimal changes if it'd make you more
comfortable about it):
  1) Don't clear b_assoc_map in fsync_buffers_list() (that would get rid of
race with drop_buffers())
  2) Change fsync_buffers_list() to readd buffer into private_list when it
is dirty after we've waited on it. That should get rid of the second
race.
  But I still think that unnecessary temporary list is worth a cleanup ;).

2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the 
private
list of fsync_buffers_list(). As buffer is on some list (note that the 
check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty 
buffer
which should be on private_list but it isn't. This race has not been 
reported,
probably because most (but not all) callers of 
mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().
   
   Maybe fsync_buffers_list should put the buffer back onto private_list if 
   it
   got dirtied again.
Yes, that's what it does in my new version. Only the locking is a bit
  subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
  when the buffer is already on some list...
 
 I fear rewrites in there.  It took five years to find this bug.  How long
 will it take to find new ones which get added?
  Well, I've added that WARN_ON() which warned us about the problem about
14 month ago so it was not that bad. But I agree that bugs are hard to
detect here as they result either in lost IO error or in not writing all
buffers on fsync both of which go mostly unnoticed.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix private_list handling

2008-01-14 Thread Jan Kara
On Fri 11-01-08 15:33:54, Andrew Morton wrote:
 On Fri, 11 Jan 2008 15:21:31 +0100
 Jan Kara [EMAIL PROTECTED] wrote:
 
  On Thu 10-01-08 16:36:35, Andrew Morton wrote:
   On Thu, 10 Jan 2008 16:55:13 +0100
   Jan Kara [EMAIL PROTECTED] wrote:
   
  Hi,

  sorry for the previous empty email...

  Supriya noted in his testing that sometimes buffers removed by
__remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
won't be properly propagated). Actually, looking more into the code I 
found
there are some more races. The patch below should fix them. It survived
beating with LTP and fsstress on ext2 filesystem on my testing machine 
so
it should be reasonably bugfree... Andrew, would you put the patch into
-mm? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer 
cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now 
drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which 
complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO 
error).
This race has been actually observed in the wild.
   
   private_lock should prevent this race.
   
   Which call to drop_buffers() is the culprit?  The first one in
   try_to_free_buffers(), I assume?  The can this still happen? one?
   
   If so, it can happen.  How?  Perhaps this is a bug.
Good question but I don't think so. The problem is that
  fsync_buffers_list() drops the private_lock() e.g. when it does
  wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
  have b_assoc_mapping set to NULL but the test
  !list_empty(bh-b_assoc_buffers) succeeds for them and thus
  __remove_assoc_queue() is called and it complains.
We could also silence the warning by leaving b_assoc_mapping set when we
  move the buffer to the constructed list.
 
 Could fsync_buffers_list() leave the buffer not on a list when it does the
 ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
 is still not on a list?
 
 I guess not, as it still needs to find the buffer to wait upon it.
 
  But given the problem below
  I've decided to do a more complete cleanup of the code.
 
 Well.  Large-scale changes to long-established code like this are a last
 resort.  We should fully investigate little local fixups first.
 
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the 
private
list of fsync_buffers_list(). As buffer is on some list (note that the 
check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty 
buffer
which should be on private_list but it isn't. This race has not been 
reported,
probably because most (but not all) callers of 
mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().
   
   Maybe fsync_buffers_list should put the buffer back onto private_list if 
   it
   got dirtied again.
Yes, that's what it does in my new version. Only the locking is a bit
  subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
  when the buffer is already on some list...
 
 I fear rewrites in there.  It took five years to find this bug.  How long
 will it take to find new ones which get added?
 
 Sigh.  lists do suck for this sort of thing.
  OK, below is a minimal fix for the problems...


Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves
buffer to a dedicated list 

Re: [PATCH] Fix private_list handling

2008-01-11 Thread Andrew Morton
On Fri, 11 Jan 2008 15:21:31 +0100
Jan Kara <[EMAIL PROTECTED]> wrote:

> On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> > On Thu, 10 Jan 2008 16:55:13 +0100
> > Jan Kara <[EMAIL PROTECTED]> wrote:
> > 
> > >   Hi,
> > > 
> > >   sorry for the previous empty email...
> > > 
> > >   Supriya noted in his testing that sometimes buffers removed by
> > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > > won't be properly propagated). Actually, looking more into the code I 
> > > found
> > > there are some more races. The patch below should fix them. It survived
> > > beating with LTP and fsstress on ext2 filesystem on my testing machine so
> > > it should be reasonably bugfree... Andrew, would you put the patch into
> > > -mm? Thanks.
> > > 
> > >   Honza
> > > -- 
> > > Jan Kara <[EMAIL PROTECTED]>
> > > SUSE Labs, CR
> > > ---
> > > 
> > > There are two possible races in handling of private_list in buffer cache.
> > > 1) When fsync_buffers_list() processes a private_list, it clears
> > > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() 
> > > comes,
> > > sees a buffer is on list so it calls __remove_assoc_queue() which 
> > > complains
> > > about b_assoc_mapping being cleared (as it cannot propagate possible IO 
> > > error).
> > > This race has been actually observed in the wild.
> > 
> > private_lock should prevent this race.
> > 
> > Which call to drop_buffers() is the culprit?  The first one in
> > try_to_free_buffers(), I assume?  The "can this still happen?" one?
> > 
> > If so, it can happen.  How?  Perhaps this is a bug.
>   Good question but I don't think so. The problem is that
> fsync_buffers_list() drops the private_lock() e.g. when it does
> wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
> have b_assoc_mapping set to NULL but the test
> !list_empty(bh->b_assoc_buffers) succeeds for them and thus
> __remove_assoc_queue() is called and it complains.
>   We could also silence the warning by leaving b_assoc_mapping set when we
> move the buffer to the constructed list.

Could fsync_buffers_list() leave the buffer not on a list when it does the
ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
is still not on a list?

I guess not, as it still needs to find the buffer to wait upon it.

> But given the problem below
> I've decided to do a more complete cleanup of the code.

Well.  Large-scale changes to long-established code like this are a last
resort.  We should fully investigate little local fixups first.

> > > 2) When fsync_buffers_list() processes a private_list,
> > > mark_buffer_dirty_inode() can be called on bh which is already on the 
> > > private
> > > list of fsync_buffers_list(). As buffer is on some list (note that the 
> > > check is
> > > performed without private_lock), it is not readded to the mapping's
> > > private_list and after fsync_buffers_list() finishes, we have a dirty 
> > > buffer
> > > which should be on private_list but it isn't. This race has not been 
> > > reported,
> > > probably because most (but not all) callers of mark_buffer_dirty_inode() 
> > > hold
> > > i_mutex and thus are serialized with fsync().
> > 
> > Maybe fsync_buffers_list should put the buffer back onto private_list if it
> > got dirtied again.
>   Yes, that's what it does in my new version. Only the locking is a bit
> subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
> when the buffer is already on some list...

I fear rewrites in there.  It took five years to find this bug.  How long
will it take to find new ones which get added?

Sigh.  lists do suck for this sort of thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix private_list handling

2008-01-11 Thread Jan Kara
On Thu 10-01-08 16:36:35, Andrew Morton wrote:
> On Thu, 10 Jan 2008 16:55:13 +0100
> Jan Kara <[EMAIL PROTECTED]> wrote:
> 
> >   Hi,
> > 
> >   sorry for the previous empty email...
> > 
> >   Supriya noted in his testing that sometimes buffers removed by
> > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> > won't be properly propagated). Actually, looking more into the code I found
> > there are some more races. The patch below should fix them. It survived
> > beating with LTP and fsstress on ext2 filesystem on my testing machine so
> > it should be reasonably bugfree... Andrew, would you put the patch into
> > -mm? Thanks.
> > 
> > Honza
> > -- 
> > Jan Kara <[EMAIL PROTECTED]>
> > SUSE Labs, CR
> > ---
> > 
> > There are two possible races in handling of private_list in buffer cache.
> > 1) When fsync_buffers_list() processes a private_list, it clears
> > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() 
> > comes,
> > sees a buffer is on list so it calls __remove_assoc_queue() which complains
> > about b_assoc_mapping being cleared (as it cannot propagate possible IO 
> > error).
> > This race has been actually observed in the wild.
> 
> private_lock should prevent this race.
> 
> Which call to drop_buffers() is the culprit?  The first one in
> try_to_free_buffers(), I assume?  The "can this still happen?" one?
> 
> If so, it can happen.  How?  Perhaps this is a bug.
  Good question but I don't think so. The problem is that
fsync_buffers_list() drops the private_lock() e.g. when it does
wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
have b_assoc_mapping set to NULL but the test
!list_empty(bh->b_assoc_buffers) succeeds for them and thus
__remove_assoc_queue() is called and it complains.
  We could also silence the warning by leaving b_assoc_mapping set when we
move the buffer to the constructed list. But given the problem below
I've decided to do a more complete cleanup of the code.

> > 2) When fsync_buffers_list() processes a private_list,
> > mark_buffer_dirty_inode() can be called on bh which is already on the 
> > private
> > list of fsync_buffers_list(). As buffer is on some list (note that the 
> > check is
> > performed without private_lock), it is not readded to the mapping's
> > private_list and after fsync_buffers_list() finishes, we have a dirty buffer
> > which should be on private_list but it isn't. This race has not been 
> > reported,
> > probably because most (but not all) callers of mark_buffer_dirty_inode() 
> > hold
> > i_mutex and thus are serialized with fsync().
> 
> Maybe fsync_buffers_list should put the buffer back onto private_list if it
> got dirtied again.
  Yes, that's what it does in my new version. Only the locking is a bit
subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
when the buffer is already on some list...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix private_list handling

2008-01-11 Thread Jan Kara
On Thu 10-01-08 16:36:35, Andrew Morton wrote:
 On Thu, 10 Jan 2008 16:55:13 +0100
 Jan Kara [EMAIL PROTECTED] wrote:
 
Hi,
  
sorry for the previous empty email...
  
Supriya noted in his testing that sometimes buffers removed by
  __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
  won't be properly propagated). Actually, looking more into the code I found
  there are some more races. The patch below should fix them. It survived
  beating with LTP and fsstress on ext2 filesystem on my testing machine so
  it should be reasonably bugfree... Andrew, would you put the patch into
  -mm? Thanks.
  
  Honza
  -- 
  Jan Kara [EMAIL PROTECTED]
  SUSE Labs, CR
  ---
  
  There are two possible races in handling of private_list in buffer cache.
  1) When fsync_buffers_list() processes a private_list, it clears
  b_assoc_mapping and moves buffer to its private list. Now drop_buffers() 
  comes,
  sees a buffer is on list so it calls __remove_assoc_queue() which complains
  about b_assoc_mapping being cleared (as it cannot propagate possible IO 
  error).
  This race has been actually observed in the wild.
 
 private_lock should prevent this race.
 
 Which call to drop_buffers() is the culprit?  The first one in
 try_to_free_buffers(), I assume?  The can this still happen? one?
 
 If so, it can happen.  How?  Perhaps this is a bug.
  Good question but I don't think so. The problem is that
fsync_buffers_list() drops the private_lock() e.g. when it does
wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
have b_assoc_mapping set to NULL but the test
!list_empty(bh-b_assoc_buffers) succeeds for them and thus
__remove_assoc_queue() is called and it complains.
  We could also silence the warning by leaving b_assoc_mapping set when we
move the buffer to the constructed list. But given the problem below
I've decided to do a more complete cleanup of the code.

  2) When fsync_buffers_list() processes a private_list,
  mark_buffer_dirty_inode() can be called on bh which is already on the 
  private
  list of fsync_buffers_list(). As buffer is on some list (note that the 
  check is
  performed without private_lock), it is not readded to the mapping's
  private_list and after fsync_buffers_list() finishes, we have a dirty buffer
  which should be on private_list but it isn't. This race has not been 
  reported,
  probably because most (but not all) callers of mark_buffer_dirty_inode() 
  hold
  i_mutex and thus are serialized with fsync().
 
 Maybe fsync_buffers_list should put the buffer back onto private_list if it
 got dirtied again.
  Yes, that's what it does in my new version. Only the locking is a bit
subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
when the buffer is already on some list...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix private_list handling

2008-01-11 Thread Andrew Morton
On Fri, 11 Jan 2008 15:21:31 +0100
Jan Kara [EMAIL PROTECTED] wrote:

 On Thu 10-01-08 16:36:35, Andrew Morton wrote:
  On Thu, 10 Jan 2008 16:55:13 +0100
  Jan Kara [EMAIL PROTECTED] wrote:
  
 Hi,
   
 sorry for the previous empty email...
   
 Supriya noted in his testing that sometimes buffers removed by
   __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
   won't be properly propagated). Actually, looking more into the code I 
   found
   there are some more races. The patch below should fix them. It survived
   beating with LTP and fsstress on ext2 filesystem on my testing machine so
   it should be reasonably bugfree... Andrew, would you put the patch into
   -mm? Thanks.
   
 Honza
   -- 
   Jan Kara [EMAIL PROTECTED]
   SUSE Labs, CR
   ---
   
   There are two possible races in handling of private_list in buffer cache.
   1) When fsync_buffers_list() processes a private_list, it clears
   b_assoc_mapping and moves buffer to its private list. Now drop_buffers() 
   comes,
   sees a buffer is on list so it calls __remove_assoc_queue() which 
   complains
   about b_assoc_mapping being cleared (as it cannot propagate possible IO 
   error).
   This race has been actually observed in the wild.
  
  private_lock should prevent this race.
  
  Which call to drop_buffers() is the culprit?  The first one in
  try_to_free_buffers(), I assume?  The can this still happen? one?
  
  If so, it can happen.  How?  Perhaps this is a bug.
   Good question but I don't think so. The problem is that
 fsync_buffers_list() drops the private_lock() e.g. when it does
 wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs
 have b_assoc_mapping set to NULL but the test
 !list_empty(bh-b_assoc_buffers) succeeds for them and thus
 __remove_assoc_queue() is called and it complains.
   We could also silence the warning by leaving b_assoc_mapping set when we
 move the buffer to the constructed list.

Could fsync_buffers_list() leave the buffer not on a list when it does the
ll_rw_block() and only add it to tmp afterwards if it sees that the buffer
is still not on a list?

I guess not, as it still needs to find the buffer to wait upon it.

 But given the problem below
 I've decided to do a more complete cleanup of the code.

Well.  Large-scale changes to long-established code like this are a last
resort.  We should fully investigate little local fixups first.

   2) When fsync_buffers_list() processes a private_list,
   mark_buffer_dirty_inode() can be called on bh which is already on the 
   private
   list of fsync_buffers_list(). As buffer is on some list (note that the 
   check is
   performed without private_lock), it is not readded to the mapping's
   private_list and after fsync_buffers_list() finishes, we have a dirty 
   buffer
   which should be on private_list but it isn't. This race has not been 
   reported,
   probably because most (but not all) callers of mark_buffer_dirty_inode() 
   hold
   i_mutex and thus are serialized with fsync().
  
  Maybe fsync_buffers_list should put the buffer back onto private_list if it
  got dirtied again.
   Yes, that's what it does in my new version. Only the locking is a bit
 subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode()
 when the buffer is already on some list...

I fear rewrites in there.  It took five years to find this bug.  How long
will it take to find new ones which get added?

Sigh.  lists do suck for this sort of thing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix private_list handling

2008-01-10 Thread Andrew Morton
On Thu, 10 Jan 2008 16:55:13 +0100
Jan Kara <[EMAIL PROTECTED]> wrote:

>   Hi,
> 
>   sorry for the previous empty email...
> 
>   Supriya noted in his testing that sometimes buffers removed by
> __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
> won't be properly propagated). Actually, looking more into the code I found
> there are some more races. The patch below should fix them. It survived
> beating with LTP and fsstress on ext2 filesystem on my testing machine so
> it should be reasonably bugfree... Andrew, would you put the patch into
> -mm? Thanks.
> 
>   Honza
> -- 
> Jan Kara <[EMAIL PROTECTED]>
> SUSE Labs, CR
> ---
> 
> There are two possible races in handling of private_list in buffer cache.
> 1) When fsync_buffers_list() processes a private_list, it clears
> b_assoc_mapping and moves buffer to its private list. Now drop_buffers() 
> comes,
> sees a buffer is on list so it calls __remove_assoc_queue() which complains
> about b_assoc_mapping being cleared (as it cannot propagate possible IO 
> error).
> This race has been actually observed in the wild.

private_lock should prevent this race.

Which call to drop_buffers() is the culprit?  The first one in
try_to_free_buffers(), I assume?  The "can this still happen?" one?

If so, it can happen.  How?  Perhaps this is a bug.

> 2) When fsync_buffers_list() processes a private_list,
> mark_buffer_dirty_inode() can be called on bh which is already on the private
> list of fsync_buffers_list(). As buffer is on some list (note that the check 
> is
> performed without private_lock), it is not readded to the mapping's
> private_list and after fsync_buffers_list() finishes, we have a dirty buffer
> which should be on private_list but it isn't. This race has not been reported,
> probably because most (but not all) callers of mark_buffer_dirty_inode() hold
> i_mutex and thus are serialized with fsync().

Maybe fsync_buffers_list should put the buffer back onto private_list if it
got dirtied again.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix private_list handling

2008-01-10 Thread Jan Kara
  Hi,

  sorry for the previous empty email...

  Supriya noted in his testing that sometimes buffers removed by
__remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
won't be properly propagated). Actually, looking more into the code I found
there are some more races. The patch below should fix them. It survived
beating with LTP and fsstress on ext2 filesystem on my testing machine so
it should be reasonably bugfree... Andrew, would you put the patch into
-mm? Thanks.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by rewriting fsync_buffers_list() to not move buffers to
dedicated private list (which has an additional bonus of reducing code size)
and scan original private_list instead (some care has to be taken to avoid
racing with drop_buffers()). Also mark_buffer_dirty_inode() is modified
to avoid race 2).

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..8691fa5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -563,15 +563,6 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  * take an address_space, not an inode.  And it should be called
  * mark_buffer_dirty_fsync() to clearly define why those buffers are being
  * queued up.
- *
- * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the
- * list if it is already on a list.  Because if the buffer is on a list,
- * it *must* already be on the right one.  If not, the filesystem is being
- * silly.  This will save a ton of locking.  But first we have to ensure
- * that buffers are taken *off* the old inode's list when they are freed
- * (presumably in truncate).  That requires careful auditing of all
- * filesystems (do it inside bforget()).  It could also be done by bringing
- * b_inode back.
  */
 
 /*
@@ -591,41 +582,6 @@ int inode_has_buffers(struct inode *inode)
return !list_empty(>i_data.private_list);
 }
 
-/*
- * osync is designed to support O_SYNC io.  It waits synchronously for
- * all already-submitted IO to complete, but does not queue any new
- * writes to the disk.
- *
- * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as
- * you dirty the buffers, and then use osync_inode_buffers to wait for
- * completion.  Any other dirty buffers which are not yet queued for
- * write will not be flushed to disk by the osync.
- */
-static int osync_buffers_list(spinlock_t *lock, struct list_head *list)
-{
-   struct buffer_head *bh;
-   struct list_head *p;
-   int err = 0;
-
-   spin_lock(lock);
-repeat:
-   list_for_each_prev(p, list) {
-   bh = BH_ENTRY(p);
-   if (buffer_locked(bh)) {
-   get_bh(bh);
-   spin_unlock(lock);
-   wait_on_buffer(bh);
-   if (!buffer_uptodate(bh))
-   err = -EIO;
-   brelse(bh);
-   spin_lock(lock);
-   goto repeat;
-   }
-   }
-   spin_unlock(lock);
-   return err;
-}
-
 /**
  * sync_mapping_buffers - write out and wait upon a mapping's "associated"
  *buffers
@@ -634,6 +590,11 @@ repeat:
  * Starts I/O against the buffers at mapping->private_list, and waits upon
  * that I/O.
  *
+ * The caller must make sure (usually by holding i_mutex) that we cannot race
+ * with invalidate_inode_buffers() or remove_inode_buffers() calling
+ * __remove_assoc_queue(). Racing with drop_buffers() is fine and is taken care
+ * of.
+ *
  * Basically, this is a convenience function for fsync().
  * @mapping is a file or directory which needs those buffers to be written for
  * a successful fsync().
@@ -667,22 +628,28 @@ void write_boundary_block(struct block_device *bdev,
}
 }
 
+/* Mark buffer as dirty and add it to inode mapping's private_list. We play
+ * tricks to avoid locking private_lock when not needed and 

[PATCH] Fix private_list handling

2008-01-10 Thread Jan Kara
  Hi,


-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix private_list handling

2008-01-10 Thread Jan Kara
  Hi,


-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix private_list handling

2008-01-10 Thread Jan Kara
  Hi,

  sorry for the previous empty email...

  Supriya noted in his testing that sometimes buffers removed by
__remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
won't be properly propagated). Actually, looking more into the code I found
there are some more races. The patch below should fix them. It survived
beating with LTP and fsstress on ext2 filesystem on my testing machine so
it should be reasonably bugfree... Andrew, would you put the patch into
-mm? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

There are two possible races in handling of private_list in buffer cache.
1) When fsync_buffers_list() processes a private_list, it clears
b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes,
sees a buffer is on list so it calls __remove_assoc_queue() which complains
about b_assoc_mapping being cleared (as it cannot propagate possible IO error).
This race has been actually observed in the wild.
2) When fsync_buffers_list() processes a private_list,
mark_buffer_dirty_inode() can be called on bh which is already on the private
list of fsync_buffers_list(). As buffer is on some list (note that the check is
performed without private_lock), it is not readded to the mapping's
private_list and after fsync_buffers_list() finishes, we have a dirty buffer
which should be on private_list but it isn't. This race has not been reported,
probably because most (but not all) callers of mark_buffer_dirty_inode() hold
i_mutex and thus are serialized with fsync().

Fix these issues by rewriting fsync_buffers_list() to not move buffers to
dedicated private list (which has an additional bonus of reducing code size)
and scan original private_list instead (some care has to be taken to avoid
racing with drop_buffers()). Also mark_buffer_dirty_inode() is modified
to avoid race 2).

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..8691fa5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -563,15 +563,6 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  * take an address_space, not an inode.  And it should be called
  * mark_buffer_dirty_fsync() to clearly define why those buffers are being
  * queued up.
- *
- * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the
- * list if it is already on a list.  Because if the buffer is on a list,
- * it *must* already be on the right one.  If not, the filesystem is being
- * silly.  This will save a ton of locking.  But first we have to ensure
- * that buffers are taken *off* the old inode's list when they are freed
- * (presumably in truncate).  That requires careful auditing of all
- * filesystems (do it inside bforget()).  It could also be done by bringing
- * b_inode back.
  */
 
 /*
@@ -591,41 +582,6 @@ int inode_has_buffers(struct inode *inode)
return !list_empty(inode-i_data.private_list);
 }
 
-/*
- * osync is designed to support O_SYNC io.  It waits synchronously for
- * all already-submitted IO to complete, but does not queue any new
- * writes to the disk.
- *
- * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as
- * you dirty the buffers, and then use osync_inode_buffers to wait for
- * completion.  Any other dirty buffers which are not yet queued for
- * write will not be flushed to disk by the osync.
- */
-static int osync_buffers_list(spinlock_t *lock, struct list_head *list)
-{
-   struct buffer_head *bh;
-   struct list_head *p;
-   int err = 0;
-
-   spin_lock(lock);
-repeat:
-   list_for_each_prev(p, list) {
-   bh = BH_ENTRY(p);
-   if (buffer_locked(bh)) {
-   get_bh(bh);
-   spin_unlock(lock);
-   wait_on_buffer(bh);
-   if (!buffer_uptodate(bh))
-   err = -EIO;
-   brelse(bh);
-   spin_lock(lock);
-   goto repeat;
-   }
-   }
-   spin_unlock(lock);
-   return err;
-}
-
 /**
  * sync_mapping_buffers - write out and wait upon a mapping's associated
  *buffers
@@ -634,6 +590,11 @@ repeat:
  * Starts I/O against the buffers at mapping-private_list, and waits upon
  * that I/O.
  *
+ * The caller must make sure (usually by holding i_mutex) that we cannot race
+ * with invalidate_inode_buffers() or remove_inode_buffers() calling
+ * __remove_assoc_queue(). Racing with drop_buffers() is fine and is taken care
+ * of.
+ *
  * Basically, this is a convenience function for fsync().
  * @mapping is a file or directory which needs those buffers to be written for
  * a successful fsync().
@@ -667,22 +628,28 @@ void write_boundary_block(struct block_device *bdev,
}
 }
 
+/* Mark buffer as dirty and add it to inode mapping's private_list. We play
+ * tricks to avoid locking private_lock when not needed and 

Re: [PATCH] Fix private_list handling

2008-01-10 Thread Andrew Morton
On Thu, 10 Jan 2008 16:55:13 +0100
Jan Kara [EMAIL PROTECTED] wrote:

   Hi,
 
   sorry for the previous empty email...
 
   Supriya noted in his testing that sometimes buffers removed by
 __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error
 won't be properly propagated). Actually, looking more into the code I found
 there are some more races. The patch below should fix them. It survived
 beating with LTP and fsstress on ext2 filesystem on my testing machine so
 it should be reasonably bugfree... Andrew, would you put the patch into
 -mm? Thanks.
 
   Honza
 -- 
 Jan Kara [EMAIL PROTECTED]
 SUSE Labs, CR
 ---
 
 There are two possible races in handling of private_list in buffer cache.
 1) When fsync_buffers_list() processes a private_list, it clears
 b_assoc_mapping and moves buffer to its private list. Now drop_buffers() 
 comes,
 sees a buffer is on list so it calls __remove_assoc_queue() which complains
 about b_assoc_mapping being cleared (as it cannot propagate possible IO 
 error).
 This race has been actually observed in the wild.

private_lock should prevent this race.

Which call to drop_buffers() is the culprit?  The first one in
try_to_free_buffers(), I assume?  The can this still happen? one?

If so, it can happen.  How?  Perhaps this is a bug.

 2) When fsync_buffers_list() processes a private_list,
 mark_buffer_dirty_inode() can be called on bh which is already on the private
 list of fsync_buffers_list(). As buffer is on some list (note that the check 
 is
 performed without private_lock), it is not readded to the mapping's
 private_list and after fsync_buffers_list() finishes, we have a dirty buffer
 which should be on private_list but it isn't. This race has not been reported,
 probably because most (but not all) callers of mark_buffer_dirty_inode() hold
 i_mutex and thus are serialized with fsync().

Maybe fsync_buffers_list should put the buffer back onto private_list if it
got dirtied again.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/