[PATCH] Fix private_list handling
There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves buffer to a dedicated list and by reinserting buffer in private_list when it is found dirty after we have submitted buffer for IO. We also change the tests whether a buffer is on a private list from !list_empty(>b_assoc_buffers) to bh->b_assoc_map so that they are single word reads and hence lockless checks are safe. Signed-off-by: Jan Kara <[EMAIL PROTECTED]> CC: Nick Piggin <[EMAIL PROTECTED]> --- Hi Andrew, this is the latest version of the fix for private_list handling races. The changes are now smaller so I hope you'll like the patch more... Please apply if you now think the patch is fine. Thanks. Honza diff --git a/fs/buffer.c b/fs/buffer.c index 7249e01..76e1ab4 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -678,7 +678,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode) } else { BUG_ON(mapping->assoc_mapping != buffer_mapping); } - if (list_empty(>b_assoc_buffers)) { + if (!bh->b_assoc_map) { spin_lock(_mapping->private_lock); list_move_tail(>b_assoc_buffers, >private_list); @@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) { struct buffer_head *bh; struct list_head tmp; + struct address_space *mapping; int err = 0, err2; INIT_LIST_HEAD(); @@ -801,9 +802,14 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) spin_lock(lock); while (!list_empty(list)) { bh = BH_ENTRY(list->next); + mapping = bh->b_assoc_map; __remove_assoc_queue(bh); + /* Avoid race with mark_buffer_dirty_inode() which does +* a lockless check and we rely on seeing the dirty bit */ + smp_mb(); if (buffer_dirty(bh) || buffer_locked(bh)) { list_add(>b_assoc_buffers, ); + bh->b_assoc_map = mapping; if (buffer_dirty(bh)) { get_bh(bh); spin_unlock(lock); @@ -822,8 +828,17 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) while (!list_empty()) { bh = BH_ENTRY(tmp.prev); - list_del_init(>b_assoc_buffers); get_bh(bh); + mapping = bh->b_assoc_map; + __remove_assoc_queue(bh); + /* Avoid race with mark_buffer_dirty_inode() which does +* a lockless check and we rely on seeing the dirty bit */ + smp_mb(); + if (buffer_dirty(bh)) { + list_add(>b_assoc_buffers, +>b_assoc_map->private_list); + bh->b_assoc_map = mapping; + } spin_unlock(lock); wait_on_buffer(bh); if (!buffer_uptodate(bh)) @@ -1195,7 +1210,7 @@ void __brelse(struct buffer_head * buf) void __bforget(struct buffer_head *bh) { clear_buffer_dirty(bh); - if (!list_empty(>b_assoc_buffers)) { + if (bh->b_assoc_map) { struct address_space *buffer_mapping = bh->b_page->mapping; spin_lock(_mapping->private_lock); @@ -3037,7 +3052,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) do { struct buffer_head *next = bh->b_this_page; - if (!list_empty(>b_assoc_buffers)) + if (bh->b_assoc_map) __remove_assoc_queue(bh); bh = next; } while (bh != head); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at
[PATCH] Fix private_list handling
There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves buffer to a dedicated list and by reinserting buffer in private_list when it is found dirty after we have submitted buffer for IO. We also change the tests whether a buffer is on a private list from !list_empty(bh-b_assoc_buffers) to bh-b_assoc_map so that they are single word reads and hence lockless checks are safe. Signed-off-by: Jan Kara [EMAIL PROTECTED] CC: Nick Piggin [EMAIL PROTECTED] --- Hi Andrew, this is the latest version of the fix for private_list handling races. The changes are now smaller so I hope you'll like the patch more... Please apply if you now think the patch is fine. Thanks. Honza diff --git a/fs/buffer.c b/fs/buffer.c index 7249e01..76e1ab4 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -678,7 +678,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode) } else { BUG_ON(mapping-assoc_mapping != buffer_mapping); } - if (list_empty(bh-b_assoc_buffers)) { + if (!bh-b_assoc_map) { spin_lock(buffer_mapping-private_lock); list_move_tail(bh-b_assoc_buffers, mapping-private_list); @@ -794,6 +794,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) { struct buffer_head *bh; struct list_head tmp; + struct address_space *mapping; int err = 0, err2; INIT_LIST_HEAD(tmp); @@ -801,9 +802,14 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) spin_lock(lock); while (!list_empty(list)) { bh = BH_ENTRY(list-next); + mapping = bh-b_assoc_map; __remove_assoc_queue(bh); + /* Avoid race with mark_buffer_dirty_inode() which does +* a lockless check and we rely on seeing the dirty bit */ + smp_mb(); if (buffer_dirty(bh) || buffer_locked(bh)) { list_add(bh-b_assoc_buffers, tmp); + bh-b_assoc_map = mapping; if (buffer_dirty(bh)) { get_bh(bh); spin_unlock(lock); @@ -822,8 +828,17 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list) while (!list_empty(tmp)) { bh = BH_ENTRY(tmp.prev); - list_del_init(bh-b_assoc_buffers); get_bh(bh); + mapping = bh-b_assoc_map; + __remove_assoc_queue(bh); + /* Avoid race with mark_buffer_dirty_inode() which does +* a lockless check and we rely on seeing the dirty bit */ + smp_mb(); + if (buffer_dirty(bh)) { + list_add(bh-b_assoc_buffers, +bh-b_assoc_map-private_list); + bh-b_assoc_map = mapping; + } spin_unlock(lock); wait_on_buffer(bh); if (!buffer_uptodate(bh)) @@ -1195,7 +1210,7 @@ void __brelse(struct buffer_head * buf) void __bforget(struct buffer_head *bh) { clear_buffer_dirty(bh); - if (!list_empty(bh-b_assoc_buffers)) { + if (bh-b_assoc_map) { struct address_space *buffer_mapping = bh-b_page-mapping; spin_lock(buffer_mapping-private_lock); @@ -3037,7 +3052,7 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) do { struct buffer_head *next = bh-b_this_page; - if (!list_empty(bh-b_assoc_buffers)) + if (bh-b_assoc_map) __remove_assoc_queue(bh); bh = next; } while (bh != head); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo
Re: [PATCH] Fix private_list handling
On Fri 11-01-08 15:33:54, Andrew Morton wrote: > On Fri, 11 Jan 2008 15:21:31 +0100 > Jan Kara <[EMAIL PROTECTED]> wrote: > > > On Thu 10-01-08 16:36:35, Andrew Morton wrote: > > > On Thu, 10 Jan 2008 16:55:13 +0100 > > > Jan Kara <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > sorry for the previous empty email... > > > > > > > > Supriya noted in his testing that sometimes buffers removed by > > > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error > > > > won't be properly propagated). Actually, looking more into the code I > > > > found > > > > there are some more races. The patch below should fix them. It survived > > > > beating with LTP and fsstress on ext2 filesystem on my testing machine > > > > so > > > > it should be reasonably bugfree... Andrew, would you put the patch into > > > > -mm? Thanks. > > > > > > > > Honza > > > > -- > > > > Jan Kara <[EMAIL PROTECTED]> > > > > SUSE Labs, CR > > > > --- > > > > > > > > There are two possible races in handling of private_list in buffer > > > > cache. > > > > 1) When fsync_buffers_list() processes a private_list, it clears > > > > b_assoc_mapping and moves buffer to its private list. Now > > > > drop_buffers() comes, > > > > sees a buffer is on list so it calls __remove_assoc_queue() which > > > > complains > > > > about b_assoc_mapping being cleared (as it cannot propagate possible IO > > > > error). > > > > This race has been actually observed in the wild. > > > > > > private_lock should prevent this race. > > > > > > Which call to drop_buffers() is the culprit? The first one in > > > try_to_free_buffers(), I assume? The "can this still happen?" one? > > > > > > If so, it can happen. How? Perhaps this is a bug. > > Good question but I don't think so. The problem is that > > fsync_buffers_list() drops the private_lock() e.g. when it does > > wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs > > have b_assoc_mapping set to NULL but the test > > !list_empty(bh->b_assoc_buffers) succeeds for them and thus > > __remove_assoc_queue() is called and it complains. > > We could also silence the warning by leaving b_assoc_mapping set when we > > move the buffer to the constructed list. > > Could fsync_buffers_list() leave the buffer not on a list when it does the > ll_rw_block() and only add it to tmp afterwards if it sees that the buffer > is still not on a list? > > I guess not, as it still needs to find the buffer to wait upon it. > > > But given the problem below > > I've decided to do a more complete cleanup of the code. > > Well. Large-scale changes to long-established code like this are a last > resort. We should fully investigate little local fixups first. > > > > > 2) When fsync_buffers_list() processes a private_list, > > > > mark_buffer_dirty_inode() can be called on bh which is already on the > > > > private > > > > list of fsync_buffers_list(). As buffer is on some list (note that the > > > > check is > > > > performed without private_lock), it is not readded to the mapping's > > > > private_list and after fsync_buffers_list() finishes, we have a dirty > > > > buffer > > > > which should be on private_list but it isn't. This race has not been > > > > reported, > > > > probably because most (but not all) callers of > > > > mark_buffer_dirty_inode() hold > > > > i_mutex and thus are serialized with fsync(). > > > > > > Maybe fsync_buffers_list should put the buffer back onto private_list if > > > it > > > got dirtied again. > > Yes, that's what it does in my new version. Only the locking is a bit > > subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() > > when the buffer is already on some list... > > I fear rewrites in there. It took five years to find this bug. How long > will it take to find new ones which get added? > > Sigh. lists do suck for this sort of thing. OK, below is a minimal fix for the problems... Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but
Re: [PATCH] Fix private_list handling
On Fri 11-01-08 15:33:54, Andrew Morton wrote: > On Fri, 11 Jan 2008 15:21:31 +0100 > Jan Kara <[EMAIL PROTECTED]> wrote: > > > On Thu 10-01-08 16:36:35, Andrew Morton wrote: > > > On Thu, 10 Jan 2008 16:55:13 +0100 > > > Jan Kara <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > sorry for the previous empty email... > > > > > > > > Supriya noted in his testing that sometimes buffers removed by > > > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error > > > > won't be properly propagated). Actually, looking more into the code I > > > > found > > > > there are some more races. The patch below should fix them. It survived > > > > beating with LTP and fsstress on ext2 filesystem on my testing machine > > > > so > > > > it should be reasonably bugfree... Andrew, would you put the patch into > > > > -mm? Thanks. > > > > > > > > Honza > > > > -- > > > > Jan Kara <[EMAIL PROTECTED]> > > > > SUSE Labs, CR > > > > --- > > > > > > > > There are two possible races in handling of private_list in buffer > > > > cache. > > > > 1) When fsync_buffers_list() processes a private_list, it clears > > > > b_assoc_mapping and moves buffer to its private list. Now > > > > drop_buffers() comes, > > > > sees a buffer is on list so it calls __remove_assoc_queue() which > > > > complains > > > > about b_assoc_mapping being cleared (as it cannot propagate possible IO > > > > error). > > > > This race has been actually observed in the wild. > > > > > > private_lock should prevent this race. > > > > > > Which call to drop_buffers() is the culprit? The first one in > > > try_to_free_buffers(), I assume? The "can this still happen?" one? > > > > > > If so, it can happen. How? Perhaps this is a bug. > > Good question but I don't think so. The problem is that > > fsync_buffers_list() drops the private_lock() e.g. when it does > > wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs > > have b_assoc_mapping set to NULL but the test > > !list_empty(bh->b_assoc_buffers) succeeds for them and thus > > __remove_assoc_queue() is called and it complains. > > We could also silence the warning by leaving b_assoc_mapping set when we > > move the buffer to the constructed list. > > Could fsync_buffers_list() leave the buffer not on a list when it does the > ll_rw_block() and only add it to tmp afterwards if it sees that the buffer > is still not on a list? > > I guess not, as it still needs to find the buffer to wait upon it. Exactly... We have to have submitted buffers on some list. > > But given the problem below > > I've decided to do a more complete cleanup of the code. > > Well. Large-scale changes to long-established code like this are a last > resort. We should fully investigate little local fixups first. I see. Well, I could do the following minimal changes if it'd make you more comfortable about it): 1) Don't clear b_assoc_map in fsync_buffers_list() (that would get rid of race with drop_buffers()) 2) Change fsync_buffers_list() to readd buffer into private_list when it is dirty after we've waited on it. That should get rid of the second race. But I still think that unnecessary temporary list is worth a cleanup ;). > > > > 2) When fsync_buffers_list() processes a private_list, > > > > mark_buffer_dirty_inode() can be called on bh which is already on the > > > > private > > > > list of fsync_buffers_list(). As buffer is on some list (note that the > > > > check is > > > > performed without private_lock), it is not readded to the mapping's > > > > private_list and after fsync_buffers_list() finishes, we have a dirty > > > > buffer > > > > which should be on private_list but it isn't. This race has not been > > > > reported, > > > > probably because most (but not all) callers of > > > > mark_buffer_dirty_inode() hold > > > > i_mutex and thus are serialized with fsync(). > > > > > > Maybe fsync_buffers_list should put the buffer back onto private_list if > > > it > > > got dirtied again. > > Yes, that's what it does in my new version. Only the locking is a bit > > subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() > > when the buffer is already on some list... > > I fear rewrites in there. It took five years to find this bug. How long > will it take to find new ones which get added? Well, I've added that WARN_ON() which warned us about the problem about 14 month ago so it was not that bad. But I agree that bugs are hard to detect here as they result either in lost IO error or in not writing all buffers on fsync both of which go mostly unnoticed. Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: [PATCH] Fix private_list handling
On Fri 11-01-08 15:33:54, Andrew Morton wrote: On Fri, 11 Jan 2008 15:21:31 +0100 Jan Kara [EMAIL PROTECTED] wrote: On Thu 10-01-08 16:36:35, Andrew Morton wrote: On Thu, 10 Jan 2008 16:55:13 +0100 Jan Kara [EMAIL PROTECTED] wrote: Hi, sorry for the previous empty email... Supriya noted in his testing that sometimes buffers removed by __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error won't be properly propagated). Actually, looking more into the code I found there are some more races. The patch below should fix them. It survived beating with LTP and fsstress on ext2 filesystem on my testing machine so it should be reasonably bugfree... Andrew, would you put the patch into -mm? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. private_lock should prevent this race. Which call to drop_buffers() is the culprit? The first one in try_to_free_buffers(), I assume? The can this still happen? one? If so, it can happen. How? Perhaps this is a bug. Good question but I don't think so. The problem is that fsync_buffers_list() drops the private_lock() e.g. when it does wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs have b_assoc_mapping set to NULL but the test !list_empty(bh-b_assoc_buffers) succeeds for them and thus __remove_assoc_queue() is called and it complains. We could also silence the warning by leaving b_assoc_mapping set when we move the buffer to the constructed list. Could fsync_buffers_list() leave the buffer not on a list when it does the ll_rw_block() and only add it to tmp afterwards if it sees that the buffer is still not on a list? I guess not, as it still needs to find the buffer to wait upon it. Exactly... We have to have submitted buffers on some list. But given the problem below I've decided to do a more complete cleanup of the code. Well. Large-scale changes to long-established code like this are a last resort. We should fully investigate little local fixups first. I see. Well, I could do the following minimal changes if it'd make you more comfortable about it): 1) Don't clear b_assoc_map in fsync_buffers_list() (that would get rid of race with drop_buffers()) 2) Change fsync_buffers_list() to readd buffer into private_list when it is dirty after we've waited on it. That should get rid of the second race. But I still think that unnecessary temporary list is worth a cleanup ;). 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Maybe fsync_buffers_list should put the buffer back onto private_list if it got dirtied again. Yes, that's what it does in my new version. Only the locking is a bit subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() when the buffer is already on some list... I fear rewrites in there. It took five years to find this bug. How long will it take to find new ones which get added? Well, I've added that WARN_ON() which warned us about the problem about 14 month ago so it was not that bad. But I agree that bugs are hard to detect here as they result either in lost IO error or in not writing all buffers on fsync both of which go mostly unnoticed. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix private_list handling
On Fri 11-01-08 15:33:54, Andrew Morton wrote: On Fri, 11 Jan 2008 15:21:31 +0100 Jan Kara [EMAIL PROTECTED] wrote: On Thu 10-01-08 16:36:35, Andrew Morton wrote: On Thu, 10 Jan 2008 16:55:13 +0100 Jan Kara [EMAIL PROTECTED] wrote: Hi, sorry for the previous empty email... Supriya noted in his testing that sometimes buffers removed by __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error won't be properly propagated). Actually, looking more into the code I found there are some more races. The patch below should fix them. It survived beating with LTP and fsstress on ext2 filesystem on my testing machine so it should be reasonably bugfree... Andrew, would you put the patch into -mm? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. private_lock should prevent this race. Which call to drop_buffers() is the culprit? The first one in try_to_free_buffers(), I assume? The can this still happen? one? If so, it can happen. How? Perhaps this is a bug. Good question but I don't think so. The problem is that fsync_buffers_list() drops the private_lock() e.g. when it does wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs have b_assoc_mapping set to NULL but the test !list_empty(bh-b_assoc_buffers) succeeds for them and thus __remove_assoc_queue() is called and it complains. We could also silence the warning by leaving b_assoc_mapping set when we move the buffer to the constructed list. Could fsync_buffers_list() leave the buffer not on a list when it does the ll_rw_block() and only add it to tmp afterwards if it sees that the buffer is still not on a list? I guess not, as it still needs to find the buffer to wait upon it. But given the problem below I've decided to do a more complete cleanup of the code. Well. Large-scale changes to long-established code like this are a last resort. We should fully investigate little local fixups first. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Maybe fsync_buffers_list should put the buffer back onto private_list if it got dirtied again. Yes, that's what it does in my new version. Only the locking is a bit subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() when the buffer is already on some list... I fear rewrites in there. It took five years to find this bug. How long will it take to find new ones which get added? Sigh. lists do suck for this sort of thing. OK, below is a minimal fix for the problems... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Fix these issues by not clearing b_assoc_map when fsync_buffers_list() moves buffer to a dedicated list
Re: [PATCH] Fix private_list handling
On Fri, 11 Jan 2008 15:21:31 +0100 Jan Kara <[EMAIL PROTECTED]> wrote: > On Thu 10-01-08 16:36:35, Andrew Morton wrote: > > On Thu, 10 Jan 2008 16:55:13 +0100 > > Jan Kara <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > sorry for the previous empty email... > > > > > > Supriya noted in his testing that sometimes buffers removed by > > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error > > > won't be properly propagated). Actually, looking more into the code I > > > found > > > there are some more races. The patch below should fix them. It survived > > > beating with LTP and fsstress on ext2 filesystem on my testing machine so > > > it should be reasonably bugfree... Andrew, would you put the patch into > > > -mm? Thanks. > > > > > > Honza > > > -- > > > Jan Kara <[EMAIL PROTECTED]> > > > SUSE Labs, CR > > > --- > > > > > > There are two possible races in handling of private_list in buffer cache. > > > 1) When fsync_buffers_list() processes a private_list, it clears > > > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() > > > comes, > > > sees a buffer is on list so it calls __remove_assoc_queue() which > > > complains > > > about b_assoc_mapping being cleared (as it cannot propagate possible IO > > > error). > > > This race has been actually observed in the wild. > > > > private_lock should prevent this race. > > > > Which call to drop_buffers() is the culprit? The first one in > > try_to_free_buffers(), I assume? The "can this still happen?" one? > > > > If so, it can happen. How? Perhaps this is a bug. > Good question but I don't think so. The problem is that > fsync_buffers_list() drops the private_lock() e.g. when it does > wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs > have b_assoc_mapping set to NULL but the test > !list_empty(bh->b_assoc_buffers) succeeds for them and thus > __remove_assoc_queue() is called and it complains. > We could also silence the warning by leaving b_assoc_mapping set when we > move the buffer to the constructed list. Could fsync_buffers_list() leave the buffer not on a list when it does the ll_rw_block() and only add it to tmp afterwards if it sees that the buffer is still not on a list? I guess not, as it still needs to find the buffer to wait upon it. > But given the problem below > I've decided to do a more complete cleanup of the code. Well. Large-scale changes to long-established code like this are a last resort. We should fully investigate little local fixups first. > > > 2) When fsync_buffers_list() processes a private_list, > > > mark_buffer_dirty_inode() can be called on bh which is already on the > > > private > > > list of fsync_buffers_list(). As buffer is on some list (note that the > > > check is > > > performed without private_lock), it is not readded to the mapping's > > > private_list and after fsync_buffers_list() finishes, we have a dirty > > > buffer > > > which should be on private_list but it isn't. This race has not been > > > reported, > > > probably because most (but not all) callers of mark_buffer_dirty_inode() > > > hold > > > i_mutex and thus are serialized with fsync(). > > > > Maybe fsync_buffers_list should put the buffer back onto private_list if it > > got dirtied again. > Yes, that's what it does in my new version. Only the locking is a bit > subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() > when the buffer is already on some list... I fear rewrites in there. It took five years to find this bug. How long will it take to find new ones which get added? Sigh. lists do suck for this sort of thing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix private_list handling
On Thu 10-01-08 16:36:35, Andrew Morton wrote: > On Thu, 10 Jan 2008 16:55:13 +0100 > Jan Kara <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > sorry for the previous empty email... > > > > Supriya noted in his testing that sometimes buffers removed by > > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error > > won't be properly propagated). Actually, looking more into the code I found > > there are some more races. The patch below should fix them. It survived > > beating with LTP and fsstress on ext2 filesystem on my testing machine so > > it should be reasonably bugfree... Andrew, would you put the patch into > > -mm? Thanks. > > > > Honza > > -- > > Jan Kara <[EMAIL PROTECTED]> > > SUSE Labs, CR > > --- > > > > There are two possible races in handling of private_list in buffer cache. > > 1) When fsync_buffers_list() processes a private_list, it clears > > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() > > comes, > > sees a buffer is on list so it calls __remove_assoc_queue() which complains > > about b_assoc_mapping being cleared (as it cannot propagate possible IO > > error). > > This race has been actually observed in the wild. > > private_lock should prevent this race. > > Which call to drop_buffers() is the culprit? The first one in > try_to_free_buffers(), I assume? The "can this still happen?" one? > > If so, it can happen. How? Perhaps this is a bug. Good question but I don't think so. The problem is that fsync_buffers_list() drops the private_lock() e.g. when it does wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs have b_assoc_mapping set to NULL but the test !list_empty(bh->b_assoc_buffers) succeeds for them and thus __remove_assoc_queue() is called and it complains. We could also silence the warning by leaving b_assoc_mapping set when we move the buffer to the constructed list. But given the problem below I've decided to do a more complete cleanup of the code. > > 2) When fsync_buffers_list() processes a private_list, > > mark_buffer_dirty_inode() can be called on bh which is already on the > > private > > list of fsync_buffers_list(). As buffer is on some list (note that the > > check is > > performed without private_lock), it is not readded to the mapping's > > private_list and after fsync_buffers_list() finishes, we have a dirty buffer > > which should be on private_list but it isn't. This race has not been > > reported, > > probably because most (but not all) callers of mark_buffer_dirty_inode() > > hold > > i_mutex and thus are serialized with fsync(). > > Maybe fsync_buffers_list should put the buffer back onto private_list if it > got dirtied again. Yes, that's what it does in my new version. Only the locking is a bit subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() when the buffer is already on some list... Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix private_list handling
On Thu 10-01-08 16:36:35, Andrew Morton wrote: On Thu, 10 Jan 2008 16:55:13 +0100 Jan Kara [EMAIL PROTECTED] wrote: Hi, sorry for the previous empty email... Supriya noted in his testing that sometimes buffers removed by __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error won't be properly propagated). Actually, looking more into the code I found there are some more races. The patch below should fix them. It survived beating with LTP and fsstress on ext2 filesystem on my testing machine so it should be reasonably bugfree... Andrew, would you put the patch into -mm? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. private_lock should prevent this race. Which call to drop_buffers() is the culprit? The first one in try_to_free_buffers(), I assume? The can this still happen? one? If so, it can happen. How? Perhaps this is a bug. Good question but I don't think so. The problem is that fsync_buffers_list() drops the private_lock() e.g. when it does wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs have b_assoc_mapping set to NULL but the test !list_empty(bh-b_assoc_buffers) succeeds for them and thus __remove_assoc_queue() is called and it complains. We could also silence the warning by leaving b_assoc_mapping set when we move the buffer to the constructed list. But given the problem below I've decided to do a more complete cleanup of the code. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Maybe fsync_buffers_list should put the buffer back onto private_list if it got dirtied again. Yes, that's what it does in my new version. Only the locking is a bit subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() when the buffer is already on some list... Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix private_list handling
On Fri, 11 Jan 2008 15:21:31 +0100 Jan Kara [EMAIL PROTECTED] wrote: On Thu 10-01-08 16:36:35, Andrew Morton wrote: On Thu, 10 Jan 2008 16:55:13 +0100 Jan Kara [EMAIL PROTECTED] wrote: Hi, sorry for the previous empty email... Supriya noted in his testing that sometimes buffers removed by __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error won't be properly propagated). Actually, looking more into the code I found there are some more races. The patch below should fix them. It survived beating with LTP and fsstress on ext2 filesystem on my testing machine so it should be reasonably bugfree... Andrew, would you put the patch into -mm? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. private_lock should prevent this race. Which call to drop_buffers() is the culprit? The first one in try_to_free_buffers(), I assume? The can this still happen? one? If so, it can happen. How? Perhaps this is a bug. Good question but I don't think so. The problem is that fsync_buffers_list() drops the private_lock() e.g. when it does wait_on_buffer(). And buffers on the list fsync_buffers_list() constructs have b_assoc_mapping set to NULL but the test !list_empty(bh-b_assoc_buffers) succeeds for them and thus __remove_assoc_queue() is called and it complains. We could also silence the warning by leaving b_assoc_mapping set when we move the buffer to the constructed list. Could fsync_buffers_list() leave the buffer not on a list when it does the ll_rw_block() and only add it to tmp afterwards if it sees that the buffer is still not on a list? I guess not, as it still needs to find the buffer to wait upon it. But given the problem below I've decided to do a more complete cleanup of the code. Well. Large-scale changes to long-established code like this are a last resort. We should fully investigate little local fixups first. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Maybe fsync_buffers_list should put the buffer back onto private_list if it got dirtied again. Yes, that's what it does in my new version. Only the locking is a bit subtle if we want to avoid taking private_lock in mark_buffer_dirty_inode() when the buffer is already on some list... I fear rewrites in there. It took five years to find this bug. How long will it take to find new ones which get added? Sigh. lists do suck for this sort of thing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix private_list handling
On Thu, 10 Jan 2008 16:55:13 +0100 Jan Kara <[EMAIL PROTECTED]> wrote: > Hi, > > sorry for the previous empty email... > > Supriya noted in his testing that sometimes buffers removed by > __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error > won't be properly propagated). Actually, looking more into the code I found > there are some more races. The patch below should fix them. It survived > beating with LTP and fsstress on ext2 filesystem on my testing machine so > it should be reasonably bugfree... Andrew, would you put the patch into > -mm? Thanks. > > Honza > -- > Jan Kara <[EMAIL PROTECTED]> > SUSE Labs, CR > --- > > There are two possible races in handling of private_list in buffer cache. > 1) When fsync_buffers_list() processes a private_list, it clears > b_assoc_mapping and moves buffer to its private list. Now drop_buffers() > comes, > sees a buffer is on list so it calls __remove_assoc_queue() which complains > about b_assoc_mapping being cleared (as it cannot propagate possible IO > error). > This race has been actually observed in the wild. private_lock should prevent this race. Which call to drop_buffers() is the culprit? The first one in try_to_free_buffers(), I assume? The "can this still happen?" one? If so, it can happen. How? Perhaps this is a bug. > 2) When fsync_buffers_list() processes a private_list, > mark_buffer_dirty_inode() can be called on bh which is already on the private > list of fsync_buffers_list(). As buffer is on some list (note that the check > is > performed without private_lock), it is not readded to the mapping's > private_list and after fsync_buffers_list() finishes, we have a dirty buffer > which should be on private_list but it isn't. This race has not been reported, > probably because most (but not all) callers of mark_buffer_dirty_inode() hold > i_mutex and thus are serialized with fsync(). Maybe fsync_buffers_list should put the buffer back onto private_list if it got dirtied again. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix private_list handling
Hi, sorry for the previous empty email... Supriya noted in his testing that sometimes buffers removed by __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error won't be properly propagated). Actually, looking more into the code I found there are some more races. The patch below should fix them. It survived beating with LTP and fsstress on ext2 filesystem on my testing machine so it should be reasonably bugfree... Andrew, would you put the patch into -mm? Thanks. Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Fix these issues by rewriting fsync_buffers_list() to not move buffers to dedicated private list (which has an additional bonus of reducing code size) and scan original private_list instead (some care has to be taken to avoid racing with drop_buffers()). Also mark_buffer_dirty_inode() is modified to avoid race 2). Signed-off-by: Jan Kara <[EMAIL PROTECTED]> diff --git a/fs/buffer.c b/fs/buffer.c index 7249e01..8691fa5 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -563,15 +563,6 @@ EXPORT_SYMBOL(mark_buffer_async_write); * take an address_space, not an inode. And it should be called * mark_buffer_dirty_fsync() to clearly define why those buffers are being * queued up. - * - * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the - * list if it is already on a list. Because if the buffer is on a list, - * it *must* already be on the right one. If not, the filesystem is being - * silly. This will save a ton of locking. But first we have to ensure - * that buffers are taken *off* the old inode's list when they are freed - * (presumably in truncate). That requires careful auditing of all - * filesystems (do it inside bforget()). It could also be done by bringing - * b_inode back. */ /* @@ -591,41 +582,6 @@ int inode_has_buffers(struct inode *inode) return !list_empty(>i_data.private_list); } -/* - * osync is designed to support O_SYNC io. It waits synchronously for - * all already-submitted IO to complete, but does not queue any new - * writes to the disk. - * - * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as - * you dirty the buffers, and then use osync_inode_buffers to wait for - * completion. Any other dirty buffers which are not yet queued for - * write will not be flushed to disk by the osync. - */ -static int osync_buffers_list(spinlock_t *lock, struct list_head *list) -{ - struct buffer_head *bh; - struct list_head *p; - int err = 0; - - spin_lock(lock); -repeat: - list_for_each_prev(p, list) { - bh = BH_ENTRY(p); - if (buffer_locked(bh)) { - get_bh(bh); - spin_unlock(lock); - wait_on_buffer(bh); - if (!buffer_uptodate(bh)) - err = -EIO; - brelse(bh); - spin_lock(lock); - goto repeat; - } - } - spin_unlock(lock); - return err; -} - /** * sync_mapping_buffers - write out and wait upon a mapping's "associated" *buffers @@ -634,6 +590,11 @@ repeat: * Starts I/O against the buffers at mapping->private_list, and waits upon * that I/O. * + * The caller must make sure (usually by holding i_mutex) that we cannot race + * with invalidate_inode_buffers() or remove_inode_buffers() calling + * __remove_assoc_queue(). Racing with drop_buffers() is fine and is taken care + * of. + * * Basically, this is a convenience function for fsync(). * @mapping is a file or directory which needs those buffers to be written for * a successful fsync(). @@ -667,22 +628,28 @@ void write_boundary_block(struct block_device *bdev, } } +/* Mark buffer as dirty and add it to inode mapping's private_list. We play + * tricks to avoid locking private_lock when not needed and
[PATCH] Fix private_list handling
Hi, -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix private_list handling
Hi, -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix private_list handling
Hi, sorry for the previous empty email... Supriya noted in his testing that sometimes buffers removed by __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error won't be properly propagated). Actually, looking more into the code I found there are some more races. The patch below should fix them. It survived beating with LTP and fsstress on ext2 filesystem on my testing machine so it should be reasonably bugfree... Andrew, would you put the patch into -mm? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Fix these issues by rewriting fsync_buffers_list() to not move buffers to dedicated private list (which has an additional bonus of reducing code size) and scan original private_list instead (some care has to be taken to avoid racing with drop_buffers()). Also mark_buffer_dirty_inode() is modified to avoid race 2). Signed-off-by: Jan Kara [EMAIL PROTECTED] diff --git a/fs/buffer.c b/fs/buffer.c index 7249e01..8691fa5 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -563,15 +563,6 @@ EXPORT_SYMBOL(mark_buffer_async_write); * take an address_space, not an inode. And it should be called * mark_buffer_dirty_fsync() to clearly define why those buffers are being * queued up. - * - * FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the - * list if it is already on a list. Because if the buffer is on a list, - * it *must* already be on the right one. If not, the filesystem is being - * silly. This will save a ton of locking. But first we have to ensure - * that buffers are taken *off* the old inode's list when they are freed - * (presumably in truncate). That requires careful auditing of all - * filesystems (do it inside bforget()). It could also be done by bringing - * b_inode back. */ /* @@ -591,41 +582,6 @@ int inode_has_buffers(struct inode *inode) return !list_empty(inode-i_data.private_list); } -/* - * osync is designed to support O_SYNC io. It waits synchronously for - * all already-submitted IO to complete, but does not queue any new - * writes to the disk. - * - * To do O_SYNC writes, just queue the buffer writes with ll_rw_block as - * you dirty the buffers, and then use osync_inode_buffers to wait for - * completion. Any other dirty buffers which are not yet queued for - * write will not be flushed to disk by the osync. - */ -static int osync_buffers_list(spinlock_t *lock, struct list_head *list) -{ - struct buffer_head *bh; - struct list_head *p; - int err = 0; - - spin_lock(lock); -repeat: - list_for_each_prev(p, list) { - bh = BH_ENTRY(p); - if (buffer_locked(bh)) { - get_bh(bh); - spin_unlock(lock); - wait_on_buffer(bh); - if (!buffer_uptodate(bh)) - err = -EIO; - brelse(bh); - spin_lock(lock); - goto repeat; - } - } - spin_unlock(lock); - return err; -} - /** * sync_mapping_buffers - write out and wait upon a mapping's associated *buffers @@ -634,6 +590,11 @@ repeat: * Starts I/O against the buffers at mapping-private_list, and waits upon * that I/O. * + * The caller must make sure (usually by holding i_mutex) that we cannot race + * with invalidate_inode_buffers() or remove_inode_buffers() calling + * __remove_assoc_queue(). Racing with drop_buffers() is fine and is taken care + * of. + * * Basically, this is a convenience function for fsync(). * @mapping is a file or directory which needs those buffers to be written for * a successful fsync(). @@ -667,22 +628,28 @@ void write_boundary_block(struct block_device *bdev, } } +/* Mark buffer as dirty and add it to inode mapping's private_list. We play + * tricks to avoid locking private_lock when not needed and
Re: [PATCH] Fix private_list handling
On Thu, 10 Jan 2008 16:55:13 +0100 Jan Kara [EMAIL PROTECTED] wrote: Hi, sorry for the previous empty email... Supriya noted in his testing that sometimes buffers removed by __remove_assoc_queue() don't have b_assoc_mapping set (and thus IO error won't be properly propagated). Actually, looking more into the code I found there are some more races. The patch below should fix them. It survived beating with LTP and fsstress on ext2 filesystem on my testing machine so it should be reasonably bugfree... Andrew, would you put the patch into -mm? Thanks. Honza -- Jan Kara [EMAIL PROTECTED] SUSE Labs, CR --- There are two possible races in handling of private_list in buffer cache. 1) When fsync_buffers_list() processes a private_list, it clears b_assoc_mapping and moves buffer to its private list. Now drop_buffers() comes, sees a buffer is on list so it calls __remove_assoc_queue() which complains about b_assoc_mapping being cleared (as it cannot propagate possible IO error). This race has been actually observed in the wild. private_lock should prevent this race. Which call to drop_buffers() is the culprit? The first one in try_to_free_buffers(), I assume? The can this still happen? one? If so, it can happen. How? Perhaps this is a bug. 2) When fsync_buffers_list() processes a private_list, mark_buffer_dirty_inode() can be called on bh which is already on the private list of fsync_buffers_list(). As buffer is on some list (note that the check is performed without private_lock), it is not readded to the mapping's private_list and after fsync_buffers_list() finishes, we have a dirty buffer which should be on private_list but it isn't. This race has not been reported, probably because most (but not all) callers of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with fsync(). Maybe fsync_buffers_list should put the buffer back onto private_list if it got dirtied again. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/