Re: [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
On Wed, Mar 08, 2017 at 10:25:52AM +0800, Qu Wenruo wrote: > [BUG] > If run_delalloc_range() returns error and there is already some ordered > extents created, btrfs will be hanged with the following backtrace: > > Call Trace: > __schedule+0x2d4/0xae0 > schedule+0x3d/0x90 > btrfs_start_ordered_extent+0x160/0x200 [btrfs] > ? wake_atomic_t_function+0x60/0x60 > btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] > btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] > btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] > process_one_work+0x2af/0x720 > ? process_one_work+0x22b/0x720 > worker_thread+0x4b/0x4f0 > kthread+0x10f/0x150 > ? process_one_work+0x720/0x720 > ? kthread_create_on_node+0x40/0x40 > ret_from_fork+0x2e/0x40 > > [CAUSE] > > |<-- delalloc range --->| > | OE 1 | OE 2 | ... | OE n | > |<>| |<-- cleanup range ->| > || > \_=> First page handled by end_extent_writepage() in __extent_writepage() > > The problem is caused by error handler of run_delalloc_range(), which > doesn't handle any created ordered extents, leaving them waiting on > btrfs_finish_ordered_io() to finish. > > However after run_delalloc_range() returns error, __extent_writepage() > won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered > except the first page, and btrfs_finish_ordered_io() won't be triggered > for created ordered extents either. > > So OE 2~n will hang forever, and if OE 1 is larger than one page, it > will also hang. > > [FIX] > Introduce btrfs_cleanup_ordered_extents() function to cleanup created > ordered extents and finish them manually. > > The function is based on existing > btrfs_endio_direct_write_update_ordered() function, and modify it to > act just like btrfs_writepage_endio_hook() but handles specified range > other than one page. > > After fix, delalloc error will be handled like: > > |<-- delalloc range --->| > | OE 1 | OE 2 | ... | OE n | > |<>|< --->|<-- old error handler ->| > || || > || \_=> Cleaned up by cleanup_ordered_extents() > \_=> First page handled by end_extent_writepage() in __extent_writepage() Reviewed-by: Liu BoThanks, -liubo > > Signed-off-by: Qu Wenruo > Signed-off-by: Filipe Manana > --- > v2: > Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing > outstanding extents, which is already done by > extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit > v3: > Skip first page to avoid underflow ordered->bytes_left. > Fix range passed in cow_file_range() which doesn't cover the whole > extent. > Expend extent_clear_unlock_delalloc() range to allow them to handle > metadata release. > v4: > Don't use extra bit to skip metadata freeing for ordered extent, > but only handle btrfs_reloc_clone_csums() error just before processing > next extent. > This makes error handle much easier for run_delalloc_nocow(). > v5: > Variant gramma and comment fixes suggested by Filipe Manana > Enhanced commit message to focus on the generic error handler bug, > pointed out by Filipe Manana > Adding missing wq/func user in __endio_write_update_ordered(), pointed > out by Filipe Manana. > Enhanced commit message with ascii art to explain the bug more easily. > Fix a bug which can leads to corrupted extent allocation, exposed by > Filipe Manana. > v6: > Split the metadata underflow fix to another patch. > Use better comment and btrfs_cleanup_ordered_extent() implementation > from Filipe Manana. > v7: > Add back const prefix > --- > fs/btrfs/inode.c | 63 > ++-- > 1 file changed, 48 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index fe588bf30d02..d3bc68bbe0e7 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode > *inode, u64 start, u64 len, > u64 ram_bytes, int compress_type, > int type); > > +static void __endio_write_update_ordered(struct inode *inode, > + const u64 offset, const u64 bytes, > + const bool uptodate); > + > +/* > + * Cleanup all submitted ordered extents in specified range to handle errors > + * from the fill_dellaloc() callback. > + * > + * NOTE: caller must ensure that when an error happens, it can not call > + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING > + * and EXTENT_DELALLOC simultaneously, because that causes the reserved > metadata > + * to be released, which we want to happen only when finishing the ordered > + * extent (btrfs_finish_ordered_io()). Also note that the caller of the > + * fill_delalloc()
Re: [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
On Wed, Mar 8, 2017 at 2:25 AM, Qu Wenruowrote: > [BUG] > If run_delalloc_range() returns error and there is already some ordered > extents created, btrfs will be hanged with the following backtrace: > > Call Trace: > __schedule+0x2d4/0xae0 > schedule+0x3d/0x90 > btrfs_start_ordered_extent+0x160/0x200 [btrfs] > ? wake_atomic_t_function+0x60/0x60 > btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] > btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] > btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] > process_one_work+0x2af/0x720 > ? process_one_work+0x22b/0x720 > worker_thread+0x4b/0x4f0 > kthread+0x10f/0x150 > ? process_one_work+0x720/0x720 > ? kthread_create_on_node+0x40/0x40 > ret_from_fork+0x2e/0x40 > > [CAUSE] > > |<-- delalloc range --->| > | OE 1 | OE 2 | ... | OE n | > |<>| |<-- cleanup range ->| > || > \_=> First page handled by end_extent_writepage() in __extent_writepage() > > The problem is caused by error handler of run_delalloc_range(), which > doesn't handle any created ordered extents, leaving them waiting on > btrfs_finish_ordered_io() to finish. > > However after run_delalloc_range() returns error, __extent_writepage() > won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered > except the first page, and btrfs_finish_ordered_io() won't be triggered > for created ordered extents either. > > So OE 2~n will hang forever, and if OE 1 is larger than one page, it > will also hang. > > [FIX] > Introduce btrfs_cleanup_ordered_extents() function to cleanup created > ordered extents and finish them manually. > > The function is based on existing > btrfs_endio_direct_write_update_ordered() function, and modify it to > act just like btrfs_writepage_endio_hook() but handles specified range > other than one page. > > After fix, delalloc error will be handled like: > > |<-- delalloc range --->| > | OE 1 | OE 2 | ... | OE n | > |<>|< --->|<-- old error handler ->| > || || > || \_=> Cleaned up by cleanup_ordered_extents() > \_=> First page handled by end_extent_writepage() in __extent_writepage() > > Signed-off-by: Qu Wenruo Reviewed-by: Filipe Manana thanks > Signed-off-by: Filipe Manana > --- > v2: > Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing > outstanding extents, which is already done by > extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit > v3: > Skip first page to avoid underflow ordered->bytes_left. > Fix range passed in cow_file_range() which doesn't cover the whole > extent. > Expend extent_clear_unlock_delalloc() range to allow them to handle > metadata release. > v4: > Don't use extra bit to skip metadata freeing for ordered extent, > but only handle btrfs_reloc_clone_csums() error just before processing > next extent. > This makes error handle much easier for run_delalloc_nocow(). > v5: > Variant gramma and comment fixes suggested by Filipe Manana > Enhanced commit message to focus on the generic error handler bug, > pointed out by Filipe Manana > Adding missing wq/func user in __endio_write_update_ordered(), pointed > out by Filipe Manana. > Enhanced commit message with ascii art to explain the bug more easily. > Fix a bug which can leads to corrupted extent allocation, exposed by > Filipe Manana. > v6: > Split the metadata underflow fix to another patch. > Use better comment and btrfs_cleanup_ordered_extent() implementation > from Filipe Manana. > v7: > Add back const prefix > --- > fs/btrfs/inode.c | 63 > ++-- > 1 file changed, 48 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index fe588bf30d02..d3bc68bbe0e7 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode > *inode, u64 start, u64 len, >u64 ram_bytes, int compress_type, >int type); > > +static void __endio_write_update_ordered(struct inode *inode, > +const u64 offset, const u64 bytes, > +const bool uptodate); > + > +/* > + * Cleanup all submitted ordered extents in specified range to handle errors > + * from the fill_dellaloc() callback. > + * > + * NOTE: caller must ensure that when an error happens, it can not call > + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING > + * and EXTENT_DELALLOC simultaneously, because that causes the reserved > metadata > + * to be released, which we want to happen only when finishing the ordered > + * extent (btrfs_finish_ordered_io()). Also note that the caller of the > + *
[PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
[BUG] If run_delalloc_range() returns error and there is already some ordered extents created, btrfs will be hanged with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 [CAUSE] |<-- delalloc range --->| | OE 1 | OE 2 | ... | OE n | |<>| |<-- cleanup range ->| || \_=> First page handled by end_extent_writepage() in __extent_writepage() The problem is caused by error handler of run_delalloc_range(), which doesn't handle any created ordered extents, leaving them waiting on btrfs_finish_ordered_io() to finish. However after run_delalloc_range() returns error, __extent_writepage() won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered except the first page, and btrfs_finish_ordered_io() won't be triggered for created ordered extents either. So OE 2~n will hang forever, and if OE 1 is larger than one page, it will also hang. [FIX] Introduce btrfs_cleanup_ordered_extents() function to cleanup created ordered extents and finish them manually. The function is based on existing btrfs_endio_direct_write_update_ordered() function, and modify it to act just like btrfs_writepage_endio_hook() but handles specified range other than one page. After fix, delalloc error will be handled like: |<-- delalloc range --->| | OE 1 | OE 2 | ... | OE n | |<>|< --->|<-- old error handler ->| || || || \_=> Cleaned up by cleanup_ordered_extents() \_=> First page handled by end_extent_writepage() in __extent_writepage() Signed-off-by: Qu WenruoSigned-off-by: Filipe Manana --- v2: Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing outstanding extents, which is already done by extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit v3: Skip first page to avoid underflow ordered->bytes_left. Fix range passed in cow_file_range() which doesn't cover the whole extent. Expend extent_clear_unlock_delalloc() range to allow them to handle metadata release. v4: Don't use extra bit to skip metadata freeing for ordered extent, but only handle btrfs_reloc_clone_csums() error just before processing next extent. This makes error handle much easier for run_delalloc_nocow(). v5: Variant gramma and comment fixes suggested by Filipe Manana Enhanced commit message to focus on the generic error handler bug, pointed out by Filipe Manana Adding missing wq/func user in __endio_write_update_ordered(), pointed out by Filipe Manana. Enhanced commit message with ascii art to explain the bug more easily. Fix a bug which can leads to corrupted extent allocation, exposed by Filipe Manana. v6: Split the metadata underflow fix to another patch. Use better comment and btrfs_cleanup_ordered_extent() implementation from Filipe Manana. v7: Add back const prefix --- fs/btrfs/inode.c | 63 ++-- 1 file changed, 48 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index fe588bf30d02..d3bc68bbe0e7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, u64 ram_bytes, int compress_type, int type); +static void __endio_write_update_ordered(struct inode *inode, +const u64 offset, const u64 bytes, +const bool uptodate); + +/* + * Cleanup all submitted ordered extents in specified range to handle errors + * from the fill_dellaloc() callback. + * + * NOTE: caller must ensure that when an error happens, it can not call + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata + * to be released, which we want to happen only when finishing the ordered + * extent (btrfs_finish_ordered_io()). Also note that the caller of the + * fill_delalloc() callback already does proper cleanup for the first page of + * the range, that is, it invokes the callback writepage_end_io_hook() for the + * range of the first page. + */ +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, +const u64 offset, +