Re: [PATCH] btrfs: Handle delalloc error correctly to avoid deadlock
Please ignore this patch. This patch doesn't handle outstanding_extents well. Thanks, Qu At 02/21/2017 10:29 AM, Qu Wenruo wrote: If run btrfs/125 with nospace_cache or space_cache=v2 mount option, btrfs will block with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 The direct cause is the error handler in run_delalloc_nocow() doesn't handle error from btrfs_reloc_clone_csums() well. The error handler of run_delalloc_nocow() will clear dirty and finish IO for the pages in that extent. However we have already inserted one ordered extent. And that ordered extent is relying on endio hooks to wait all its pages to finish, while only the first page will finish. This makes that ordered extent never finish, so blocking the file system. Although the root cause is still in RAID5/6, it won't hurt to fix the error routine first. This patch will slightly modify one existing function, btrfs_endio_direct_write_update_ordered() to handle free space inode, just like what btrfs_writepage_end_io_hook() does. And use it as base to implement one inline function, btrfs_cleanup_ordered_extents() to handle the error in run_delalloc_nocow() and cow_file_range(). For compression, it's calling writepage_end_io_hook() itself to handle its error, and any submitted ordered extent will have its bio submitted, so no need to worry about compression part. Suggested-by: Filipe MananaSigned-off-by: Qu Wenruo --- fs/btrfs/inode.c | 58 ++-- 1 file changed, 44 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1e861a063721..e2e2267b9a73 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -113,6 +113,24 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start, u64 block_start, u64 block_len, u64 orig_block_len, u64 ram_bytes, int type); +static void btrfs_endio_write_update_ordered(struct inode *inode, +const u64 offset, +const u64 bytes, +const int uptodate); +/* + * Set error bit and cleanup all ordered extents in specified range of @inode. + * + * This is for error case where ordered extent(s) is submitted but + * corresponding bio is not submitted. + * This can make waiter on such ordered extent never finish, as there is no + * endio hook called to finish such ordered extent. + */ +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, +const u64 start, +const u64 len) +{ + return btrfs_endio_write_update_ordered(inode, start, len, 0); +} static int btrfs_dirty_inode(struct inode *inode); @@ -1096,6 +1114,7 @@ static noinline int cow_file_range(struct inode *inode, EXTENT_DELALLOC | EXTENT_DEFRAG, PAGE_UNLOCK | PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK); + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); goto out; } @@ -1538,7 +1557,7 @@ static noinline int run_delalloc_nocow(struct inode *inode, if (!ret) ret = err; - if (ret && cur_offset < end) + if (ret && cur_offset < end) { extent_clear_unlock_delalloc(inode, cur_offset, end, end, locked_page, EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DEFRAG | @@ -1546,6 +1565,8 @@ static noinline int run_delalloc_nocow(struct inode *inode, PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK); + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); + } btrfs_free_path(path); return ret; } @@ -8185,17 +8206,27 @@ static void btrfs_endio_direct_read(struct bio *bio) bio_put(bio); } -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, - const u64 offset, - const u64 bytes, -
[PATCH] btrfs: Handle delalloc error correctly to avoid deadlock
If run btrfs/125 with nospace_cache or space_cache=v2 mount option, btrfs will block with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 The direct cause is the error handler in run_delalloc_nocow() doesn't handle error from btrfs_reloc_clone_csums() well. The error handler of run_delalloc_nocow() will clear dirty and finish IO for the pages in that extent. However we have already inserted one ordered extent. And that ordered extent is relying on endio hooks to wait all its pages to finish, while only the first page will finish. This makes that ordered extent never finish, so blocking the file system. Although the root cause is still in RAID5/6, it won't hurt to fix the error routine first. This patch will slightly modify one existing function, btrfs_endio_direct_write_update_ordered() to handle free space inode, just like what btrfs_writepage_end_io_hook() does. And use it as base to implement one inline function, btrfs_cleanup_ordered_extents() to handle the error in run_delalloc_nocow() and cow_file_range(). For compression, it's calling writepage_end_io_hook() itself to handle its error, and any submitted ordered extent will have its bio submitted, so no need to worry about compression part. Suggested-by: Filipe MananaSigned-off-by: Qu Wenruo --- fs/btrfs/inode.c | 58 ++-- 1 file changed, 44 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1e861a063721..e2e2267b9a73 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -113,6 +113,24 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start, u64 block_start, u64 block_len, u64 orig_block_len, u64 ram_bytes, int type); +static void btrfs_endio_write_update_ordered(struct inode *inode, +const u64 offset, +const u64 bytes, +const int uptodate); +/* + * Set error bit and cleanup all ordered extents in specified range of @inode. + * + * This is for error case where ordered extent(s) is submitted but + * corresponding bio is not submitted. + * This can make waiter on such ordered extent never finish, as there is no + * endio hook called to finish such ordered extent. + */ +static inline void btrfs_cleanup_ordered_extents(struct inode *inode, +const u64 start, +const u64 len) +{ + return btrfs_endio_write_update_ordered(inode, start, len, 0); +} static int btrfs_dirty_inode(struct inode *inode); @@ -1096,6 +1114,7 @@ static noinline int cow_file_range(struct inode *inode, EXTENT_DELALLOC | EXTENT_DEFRAG, PAGE_UNLOCK | PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK); + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); goto out; } @@ -1538,7 +1557,7 @@ static noinline int run_delalloc_nocow(struct inode *inode, if (!ret) ret = err; - if (ret && cur_offset < end) + if (ret && cur_offset < end) { extent_clear_unlock_delalloc(inode, cur_offset, end, end, locked_page, EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DEFRAG | @@ -1546,6 +1565,8 @@ static noinline int run_delalloc_nocow(struct inode *inode, PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK); + btrfs_cleanup_ordered_extents(inode, start, end - start + 1); + } btrfs_free_path(path); return ret; } @@ -8185,17 +8206,27 @@ static void btrfs_endio_direct_read(struct bio *bio) bio_put(bio); } -static void btrfs_endio_direct_write_update_ordered(struct inode *inode, - const u64 offset, - const u64 bytes, - const int uptodate) +static void btrfs_endio_write_update_ordered(struct inode *inode, +
Re: mount troubles after crash
At 02/20/2017 10:13 PM, Patrick Schmid wrote: Hi togehter, during a balance run, the server crash. after the crash i got the attached error messages Kernel: 4.9.11 A known bug when qgroup is enabled. Fixed in v4.10-rcs. Please mount using v4.10 kernels and it will mount without problem. Thanks, Qu Btrfs-Tools: v4.9.1 btrfs check --repair failed with: couldn't open RDWR because of unsupported option features (3). Open ctree failed is there a way to fix this 140TB filesystem? Regards Patrick -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qgroup: Retry after commit on getting EDQUOT
At 02/20/2017 08:51 PM, Goldwyn Rodrigues wrote: On 02/19/2017 11:35 PM, Qu Wenruo wrote: At 02/20/2017 12:35 PM, Goldwyn Rodrigues wrote: Hi Qu, On 02/19/2017 09:07 PM, Qu Wenruo wrote: At 02/19/2017 09:30 PM, Goldwyn Rodrigues wrote: From: Goldwyn RodriguesWe are facing the same problem with EDQUOT which was experienced with ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but here is a fix. Let me know if it is too big a hammer. Quotas are reserved during the start of an operation, incrementing qg->reserved. However, it is written to disk in a commit_transaction which could take as long as commit_interval. In the meantime there could be deletions which are not accounted for because deletions are accounted for only while committed (free_refroot). So, when we get a EDQUOT flush the data to disk and try again. IIRC Jeff submitted a qgroup patch to allow unlink to be done even we already hit EDQUOT. https://patchwork.kernel.org/patch/9537193/ I think that's a better solution, which is quite like what we do to handle ENOSPC. These are two separate issues. This issue is where qg->reserved gets over-inflated because of repeated deletions and recreations within a commit_interval. My fault, that's indeed two different bugs. And I succeeded to reproduce the bug. Jeff's patch deals with allowing removal even if the quota is exceeded so that eventually there can be space freed. I would suggest you apply Jeff's patch and try to run the script I have presented. Thanks, Qu I combined the conditions of rfer and excl to reduce code lines, though the condition looks uglier. Here is a sample script which shows this issue. DEVICE=/dev/vdb MOUNTPOINT=/mnt TESTVOL=$MOUNTPOINT/tmp QUOTA=5 PROG=btrfs DD_BS="4k" DD_COUNT="256" RUN_TIMES=5000 mkfs.btrfs -f $DEVICE mount -o commit=240 $DEVICE $MOUNTPOINT $PROG subvolume create $TESTVOL $PROG quota enable $TESTVOL $PROG qgroup limit ${QUOTA}G $TESTVOL typeset -i DD_RUN_GOOD typeset -i QUOTA function _check_cmd() { if [[ ${?} > 0 ]]; then echo -n "$(date) E: Running previous command" echo ${*} echo "Without sync" $PROG qgroup show -pcreFf ${TESTVOL} echo "With sync" $PROG qgroup show -pcreFf --sync ${TESTVOL} exit 1 fi } while true; do DD_RUN_GOOD=$RUN_TIMES while (( ${DD_RUN_GOOD} != 0 )); do dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT} _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}" DD_RUN_GOOD=(${DD_RUN_GOOD}-1) done $PROG qgroup show -pcref $TESTVOL echo "--- Cleanup -- " rm $TESTVOL/quotatest* done It would be better if we can reduce the reproduce time and submit it as a fstest test case. Yes, unfortunately, it is more probabilistic than deterministic. But yes, reducing the size and increasing the commit_interval can improve the time. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/qgroup.c | 28 +--- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 4700cac..9ace407 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2093,6 +2093,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes) struct btrfs_fs_info *fs_info = root->fs_info; u64 ref_root = root->root_key.objectid; int ret = 0; +int retried = 0; struct ulist_node *unode; struct ulist_iterator uiter; @@ -2101,7 +2102,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes) if (num_bytes == 0) return 0; - +retry: spin_lock(_info->qgroup_lock); quota_root = fs_info->quota_root; if (!quota_root) @@ -2127,16 +2128,21 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes) qg = u64_to_ptr(unode->aux); -if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) && -qg->reserved + (s64)qg->rfer + num_bytes > -qg->max_rfer) { -ret = -EDQUOT; -goto out; -} - -if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) && -qg->reserved + (s64)qg->excl + num_bytes > -qg->max_excl) { +if (((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) && + qg->reserved + (s64)qg->rfer + num_bytes > qg->max_rfer) || +((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) && +qg->reserved + (s64)qg->excl + num_bytes > qg->max_excl)) { +if (!retried && qg->reserved > 0) { Does the qg->reserved > 0 check has any extra meaning here? qg->reserved greater than zero means the qgroup tree is dirty and needs to be committed to disk. It may or may not have quota reductions which can only be confirmed after free_ref_root is called in commit. While it may be a small probability, it will save I/O processing when the
Re: [PATCH] btrfs: Handle btrfs_reloc_clone_csums error correctly to avoid deadlock
At 02/20/2017 09:43 PM, Filipe Manana wrote: On Mon, Feb 20, 2017 at 12:31 AM, Qu Wenruowrote: At 02/17/2017 11:25 PM, Filipe Manana wrote: On Fri, Feb 17, 2017 at 12:37 AM, Qu Wenruo wrote: In case of an error the endio handler is called only for the first page, and not for whole range for which one or more ordered extents were created. So the ordered extents will be around forever. That's the same thing I'm trying to fix. While I'm asking the reason the cleanup previously created ordered extent, which didn't encounter any error. Any error in the loop requires cleaning up previously created ordered extents, because if that function returns an error no bio is submitted and therefore the ordered extents won't be removed, except for the case where only one ordered extent is created and the delalloc range is a single page. Or is this a designed behavior for things like fsync? Success for all or failure if any fails? This is unrelated to fsync. It's about leaving ordered extents which can make any task get them and hang on them forever. However in the case of btrfs_reloc_clone_csum(), if it fails, there is only one ordered extent really needs to clean. The problem is more generic and you can forget about btrfs_reloc_clone_csum(). It's about the delalloc callback returning an error and one or more ordered extents were created by past iterations of the loop before the error happened. While the only possible problem we can hit after submitting an ordered extent is btrfs_reloc_clone_csum(). Previously created ordered extent can finish without problem, so we only need to clean the last created one, and no need to cleanup all ordered extent that it created. For a more specific example of this case: We are relocating one data block group, whose size is 1G, containing 8 128M file extents. For run_delalloc_nocow(), it's called on root DATA_RELOC inode 257, range is [0,1G), And for first loop, it handles the first [0, 128M) without problem. 1st ordered extent is submitted and can finish even before submitting 2nd ordered extent. The ordered extent is submitted but not a bio for it. It's when the bio completes (writeback finishes) that the endio callback is invoked, which wakes up every task waiting on the ordered extent, sets IO errors bit if needed, removes the ordered extent, etc. Then when we are submitting 2nd ordered extent, ranged from [128M, 256M), btrfs_reloc_clone_csum() fails. we must cleanup at least 2nd ordered extent or it will hang forever. If I understand your right, you mean we must cleanup all ordered extent in the range [0, 1G). Yes. My point is, since 1st ordered extent can be submitted and finished before 2nd ordered extent submission without problem, I didn't see the point to cleanup 1st ordered extent and non-exist 3rd~8th ordered extent. As said before in this reply, if the delalloc callback (cow_file_range, run_dellaloc_nocow) returns an error, no bios are submitted for any ordered extents it might have created before an error happened. Check __extent_writepage() - it calls writepage_delalloc() which calls our delalloc callback, and if that returns an error, it won't call __extent_writepage_io(), which is what submits bios. Or I missed or misunderstand something else? You missed a lot of things I suppose, namely that bios aren't submitted if the dealloc function returns an error. That's the point. I misunderstand the bio submission timing, so in that case you're right about cleanup the whole range. I'll update the patch to address it. Thanks, Qu Honestly I don't know how to explain things better to you, and was hopping this was easier to understand using the direct IO write error path as an example. thanks Thanks, Qu thanks Thanks, Qu In that case, since we haven't unlock pages/extent, there will no race nor new ordered extent, and we are ensured to have only one ordered extent need cleanup. Also, for any created ordered extent, you want to set the bit BTRFS_ORDERED_IOERR on it before removing it, since there might be concurrent tasks waiting for it that do error handling (like an fsync, since when running delalloc the inode isn't locked). Fsync is not affected IIRC, as btrfs_reloc_clone_csum() is only called for DATA_RELOC tree inode, which we don't call fsync on it. Look at the direct IO write path error handling and btrfs_endio_direct_write_update_ordered() for a very similar scenario. And the same problem happens in the cow case (inode.c:cow_file_range()). BTW, I have already done it in cow_file_range(), check the beginning lines of inode.c of this patch. Indeed. I missed it. Thanks. Thanks, Qu Thanks. + } btrfs_free_path(path); return ret; } diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 041c3326d109..dba1cf3464a7 100644 --- a/fs/btrfs/ordered-data.c +++
[PATCH 2/4] btrfs: document existence of extent_io ops callbacks
Some of the callbacks defined in btree_extent_io_ops and btrfs_extent_io_ops do always exist so we don't need to check the existence before each call. This patch just reorders the definition and documents which are mandatory/optional. Signed-off-by: David Sterba--- fs/btrfs/disk-io.c | 7 +-- fs/btrfs/extent_io.h | 23 --- fs/btrfs/inode.c | 7 +-- 3 files changed, 26 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2b06f557c176..0715b6f3f686 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4653,9 +4653,12 @@ static int btrfs_cleanup_transaction(struct btrfs_fs_info *fs_info) } static const struct extent_io_ops btree_extent_io_ops = { - .readpage_end_io_hook = btree_readpage_end_io_hook, - .readpage_io_failed_hook = btree_io_failed_hook, + /* mandatory callbacks */ .submit_bio_hook = btree_submit_bio_hook, + .readpage_end_io_hook = btree_readpage_end_io_hook, /* note we're sharing with inode.c for the merge bio hook */ .merge_bio_hook = btrfs_merge_bio_hook, + + /* optional callbacks */ + .readpage_io_failed_hook = btree_io_failed_hook, }; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index fbc92315b503..5c5e2e6cfb9e 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -91,18 +91,27 @@ typedef int (extent_submit_bio_hook_t)(struct inode *inode, struct bio *bio, int mirror_num, unsigned long bio_flags, u64 bio_offset); struct extent_io_ops { - int (*fill_delalloc)(struct inode *inode, struct page *locked_page, -u64 start, u64 end, int *page_started, -unsigned long *nr_written); - int (*writepage_start_hook)(struct page *page, u64 start, u64 end); + /* +* The following callbacks must be allways defined, the function +* pointer will be called unconditionally. +*/ extent_submit_bio_hook_t *submit_bio_hook; + int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset, + struct page *page, u64 start, u64 end, + int mirror); int (*merge_bio_hook)(struct page *page, unsigned long offset, size_t size, struct bio *bio, unsigned long bio_flags); + + /* +* Optional hooks, called if the pointer is not NULL +*/ + int (*fill_delalloc)(struct inode *inode, struct page *locked_page, +u64 start, u64 end, int *page_started, +unsigned long *nr_written); int (*readpage_io_failed_hook)(struct page *page, int failed_mirror); - int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset, - struct page *page, u64 start, u64 end, - int mirror); + + int (*writepage_start_hook)(struct page *page, u64 start, u64 end); void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end, struct extent_state *state, int uptodate); void (*set_bit_hook)(struct inode *inode, struct extent_state *state, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index eafadf0851d1..72faf9b5616a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10541,10 +10541,13 @@ static const struct file_operations btrfs_dir_file_operations = { }; static const struct extent_io_ops btrfs_extent_io_ops = { - .fill_delalloc = run_delalloc_range, + /* mandatory callbacks */ .submit_bio_hook = btrfs_submit_bio_hook, - .merge_bio_hook = btrfs_merge_bio_hook, .readpage_end_io_hook = btrfs_readpage_end_io_hook, + .merge_bio_hook = btrfs_merge_bio_hook, + + /* optional callbacks */ + .fill_delalloc = run_delalloc_range, .writepage_end_io_hook = btrfs_writepage_end_io_hook, .writepage_start_hook = btrfs_writepage_start_hook, .set_bit_hook = btrfs_set_bit_hook, -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs: add dummy callback for readpage_io_failed and drop checks
Make extent_io_ops::readpage_io_failed_hook callback mandatory and define a dummy function for btrfs_extent_io_ops. As the failed IO callback is not performance critical, the branch vs extra trade off does not hurt. Signed-off-by: David Sterba--- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 2 +- fs/btrfs/extent_io.h | 2 +- fs/btrfs/inode.c | 7 +++ 4 files changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0715b6f3f686..fbf4921f4d60 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4658,7 +4658,7 @@ static const struct extent_io_ops btree_extent_io_ops = { .readpage_end_io_hook = btree_readpage_end_io_hook, /* note we're sharing with inode.c for the merge bio hook */ .merge_bio_hook = btrfs_merge_bio_hook, + .readpage_io_failed_hook = btree_io_failed_hook, /* optional callbacks */ - .readpage_io_failed_hook = btree_io_failed_hook, }; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f5cff93ab152..eaee7bb2ff7c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2578,7 +2578,7 @@ static void end_bio_extent_readpage(struct bio *bio) if (likely(uptodate)) goto readpage_ok; - if (tree->ops && tree->ops->readpage_io_failed_hook) { + if (tree->ops) { ret = tree->ops->readpage_io_failed_hook(page, mirror); if (!ret && !bio->bi_error) uptodate = 1; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 5c5e2e6cfb9e..63c8cc970b1c 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -102,6 +102,7 @@ struct extent_io_ops { int (*merge_bio_hook)(struct page *page, unsigned long offset, size_t size, struct bio *bio, unsigned long bio_flags); + int (*readpage_io_failed_hook)(struct page *page, int failed_mirror); /* * Optional hooks, called if the pointer is not NULL @@ -109,7 +110,6 @@ struct extent_io_ops { int (*fill_delalloc)(struct inode *inode, struct page *locked_page, u64 start, u64 end, int *page_started, unsigned long *nr_written); - int (*readpage_io_failed_hook)(struct page *page, int failed_mirror); int (*writepage_start_hook)(struct page *page, u64 start, u64 end); void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 72faf9b5616a..a74191fa3934 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10503,6 +10503,12 @@ static int btrfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) } +__attribute__((const)) +static int dummy_readpage_io_failed_hook(struct page *page, int failed_mirror) +{ + return 0; +} + static const struct inode_operations btrfs_dir_inode_operations = { .getattr= btrfs_getattr, .lookup = btrfs_lookup, @@ -10545,6 +10551,7 @@ static const struct extent_io_ops btrfs_extent_io_ops = { .submit_bio_hook = btrfs_submit_bio_hook, .readpage_end_io_hook = btrfs_readpage_end_io_hook, .merge_bio_hook = btrfs_merge_bio_hook, + .readpage_io_failed_hook = dummy_readpage_io_failed_hook, /* optional callbacks */ .fill_delalloc = run_delalloc_range, -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Cleanups in extent_io callbacks
Some of the checks for extent_io callbacks can be safely dropped as they're always defined, plus some dummy callback additions so more checks can be dropped. There's more potential for the same cleanup in other callbacks but this would need more evaluation wheather dummy callbacks vs existence checks are really worth it. David Sterba (4): btrfs: let writepage_end_io_hook return void btrfs: document existence of extent_io ops callbacks btrfs: drop checks for mandatory extent_io_ops callbacks btrfs: add dummy callback for readpage_io_failed and drop checks fs/btrfs/disk-io.c | 7 +-- fs/btrfs/extent_io.c | 18 +++--- fs/btrfs/extent_io.h | 25 + fs/btrfs/inode.c | 20 ++-- 4 files changed, 43 insertions(+), 27 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] btrfs: drop checks for mandatory extent_io_ops callbacks
We know that eadpage_end_io_hook, submit_bio_hook and merge_bio_hook are always defined so we can drop the checks before we call them. Signed-off-by: David Sterba--- fs/btrfs/extent_io.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 8de29aa4d1a2..f5cff93ab152 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2565,8 +2565,7 @@ static void end_bio_extent_readpage(struct bio *bio) len = bvec->bv_len; mirror = io_bio->mirror_num; - if (likely(uptodate && tree->ops && - tree->ops->readpage_end_io_hook)) { + if (likely(uptodate && tree->ops)) { ret = tree->ops->readpage_end_io_hook(io_bio, offset, page, start, end, mirror); @@ -2728,7 +2727,7 @@ static int __must_check submit_one_bio(struct bio *bio, int mirror_num, bio->bi_private = NULL; bio_get(bio); - if (tree->ops && tree->ops->submit_bio_hook) + if (tree->ops) ret = tree->ops->submit_bio_hook(page->mapping->host, bio, mirror_num, bio_flags, start); else @@ -2743,7 +2742,7 @@ static int merge_bio(struct extent_io_tree *tree, struct page *page, unsigned long bio_flags) { int ret = 0; - if (tree->ops && tree->ops->merge_bio_hook) + if (tree->ops) ret = tree->ops->merge_bio_hook(page, offset, size, bio, bio_flags); return ret; -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: let writepage_end_io_hook return void
There's no error path in any of the instances, always return 0. Signed-off-by: David Sterba--- fs/btrfs/extent_io.c | 9 +++-- fs/btrfs/extent_io.h | 2 +- fs/btrfs/inode.c | 6 ++ 3 files changed, 6 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d15b5ddb6732..8de29aa4d1a2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2435,12 +2435,9 @@ void end_extent_writepage(struct page *page, int err, u64 start, u64 end) tree = _I(page->mapping->host)->io_tree; - if (tree->ops && tree->ops->writepage_end_io_hook) { - ret = tree->ops->writepage_end_io_hook(page, start, - end, NULL, uptodate); - if (ret) - uptodate = 0; - } + if (tree->ops && tree->ops->writepage_end_io_hook) + tree->ops->writepage_end_io_hook(page, start, end, NULL, + uptodate); if (!uptodate) { ClearPageUptodate(page); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 270d03be290e..fbc92315b503 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -103,7 +103,7 @@ struct extent_io_ops { int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset, struct page *page, u64 start, u64 end, int mirror); - int (*writepage_end_io_hook)(struct page *page, u64 start, u64 end, + void (*writepage_end_io_hook)(struct page *page, u64 start, u64 end, struct extent_state *state, int uptodate); void (*set_bit_hook)(struct inode *inode, struct extent_state *state, unsigned *bits); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index dae2734a725b..eafadf0851d1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2977,7 +2977,7 @@ static void finish_ordered_fn(struct btrfs_work *work) btrfs_finish_ordered_io(ordered_extent); } -static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, +static void btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, struct extent_state *state, int uptodate) { struct inode *inode = page->mapping->host; @@ -2991,7 +2991,7 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, ClearPagePrivate2(page); if (!btrfs_dec_test_ordered_pending(inode, _extent, start, end - start + 1, uptodate)) - return 0; + return; if (btrfs_is_free_space_inode(inode)) { wq = fs_info->endio_freespace_worker; @@ -3004,8 +3004,6 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, btrfs_init_work(_extent->work, func, finish_ordered_fn, NULL, NULL); btrfs_queue_work(wq, _extent->work); - - return 0; } static int __readpage_endio_check(struct inode *inode, -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] btrfs-progs: RAID5: Inject data stripe corruption and verify scrub fixes it and retains correct parity.
Test script to create file with specific data-stripe layout.Computes stripe locations. Inject corruption into data-stripe and verify scrubbing process fixes corrupted block. It also validates the corresponding parity stripe. Signed-off-by: Lakshmipathi.G--- .../test.sh| 402 + 1 file changed, 402 insertions(+) create mode 100755 tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh diff --git a/tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh b/tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh new file mode 100755 index 000..0129a75 --- /dev/null +++ b/tests/misc-tests/020-raid5-datastripe-corruption-parity-check/test.sh @@ -0,0 +1,402 @@ +#!/bin/bash +# +# Raid5: Inject data stripe corruption and fix them using scrub. +# +# Script will perform the following: +# 1) Create Raid5 using 3 loopback devices. +# 2) Ensure file layout is created in a predictable manner. +#Each data stripe(64KB) should uniquely start with 'DN', +#where N represents the data stripe number.(ex:D0,D1 etc) +# 3) Once file is created with specific layout, check whether file +#has single extent. At the moment, script wont handle multi-extent +#files. +# 4) If file has single extent with the help of 'dump-tree' compute data and +#parity stripe details like devicename, position and actual on-disk data. +# 5) $STRIPEINFO_COMPLETE file will have all necessary data at this stage. +# 6) Inject corruption into given data-stripe by zero'ing out its first 4 bytes. +# 7) After injecting corruption, running online-scrub is expected to fix +#the corrupted data stripe with the help of parity block and +#corresponding data stripe. +# 8) If scrub successful, verify the data stripe has original un-corrupted value. +# 9) If scrub successful, verify parity stripe is valid, otherwise its a parity bug. +# 10) If no issues found, cleanup files and devices. Repeat the process for +#different file size and data-stripe. +# +# Note: This script corrupts only data-stripe blocks. +# Known Limitations (will be addressed later): +# - Script expects even number of data-stripes in file. +# - Script expects the file to have single extent. + +source $TOP/tests/common + +check_prereq btrfs +check_prereq mkfs.btrfs +check_prereq btrfs-debugfs + +setup_root_helper +prepare_test_dev 1024M + +ndevs=3 +declare -a devs +declare -a parity_offsets +stripe_entry="" +device_name="" +stripe_offset="" +stripe_content="" + +#Complete stripe layout +STRIPEINFO_COMPLETE=$(mktemp --tmpdir btrfs-progs-raid5-stripe-complete.infoXX) +#dump-tree output file +DUMPTREE_OUTPUT=$(mktemp --tmpdir btrfs-progs-raid5-tree-dump.infoXX) +#tmp files +STRIPEINFO_PARTIAL=$(mktemp --tmpdir btrfs-progs-raid5-stripe-partial.infoXX) +STRIPE_TMP=$(mktemp --tmpdir btrfs-progs-raid5-stripetmp.infoXX) +MULTI_EXTENT_CHECK=$(mktemp --tmpdir btrfs-progs-raid5-extent-check.infoXX) +EXTENT_WITH_SIZE=$(mktemp --tmpdir btrfs-progs-raid5-extent-size.infoXX) +PARITY_LOC1=$(mktemp --tmpdir btrfs-progs-raid5-parity-loc1.infoXX) +PARITY_LOCATION=$(mktemp --tmpdir btrfs-progs-raid5-parity-locations.infoXX) + + +prepare_devices() +{ + for i in `seq $ndevs`; do + touch img$i + chmod a+rw img$i + truncate -s0 img$i + truncate -s512M img$i + devs[$i]=`run_check_stdout $SUDO_HELPER losetup --find --show img$i` + done + truncate -s0 $STRIPE_TMP + truncate -s0 $STRIPEINFO_PARTIAL + truncate -s0 $STRIPEINFO_COMPLETE +} + +cleanup_devices() +{ + for dev in ${devs[@]}; do + run_check $SUDO_HELPER losetup -d $dev + done + for i in `seq $ndevs`; do + truncate -s0 img$i + done + run_check $SUDO_HELPER losetup --all +} + +test_do_mkfs() +{ + run_check $SUDO_HELPER $TOP/mkfs.btrfs -f \ + $@ +} + +test_mkfs_multi() +{ + test_do_mkfs $@ ${devs[@]} +} + +#$1 Filename +#$2 Expected no.of data stripes for the file. +create_layout(){ + fname=$1 + size=$(( $2 * 65536 )) + n=0 + bs_value=1 + stripe=0 + while (( $n < $size )) + do + if [ $(( $n % 65536 )) -eq 0 ]; then + val='D'$stripe + echo -n $val + stripe=$(( $stripe+1 )) + # ensure proper value + bs_value=`echo "${#val}"` + else + echo -n 'x' + bs_value=1 + fi +n=$(( $n+$bs_value )) + done | dd of=/tmp/$fname bs=$bs_value conv=notrunc &> /dev/null + cp /tmp/$fname $TEST_MNT + rm -f /tmp/$fname +} + +# $1,$2,$3 - available device offsets +find_parity_device(){ +
[PATCH 2/3] btrfs: handle allocation error in update_dev_stat_item
Signed-off-by: David Sterba--- fs/btrfs/volumes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 1fac98728814..64d6665f6eda 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6954,7 +6954,8 @@ static int update_dev_stat_item(struct btrfs_trans_handle *trans, key.offset = device->devid; path = btrfs_alloc_path(); - BUG_ON(!path); + if (!path) + return -ENOMEM; ret = btrfs_search_slot(trans, dev_root, , path, -1, 1); if (ret < 0) { btrfs_warn_in_rcu(fs_info, -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs: remove BUG_ON from __tree_mod_log_insert
All callers dereference the 'tm' parameter before it gets to this function, the NULL check does not make much sense here. Signed-off-by: David Sterba--- fs/btrfs/ctree.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 1192bc7d2ee7..2c3c943bfcdc 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -453,8 +453,6 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm) struct rb_node *parent = NULL; struct tree_mod_elem *cur; - BUG_ON(!tm); - tm->seq = btrfs_inc_tree_mod_seq(fs_info); tm_root = _info->tree_mod_log; -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: do proper error handling in btrfs_insert_xattr_item
The space check in btrfs_insert_xattr_item is duplicated in it's caller (do_setxattr) so we won't hit the BUG_ON. Continuing without any check could be disasterous so turn it to a proper error handling. Signed-off-by: David Sterba--- fs/btrfs/dir-item.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c index 724504a2d7ac..640801082533 100644 --- a/fs/btrfs/dir-item.c +++ b/fs/btrfs/dir-item.c @@ -80,7 +80,8 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans, struct extent_buffer *leaf; u32 data_size; - BUG_ON(name_len + data_len > BTRFS_MAX_XATTR_SIZE(root->fs_info)); + if (name_len + data_len > BTRFS_MAX_XATTR_SIZE(root->fs_info)) + return -ENOSPC; key.objectid = objectid; key.type = BTRFS_XATTR_ITEM_KEY; -- 2.10.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Remove some BUG_ONs
Minor cleanup in the BUG_ONs, where could be either removed or replaced by proper error handling where the call chain is ready for that. For 4.11. David Sterba (3): btrfs: remove BUG_ON from __tree_mod_log_insert btrfs: handle allocation error in update_dev_stat_item btrfs: do proper error handling in btrfs_insert_xattr_item fs/btrfs/ctree.c| 2 -- fs/btrfs/dir-item.c | 3 ++- fs/btrfs/volumes.c | 3 ++- 3 files changed, 4 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why is btrfs_run_delayed_refs() run multiple times
On Mon, Feb 20, 2017 at 5:08 PM, Goldwyn Rodrigueswrote: > Hi, > > Why do we call btrfs_run_delayed_refs multiple times in a function? Some > of the examples are: > > btrfs_commit_transaction() > commit_cowonly_roots() > > Is it because one call can generate more delayed refs and hence we need > to run them again? Under what scenarios is this possible? All extents, for both data and metadata, are reference counted and have back references in the extent tree. Adding/updating/removing things to/in/from a tree implies COWing metadata extents (tree nodes and leaves), which implies generating more delayed references. Eventually this loop stops, as metadata extents are COWed only once per transaction or more times if writeback was triggered. So it's possible in all scenarios. > > -- > Goldwyn > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "People will forget what you said, people will forget what you did, but people will never forget how you made them feel." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] block subsystem refcounter conversions
On Mon, 2017-02-20 at 17:56 +0100, Peter Zijlstra wrote: > On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote: > > On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote: > > > On 02/20/2017 04:16 AM, Elena Reshetova wrote: > > > > Now when new refcount_t type and API are finally merged > > > > (see include/linux/refcount.h), the following > > > > patches convert various refcounters in the block susystem from > > > > atomic_t to refcount_t. By doing this we prevent intentional or > > > > accidental underflows or overflows that can led to use-after > > > > -free vulnerabilities. > > > > This description isn't right ... nothing is prevented; we get > > warnings on saturation and use after free with this. > > The thing that is prevented is overflow and then a use-after-free by > making it a leak. > > Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1 > we'll never get there. > > So you loose use-after-free, you gain a resource leak. The general > idea being that use-after-free is a nice trampoline for exploits, > leaks are 'only' a DoS. OK, I see the intention: it's protection from outside influence. It still doesn't prevent *us* from screwing up in the kernel and inducing a use after free by doing too many puts (or too few gets) ... that's what the message suggests to me (me coding wrongly is accidental underflows or overflows as I read it). James -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Why is btrfs_run_delayed_refs() run multiple times
Hi, Why do we call btrfs_run_delayed_refs multiple times in a function? Some of the examples are: btrfs_commit_transaction() commit_cowonly_roots() Is it because one call can generate more delayed refs and hence we need to run them again? Under what scenarios is this possible? -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] block subsystem refcounter conversions
On Mon, Feb 20, 2017 at 07:41:01AM -0800, James Bottomley wrote: > On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote: > > On 02/20/2017 04:16 AM, Elena Reshetova wrote: > > > Now when new refcount_t type and API are finally merged > > > (see include/linux/refcount.h), the following > > > patches convert various refcounters in the block susystem from > > > atomic_t to refcount_t. By doing this we prevent intentional or > > > accidental underflows or overflows that can led to use-after-free > > > vulnerabilities. > > This description isn't right ... nothing is prevented; we get warnings > on saturation and use after free with this. The thing that is prevented is overflow and then a use-after-free by making it a leak. Modular stuff, you put and free at: (n+1) mod n, by saturating at n-1 we'll never get there. So you loose use-after-free, you gain a resource leak. The general idea being that use-after-free is a nice trampoline for exploits, leaks are 'only' a DoS. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] block subsystem refcounter conversions
On 02/20/2017 08:41 AM, James Bottomley wrote: > On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote: >> On 02/20/2017 04:16 AM, Elena Reshetova wrote: >>> Now when new refcount_t type and API are finally merged >>> (see include/linux/refcount.h), the following >>> patches convert various refcounters in the block susystem from >>> atomic_t to refcount_t. By doing this we prevent intentional or >>> accidental underflows or overflows that can led to use-after-free >>> vulnerabilities. > > This description isn't right ... nothing is prevented; we get warnings > on saturation and use after free with this. > >>> The below patches are fully independent and can be cherry-picked >>> separately. Since we convert all kernel subsystems in the same >>> fashion, resulting in about 300 patches, we have to group them for >>> sending at least in some fashion to be manageable. Please excuse >>> the long cc list. >>> >>> Elena Reshetova (5): >>> block: convert bio.__bi_cnt from atomic_t to refcount_t >>> block: convert blk_queue_tag.refcnt from atomic_t to refcount_t >>> block: convert blkcg_gq.refcnt from atomic_t to refcount_t >>> block: convert io_context.active_ref from atomic_t to refcount_t >>> block: convert bsg_device.ref_count from atomic_t to refcount_t >> >> I went to look at the implementation, and the size of a refcount_t. >> But the code is not there?! You say it's finally merged, where is >> it merged? Don't send code like this without the necessary >> infrastructure being in mainline. > > It's one of those no discussion except notice by tipbot things because > Ingo liked it. > > The size is atomic_t, but the primitives check for overflow and check > inc from zero and things, so in a true refcounting situation we get > automatic warnings of problems inside the primitives. > > That said, if we were going to convert the block layer to this > semantic, surely the benefit of the conversion would be getting rid of > the now unnecessary BUG_ON and WARN_ON in the code, which do exactly > the same thing the refcount primitives now do? Well, I have no idea what it does, which is why I went to look at the code. So any talk of converting the block layer is premature. But it's not there. I'll defer judgment until we have something in mainline, until then I've archived this thread. And I agree, keeping warn/bug for cases that should be handled with this framework would be counter productive and pointless. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] block subsystem refcounter conversions
On Mon, 2017-02-20 at 08:15 -0700, Jens Axboe wrote: > On 02/20/2017 04:16 AM, Elena Reshetova wrote: > > Now when new refcount_t type and API are finally merged > > (see include/linux/refcount.h), the following > > patches convert various refcounters in the block susystem from > > atomic_t to refcount_t. By doing this we prevent intentional or > > accidental underflows or overflows that can led to use-after-free > > vulnerabilities. This description isn't right ... nothing is prevented; we get warnings on saturation and use after free with this. > > The below patches are fully independent and can be cherry-picked > > separately. Since we convert all kernel subsystems in the same > > fashion, resulting in about 300 patches, we have to group them for > > sending at least in some fashion to be manageable. Please excuse > > the long cc list. > > > > Elena Reshetova (5): > > block: convert bio.__bi_cnt from atomic_t to refcount_t > > block: convert blk_queue_tag.refcnt from atomic_t to refcount_t > > block: convert blkcg_gq.refcnt from atomic_t to refcount_t > > block: convert io_context.active_ref from atomic_t to refcount_t > > block: convert bsg_device.ref_count from atomic_t to refcount_t > > I went to look at the implementation, and the size of a refcount_t. > But the code is not there?! You say it's finally merged, where is > it merged? Don't send code like this without the necessary > infrastructure being in mainline. It's one of those no discussion except notice by tipbot things because Ingo liked it. The size is atomic_t, but the primitives check for overflow and check inc from zero and things, so in a true refcounting situation we get automatic warnings of problems inside the primitives. That said, if we were going to convert the block layer to this semantic, surely the benefit of the conversion would be getting rid of the now unnecessary BUG_ON and WARN_ON in the code, which do exactly the same thing the refcount primitives now do? James -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] block subsystem refcounter conversions
On 02/20/2017 04:16 AM, Elena Reshetova wrote: > Now when new refcount_t type and API are finally merged > (see include/linux/refcount.h), the following > patches convert various refcounters in the block susystem from atomic_t > to refcount_t. By doing this we prevent intentional or accidental > underflows or overflows that can led to use-after-free vulnerabilities. > > The below patches are fully independent and can be cherry-picked separately. > Since we convert all kernel subsystems in the same fashion, resulting > in about 300 patches, we have to group them for sending at least in some > fashion to be manageable. Please excuse the long cc list. > > Elena Reshetova (5): > block: convert bio.__bi_cnt from atomic_t to refcount_t > block: convert blk_queue_tag.refcnt from atomic_t to refcount_t > block: convert blkcg_gq.refcnt from atomic_t to refcount_t > block: convert io_context.active_ref from atomic_t to refcount_t > block: convert bsg_device.ref_count from atomic_t to refcount_t I went to look at the implementation, and the size of a refcount_t. But the code is not there?! You say it's finally merged, where is it merged? Don't send code like this without the necessary infrastructure being in mainline. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/311: Disable dmesg check
Hi Chandan, On 07/17/15 12:56, Chandan Rajendra wrote: When running generic/311 on Btrfs' subpagesize-blocksize patchset (on ppc64 with 4k sectorsize and 16k node/leaf size) I noticed the following call trace, BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 found 158 BTRFS (device dm-0): parent transid verify failed on 29720576 wanted 160 found 158 BTRFS: Transaction aborted (error -5) WARNING: at /root/repos/linux/fs/btrfs/super.c:260 Modules linked in: CPU: 3 PID: 30769 Comm: umount Tainted: GWL 4.0.0-rc5-11671-g8b82e73e #63 task: c00079aaddb0 ti: c00079a48000 task.ti: c00079a48000 NIP: c0499aa0 LR: c0499a9c CTR: c0779630 REGS: c00079a4b480 TRAP: 0700 Tainted: GW L (4.0.0-rc5-11671-g8b82e73e) MSR: 800100029032CR: 28008828 XER: 2000 CFAR: c0a23914 SOFTE: 1 GPR00: c0499a9c c00079a4b700 c103bdf8 0025 GPR04: 0001 0502 c107e918 0cda GPR08: 0007 0007 0001 c10f5044 GPR12: 28008822 cfdc0d80 2000 10152e00 GPR16: 010002979380 10140724 GPR20: GPR24: c000151f61a8 c00055e5e800 c0aac270 GPR28: 04a4 fffb c00055e5e800 c000679204d0 NIP [c0499aa0] .__btrfs_abort_transaction+0x180/0x190 LR [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190 Call Trace: [c00079a4b700] [c0499a9c] .__btrfs_abort_transaction+0x17c/0x190 (unreliable) [c00079a4b7a0] [c0541678] .__btrfs_run_delayed_items+0xe8/0x220 [c00079a4b850] [c04d5b3c] .btrfs_commit_transaction+0x37c/0xca0 [c00079a4b960] [c049824c] .btrfs_sync_fs+0x6c/0x1a0 [c00079a4ba00] [c0255270] .sync_filesystem+0xd0/0x100 [c00079a4ba80] [c0218070] .generic_shutdown_super+0x40/0x170 [c00079a4bb10] [c0218598] .kill_anon_super+0x18/0x30 [c00079a4bb90] [c0498418] .btrfs_kill_super+0x18/0xc0 [c00079a4bc10] [c0218ac8] .deactivate_locked_super+0x98/0xe0 [c00079a4bc90] [c023e744] .cleanup_mnt+0x54/0xa0 [c00079a4bd10] [c00b7d14] .task_work_run+0x114/0x150 [c00079a4bdb0] [c0015f84] .do_notify_resume+0x74/0x80 [c00079a4be30] [c0009838] .ret_from_except_lite+0x64/0x68 Instruction dump: ebc1fff0 ebe1fff8 4bfffb28 6000 3ce2ffcd 38e7e818 4bbc 3c62ffd2 7fa4eb78 3863b808 48589e1d 6000 <0fe0> 4bfffedc 6000 6000 BTRFS: error (device dm-0) in __btrfs_run_delayed_items:1188: errno=-5 IO failure The call trace is seen when executing _run_test() for the 8th time. The above trace is actually a false-positive failure as indicated below, fsync-tester fsync(fd) Write delayed inode item to fs tree (assume transid to be 160) (assume tree block to start at logical address 29720576) md5sum $testfile This causes a delayed inode to be added Load flakey table i.e. drop writes that are initiated from now onwards Unmount filesystem btrfs_sync_fs is invoked Write 29720576 metadata block to disk free_extent_buffer(29720576) release_extent_buffer(29720576) Start writing delayed inode Traverse the fs tree (assume the parent tree block of 29720576 is still in memory) When reading 29720576 from disk, parent's blkptr will have generation set to 160. But the on-disk tree block will have an older generation (say, 158). Transid verification fails and hence the transaction gets aborted The test only cares about the FS instance before the unmount operation (i.e. the synced FS). Hence to get the test to pass, ignore the false-positive trace that could be generated. Looks like this patch didn't make it, is there any kernel patch which fixed this bug ? Or any hints on how to reproduce this bug ? Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mount troubles after crash
Hi togehter, during a balance run, the server crash. after the crash i got the attached error messages Kernel: 4.9.11 Btrfs-Tools: v4.9.1 btrfs check --repair failed with: couldn't open RDWR because of unsupported option features (3). Open ctree failed is there a way to fix this 140TB filesystem? Regards Patrick -- Patrick Schmidsupport: +41 44 633 2668 IT Services Group, HPT H 8voice: +41 44 633 3997 Departement Physik, ETH Zurich CH-8093 Zurich, Switzerland Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.622208] BUG: unable to handle kernel paging request at fe10 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.630033] IP: [] qgroup_fix_relocated_data_extents+0x2b/0x290 [btrfs] Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.639224] PGD 1c0a067 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.641864] PUD 1c0c067 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.644708] PMD 0 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.645306] Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.646983] Oops: [#1] SMP Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.650495] Modules linked in: nfsv3(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) xt_multiport(E) iptable_filter(E) ip_tables(E) x_tables(E) autofs4(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) configfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) btrfs(E) xor(E) raid6_pq(E) sb_edac(E) edac_core(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) bonding(E) ipmi_ssif(E) mousedev(E) lpc_ich(E) wmi(E) mei_me(E) mei(E) ioatdma(E) shpchp(E) tpm_tis(E) tpm_tis_core(E) ipmi_si(E) ipmi_poweroff(E) tcp_nv(E) ipmi_devintf(E) ipmi_msghandler(E) Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.730264] sch_fq(E) hid_generic(E) igb(E) isci(E) ixgbe(E) i2c_algo_bit(E) libsas(E) dca(E) ahci(E) usbhid(E) ptp(E) scsi_transport_sas(E) arcmsr(OE) libahci(E) hid(E) mdio(E) pps_core(E) Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.747985] CPU: 6 PID: 86374 Comm: mount Tainted: G OE 4.9.11-stable-blkmq #41 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.757389] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.768937] task: 88010323 task.stack: c90026238000 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.775601] RIP: 0010:[] [] qgroup_fix_relocated_data_extents+0x2b/0x290 [btrfs] Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.787571] RSP: 0018:c9002623ba00 EFLAGS: 00010246 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.793542] RAX: RBX: 880b6b60c000 RCX: Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.801559] RDX: 88010323 RSI: 881f18e2d000 RDI: 881f48535f40 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.809572] RBP: c9002623ba78 R08: 60e0008071d0 R09: 881f48535f40 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.817588] R10: 9cd50001 R11: 880d9cd508c0 R12: 881ff59b5800 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.825604] R13: c9002623baa0 R14: R15: Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.833619] FS: 7fb6570ee880() GS:881ffe40() knlGS: Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.842731] CS: 0010 DS: ES: CR0: 80050033 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.849189] CR2: fe10 CR3: 000c93c9d000 CR4: 000406e0 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.857211] Stack: Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.859498] 881f48535f40 0801 c9002623ba68 a08ea66b Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.867884] 881f48535f40 88010323 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.876269] 880dc489a1c0 881ff59b5800 880dc489a1c0 881ff59b5800 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.884655] Call Trace: Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.887444] [] ? start_transaction+0x9b/0x4b0 [btrfs] Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.894986] [] btrfs_recover_relocation+0x378/0x420 [btrfs] Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.903136] [] open_ctree+0x22e7/0x27f0 [btrfs] Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.910089] [] btrfs_mount+0xca4/0xec0 [btrfs] Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.916942] [] ? find_next_zero_bit+0x1e/0x20 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.923703] [] ? pcpu_next_unpop+0x3e/0x50 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.930164] [] ? find_next_bit+0x19/0x20 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.936432] [] mount_fs+0x39/0x160 Feb 20 14:42:01 phd-bkp-gw kernel: [ 6011.942114] [] ? __alloc_percpu+0x15/0x20 Feb 20 14:42:01 phd-bkp-gw
Re: Large no.of extents for a single small file with a fresh RAID5 setup
On Mon, Feb 20, 2017 at 03:53:15PM +0800, Qu Wenruo wrote: > > > At 02/18/2017 10:07 AM, Lakshmipathi.G wrote: > >Hi, > > > >With RAID5, using 3 loop-back devices each has 4GB. On a fresh setup, > >created a 2mb file. > >While checking no.of extents,it shows like: > > > ># ./btrfs-debugfs -f /mnt/file2m.txt > >(257 0): ram 110592 disk 145227776 disk_size 114688 > >(257 110592): ram 4096 disk 145461248 disk_size 4096 > >(257 114688): ram 114688 disk 145342464 disk_size 118784 > >(257 229376): ram 712704 disk 145620992 disk_size 716800 > >(257 942080): ram 4096 disk 145096704 disk_size 4096 > >(257 946176): ram 737280 disk 146407424 disk_size 741376 > >(257 1683456): ram 413696 disk 147193856 disk_size 413696 > >file: /mnt/file2m.txt extents 7 disk size 2113536 logical size 2097152 ratio > >0.99 > > > >Is this expected behaviour? Is there any mount option to minimize no.of > >extents for > >a smaller file? > > What mount option did you use? > Are you using compression? Used the default mount options (mount /dev/loop0 /mnt) and compression not enabled. > How did you create the 2m file? Yes,this is the problem. script invokes 'dd' with 'notrunc' for each byte. If the file is 1kb, 'dd' invoked 1024 times with notrunc. This seems to be creating the issue. Tried using open-fd (exec FD<>/path/f, write finally close FD) method even this runs into possible sync and creates multiple extents. so finally found a simple trick, created the file outside btrfs then copy it into btrfs-mnt-pt :) this workaround helps to solve the script file-creation issue. > > This seems quite strange, if you go through normal buffered write without > compression, 2m can easily be embedded into one file extent. > > And according to extent at offset 0, 112K, 224K, its data extent is larger > than its file extent, which means it goes through CoW. > > Either you did fsync/sync between writes or you goes through Direct IO. > > Anyway, the way you create the 2M file is quite important here now. > > Thanks, > Qu > > thanks for the help. Cheers. Lakshmipathi.G -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: remove unnecessary constraint when splitting a leaf
On Sat, Feb 18, 2017 at 12:34 AM, Liu Bowrote: > On Fri, Feb 17, 2017 at 07:16:04PM +, fdman...@kernel.org wrote: >> From: Filipe Manana >> >> If we want to append an item to a leaf we were trying to move off from the >> leaf into its neighbors an amount of space corresponding to the item's >> size. That amount of space is too much and can be reduced to the item's >> size minus the amount of free space in the leaf, like we do for all the >> other cases (item inserted at the beginning or somewhere in the middle of >> the leaf). >> >> Signed-off-by: Filipe Manana >> --- >> fs/btrfs/ctree.c | 12 +--- >> 1 file changed, 5 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c >> index a426dc8..86be619 100644 >> --- a/fs/btrfs/ctree.c >> +++ b/fs/btrfs/ctree.c >> @@ -4130,11 +4130,11 @@ static noinline int push_for_double_split(struct >> btrfs_trans_handle *trans, >> int progress = 0; >> int slot; >> u32 nritems; >> - int space_needed = data_size; >> + int space_needed; >> >> slot = path->slots[0]; >> - if (slot < btrfs_header_nritems(path->nodes[0])) >> - space_needed -= btrfs_leaf_free_space(fs_info, path->nodes[0]); >> + space_needed = data_size - >> + btrfs_leaf_free_space(fs_info, path->nodes[0]); >> >> /* >>* try to push all the items after our slot into the >> @@ -4205,11 +4205,9 @@ static noinline int split_leaf(struct >> btrfs_trans_handle *trans, >> >> /* first try to make some room by pushing left and right */ >> if (data_size && path->nodes[1]) { >> - int space_needed = data_size; >> - >> - if (slot < btrfs_header_nritems(l)) >> - space_needed -= btrfs_leaf_free_space(fs_info, l); >> + int space_needed; > > One doubt, if (slot == nritems(l)), and push_leaf_right() is with @emtpy == 0, > so it is possible to use the right sibling as the node for the new item, so if > we minus l's free space from space_needed, then it's possible that > > (the original space_needed) > (free space of right node) > (the subtracted > space_needed) > > in that case, we're going to use right node as path->nodes[0] while right node > doesn't have enough space. Right. And I forgot why I added this constraint in the first place and the optimization at push_leaf_right() that uses the right leaf and its first slot. The idea here (this patch) was to avoid less splits when we fallback to try to move items into the left sibling, that is, if the new item is for a slot > 0, right sibling is full (or does not have enough space to move items into it) we try to push the left sibling with the goal of freeing new_item_size - leaf_free_space bytes from our leaf instead of new_item_size bytes. Clearly while doing that I made the first optimization of trying to use the right leaf less likely to succeed. I've reworked this with a new subject and changelog at: https://patchwork.kernel.org/patch/9582901/ Thanks for reminding about that. > > Thanks, > > -liubo >> >> + space_needed = data_size - btrfs_leaf_free_space(fs_info, l); >> wret = push_leaf_right(trans, root, path, space_needed, >> space_needed, 0, 0); >> if (wret < 0) >> -- >> 2.7.0.rc3 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: try harder to migrate items to left sibling before splitting a leaf
From: Filipe MananaBefore attempting to split a leaf we try to migrate items from the leaf to its right and left siblings. We start by trying to move items into the rigth sibling and, if the new item is meant to be inserted at the end of our leaf, we try to free from our leaf an amount of bytes equal to the number of bytes used by the new item, by setting the variable space_needed to the byte size of that new item. However if we fail to move enough items to the right sibling due to lack of space in that sibling, we then try to move items into the left sibling, and in that case we try to free an amount equal to the size of the new item from our leaf, when we need only to free an amount corresponding to the size of the new item minus the current free space of our leaf. So make sure that before we try to move items to the left sibling we do set the variable space_needed with a value corresponding to the new item's size minus the leaf's current free space. Signed-off-by: Filipe Manana --- fs/btrfs/ctree.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index a426dc8..1d66761 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -4160,6 +4160,9 @@ static noinline int push_for_double_split(struct btrfs_trans_handle *trans, /* try to push all the items before our slot into the next leaf */ slot = path->slots[0]; + space_needed = data_size; + if (slot > 0) + space_needed -= btrfs_leaf_free_space(fs_info, path->nodes[0]); ret = push_leaf_left(trans, root, path, 1, space_needed, 0, slot); if (ret < 0) return ret; @@ -4215,6 +4218,10 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans, if (wret < 0) return wret; if (wret) { + space_needed = data_size; + if (slot > 0) + space_needed -= btrfs_leaf_free_space(fs_info, + l); wret = push_leaf_left(trans, root, path, space_needed, space_needed, 0, (u32)-1); if (wret < 0) -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Handle btrfs_reloc_clone_csums error correctly to avoid deadlock
On Mon, Feb 20, 2017 at 12:31 AM, Qu Wenruowrote: > > > At 02/17/2017 11:25 PM, Filipe Manana wrote: >> >> On Fri, Feb 17, 2017 at 12:37 AM, Qu Wenruo >> wrote: >>> >>> >>> >>> At 02/16/2017 06:03 PM, Filipe Manana wrote: On Thu, Feb 16, 2017 at 12:39 AM, Qu Wenruo wrote: > > > > > At 02/15/2017 11:59 PM, Filipe Manana wrote: >> >> >> >> On Wed, Feb 15, 2017 at 8:49 AM, Qu Wenruo >> wrote: >>> >>> >>> >>> If run btrfs/125 with nospace_cache or space_cache=v2 mount option, >>> btrfs will block with the following backtrace: >>> >>> Call Trace: >>> __schedule+0x2d4/0xae0 >>> schedule+0x3d/0x90 >>> btrfs_start_ordered_extent+0x160/0x200 [btrfs] >>> ? wake_atomic_t_function+0x60/0x60 >>> btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] >>> btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] >>> btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] >>> process_one_work+0x2af/0x720 >>> ? process_one_work+0x22b/0x720 >>> worker_thread+0x4b/0x4f0 >>> kthread+0x10f/0x150 >>> ? process_one_work+0x720/0x720 >>> ? kthread_create_on_node+0x40/0x40 >>> ret_from_fork+0x2e/0x40 >>> >>> The direct cause is the error handler in run_delalloc_nocow() doesn't >>> handle error from btrfs_reloc_clone_csums() well. >>> >>> The related part call path will be: >>> __extent_writepage >>> |- writepage_delalloc() >>> | |- run_delalloc_range() >>> | |- run_delalloc_nocow() >>> ||- btrfs_add_ordered_extent() >>> || Now one ordered extent for file range, e.g [0, 1M) is >>> inserted >>> || >>> ||- btrfs_reloc_clone_csums() >>> || Fails with -EIO, as RAID5/6 doesn't repair some csum tree >>> || blocks >>> || >>> ||- extent_clear_unlock_delalloc() >>> | Error routine, unlock and clear page DIRTY, end page >>> writeback >>> | So the remaining 255 pages will not go through writeback >>> | >>> |- __extent_writepage_io() >>>|- writepage_end_io_hook() >>> |- btrfs_dev_test_ordered_pending() >>> Reduce ordered_extent->bytes_left by 4K. >>> Still have (1M - 4K) to finish. >>> >>> While the remaining 255 pages will not go through IO nor trigger >>> writepage_end_io_hook(), the ordered extent for [0, 1M) will >>> never finish, and blocking current transaction forever. >>> >>> Although the root cause is still in RAID5/6, it won't hurt to fix the >>> error routine first. >>> >>> This patch will cleanup the ordered extent in error routine, so at >>> least >>> we won't cause deadlock. >>> >>> Signed-off-by: Qu Wenruo >>> --- >>> fs/btrfs/extent_io.c| 1 - >>> fs/btrfs/inode.c| 10 -- >>> fs/btrfs/ordered-data.c | 25 + >>> fs/btrfs/ordered-data.h | 10 ++ >>> 4 files changed, 43 insertions(+), 3 deletions(-) >>> >>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >>> index 4ac383a3a649..a14d1b0840c5 100644 >>> --- a/fs/btrfs/extent_io.c >>> +++ b/fs/btrfs/extent_io.c >>> @@ -3258,7 +3258,6 @@ static noinline_for_stack int >>> writepage_delalloc(struct inode *inode, >>>delalloc_end, >>>_started, >>>nr_written); >>> - /* File system has been set read-only */ >>> if (ret) { >>> SetPageError(page); >>> /* fill_delalloc should be return < 0 for >>> error >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 1e861a063721..3c3ade58afd7 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -1052,8 +1052,11 @@ static noinline int cow_file_range(struct >>> inode >>> *inode, >>> BTRFS_DATA_RELOC_TREE_OBJECTID) { >>> ret = btrfs_reloc_clone_csums(inode, start, >>> >>> cur_alloc_size); >>> - if (ret) >>> + if (ret) { >>> + btrfs_clean_ordered_extent(inode, >>> start, >>> + ram_size); >>> goto out_drop_extent_cache; >>> + } >>> } >>> >>> btrfs_dec_block_group_reservations(fs_info, >>> ins.objectid); >>> @@ -1538,7 +1541,7 @@ static noinline int run_delalloc_nocow(struct >>> inode >>>
Re: [PATCH] qgroup: Retry after commit on getting EDQUOT
On 02/19/2017 11:35 PM, Qu Wenruo wrote: > > > At 02/20/2017 12:35 PM, Goldwyn Rodrigues wrote: >> Hi Qu, >> >> On 02/19/2017 09:07 PM, Qu Wenruo wrote: >>> >>> >>> At 02/19/2017 09:30 PM, Goldwyn Rodrigues wrote: From: Goldwyn RodriguesWe are facing the same problem with EDQUOT which was experienced with ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but here is a fix. Let me know if it is too big a hammer. Quotas are reserved during the start of an operation, incrementing qg->reserved. However, it is written to disk in a commit_transaction which could take as long as commit_interval. In the meantime there could be deletions which are not accounted for because deletions are accounted for only while committed (free_refroot). So, when we get a EDQUOT flush the data to disk and try again. >>> >>> IIRC Jeff submitted a qgroup patch to allow unlink to be done even we >>> already hit EDQUOT. >>> https://patchwork.kernel.org/patch/9537193/ >>> >>> I think that's a better solution, which is quite like what we do to >>> handle ENOSPC. >>> >> >> These are two separate issues. This issue is where qg->reserved gets >> over-inflated because of repeated deletions and recreations within a >> commit_interval. > > My fault, that's indeed two different bugs. > > And I succeeded to reproduce the bug. >> >> Jeff's patch deals with allowing removal even if the quota is exceeded >> so that eventually there can be space freed. >> >> I would suggest you apply Jeff's patch and try to run the script I have >> presented. >> >> >>> Thanks, >>> Qu >>> I combined the conditions of rfer and excl to reduce code lines, though the condition looks uglier. Here is a sample script which shows this issue. DEVICE=/dev/vdb MOUNTPOINT=/mnt TESTVOL=$MOUNTPOINT/tmp QUOTA=5 PROG=btrfs DD_BS="4k" DD_COUNT="256" RUN_TIMES=5000 mkfs.btrfs -f $DEVICE mount -o commit=240 $DEVICE $MOUNTPOINT $PROG subvolume create $TESTVOL $PROG quota enable $TESTVOL $PROG qgroup limit ${QUOTA}G $TESTVOL typeset -i DD_RUN_GOOD typeset -i QUOTA function _check_cmd() { if [[ ${?} > 0 ]]; then echo -n "$(date) E: Running previous command" echo ${*} echo "Without sync" $PROG qgroup show -pcreFf ${TESTVOL} echo "With sync" $PROG qgroup show -pcreFf --sync ${TESTVOL} exit 1 fi } while true; do DD_RUN_GOOD=$RUN_TIMES while (( ${DD_RUN_GOOD} != 0 )); do dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT} _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}" DD_RUN_GOOD=(${DD_RUN_GOOD}-1) done $PROG qgroup show -pcref $TESTVOL echo "--- Cleanup -- " rm $TESTVOL/quotatest* done > > It would be better if we can reduce the reproduce time and submit it as > a fstest test case. Yes, unfortunately, it is more probabilistic than deterministic. But yes, reducing the size and increasing the commit_interval can improve the time. > Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/qgroup.c | 28 +--- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 4700cac..9ace407 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2093,6 +2093,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes) struct btrfs_fs_info *fs_info = root->fs_info; u64 ref_root = root->root_key.objectid; int ret = 0; +int retried = 0; struct ulist_node *unode; struct ulist_iterator uiter; @@ -2101,7 +2102,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes) if (num_bytes == 0) return 0; - +retry: spin_lock(_info->qgroup_lock); quota_root = fs_info->quota_root; if (!quota_root) @@ -2127,16 +2128,21 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes) qg = u64_to_ptr(unode->aux); -if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) && -qg->reserved + (s64)qg->rfer + num_bytes > -qg->max_rfer) { -ret = -EDQUOT; -goto out; -} - -if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) && -qg->reserved + (s64)qg->excl + num_bytes > -qg->max_excl) { +if (((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) && +
[PATCH 0/5] block subsystem refcounter conversions
Now when new refcount_t type and API are finally merged (see include/linux/refcount.h), the following patches convert various refcounters in the block susystem from atomic_t to refcount_t. By doing this we prevent intentional or accidental underflows or overflows that can led to use-after-free vulnerabilities. The below patches are fully independent and can be cherry-picked separately. Since we convert all kernel subsystems in the same fashion, resulting in about 300 patches, we have to group them for sending at least in some fashion to be manageable. Please excuse the long cc list. Elena Reshetova (5): block: convert bio.__bi_cnt from atomic_t to refcount_t block: convert blk_queue_tag.refcnt from atomic_t to refcount_t block: convert blkcg_gq.refcnt from atomic_t to refcount_t block: convert io_context.active_ref from atomic_t to refcount_t block: convert bsg_device.ref_count from atomic_t to refcount_t block/bio.c| 6 +++--- block/blk-cgroup.c | 2 +- block/blk-ioc.c| 4 ++-- block/blk-tag.c| 8 block/bsg.c| 9 + block/cfq-iosched.c| 4 ++-- fs/btrfs/volumes.c | 2 +- include/linux/bio.h| 4 ++-- include/linux/blk-cgroup.h | 11 ++- include/linux/blk_types.h | 3 ++- include/linux/blkdev.h | 3 ++- include/linux/iocontext.h | 7 --- 12 files changed, 34 insertions(+), 29 deletions(-) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] block: convert bio.__bi_cnt from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena ReshetovaSigned-off-by: Hans Liljestrand Signed-off-by: Kees Cook Signed-off-by: David Windsor --- block/bio.c | 6 +++--- fs/btrfs/volumes.c| 2 +- include/linux/bio.h | 4 ++-- include/linux/blk_types.h | 3 ++- 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/block/bio.c b/block/bio.c index 5eec5e0..3dffc17 100644 --- a/block/bio.c +++ b/block/bio.c @@ -275,7 +275,7 @@ void bio_init(struct bio *bio, struct bio_vec *table, { memset(bio, 0, sizeof(*bio)); atomic_set(>__bi_remaining, 1); - atomic_set(>__bi_cnt, 1); + refcount_set(>__bi_cnt, 1); bio->bi_io_vec = table; bio->bi_max_vecs = max_vecs; @@ -543,12 +543,12 @@ void bio_put(struct bio *bio) if (!bio_flagged(bio, BIO_REFFED)) bio_free(bio); else { - BIO_BUG_ON(!atomic_read(>__bi_cnt)); + BIO_BUG_ON(!refcount_read(>__bi_cnt)); /* * last put frees it */ - if (atomic_dec_and_test(>__bi_cnt)) + if (refcount_dec_and_test(>__bi_cnt)) bio_free(bio); } } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 13e55d1..ff1fe9d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -441,7 +441,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device) waitqueue_active(_info->async_submit_wait)) wake_up(_info->async_submit_wait); - BUG_ON(atomic_read(>__bi_cnt) == 0); + BUG_ON(refcount_read(>__bi_cnt) == 0); /* * if we're doing the sync list, record that our diff --git a/include/linux/bio.h b/include/linux/bio.h index 8e52119..44ac678 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -234,7 +234,7 @@ static inline void bio_get(struct bio *bio) { bio->bi_flags |= (1 << BIO_REFFED); smp_mb__before_atomic(); - atomic_inc(>__bi_cnt); + refcount_inc(>__bi_cnt); } static inline void bio_cnt_set(struct bio *bio, unsigned int count) @@ -243,7 +243,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count) bio->bi_flags |= (1 << BIO_REFFED); smp_mb__before_atomic(); } - atomic_set(>__bi_cnt, count); + refcount_set(>__bi_cnt, count); } static inline bool bio_flagged(struct bio *bio, unsigned int bit) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d703acb..c41407f 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -7,6 +7,7 @@ #include #include +#include struct bio_set; struct bio; @@ -73,7 +74,7 @@ struct bio { unsigned short bi_max_vecs;/* max bvl_vecs we can hold */ - atomic_t__bi_cnt; /* pin count */ + refcount_t __bi_cnt; /* pin count */ struct bio_vec *bi_io_vec; /* the actual vec list */ -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] block: convert blk_queue_tag.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena ReshetovaSigned-off-by: Hans Liljestrand Signed-off-by: Kees Cook Signed-off-by: David Windsor --- block/blk-tag.c| 8 include/linux/blkdev.h | 3 ++- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/block/blk-tag.c b/block/blk-tag.c index 07cc329..d83555e 100644 --- a/block/blk-tag.c +++ b/block/blk-tag.c @@ -35,7 +35,7 @@ EXPORT_SYMBOL(blk_queue_find_tag); */ void blk_free_tags(struct blk_queue_tag *bqt) { - if (atomic_dec_and_test(>refcnt)) { + if (refcount_dec_and_test(>refcnt)) { BUG_ON(find_first_bit(bqt->tag_map, bqt->max_depth) < bqt->max_depth); @@ -130,7 +130,7 @@ static struct blk_queue_tag *__blk_queue_init_tags(struct request_queue *q, if (init_tag_map(q, tags, depth)) goto fail; - atomic_set(>refcnt, 1); + refcount_set(>refcnt, 1); tags->alloc_policy = alloc_policy; tags->next_tag = 0; return tags; @@ -180,7 +180,7 @@ int blk_queue_init_tags(struct request_queue *q, int depth, queue_flag_set(QUEUE_FLAG_QUEUED, q); return 0; } else - atomic_inc(>refcnt); + refcount_inc(>refcnt); /* * assign it, all done @@ -225,7 +225,7 @@ int blk_queue_resize_tags(struct request_queue *q, int new_depth) * Currently cannot replace a shared tag map with a new * one, so error out if this is the case */ - if (atomic_read(>refcnt) != 1) + if (refcount_read(>refcnt) != 1) return -EBUSY; /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index aecca0e..95ef11c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -25,6 +25,7 @@ #include #include #include +#include struct module; struct scsi_ioctl_command; @@ -288,7 +289,7 @@ struct blk_queue_tag { unsigned long *tag_map; /* bit map of free/busy tags */ int max_depth; /* what we will send to device */ int real_max_depth; /* what the array can hold */ - atomic_t refcnt;/* map can be shared */ + refcount_t refcnt; /* map can be shared */ int alloc_policy; /* tag allocation policy */ int next_tag; /* next tag */ }; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] block: convert io_context.active_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena ReshetovaSigned-off-by: Hans Liljestrand Signed-off-by: Kees Cook Signed-off-by: David Windsor --- block/blk-ioc.c | 4 ++-- block/cfq-iosched.c | 4 ++-- include/linux/iocontext.h | 7 --- 3 files changed, 8 insertions(+), 7 deletions(-) diff --git a/block/blk-ioc.c b/block/blk-ioc.c index b12f9c8..130dc23 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -173,7 +173,7 @@ void put_io_context_active(struct io_context *ioc) unsigned long flags; struct io_cq *icq; - if (!atomic_dec_and_test(>active_ref)) { + if (!refcount_dec_and_test(>active_ref)) { put_io_context(ioc); return; } @@ -256,7 +256,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node) /* initialize */ atomic_long_set(>refcount, 1); atomic_set(>nr_tasks, 1); - atomic_set(>active_ref, 1); + refcount_set(>active_ref, 1); spin_lock_init(>lock); INIT_RADIX_TREE(>icq_tree, GFP_ATOMIC | __GFP_HIGH); INIT_HLIST_HEAD(>icq_list); diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 9212627..2871bb9 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -2959,7 +2959,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) * task has exited, don't wait */ cic = cfqd->active_cic; - if (!cic || !atomic_read(>icq.ioc->active_ref)) + if (!cic || !refcount_read(>icq.ioc->active_ref)) return; /* @@ -3959,7 +3959,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, if (cfqq->next_rq && req_noidle(cfqq->next_rq)) enable_idle = 0; - else if (!atomic_read(>icq.ioc->active_ref) || + else if (!refcount_read(>icq.ioc->active_ref) || !cfqd->cfq_slice_idle || (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) enable_idle = 0; diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h index df38db2..a1e28c3 100644 --- a/include/linux/iocontext.h +++ b/include/linux/iocontext.h @@ -3,6 +3,7 @@ #include #include +#include #include enum { @@ -96,7 +97,7 @@ struct io_cq { */ struct io_context { atomic_long_t refcount; - atomic_t active_ref; + refcount_t active_ref; atomic_t nr_tasks; /* all the fields below are protected by this lock */ @@ -128,9 +129,9 @@ struct io_context { static inline void get_io_context_active(struct io_context *ioc) { WARN_ON_ONCE(atomic_long_read(>refcount) <= 0); - WARN_ON_ONCE(atomic_read(>active_ref) <= 0); + WARN_ON_ONCE(refcount_read(>active_ref) == 0); atomic_long_inc(>refcount); - atomic_inc(>active_ref); + refcount_inc(>active_ref); } static inline void ioc_task_link(struct io_context *ioc) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] block: convert blkcg_gq.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena ReshetovaSigned-off-by: Hans Liljestrand Signed-off-by: Kees Cook Signed-off-by: David Windsor --- block/blk-cgroup.c | 2 +- include/linux/blk-cgroup.h | 11 ++- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 295e98c2..75de844 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -106,7 +106,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, blkg->q = q; INIT_LIST_HEAD(>q_node); blkg->blkcg = blkcg; - atomic_set(>refcnt, 1); + refcount_set(>refcnt, 1); /* root blkg uses @q->root_rl, init rl only for !root blkgs */ if (blkcg != _root) { diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 01b62e7..0d3efa9 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -19,6 +19,7 @@ #include #include #include +#include /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */ #define BLKG_STAT_CPU_BATCH(INT_MAX / 2) @@ -122,7 +123,7 @@ struct blkcg_gq { struct request_list rl; /* reference count */ - atomic_trefcnt; + refcount_t refcnt; /* is this blkg online? protected by both blkcg and q locks */ boolonline; @@ -354,8 +355,8 @@ static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen) */ static inline void blkg_get(struct blkcg_gq *blkg) { - WARN_ON_ONCE(atomic_read(>refcnt) <= 0); - atomic_inc(>refcnt); + WARN_ON_ONCE(refcount_read(>refcnt) == 0); + refcount_inc(>refcnt); } void __blkg_release_rcu(struct rcu_head *rcu); @@ -366,8 +367,8 @@ void __blkg_release_rcu(struct rcu_head *rcu); */ static inline void blkg_put(struct blkcg_gq *blkg) { - WARN_ON_ONCE(atomic_read(>refcnt) <= 0); - if (atomic_dec_and_test(>refcnt)) + WARN_ON_ONCE(refcount_read(>refcnt) == 0); + if (refcount_dec_and_test(>refcnt)) call_rcu(>rcu_head, __blkg_release_rcu); } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] block: convert bsg_device.ref_count from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena ReshetovaSigned-off-by: Hans Liljestrand Signed-off-by: Kees Cook Signed-off-by: David Windsor --- block/bsg.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 74835db..6d0ceb9 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -38,7 +39,7 @@ struct bsg_device { struct list_head busy_list; struct list_head done_list; struct hlist_node dev_list; - atomic_t ref_count; + refcount_t ref_count; int queued_cmds; int done_cmds; wait_queue_head_t wq_done; @@ -711,7 +712,7 @@ static int bsg_put_device(struct bsg_device *bd) mutex_lock(_mutex); - do_free = atomic_dec_and_test(>ref_count); + do_free = refcount_dec_and_test(>ref_count); if (!do_free) { mutex_unlock(_mutex); goto out; @@ -763,7 +764,7 @@ static struct bsg_device *bsg_add_device(struct inode *inode, bsg_set_block(bd, file); - atomic_set(>ref_count, 1); + refcount_set(>ref_count, 1); mutex_lock(_mutex); hlist_add_head(>dev_list, bsg_dev_idx_hash(iminor(inode))); @@ -783,7 +784,7 @@ static struct bsg_device *__bsg_get_device(int minor, struct request_queue *q) hlist_for_each_entry(bd, bsg_dev_idx_hash(minor), dev_list) { if (bd->queue == q) { - atomic_inc(>ref_count); + refcount_inc(>ref_count); goto found; } } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] btrfs: export compression buffer limits in a header
At 02/18/2017 02:05 AM, David Sterba wrote: Move the buffer limit definitions out of compress_file_range. Nice one. No longer need to dig the codes to know the 128K limit now. Reviewed-by: Qu WenruoThanks, Qu Signed-off-by: David Sterba --- fs/btrfs/compression.h | 15 +++ fs/btrfs/inode.c | 10 -- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index e453f42b3bbf..c9d7e552cfa8 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -19,6 +19,21 @@ #ifndef __BTRFS_COMPRESSION_ #define __BTRFS_COMPRESSION_ +/* + * We want to make sure that amount of RAM required to uncompress an extent is + * reasonable, so we limit the total size in ram of a compressed extent to + * 128k. This is a crucial number because it also controls how easily we can + * spread reads across cpus for decompression. + * + * We also want to make sure the amount of IO required to do a random read is + * reasonably small, so we limit the size of a compressed extent to 128k. + */ + +/* Maximum length of compressed data stored on disk */ +#define BTRFS_MAX_COMPRESSED (SZ_128K) +/* Maximum size of data before compression */ +#define BTRFS_MAX_UNCOMPRESSED (SZ_128K) + void btrfs_init_compress(void); void btrfs_exit_compress(void); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 801d1b3fd9d7..bc547608ff1a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -470,16 +470,6 @@ static noinline void compress_file_range(struct inode *inode, (start > 0 || end + 1 < BTRFS_I(inode)->disk_i_size)) goto cleanup_and_bail_uncompressed; - /* we want to make sure that amount of ram required to uncompress -* an extent is reasonable, so we limit the total size in ram -* of a compressed extent to 128k. This is a crucial number -* because it also controls how easily we can spread reads across -* cpus for decompression. -* -* We also want to make sure the amount of IO required to do -* a random read is reasonably small, so we limit the size of -* a compressed extent to 128k. -*/ total_compressed = min(total_compressed, max_uncompressed); num_bytes = ALIGN(end - start + 1, blocksize); num_bytes = max(blocksize, num_bytes); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html