Re: [PATCH v2] Btrfs: fix null pointer dereference on compressed write path error
On Fri, Oct 12, 2018 at 4:38 PM wrote: > > From: Filipe Manana > > At inode.c:compress_file_range(), under the "free_pages_out" label, we can > end up dereferencing the "pages" pointer when it has a NULL value. This > case happens when "start" has a value of 0 and we fail to allocate memory > for the "pages" pointer. When that happens we jump to the "cont" label and > then enter the "if (start == 0)" branch where we immediately call the > cow_file_range_inline() function. If that function returns 0 (success > creating an inline extent) or an error (like -ENOMEM for example) we jump > to the "free_pages_out" label and then access "pages[i]" leading to a NULL > pointer dereference, since "nr_pages" has a value greater than zero at > that point. > > Fix this by setting "nr_pages" to 0 when we fail to allocate memory for > the "pages" pointer. > Looks good. Reviewed-by: Liu Bo thanks, liubo > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 > Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") > Signed-off-by: Filipe Manana > --- > > V2: Updated changelog. > > fs/btrfs/inode.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 66c6c4103d2f..d6b61b1facdd 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -503,6 +503,7 @@ static noinline void compress_file_range(struct inode > *inode, > pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); > if (!pages) { > /* just bail out to the uncompressed code */ > + nr_pages = 0; > goto cont; > } > > -- > 2.11.0 >
[PATCH v2] Btrfs: fix null pointer dereference on compressed write path error
From: Filipe Manana At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") Signed-off-by: Filipe Manana --- V2: Updated changelog. fs/btrfs/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 66c6c4103d2f..d6b61b1facdd 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -503,6 +503,7 @@ static noinline void compress_file_range(struct inode *inode, pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); if (!pages) { /* just bail out to the uncompressed code */ + nr_pages = 0; goto cont; } -- 2.11.0
Re: [PATCH v7 2/6] mm: export add_swap_extent()
On Tue, 11 Sep 2018 15:34:45 -0700 Omar Sandoval wrote: > From: Omar Sandoval > > Btrfs will need this for swap file support. > Acked-by: Andrew Morton
Re: [PATCH v7 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
On Tue, 11 Sep 2018 15:34:44 -0700 Omar Sandoval wrote: > From: Omar Sandoval > > The SWP_FILE flag serves two purposes: to make swap_{read,write}page() > go through the filesystem, and to make swapoff() call > ->swap_deactivate(). For Btrfs, we want the latter but not the former, > so split this flag into two. This makes us always call > ->swap_deactivate() if ->swap_activate() succeeded, not just if it > didn't add any swap extents itself. > > This also resolves the issue of the very misleading name of SWP_FILE, > which is only used for swap files over NFS. > Acked-by: Andrew Morton
Re: [PATCH 42/42] btrfs: don't run delayed_iputs in commit
On Fri, Oct 12, 2018 at 8:35 PM Josef Bacik wrote: > > This could result in a really bad case where we do something like > > evict > evict_refill_and_join > btrfs_commit_transaction > btrfs_run_delayed_iputs > evict > evict_refill_and_join > btrfs_commit_transaction > ... forever > > We have plenty of other places where we run delayed iputs that are much > safer, let those do the work. > > Signed-off-by: Josef Bacik Reviewed-by: Filipe Manana > --- > fs/btrfs/transaction.c | 9 - > 1 file changed, 9 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 9168efaca37e..c91dc36fccae 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -2265,15 +2265,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > > kmem_cache_free(btrfs_trans_handle_cachep, trans); > > - /* > -* If fs has been frozen, we can not handle delayed iputs, otherwise > -* it'll result in deadlock about SB_FREEZE_FS. > -*/ > - if (current != fs_info->transaction_kthread && > - current != fs_info->cleaner_kthread && > - !test_bit(BTRFS_FS_FROZEN, _info->flags)) > - btrfs_run_delayed_iputs(fs_info); > - > return ret; > > scrub_continue: > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
[PATCH] Btrfs: fix null pointer dereference on compressed write path error
From: Filipe Manana At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 66c6c4103d2f..d6b61b1facdd 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -503,6 +503,7 @@ static noinline void compress_file_range(struct inode *inode, pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); if (!pages) { /* just bail out to the uncompressed code */ + nr_pages = 0; goto cont; } -- 2.11.0
[PATCH 33/42] btrfs: fix insert_reserved error handling
We were not handling the reserved byte accounting properly for data references. Metadata was fine, if it errored out the error paths would free the bytes_reserved count and pin the extent, but it even missed one of the error cases. So instead move this handling up into run_one_delayed_ref so we are sure that both cases are properly cleaned up in case of a transaction abort. Reviewed-by: David Sterba Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 933cba06c9fb..9b5366932298 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, insert_reserved); else BUG(); + if (ret && insert_reserved) + btrfs_pin_extent(trans->fs_info, node->bytenr, +node->num_bytes, 1); return ret; } @@ -8257,21 +8260,14 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans, } path = btrfs_alloc_path(); - if (!path) { - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); + if (!path) return -ENOMEM; - } path->leave_spinning = 1; ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path, _key, size); if (ret) { btrfs_free_path(path); - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); return ret; } -- 2.14.3
[PATCH 31/42] btrfs: cleanup pending bgs on transaction abort
We may abort the transaction during a commit and not have a chance to run the pending bgs stuff, which will leave block groups on our list and cause us accounting issues and leaked memory. Fix this by running the pending bgs when we cleanup a transaction. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 46ca775a709e..9168efaca37e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2280,6 +2280,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_scrub_continue(fs_info); cleanup_transaction: btrfs_trans_release_metadata(trans); + /* This cleans up the pending block groups list properly. */ + if (!trans->aborted) + trans->aborted = ret; + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); trans->block_rsv = NULL; btrfs_warn(fs_info, "Skipping commit of aborted transaction."); -- 2.14.3
[PATCH 27/42] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
Instead of open coding this stuff use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 121ab180a78a..fe1f229320ef 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (head->must_insert_reserved) pin_bytes = true; btrfs_free_delayed_extent_op(head->extent_op); - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; - atomic_dec(_refs->num_entries); - rb_erase(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); mutex_unlock(>mutex); -- 2.14.3
[PATCH 40/42] btrfs: drop min_size from evict_refill_and_join
We don't need it, rsv->size is set once and never changes throughout its lifetime, so just use that for the reserve size. Reviewed-by: David Sterba Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ab8242b10601..dbcca915e681 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5339,8 +5339,7 @@ static void evict_inode_truncate_pages(struct inode *inode) } static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, - struct btrfs_block_rsv *rsv, - u64 min_size) + struct btrfs_block_rsv *rsv) { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; @@ -5350,7 +5349,7 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, min_size, + ret = btrfs_block_rsv_refill(root, rsv, rsv->size, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5368,7 +5367,7 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, * it. */ if (!btrfs_check_space_for_delayed_refs(fs_info) && - !btrfs_block_rsv_migrate(global_rsv, rsv, min_size, 0)) + !btrfs_block_rsv_migrate(global_rsv, rsv, rsv->size, 0)) return trans; /* If not, commit and try again. */ @@ -5384,7 +5383,6 @@ void btrfs_evict_inode(struct inode *inode) struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_block_rsv *rsv; - u64 min_size; int ret; trace_btrfs_inode_evict(inode); @@ -5394,8 +5392,6 @@ void btrfs_evict_inode(struct inode *inode) return; } - min_size = btrfs_calc_trunc_metadata_size(fs_info, 1); - evict_inode_truncate_pages(inode); if (inode->i_nlink && @@ -5428,13 +5424,13 @@ void btrfs_evict_inode(struct inode *inode) rsv = btrfs_alloc_block_rsv(fs_info, BTRFS_BLOCK_RSV_TEMP); if (!rsv) goto no_delete; - rsv->size = min_size; + rsv->size = btrfs_calc_trunc_metadata_size(fs_info, 1); rsv->failfast = 1; btrfs_i_size_write(BTRFS_I(inode), 0); while (1) { - trans = evict_refill_and_join(root, rsv, min_size); + trans = evict_refill_and_join(root, rsv); if (IS_ERR(trans)) goto free_rsv; @@ -5459,7 +5455,7 @@ void btrfs_evict_inode(struct inode *inode) * If it turns out that we are dropping too many of these, we might want * to add a mechanism for retrying these after a commit. */ - trans = evict_refill_and_join(root, rsv, min_size); + trans = evict_refill_and_join(root, rsv); if (!IS_ERR(trans)) { trans->block_rsv = rsv; btrfs_orphan_del(trans, BTRFS_I(inode)); -- 2.14.3
[PATCH 39/42] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 4 +++- fs/btrfs/disk-io.c | 5 ++--- fs/btrfs/extent-tree.c | 9 + fs/btrfs/inode.c | 21 + 4 files changed, 31 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e40356ca0295..1ef0b1649cad 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -894,7 +894,8 @@ struct btrfs_fs_info { spinlock_t delayed_iput_lock; struct list_head delayed_iputs; - struct mutex cleaner_delayed_iput_mutex; + atomic_t nr_delayed_iputs; + wait_queue_head_t delayed_iputs_wait; /* this protects tree_mod_seq_list */ spinlock_t tree_mod_seq_lock; @@ -3212,6 +3213,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info); +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info); int btrfs_prealloc_file_range(struct inode *inode, int mode, u64 start, u64 num_bytes, u64 min_size, loff_t actual_len, u64 *alloc_hint); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 51b2a5bf25e5..3dce9ff72e41 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg) goto sleep; } - mutex_lock(_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(_info->cleaner_delayed_iput_mutex); again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(_info->cleaner_mutex); @@ -2677,7 +2675,6 @@ int open_ctree(struct super_block *sb, mutex_init(_info->delete_unused_bgs_mutex); mutex_init(_info->reloc_mutex); mutex_init(_info->delalloc_root_mutex); - mutex_init(_info->cleaner_delayed_iput_mutex); seqlock_init(_info->profiles_lock); INIT_LIST_HEAD(_info->dirty_cowonly_roots); @@ -2699,6 +2696,7 @@ int open_ctree(struct super_block *sb, atomic_set(_info->defrag_running, 0); atomic_set(_info->qgroup_op_seq, 0); atomic_set(_info->reada_works_cnt, 0); + atomic_set(_info->nr_delayed_iputs, 0); atomic64_set(_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; @@ -2776,6 +2774,7 @@ int open_ctree(struct super_block *sb, init_waitqueue_head(_info->transaction_wait); init_waitqueue_head(_info->transaction_blocked_wait); init_waitqueue_head(_info->async_submit_wait); + init_waitqueue_head(_info->delayed_iputs_wait); INIT_LIST_HEAD(_info->pinned_chunks); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2056d5d37dfe..d0a7b0589826 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4258,8 +4258,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) * operations. Wait for it to finish so that * more space is released. */ - mutex_lock(_info->cleaner_delayed_iput_mutex); - mutex_unlock(_info->cleaner_delayed_iput_mutex); + ret = btrfs_wait_on_delayed_iputs(fs_info); + if (ret) + return ret; goto again; } else { btrfs_end_transaction(trans); @@ -4829,9 +4830,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * pinned space, so make sure we run the iputs before we do our pinned * bytes check below. */ - mutex_lock(_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(_info->cleaner_delayed_iput_mutex); + wait_event(fs_info->delayed_iputs_wait, + atomic_read(_info->nr_delayed_iputs) == 0); trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0a1671fb03bf..ab8242b10601 100644 --- a/fs/btrfs/inode.c +++
[PATCH 34/42] btrfs: wait on ordered extents on abort cleanup
If we flip read-only before we initiate writeback on all dirty pages for ordered extents we've created then we'll have ordered extents left over on umount, which results in all sorts of bad things happening. Fix this by making sure we wait on ordered extents if we have to do the aborted transaction cleanup stuff. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 54fbdc944a3f..51b2a5bf25e5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4188,6 +4188,14 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info) spin_lock(_info->ordered_root_lock); } spin_unlock(_info->ordered_root_lock); + + /* +* We need this here because if we've been flipped read-only we won't +* get sync() from the umount, so we need to make sure any ordered +* extents that haven't had their dirty pages IO start writeout yet +* actually get run and error out properly. +*/ + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, -- 2.14.3
[PATCH 42/42] btrfs: don't run delayed_iputs in commit
This could result in a really bad case where we do something like evict evict_refill_and_join btrfs_commit_transaction btrfs_run_delayed_iputs evict evict_refill_and_join btrfs_commit_transaction ... forever We have plenty of other places where we run delayed iputs that are much safer, let those do the work. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 9168efaca37e..c91dc36fccae 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2265,15 +2265,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) kmem_cache_free(btrfs_trans_handle_cachep, trans); - /* -* If fs has been frozen, we can not handle delayed iputs, otherwise -* it'll result in deadlock about SB_FREEZE_FS. -*/ - if (current != fs_info->transaction_kthread && - current != fs_info->cleaner_kthread && - !test_bit(BTRFS_FS_FROZEN, _info->flags)) - btrfs_run_delayed_iputs(fs_info); - return ret; scrub_continue: -- 2.14.3
[PATCH 26/42] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
We have this open coded in btrfs_destroy_delayed_refs, use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 39bd158466cd..121ab180a78a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, head = rb_entry(node, struct btrfs_delayed_ref_head, href_node); - if (!mutex_trylock(>mutex)) { - refcount_inc(>refs); - spin_unlock(_refs->lock); - - mutex_lock(>mutex); - mutex_unlock(>mutex); - btrfs_put_delayed_ref_head(head); - spin_lock(_refs->lock); + if (btrfs_delayed_ref_lock(delayed_refs, head)) continue; - } + spin_lock(>lock); while ((n = rb_first(>ref_tree)) != NULL) { ref = rb_entry(n, struct btrfs_delayed_ref_node, -- 2.14.3
[PATCH 22/42] btrfs: only run delayed refs if we're committing
I noticed in a giant dbench run that we spent a lot of time on lock contention while running transaction commit. This is because dbench results in a lot of fsync()'s that do a btrfs_transaction_commit(), and they all run the delayed refs first thing, so they all contend with each other. This leads to seconds of 0 throughput. Change this to only run the delayed refs if we're the ones committing the transaction. This makes the latency go away and we get no more lock contention. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a0f19ca0bd6c..39a2bddb0b29 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1925,15 +1925,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - /* make a pass through all the delayed refs we have so far -* any runnings procs may add more while we are here -*/ - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - cur_trans = trans->transaction; /* @@ -1946,12 +1937,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) if (!list_empty(>new_bgs)) btrfs_create_pending_block_groups(trans); - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { int run_it = 0; @@ -2022,6 +2007,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) spin_unlock(_info->trans_lock); } + /* +* We are now the only one in the commit area, we can run delayed refs +* without hitting a bunch of lock contention from a lot of people +* trying to commit the transaction at once. +*/ + ret = btrfs_run_delayed_refs(trans, 0); + if (ret) + goto cleanup_transaction; + extwriter_counter_dec(cur_trans, trans->type); ret = btrfs_start_delalloc_flush(fs_info); -- 2.14.3
[PATCH 41/42] btrfs: reserve extra space during evict()
We could generate a lot of delayed refs in evict but never have any left over space from our block rsv to make up for that fact. So reserve some extra space and give it to the transaction so it can be used to refill the delayed refs rsv every loop through the truncate path. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index dbcca915e681..9f7da5e3c741 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5343,13 +5343,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; + u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1); int failures = 0; for (;;) { struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, rsv->size, + ret = btrfs_block_rsv_refill(root, rsv, +rsv->size + delayed_refs_extra, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5358,9 +5360,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, return ERR_PTR(-ENOSPC); } + /* +* Evict can generate a large amount of delayed refs without +* having a way to add space back since we exhaust our temporary +* block rsv. We aren't allowed to do FLUSH_ALL in this case +* because we could deadlock with so many things in the flushing +* code, so we have to try and hold some extra space to +* compensate for our delayed ref generation. If we can't get +* that space then we need see if we can steal our minimum from +* the global reserve. We will be ratelimited by the amount of +* space we have for the delayed refs rsv, so we'll end up +* committing and trying again. +*/ trans = btrfs_join_transaction(root); - if (IS_ERR(trans) || !ret) + if (IS_ERR(trans) || !ret) { + if (!IS_ERR(trans)) { + trans->block_rsv = _info->trans_block_rsv; + trans->bytes_reserved = delayed_refs_extra; + btrfs_block_rsv_migrate(rsv, trans->block_rsv, + delayed_refs_extra, 1); + } return trans; + } /* * Try to steal from the global reserve if there is space for -- 2.14.3
[PATCH 38/42] btrfs: be more explicit about allowed flush states
For FLUSH_LIMIT flushers we really can only allocate chunks and flush delayed inode items, everything else is problematic. I added a bunch of new states and it lead to weirdness in the FLUSH_LIMIT case because I forgot about how it worked. So instead explicitly declare the states that are ok for flushing with FLUSH_LIMIT and use that for our state machine. Then as we add new things that are safe we can just add them to this list. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4bacb5042d08..2056d5d37dfe 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5110,12 +5110,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work) INIT_WORK(work, btrfs_async_reclaim_metadata_space); } +static const enum btrfs_flush_state priority_flush_states[] = { + FLUSH_DELAYED_ITEMS_NR, + FLUSH_DELAYED_ITEMS, + ALLOC_CHUNK, +}; + static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, struct reserve_ticket *ticket) { u64 to_reclaim; - int flush_state = FLUSH_DELAYED_ITEMS_NR; + int flush_state = 0; spin_lock(_info->lock); to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info, @@ -5127,7 +5133,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, spin_unlock(_info->lock); do { - flush_space(fs_info, space_info, to_reclaim, flush_state); + flush_space(fs_info, space_info, to_reclaim, + priority_flush_states[flush_state]); flush_state++; spin_lock(_info->lock); if (ticket->bytes == 0) { @@ -5135,15 +5142,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, return; } spin_unlock(_info->lock); - - /* -* Priority flushers can't wait on delalloc without -* deadlocking. -*/ - if (flush_state == FLUSH_DELALLOC || - flush_state == FLUSH_DELALLOC_WAIT) - flush_state = ALLOC_CHUNK; - } while (flush_state < COMMIT_TRANS); + } while (flush_state < ARRAY_SIZE(priority_flush_states)); } static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, -- 2.14.3
[PATCH 35/42] MAINTAINERS: update my email address for btrfs
My work email is completely useless, switch it to my personal address so I get emails on a account I actually pay attention to. Signed-off-by: Josef Bacik --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 32fbc6f732d4..7723dc958e99 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3095,7 +3095,7 @@ F:drivers/gpio/gpio-bt8xx.c BTRFS FILE SYSTEM M: Chris Mason -M: Josef Bacik +M: Josef Bacik M: David Sterba L: linux-btrfs@vger.kernel.org W: http://btrfs.wiki.kernel.org/ -- 2.14.3
[PATCH 36/42] btrfs: wait on caching when putting the bg cache
While testing my backport I noticed there was a panic if I ran generic/416 generic/417 generic/418 all in a row. This just happened to uncover a race where we had outstanding IO after we destroy all of our workqueues, and then we'd go to queue the endio work on those free'd workqueues. This is because we aren't waiting for the caching threads to be done before freeing everything up, so to fix this make sure we wait on any outstanding caching that's being done before we free up the block group, so we're sure to be done with all IO by the time we get to btrfs_stop_all_workers(). This fixes the panic I was seeing consistently in testing. [ cut here ] kernel BUG at fs/btrfs/volumes.c:6112! SMP PTI Modules linked in: CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: btrfs-cache btrfs_cache_helper RIP: 0010:btrfs_map_bio+0x346/0x370 RSP: :c900061e79d0 EFLAGS: 00010202 RAX: RBX: 880071542e00 RCX: 00533000 RDX: 88006bb74380 RSI: 0008 RDI: 88007816 RBP: 0001 R08: 8800781cd200 R09: 00503000 R10: 88006cd21200 R11: R12: R13: R14: 8800781cd200 R15: 880071542e00 FS: () GS:88007fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0817ffc4 CR3: 78314000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: btree_submit_bio_hook+0x8a/0xd0 submit_one_bio+0x5d/0x80 read_extent_buffer_pages+0x18a/0x320 btree_read_extent_buffer_pages+0xbc/0x200 ? alloc_extent_buffer+0x359/0x3e0 read_tree_block+0x3d/0x60 read_block_for_search.isra.30+0x1a5/0x360 btrfs_search_slot+0x41b/0xa10 btrfs_next_old_leaf+0x212/0x470 caching_thread+0x323/0x490 normal_work_helper+0xc5/0x310 process_one_work+0x141/0x340 worker_thread+0x44/0x3c0 kthread+0xf8/0x130 ? process_one_work+0x340/0x340 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 Code: ff ff 48 8b 4c 24 28 48 89 de 48 8b 7c 24 08 e8 d1 e5 04 00 89 c3 e9 08 ff ff ff 4d 89 c6 49 89 df e9 27 fe ff ff e8 5a 3a bb ff <0f> 0b 0f 0b e9 57 ff ff ff 48 8b 7c 24 08 4c 89 f9 4c 89 ea 48 RIP: btrfs_map_bio+0x346/0x370 RSP: c900061e79d0 ---[ end trace 827eb13e50846033 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9b5366932298..4bacb5042d08 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9894,6 +9894,7 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info *info) block_group = btrfs_lookup_first_block_group(info, last); while (block_group) { + wait_block_group_cache_done(block_group); spin_lock(_group->lock); if (block_group->iref) break; -- 2.14.3
[PATCH 37/42] btrfs: wakeup cleaner thread when adding delayed iput
The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. The cleaner thread only gets woken up every 30 seconds, so instead wake it up to do it's work so that we can free up that space as quickly as possible. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2b257d14bd3d..0a1671fb03bf 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3323,6 +3323,7 @@ void btrfs_add_delayed_iput(struct inode *inode) ASSERT(list_empty(>delayed_iput)); list_add_tail(>delayed_iput, _info->delayed_iputs); spin_unlock(_info->delayed_iput_lock); + wake_up_process(fs_info->cleaner_kthread); } void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info) -- 2.14.3
[PATCH 30/42] btrfs: just delete pending bgs if we are aborted
We still need to do all of the accounting cleanup for pending block groups if we abort. So set the ret to trans->aborted so if we aborted the cleanup happens and everybody is happy. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 14495c030301..933cba06c9fb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10363,9 +10363,15 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; - int ret = 0; + int ret; bool can_flush_pending_bgs = trans->can_flush_pending_bgs; + /* +* If we aborted the transaction with pending bg's we need to just +* cleanup the list and carry on. +*/ + ret = trans->aborted; + trans->can_flush_pending_bgs = false; while (!list_empty(>new_bgs)) { block_group = list_first_entry(>new_bgs, -- 2.14.3
[PATCH 19/42] btrfs: set max_extent_size properly
From: Josef Bacik We can't use entry->bytes if our entry is a bitmap entry, we need to use entry->max_extent_size in that case. Fix up all the logic to make this consistent. Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 30 -- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index e077ad3b4549..2dd773e96530 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1770,6 +1770,13 @@ static int search_bitmap(struct btrfs_free_space_ctl *ctl, return -1; } +static inline u64 get_max_extent_size(struct btrfs_free_space *entry) +{ + if (entry->bitmap) + return entry->max_extent_size; + return entry->bytes; +} + /* Cache the size of the max extent in bytes */ static struct btrfs_free_space * find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, @@ -1791,8 +1798,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, for (node = >offset_index; node; node = rb_next(node)) { entry = rb_entry(node, struct btrfs_free_space, offset_index); if (entry->bytes < *bytes) { - if (entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + *max_extent_size = max(get_max_extent_size(entry), + *max_extent_size); continue; } @@ -1810,8 +1817,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, } if (entry->bytes < *bytes + align_off) { - if (entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + *max_extent_size = max(get_max_extent_size(entry), + *max_extent_size); continue; } @@ -1823,8 +1830,10 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, *offset = tmp; *bytes = size; return entry; - } else if (size > *max_extent_size) { - *max_extent_size = size; + } else { + *max_extent_size = + max(get_max_extent_size(entry), + *max_extent_size); } continue; } @@ -2684,8 +2693,8 @@ static u64 btrfs_alloc_from_bitmap(struct btrfs_block_group_cache *block_group, err = search_bitmap(ctl, entry, _start, _bytes, true); if (err) { - if (search_bytes > *max_extent_size) - *max_extent_size = search_bytes; + *max_extent_size = max(get_max_extent_size(entry), + *max_extent_size); return 0; } @@ -2722,8 +2731,9 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group, entry = rb_entry(node, struct btrfs_free_space, offset_index); while (1) { - if (entry->bytes < bytes && entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + if (entry->bytes < bytes) + *max_extent_size = max(get_max_extent_size(entry), + *max_extent_size); if (entry->bytes < bytes || (!entry->bitmap && entry->offset < min_start)) { -- 2.14.3
[PATCH 25/42] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock
We don't need the trans except to get the delayed_refs_root, so just pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a day. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 5 + fs/btrfs/delayed-ref.h | 2 +- fs/btrfs/extent-tree.c | 2 +- 3 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 96ce087747b2..87778645bf4a 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr, return NULL; } -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, struct btrfs_delayed_ref_head *head) { - struct btrfs_delayed_ref_root *delayed_refs; - - delayed_refs = >transaction->delayed_refs; lockdep_assert_held(_refs->lock); if (mutex_trylock(>mutex)) return 0; diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 7769177b489e..ee636d7a710a 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head * btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 bytenr); -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, struct btrfs_delayed_ref_head *head); static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0d7fc07c34ab..529c2aefbbf4 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2600,7 +2600,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, /* grab the lock that says we are going to process * all the refs for this head */ - ret = btrfs_delayed_ref_lock(trans, locked_ref); + ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref); spin_unlock(_refs->lock); /* * we may have dropped the spin lock to get the head -- 2.14.3
[PATCH 16/42] btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error out. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 56 -- 1 file changed, 41 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 19449ee93693..3757ca42b8e0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5766,6 +5766,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root, return ret; } +static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv, + u64 *metadata_bytes, u64 *qgroup_bytes) +{ + *metadata_bytes = 0; + *qgroup_bytes = 0; + + spin_lock(_rsv->lock); + if (block_rsv->reserved < block_rsv->size) + *metadata_bytes = block_rsv->size - block_rsv->reserved; + if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) + *qgroup_bytes = block_rsv->qgroup_rsv_size - + block_rsv->qgroup_rsv_reserved; + spin_unlock(_rsv->lock); +} + /** * btrfs_inode_rsv_refill - refill the inode block rsv. * @inode - the inode we are refilling. @@ -5781,25 +5796,37 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_block_rsv *block_rsv = >block_rsv; - u64 num_bytes = 0; + u64 num_bytes = 0, last = 0; u64 qgroup_num_bytes = 0; int ret = -ENOSPC; - spin_lock(_rsv->lock); - if (block_rsv->reserved < block_rsv->size) - num_bytes = block_rsv->size - block_rsv->reserved; - if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) - qgroup_num_bytes = block_rsv->qgroup_rsv_size - - block_rsv->qgroup_rsv_reserved; - spin_unlock(_rsv->lock); - + __get_refill_bytes(block_rsv, _bytes, _num_bytes); if (num_bytes == 0) return 0; - ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); - if (ret) - return ret; - ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + do { + ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); + if (ret) + return ret; + ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + if (ret) { + btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + last = num_bytes; + /* +* If we are fragmented we can end up with a lot of +* outstanding extents which will make our size be much +* larger than our reserved amount. If we happen to +* try to do a reservation here that may result in us +* trying to do a pretty hefty reservation, which we may +* not need once delalloc flushing happens. If this is +* the case try and do the reserve again. +*/ + if (flush == BTRFS_RESERVE_FLUSH_ALL) + __get_refill_bytes(block_rsv, _bytes, + _num_bytes); + } + } while (ret && last != num_bytes); + if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 0); trace_btrfs_space_reservation(root->fs_info, "delalloc", @@ -5809,8 +5836,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, spin_lock(_rsv->lock); block_rsv->qgroup_rsv_reserved += qgroup_num_bytes; spin_unlock(_rsv->lock); - } else - btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + } return ret; } -- 2.14.3
[PATCH 32/42] btrfs: only free reserved extent if we didn't insert it
When we insert the file extent once the ordered extent completes we free the reserved extent reservation as it'll have been migrated to the bytes_used counter. However if we error out after this step we'll still clear the reserved extent reservation, resulting in a negative accounting of the reserved bytes for the block group and space info. Fix this by only doing the free if we didn't successfully insert a file extent for this extent. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/inode.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5a91055a13b2..2b257d14bd3d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) bool truncated = false; bool range_locked = false; bool clear_new_delalloc_bytes = false; + bool clear_reserved_extent = true; if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) && @@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) logical_len, logical_len, compress_type, 0, 0, BTRFS_FILE_EXTENT_REG); - if (!ret) + if (!ret) { + clear_reserved_extent = false; btrfs_release_delalloc_bytes(fs_info, ordered_extent->start, ordered_extent->disk_len); + } } unpin_extent_cache(_I(inode)->extent_tree, ordered_extent->file_offset, ordered_extent->len, @@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) * wrong we need to return the space for this ordered extent * back to the allocator. We only free the extent in the * truncated case if we didn't write out the extent at all. +* +* If we made it past insert_reserved_file_extent before we +* errored out then we don't need to do this as the accounting +* has already been done. */ if ((ret || !logical_len) && + clear_reserved_extent && !test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags)) btrfs_free_reserved_extent(fs_info, -- 2.14.3
[PATCH 21/42] btrfs: reset max_extent_size on clear in a bitmap
From: Josef Bacik We need to clear the max_extent_size when we clear bits from a bitmap since it could have been from the range that contains the max_extent_size. Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 2dd773e96530..411acffbd45d 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1687,6 +1687,8 @@ static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl, bitmap_clear(info->bitmap, start, count); info->bytes -= bytes; + if (info->max_extent_size > ctl->unit) + info->max_extent_size = 0; } static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl, -- 2.14.3
[PATCH 28/42] btrfs: handle delayed ref head accounting cleanup in abort
We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 5 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 13 ++--- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29db902511c1..e40356ca0295 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -35,6 +35,7 @@ struct btrfs_trans_handle; struct btrfs_transaction; struct btrfs_pending_snapshot; +struct btrfs_delayed_ref_root; extern struct kmem_cache *btrfs_trans_handle_cachep; extern struct kmem_cache *btrfs_bit_radix_cachep; extern struct kmem_cache *btrfs_path_cachep; @@ -2623,6 +2624,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long count); int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info, unsigned long count, u64 transid, int wait); +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len); int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index fe1f229320ef..54fbdc944a3f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (pin_bytes) btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); btrfs_put_delayed_ref_head(head); cond_resched(); spin_lock(_refs->lock); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 529c2aefbbf4..5b5a28b69fa9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2475,12 +2475,11 @@ static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, - struct btrfs_delayed_ref_head *head) +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) { - struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_delayed_ref_root *delayed_refs = - >transaction->delayed_refs; int nr_items = 1; if (head->total_ref_mod < 0) { @@ -2558,7 +2557,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); @@ -7227,7 +7226,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 29/42] btrfs: call btrfs_create_pending_block_groups unconditionally
The first thing we do is loop through the list, this if (!list_empty()) btrfs_create_pending_block_groups(); thing is just wasted space. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 3 +-- fs/btrfs/transaction.c | 6 ++ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5b5a28b69fa9..14495c030301 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2978,8 +2978,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } if (run_all) { - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); spin_lock(_refs->lock); node = rb_first(_refs->href_root); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 39a2bddb0b29..46ca775a709e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -846,8 +846,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); @@ -1934,8 +1933,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) cur_trans->delayed_refs.flushing = 1; smp_wmb(); - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { int run_it = 0; -- 2.14.3
[PATCH 24/42] btrfs: assert on non-empty delayed iputs
I ran into an issue where there was some reference being held on an inode that I couldn't track. This assert wasn't triggered, but it at least rules out we're doing something stupid. Reviewed-by: Omar Sandoval Reviewed-by: David Sterba Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 377ad9c1cb17..39bd158466cd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) kthread_stop(fs_info->transaction_kthread); kthread_stop(fs_info->cleaner_kthread); + ASSERT(list_empty(_info->delayed_iputs)); set_bit(BTRFS_FS_CLOSING_DONE, _info->flags); btrfs_free_qgroup_config(fs_info); -- 2.14.3
[PATCH 23/42] btrfs: make sure we create all new bgs
Allocating new chunks modifies both the extent and chunk tree, which can trigger new chunk allocations. So instead of doing list_for_each_safe, just do while (!list_empty()) so we make sure we don't exit with other pending bg's still on our list. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Reviewed-by: David Sterba Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 6721698ab8aa..0d7fc07c34ab 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10361,7 +10361,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) { struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_block_group_cache *block_group, *tmp; + struct btrfs_block_group_cache *block_group; struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; @@ -10369,7 +10369,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) bool can_flush_pending_bgs = trans->can_flush_pending_bgs; trans->can_flush_pending_bgs = false; - list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) { + while (!list_empty(>new_bgs)) { + block_group = list_first_entry(>new_bgs, + struct btrfs_block_group_cache, + bg_list); if (ret) goto next; -- 2.14.3
[PATCH 01/42] btrfs: add btrfs_delete_ref_head helper
From: Josef Bacik We do this dance in cleanup_ref_head and check_ref_cleanup, unify it into a helper and cleanup the calling functions. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/delayed-ref.c | 14 ++ fs/btrfs/delayed-ref.h | 3 ++- fs/btrfs/extent-tree.c | 22 +++--- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 62ff545ba1f7..3a9e4ac21794 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans) return head; } +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + lockdep_assert_held(_refs->lock); + lockdep_assert_held(>lock); + + rb_erase(>href_node, _refs->href_root); + RB_CLEAR_NODE(>href_node); + atomic_dec(_refs->num_entries); + delayed_refs->num_heads--; + if (head->processing == 0) + delayed_refs->num_heads_ready--; +} + /* * Helper to insert the ref_node to the tail or merge with tail. * diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index d9f2a4ebd5db..7769177b489e 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { mutex_unlock(>mutex); } - +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); struct btrfs_delayed_ref_head * btrfs_select_ref_head(struct btrfs_trans_handle *trans); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f77226d8020a..d24a0de4a2e7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); return 1; } - delayed_refs->num_heads--; - rb_erase(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); - atomic_dec(_refs->num_entries); trace_run_delayed_ref_head(fs_info, head, 0); @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!mutex_trylock(>mutex)) goto out; - /* -* at this point we have a head with no other entries. Go -* ahead and process it. -*/ - rb_erase(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); - atomic_dec(_refs->num_entries); - - /* -* we don't take a ref on the node because we're removing it from the -* tree, so we just steal the ref the tree was holding. -*/ - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; + btrfs_delete_ref_head(delayed_refs, head); head->processing = 0; + spin_unlock(>lock); spin_unlock(_refs->lock); -- 2.14.3
[PATCH 00/42][v5] My current patch queue
v3->v4: - added stacktraces to all the changelogs - added the various reviewed-by's. - fixed the loop in inode_rsv_refill to not use goto again; v2->v3: - reworked the truncate/evict throttling, we were still occasionally hitting enospc aborts in production in these paths because we were too aggressive with space usage. - reworked the delayed iput stuff to be a little less racey and less deadlocky. - Addressed the comments from Dave and Omar. - A lot of production testing. v1->v2: - addressed all of the issues brought up. - added more comments. - split up some patches. original message: This is the current queue of things that I've been working on. The main thing these patches are doing is separating out the delayed refs reservations from the global reserve into their own block rsv. We have been consistently hitting issues in production where we abort a transaction because we run out of the global reserve either while running delayed refs or while updating dirty block groups. This is because the math around global reserves is made up bullshit magic that has been tweaked more and more throughout the years. The result is something that is inconsistent across the board and sometimes wrong. So instead we need a way to know exactly how much space we need to keep around in order to satisfy our outstanding delayed refs and our dirty block groups. Since we don't know how many delayed refs we need at the start of any modification we simply use the nr_items passed into btrfs_start_transaction() as a guess for what we may need. This has the side effect of putting more pressure on the ENOSPC system, but it's pressure we can deal with more intelligently because we always know how much space we have outstanding, instead of guessing with weird global reserve math. This works similar to every other reservation we have, we reserve the worst case up front, and then at transaction end time we free up any space we didn't actually use for delayed refs. My performance tests show that we are bit faster now since we can do more intelligent flushing and don't have to fall back on simply committing the transaction in hopes that we have enough space for everything we need to do. That leads me to the 2nd part of this pull, there's a bunch of fixes around ENOSPC. Because we are a bit faster now there were a bunch of things uncovered in testing, but they seem to be all resolved now. The final chunk of fixes are around transaction aborts. There were a lot of accounting bugs I was running into while running generic/435, so I fixed a bunch of those up so now it runs cleanly. I have been running these patches through xfstests on multiple machines for a while, they are pretty solid and ready for wider testing and review. Thanks, Josef
[PATCH 05/42] btrfs: only count ref heads run in __btrfs_run_delayed_refs
We pick the number of ref's to run based on the number of ref heads, and only make the decision to stop once we've processed entire ref heads, so only count the ref heads we've run and bail once we've hit the number of ref heads we wanted to process. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 98f36dfeccb0..b32bd38390dd 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2592,6 +2592,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); break; } + count++; /* grab the lock that says we are going to process * all the refs for this head */ @@ -2605,7 +2606,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, */ if (ret == -EAGAIN) { locked_ref = NULL; - count++; continue; } } @@ -2633,7 +2633,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unselect_delayed_ref_head(delayed_refs, locked_ref); locked_ref = NULL; cond_resched(); - count++; continue; } @@ -2651,7 +2650,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, return ret; } locked_ref = NULL; - count++; continue; } @@ -2702,7 +2700,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } btrfs_put_delayed_ref(ref); - count++; cond_resched(); } -- 2.14.3
[PATCH 14/42] btrfs: reset max_extent_size properly
If we use up our block group before allocating a new one we'll easily get a max_extent_size that's set really really low, which will result in a lot of fragmentation. We need to make sure we're resetting the max_extent_size when we add a new chunk or add new space. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 640ec640b21c..2b1704331d21 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4573,6 +4573,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, goto out; } else { ret = 1; + space_info->max_extent_size = 0; } space_info->force_alloc = CHUNK_ALLOC_NO_FORCE; @@ -6671,6 +6672,7 @@ static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache, space_info->bytes_readonly += num_bytes; cache->reserved -= num_bytes; space_info->bytes_reserved -= num_bytes; + space_info->max_extent_size = 0; if (delalloc) cache->delalloc_bytes -= num_bytes; -- 2.14.3
[PATCH 11/42] btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when truncating a file. Now that we have a delayed refs rsv and a mechanism for refilling that reserve simply use that instead of all of this magic. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 79 1 file changed, 17 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cd00ec869c96..5a91055a13b2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4493,31 +4493,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) return err; } -static int truncate_space_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytes_deleted) -{ - struct btrfs_fs_info *fs_info = root->fs_info; - int ret; - - /* -* This is only used to apply pressure to the enospc system, we don't -* intend to use this reservation at all. -*/ - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted); - bytes_deleted *= fs_info->nodesize; - ret = btrfs_block_rsv_add(root, _info->trans_block_rsv, - bytes_deleted, BTRFS_RESERVE_NO_FLUSH); - if (!ret) { - trace_btrfs_space_reservation(fs_info, "transaction", - trans->transid, - bytes_deleted, 1); - trans->bytes_reserved += bytes_deleted; - } - return ret; - -} - /* * Return this if we need to call truncate_block for the last bit of the * truncate. @@ -4562,7 +4537,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = false; bool should_throttle = false; - bool should_end = false; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4775,15 +4749,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); break; } - if (btrfs_should_throttle_delayed_refs(trans, fs_info)) - btrfs_async_run_delayed_refs(fs_info, - trans->delayed_ref_updates * 2, - trans->transid, 0); if (be_nice) { - if (truncate_space_check(trans, root, -extent_num_bytes)) { - should_end = true; - } if (btrfs_should_throttle_delayed_refs(trans, fs_info)) should_throttle = true; @@ -4795,7 +4761,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (path->slots[0] == 0 || path->slots[0] != pending_del_slot || - should_throttle || should_end) { + should_throttle) { if (pending_del_nr) { ret = btrfs_del_items(trans, root, path, pending_del_slot, @@ -4807,23 +4773,24 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, pending_del_nr = 0; } btrfs_release_path(path); - if (should_throttle) { - unsigned long updates = trans->delayed_ref_updates; - if (updates) { - trans->delayed_ref_updates = 0; - ret = btrfs_run_delayed_refs(trans, - updates * 2); - if (ret) - break; - } - } + /* -* if we failed to refill our space rsv, bail out -* and let the transaction restart +* We can generate a lot of delayed refs, so we need to +* throttle every once and a while and make sure we're +* adding enough space to keep up with the work we are +* generating. Since we hold a transaction here we +* can't flush, and we don't want to FLUSH_LIMIT because +* we could have generated too many delayed refs to +* actually allocate, so just bail if we're short and +* let the
[PATCH 06/42] btrfs: introduce delayed_refs_rsv
From: Josef Bacik Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 27 +++-- fs/btrfs/delayed-ref.c | 28 - fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 279 +++ fs/btrfs/inode.c | 2 +- fs/btrfs/transaction.c | 77 ++-- include/trace/events/btrfs.h | 2 + 7 files changed, 312 insertions(+), 107 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 66f1d3895bca..1a2c3b629af2 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -452,8 +452,9 @@ struct btrfs_space_info { #defineBTRFS_BLOCK_RSV_TRANS 3 #defineBTRFS_BLOCK_RSV_CHUNK 4 #defineBTRFS_BLOCK_RSV_DELOPS 5 -#defineBTRFS_BLOCK_RSV_EMPTY 6 -#defineBTRFS_BLOCK_RSV_TEMP7 +#define BTRFS_BLOCK_RSV_DELREFS6 +#defineBTRFS_BLOCK_RSV_EMPTY 7 +#defineBTRFS_BLOCK_RSV_TEMP8 struct btrfs_block_rsv { u64 size; @@ -794,6 +795,8 @@ struct btrfs_fs_info { struct btrfs_block_rsv chunk_block_rsv; /* block reservation for delayed operations */ struct btrfs_block_rsv delayed_block_rsv; + /* block reservation for delayed refs */ + struct btrfs_block_rsv delayed_refs_rsv; struct btrfs_block_rsv empty_block_rsv; @@ -2608,8 +2611,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info); +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, const u64 start); void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); @@ -2723,10 +2725,12 @@ enum btrfs_reserve_flush_enum { enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, - FLUSH_DELALLOC = 3, - FLUSH_DELALLOC_WAIT = 4, - ALLOC_CHUNK = 5, - COMMIT_TRANS= 6, + FLUSH_DELAYED_REFS_NR = 3, + FLUSH_DELAYED_REFS = 4, + FLUSH_DELALLOC = 5, + FLUSH_DELALLOC_WAIT = 6, + ALLOC_CHUNK = 7, + COMMIT_TRANS= 8, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); @@ -2777,6 +2781,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv, u64 num_bytes); +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); +int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info, + enum btrfs_reserve_flush_enum flush); +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *src, + u64 num_bytes); int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); diff --git
[PATCH 03/42] btrfs: cleanup extent_op handling
From: Josef Bacik The cleanup_extent_op function actually would run the extent_op if it needed running, which made the name sort of a misnomer. Change it to run_and_cleanup_extent_op, and move the actual cleanup work to cleanup_extent_op so it can be used by check_ref_cleanup() in order to unify the extent op handling. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a44d55e36e11..98f36dfeccb0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2442,19 +2442,33 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref btrfs_delayed_ref_unlock(head); } -static int cleanup_extent_op(struct btrfs_trans_handle *trans, -struct btrfs_delayed_ref_head *head) +static struct btrfs_delayed_extent_op * +cleanup_extent_op(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; - int ret; if (!extent_op) - return 0; - head->extent_op = NULL; + return NULL; + if (head->must_insert_reserved) { + head->extent_op = NULL; btrfs_free_delayed_extent_op(extent_op); - return 0; + return NULL; } + return extent_op; +} + +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, +struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_extent_op *extent_op = + cleanup_extent_op(trans, head); + int ret; + + if (!extent_op) + return 0; + head->extent_op = NULL; spin_unlock(>lock); ret = run_delayed_extent_op(trans, head, extent_op); btrfs_free_delayed_extent_op(extent_op); @@ -2506,7 +2520,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs = >transaction->delayed_refs; - ret = cleanup_extent_op(trans, head); + ret = run_and_cleanup_extent_op(trans, head); if (ret < 0) { unselect_delayed_ref_head(delayed_refs, head); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!RB_EMPTY_ROOT(>ref_tree)) goto out; - if (head->extent_op) { - if (!head->must_insert_reserved) - goto out; - btrfs_free_delayed_extent_op(head->extent_op); - head->extent_op = NULL; - } + if (cleanup_extent_op(trans, head) != NULL) + goto out; /* * waiting for the lock here would deadlock. If someone else has it -- 2.14.3
[PATCH 02/42] btrfs: add cleanup_ref_head_accounting helper
From: Josef Bacik We were missing some quota cleanups in check_ref_cleanup, so break the ref head accounting cleanup into a helper and call that from both check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that we don't screw up accounting in the future for other things that we add. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 67 +- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d24a0de4a2e7..a44d55e36e11 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_ref_root *delayed_refs = + >transaction->delayed_refs; + + if (head->total_ref_mod < 0) { + struct btrfs_space_info *space_info; + u64 flags; + + if (head->is_data) + flags = BTRFS_BLOCK_GROUP_DATA; + else if (head->is_system) + flags = BTRFS_BLOCK_GROUP_SYSTEM; + else + flags = BTRFS_BLOCK_GROUP_METADATA; + space_info = __find_space_info(fs_info, flags); + ASSERT(space_info); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, + BTRFS_TOTAL_BYTES_PINNED_BATCH); + + if (head->is_data) { + spin_lock(_refs->lock); + delayed_refs->pending_csums -= head->num_bytes; + spin_unlock(_refs->lock); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); +} + static int cleanup_ref_head(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head) { @@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(>lock); spin_unlock(_refs->lock); - trace_run_delayed_ref_head(fs_info, head, 0); - - if (head->total_ref_mod < 0) { - struct btrfs_space_info *space_info; - u64 flags; - - if (head->is_data) - flags = BTRFS_BLOCK_GROUP_DATA; - else if (head->is_system) - flags = BTRFS_BLOCK_GROUP_SYSTEM; - else - flags = BTRFS_BLOCK_GROUP_METADATA; - space_info = __find_space_info(fs_info, flags); - ASSERT(space_info); - percpu_counter_add_batch(_info->total_bytes_pinned, - -head->num_bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH); - - if (head->is_data) { - spin_lock(_refs->lock); - delayed_refs->pending_csums -= head->num_bytes; - spin_unlock(_refs->lock); - } - } - if (head->must_insert_reserved) { btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); @@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); + cleanup_ref_head_accounting(trans, head); + + trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); btrfs_put_delayed_ref_head(head); return 0; @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; + cleanup_ref_head_accounting(trans, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 07/42] btrfs: check if free bgs for commit
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 33 +++-- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1213f573eea2..4aab49debfc3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4830,10 +4830,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) + trans = btrfs_join_transaction(fs_info->extent_root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + /* +* See if there is enough pinned space to make this reservation, or if +* we have bg's that are going to be freed, allowing us to possibly do a +* chunk allocation the next loop through. +*/ + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) || + __percpu_counter_compare(_info->total_bytes_pinned, bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; /* @@ -4841,7 +4849,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * this reservation. */ if (space_info != delayed_rsv->space_info) - return -ENOSPC; + goto enospc; spin_lock(_rsv->lock); reclaim_bytes += delayed_rsv->reserved; @@ -4855,17 +4863,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, bytes -= reclaim_bytes; if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - +bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + goto enospc; commit: - trans = btrfs_join_transaction(fs_info->extent_root); - if (IS_ERR(trans)) - return -ENOSPC; - return btrfs_commit_transaction(trans); +enospc: + btrfs_end_transaction(trans); + return -ENOSPC; } /* -- 2.14.3
[PATCH 10/42] btrfs: protect space cache inode alloc with nofs
If we're allocating a new space cache inode it's likely going to be under a transaction handle, so we need to use memalloc_nofs_save() in order to avoid deadlocks, and more importantly lockdep messages that make xfstests fail. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik Reviewed-by: David Sterba --- fs/btrfs/free-space-cache.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index c3888c113d81..e077ad3b4549 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "ctree.h" #include "free-space-cache.h" #include "transaction.h" @@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root, struct btrfs_free_space_header *header; struct extent_buffer *leaf; struct inode *inode = NULL; + unsigned nofs_flag; int ret; key.objectid = BTRFS_FREE_SPACE_OBJECTID; @@ -68,7 +70,13 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root, btrfs_disk_key_to_cpu(, _key); btrfs_release_path(path); + /* +* We are often under a trans handle at this point, so we need to make +* sure NOFS is set to keep us from deadlocking. +*/ + nofs_flag = memalloc_nofs_save(); inode = btrfs_iget(fs_info->sb, , root, NULL); + memalloc_nofs_restore(nofs_flag); if (IS_ERR(inode)) return inode; -- 2.14.3
[PATCH 04/42] btrfs: only track ref_heads in delayed_ref_updates
From: Josef Bacik We use this number to figure out how many delayed refs to run, but __btrfs_run_delayed_refs really only checks every time we need a new delayed ref head, so we always run at least one ref head completely no matter what the number of items on it. Fix the accounting to only be adjusted when we add/remove a ref head. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 3a9e4ac21794..27f7dd4e3d52 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, ref->in_tree = 0; btrfs_put_delayed_ref(ref); atomic_dec(_refs->num_entries); - if (trans->delayed_ref_updates) - trans->delayed_ref_updates--; } static bool merge_ref(struct btrfs_trans_handle *trans, @@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, if (ref->action == BTRFS_ADD_DELAYED_REF) list_add_tail(>add_list, >ref_add_list); atomic_inc(>num_entries); - trans->delayed_ref_updates++; spin_unlock(>lock); return ret; } -- 2.14.3
[PATCH 18/42] btrfs: move the dio_sem higher up the callchain
We're getting a lockdep splat because we take the dio_sem under the log_mutex. What we really need is to protect fsync() from logging an extent map for an extent we never waited on higher up, so just guard the whole thing with dio_sem. == WARNING: possible circular locking dependency detected 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted -- aio-dio-invalid/30928 is trying to acquire lock: 92621cfd (>mmap_sem){}, at: get_user_pages_unlocked+0x5a/0x1e0 but task is already holding lock: cefe6b35 (>dio_sem){}, at: btrfs_direct_IO+0x3be/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (>dio_sem){}: lock_acquire+0xbd/0x220 down_write+0x51/0xb0 btrfs_log_changed_extents+0x80/0xa40 btrfs_log_inode+0xbaf/0x1000 btrfs_log_inode_parent+0x26f/0xa80 btrfs_log_dentry_safe+0x50/0x70 btrfs_sync_file+0x357/0x540 do_fsync+0x38/0x60 __ia32_sys_fdatasync+0x12/0x20 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #4 (>log_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_record_unlink_dir+0x2a/0xa0 btrfs_unlink+0x5a/0xc0 vfs_unlink+0xb1/0x1a0 do_unlinkat+0x264/0x2b0 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #3 (sb_internal#2){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 start_transaction+0x3e6/0x590 btrfs_evict_inode+0x475/0x640 evict+0xbf/0x1b0 btrfs_run_delayed_iputs+0x6c/0x90 cleaner_kthread+0x124/0x1a0 kthread+0x106/0x140 ret_from_fork+0x3a/0x50 -> #2 (_info->cleaner_delayed_iput_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_alloc_data_chunk_ondemand+0x197/0x530 btrfs_check_data_free_space+0x4c/0x90 btrfs_delalloc_reserve_space+0x20/0x60 btrfs_page_mkwrite+0x87/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #1 (sb_pagefaults){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 btrfs_page_mkwrite+0x6a/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #0 (>mmap_sem){}: __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 down_read+0x48/0xb0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 __blockdev_direct_IO+0x32d/0x1c20 btrfs_direct_IO+0x227/0x400 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 io_submit_one+0x3a9/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 other info that might help us debug this: Chain exists of: >mmap_sem --> >log_mutex --> >dio_sem Possible unsafe locking scenario: CPU0CPU1 lock(>dio_sem); lock(>log_mutex); lock(>dio_sem); lock(>mmap_sem); *** DEADLOCK *** 1 lock held by aio-dio-invalid/30928: #0: cefe6b35 (>dio_sem){}, at: btrfs_direct_IO+0x3be/0x400 stack backtrace: CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: dump_stack+0x7c/0xbb print_circular_bug.isra.37+0x297/0x2a4 check_prev_add.constprop.45+0x781/0x7a0 ? __lock_acquire+0x42e/0x7a0 validate_chain.isra.41+0x7f0/0xb00 __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 ? get_user_pages_unlocked+0x5a/0x1e0 down_read+0x48/0xb0 ? get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 ? __alloc_workqueue_key+0x358/0x490 ? __blockdev_direct_IO+0x14b/0x1c20 __blockdev_direct_IO+0x32d/0x1c20 ? btrfs_run_delalloc_work+0x40/0x40 ? can_nocow_extent+0x490/0x490 ? kvm_clock_read+0x1f/0x30 ? can_nocow_extent+0x490/0x490 ? btrfs_run_delalloc_work+0x40/0x40 btrfs_direct_IO+0x227/0x400 ? btrfs_run_delalloc_work+0x40/0x40 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 ? kvm_clock_read+0x1f/0x30 ? __might_fault+0x3e/0x90 io_submit_one+0x3a9/0x620 ? io_submit_one+0xe5/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0
[PATCH 09/42] btrfs: release metadata before running delayed refs
We want to release the unused reservation we have since it refills the delayed refs reserve, which will make everything go smoother when running the delayed refs if we're short on our reservation. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 117e0c4a914a..a0f19ca0bd6c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1922,6 +1922,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) return ret; } + btrfs_trans_release_metadata(trans); + trans->block_rsv = NULL; + /* make a pass through all the delayed refs we have so far * any runnings procs may add more while we are here */ @@ -1931,9 +1934,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) return ret; } - btrfs_trans_release_metadata(trans); - trans->block_rsv = NULL; - cur_trans = trans->transaction; /* -- 2.14.3
[PATCH 12/42] btrfs: don't use global rsv for chunk allocation
We've done this forever because of the voodoo around knowing how much space we have. However we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8b00c658deb3..828b82db439c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4374,21 +4374,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int should_alloc_chunk(struct btrfs_fs_info *fs_info, struct btrfs_space_info *sinfo, int force) { - struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; u64 bytes_used = btrfs_space_info_used(sinfo, false); u64 thresh; if (force == CHUNK_ALLOC_FORCE) return 1; - /* -* We need to take into account the global rsv because for all intents -* and purposes it's used space. Don't worry about locking the -* global_rsv, it doesn't change except when the transaction commits. -*/ - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA) - bytes_used += calc_global_rsv_need_space(global_rsv); - /* * in limited mode, we want to have some free space up to * about 1% of the FS size. -- 2.14.3
[PATCH 20/42] btrfs: don't use ctl->free_space for max_extent_size
From: Josef Bacik max_extent_size is supposed to be the largest contiguous range for the space info, and ctl->free_space is the total free space in the block group. We need to keep track of these separately and _only_ use the max_free_space if we don't have a max_extent_size, as that means our original request was too large to search any of the block groups for and therefore wouldn't have a max_extent_size set. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ef852d1911d1..6721698ab8aa 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7496,6 +7496,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, struct btrfs_block_group_cache *block_group = NULL; u64 search_start = 0; u64 max_extent_size = 0; + u64 max_free_space = 0; u64 empty_cluster = 0; struct btrfs_space_info *space_info; int loop = 0; @@ -7791,8 +7792,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_lock(>tree_lock); if (ctl->free_space < num_bytes + empty_cluster + empty_size) { - if (ctl->free_space > max_extent_size) - max_extent_size = ctl->free_space; + max_free_space = max(max_free_space, +ctl->free_space); spin_unlock(>tree_lock); goto loop; } @@ -7959,6 +7960,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, } out: if (ret == -ENOSPC) { + if (!max_extent_size) + max_extent_size = max_free_space; spin_lock(_info->lock); space_info->max_extent_size = max_extent_size; spin_unlock(_info->lock); -- 2.14.3
[PATCH 13/42] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
With my change to no longer take into account the global reserve for metadata allocation chunks we have this side-effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves the issues I was seeing with the mixed bg tests in xfstests with my previous patch. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 18 +- include/trace/events/btrfs.h | 1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1a2c3b629af2..29db902511c1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2730,7 +2730,8 @@ enum btrfs_flush_state { FLUSH_DELALLOC = 5, FLUSH_DELALLOC_WAIT = 6, ALLOC_CHUNK = 7, - COMMIT_TRANS= 8, + ALLOC_CHUNK_FORCE = 8, + COMMIT_TRANS= 9, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 828b82db439c..640ec640b21c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4914,6 +4914,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, btrfs_end_transaction(trans); break; case ALLOC_CHUNK: + case ALLOC_CHUNK_FORCE: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -4921,7 +4922,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, } ret = do_chunk_alloc(trans, btrfs_metadata_alloc_profile(fs_info), -CHUNK_ALLOC_NO_FORCE); +(state == ALLOC_CHUNK) ? +CHUNK_ALLOC_NO_FORCE : +CHUNK_ALLOC_FORCE); btrfs_end_transaction(trans); if (ret > 0 || ret == -ENOSPC) ret = 0; @@ -5057,6 +5060,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) commit_cycles--; } + /* +* We don't want to force a chunk allocation until we've tried +* pretty hard to reclaim space. Think of the case where we +* free'd up a bunch of space and so have a lot of pinned space +* to reclaim. We would rather use that than possibly create a +* underutilized metadata chunk. So if this is our first run +* through the flushing state machine skip ALLOC_CHUNK_FORCE and +* commit the transaction. If nothing has changed the next go +* around then we can force a chunk allocation. +*/ + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) + flush_state++; + if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 7d205e50b09c..fdb23181b5b7 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- 2.14.3
[PATCH 08/42] btrfs: dump block_rsv whe dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval Reviewed-by: David Sterba --- fs/btrfs/extent-tree.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4aab49debfc3..8b00c658deb3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7918,6 +7918,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, return ret; } +#define DUMP_BLOCK_RSV(fs_info, rsv_name) \ +do { \ + struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \ + spin_lock(&__rsv->lock);\ + btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \ + __rsv->size, __rsv->reserved); \ + spin_unlock(&__rsv->lock); \ +} while (0) + static void dump_space_info(struct btrfs_fs_info *fs_info, struct btrfs_space_info *info, u64 bytes, int dump_block_groups) @@ -7937,6 +7946,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info, info->bytes_readonly); spin_unlock(>lock); + DUMP_BLOCK_RSV(fs_info, global_block_rsv); + DUMP_BLOCK_RSV(fs_info, trans_block_rsv); + DUMP_BLOCK_RSV(fs_info, chunk_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv); + if (!dump_block_groups) return; -- 2.14.3
[PATCH 17/42] btrfs: run delayed iputs before committing
Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3757ca42b8e0..ef852d1911d1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4823,6 +4823,15 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; + /* +* If we have pending delayed iputs then we could free up a bunch of +* pinned space, so make sure we run the iputs before we do our pinned +* bytes check below. +*/ + mutex_lock(_info->cleaner_delayed_iput_mutex); + btrfs_run_delayed_iputs(fs_info); + mutex_unlock(_info->cleaner_delayed_iput_mutex); + trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) return PTR_ERR(trans); -- 2.14.3
[PATCH 15/42] btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2b1704331d21..19449ee93693 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4779,6 +4779,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim, } struct reserve_ticket { + u64 orig_bytes; u64 bytes; int error; struct list_head list; @@ -5000,7 +5001,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, !test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state)); } -static void wake_all_tickets(struct list_head *head) +static bool wake_all_tickets(struct list_head *head) { struct reserve_ticket *ticket; @@ -5009,7 +5010,10 @@ static void wake_all_tickets(struct list_head *head) list_del_init(>list); ticket->error = -ENOSPC; wake_up(>wait); + if (ticket->bytes != ticket->orig_bytes) + return true; } + return false; } /* @@ -5077,8 +5081,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { - wake_all_tickets(_info->tickets); - space_info->flush = 0; + if (wake_all_tickets(_info->tickets)) { + flush_state = FLUSH_DELAYED_ITEMS_NR; + commit_cycles--; + } else { + space_info->flush = 0; + } } else { flush_state = FLUSH_DELAYED_ITEMS_NR; } @@ -5130,10 +5138,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - struct reserve_ticket *ticket, u64 orig_bytes) + struct reserve_ticket *ticket) { DEFINE_WAIT(wait); + u64 reclaim_bytes = 0; int ret = 0; spin_lock(_info->lock); @@ -5154,14 +5163,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, ret = ticket->error; if (!list_empty(>list)) list_del_init(>list); - if (ticket->bytes && ticket->bytes < orig_bytes) { - u64 num_bytes = orig_bytes - ticket->bytes; - space_info->bytes_may_use -= num_bytes; - trace_btrfs_space_reservation(fs_info, "space_info", - space_info->flags, num_bytes, 0); - } + if (ticket->bytes && ticket->bytes < ticket->orig_bytes) + reclaim_bytes = ticket->orig_bytes - ticket->bytes; spin_unlock(_info->lock); + if (reclaim_bytes) + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes); return ret; } @@ -5187,6 +5194,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, { struct reserve_ticket ticket; u64 used; + u64 reclaim_bytes = 0; int ret = 0; ASSERT(orig_bytes); @@ -5222,6 +5230,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, * the list and we will do our own flushing further down. */ if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { + ticket.orig_bytes = orig_bytes; ticket.bytes = orig_bytes; ticket.error = 0; init_waitqueue_head(); @@ -5262,25 +5271,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, return ret; if (flush == BTRFS_RESERVE_FLUSH_ALL) - return wait_reserve_ticket(fs_info, space_info, , - orig_bytes); + return wait_reserve_ticket(fs_info,
Re: [PATCH 33/42] btrfs: fix insert_reserved error handling
On Thu, Oct 11, 2018 at 03:54:22PM -0400, Josef Bacik wrote: > We were not handling the reserved byte accounting properly for data > references. Metadata was fine, if it errored out the error paths would > free the bytes_reserved count and pin the extent, but it even missed one > of the error cases. So instead move this handling up into > run_one_delayed_ref so we are sure that both cases are properly cleaned > up in case of a transaction abort. > > Reviewed-by: Nikolay Borisov > Signed-off-by: Josef Bacik Reviewed-by: David Sterba
Re: [PATCH 19/42] btrfs: set max_extent_size properly
On Thu, Oct 11, 2018 at 03:54:08PM -0400, Josef Bacik wrote: > From: Josef Bacik > > We can't use entry->bytes if our entry is a bitmap entry, we need to use > entry->max_extent_size in that case. Fix up all the logic to make this > consistent. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/free-space-cache.c | 29 +++-- > 1 file changed, 19 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > index e077ad3b4549..2e96ee7da3ec 100644 > --- a/fs/btrfs/free-space-cache.c > +++ b/fs/btrfs/free-space-cache.c > @@ -1770,6 +1770,18 @@ static int search_bitmap(struct btrfs_free_space_ctl > *ctl, > return -1; > } > > +static void set_max_extent_size(struct btrfs_free_space *entry, I find this a bit confusing, in the code it's like set_max_extent_size(entry, max_extent_size) so it reads like 'set max extent size of the entry to value', while the function does the opposite. That the 2nd parameter is a pointer does not help either. A function comment explaining that would not help as it's confusing at the callsites. *max_extent_size = get_max_extent_size(entry, *max_extent_size) This is more obvious what it does but can't be called an improvement regarding readability. Other ideas?
Re: [PATCH 07/42] btrfs: check if free bgs for commit
On Thu, Oct 11, 2018 at 02:33:55PM -0400, Josef Bacik wrote: > On Thu, Oct 04, 2018 at 01:24:24PM +0200, David Sterba wrote: > > On Fri, Sep 28, 2018 at 07:17:46AM -0400, Josef Bacik wrote: > > > may_commit_transaction will skip committing the transaction if we don't > > > have enough pinned space or if we're trying to find space for a SYSTEM > > > chunk. However if we have pending free block groups in this transaction > > > we still want to commit as we may be able to allocate a chunk to make > > > our reservation. So instead of just returning ENOSPC, check if we have > > > free block groups pending, and if so commit the transaction to allow us > > > to use that free space. > > > > > > Signed-off-by: Josef Bacik > > > Reviewed-by: Omar Sandoval > > > --- > > > fs/btrfs/extent-tree.c | 33 +++-- > > > 1 file changed, 19 insertions(+), 14 deletions(-) > > > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > > index 1213f573eea2..da73b3e5bc39 100644 > > > --- a/fs/btrfs/extent-tree.c > > > +++ b/fs/btrfs/extent-tree.c > > > @@ -4830,10 +4830,18 @@ static int may_commit_transaction(struct > > > btrfs_fs_info *fs_info, > > > if (!bytes) > > > return 0; > > > > > > - /* See if there is enough pinned space to make this reservation */ > > > - if (__percpu_counter_compare(_info->total_bytes_pinned, > > > -bytes, > > > -BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) > > > + trans = btrfs_join_transaction(fs_info->extent_root); > > > + if (IS_ERR(trans)) > > > + return -ENOSPC; > > > > Why do you set override the error to ENOSPC, instead of passing > > PTR_ERR(trans)? There are more reasons why btrfs_join_transaction could > > fail: EROFS after error, EDQUOT when quotas say no, ENOMEM from > > ulist_add, that's just a sample I quickly found. > > (argh I hit r instead of g, sending again) > > Because may_commit_transaction is only called during flushing, we don't > actually > care about the value, just that it failed. Thanks, No, we do care about error values. If anything, here it's used in the tracepoint as the reason why the flushing failed. Conflating them into one without a good reason is a bad practice I don't want to let spread in the code.
Re: [PATCH v2 1/2] btrfs: relocation: Cleanup while() loop using rbtree_postorder_for_each_entry_safe()
On Fri, Sep 21, 2018 at 03:20:29PM +0800, Qu Wenruo wrote: > And add one line comment explaining what we're doing for each loop. > > Signed-off-by: Qu Wenruo > --- > changelog: > v2: > Use rbtree_postorder_for_each_entry_safe() to replace for() loop. 1-2 reviewed and added to 4.20 queue, thanks.
Re: [PATCH v3 4/4] btrfs: Refactor find_free_extent() loops update into find_free_extent_update_loop()
On 2018/10/12 下午9:52, Josef Bacik wrote: > On Fri, Oct 12, 2018 at 02:18:19PM +0800, Qu Wenruo wrote: >> We have a complex loop design for find_free_extent(), that has different >> behavior for each loop, some even includes new chunk allocation. >> >> Instead of putting such a long code into find_free_extent() and makes it >> harder to read, just extract them into find_free_extent_update_loop(). >> >> With all the cleanups, the main find_free_extent() should be pretty >> barebone: >> >> find_free_extent() >> |- Iterate through all block groups >> | |- Get a valid block group >> | |- Try to do clustered allocation in that block group >> | |- Try to do unclustered allocation in that block group >> | |- Check if the result is valid >> | | |- If valid, then exit >> | |- Jump to next block group >> | >> |- Push harder to find free extents >>|- If not found, re-iterate all block groups >> >> Signed-off-by: Qu Wenruo >> Reviewed-by: Su Yue >> --- >> fs/btrfs/extent-tree.c | 217 ++--- >> 1 file changed, 117 insertions(+), 100 deletions(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index e6bfa91af41c..938569d2c583 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -7236,7 +7236,9 @@ struct find_free_extent_ctl { >> /* RAID index, converted from flags */ >> int index; >> >> -/* Current loop number */ >> +/* >> + * Current loop number, check find_free_extent_update_loop() for details >> + */ >> int loop; >> >> /* >> @@ -7433,6 +7435,117 @@ static int find_free_extent_unclustered(struct >> btrfs_block_group_cache *bg, >> return 0; >> } >> >> +/* >> + * Return >0 means caller needs to re-search for free extent >> + * Return 0 means we have the needed free extent. >> + * Return <0 means we failed to locate any free extent. >> + */ >> +static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, >> +struct btrfs_free_cluster *last_ptr, >> +struct btrfs_key *ins, >> +struct find_free_extent_ctl *ffe_ctl, >> +int full_search, bool use_cluster) >> +{ >> +struct btrfs_root *root = fs_info->extent_root; >> +int ret; >> + >> +if ((ffe_ctl->loop == LOOP_CACHING_NOWAIT) && >> +ffe_ctl->have_caching_bg && !ffe_ctl->orig_have_caching_bg) >> +ffe_ctl->orig_have_caching_bg = true; >> + >> +if (!ins->objectid && ffe_ctl->loop >= LOOP_CACHING_WAIT && >> + ffe_ctl->have_caching_bg) >> +return 1; >> + >> +if (!ins->objectid && ++(ffe_ctl->index) < BTRFS_NR_RAID_TYPES) >> +return 1; >> + >> +/* >> + * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking >> + * caching kthreads as we move along >> + * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching >> + * LOOP_ALLOC_CHUNK, force a chunk allocation and try again >> + * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try >> + * again >> + */ >> +if (!ins->objectid && ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) { >> +ffe_ctl->index = 0; >> +if (ffe_ctl->loop == LOOP_CACHING_NOWAIT) { >> +/* >> + * We want to skip the LOOP_CACHING_WAIT step if we >> + * don't have any uncached bgs and we've already done a >> + * full search through. >> + */ >> +if (ffe_ctl->orig_have_caching_bg || !full_search) >> +ffe_ctl->loop = LOOP_CACHING_WAIT; >> +else >> +ffe_ctl->loop = LOOP_ALLOC_CHUNK; >> +} else { >> +ffe_ctl->loop++; >> +} >> + >> +if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) { >> +struct btrfs_trans_handle *trans; >> +int exist = 0; >> + >> +trans = current->journal_info; >> +if (trans) >> +exist = 1; >> +else >> +trans = btrfs_join_transaction(root); >> + >> +if (IS_ERR(trans)) { >> +ret = PTR_ERR(trans); >> +return ret; >> +} >> + >> +ret = do_chunk_alloc(trans, ffe_ctl->flags, >> + CHUNK_ALLOC_FORCE); >> + >> +/* >> + * If we can't allocate a new chunk we've already looped >> + * through at least once, move on to the NO_EMPTY_SIZE >> + * case. >> + */ >> +if (ret == -ENOSPC) >> +
Re: [PATCH v3 3/4] btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered()
On Fri, Oct 12, 2018 at 02:18:18PM +0800, Qu Wenruo wrote: > This patch will extract unclsutered extent allocation code into > find_free_extent_unclustered(). > > And this helper function will use return value to indicate what to do > next. > > This should make find_free_extent() a little easier to read. > > Signed-off-by: Qu Wenruo > Reviewed-by: Su Yue Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v3 2/4] btrfs: Refactor clustered extent allocation into find_free_extent_clustered()
On Fri, Oct 12, 2018 at 02:18:17PM +0800, Qu Wenruo wrote: > We have two main methods to find free extents inside a block group: > 1) clustered allocation > 2) unclustered allocation > > This patch will extract the clustered allocation into > find_free_extent_clustered() to make it a little easier to read. > > Instead of jumping between different labels in find_free_extent(), the > helper function will use return value to indicate different behavior. > > Signed-off-by: Qu Wenruo > Reviewed-by: Su Yue Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v3 1/4] btrfs: Introduce find_free_extent_ctl structure for later rework
On Fri, Oct 12, 2018 at 02:18:16PM +0800, Qu Wenruo wrote: > Instead of tons of different local variables in find_free_extent(), > extract them into find_free_extent_ctl structure, and add better > explanation for them. > > Some modification may looks redundant, but will later greatly simplify > function parameter list during find_free_extent() refactor. > > Signed-off-by: Qu Wenruo Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH v3 4/4] btrfs: Refactor find_free_extent() loops update into find_free_extent_update_loop()
On Fri, Oct 12, 2018 at 02:18:19PM +0800, Qu Wenruo wrote: > We have a complex loop design for find_free_extent(), that has different > behavior for each loop, some even includes new chunk allocation. > > Instead of putting such a long code into find_free_extent() and makes it > harder to read, just extract them into find_free_extent_update_loop(). > > With all the cleanups, the main find_free_extent() should be pretty > barebone: > > find_free_extent() > |- Iterate through all block groups > | |- Get a valid block group > | |- Try to do clustered allocation in that block group > | |- Try to do unclustered allocation in that block group > | |- Check if the result is valid > | | |- If valid, then exit > | |- Jump to next block group > | > |- Push harder to find free extents >|- If not found, re-iterate all block groups > > Signed-off-by: Qu Wenruo > Reviewed-by: Su Yue > --- > fs/btrfs/extent-tree.c | 217 ++--- > 1 file changed, 117 insertions(+), 100 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index e6bfa91af41c..938569d2c583 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7236,7 +7236,9 @@ struct find_free_extent_ctl { > /* RAID index, converted from flags */ > int index; > > - /* Current loop number */ > + /* > + * Current loop number, check find_free_extent_update_loop() for details > + */ > int loop; > > /* > @@ -7433,6 +7435,117 @@ static int find_free_extent_unclustered(struct > btrfs_block_group_cache *bg, > return 0; > } > > +/* > + * Return >0 means caller needs to re-search for free extent > + * Return 0 means we have the needed free extent. > + * Return <0 means we failed to locate any free extent. > + */ > +static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, > + struct btrfs_free_cluster *last_ptr, > + struct btrfs_key *ins, > + struct find_free_extent_ctl *ffe_ctl, > + int full_search, bool use_cluster) > +{ > + struct btrfs_root *root = fs_info->extent_root; > + int ret; > + > + if ((ffe_ctl->loop == LOOP_CACHING_NOWAIT) && > + ffe_ctl->have_caching_bg && !ffe_ctl->orig_have_caching_bg) > + ffe_ctl->orig_have_caching_bg = true; > + > + if (!ins->objectid && ffe_ctl->loop >= LOOP_CACHING_WAIT && > + ffe_ctl->have_caching_bg) > + return 1; > + > + if (!ins->objectid && ++(ffe_ctl->index) < BTRFS_NR_RAID_TYPES) > + return 1; > + > + /* > + * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking > + * caching kthreads as we move along > + * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching > + * LOOP_ALLOC_CHUNK, force a chunk allocation and try again > + * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try > + * again > + */ > + if (!ins->objectid && ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) { > + ffe_ctl->index = 0; > + if (ffe_ctl->loop == LOOP_CACHING_NOWAIT) { > + /* > + * We want to skip the LOOP_CACHING_WAIT step if we > + * don't have any uncached bgs and we've already done a > + * full search through. > + */ > + if (ffe_ctl->orig_have_caching_bg || !full_search) > + ffe_ctl->loop = LOOP_CACHING_WAIT; > + else > + ffe_ctl->loop = LOOP_ALLOC_CHUNK; > + } else { > + ffe_ctl->loop++; > + } > + > + if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) { > + struct btrfs_trans_handle *trans; > + int exist = 0; > + > + trans = current->journal_info; > + if (trans) > + exist = 1; > + else > + trans = btrfs_join_transaction(root); > + > + if (IS_ERR(trans)) { > + ret = PTR_ERR(trans); > + return ret; > + } > + > + ret = do_chunk_alloc(trans, ffe_ctl->flags, > + CHUNK_ALLOC_FORCE); > + > + /* > + * If we can't allocate a new chunk we've already looped > + * through at least once, move on to the NO_EMPTY_SIZE > + * case. > + */ > + if (ret == -ENOSPC) > + ffe_ctl->loop = LOOP_NO_EMPTY_SIZE; > + > + /* Do not bail out on ENOSPC since we
Re: [PATCH] Btrfs: fix deadlock when writing out free space caches
On Fri, Oct 12, 2018 at 10:03:55AM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > When writing out a block group free space cache we can end deadlocking > with ourseves on an extent buffer lock resulting in a warning like the > following: > > [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 > btrfs_tree_lock+0x1be/0x1d0 [btrfs] > [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G > W I 4.16.8 #1 > [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] > [245043.396791] RSP: 0018:c9000424b840 EFLAGS: 00010246 > [245043.398093] RAX: 0a30 RBX: 8807e20a3d20 RCX: > 0001 > [245043.399414] RDX: 0001 RSI: 0002 RDI: > 8807e20a3d20 > [245043.400732] RBP: 0001 R08: 88041f39a700 R09: > 8800 > [245043.402021] R10: 0040 R11: 8807e20a3d20 R12: > 8807cb220630 > [245043.403296] R13: 0001 R14: 8807cb220628 R15: > 88041fbdf000 > [245043.404780] FS: () GS:88082fc8() > knlGS: > [245043.406050] CS: 0010 DS: ES: CR0: 80050033 > [245043.407321] CR2: 7fffdbdb9f10 CR3: 01c09005 CR4: > 000206e0 > [245043.408670] Call Trace: > [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] > [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] > [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] > [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] > [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] > [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] > [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] > [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] > [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] > [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] > [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] > [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] > [245043.425538] ? iput+0x72/0x1e0 > [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] > [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] > [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] > [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] > [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] > [245043.433341] kthread+0xf0/0x130 > [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] > [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 > [245043.437236] ret_from_fork+0x1f/0x30 > [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- > > This is because at write_one_cache_group() when we are COWing a leaf from > the extent tree we end up allocating a new block group (chunk) and, > because we have hit a threshold on the number of bytes reserved for system > chunks, we attempt to finalize the creation of new block groups from the > current transaction, by calling btrfs_create_pending_block_groups(). > However here we also need to modify the extent tree in order to insert > a block group item, and if the location for this new block group item > happens to be in the same leaf that we were COWing earlier, we deadlock > since btrfs_search_slot() tries to write lock the extent buffer that we > locked before at write_one_cache_group(). > > We have already hit similar cases in the past and commit d9a0540a79f8 > ("Btrfs: fix deadlock when finalizing block group creation") fixed some > of those cases by delaying the creation of pending block groups at the > known specific spots that could lead to a deadlock. This change reworks > that commit to be more generic so that we don't have to add similar logic > to every possible path that can lead to a deadlock. This is done by > making __btrfs_cow_block() disallowing the creation of new block groups > (setting the transaction's can_flush_pending_bgs to false) before it > attempts to allocate a new extent buffer for either the extent, chunk or > device trees, since those are the trees that pending block creation > modifies. Once the new extent buffer is allocated, it allows creation of > pending block groups to happen again. > > This change depends on a recent patch from Josef which is not yet in > Linus' tree, named "btrfs: make sure we create all new block groups" in > order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). > > Fixes: d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group > creation") > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753 > Link: > https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zmnmbsifj8abu9...@mail.gmail.com/ > Reported-by: E V > Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] Btrfs: fix use-after-free during inode eviction
On 2018/10/12 下午8:02, fdman...@kernel.org wrote: > From: Filipe Manana > > At inode.c:evict_inode_truncate_pages(), when we iterate over the inode's > extent states, we access an extent state record's "state" field after we > unlocked the inode's io tree lock. This can lead to a use-after-free issue > because after we unlock the io tree that extent state record might have > been freed due to being merged into another adjacent extent state > record (a previous inflight bio for a read operation finished in the > meanwhile which unlocked a range in the io tree and cause a merge of > extent state records, as explained in the comment before the while loop > added in commit 6ca0709756710 ("Btrfs: fix hang during inode eviction due > to concurrent readahead")). > > Fix this by keeping a copy of the extent state's flags in a local variable > and using it after unlocking the io tree. > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189 > Fixes: b9d0b38928e2 ("btrfs: Add handler for invalidate page") > CC: sta...@vger.kernel.org # 4.4+ > Signed-off-by: Filipe Manana Reviewed-by: Qu Wenruo Indeed my fault, state should only be accessed with spinlock hold. Thanks, Qu > --- > fs/btrfs/inode.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 3ea5339603cf..66c6c4103d2f 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -5274,11 +5274,13 @@ static void evict_inode_truncate_pages(struct inode > *inode) > struct extent_state *cached_state = NULL; > u64 start; > u64 end; > + unsigned state_flags; > > node = rb_first(_tree->state); > state = rb_entry(node, struct extent_state, rb_node); > start = state->start; > end = state->end; > + state_flags = state->state; > spin_unlock(_tree->lock); > > lock_extent_bits(io_tree, start, end, _state); > @@ -5291,7 +5293,7 @@ static void evict_inode_truncate_pages(struct inode > *inode) >* >* Note, end is the bytenr of last byte, so we need + 1 here. >*/ > - if (state->state & EXTENT_DELALLOC) > + if (state_flags & EXTENT_DELALLOC) > btrfs_qgroup_free_data(inode, NULL, start, end - start > + 1); > > clear_extent_bit(io_tree, start, end, > signature.asc Description: OpenPGP digital signature
[PATCH] Btrfs: fix use-after-free during inode eviction
From: Filipe Manana At inode.c:evict_inode_truncate_pages(), when we iterate over the inode's extent states, we access an extent state record's "state" field after we unlocked the inode's io tree lock. This can lead to a use-after-free issue because after we unlock the io tree that extent state record might have been freed due to being merged into another adjacent extent state record (a previous inflight bio for a read operation finished in the meanwhile which unlocked a range in the io tree and cause a merge of extent state records, as explained in the comment before the while loop added in commit 6ca0709756710 ("Btrfs: fix hang during inode eviction due to concurrent readahead")). Fix this by keeping a copy of the extent state's flags in a local variable and using it after unlocking the io tree. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189 Fixes: b9d0b38928e2 ("btrfs: Add handler for invalidate page") CC: sta...@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3ea5339603cf..66c6c4103d2f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5274,11 +5274,13 @@ static void evict_inode_truncate_pages(struct inode *inode) struct extent_state *cached_state = NULL; u64 start; u64 end; + unsigned state_flags; node = rb_first(_tree->state); state = rb_entry(node, struct extent_state, rb_node); start = state->start; end = state->end; + state_flags = state->state; spin_unlock(_tree->lock); lock_extent_bits(io_tree, start, end, _state); @@ -5291,7 +5293,7 @@ static void evict_inode_truncate_pages(struct inode *inode) * * Note, end is the bytenr of last byte, so we need + 1 here. */ - if (state->state & EXTENT_DELALLOC) + if (state_flags & EXTENT_DELALLOC) btrfs_qgroup_free_data(inode, NULL, start, end - start + 1); clear_extent_bit(io_tree, start, end, -- 2.11.0
Re: errors reported by btrfs-check
On 2018/10/12 下午6:35, Jürgen Herrmann wrote: > Am 12.10.2018 10:19, schrieb Qu Wenruo: > > [snip] > >> Please run the following command: >> >> # btrfs ins dump-tree --follow -b 166456229888 >> >> It could be caused by the fact that btrfs-progs --repair doesn't handle >> log tree well. >> >> If that's the case, "btrfs rescue zero-log" should help. >> >> But anyway, feel free to re-create the fs if zero-log doesn't help. >> >> Thanks, >> Qu > > First of all, thanks for your help! > > Did that: > > root@mint:/home/mint# /tmp/btrfs ins dump-tree --follow -b 166456229888 > /dev/mapper/sda3c > /tmp/btrfs-ins.txt > parent transid verify failed on 166475366400 wanted 127054 found 127060 > Ignoring transid failure > WARNING: eb corrupted: parent bytenr 166456229888 slot 22 level 2 child > bytenr 166475366400 level has 0 expect 1, skipping the slot > > The created btrfs-ins.txt is ~300MB in size, so i did not attach that > here. do you need id? then I can upload it to my owncloud and post a > link here. > > after that i tried to zero the log: > parent transid verify failed on 166475366400 wanted 127054 found 127060 > parent transid verify failed on 166475366400 wanted 127054 found 127060 > Ignoring transid failure > ERROR: child eb corrupted: parent bytenr=166456229888 item=22 parent > level=2 child level=0 > ERROR: could not open ctree > > mounting is still not possible: > [ 2726.355134] BTRFS info (device dm-0): disk space caching is enabled > [ 2726.355138] BTRFS info (device dm-0): has skinny extents > [ 2726.395638] BTRFS error (device dm-0): parent transid verify failed > on 166475366400 wanted 127054 found 127060 > [ 2726.395647] BTRFS error (device dm-0): failed to read block groups: -5 This means the extent tree is corrupted. This is not a good thing at all. Please go ahead and recreate the fs. I'm not sure how is --repair contribute to this bug in current stage. The biggest problem is, after your first --repair attempt, it doesn't report any extent tree corruption. > [ 2726.451907] BTRFS error (device dm-0): open_ctree failed > > Anything left to try? No, please go ahead to do what you need to. Thanks, Qu > > Best regards, > Jürgen signature.asc Description: OpenPGP digital signature
Re: errors reported by btrfs-check
Am 12.10.2018 10:19, schrieb Qu Wenruo: [snip] Please run the following command: # btrfs ins dump-tree --follow -b 166456229888 It could be caused by the fact that btrfs-progs --repair doesn't handle log tree well. If that's the case, "btrfs rescue zero-log" should help. But anyway, feel free to re-create the fs if zero-log doesn't help. Thanks, Qu First of all, thanks for your help! Did that: root@mint:/home/mint# /tmp/btrfs ins dump-tree --follow -b 166456229888 /dev/mapper/sda3c > /tmp/btrfs-ins.txt parent transid verify failed on 166475366400 wanted 127054 found 127060 Ignoring transid failure WARNING: eb corrupted: parent bytenr 166456229888 slot 22 level 2 child bytenr 166475366400 level has 0 expect 1, skipping the slot The created btrfs-ins.txt is ~300MB in size, so i did not attach that here. do you need id? then I can upload it to my owncloud and post a link here. after that i tried to zero the log: parent transid verify failed on 166475366400 wanted 127054 found 127060 parent transid verify failed on 166475366400 wanted 127054 found 127060 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=166456229888 item=22 parent level=2 child level=0 ERROR: could not open ctree mounting is still not possible: [ 2726.355134] BTRFS info (device dm-0): disk space caching is enabled [ 2726.355138] BTRFS info (device dm-0): has skinny extents [ 2726.395638] BTRFS error (device dm-0): parent transid verify failed on 166475366400 wanted 127054 found 127060 [ 2726.395647] BTRFS error (device dm-0): failed to read block groups: -5 [ 2726.451907] BTRFS error (device dm-0): open_ctree failed Anything left to try? Best regards, Jürgen -- Jürgen Herrmann https://t-5.eu ALbertstraße 2 94327 Bogen
Re: errors reported by btrfs-check
[snip] > > Hi there! > > I ran btrfs check --repair on the filesystem. I dont' have this log > anymore, > as it was then sitting on the repaired fs), which is now dead. > after repairing it I could still mount the fs. > > as my btrfs send problem still persists (another thread), I decided to > run check > again... this time i saved the logs to a different partition as it was more > convenient at that time. here's the output: > > Checking filesystem on /dev/mapper/sda3c > UUID: a914c141-72bf-448b-847f-d64ee82d8b7b > [1/7] checking root items > Fixed 0 roots. > [2/7] checking extents > No device size related problem found > [3/7] checking free space cache > cache and super generation don't match, space cache will be invalidated > [4/7] checking fs roots > root 1387 inode 3082380 errors 100, file extent discount > Found file extent holes: > start: 0, len: 233472 > ERROR: errors found in fs roots > found 470636822528 bytes used, error(s) found > total csum bytes: 452176524 > total tree bytes: 4579753984 > total fs tree bytes: 3822665728 > total extent tree bytes: 245301248 > btree space waste bytes: 847442809 > file data blocks allocated: 9670623588352 > referenced 932405829632 > > I did a repair once again but the error did not go away. > > after rebooting the fs is no longer mountable: > mint@mint:/tmp$ sudo /tmp/btrfs check /dev/mapper/sda3c > Opening filesystem to check... > parent transid verify failed on 166475366400 wanted 127054 found 127060 > parent transid verify failed on 166475366400 wanted 127054 found 127060 > Ignoring transid failure > ERROR: child eb corrupted: parent bytenr=166456229888 item=22 parent > level=2 child level=0 Please run the following command: # btrfs ins dump-tree --follow -b 166456229888 It could be caused by the fact that btrfs-progs --repair doesn't handle log tree well. If that's the case, "btrfs rescue zero-log" should help. But anyway, feel free to re-create the fs if zero-log doesn't help. Thanks, Qu > ERROR: cannot open file system > > If anyone wants me to help debugging this issue you will have to respond > fast, because i need my laptop back in working condition and will soon > nuke the fs. after all I will try one more time with a freshly formatted > btrfs partition. if i still get these kinds of errors i will have to > move back to > the tried and true xfs :/ > > best regards, > Jürgen signature.asc Description: OpenPGP digital signature
Re: errors reported by btrfs-check
Am 12.10.2018 01:56, schrieb Qu Wenruo: On 2018/10/12 上午4:30, Jürgen Herrmann wrote: Hi! I just did a btrfs check on my laptop's btrfs filesystem while i was on the usb stick rescue system. the following errors where reported: root@mint:/home/mint# btrfs check /dev/mapper/sda3crypt Checking filesystem on /dev/mapper/sda3crypt UUID: a914c141-72bf-448b-847f-d64ee82d8b7b checking extents checking free space cache checking fs roots root 258 inode 3082368 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 258 inode 3082370 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 258 inode 3082371 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 258 inode 3082373 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 258 inode 3082414 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 258 inode 3082415 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 258 inode 3082421 errors 100, file extent discount Found file extent holes: start: 0, len: 4096 root 1387 inode 3082368 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1387 inode 3082370 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1387 inode 3082371 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1387 inode 3082372 errors 100, file extent discount Found file extent holes: start: 8192, len: 4096 start: 16384, len: 4096 start: 24576, len: 4096 start: 32768, len: 4096 start: 40960, len: 4096 start: 49152, len: 20480 start: 73728, len: 4096 start: 81920, len: 4096 start: 90112, len: 8192 root 1387 inode 3082373 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1387 inode 3082374 errors 100, file extent discount Found file extent holes: start: 8192, len: 4096 start: 16384, len: 20480 start: 40960, len: 12288 start: 57344, len: 4096 start: 65536, len: 8192 root 1387 inode 3082380 errors 100, file extent discount Found file extent holes: start: 0, len: 233472 root 1387 inode 3082386 errors 100, file extent discount Found file extent holes: start: 0, len: 4096 root 1387 inode 3082398 errors 100, file extent discount Found file extent holes: start: 20480, len: 16384 root 1387 inode 3082414 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1387 inode 3082415 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1387 inode 3082421 errors 100, file extent discount Found file extent holes: start: 0, len: 4096 root 1391 inode 3082368 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1391 inode 3082370 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1391 inode 3082371 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1391 inode 3082373 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1391 inode 3082386 errors 100, file extent discount Found file extent holes: start: 0, len: 4096 root 1391 inode 3082414 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1391 inode 3082415 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1391 inode 3082421 errors 100, file extent discount Found file extent holes: start: 0, len: 4096 root 1394 inode 3082368 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1394 inode 3082370 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1394 inode 3082371 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1394 inode 3082373 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1394 inode 3082386 errors 100, file extent discount Found file extent holes: start: 0, len: 4096 root 1394 inode 3082414 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1394 inode 3082415 errors 100, file extent discount Found file extent holes: start: 0, len: 8192 root 1394 inode 3082421 errors 100, file extent discount Found file extent holes: start: 0, len: 4096 ERROR: errors found in fs roots found 469458231296 bytes used, error(s) found total csum bytes: 451180560 total tree bytes: 4558831616 total fs tree bytes: 3802955776 total extent tree bytes: 245055488 btree space waste bytes: 842802897 file data blocks allocated: 9656815640576 referenced 929225080832 Scrub completes ok though. I'm prepared to wipe the fs if needed, more than one backup is ready :) No need. File extent discount is not a big problem. It only means btrfs lacks some hole file
Re: [PATCH 2/2] btrfs-progs: Deprecate unused super block member log_root_transid
On 2018/10/12 下午5:13, Nikolay Borisov wrote: > > > On 12.10.2018 11:46, Qu Wenruo wrote: >> >> >> On 2018/10/12 下午2:53, Nikolay Borisov wrote: >>> >>> >>> On 12.10.2018 09:42, Qu Wenruo wrote: The only user of it is "btrfs inspect dump-super". Signed-off-by: Qu Wenruo --- cmds-inspect-dump-super.c | 4 ++-- ctree.h | 6 ++ 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c index e965267c5d96..3f33931ed9bc 100644 --- a/cmds-inspect-dump-super.c +++ b/cmds-inspect-dump-super.c @@ -387,8 +387,8 @@ static void dump_superblock(struct btrfs_super_block *sb, int full) (unsigned long long)btrfs_super_chunk_root_level(sb)); printf("log_root\t\t%llu\n", (unsigned long long)btrfs_super_log_root(sb)); - printf("log_root_transid\t%llu\n", - (unsigned long long)btrfs_super_log_root_transid(sb)); + printf("log_root_transid (deprecated)\t%llu\n", + le64_to_cpu(sb->__unused_log_root_transid)); >>> >>> This should be entirely removed. >> >> It looks OK to me. >> Just like the old leafsize. >> >> And if we try to use this member again, even old progs could show it >> without problem. > > But what is the point of having something which always shows 0 and is > essentially unused? Personally speaking, even it's in fact never used, it is still part of the specification of btrfs super blocks. Even it's a bad decision we made a long time ago, we still need to mark it as unused, other than converting it back to "reserved". To make it short and clear, if some member is specified in early specification, even it's never used, we still need to mark it unused and keep part of naming. Thanks, Qu > >> >> Thanks, >> Qu >> >>> printf("log_root_level\t\t%llu\n", (unsigned long long)btrfs_super_log_root_level(sb)); printf("total_bytes\t\t%llu\n", diff --git a/ctree.h b/ctree.h index 4719962df67d..a314fdb102b0 100644 --- a/ctree.h +++ b/ctree.h @@ -427,8 +427,8 @@ struct btrfs_super_block { __le64 chunk_root; __le64 log_root; - /* this will help find the new super based on the log root */ - __le64 log_root_transid; + /* This member is never touched, should always be 0 */ + __le64 __unused_log_root_transid; __le64 total_bytes; __le64 bytes_used; __le64 root_dir_objectid; @@ -2203,8 +2203,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct btrfs_super_block, chunk_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, log_root, 64); -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, - log_root_transid, 64); BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, log_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >>
Re: [PATCH 2/2] btrfs-progs: Deprecate unused super block member log_root_transid
On 12.10.2018 11:46, Qu Wenruo wrote: > > > On 2018/10/12 下午2:53, Nikolay Borisov wrote: >> >> >> On 12.10.2018 09:42, Qu Wenruo wrote: >>> The only user of it is "btrfs inspect dump-super". >>> >>> Signed-off-by: Qu Wenruo >>> --- >>> cmds-inspect-dump-super.c | 4 ++-- >>> ctree.h | 6 ++ >>> 2 files changed, 4 insertions(+), 6 deletions(-) >>> >>> diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c >>> index e965267c5d96..3f33931ed9bc 100644 >>> --- a/cmds-inspect-dump-super.c >>> +++ b/cmds-inspect-dump-super.c >>> @@ -387,8 +387,8 @@ static void dump_superblock(struct btrfs_super_block >>> *sb, int full) >>>(unsigned long long)btrfs_super_chunk_root_level(sb)); >>> printf("log_root\t\t%llu\n", >>>(unsigned long long)btrfs_super_log_root(sb)); >>> - printf("log_root_transid\t%llu\n", >>> - (unsigned long long)btrfs_super_log_root_transid(sb)); >>> + printf("log_root_transid (deprecated)\t%llu\n", >>> + le64_to_cpu(sb->__unused_log_root_transid)); >> >> This should be entirely removed. > > It looks OK to me. > Just like the old leafsize. > > And if we try to use this member again, even old progs could show it > without problem. But what is the point of having something which always shows 0 and is essentially unused? > > Thanks, > Qu > >> >>> printf("log_root_level\t\t%llu\n", >>>(unsigned long long)btrfs_super_log_root_level(sb)); >>> printf("total_bytes\t\t%llu\n", >>> diff --git a/ctree.h b/ctree.h >>> index 4719962df67d..a314fdb102b0 100644 >>> --- a/ctree.h >>> +++ b/ctree.h >>> @@ -427,8 +427,8 @@ struct btrfs_super_block { >>> __le64 chunk_root; >>> __le64 log_root; >>> >>> - /* this will help find the new super based on the log root */ >>> - __le64 log_root_transid; >>> + /* This member is never touched, should always be 0 */ >>> + __le64 __unused_log_root_transid; >>> __le64 total_bytes; >>> __le64 bytes_used; >>> __le64 root_dir_objectid; >>> @@ -2203,8 +2203,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, >>> struct btrfs_super_block, >>> chunk_root_level, 8); >>> BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, >>> log_root, 64); >>> -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, >>> -log_root_transid, 64); >>> BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, >>> log_root_level, 8); >>> BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >>> >
[PATCH] Btrfs: fix deadlock when writing out free space caches
From: Filipe Manana When writing out a block group free space cache we can end deadlocking with ourseves on an extent buffer lock resulting in a warning like the following: [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G W I 4.16.8 #1 [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.396791] RSP: 0018:c9000424b840 EFLAGS: 00010246 [245043.398093] RAX: 0a30 RBX: 8807e20a3d20 RCX: 0001 [245043.399414] RDX: 0001 RSI: 0002 RDI: 8807e20a3d20 [245043.400732] RBP: 0001 R08: 88041f39a700 R09: 8800 [245043.402021] R10: 0040 R11: 8807e20a3d20 R12: 8807cb220630 [245043.403296] R13: 0001 R14: 8807cb220628 R15: 88041fbdf000 [245043.404780] FS: () GS:88082fc8() knlGS: [245043.406050] CS: 0010 DS: ES: CR0: 80050033 [245043.407321] CR2: 7fffdbdb9f10 CR3: 01c09005 CR4: 000206e0 [245043.408670] Call Trace: [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [245043.425538] ? iput+0x72/0x1e0 [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] [245043.433341] kthread+0xf0/0x130 [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 [245043.437236] ret_from_fork+0x1f/0x30 [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- This is because at write_one_cache_group() when we are COWing a leaf from the extent tree we end up allocating a new block group (chunk) and, because we have hit a threshold on the number of bytes reserved for system chunks, we attempt to finalize the creation of new block groups from the current transaction, by calling btrfs_create_pending_block_groups(). However here we also need to modify the extent tree in order to insert a block group item, and if the location for this new block group item happens to be in the same leaf that we were COWing earlier, we deadlock since btrfs_search_slot() tries to write lock the extent buffer that we locked before at write_one_cache_group(). We have already hit similar cases in the past and commit d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation") fixed some of those cases by delaying the creation of pending block groups at the known specific spots that could lead to a deadlock. This change reworks that commit to be more generic so that we don't have to add similar logic to every possible path that can lead to a deadlock. This is done by making __btrfs_cow_block() disallowing the creation of new block groups (setting the transaction's can_flush_pending_bgs to false) before it attempts to allocate a new extent buffer for either the extent, chunk or device trees, since those are the trees that pending block creation modifies. Once the new extent buffer is allocated, it allows creation of pending block groups to happen again. This change depends on a recent patch from Josef which is not yet in Linus' tree, named "btrfs: make sure we create all new block groups" in order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). Fixes: d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753 Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zmnmbsifj8abu9...@mail.gmail.com/ Reported-by: E V Signed-off-by: Filipe Manana --- fs/btrfs/ctree.c | 17 + fs/btrfs/extent-tree.c | 16 ++-- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index d436fb4c002e..089b46c4d97f 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1050,9 +1050,26 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, if ((root->root_key.objectid ==
Re: [PATCH] btrfs-progs: add cli to forget one or all scanned devices
On 12.10.2018 07:06, Anand Jain wrote: > This patch adds cli > btrfs device forget [dev] > to remove the given device structure in the kernel if the device > is unmounted. If no argument is given it shall remove all stale > (device which are not mounted) from the kernel. > > Signed-off-by: Anand Jain > --- > cmds-device.c | 63 > ++- > ioctl.h | 2 ++ > 2 files changed, 56 insertions(+), 9 deletions(-) > > diff --git a/cmds-device.c b/cmds-device.c > index 2a05f70a76a9..ecc391ea01d8 100644 > --- a/cmds-device.c > +++ b/cmds-device.c > @@ -254,10 +254,32 @@ static int cmd_device_delete(int argc, char **argv) > return _cmd_device_remove(argc, argv, cmd_device_delete_usage); > } > > +static int btrfs_forget_devices(char *path) > +{ > + struct btrfs_ioctl_vol_args args; > + int ret; > + int fd; > + > + fd = open("/dev/btrfs-control", O_RDWR); > + if (fd < 0) > + return -errno; > + > + memset(, 0, sizeof(args)); > + if (path) > + strncpy_null(args.name, path); > + ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, ); > + if (ret) > + ret = -errno; > + close(fd); > + return ret; > +} > + > static const char * const cmd_device_scan_usage[] = { > - "btrfs device scan [(-d|--all-devices)| [...]]", > - "Scan devices for a btrfs filesystem", > + "btrfs device scan [(-d|--all-devices)|-u|--forget| "\ > + "[...]]", > + "Scan or forget (deregister) devices for a btrfs filesystem", > " -d|--all-devices (deprecated)", > + " -u|--forget [ ..]", > NULL > }; > > @@ -267,32 +289,40 @@ static int cmd_device_scan(int argc, char **argv) > int devstart; > int all = 0; > int ret = 0; > + int forget = 0; > > optind = 0; > while (1) { > int c; > static const struct option long_options[] = { > { "all-devices", no_argument, NULL, 'd'}, > + { "forget", no_argument, NULL, 'u'}, > { NULL, 0, NULL, 0} > }; > > - c = getopt_long(argc, argv, "d", long_options, NULL); > + c = getopt_long(argc, argv, "du", long_options, NULL); > if (c < 0) > break; > switch (c) { > case 'd': > all = 1; > break; > + case 'u': > + forget = 1; > + break; > default: > usage(cmd_device_scan_usage); > } > } > devstart = optind; > > + if (all && forget) > + usage(cmd_device_scan_usage); > + > if (all && check_argc_max(argc - optind, 1)) > usage(cmd_device_scan_usage); > > - if (all || argc - optind == 0) { > + if (!forget && (all || argc - optind == 0)) { > printf("Scanning for Btrfs filesystems\n"); > ret = btrfs_scan_devices(); > error_on(ret, "error %d while scanning", ret); > @@ -301,6 +331,13 @@ static int cmd_device_scan(int argc, char **argv) > goto out; > } > > + if (forget && (all || argc - optind == 0)) { You cannot ever have (forget && all) since you specifically call usage() above, just remove the 'all'. Furthermore, I think the code will be more legible if it's part of an else if rather than a plain if construct. In fact the whole function will be more readable. I.e you will have the error conditions at the to, followed by "all device scan" case, followed by "all device forget" and in the final else branch you will have the specific device handling. The increased indentation level won't be a problem since the longest lines are string prints and we are not wrapping those. > + ret = btrfs_forget_devices(NULL); > + if (ret) > + error("Can't forget: %s", strerror(-ret)); > + goto out; > + } > + > for( i = devstart ; i < argc ; i++ ){ > char *path; > > @@ -315,11 +352,19 @@ static int cmd_device_scan(int argc, char **argv) > ret = 1; > goto out; > } > - printf("Scanning for Btrfs filesystems in '%s'\n", path); > - if (btrfs_register_one_device(path) != 0) { > - ret = 1; > - free(path); > - goto out; > + if (forget) { > + ret = btrfs_forget_devices(path); > + if (ret) > + error("Can't forget '%s': %s", > + path, strerror(-ret)); For consistency sake I think this error printout should be moved inside btrfs_forget_devices() similarly to what btrfs_register_one_device does on
Re: [PATCH 14/42] btrfs: reset max_extent_size properly
On Thu, Oct 11, 2018 at 8:57 PM Josef Bacik wrote: > > If we use up our block group before allocating a new one we'll easily > get a max_extent_size that's set really really low, which will result in > a lot of fragmentation. We need to make sure we're resetting the > max_extent_size when we add a new chunk or add new space. > > Signed-off-by: Josef Bacik Reviewed-by: Filipe Manana > --- > fs/btrfs/extent-tree.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index cd2280962c8c..f84537a1d7eb 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4573,6 +4573,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle > *trans, u64 flags, > goto out; > } else { > ret = 1; > + space_info->max_extent_size = 0; > } > > space_info->force_alloc = CHUNK_ALLOC_NO_FORCE; > @@ -6671,6 +6672,7 @@ static int btrfs_free_reserved_bytes(struct > btrfs_block_group_cache *cache, > space_info->bytes_readonly += num_bytes; > cache->reserved -= num_bytes; > space_info->bytes_reserved -= num_bytes; > + space_info->max_extent_size = 0; > > if (delalloc) > cache->delalloc_bytes -= num_bytes; > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH 20/42] btrfs: don't use ctl->free_space for max_extent_size
On Thu, Oct 11, 2018 at 8:57 PM Josef Bacik wrote: > > From: Josef Bacik > > max_extent_size is supposed to be the largest contiguous range for the > space info, and ctl->free_space is the total free space in the block > group. We need to keep track of these separately and _only_ use the > max_free_space if we don't have a max_extent_size, as that means our > original request was too large to search any of the block groups for and > therefore wouldn't have a max_extent_size set. > > Signed-off-by: Josef Bacik Reviewed-by: Filipe Manana > --- > fs/btrfs/extent-tree.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 6e7bc3197737..4f48d047a1ec 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7496,6 +7496,7 @@ static noinline int find_free_extent(struct > btrfs_fs_info *fs_info, > struct btrfs_block_group_cache *block_group = NULL; > u64 search_start = 0; > u64 max_extent_size = 0; > + u64 max_free_space = 0; > u64 empty_cluster = 0; > struct btrfs_space_info *space_info; > int loop = 0; > @@ -7791,8 +7792,8 @@ static noinline int find_free_extent(struct > btrfs_fs_info *fs_info, > spin_lock(>tree_lock); > if (ctl->free_space < > num_bytes + empty_cluster + empty_size) { > - if (ctl->free_space > max_extent_size) > - max_extent_size = ctl->free_space; > + max_free_space = max(max_free_space, > +ctl->free_space); > spin_unlock(>tree_lock); > goto loop; > } > @@ -7959,6 +7960,8 @@ static noinline int find_free_extent(struct > btrfs_fs_info *fs_info, > } > out: > if (ret == -ENOSPC) { > + if (!max_extent_size) > + max_extent_size = max_free_space; > spin_lock(_info->lock); > space_info->max_extent_size = max_extent_size; > spin_unlock(_info->lock); > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH 19/42] btrfs: set max_extent_size properly
On Thu, Oct 11, 2018 at 8:57 PM Josef Bacik wrote: > > From: Josef Bacik > > We can't use entry->bytes if our entry is a bitmap entry, we need to use > entry->max_extent_size in that case. Fix up all the logic to make this > consistent. > > Signed-off-by: Josef Bacik Reviewed-by: Filipe Manana > --- > fs/btrfs/free-space-cache.c | 29 +++-- > 1 file changed, 19 insertions(+), 10 deletions(-) > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > index e077ad3b4549..2e96ee7da3ec 100644 > --- a/fs/btrfs/free-space-cache.c > +++ b/fs/btrfs/free-space-cache.c > @@ -1770,6 +1770,18 @@ static int search_bitmap(struct btrfs_free_space_ctl > *ctl, > return -1; > } > > +static void set_max_extent_size(struct btrfs_free_space *entry, > + u64 *max_extent_size) > +{ > + if (entry->bitmap) { > + if (entry->max_extent_size > *max_extent_size) > + *max_extent_size = entry->max_extent_size; > + } else { > + if (entry->bytes > *max_extent_size) > + *max_extent_size = entry->bytes; > + } > +} > + > /* Cache the size of the max extent in bytes */ > static struct btrfs_free_space * > find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, > @@ -1791,8 +1803,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 > *offset, u64 *bytes, > for (node = >offset_index; node; node = rb_next(node)) { > entry = rb_entry(node, struct btrfs_free_space, offset_index); > if (entry->bytes < *bytes) { > - if (entry->bytes > *max_extent_size) > - *max_extent_size = entry->bytes; > + set_max_extent_size(entry, max_extent_size); > continue; > } > > @@ -1810,8 +1821,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 > *offset, u64 *bytes, > } > > if (entry->bytes < *bytes + align_off) { > - if (entry->bytes > *max_extent_size) > - *max_extent_size = entry->bytes; > + set_max_extent_size(entry, max_extent_size); > continue; > } > > @@ -1823,8 +1833,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 > *offset, u64 *bytes, > *offset = tmp; > *bytes = size; > return entry; > - } else if (size > *max_extent_size) { > - *max_extent_size = size; > + } else { > + set_max_extent_size(entry, max_extent_size); > } > continue; > } > @@ -2684,8 +2694,7 @@ static u64 btrfs_alloc_from_bitmap(struct > btrfs_block_group_cache *block_group, > > err = search_bitmap(ctl, entry, _start, _bytes, true); > if (err) { > - if (search_bytes > *max_extent_size) > - *max_extent_size = search_bytes; > + set_max_extent_size(entry, max_extent_size); > return 0; > } > > @@ -2722,8 +2731,8 @@ u64 btrfs_alloc_from_cluster(struct > btrfs_block_group_cache *block_group, > > entry = rb_entry(node, struct btrfs_free_space, offset_index); > while (1) { > - if (entry->bytes < bytes && entry->bytes > *max_extent_size) > - *max_extent_size = entry->bytes; > + if (entry->bytes < bytes) > + set_max_extent_size(entry, max_extent_size); > > if (entry->bytes < bytes || > (!entry->bitmap && entry->offset < min_start)) { > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH] btrfs: introduce feature to forget a btrfs device
On 12.10.2018 07:06, Anand Jain wrote: > Support for a new command 'btrfs dev forget [dev]' is proposed here > to undo the effects of 'btrfs dev scan [dev]'. For this purpose > this patch proposes to use ioctl #5 as it was empty. > IOW(BTRFS_IOCTL_MAGIC, 5, ..) > This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from > the /dev/btrfs-control to forget one or all devices, (devices which are > not mounted) from the btrfs kernel. > > The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be > set to specify the device path. And all unmounted devices can be removed > from the kernel if no device path is provided. > > Again, the devices are removed only if the relevant fsid aren't mounted. > > This new cli can provide.. > . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the >device is not going to be mounted. > . Ability to mount the device in degraded mode when one of the other >device is corrupted like in split brain raid1. > . Running test cases which requires btrfs.ko-reload if the rootfs >is btrfs. > > Signed-off-by: Anand Jain > --- > fs/btrfs/super.c | 3 +++ > fs/btrfs/volumes.c | 9 + > fs/btrfs/volumes.h | 1 + > include/uapi/linux/btrfs.h | 2 ++ > 4 files changed, 15 insertions(+) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 6601c9aa5e35..f14610ef1c66 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -2244,6 +2244,9 @@ static long btrfs_control_ioctl(struct file *file, > unsigned int cmd, > ret = PTR_ERR_OR_ZERO(device); > mutex_unlock(_mutex); > break; > + case BTRFS_IOC_FORGET_DEV: > + ret = btrfs_forget_devices(vol->name); > + break; > case BTRFS_IOC_DEVICES_READY: > mutex_lock(_mutex); > device = btrfs_scan_one_device(vol->name, FMODE_READ, > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index f4405e430da6..e2ab203d9c1d 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -1208,6 +1208,15 @@ static int btrfs_read_disk_super(struct block_device > *bdev, u64 bytenr, > return 0; > } > > +int btrfs_forget_devices(const char *path) > +{ > + mutex_lock(_mutex); > + btrfs_free_stale_devices(strlen(path) ? path:NULL, NULL); Checking the len of path is redundant since userspace will pass a zeroed struct which will be NULL in any case. Additionally in btrfs_free_stalce_devices there is existing check whether path is set or not. > + mutex_unlock(_mutex); > + > + return 0; > +} > + > /* > * Look for a btrfs signature on a device. This may be called out of the > mount path > * and we are not allowed to call set_blocksize during the scan. The > superblock > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 23e9285d88de..394079d5d08d 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -406,6 +406,7 @@ int btrfs_open_devices(struct btrfs_fs_devices > *fs_devices, > fmode_t flags, void *holder); > struct btrfs_device *btrfs_scan_one_device(const char *path, > fmode_t flags, void *holder); > +int btrfs_forget_devices(const char *path); > int btrfs_close_devices(struct btrfs_fs_devices *fs_devices); > void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step); > void btrfs_assign_next_active_device(struct btrfs_device *device, > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h > index 5ca1d21fc4a7..b1be7f828cb4 100644 > --- a/include/uapi/linux/btrfs.h > +++ b/include/uapi/linux/btrfs.h > @@ -836,6 +836,8 @@ enum btrfs_err_code { > struct btrfs_ioctl_vol_args) > #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \ > struct btrfs_ioctl_vol_args) > +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \ > +struct btrfs_ioctl_vol_args) > /* trans start and trans end are dangerous, and only for > * use by applications that know how to avoid the > * resulting deadlocks >
Re: [PATCH 2/2] btrfs-progs: Deprecate unused super block member log_root_transid
On 2018/10/12 下午2:53, Nikolay Borisov wrote: > > > On 12.10.2018 09:42, Qu Wenruo wrote: >> The only user of it is "btrfs inspect dump-super". >> >> Signed-off-by: Qu Wenruo >> --- >> cmds-inspect-dump-super.c | 4 ++-- >> ctree.h | 6 ++ >> 2 files changed, 4 insertions(+), 6 deletions(-) >> >> diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c >> index e965267c5d96..3f33931ed9bc 100644 >> --- a/cmds-inspect-dump-super.c >> +++ b/cmds-inspect-dump-super.c >> @@ -387,8 +387,8 @@ static void dump_superblock(struct btrfs_super_block >> *sb, int full) >> (unsigned long long)btrfs_super_chunk_root_level(sb)); >> printf("log_root\t\t%llu\n", >> (unsigned long long)btrfs_super_log_root(sb)); >> -printf("log_root_transid\t%llu\n", >> - (unsigned long long)btrfs_super_log_root_transid(sb)); >> +printf("log_root_transid (deprecated)\t%llu\n", >> + le64_to_cpu(sb->__unused_log_root_transid)); > > This should be entirely removed. It looks OK to me. Just like the old leafsize. And if we try to use this member again, even old progs could show it without problem. Thanks, Qu > >> printf("log_root_level\t\t%llu\n", >> (unsigned long long)btrfs_super_log_root_level(sb)); >> printf("total_bytes\t\t%llu\n", >> diff --git a/ctree.h b/ctree.h >> index 4719962df67d..a314fdb102b0 100644 >> --- a/ctree.h >> +++ b/ctree.h >> @@ -427,8 +427,8 @@ struct btrfs_super_block { >> __le64 chunk_root; >> __le64 log_root; >> >> -/* this will help find the new super based on the log root */ >> -__le64 log_root_transid; >> +/* This member is never touched, should always be 0 */ >> +__le64 __unused_log_root_transid; >> __le64 total_bytes; >> __le64 bytes_used; >> __le64 root_dir_objectid; >> @@ -2203,8 +2203,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, >> struct btrfs_super_block, >> chunk_root_level, 8); >> BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, >> log_root, 64); >> -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, >> - log_root_transid, 64); >> BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, >> log_root_level, 8); >> BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >>
Re: [PATCH v2] btrfs: Deprecate unused super block member, log_root_transid
On 2018/10/12 下午2:52, Nikolay Borisov wrote: > > > On 12.10.2018 09:37, Qu Wenruo wrote: >> The member log_root_transid is never used. >> It's always kept untouched even when updating log tree root. >> >> And populating it without introducing new incompat flags could easily >> cause back-compatibility problems. >> So just mark it unused. >> >> Signed-off-by: Qu Wenruo >> --- >> changelog: >> v2: >> Remove the redundant comment. >> --- >> fs/btrfs/ctree.h | 5 + >> 1 file changed, 1 insertion(+), 4 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 53af9f5253f4..7adf5f4dcda4 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -214,8 +214,7 @@ struct btrfs_super_block { >> __le64 chunk_root; >> __le64 log_root; >> >> -/* this will help find the new super based on the log root */ >> -__le64 log_root_transid; >> +__le64 __unused_log_root_transid; > > why do you keep insisting on having "log_root_transid" in the name of > this member? It was never used on-disk so for all intents and purposes > log_root_transid doesn't mean anything? No need to resend, I'm sure > David will change it on commit, nut I'm just curious. We have a similar case, leaf_size. Although it differs from this case and it's still used, I prefer to have something showing that we at least planned to use this member. Just in case one day we tries to use it again. Thanks, Qu > > >> __le64 total_bytes; >> __le64 bytes_used; >> __le64 root_dir_objectid; >> @@ -2317,8 +2316,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, >> struct btrfs_super_block, >> chunk_root_level, 8); >> BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, >> log_root, 64); >> -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, >> - log_root_transid, 64); >> BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, >> log_root_level, 8); >> BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >>
[PATCH] btrfs-progs: super-recover: fix double free fs_devices memory
From: Robbie Ko super-recover collects btrfs devices information using existed functions scan_one_devices(). Problem is fs_devices is freed twice. One in __open_ctree_fd() when error happens and the other in btrfs_close_devices(recover.fs_devices) when root is NULL. Commit "30fd6f2e92695c355c8f76b8887cd4fade60cdac" add force-close all opened device before program exit, to avoid memory leak in all btrfs sub-command. Therefore, there is an unnecessary freed of fs_devices in btrfs_recover_superblocks. Fix this problem by remove unnecessary freed of fs_devices. Signed-off-by: Robbie Ko --- super-recover.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/super-recover.c b/super-recover.c index 880fd77..86b3df9 100644 --- a/super-recover.c +++ b/super-recover.c @@ -292,9 +292,6 @@ int btrfs_recover_superblocks(const char *dname, no_recover: recover_err_str(ret); free_recover_superblock(); - /* check if we have freed fs_devices in close_ctree() */ - if (!root) - btrfs_close_devices(recover.fs_devices); return ret; } -- 1.9.1
Re: [PATCH v3 1/4] btrfs: Introduce find_free_extent_ctl structure for later rework
On 10/12/18 2:18 PM, Qu Wenruo wrote: Instead of tons of different local variables in find_free_extent(), extract them into find_free_extent_ctl structure, and add better explanation for them. Some modification may looks redundant, but will later greatly simplify function parameter list during find_free_extent() refactor. Signed-off-by: Qu Wenruo Reviewed-by: Su Yue --- fs/btrfs/extent-tree.c | 253 ++--- 1 file changed, 161 insertions(+), 92 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index de6f75f5547b..dc10f6fd26af 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7212,6 +7212,55 @@ btrfs_release_block_group(struct btrfs_block_group_cache *cache, btrfs_put_block_group(cache); } +/* + * Internal used structure for find_free_extent() function. + * Wraps needed parameters. + */ +struct find_free_extent_ctl { + /* Basic allocation info */ + u64 ram_bytes; + u64 num_bytes; + u64 empty_size; + u64 flags; + int delalloc; + + /* Where to start the search inside the bg */ + u64 search_start; + + /* For clustered allocation */ + u64 empty_cluster; + + bool have_caching_bg; + bool orig_have_caching_bg; + + /* RAID index, converted from flags */ + int index; + + /* Current loop number */ + int loop; + + /* +* Whether we're refilling a cluster, if true we need to re-search +* current block group but don't try to refill the cluster again. +*/ + bool retry_clustered; + + /* +* Whether we're updating free space cache, if true we need to re-search +* current block group but don't try updating free space cache again. +*/ + bool retry_unclustered; + + /* If current block group is cached */ + int cached; + + /* Max extent size found */ + u64 max_extent_size; + + /* Found result */ + u64 found_offset; +}; + /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: @@ -7232,20 +7281,26 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, struct btrfs_root *root = fs_info->extent_root; struct btrfs_free_cluster *last_ptr = NULL; struct btrfs_block_group_cache *block_group = NULL; - u64 search_start = 0; - u64 max_extent_size = 0; - u64 empty_cluster = 0; + struct find_free_extent_ctl ffe_ctl = {0}; struct btrfs_space_info *space_info; - int loop = 0; - int index = btrfs_bg_flags_to_raid_index(flags); - bool failed_cluster_refill = false; - bool failed_alloc = false; bool use_cluster = true; - bool have_caching_bg = false; - bool orig_have_caching_bg = false; bool full_search = false; WARN_ON(num_bytes < fs_info->sectorsize); + + ffe_ctl.ram_bytes = ram_bytes; + ffe_ctl.num_bytes = num_bytes; + ffe_ctl.empty_size = empty_size; + ffe_ctl.flags = flags; + ffe_ctl.search_start = 0; + ffe_ctl.retry_clustered = false; + ffe_ctl.retry_unclustered = false; + ffe_ctl.delalloc = delalloc; + ffe_ctl.index = btrfs_bg_flags_to_raid_index(flags); + ffe_ctl.have_caching_bg = false; + ffe_ctl.orig_have_caching_bg = false; + ffe_ctl.found_offset = 0; + ins->type = BTRFS_EXTENT_ITEM_KEY; ins->objectid = 0; ins->offset = 0; @@ -7281,7 +7336,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_unlock(_info->lock); } - last_ptr = fetch_cluster_info(fs_info, space_info, _cluster); + last_ptr = fetch_cluster_info(fs_info, space_info, + _ctl.empty_cluster); if (last_ptr) { spin_lock(_ptr->lock); if (last_ptr->block_group) @@ -7298,10 +7354,12 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_unlock(_ptr->lock); } - search_start = max(search_start, first_logical_byte(fs_info, 0)); - search_start = max(search_start, hint_byte); - if (search_start == hint_byte) { - block_group = btrfs_lookup_block_group(fs_info, search_start); + ffe_ctl.search_start = max(ffe_ctl.search_start, + first_logical_byte(fs_info, 0)); + ffe_ctl.search_start = max(ffe_ctl.search_start, hint_byte); + if (ffe_ctl.search_start == hint_byte) { + block_group = btrfs_lookup_block_group(fs_info, + ffe_ctl.search_start); /* * we don't want to use the block group if it doesn't match our * allocation bits, or if its not cached. @@ -7323,7 +7381,7 @@ static noinline int find_free_extent(struct
[PATCH] btrfs-progs: fix compile warning when using gcc8 to compile btrfs-progs
When using gcc8 to compile btrfs-progs, it complains as below: ctree.c: In function 'btrfs_search_slot_for_read': ctree.c:1249:45: warning: passing argument 3 of 'btrfs_search_slot' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers] ret = btrfs_search_slot(NULL, root, key, p, 0, 0); Change btrfs_search_slot prototype with 'const' qualifier for argument 3. Also fix similar problems as above change. Signed-off-by: Su Yanjun --- ctree.c | 19 ++- ctree.h | 10 +- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/ctree.c b/ctree.c index aa1568620205..c8dd73cf2ce2 100644 --- a/ctree.c +++ b/ctree.c @@ -27,7 +27,7 @@ static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, int level); static int split_leaf(struct btrfs_trans_handle *trans, struct btrfs_root - *root, struct btrfs_key *ins_key, + *root, const struct btrfs_key *ins_key, struct btrfs_path *path, int data_size, int extend); static int push_node_left(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *dst, @@ -389,7 +389,7 @@ int btrfs_cow_block(struct btrfs_trans_handle *trans, return ret; } -int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2) +int btrfs_comp_cpu_keys(const struct btrfs_key *k1, const struct btrfs_key *k2) { if (k1->objectid > k2->objectid) return 1; @@ -409,7 +409,8 @@ int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2) /* * compare two keys in a memcmp fashion */ -static int btrfs_comp_keys(struct btrfs_disk_key *disk, struct btrfs_key *k2) +static int btrfs_comp_keys(struct btrfs_disk_key *disk, + const struct btrfs_key *k2) { struct btrfs_key k1; @@ -602,7 +603,7 @@ static int noinline check_block(struct btrfs_root *root, * slot may point to max if the key is bigger than all of the keys */ static int generic_bin_search(struct extent_buffer *eb, unsigned long p, - int item_size, struct btrfs_key *key, + int item_size, const struct btrfs_key *key, int max, int *slot) { int low = 0; @@ -636,7 +637,7 @@ static int generic_bin_search(struct extent_buffer *eb, unsigned long p, * simple bin_search frontend that does the right thing for * leaves vs nodes */ -static int bin_search(struct extent_buffer *eb, struct btrfs_key *key, +static int bin_search(struct extent_buffer *eb, const struct btrfs_key *key, int level, int *slot) { if (level == 0) @@ -1129,9 +1130,9 @@ out: * tree. if ins_len < 0, nodes will be merged as we walk down the tree (if * possible) */ -int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root - *root, struct btrfs_key *key, struct btrfs_path *p, int - ins_len, int cow) +int btrfs_search_slot(struct btrfs_trans_handle *trans, + struct btrfs_root *root, const struct btrfs_key *key, + struct btrfs_path *p, int ins_len, int cow) { struct extent_buffer *b; int slot; @@ -2150,7 +2151,7 @@ static noinline int copy_for_split(struct btrfs_trans_handle *trans, */ static noinline int split_leaf(struct btrfs_trans_handle *trans, struct btrfs_root *root, - struct btrfs_key *ins_key, + const struct btrfs_key *ins_key, struct btrfs_path *path, int data_size, int extend) { diff --git a/ctree.h b/ctree.h index 2a2437070ef9..cf0efae9c185 100644 --- a/ctree.h +++ b/ctree.h @@ -1973,7 +1973,7 @@ static inline void btrfs_disk_key_to_cpu(struct btrfs_key *cpu, } static inline void btrfs_cpu_key_to_disk(struct btrfs_disk_key *disk, -struct btrfs_key *cpu) +const struct btrfs_key *cpu) { disk->offset = cpu_to_le64(cpu->offset); disk->type = cpu->type; @@ -2552,7 +2552,7 @@ u64 add_new_free_space(struct btrfs_block_group_cache *block_group, u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset); /* ctree.c */ -int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2); +int btrfs_comp_cpu_keys(const struct btrfs_key *k1, const struct btrfs_key *k2); int btrfs_del_ptr(struct btrfs_root *root, struct btrfs_path *path, int level, int slot); enum btrfs_tree_block_status @@ -2595,9 +2595,9 @@ int btrfs_split_item(struct btrfs_trans_handle *trans, struct btrfs_path *path, struct btrfs_key *new_key, unsigned long split_offset); -int btrfs_search_slot(struct
Re: [PATCH 2/2] btrfs-progs: Deprecate unused super block member log_root_transid
On 12.10.2018 09:42, Qu Wenruo wrote: > The only user of it is "btrfs inspect dump-super". > > Signed-off-by: Qu Wenruo > --- > cmds-inspect-dump-super.c | 4 ++-- > ctree.h | 6 ++ > 2 files changed, 4 insertions(+), 6 deletions(-) > > diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c > index e965267c5d96..3f33931ed9bc 100644 > --- a/cmds-inspect-dump-super.c > +++ b/cmds-inspect-dump-super.c > @@ -387,8 +387,8 @@ static void dump_superblock(struct btrfs_super_block *sb, > int full) > (unsigned long long)btrfs_super_chunk_root_level(sb)); > printf("log_root\t\t%llu\n", > (unsigned long long)btrfs_super_log_root(sb)); > - printf("log_root_transid\t%llu\n", > -(unsigned long long)btrfs_super_log_root_transid(sb)); > + printf("log_root_transid (deprecated)\t%llu\n", > +le64_to_cpu(sb->__unused_log_root_transid)); This should be entirely removed. > printf("log_root_level\t\t%llu\n", > (unsigned long long)btrfs_super_log_root_level(sb)); > printf("total_bytes\t\t%llu\n", > diff --git a/ctree.h b/ctree.h > index 4719962df67d..a314fdb102b0 100644 > --- a/ctree.h > +++ b/ctree.h > @@ -427,8 +427,8 @@ struct btrfs_super_block { > __le64 chunk_root; > __le64 log_root; > > - /* this will help find the new super based on the log root */ > - __le64 log_root_transid; > + /* This member is never touched, should always be 0 */ > + __le64 __unused_log_root_transid; > __le64 total_bytes; > __le64 bytes_used; > __le64 root_dir_objectid; > @@ -2203,8 +2203,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct > btrfs_super_block, >chunk_root_level, 8); > BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, >log_root, 64); > -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, > - log_root_transid, 64); > BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, >log_root_level, 8); > BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >
Re: [PATCH v2] btrfs: Deprecate unused super block member, log_root_transid
On 12.10.2018 09:37, Qu Wenruo wrote: > The member log_root_transid is never used. > It's always kept untouched even when updating log tree root. > > And populating it without introducing new incompat flags could easily > cause back-compatibility problems. > So just mark it unused. > > Signed-off-by: Qu Wenruo > --- > changelog: > v2: > Remove the redundant comment. > --- > fs/btrfs/ctree.h | 5 + > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 53af9f5253f4..7adf5f4dcda4 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -214,8 +214,7 @@ struct btrfs_super_block { > __le64 chunk_root; > __le64 log_root; > > - /* this will help find the new super based on the log root */ > - __le64 log_root_transid; > + __le64 __unused_log_root_transid; why do you keep insisting on having "log_root_transid" in the name of this member? It was never used on-disk so for all intents and purposes log_root_transid doesn't mean anything? No need to resend, I'm sure David will change it on commit, nut I'm just curious. > __le64 total_bytes; > __le64 bytes_used; > __le64 root_dir_objectid; > @@ -2317,8 +2316,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct > btrfs_super_block, >chunk_root_level, 8); > BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, >log_root, 64); > -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, > - log_root_transid, 64); > BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, >log_root_level, 8); > BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >
[PATCH 1/2] btrfs-progs: find-root: Fix wrong generation for log tree
When searching for log tree, we should use btrfs_super_generation + 1 other than log_root_transid. Since log_root_transid is never touched and will be deprecated soon. Signed-off-by: Qu Wenruo --- btrfs-find-root.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/btrfs-find-root.c b/btrfs-find-root.c index e2d2e70c408c..376bbc4c47fa 100644 --- a/btrfs-find-root.c +++ b/btrfs-find-root.c @@ -73,7 +73,7 @@ static void get_root_gen_and_level(u64 objectid, struct btrfs_fs_info *fs_info, break; case BTRFS_TREE_LOG_OBJECTID: level = btrfs_super_log_root_level(super); - gen = btrfs_super_log_root_transid(super); + gen = btrfs_super_generation(super) + 1; break; case BTRFS_UUID_TREE_OBJECTID: gen = btrfs_super_uuid_tree_generation(super); -- 2.19.1
[PATCH 2/2] btrfs-progs: Deprecate unused super block member log_root_transid
The only user of it is "btrfs inspect dump-super". Signed-off-by: Qu Wenruo --- cmds-inspect-dump-super.c | 4 ++-- ctree.h | 6 ++ 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c index e965267c5d96..3f33931ed9bc 100644 --- a/cmds-inspect-dump-super.c +++ b/cmds-inspect-dump-super.c @@ -387,8 +387,8 @@ static void dump_superblock(struct btrfs_super_block *sb, int full) (unsigned long long)btrfs_super_chunk_root_level(sb)); printf("log_root\t\t%llu\n", (unsigned long long)btrfs_super_log_root(sb)); - printf("log_root_transid\t%llu\n", - (unsigned long long)btrfs_super_log_root_transid(sb)); + printf("log_root_transid (deprecated)\t%llu\n", + le64_to_cpu(sb->__unused_log_root_transid)); printf("log_root_level\t\t%llu\n", (unsigned long long)btrfs_super_log_root_level(sb)); printf("total_bytes\t\t%llu\n", diff --git a/ctree.h b/ctree.h index 4719962df67d..a314fdb102b0 100644 --- a/ctree.h +++ b/ctree.h @@ -427,8 +427,8 @@ struct btrfs_super_block { __le64 chunk_root; __le64 log_root; - /* this will help find the new super based on the log root */ - __le64 log_root_transid; + /* This member is never touched, should always be 0 */ + __le64 __unused_log_root_transid; __le64 total_bytes; __le64 bytes_used; __le64 root_dir_objectid; @@ -2203,8 +2203,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct btrfs_super_block, chunk_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, log_root, 64); -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, -log_root_transid, 64); BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, log_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, -- 2.19.1
[PATCH v2] btrfs: Deprecate unused super block member, log_root_transid
The member log_root_transid is never used. It's always kept untouched even when updating log tree root. And populating it without introducing new incompat flags could easily cause back-compatibility problems. So just mark it unused. Signed-off-by: Qu Wenruo --- changelog: v2: Remove the redundant comment. --- fs/btrfs/ctree.h | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 53af9f5253f4..7adf5f4dcda4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -214,8 +214,7 @@ struct btrfs_super_block { __le64 chunk_root; __le64 log_root; - /* this will help find the new super based on the log root */ - __le64 log_root_transid; + __le64 __unused_log_root_transid; __le64 total_bytes; __le64 bytes_used; __le64 root_dir_objectid; @@ -2317,8 +2316,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct btrfs_super_block, chunk_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, log_root, 64); -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, -log_root_transid, 64); BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, log_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, -- 2.19.1
Re: [PATCH] btrfs: Deprecate unused super block member, log_root_transid
On 2018/10/12 下午2:31, Nikolay Borisov wrote: > > > On 12.10.2018 09:26, Qu Wenruo wrote: >> The member log_root_transid is never used. >> It's always kept untouched even when updating log tree root. >> >> And populating it without introducing new incompat flags could easily >> cause back-compatibility problems. >> So just mark it unused. >> >> Signed-off-by: Qu Wenruo >> --- >> fs/btrfs/ctree.h | 6 ++ >> 1 file changed, 2 insertions(+), 4 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 53af9f5253f4..9adc53db679a 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -214,8 +214,8 @@ struct btrfs_super_block { >> __le64 chunk_root; >> __le64 log_root; >> >> -/* this will help find the new super based on the log root */ >> -__le64 log_root_transid; >> +/* This member is never touched, should always be 0 */ >> +__le64 __unused_log_root_transid; > > no need to be that descriptive, that's the whole idea of switching the > parameter's name. Just use "reserved1" or "unused" or "padding", also > the name is eloquent enough that the comment is redundant. Indeed, that comment will be gone. Thanks, Qu > >> __le64 total_bytes; >> __le64 bytes_used; >> __le64 root_dir_objectid; >> @@ -2317,8 +2317,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, >> struct btrfs_super_block, >> chunk_root_level, 8); >> BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, >> log_root, 64); >> -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, >> - log_root_transid, 64); >> BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, >> log_root_level, 8); >> BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >>
Re: [PATCH] btrfs: Deprecate unused super block member, log_root_transid
On 12.10.2018 09:26, Qu Wenruo wrote: > The member log_root_transid is never used. > It's always kept untouched even when updating log tree root. > > And populating it without introducing new incompat flags could easily > cause back-compatibility problems. > So just mark it unused. > > Signed-off-by: Qu Wenruo > --- > fs/btrfs/ctree.h | 6 ++ > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 53af9f5253f4..9adc53db679a 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -214,8 +214,8 @@ struct btrfs_super_block { > __le64 chunk_root; > __le64 log_root; > > - /* this will help find the new super based on the log root */ > - __le64 log_root_transid; > + /* This member is never touched, should always be 0 */ > + __le64 __unused_log_root_transid; no need to be that descriptive, that's the whole idea of switching the parameter's name. Just use "reserved1" or "unused" or "padding", also the name is eloquent enough that the comment is redundant. > __le64 total_bytes; > __le64 bytes_used; > __le64 root_dir_objectid; > @@ -2317,8 +2317,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct > btrfs_super_block, >chunk_root_level, 8); > BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, >log_root, 64); > -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, > - log_root_transid, 64); > BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, >log_root_level, 8); > BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, >
[PATCH] btrfs: Deprecate unused super block member, log_root_transid
The member log_root_transid is never used. It's always kept untouched even when updating log tree root. And populating it without introducing new incompat flags could easily cause back-compatibility problems. So just mark it unused. Signed-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 53af9f5253f4..9adc53db679a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -214,8 +214,8 @@ struct btrfs_super_block { __le64 chunk_root; __le64 log_root; - /* this will help find the new super based on the log root */ - __le64 log_root_transid; + /* This member is never touched, should always be 0 */ + __le64 __unused_log_root_transid; __le64 total_bytes; __le64 bytes_used; __le64 root_dir_objectid; @@ -2317,8 +2317,6 @@ BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct btrfs_super_block, chunk_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block, log_root, 64); -BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block, -log_root_transid, 64); BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block, log_root_level, 8); BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block, -- 2.19.1
[PATCH v3 4/4] btrfs: Refactor find_free_extent() loops update into find_free_extent_update_loop()
We have a complex loop design for find_free_extent(), that has different behavior for each loop, some even includes new chunk allocation. Instead of putting such a long code into find_free_extent() and makes it harder to read, just extract them into find_free_extent_update_loop(). With all the cleanups, the main find_free_extent() should be pretty barebone: find_free_extent() |- Iterate through all block groups | |- Get a valid block group | |- Try to do clustered allocation in that block group | |- Try to do unclustered allocation in that block group | |- Check if the result is valid | | |- If valid, then exit | |- Jump to next block group | |- Push harder to find free extents |- If not found, re-iterate all block groups Signed-off-by: Qu Wenruo Reviewed-by: Su Yue --- fs/btrfs/extent-tree.c | 217 ++--- 1 file changed, 117 insertions(+), 100 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e6bfa91af41c..938569d2c583 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7236,7 +7236,9 @@ struct find_free_extent_ctl { /* RAID index, converted from flags */ int index; - /* Current loop number */ + /* +* Current loop number, check find_free_extent_update_loop() for details +*/ int loop; /* @@ -7433,6 +7435,117 @@ static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg, return 0; } +/* + * Return >0 means caller needs to re-search for free extent + * Return 0 means we have the needed free extent. + * Return <0 means we failed to locate any free extent. + */ +static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, + struct btrfs_free_cluster *last_ptr, + struct btrfs_key *ins, + struct find_free_extent_ctl *ffe_ctl, + int full_search, bool use_cluster) +{ + struct btrfs_root *root = fs_info->extent_root; + int ret; + + if ((ffe_ctl->loop == LOOP_CACHING_NOWAIT) && + ffe_ctl->have_caching_bg && !ffe_ctl->orig_have_caching_bg) + ffe_ctl->orig_have_caching_bg = true; + + if (!ins->objectid && ffe_ctl->loop >= LOOP_CACHING_WAIT && +ffe_ctl->have_caching_bg) + return 1; + + if (!ins->objectid && ++(ffe_ctl->index) < BTRFS_NR_RAID_TYPES) + return 1; + + /* +* LOOP_CACHING_NOWAIT, search partially cached block groups, kicking +* caching kthreads as we move along +* LOOP_CACHING_WAIT, search everything, and wait if our bg is caching +* LOOP_ALLOC_CHUNK, force a chunk allocation and try again +* LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try +* again +*/ + if (!ins->objectid && ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) { + ffe_ctl->index = 0; + if (ffe_ctl->loop == LOOP_CACHING_NOWAIT) { + /* +* We want to skip the LOOP_CACHING_WAIT step if we +* don't have any uncached bgs and we've already done a +* full search through. +*/ + if (ffe_ctl->orig_have_caching_bg || !full_search) + ffe_ctl->loop = LOOP_CACHING_WAIT; + else + ffe_ctl->loop = LOOP_ALLOC_CHUNK; + } else { + ffe_ctl->loop++; + } + + if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) { + struct btrfs_trans_handle *trans; + int exist = 0; + + trans = current->journal_info; + if (trans) + exist = 1; + else + trans = btrfs_join_transaction(root); + + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + return ret; + } + + ret = do_chunk_alloc(trans, ffe_ctl->flags, +CHUNK_ALLOC_FORCE); + + /* +* If we can't allocate a new chunk we've already looped +* through at least once, move on to the NO_EMPTY_SIZE +* case. +*/ + if (ret == -ENOSPC) + ffe_ctl->loop = LOOP_NO_EMPTY_SIZE; + + /* Do not bail out on ENOSPC since we can do more. */ + if (ret < 0 && ret != -ENOSPC) + btrfs_abort_transaction(trans, ret); + else
[PATCH v3 0/4] btrfs: Refactor find_free_extent()
Can be fetched from github: https://github.com/adam900710/linux/tree/refactor_find_free_extent Which is based on v4.19-rc1. extent-tree.c::find_free_extent() could be one of the most ill-structured functions, it has at least 6 non-exit tags and jumps between them. Refactor it into 4 parts: 1) find_free_extent() The main entrance, does the main work of block group iteration and block group selection. Now this function doesn't care nor handles free extent search by itself. 2) find_free_extent_clustered() Do clustered free extent search. May try to build/re-fill cluster. 3) find_free_extent_unclustered() Do unclustered free extent search. May try to fill free space cache. 4) find_free_extent_update_loop() Do the loop based black magic. May allocate new chunk. With this patch, at least we should make find_free_extent() a little easier to read, and provides the basis for later work on this function. Current refactor is trying not to touch the original functionality, thus the helper structure find_free_extent_ctl still contains a lot of unrelated members. But it should not change how find_free_extent() works at all. changelog: v2: Split into 4 patches. Minor comment newline change. v3: Mostly cosmetic update. Rebased to v4.19-rc1 Rename find_free_extent_ctrl to find_free_extent_ctl to keep the naming scheme the same. Fix some comment spelling error. Enhance comment for find_free_extent_unclustered(). Add reviewed-by tag. Qu Wenruo (4): btrfs: Introduce find_free_extent_ctl structure for later rework btrfs: Refactor clustered extent allocation into find_free_extent_clustered() btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered() btrfs: Refactor find_free_extent() loops update into find_free_extent_update_loop() fs/btrfs/extent-tree.c | 702 - 1 file changed, 404 insertions(+), 298 deletions(-) -- 2.19.1
[PATCH v3 2/4] btrfs: Refactor clustered extent allocation into find_free_extent_clustered()
We have two main methods to find free extents inside a block group: 1) clustered allocation 2) unclustered allocation This patch will extract the clustered allocation into find_free_extent_clustered() to make it a little easier to read. Instead of jumping between different labels in find_free_extent(), the helper function will use return value to indicate different behavior. Signed-off-by: Qu Wenruo Reviewed-by: Su Yue --- fs/btrfs/extent-tree.c | 244 - 1 file changed, 121 insertions(+), 123 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dc10f6fd26af..896d54b3c554 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7261,6 +7261,115 @@ struct find_free_extent_ctl { u64 found_offset; }; + +/* + * Helper function for find_free_extent(). + * + * Return -ENOENT to inform caller that we need fallback to unclustered mode. + * Return -EAGAIN to inform caller that we need to re-search this block group + * Return >0 to inform caller that we find nothing + * Return 0 means we have found a location and set ffe_ctl->found_offset. + */ +static int find_free_extent_clustered(struct btrfs_block_group_cache *bg, + struct btrfs_free_cluster *last_ptr, + struct find_free_extent_ctl *ffe_ctl, + struct btrfs_block_group_cache **cluster_bg_ret) +{ + struct btrfs_fs_info *fs_info = bg->fs_info; + struct btrfs_block_group_cache *cluster_bg; + u64 aligned_cluster; + u64 offset; + int ret; + + cluster_bg = btrfs_lock_cluster(bg, last_ptr, ffe_ctl->delalloc); + if (!cluster_bg) + goto refill_cluster; + if (cluster_bg != bg && (cluster_bg->ro || + !block_group_bits(cluster_bg, ffe_ctl->flags))) + goto release_cluster; + + offset = btrfs_alloc_from_cluster(cluster_bg, last_ptr, + ffe_ctl->num_bytes, cluster_bg->key.objectid, + _ctl->max_extent_size); + if (offset) { + /* we have a block, we're done */ + spin_unlock(_ptr->refill_lock); + trace_btrfs_reserve_extent_cluster(cluster_bg, + ffe_ctl->search_start, ffe_ctl->num_bytes); + *cluster_bg_ret = cluster_bg; + ffe_ctl->found_offset = offset; + return 0; + } + WARN_ON(last_ptr->block_group != cluster_bg); +release_cluster: + /* If we are on LOOP_NO_EMPTY_SIZE, we can't set up a new clusters, so +* lets just skip it and let the allocator find whatever block it can +* find. If we reach this point, we will have tried the cluster +* allocator plenty of times and not have found anything, so we are +* likely way too fragmented for the clustering stuff to find anything. +* +* However, if the cluster is taken from the current block group, +* release the cluster first, so that we stand a better chance of +* succeeding in the unclustered allocation. +*/ + if (ffe_ctl->loop >= LOOP_NO_EMPTY_SIZE && cluster_bg != bg) { + spin_unlock(_ptr->refill_lock); + btrfs_release_block_group(cluster_bg, ffe_ctl->delalloc); + return -ENOENT; + } + + /* This cluster didn't work out, free it and start over */ + btrfs_return_cluster_to_free_space(NULL, last_ptr); + + if (cluster_bg != bg) + btrfs_release_block_group(cluster_bg, ffe_ctl->delalloc); + +refill_cluster: + if (ffe_ctl->loop >= LOOP_NO_EMPTY_SIZE) { + spin_unlock(_ptr->refill_lock); + return -ENOENT; + } + + aligned_cluster = max_t(u64, + ffe_ctl->empty_cluster + ffe_ctl->empty_size, + bg->full_stripe_len); + ret = btrfs_find_space_cluster(fs_info, bg, last_ptr, + ffe_ctl->search_start, ffe_ctl->num_bytes, + aligned_cluster); + if (ret == 0) { + /* now pull our allocation out of this cluster */ + offset = btrfs_alloc_from_cluster(bg, last_ptr, + ffe_ctl->num_bytes, + ffe_ctl->search_start, + _ctl->max_extent_size); + if (offset) { + /* we found one, proceed */ + spin_unlock(_ptr->refill_lock); + trace_btrfs_reserve_extent_cluster(bg, + ffe_ctl->search_start, ffe_ctl->num_bytes); + ffe_ctl->found_offset = offset; + return 0; + } + } else if (!ffe_ctl->cached && ffe_ctl->loop > LOOP_CACHING_NOWAIT && + !ffe_ctl->retry_clustered) { + spin_unlock(_ptr->refill_lock); + + ffe_ctl->retry_clustered =
[PATCH v3 3/4] btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered()
This patch will extract unclsutered extent allocation code into find_free_extent_unclustered(). And this helper function will use return value to indicate what to do next. This should make find_free_extent() a little easier to read. Signed-off-by: Qu Wenruo Reviewed-by: Su Yue --- fs/btrfs/extent-tree.c | 114 - 1 file changed, 68 insertions(+), 46 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 896d54b3c554..e6bfa91af41c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7370,6 +7370,69 @@ static int find_free_extent_clustered(struct btrfs_block_group_cache *bg, return 1; } +/* + * Return >0 to inform caller that we find nothing + * Return 0 when we found an free extent and set ffe_ctrl->found_offset + * Return -EAGAIN to inform caller that we need to re-search this block group + */ +static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg, + struct btrfs_free_cluster *last_ptr, + struct find_free_extent_ctl *ffe_ctl) +{ + u64 offset; + + /* +* We are doing an unclustered alloc, set the fragmented flag so we +* don't bother trying to setup a cluster again until we get more space. +*/ + if (unlikely(last_ptr)) { + spin_lock(_ptr->lock); + last_ptr->fragmented = 1; + spin_unlock(_ptr->lock); + } + if (ffe_ctl->cached) { + struct btrfs_free_space_ctl *free_space_ctl; + + free_space_ctl = bg->free_space_ctl; + spin_lock(_space_ctl->tree_lock); + if (free_space_ctl->free_space < + ffe_ctl->num_bytes + ffe_ctl->empty_cluster + + ffe_ctl->empty_size) { + ffe_ctl->max_extent_size = max_t(u64, + ffe_ctl->max_extent_size, + free_space_ctl->free_space); + spin_unlock(_space_ctl->tree_lock); + return 1; + } + spin_unlock(_space_ctl->tree_lock); + } + + offset = btrfs_find_space_for_alloc(bg, ffe_ctl->search_start, + ffe_ctl->num_bytes, ffe_ctl->empty_size, + _ctl->max_extent_size); + + /* +* If we didn't find a chunk, and we haven't failed on this block group +* before, and this block group is in the middle of caching and we are +* ok with waiting, then go ahead and wait for progress to be made, and +* set @retry_unclustered to true. +* +* If @retry_unclustered is true then we've already waited on this block +* group once and should move on to the next block group. +*/ + if (!offset && !ffe_ctl->retry_unclustered && !ffe_ctl->cached && + ffe_ctl->loop > LOOP_CACHING_NOWAIT) { + wait_block_group_cache_progress(bg, ffe_ctl->num_bytes + + ffe_ctl->empty_size); + ffe_ctl->retry_unclustered = true; + return -EAGAIN; + } else if (!offset) { + return 1; + } + ffe_ctl->found_offset = offset; + return 0; +} + /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: @@ -7572,54 +7635,13 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, /* ret == -ENOENT case falls through */ } - /* -* We are doing an unclustered alloc, set the fragmented flag so -* we don't bother trying to setup a cluster again until we get -* more space. -*/ - if (unlikely(last_ptr)) { - spin_lock(_ptr->lock); - last_ptr->fragmented = 1; - spin_unlock(_ptr->lock); - } - if (ffe_ctl.cached) { - struct btrfs_free_space_ctl *ctl = - block_group->free_space_ctl; - - spin_lock(>tree_lock); - if (ctl->free_space < - num_bytes + ffe_ctl.empty_cluster + empty_size) { - if (ctl->free_space > ffe_ctl.max_extent_size) - ffe_ctl.max_extent_size = ctl->free_space; - spin_unlock(>tree_lock); - goto loop; - } - spin_unlock(>tree_lock); - } - - ffe_ctl.found_offset = btrfs_find_space_for_alloc(block_group, - ffe_ctl.search_start, num_bytes, empty_size, - _ctl.max_extent_size); - /* -* If we
[PATCH v3 1/4] btrfs: Introduce find_free_extent_ctl structure for later rework
Instead of tons of different local variables in find_free_extent(), extract them into find_free_extent_ctl structure, and add better explanation for them. Some modification may looks redundant, but will later greatly simplify function parameter list during find_free_extent() refactor. Signed-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 253 ++--- 1 file changed, 161 insertions(+), 92 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index de6f75f5547b..dc10f6fd26af 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7212,6 +7212,55 @@ btrfs_release_block_group(struct btrfs_block_group_cache *cache, btrfs_put_block_group(cache); } +/* + * Internal used structure for find_free_extent() function. + * Wraps needed parameters. + */ +struct find_free_extent_ctl { + /* Basic allocation info */ + u64 ram_bytes; + u64 num_bytes; + u64 empty_size; + u64 flags; + int delalloc; + + /* Where to start the search inside the bg */ + u64 search_start; + + /* For clustered allocation */ + u64 empty_cluster; + + bool have_caching_bg; + bool orig_have_caching_bg; + + /* RAID index, converted from flags */ + int index; + + /* Current loop number */ + int loop; + + /* +* Whether we're refilling a cluster, if true we need to re-search +* current block group but don't try to refill the cluster again. +*/ + bool retry_clustered; + + /* +* Whether we're updating free space cache, if true we need to re-search +* current block group but don't try updating free space cache again. +*/ + bool retry_unclustered; + + /* If current block group is cached */ + int cached; + + /* Max extent size found */ + u64 max_extent_size; + + /* Found result */ + u64 found_offset; +}; + /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: @@ -7232,20 +7281,26 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, struct btrfs_root *root = fs_info->extent_root; struct btrfs_free_cluster *last_ptr = NULL; struct btrfs_block_group_cache *block_group = NULL; - u64 search_start = 0; - u64 max_extent_size = 0; - u64 empty_cluster = 0; + struct find_free_extent_ctl ffe_ctl = {0}; struct btrfs_space_info *space_info; - int loop = 0; - int index = btrfs_bg_flags_to_raid_index(flags); - bool failed_cluster_refill = false; - bool failed_alloc = false; bool use_cluster = true; - bool have_caching_bg = false; - bool orig_have_caching_bg = false; bool full_search = false; WARN_ON(num_bytes < fs_info->sectorsize); + + ffe_ctl.ram_bytes = ram_bytes; + ffe_ctl.num_bytes = num_bytes; + ffe_ctl.empty_size = empty_size; + ffe_ctl.flags = flags; + ffe_ctl.search_start = 0; + ffe_ctl.retry_clustered = false; + ffe_ctl.retry_unclustered = false; + ffe_ctl.delalloc = delalloc; + ffe_ctl.index = btrfs_bg_flags_to_raid_index(flags); + ffe_ctl.have_caching_bg = false; + ffe_ctl.orig_have_caching_bg = false; + ffe_ctl.found_offset = 0; + ins->type = BTRFS_EXTENT_ITEM_KEY; ins->objectid = 0; ins->offset = 0; @@ -7281,7 +7336,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_unlock(_info->lock); } - last_ptr = fetch_cluster_info(fs_info, space_info, _cluster); + last_ptr = fetch_cluster_info(fs_info, space_info, + _ctl.empty_cluster); if (last_ptr) { spin_lock(_ptr->lock); if (last_ptr->block_group) @@ -7298,10 +7354,12 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_unlock(_ptr->lock); } - search_start = max(search_start, first_logical_byte(fs_info, 0)); - search_start = max(search_start, hint_byte); - if (search_start == hint_byte) { - block_group = btrfs_lookup_block_group(fs_info, search_start); + ffe_ctl.search_start = max(ffe_ctl.search_start, + first_logical_byte(fs_info, 0)); + ffe_ctl.search_start = max(ffe_ctl.search_start, hint_byte); + if (ffe_ctl.search_start == hint_byte) { + block_group = btrfs_lookup_block_group(fs_info, + ffe_ctl.search_start); /* * we don't want to use the block group if it doesn't match our * allocation bits, or if its not cached. @@ -7323,7 +7381,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,