Re: IO errors when building RAID1.... ?
Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted: > If you want you can post the output from 'sudo smartctl -x /dev/sda' > which will contain more information... but this is in some sense > superfluous. The problem is very clearly a bad drive, the drive > explicitly report to libata a write error, and included the sector LBA > affected, and only the drive firmware would know that. It's not likely a > cable problem or something like. And that the write error is reported at > all means it's persistent, not transient. Two points: 1) Does this happen to be an archive/SMR (shingled magnetic recording) device? If so that might be the problem as such devices really aren't suited to normal usage (they really are designed for archiving), and btrfs' COW patterns can exacerbate the issue. It's quite possible that the original install didn't load up the IO as heavily as the balance- convert does, so the problem appears with convert but not for install. 2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, I'd suggest running badblocks -w (make sure the device doesn't have anything valuable on it!) on the device -- note that this will take awhile, probably a couple days perhaps longer, as it writes four different patterns to the entire device one at a time, reading everything back to verify the pattern was written correctly, so it's actually going over the entire device 8 times, alternating write and read, but it should settle the issue of the reliability of the device. Or if you'd rather spend the money than the time and it's not under warrantee still, just replace it, or at least buy a new one to use while you run the tests on that one. I fully understand that tying up the thing running tests on it for days straight may not be viable. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
On Thu, Aug 30, 2018 at 01:42:20PM -0400, Josef Bacik wrote: > We may abort the transaction during a commit and not have a chance to > run the pending bgs stuff, which will leave block groups on our list and > cause us accounting issues and leaked memory. Fix this by running the > pending bgs when we cleanup a transaction. Reviewed-by: Omar Sandoval Again, I think it's fine to reuse the same function as long as there's a comment here. > Signed-off-by: Josef Bacik > --- > fs/btrfs/transaction.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 89d14f135837..0f39a0d302d3 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > btrfs_scrub_continue(fs_info); > cleanup_transaction: > btrfs_trans_release_metadata(trans); > + btrfs_create_pending_block_groups(trans); > btrfs_trans_release_chunk_metadata(trans); > trans->block_rsv = NULL; > btrfs_warn(fs_info, "Skipping commit of aborted transaction."); > -- > 2.14.3 >
Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted
On Thu, Aug 30, 2018 at 01:42:19PM -0400, Josef Bacik wrote: > We still need to do all of the accounting cleanup for pending block > groups if we abort. So set the ret to trans->aborted so if we aborted > the cleanup happens and everybody is happy. Reviewed-by: Omar Sandoval Reusing the loop is fine IMO, but a comment would be appreciated. > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 90f267f4dd0f..132a1157982c 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct > btrfs_trans_handle *trans) > struct btrfs_root *extent_root = fs_info->extent_root; > struct btrfs_block_group_item item; > struct btrfs_key key; > - int ret = 0; > + int ret = trans->aborted; > bool can_flush_pending_bgs = trans->can_flush_pending_bgs; > > trans->can_flush_pending_bgs = false; > -- > 2.14.3 >
Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing
On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote: > I noticed in a giant dbench run that we spent a lot of time on lock > contention while running transaction commit. This is because dbench > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and > they all run the delayed refs first thing, so they all contend with > each other. This leads to seconds of 0 throughput. Change this to only > run the delayed refs if we're the ones committing the transaction. This > makes the latency go away and we get no more lock contention. This means that we're going to spend more time running delayed refs while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new transactions more than before? > Signed-off-by: Josef Bacik > --- > fs/btrfs/transaction.c | 24 +--- > 1 file changed, 9 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index ebb0c0405598..2bb19e2ded5e 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > btrfs_trans_release_metadata(trans); > trans->block_rsv = NULL; > > - /* make a pass through all the delayed refs we have so far > - * any runnings procs may add more while we are here > - */ > - ret = btrfs_run_delayed_refs(trans, 0); > - if (ret) { > - btrfs_end_transaction(trans); > - return ret; > - } > - > cur_trans = trans->transaction; > > /* > @@ -1939,12 +1930,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > if (!list_empty(>new_bgs)) > btrfs_create_pending_block_groups(trans); > > - ret = btrfs_run_delayed_refs(trans, 0); > - if (ret) { > - btrfs_end_transaction(trans); > - return ret; > - } > - > if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { > int run_it = 0; > > @@ -2015,6 +2000,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > spin_unlock(_info->trans_lock); > } > > + /* > + * We are now the only one in the commit area, we can run delayed refs > + * without hitting a bunch of lock contention from a lot of people > + * trying to commit the transaction at once. > + */ > + ret = btrfs_run_delayed_refs(trans, 0); > + if (ret) > + goto cleanup_transaction; > + > extwriter_counter_dec(cur_trans, trans->type); > > ret = btrfs_start_delalloc_flush(fs_info); > -- > 2.14.3 >
Re: [PATCH 23/35] btrfs: assert on non-empty delayed iputs
On Thu, Aug 30, 2018 at 01:42:13PM -0400, Josef Bacik wrote: > I ran into an issue where there was some reference being held on an > inode that I couldn't track. This assert wasn't triggered, but it at > least rules out we're doing something stupid. Reviewed-by: Omar Sandoval > Signed-off-by: Josef Bacik > --- > fs/btrfs/disk-io.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 0e42401756b8..11ea2ea7439e 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) > kthread_stop(fs_info->transaction_kthread); > kthread_stop(fs_info->cleaner_kthread); > > + ASSERT(list_empty(_info->delayed_iputs)); > set_bit(BTRFS_FS_CLOSING_DONE, _info->flags); > > btrfs_free_qgroup_config(fs_info); > -- > 2.14.3 >
Re: [PATCH 09/35] btrfs: protect space cache inode alloc with nofs
On Thu, Aug 30, 2018 at 01:41:59PM -0400, Josef Bacik wrote: > If we're allocating a new space cache inode it's likely going to be > under a transaction handle, so we need to use memalloc_nofs_save() in > order to avoid deadlocks, and more importantly lockdep messages that > make xfstests fail. Could use a comment where we call memalloc_nofs_save(). Otherwise, Reviewed-by: Omar Sandoval > Signed-off-by: Josef Bacik > --- > fs/btrfs/free-space-cache.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > index c3888c113d81..db93a5f035a0 100644 > --- a/fs/btrfs/free-space-cache.c > +++ b/fs/btrfs/free-space-cache.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #include "ctree.h" > #include "free-space-cache.h" > #include "transaction.h" > @@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct > btrfs_root *root, > struct btrfs_free_space_header *header; > struct extent_buffer *leaf; > struct inode *inode = NULL; > + unsigned nofs_flag; > int ret; > > key.objectid = BTRFS_FREE_SPACE_OBJECTID; > @@ -68,7 +70,9 @@ static struct inode *__lookup_free_space_inode(struct > btrfs_root *root, > btrfs_disk_key_to_cpu(, _key); > btrfs_release_path(path); > > + nofs_flag = memalloc_nofs_save(); > inode = btrfs_iget(fs_info->sb, , root, NULL); > + memalloc_nofs_restore(nofs_flag); > if (IS_ERR(inode)) > return inode; > > -- > 2.14.3 >
Re: [PATCH 08/35] btrfs: release metadata before running delayed refs
On Thu, Aug 30, 2018 at 01:41:58PM -0400, Josef Bacik wrote: > We want to release the unused reservation we have since it refills the > delayed refs reserve, which will make everything go smoother when > running the delayed refs if we're short on our reservation. Reviewed-by: Omar Sandoval > Signed-off-by: Josef Bacik > --- > fs/btrfs/transaction.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 99741254e27e..ebb0c0405598 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > return ret; > } > > + btrfs_trans_release_metadata(trans); > + trans->block_rsv = NULL; > + > /* make a pass through all the delayed refs we have so far >* any runnings procs may add more while we are here >*/ > @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > return ret; > } > > - btrfs_trans_release_metadata(trans); > - trans->block_rsv = NULL; > - > cur_trans = trans->transaction; > > /* > -- > 2.14.3 >
Re: [PATCH 22/35] btrfs: make sure we create all new bgs
On Thu, Aug 30, 2018 at 01:42:12PM -0400, Josef Bacik wrote: > We can actually allocate new chunks while we're creating our bg's, so > instead of doing list_for_each_safe, just do while (!list_empty()) so we > make sure to catch any new bg's that get added to the list. Reviewed-by: Omar Sandoval Since Nikolay pointed it out, might as well mention in the commit message that this can happen because we modify the chunk and extent trees. > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index ca98c39308f6..fc30ff96f0d6 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info > *info) > void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) > { > struct btrfs_fs_info *fs_info = trans->fs_info; > - struct btrfs_block_group_cache *block_group, *tmp; > + struct btrfs_block_group_cache *block_group; > struct btrfs_root *extent_root = fs_info->extent_root; > struct btrfs_block_group_item item; > struct btrfs_key key; > @@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct > btrfs_trans_handle *trans) > bool can_flush_pending_bgs = trans->can_flush_pending_bgs; > > trans->can_flush_pending_bgs = false; > - list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) { > + while (!list_empty(>new_bgs)) { > + block_group = list_first_entry(>new_bgs, > +struct btrfs_block_group_cache, > +bg_list); > if (ret) > goto next; > > -- > 2.14.3 >
Re: [PATCH 06/35] btrfs: check if free bgs for commit
On Thu, Aug 30, 2018 at 01:41:56PM -0400, Josef Bacik wrote: > may_commit_transaction will skip committing the transaction if we don't > have enough pinned space or if we're trying to find space for a SYSTEM > chunk. However if we have pending free block groups in this transaction > we still want to commit as we may be able to allocate a chunk to make > our reservation. So instead of just returning ENOSPC, check if we have > free block groups pending, and if so commit the transaction to allow us > to use that free space. This makes sense. > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 18 -- > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 6e7f350754d2..80615a579b18 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info > *fs_info, > struct btrfs_trans_handle *trans; > u64 bytes; > u64 reclaim_bytes = 0; > + bool do_commit = true; I find this naming a little mind bending when I read do_commit = false; goto commit; Since the end result is that we always join the transaction if we make it past the (!bytes) check anyways, can we do the pending bgs check first? I find the following easier to follow, fwiw. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index de6f75f5547b..dd7aeb5fb6bf 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4779,18 +4779,25 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) - goto commit; + trans = btrfs_join_transaction(fs_info->extent_root); + if (IS_ERR(trans)) + return -ENOSPC; + + /* +* See if we have a pending bg or there is enough pinned space to make +* this reservation. +*/ + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) || + __percpu_counter_compare(_info->total_bytes_pinned, bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) + return btrfs_commit_transaction(trans); /* * See if there is some space in the delayed insertion reservation for * this reservation. */ if (space_info != delayed_rsv->space_info) - return -ENOSPC; + goto enospc; spin_lock(_rsv->lock); if (delayed_rsv->size > bytes) @@ -4801,16 +4808,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (__percpu_counter_compare(_info->total_bytes_pinned, bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - -commit: - trans = btrfs_join_transaction(fs_info->extent_root); - if (IS_ERR(trans)) - return -ENOSPC; + BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + goto enospc; return btrfs_commit_transaction(trans); + +enospc: + btrfs_end_transaction(trans); + return -ENOSPC; } /*
Re: [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup
On Thu, Aug 30, 2018 at 01:41:53PM -0400, Josef Bacik wrote: > From: Josef Bacik > > Unify the extent_op handling as well, just add a flag so we don't > actually run the extent op from check_ref_cleanup and instead return a > value so that we can skip cleaning up the ref head. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 17 + > 1 file changed, 9 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 4c9fd35bca07..87c42a2c45b1 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2443,18 +2443,23 @@ static void unselect_delayed_ref_head(struct > btrfs_delayed_ref_root *delayed_ref > } > > static int cleanup_extent_op(struct btrfs_trans_handle *trans, > - struct btrfs_delayed_ref_head *head) > + struct btrfs_delayed_ref_head *head, > + bool run_extent_op) > { > struct btrfs_delayed_extent_op *extent_op = head->extent_op; > int ret; > > if (!extent_op) > return 0; > + > head->extent_op = NULL; > if (head->must_insert_reserved) { > btrfs_free_delayed_extent_op(extent_op); > return 0; > + } else if (!run_extent_op) { > + return 1; > } > + > spin_unlock(>lock); > ret = run_delayed_extent_op(trans, head, extent_op); > btrfs_free_delayed_extent_op(extent_op); So if cleanup_extent_op() returns 1, then the head was unlocked, unless run_extent_op was true. That's pretty confusing. Can we make it always unlock in the !must_insert_reserved case? > @@ -2506,7 +2511,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > > delayed_refs = >transaction->delayed_refs; > > - ret = cleanup_extent_op(trans, head); > + ret = cleanup_extent_op(trans, head, true); > if (ret < 0) { > unselect_delayed_ref_head(delayed_refs, head); > btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); > @@ -6977,12 +6982,8 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (!RB_EMPTY_ROOT(>ref_tree)) > goto out; > > - if (head->extent_op) { > - if (!head->must_insert_reserved) > - goto out; > - btrfs_free_delayed_extent_op(head->extent_op); > - head->extent_op = NULL; > - } > + if (cleanup_extent_op(trans, head, false)) > + goto out; > > /* >* waiting for the lock here would deadlock. If someone else has it > -- > 2.14.3 >
Re: [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper
On Thu, Aug 30, 2018 at 01:41:52PM -0400, Josef Bacik wrote: > From: Josef Bacik > > We were missing some quota cleanups in check_ref_cleanup, so break the > ref head accounting cleanup into a helper and call that from both > check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that > we don't screw up accounting in the future for other things that we add. Reviewed-by: Omar Sandoval > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 67 > +- > 1 file changed, 39 insertions(+), 28 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 6799950fa057..4c9fd35bca07 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle > *trans, > return ret ? ret : 1; > } > > +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, > + struct btrfs_delayed_ref_head *head) > +{ > + struct btrfs_fs_info *fs_info = trans->fs_info; > + struct btrfs_delayed_ref_root *delayed_refs = > + >transaction->delayed_refs; > + > + if (head->total_ref_mod < 0) { > + struct btrfs_space_info *space_info; > + u64 flags; > + > + if (head->is_data) > + flags = BTRFS_BLOCK_GROUP_DATA; > + else if (head->is_system) > + flags = BTRFS_BLOCK_GROUP_SYSTEM; > + else > + flags = BTRFS_BLOCK_GROUP_METADATA; > + space_info = __find_space_info(fs_info, flags); > + ASSERT(space_info); > + percpu_counter_add_batch(_info->total_bytes_pinned, > +-head->num_bytes, > +BTRFS_TOTAL_BYTES_PINNED_BATCH); While you're here, could you fix this botched whitespace?
[PATCH v5 3/6] vfs: update swap_{,de}activate documentation
From: Omar Sandoval The documentation for these functions is wrong in several ways: - swap_activate() is called with the inode locked - swap_activate() takes a swap_info_struct * and a sector_t * - swap_activate() can also return a positive number of extents it added itself - swap_deactivate() does not return anything Reviewed-by: Nikolay Borisov Signed-off-by: Omar Sandoval --- Documentation/filesystems/Locking | 17 +++-- Documentation/filesystems/vfs.txt | 12 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index efea228ccd8a..b970c8c2ee22 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -210,8 +210,9 @@ prototypes: int (*launder_page)(struct page *); int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long); int (*error_remove_page)(struct address_space *, struct page *); - int (*swap_activate)(struct file *); - int (*swap_deactivate)(struct file *); + int (*swap_activate)(struct swap_info_struct *, struct file *, +sector_t *); + void (*swap_deactivate)(struct file *); locking rules: All except set_page_dirty and freepage may block @@ -235,8 +236,8 @@ putback_page: yes launder_page: yes is_partially_uptodate: yes error_remove_page: yes -swap_activate: no -swap_deactivate: no +swap_activate: yes +swap_deactivate: no ->write_begin(), ->write_end() and ->readpage() may be called from the request handler (/dev/loop). @@ -333,14 +334,10 @@ cleaned, or an error value if not. Note that in order to prevent the page getting mapped back in and redirtied, it needs to be kept locked across the entire operation. - ->swap_activate will be called with a non-zero argument on -files backing (non block device backed) swapfiles. A return value -of zero indicates success, in which case this file can be used for -backing swapspace. The swapspace operations will be proxied to the -address space operations. + ->swap_activate is called from sys_swapon() with the inode locked. ->swap_deactivate() will be called in the sys_swapoff() -path after ->swap_activate() returned success. +path after ->swap_activate() returned success. The inode is not locked. --- file_lock_operations -- prototypes: diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 4b2084d0f1fb..40d6d6d4b76b 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -652,8 +652,9 @@ struct address_space_operations { unsigned long); void (*is_dirty_writeback) (struct page *, bool *, bool *); int (*error_remove_page) (struct mapping *mapping, struct page *page); - int (*swap_activate)(struct file *); - int (*swap_deactivate)(struct file *); + int (*swap_activate)(struct swap_info_struct *, struct file *, +sector_t *); + void (*swap_deactivate)(struct file *); }; writepage: called by the VM to write a dirty page to backing store. @@ -830,8 +831,11 @@ struct address_space_operations { swap_activate: Called when swapon is used on a file to allocate space if necessary and pin the block lookup information in - memory. A return value of zero indicates success, - in which case this file can be used to back swapspace. + memory. If this returns zero, the swap system will call the address + space operations ->readpage() and ->direct_IO(). Alternatively, this + may call add_swap_extent() and return the number of extents added, in + which case the swap system will use the provided blocks directly + instead of going through the filesystem. swap_deactivate: Called during swapoff on files where swap_activate was successful. -- 2.18.0
[PATCH v5 0/6] Btrfs: implement swap file support
From: Omar Sandoval Hi, This series implements swap file support for Btrfs. Changes since v4 [1]: - Added a kernel doc for btrfs_get_chunk_map() - Got rid of "Btrfs: push EXCL_OP set into btrfs_rm_device()" - Made activate error messages more clear and consistent - Changed clear vs unlock order in activate error case - Added "mm: export add_swap_extent()" as a separate patch - Added a btrfs_wait_ordered_range() at the beginning of btrfs_swap_activate() to catch newly created files - Added some Reviewed-bys from Nikolay I took a stab at adding support for balance when a swap file is active, but it's a major pain: we need to mark block groups which contain swap file extents, check the block group counter in relocate/scrub, then unmark the block groups when the swap file is deactivated, which gets really messy because the file can grow while it is an active swap file. If this is a deal breaker, I can work something out, but I don't think it's worth the trouble. This was tested with the swap tests in xfstests plus my new tests here [2]. Additionally, I used my swapme test program [3] and ran a few memory-intensive workloads (e.g., a highly parallel kernel build), verifying that swap was being used. All of this was done with lockdep enabled. This series is based on v4.19-rc1. Please take a look. Thanks! 1: https://www.spinics.net/lists/linux-btrfs/msg78731.html 2: https://github.com/osandov/xfstests/tree/btrfs-swap 3: https://github.com/osandov/osandov-linux/blob/master/scripts/swapme.c Omar Sandoval (6): mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS mm: export add_swap_extent() vfs: update swap_{,de}activate documentation Btrfs: prevent ioctls from interfering with a swap file Btrfs: rename get_chunk_map() and make it non-static Btrfs: support swap files Documentation/filesystems/Locking | 17 +-- Documentation/filesystems/vfs.txt | 12 +- fs/btrfs/ctree.h | 6 + fs/btrfs/disk-io.c| 3 + fs/btrfs/inode.c | 232 ++ fs/btrfs/ioctl.c | 51 ++- fs/btrfs/volumes.c| 28 ++-- fs/btrfs/volumes.h| 9 ++ include/linux/swap.h | 13 +- mm/page_io.c | 6 +- mm/swapfile.c | 14 +- 11 files changed, 348 insertions(+), 43 deletions(-) -- 2.18.0
[PATCH v5 4/6] Btrfs: prevent ioctls from interfering with a swap file
From: Omar Sandoval When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag - balance - device remove/replace/resize Don't allow those to happen on an active swap file. Balance and device remove/replace/resize in particular are disallowed entirely; in the future, we can relax this so that relocation skips/errors out only on chunks containing an active swap file. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 6 ++ fs/btrfs/disk-io.c | 3 +++ fs/btrfs/ioctl.c | 51 ++ fs/btrfs/volumes.c | 6 ++ 4 files changed, 62 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 53af9f5253f4..1c767a6394ae 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1121,6 +1121,9 @@ struct btrfs_fs_info { u32 sectorsize; u32 stripesize; + /* Number of active swapfiles */ + atomic_t nr_swapfiles; + #ifdef CONFIG_BTRFS_FS_REF_VERIFY spinlock_t ref_verify_lock; struct rb_root block_tree; @@ -1285,6 +1288,9 @@ struct btrfs_root { spinlock_t qgroup_meta_rsv_lock; u64 qgroup_meta_rsv_pertrans; u64 qgroup_meta_rsv_prealloc; + + /* Number of active swapfiles */ + atomic_t nr_swapfiles; }; struct btrfs_file_private { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5124c15705ce..50ee5cd3efae 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1187,6 +1187,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, atomic_set(>log_batch, 0); refcount_set(>refs, 1); atomic_set(>will_be_snapshotted, 0); + atomic_set(>nr_swapfiles, 0); root->log_transid = 0; root->log_transid_committed = -1; root->last_log_commit = 0; @@ -2781,6 +2782,8 @@ int open_ctree(struct super_block *sb, fs_info->sectorsize = 4096; fs_info->stripesize = 4096; + atomic_set(_info->nr_swapfiles, 0); + ret = btrfs_alloc_stripe_hash_table(fs_info); if (ret) { err = ret; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 63600dc2ac4c..cc230dcd32a4 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -290,6 +290,11 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) } else if (fsflags & FS_COMPR_FL) { const char *comp; + if (IS_SWAPFILE(inode)) { + ret = -ETXTBSY; + goto out_unlock; + } + binode->flags |= BTRFS_INODE_COMPRESS; binode->flags &= ~BTRFS_INODE_NOCOMPRESS; @@ -751,6 +756,12 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, if (!test_bit(BTRFS_ROOT_REF_COWS, >state)) return -EINVAL; + if (atomic_read(>nr_swapfiles)) { + btrfs_info(fs_info, + "cannot snapshot subvolume with active swapfile"); + return -ETXTBSY; + } + pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_KERNEL); if (!pending_snapshot) return -ENOMEM; @@ -1487,9 +1498,13 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, } inode_lock(inode); - if (do_compress) - BTRFS_I(inode)->defrag_compress = compress_type; - ret = cluster_pages_for_defrag(inode, pages, i, cluster); + if (IS_SWAPFILE(inode)) { + ret = -ETXTBSY; + } else { + if (do_compress) + BTRFS_I(inode)->defrag_compress = compress_type; + ret = cluster_pages_for_defrag(inode, pages, i, cluster); + } if (ret < 0) { inode_unlock(inode); goto out_ra; @@ -1585,6 +1600,12 @@ static noinline int btrfs_ioctl_resize(struct file *file, return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS; } + if (atomic_read(_info->nr_swapfiles)) { + btrfs_info(fs_info, "cannot resize with active swapfile"); + ret = -ETXTBSY; + goto out; + } + vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) { ret = PTR_ERR(vol_args); @@ -3538,6 +3559,11
[PATCH v5 6/6] Btrfs: support swap files
From: Omar Sandoval Implement the swap file a_ops on Btrfs. Activation needs to make sure that the file can be used as a swap file, which currently means it must be fully allocated as nocow with no compression on one device. It also sets up the swap extents directly with add_swap_extent(), so export it. Signed-off-by: Omar Sandoval --- fs/btrfs/inode.c | 232 +++ 1 file changed, 232 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9357a19d2bff..c0409e632768 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include "ctree.h" #include "disk-io.h" @@ -10437,6 +10438,235 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) } } +struct btrfs_swap_info { + u64 start; + u64 block_start; + u64 block_len; + u64 lowest_ppage; + u64 highest_ppage; + unsigned long nr_pages; + int nr_extents; +}; + +static int btrfs_add_swap_extent(struct swap_info_struct *sis, +struct btrfs_swap_info *bsi) +{ + unsigned long nr_pages; + u64 first_ppage, first_ppage_reported, next_ppage; + int ret; + + first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT; + next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len, + PAGE_SIZE) >> PAGE_SHIFT; + + if (first_ppage >= next_ppage) + return 0; + nr_pages = next_ppage - first_ppage; + + first_ppage_reported = first_ppage; + if (bsi->start == 0) + first_ppage_reported++; + if (bsi->lowest_ppage > first_ppage_reported) + bsi->lowest_ppage = first_ppage_reported; + if (bsi->highest_ppage < (next_ppage - 1)) + bsi->highest_ppage = next_ppage - 1; + + ret = add_swap_extent(sis, bsi->nr_pages, nr_pages, first_ppage); + if (ret < 0) + return ret; + bsi->nr_extents += ret; + bsi->nr_pages += nr_pages; + return 0; +} + +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, + sector_t *span) +{ + struct inode *inode = file_inode(file); + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; + struct extent_io_tree *io_tree = _I(inode)->io_tree; + struct extent_state *cached_state = NULL; + struct extent_map *em = NULL; + struct btrfs_device *device = NULL; + struct btrfs_swap_info bsi = { + .lowest_ppage = (sector_t)-1ULL, + }; + int ret = 0; + u64 isize = inode->i_size; + u64 start; + + /* +* If the swap file was just created, make sure delalloc is done. If the +* file changes again after this, the user is doing something stupid and +* we don't really care. +*/ + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); + if (ret) + return ret; + + /* +* The inode is locked, so these flags won't change after we check them. +*/ + if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) { + btrfs_err(fs_info, "swapfile must not be compressed"); + return -EINVAL; + } + if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) { + btrfs_err(fs_info, "swapfile must not be copy-on-write"); + return -EINVAL; + } + + /* +* Balance or device remove/replace/resize can move stuff around from +* under us. The EXCL_OP flag makes sure they aren't running/won't run +* concurrently while we are mapping the swap extents, and the fs_info +* nr_swapfiles counter prevents them from running while the swap file +* is active and moving the extents. Note that this also prevents a +* concurrent device add which isn't actually necessary, but it's not +* really worth the trouble to allow it. +*/ + if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) + return -EBUSY; + atomic_inc(_info->nr_swapfiles); + /* +* Snapshots can create extents which require COW even if NODATACOW is +* set. We use this counter to prevent snapshots. We must increment it +* before walking the extents because we don't want a concurrent +* snapshot to run after we've already checked the extents. +*/ + atomic_inc(_I(inode)->root->nr_swapfiles); + + lock_extent_bits(io_tree, 0, isize - 1, _state); + start = 0; + while (start < isize) { + u64 end, logical_block_start, physical_block_start; + u64 len = isize - start; + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out; +
[PATCH v5 2/6] mm: export add_swap_extent()
From: Omar Sandoval Btrfs will need this for swap file support. Signed-off-by: Omar Sandoval --- mm/swapfile.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/swapfile.c b/mm/swapfile.c index d3f95833d12e..51cb30de17bc 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2365,6 +2365,7 @@ add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, list_add_tail(_se->list, >first_swap_extent.list); return 1; } +EXPORT_SYMBOL_GPL(add_swap_extent); /* * A `swap extent' is a simple thing which maps a contiguous range of pages -- 2.18.0
[PATCH v5 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
From: Omar Sandoval The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go through the filesystem, and to make swapoff() call ->swap_deactivate(). For Btrfs, we want the latter but not the former, so split this flag into two. This makes us always call ->swap_deactivate() if ->swap_activate() succeeded, not just if it didn't add any swap extents itself. This also resolves the issue of the very misleading name of SWP_FILE, which is only used for swap files over NFS. Reviewed-by: Nikolay Borisov Signed-off-by: Omar Sandoval --- include/linux/swap.h | 13 +++-- mm/page_io.c | 6 +++--- mm/swapfile.c| 13 - 3 files changed, 18 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 8e2c11e692ba..0fda0aa743f0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -167,13 +167,14 @@ enum { SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */ SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */ SWP_BLKDEV = (1 << 6), /* its a block device */ - SWP_FILE= (1 << 7), /* set after swap_activate success */ - SWP_AREA_DISCARD = (1 << 8),/* single-time swap area discards */ - SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */ - SWP_STABLE_WRITES = (1 << 10), /* no overwrite PG_writeback pages */ - SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */ + SWP_ACTIVATED = (1 << 7), /* set after swap_activate success */ + SWP_FS = (1 << 8), /* swap file goes through fs */ + SWP_AREA_DISCARD = (1 << 9),/* single-time swap area discards */ + SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */ + SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ + SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ /* add others here before... */ - SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */ + SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */ }; #define SWAP_CLUSTER_MAX 32UL diff --git a/mm/page_io.c b/mm/page_io.c index aafd19ec1db4..e8653c368069 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, struct swap_info_struct *sis = page_swap_info(page); VM_BUG_ON_PAGE(!PageSwapCache(page), page); - if (sis->flags & SWP_FILE) { + if (sis->flags & SWP_FS) { struct kiocb kiocb; struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; @@ -365,7 +365,7 @@ int swap_readpage(struct page *page, bool synchronous) goto out; } - if (sis->flags & SWP_FILE) { + if (sis->flags & SWP_FS) { struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; @@ -423,7 +423,7 @@ int swap_set_page_dirty(struct page *page) { struct swap_info_struct *sis = page_swap_info(page); - if (sis->flags & SWP_FILE) { + if (sis->flags & SWP_FS) { struct address_space *mapping = sis->swap_file->f_mapping; VM_BUG_ON_PAGE(!PageSwapCache(page), page); diff --git a/mm/swapfile.c b/mm/swapfile.c index d954b71c4f9c..d3f95833d12e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -989,7 +989,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) goto nextsi; } if (size == SWAPFILE_CLUSTER) { - if (!(si->flags & SWP_FILE)) + if (!(si->flags & SWP_FS)) n_ret = swap_alloc_cluster(si, swp_entries); } else n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, @@ -2310,12 +2310,13 @@ static void destroy_swap_extents(struct swap_info_struct *sis) kfree(se); } - if (sis->flags & SWP_FILE) { + if (sis->flags & SWP_ACTIVATED) { struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping; - sis->flags &= ~SWP_FILE; - mapping->a_ops->swap_deactivate(swap_file); + sis->flags &= ~SWP_ACTIVATED; + if (mapping->a_ops->swap_deactivate) + mapping->a_ops->swap_deactivate(swap_file); } } @@ -2411,8 +2412,10 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) if (mapping->a_ops->swap_activate) { ret = mapping->a_ops->swap_activate(sis, swap_file, span); + if (ret >= 0) + sis->flags |= SWP_ACTIVATED; if (!ret) { -
[PATCH v5 5/6] Btrfs: rename get_chunk_map() and make it non-static
From: Omar Sandoval The Btrfs swap code is going to need it, so give it a btrfs_ prefix and make it non-static. Signed-off-by: Omar Sandoval --- fs/btrfs/volumes.c | 22 +++--- fs/btrfs/volumes.h | 9 + 2 files changed, 20 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c03ef5322689..0aa8aff6774b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2712,8 +2712,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset) return ret; } -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info, - u64 logical, u64 length) +struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info, + u64 logical, u64 length) { struct extent_map_tree *em_tree; struct extent_map *em; @@ -2750,7 +2750,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset) int i, ret = 0; struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; - em = get_chunk_map(fs_info, chunk_offset, 1); + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1); if (IS_ERR(em)) { /* * This is a logic error, but we don't want to just rely on the @@ -4884,7 +4884,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, int i = 0; int ret = 0; - em = get_chunk_map(fs_info, chunk_offset, chunk_size); + em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size); if (IS_ERR(em)) return PTR_ERR(em); @@ -5026,7 +5026,7 @@ int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, u64 chunk_offset) int miss_ndevs = 0; int i; - em = get_chunk_map(fs_info, chunk_offset, 1); + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1); if (IS_ERR(em)) return 1; @@ -5086,7 +5086,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len) struct map_lookup *map; int ret; - em = get_chunk_map(fs_info, logical, len); + em = btrfs_get_chunk_map(fs_info, logical, len); if (IS_ERR(em)) /* * We could return errors for these cases, but that could get @@ -5132,7 +5132,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info, struct map_lookup *map; unsigned long len = fs_info->sectorsize; - em = get_chunk_map(fs_info, logical, len); + em = btrfs_get_chunk_map(fs_info, logical, len); if (!WARN_ON(IS_ERR(em))) { map = em->map_lookup; @@ -5149,7 +5149,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) struct map_lookup *map; int ret = 0; - em = get_chunk_map(fs_info, logical, len); + em = btrfs_get_chunk_map(fs_info, logical, len); if(!WARN_ON(IS_ERR(em))) { map = em->map_lookup; @@ -5308,7 +5308,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info, /* discard always return a bbio */ ASSERT(bbio_ret); - em = get_chunk_map(fs_info, logical, length); + em = btrfs_get_chunk_map(fs_info, logical, length); if (IS_ERR(em)) return PTR_ERR(em); @@ -5634,7 +5634,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, return __btrfs_map_block_for_discard(fs_info, logical, *length, bbio_ret); - em = get_chunk_map(fs_info, logical, *length); + em = btrfs_get_chunk_map(fs_info, logical, *length); if (IS_ERR(em)) return PTR_ERR(em); @@ -5933,7 +5933,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, u64 rmap_len; int i, j, nr = 0; - em = get_chunk_map(fs_info, chunk_start, 1); + em = btrfs_get_chunk_map(fs_info, chunk_start, 1); if (IS_ERR(em)) return -EIO; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 23e9285d88de..d68c8a05a774 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -465,6 +465,15 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info, int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, u64 chunk_offset, u64 chunk_size); int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset); +/** + * btrfs_get_chunk_map() - Find the mapping containing the given logical extent. + * @logical: Logical block offset in bytes. + * @length: Length of extent in bytes. + * + * Return: Chunk mapping or ERR_PTR. + */ +struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info, + u64 logical, u64 length); static inline void btrfs_dev_stat_inc(struct btrfs_device *dev, int
Re: IO errors when building RAID1.... ?
If you want you can post the output from 'sudo smartctl -x /dev/sda' which will contain more information... but this is in some sense superfluous. The problem is very clearly a bad drive, the drive explicitly report to libata a write error, and included the sector LBA affected, and only the drive firmware would know that. It's not likely a cable problem or something like. And that the write error is reported at all means it's persistent, not transient. Chris Murphy
Re: IO errors when building RAID1.... ?
On Fri, Aug 31, 2018 at 10:35 AM, Pierre Couderc wrote: > > Aug 31 17:34:55 server su[559]: Successful su for root by nous > Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root > Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for > user root by nous(uid=1000) > Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create > session: Already running in a session > Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added > /dev/sdb1 > Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating block > group 1103101952 flags 1 > Aug 31 17:36:12 server sshd[572]: Accepted password for nous from > 2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2 > Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened for > user nous by (uid=0) > Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous. > Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous. > Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50) > Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr > 0x40d0802 action 0xe frozen > Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt > PHYRdyChg CommWake 10B8B DevExch } > Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA > Aug 31 17:36:16 server kernel: ata1.00: cmd > c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in > res > 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error) > Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY } > Aug 31 17:36:16 server kernel: ata1.00: hard resetting link > Aug 31 17:36:17 server kernel: ata1.01: hard resetting link > Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2) > Aug 31 17:36:20 server su[590]: Successful su for root by nous > Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root > Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for > user root by nous(uid=1000) > Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create > session: Already running in a session > Aug 31 17:36:23 server kernel: ata1.00: hard resetting link > Aug 31 17:36:23 server kernel: ata1.01: hard resetting link > Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2) > Aug 31 17:36:30 server kernel: ata1.00: hard resetting link > Aug 31 17:36:30 server kernel: ata1.01: hard resetting link > Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2) > Aug 31 17:36:31 server kernel: ata1.00: disabled > Aug 31 17:36:36 server kernel: ata1.00: hard resetting link > Aug 31 17:36:37 server kernel: ata1.01: hard resetting link > Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0) > Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 > SControl 300) > Aug 31 17:36:38 server kernel: ata1.01: SATA link down (SStatus 4 SControl > 0) > Aug 31 17:36:38 server kernel: ata1.00: NODEV after polling detection > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Illegal > Request [current] > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: Unaligned > write command > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 > 00 02 cd 00 00 00 60 00 > Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda, > sector 183552 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1 > errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 > Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device > Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] killing request > Aug 31
bug: btrfs-progs scrub -R flag doesn't show per device stats
btrfs-progs v4.17.1 man btrfs-scrub: -R print raw statistics per-device instead of a summary However, on a two device Btrfs volume, -R does not show per device statistics. See screenshot: https://drive.google.com/open?id=1xmt_NHGlNJPc8I0F4_OZxgGe9b3quCnD Additionally, the description of -d and -R doesn't help me distinquish between the two. -R says "instead of a summary" so that suggests -d will summarize but isn't explicitly stated. -- Chris Murphy
Re: How to erase a RAID1 (+++)?
Alberto Bursi posted on Fri, 31 Aug 2018 14:54:46 + as excerpted: > I just keep around a USB drive with a full Linux system on it, to act as > "recovery". If the btrfs raid fails I boot into that and I can do > maintenance with a full graphical interface and internet access so I can > google things. I do very similar, except my "recovery boot" is my backup (with normally including for root two levels of backup/recovery available, three for some things). I've actually gone so far as to have /etc/fstab be a symlink to one of several files, depending on what version of root vs. the off-root filesystems I'm booting, with a set of modular files that get assembled by scripts to build the fstabs as appropriate. So updating fstab is a process of updating the modules, then running the scripts to create the actual fstabs, and after I update a root backup the last step is changing the symlink to point to the appropriate fstab for that backup, so it's correct if I end up booting from it. Meanwhile, each root, working and two backups, is its own set of two device partitions in btrfs raid1 mode. (One set of backups is on separate physical devices, covering the device death scenario, the other is on different partitions on the same, newer and larger pair of physical devices as the working set, so it won't cover device death but still covers fat-fingering, filesystem fubaring, bad upgrades, etc.) /boot is separate and there's four of those (working and three backups), one each on each device of the two physical pairs, with the bios able to point to any of the four. I run grub2, so once the bios loads that, I can interactively load kernels from any of the other three /boots and choose to boot any of the three roots. And I build my own kernels, with an initrd attached as an initramfs to each, and test that they boot. So selecting a kernel by definition selects its attached initramfs as well, meaning the initr*s are backed up and selected with the kernels. (As I said earlier it'd sure be nice to be able to do away with the initr*s again. I was actually thinking about testing that today, which was supposed to be a day off, but got called in to work, so the test will have to wait once again...) What's nice about all that is that just as you said, each recovery/backup is a snapshot of the working system at the time I took the backup, so it's not a limited recovery boot at all, it has the same access to tools, manpages, net, X/plasma, browsers, etc, that my normal system does, because it /is/ my normal system from whenever I took the backup. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
IO errors when building RAID1.... ?
When trying to build a RAID1 on main fs. After normal debian stretch install : root@server:/home/nous# btrfs device add /dev/sdb1 / root@server:/home/nous# btrfs fi show Label: none uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc Total devices 2 FS bytes used 824.60MiB devid 1 size 1.82TiB used 3.02GiB path /dev/sda1 devid 2 size 1.82TiB used 0.00B path /dev/sdb1 root@server:/home/nous# btrfs balance start -v -mconvert=raid1 -dconvert=raid1 / Dumping filters: flags 0x7, state 0x0, force is off DATA (flags 0x100): converting, target=16, soft is off METADATA (flags 0x100): converting, target=16, soft is off SYSTEM (flags 0x100): converting, target=16, soft is off Killed root@server:/home/nous# btrfs fi show Label: none uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc Total devices 2 FS bytes used 1.29GiB devid 2 size 1.82TiB used 1.00GiB path /dev/sdb1 *** Some devices missing Some IO errors on /dev/sda are found in journalctl (see them below) I cannot believe that /dev/sda has no hard disk errors when installing without problems, but has many ones when I "btrfs device add /dev/sdb1 /". I can reproduce the problem : reinstall (3times...) and try "btrfs device add /dev/sdb1 /" with the same results... Aug 31 17:34:55 server su[559]: Successful su for root by nous Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for user root by nous(uid=1000) Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create session: Already running in a session Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added /dev/sdb1 Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating block group 1103101952 flags 1 Aug 31 17:36:12 server sshd[572]: Accepted password for nous from 2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2 Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened for user nous by (uid=0) Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous. Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous. Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50) Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr 0x40d0802 action 0xe frozen Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch } Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA Aug 31 17:36:16 server kernel: ata1.00: cmd c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error) Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY } Aug 31 17:36:16 server kernel: ata1.00: hard resetting link Aug 31 17:36:17 server kernel: ata1.01: hard resetting link Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 SControl 0) Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2) Aug 31 17:36:20 server su[590]: Successful su for root by nous Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for user root by nous(uid=1000) Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create session: Already running in a session Aug 31 17:36:23 server kernel: ata1.00: hard resetting link Aug 31 17:36:23 server kernel: ata1.01: hard resetting link Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 SControl 0) Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2) Aug 31 17:36:30 server kernel: ata1.00: hard resetting link Aug 31 17:36:30 server kernel: ata1.01: hard resetting link Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 SControl 0) Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2) Aug 31 17:36:31 server kernel: ata1.00: disabled Aug 31 17:36:36 server kernel: ata1.00: hard resetting link Aug 31 17:36:37 server kernel: ata1.01: hard resetting link Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0) Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Aug 31 17:36:38 server kernel: ata1.01: SATA
Re: How to erase a RAID1 (+++)?
On 08/31/2018 04:54 PM, Alberto Bursi wrote: On 8/31/2018 8:53 AM, Pierre Couderc wrote: OK, I have understood the message... I was planning that as you said "semi-routinely", and I understand btrfs is not soon enough ready, and I am very very far to be a specialist as you are. So, I shall mount my RAID1 very standard, and I shall expect the disaster, hoping it does not occur Now, I shall try to absorb all that... Thank you very much ! I just keep around a USB drive with a full Linux system on it, to act as "recovery". If the btrfs raid fails I boot into that and I can do maintenance with a full graphical interface and internet access so I can google things. Of course on a home server you can't do that without some automation that will switch boot device after some amount of boot failures of the main OS. While if your server has BMC (lights-out management) you can switch boot device through that. -Alberto Thank you Alberto, yes,I have some other computers to react in case of failure.
Re: How to erase a RAID1 (+++)?
On 8/31/2018 8:53 AM, Pierre Couderc wrote: > > OK, I have understood the message... I was planning that as you said > "semi-routinely", and I understand btrfs is not soon enough ready, and > I am very very far to be a specialist as you are. > So, I shall mount my RAID1 very standard, and I shall expect the > disaster, hoping it does not occur > Now, I shall try to absorb all that... > > Thank you very much ! > I just keep around a USB drive with a full Linux system on it, to act as "recovery". If the btrfs raid fails I boot into that and I can do maintenance with a full graphical interface and internet access so I can google things. Of course on a home server you can't do that without some automation that will switch boot device after some amount of boot failures of the main OS. While if your server has BMC (lights-out management) you can switch boot device through that. -Alberto
Re: [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper
On Fri, Aug 31, 2018 at 10:57:45AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:41, Josef Bacik wrote: > > From: Josef Bacik > > > > We do this dance in cleanup_ref_head and check_ref_cleanup, unify it > > into a helper and cleanup the calling functions. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/delayed-ref.c | 14 ++ > > fs/btrfs/delayed-ref.h | 3 ++- > > fs/btrfs/extent-tree.c | 24 > > 3 files changed, 20 insertions(+), 21 deletions(-) > > > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > > index 62ff545ba1f7..3a9e4ac21794 100644 > > --- a/fs/btrfs/delayed-ref.c > > +++ b/fs/btrfs/delayed-ref.c > > @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans) > > return head; > > } > > > > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > > + struct btrfs_delayed_ref_head *head) > > +{ > > + lockdep_assert_held(_refs->lock); > > + lockdep_assert_held(>lock); > > + > > + rb_erase(>href_node, _refs->href_root); > > + RB_CLEAR_NODE(>href_node); > > + atomic_dec(_refs->num_entries); > > + delayed_refs->num_heads--; > > + if (head->processing == 0) > > + delayed_refs->num_heads_ready--; > > +} > > + > > /* > > * Helper to insert the ref_node to the tail or merge with tail. > > * > > diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h > > index d9f2a4ebd5db..7769177b489e 100644 > > --- a/fs/btrfs/delayed-ref.h > > +++ b/fs/btrfs/delayed-ref.h > > @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct > > btrfs_delayed_ref_head *head) > > { > > mutex_unlock(>mutex); > > } > > - > > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > > + struct btrfs_delayed_ref_head *head); > > > > struct btrfs_delayed_ref_head * > > btrfs_select_ref_head(struct btrfs_trans_handle *trans); > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index f77226d8020a..6799950fa057 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct > > btrfs_trans_handle *trans, > > spin_unlock(_refs->lock); > > return 1; > > } > > - delayed_refs->num_heads--; > > - rb_erase(>href_node, _refs->href_root); > > - RB_CLEAR_NODE(>href_node); > > - spin_unlock(>lock); > > + btrfs_delete_ref_head(delayed_refs, head); > > spin_unlock(_refs->lock); > > - atomic_dec(_refs->num_entries); > > + spin_unlock(>lock); > > > > Again, the feedback of reversed lock-order is not addressed: > > https://www.spinics.net/lists/linux-btrfs/msg80482.html > Oops, I'll fix this. Josef
Re: [PATCH 15/35] btrfs: run delayed iputs before committing
On Fri, Aug 31, 2018 at 10:55:22AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:42, Josef Bacik wrote: > > We want to have a complete picture of any delayed inode updates before > > we make the decision to commit or not, so make sure we run delayed iputs > > before making the decision to commit or not. > > Again, there was request for more detail which is not addressed: > > https://www.spinics.net/lists/linux-btrfs/msg81237.html I'll make the changelog more explicit, thanks, Josef
Re: [PATCH 07/35] btrfs: dump block_rsv whe dumping space info
On Fri, Aug 31, 2018 at 10:53:54AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:41, Josef Bacik wrote: > > For enospc_debug having the block rsvs is super helpful to see if we've > > done something wrong. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/extent-tree.c | 15 +++ > > 1 file changed, 15 insertions(+) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 80615a579b18..df826f713034 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct > > btrfs_fs_info *fs_info, > > return ret; > > } > > > > +static void dump_block_rsv(struct btrfs_block_rsv *rsv) > > +{ > > + spin_lock(>lock); > > + printk(KERN_ERR "%d: size %llu reserved %llu\n", > > + rsv->type, (unsigned long long)rsv->size, > > + (unsigned long long)rsv->reserved); > > + spin_unlock(>lock); > > +} > > There was feedback on this hunk which you seem to have ignored: > > https://www.spinics.net/lists/linux-btrfs/msg80473.html > Yup good point, I'll fix this up. Thanks, Josef
Re: [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates
On Fri, Aug 31, 2018 at 10:52:55AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:41, Josef Bacik wrote: > > From: Josef Bacik > > > > We use this number to figure out how many delayed refs to run, but > > __btrfs_run_delayed_refs really only checks every time we need a new > > delayed ref head, so we always run at least one ref head completely no > > matter what the number of items on it. So instead track only the ref > > heads added by this trans handle and adjust the counting appropriately > > in __btrfs_run_delayed_refs. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/delayed-ref.c | 3 --- > > fs/btrfs/extent-tree.c | 5 + > > 2 files changed, 1 insertion(+), 7 deletions(-) > > > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > > index 3a9e4ac21794..27f7dd4e3d52 100644 > > --- a/fs/btrfs/delayed-ref.c > > +++ b/fs/btrfs/delayed-ref.c > > @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct > > btrfs_trans_handle *trans, > > ref->in_tree = 0; > > btrfs_put_delayed_ref(ref); > > atomic_dec(_refs->num_entries); > > - if (trans->delayed_ref_updates) > > - trans->delayed_ref_updates--; > > There was feedback on this particular hunk and you've completely ignored > it, that's not nice: > > https://www.spinics.net/lists/linux-btrfs/msg80514.html I just missed it in the last go around (as is the case for the other ones). I'm not sure what part is confusing, we only want delayed_ref_updates to be how many delayed ref heads there are, which is what this patch is changing. I could probably split this between these two changes and the count changing below since they are slightly different things, I'll do that. Thanks, Josef
Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
On Fri, Aug 31, 2018 at 10:48:58AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:42, Josef Bacik wrote: > > We may abort the transaction during a commit and not have a chance to > > run the pending bgs stuff, which will leave block groups on our list and > > cause us accounting issues and leaked memory. Fix this by running the > > pending bgs when we cleanup a transaction. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/transaction.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > > index 89d14f135837..0f39a0d302d3 100644 > > --- a/fs/btrfs/transaction.c > > +++ b/fs/btrfs/transaction.c > > @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct > > btrfs_trans_handle *trans) > > btrfs_scrub_continue(fs_info); > > cleanup_transaction: > > btrfs_trans_release_metadata(trans); > > + btrfs_create_pending_block_groups(trans); > > And now you've basically hi-jacked btrfs_create_pending_block_groups to > just act as "delete all bg" in case transaction is aborted. Considering > this and the previous patch I'd rather you replace them with a single > one which introduces a new function delete_pending_bgs or whatever and > use that. This will be more explicit and self-documenting. > I haven't hi-jacked it and I'm not adding another helper when we already have code that does the right thing. Remember if we abort a transaction in a path that doesn't commit we still end up calling btrfs_create_pending_block_groups() in btrfs_end_transaction which does the cleanup there, I'm not adding a bunch of new code for a case that's easily handled with the previous fix and works the same way in other paths. Thanks, Josef
Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted
On Fri, Aug 31, 2018 at 10:46:36AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:42, Josef Bacik wrote: > > We still need to do all of the accounting cleanup for pending block > > groups if we abort. So set the ret to trans->aborted so if we aborted > > the cleanup happens and everybody is happy. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/extent-tree.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 90f267f4dd0f..132a1157982c 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct > > btrfs_trans_handle *trans) > > struct btrfs_root *extent_root = fs_info->extent_root; > > struct btrfs_block_group_item item; > > struct btrfs_key key; > > - int ret = 0; > > + int ret = trans->aborted; > > This is really subtle and magical and not obvious from the context of > the patch, but if the transaction is aborted this will change the loop > to actually just delete all block groups in ->new_bgs. I'd rather have > an explicit loop for that honestly. We need it this way in case creating the bg's errors out anyway, there's no sense in adding a bunch of code to do something we have to handle already anyway. Thanks, Josef
Re: [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort
On Fri, Aug 31, 2018 at 10:42:13AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:42, Josef Bacik wrote: > > We weren't doing any of the accounting cleanup when we aborted > > transactions. Fix this by making cleanup_ref_head_accounting global and > > calling it from the abort code, this fixes the issue where our > > accounting was all wrong after the fs aborts. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.h | 5 + > > fs/btrfs/disk-io.c | 1 + > > fs/btrfs/extent-tree.c | 13 ++--- > > 3 files changed, 12 insertions(+), 7 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 791e287c2292..67923b2030b8 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -35,6 +35,7 @@ > > struct btrfs_trans_handle; > > struct btrfs_transaction; > > struct btrfs_pending_snapshot; > > +struct btrfs_delayed_ref_root; > > extern struct kmem_cache *btrfs_trans_handle_cachep; > > extern struct kmem_cache *btrfs_bit_radix_cachep; > > extern struct kmem_cache *btrfs_path_cachep; > > @@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle > > *trans, > >unsigned long count); > > int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info, > > unsigned long count, u64 transid, int wait); > > +void > > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, > > + struct btrfs_delayed_ref_root *delayed_refs, > > + struct btrfs_delayed_ref_head *head); > > int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 > > len); > > int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, > > struct btrfs_fs_info *fs_info, u64 bytenr, > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > > index 1d3f5731d616..caaca8154a1a 100644 > > --- a/fs/btrfs/disk-io.c > > +++ b/fs/btrfs/disk-io.c > > @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct > > btrfs_transaction *trans, > > if (pin_bytes) > > btrfs_pin_extent(fs_info, head->bytenr, > > head->num_bytes, 1); > > + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); > > btrfs_put_delayed_ref_head(head); > > cond_resched(); > > spin_lock(_refs->lock); > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 32579221d900..031d2b11ddee 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct > > btrfs_trans_handle *trans, > > return ret ? ret : 1; > > } > > > > -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, > > - struct btrfs_delayed_ref_head *head) > > +void > > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, > > + struct btrfs_delayed_ref_root *delayed_refs, > > + struct btrfs_delayed_ref_head *head) > > { > I don't see any reason to change the signature of the function, the new > call sites have valid transaction handles where you can obtain > references to fs_info/delayed_refs. Just stick with adding btrfs_ prefix > and exporting it. > We don't have a valid transaction handle in btrfs_destroy_delayed_refs because we can call it at umount time when we no longer have a trans handle. Thanks, Josef
Re: [PATCH 22/35] btrfs: make sure we create all new bgs
On Fri, Aug 31, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote: > > > On 30.08.2018 20:42, Josef Bacik wrote: > > We can actually allocate new chunks while we're creating our bg's, so > > instead of doing list_for_each_safe, just do while (!list_empty()) so we > > make sure to catch any new bg's that get added to the list. > > HOw can this occur, please elaborate and put an example callstack in the > commit log. > Eh? We're modifying the extent tree and chunk tree, which can cause bg's to be allocated, it's just common sense. Josef
Re: [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper
On 30.08.2018 20:41, Josef Bacik wrote: > From: Josef Bacik > > We do this dance in cleanup_ref_head and check_ref_cleanup, unify it > into a helper and cleanup the calling functions. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/delayed-ref.c | 14 ++ > fs/btrfs/delayed-ref.h | 3 ++- > fs/btrfs/extent-tree.c | 24 > 3 files changed, 20 insertions(+), 21 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 62ff545ba1f7..3a9e4ac21794 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans) > return head; > } > > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > +struct btrfs_delayed_ref_head *head) > +{ > + lockdep_assert_held(_refs->lock); > + lockdep_assert_held(>lock); > + > + rb_erase(>href_node, _refs->href_root); > + RB_CLEAR_NODE(>href_node); > + atomic_dec(_refs->num_entries); > + delayed_refs->num_heads--; > + if (head->processing == 0) > + delayed_refs->num_heads_ready--; > +} > + > /* > * Helper to insert the ref_node to the tail or merge with tail. > * > diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h > index d9f2a4ebd5db..7769177b489e 100644 > --- a/fs/btrfs/delayed-ref.h > +++ b/fs/btrfs/delayed-ref.h > @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct > btrfs_delayed_ref_head *head) > { > mutex_unlock(>mutex); > } > - > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > +struct btrfs_delayed_ref_head *head); > > struct btrfs_delayed_ref_head * > btrfs_select_ref_head(struct btrfs_trans_handle *trans); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index f77226d8020a..6799950fa057 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > spin_unlock(_refs->lock); > return 1; > } > - delayed_refs->num_heads--; > - rb_erase(>href_node, _refs->href_root); > - RB_CLEAR_NODE(>href_node); > - spin_unlock(>lock); > + btrfs_delete_ref_head(delayed_refs, head); > spin_unlock(_refs->lock); > - atomic_dec(_refs->num_entries); > + spin_unlock(>lock); > Again, the feedback of reversed lock-order is not addressed: https://www.spinics.net/lists/linux-btrfs/msg80482.html > trace_run_delayed_ref_head(fs_info, head, 0); > > @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (!mutex_trylock(>mutex)) > goto out; > > - /* > - * at this point we have a head with no other entries. Go > - * ahead and process it. > - */ > - rb_erase(>href_node, _refs->href_root); > - RB_CLEAR_NODE(>href_node); > - atomic_dec(_refs->num_entries); > - > - /* > - * we don't take a ref on the node because we're removing it from the > - * tree, so we just steal the ref the tree was holding. > - */ > - delayed_refs->num_heads--; > - if (head->processing == 0) > - delayed_refs->num_heads_ready--; > + btrfs_delete_ref_head(delayed_refs, head); > head->processing = 0; > + > spin_unlock(>lock); > spin_unlock(_refs->lock); > >
Re: [PATCH 15/35] btrfs: run delayed iputs before committing
On 30.08.2018 20:42, Josef Bacik wrote: > We want to have a complete picture of any delayed inode updates before > we make the decision to commit or not, so make sure we run delayed iputs > before making the decision to commit or not. Again, there was request for more detail which is not addressed: https://www.spinics.net/lists/linux-btrfs/msg81237.html > > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 7c0e99e1f56c..064db7ebaf67 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4831,6 +4831,10 @@ static int may_commit_transaction(struct btrfs_fs_info > *fs_info, > goto commit; > } > > + mutex_lock(_info->cleaner_delayed_iput_mutex); > + btrfs_run_delayed_iputs(fs_info); > + mutex_unlock(_info->cleaner_delayed_iput_mutex); > + > spin_lock(_rsv->lock); > reclaim_bytes += delayed_rsv->reserved; > spin_unlock(_rsv->lock); >
Re: [PATCH 07/35] btrfs: dump block_rsv whe dumping space info
On 30.08.2018 20:41, Josef Bacik wrote: > For enospc_debug having the block rsvs is super helpful to see if we've > done something wrong. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 80615a579b18..df826f713034 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct > btrfs_fs_info *fs_info, > return ret; > } > > +static void dump_block_rsv(struct btrfs_block_rsv *rsv) > +{ > + spin_lock(>lock); > + printk(KERN_ERR "%d: size %llu reserved %llu\n", > +rsv->type, (unsigned long long)rsv->size, > +(unsigned long long)rsv->reserved); > + spin_unlock(>lock); > +} There was feedback on this hunk which you seem to have ignored: https://www.spinics.net/lists/linux-btrfs/msg80473.html > + > static void dump_space_info(struct btrfs_fs_info *fs_info, > struct btrfs_space_info *info, u64 bytes, > int dump_block_groups) > @@ -7929,6 +7938,12 @@ static void dump_space_info(struct btrfs_fs_info > *fs_info, > info->bytes_readonly); > spin_unlock(>lock); > > + dump_block_rsv(_info->global_block_rsv); > + dump_block_rsv(_info->trans_block_rsv); > + dump_block_rsv(_info->chunk_block_rsv); > + dump_block_rsv(_info->delayed_block_rsv); > + dump_block_rsv(_info->delayed_refs_rsv); > + > if (!dump_block_groups) > return; > >
Re: [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates
On 30.08.2018 20:41, Josef Bacik wrote: > From: Josef Bacik > > We use this number to figure out how many delayed refs to run, but > __btrfs_run_delayed_refs really only checks every time we need a new > delayed ref head, so we always run at least one ref head completely no > matter what the number of items on it. So instead track only the ref > heads added by this trans handle and adjust the counting appropriately > in __btrfs_run_delayed_refs. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/delayed-ref.c | 3 --- > fs/btrfs/extent-tree.c | 5 + > 2 files changed, 1 insertion(+), 7 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 3a9e4ac21794..27f7dd4e3d52 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct > btrfs_trans_handle *trans, > ref->in_tree = 0; > btrfs_put_delayed_ref(ref); > atomic_dec(_refs->num_entries); > - if (trans->delayed_ref_updates) > - trans->delayed_ref_updates--; There was feedback on this particular hunk and you've completely ignored it, that's not nice: https://www.spinics.net/lists/linux-btrfs/msg80514.html > } > > static bool merge_ref(struct btrfs_trans_handle *trans, > @@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle > *trans, > if (ref->action == BTRFS_ADD_DELAYED_REF) > list_add_tail(>add_list, >ref_add_list); > atomic_inc(>num_entries); > - trans->delayed_ref_updates++; > spin_unlock(>lock); > return ret; > } > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 87c42a2c45b1..20531389a20a 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2583,6 +2583,7 @@ static noinline int __btrfs_run_delayed_refs(struct > btrfs_trans_handle *trans, > spin_unlock(_refs->lock); > break; > } > + count++; > > /* grab the lock that says we are going to process >* all the refs for this head */ > @@ -2596,7 +2597,6 @@ static noinline int __btrfs_run_delayed_refs(struct > btrfs_trans_handle *trans, >*/ > if (ret == -EAGAIN) { > locked_ref = NULL; > - count++; > continue; > } > } > @@ -2624,7 +2624,6 @@ static noinline int __btrfs_run_delayed_refs(struct > btrfs_trans_handle *trans, > unselect_delayed_ref_head(delayed_refs, locked_ref); > locked_ref = NULL; > cond_resched(); > - count++; > continue; > } > > @@ -2642,7 +2641,6 @@ static noinline int __btrfs_run_delayed_refs(struct > btrfs_trans_handle *trans, > return ret; > } > locked_ref = NULL; > - count++; > continue; > } > > @@ -2693,7 +2691,6 @@ static noinline int __btrfs_run_delayed_refs(struct > btrfs_trans_handle *trans, > } > > btrfs_put_delayed_ref(ref); > - count++; > cond_resched(); > } > >
Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort
On 30.08.2018 20:42, Josef Bacik wrote: > We may abort the transaction during a commit and not have a chance to > run the pending bgs stuff, which will leave block groups on our list and > cause us accounting issues and leaked memory. Fix this by running the > pending bgs when we cleanup a transaction. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/transaction.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 89d14f135837..0f39a0d302d3 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > btrfs_scrub_continue(fs_info); > cleanup_transaction: > btrfs_trans_release_metadata(trans); > + btrfs_create_pending_block_groups(trans); And now you've basically hi-jacked btrfs_create_pending_block_groups to just act as "delete all bg" in case transaction is aborted. Considering this and the previous patch I'd rather you replace them with a single one which introduces a new function delete_pending_bgs or whatever and use that. This will be more explicit and self-documenting. > btrfs_trans_release_chunk_metadata(trans); > trans->block_rsv = NULL; > btrfs_warn(fs_info, "Skipping commit of aborted transaction."); >
Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted
On 30.08.2018 20:42, Josef Bacik wrote: > We still need to do all of the accounting cleanup for pending block > groups if we abort. So set the ret to trans->aborted so if we aborted > the cleanup happens and everybody is happy. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 90f267f4dd0f..132a1157982c 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct > btrfs_trans_handle *trans) > struct btrfs_root *extent_root = fs_info->extent_root; > struct btrfs_block_group_item item; > struct btrfs_key key; > - int ret = 0; > + int ret = trans->aborted; This is really subtle and magical and not obvious from the context of the patch, but if the transaction is aborted this will change the loop to actually just delete all block groups in ->new_bgs. I'd rather have an explicit loop for that honestly. > bool can_flush_pending_bgs = trans->can_flush_pending_bgs; > > trans->can_flush_pending_bgs = false; >
Re: [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally
On 30.08.2018 20:42, Josef Bacik wrote: > The first thing we do is loop through the list, this > > if (!list_empty()) > btrfs_create_pending_block_groups(); > > thing is just wasted space. > > Signed-off-by: Josef Bacik Makes sense, although it would have been ideal if this patch followed directly your " btrfs: make sure we create all new bgs" one. Anyway: Reviewed-by: Nikolay Borisov > --- > fs/btrfs/extent-tree.c | 3 +-- > fs/btrfs/transaction.c | 6 ++ > 2 files changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 031d2b11ddee..90f267f4dd0f 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2970,8 +2970,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle > *trans, > } > > if (run_all) { > - if (!list_empty(>new_bgs)) > - btrfs_create_pending_block_groups(trans); > + btrfs_create_pending_block_groups(trans); > > spin_lock(_refs->lock); > node = rb_first(_refs->href_root); > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 2bb19e2ded5e..89d14f135837 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -839,8 +839,7 @@ static int __btrfs_end_transaction(struct > btrfs_trans_handle *trans, > btrfs_trans_release_metadata(trans); > trans->block_rsv = NULL; > > - if (!list_empty(>new_bgs)) > - btrfs_create_pending_block_groups(trans); > + btrfs_create_pending_block_groups(trans); > > btrfs_trans_release_chunk_metadata(trans); > > @@ -1927,8 +1926,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > cur_trans->delayed_refs.flushing = 1; > smp_wmb(); > > - if (!list_empty(>new_bgs)) > - btrfs_create_pending_block_groups(trans); > + btrfs_create_pending_block_groups(trans); > > if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { > int run_it = 0; >
Re: [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort
On 30.08.2018 20:42, Josef Bacik wrote: > We weren't doing any of the accounting cleanup when we aborted > transactions. Fix this by making cleanup_ref_head_accounting global and > calling it from the abort code, this fixes the issue where our > accounting was all wrong after the fs aborts. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/ctree.h | 5 + > fs/btrfs/disk-io.c | 1 + > fs/btrfs/extent-tree.c | 13 ++--- > 3 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 791e287c2292..67923b2030b8 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -35,6 +35,7 @@ > struct btrfs_trans_handle; > struct btrfs_transaction; > struct btrfs_pending_snapshot; > +struct btrfs_delayed_ref_root; > extern struct kmem_cache *btrfs_trans_handle_cachep; > extern struct kmem_cache *btrfs_bit_radix_cachep; > extern struct kmem_cache *btrfs_path_cachep; > @@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle > *trans, > unsigned long count); > int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info, >unsigned long count, u64 transid, int wait); > +void > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, > + struct btrfs_delayed_ref_root *delayed_refs, > + struct btrfs_delayed_ref_head *head); > int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 > len); > int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, >struct btrfs_fs_info *fs_info, u64 bytenr, > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 1d3f5731d616..caaca8154a1a 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct > btrfs_transaction *trans, > if (pin_bytes) > btrfs_pin_extent(fs_info, head->bytenr, >head->num_bytes, 1); > + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); > btrfs_put_delayed_ref_head(head); > cond_resched(); > spin_lock(_refs->lock); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 32579221d900..031d2b11ddee 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct > btrfs_trans_handle *trans, > return ret ? ret : 1; > } > > -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, > - struct btrfs_delayed_ref_head *head) > +void > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, > + struct btrfs_delayed_ref_root *delayed_refs, > + struct btrfs_delayed_ref_head *head) > { I don't see any reason to change the signature of the function, the new call sites have valid transaction handles where you can obtain references to fs_info/delayed_refs. Just stick with adding btrfs_ prefix and exporting it. > - struct btrfs_fs_info *fs_info = trans->fs_info; > - struct btrfs_delayed_ref_root *delayed_refs = > - >transaction->delayed_refs; > int nr_items = 1; > > if (head->total_ref_mod < 0) { > @@ -2549,7 +2548,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > } > } > > - cleanup_ref_head_accounting(trans, head); > + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); > > trace_run_delayed_ref_head(fs_info, head, 0); > btrfs_delayed_ref_unlock(head); > @@ -7191,7 +7190,7 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (head->must_insert_reserved) > ret = 1; > > - cleanup_ref_head_accounting(trans, head); > + btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head); > mutex_unlock(>mutex); > btrfs_put_delayed_ref_head(head); > return ret; >
Re: [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
On 30.08.2018 20:42, Josef Bacik wrote: > Instead of open coding this stuff use the helper instead. > > Signed-off-by: Josef Bacik Reviewed-by: Nikolay Borisov > --- > fs/btrfs/disk-io.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index c72ab2ca7627..1d3f5731d616 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct > btrfs_transaction *trans, > if (head->must_insert_reserved) > pin_bytes = true; > btrfs_free_delayed_extent_op(head->extent_op); > - delayed_refs->num_heads--; > - if (head->processing == 0) > - delayed_refs->num_heads_ready--; > - atomic_dec(_refs->num_entries); > - rb_erase(>href_node, _refs->href_root); > - RB_CLEAR_NODE(>href_node); > + btrfs_delete_ref_head(delayed_refs, head); > spin_unlock(>lock); > spin_unlock(_refs->lock); > mutex_unlock(>mutex); >
Re: [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
On 30.08.2018 20:42, Josef Bacik wrote: > We have this open coded in btrfs_destroy_delayed_refs, use the helper > instead. > > Signed-off-by: Josef Bacik Reviewed-by: Nikolay Borisov > --- > fs/btrfs/disk-io.c | 11 ++- > 1 file changed, 2 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 11ea2ea7439e..c72ab2ca7627 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct > btrfs_transaction *trans, > > head = rb_entry(node, struct btrfs_delayed_ref_head, > href_node); > - if (!mutex_trylock(>mutex)) { > - refcount_inc(>refs); > - spin_unlock(_refs->lock); > - > - mutex_lock(>mutex); > - mutex_unlock(>mutex); > - btrfs_put_delayed_ref_head(head); > - spin_lock(_refs->lock); > + if (btrfs_delayed_ref_lock(delayed_refs, head)) > continue; > - } > + > spin_lock(>lock); > while ((n = rb_first(>ref_tree)) != NULL) { > ref = rb_entry(n, struct btrfs_delayed_ref_node, >
Re: [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock
On 30.08.2018 20:42, Josef Bacik wrote: > We don't need the trans except to get the delayed_refs_root, so just > pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a > day. > > Signed-off-by: Josef Bacik Reviewed-by: Nikolay Borisov > --- > fs/btrfs/delayed-ref.c | 5 + > fs/btrfs/delayed-ref.h | 2 +- > fs/btrfs/extent-tree.c | 2 +- > 3 files changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 96ce087747b2..87778645bf4a 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr, > return NULL; > } > > -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, > +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, > struct btrfs_delayed_ref_head *head) > { > - struct btrfs_delayed_ref_root *delayed_refs; > - > - delayed_refs = >transaction->delayed_refs; > lockdep_assert_held(_refs->lock); > if (mutex_trylock(>mutex)) > return 0; > diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h > index 7769177b489e..ee636d7a710a 100644 > --- a/fs/btrfs/delayed-ref.h > +++ b/fs/btrfs/delayed-ref.h > @@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle > *trans, > struct btrfs_delayed_ref_head * > btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, > u64 bytenr); > -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, > +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, > struct btrfs_delayed_ref_head *head); > static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head > *head) > { > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index fc30ff96f0d6..32579221d900 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2591,7 +2591,7 @@ static noinline int __btrfs_run_delayed_refs(struct > btrfs_trans_handle *trans, > > /* grab the lock that says we are going to process >* all the refs for this head */ > - ret = btrfs_delayed_ref_lock(trans, locked_ref); > + ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref); > spin_unlock(_refs->lock); > /* >* we may have dropped the spin lock to get the head >
Re: [PATCH 22/35] btrfs: make sure we create all new bgs
On 30.08.2018 20:42, Josef Bacik wrote: > We can actually allocate new chunks while we're creating our bg's, so > instead of doing list_for_each_safe, just do while (!list_empty()) so we > make sure to catch any new bg's that get added to the list. HOw can this occur, please elaborate and put an example callstack in the commit log. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index ca98c39308f6..fc30ff96f0d6 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info > *info) > void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) > { > struct btrfs_fs_info *fs_info = trans->fs_info; > - struct btrfs_block_group_cache *block_group, *tmp; > + struct btrfs_block_group_cache *block_group; > struct btrfs_root *extent_root = fs_info->extent_root; > struct btrfs_block_group_item item; > struct btrfs_key key; > @@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct > btrfs_trans_handle *trans) > bool can_flush_pending_bgs = trans->can_flush_pending_bgs; > > trans->can_flush_pending_bgs = false; > - list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) { > + while (!list_empty(>new_bgs)) { > + block_group = list_first_entry(>new_bgs, > +struct btrfs_block_group_cache, > +bg_list); > if (ret) > goto next; > >
Re: How to erase a RAID1 (+++)?
On 08/31/2018 04:29 AM, Duncan wrote: Chris Murphy posted on Thu, 30 Aug 2018 11:08:28 -0600 as excerpted: My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in prder to start in degraded mode Good luck with this. The Btrfs archives are full of various limitations of Btrfs raid1. There is no automatic degraded mount for Btrfs. And if you persistently ask for degraded mount, you run the risk of other problems if there's merely a delayed discovery of one of the devices. Once a Btrfs volume is degraded, it does not automatically resume normal operation just because the formerly missing device becomes available. So... this is flat out not suitable for use cases where you need unattended raid1 degraded boot. Agreeing in general and adding some detail... 1) Are you intending to use an initr*? I'm not sure the current status (I actually need to test again for myself), but at least in the past, booting a btrfs raid1 rootfs required an initr*, and I have and use one here, for that purpose alone (until switching to btrfs raid1 root, I went initr*-less, and would prefer that again, due to the complications of maintaining an initr*). The base problem is that with raid1 (or other forms of multi-device btrfs, but it happens to be raid1 that's in question for both you and I) the filesystem needs multiple devices to complete the filesystem and the kernel's root= parameter takes only one. When mounting after userspace is up, a btrfs device scan is normally run (often automatically by udev) before the mount, that lets btrfs in the kernel track what devices belong to what filesystems, so pointing to just one of the devices is enough because the kernel knows from that what filesystem is intended and can match up the others that go with it from the earlier scan. Now there's a btrfs mount option, device=/dev/*, that can be provided more than once for additional devices, that can /normally/ be used to tell the kernel what specific devices to use, bypassing the need for btrfs device scan, and in /theory/, passing that like other mount options in the kernel commandline via rootflags= /should/ "just work". But for reasons I as a btrfs user (not dev, and definitely not kernel or btrfs dev) don't fully understand, passing device= via rootflags= is, or at least was, broken, so properly mounting a multi-device btrfs required (and may still require) userspace, thus for a multi-device btrfs rootfs, an initr*. So direct-booting to a multi-device btrfs rootfs didn't normally work. It would if you passed rootflags=degraded (at least with a two-device raid1 so the one device passed in root= contained one copy of everything), but then it was unclear if the additional device was successfully added to the raid1 later, or not. And with no automatic sync and bringing back to undegraded status, it was a risk I didn't want to take. So unfortunately, initr* it was! But I originally tested that when I setup my own btrfs raid1 rootfs very long ago in kernel and btrfs terms, kernel 3.6 or so IIRC, and while I've not /seen/ anything definitive on-list to suggest rootflags=device= is unbroken now (I asked recently and got an affirmative reply, but I asked for clarification and I've not seen it, tho perhaps it's there and I've not read it yet), perhaps I missed it. And I've not retested lately, tho I really should as while I asked I guess the only real way to know is to try it for myself, and it'd definitely be nice to be direct-booting without having to bother with an initr*, again. 2) As both Chris and I alluded to, unlike say mdraid, btrfs doesn't (yet) have an automatic mechanism to re-sync and "undegrade" after having been mounted degraded,rw. A btrfs scrub can be run to re-sync raid1 chunks, but single chunks may have been added while in the degraded state as well, and those need a balance convert to raid1 mode, before the filesystem and data on it can be be considered reliably able to withstand device loss once again. In fact, while the problem has been fixed now, for quite awhile if the filesystem was mounted degraded,rw, you often had exactly that one mount to fix the problem, as new chunks would be written in single mode and after that the filesystem would refuse to mount writable,degraded, and would only let you mount degraded,ro, which would let you get data off it but not let you fix the problem. Word to the wise if you're planning on running stable-debian (which tend to be older) kernels, or even just trying to use them for recovery if you need to! (The fix was to have a mount check if at least one copy of all chunks were available and allow rw mounting if so, instead of simply assuming that any single-mode chunks at all meant some wouldn't be available on a multi-device filesystem with a device missing, thus forcing read-only mode only mounting, as it used to do.) 3) If a btrfs raid1 is mounted degraded,rw with one device missing, then mounted again degraded,rw, with a different device
Re: How to erase a RAID1 (+++)?
On 08/30/2018 07:08 PM, Chris Murphy wrote: On Thu, Aug 30, 2018 at 3:13 AM, Pierre Couderc wrote: Trying to install a RAID1 on a debian stretch, I made some mistake and got this, after installing on disk1 and trying to add second disk : root@server:~# fdisk -l Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x2a799300 Device Boot StartEndSectors Size Id Type /dev/sda1 * 2048 3907028991 3907026944 1.8T 83 Linux Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x9770f6fa Device Boot StartEndSectors Size Id Type /dev/sdb1 * 2048 3907029167 3907027120 1.8T 5 Extended Extended partition type is not a problem if you're using GRUB as the bootloader; other bootloaders may not like this. Strictly speaking the type code 0x05 is incorrect, GRUB ignores type code, as does the kernel. GRUB also ignores the active bit (boot flag). And : root@server:~# btrfs fi show Label: none uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed Total devices 2 FS bytes used 1.10GiB devid1 size 1.82TiB used 4.02GiB path /dev/sda1 devid2 size 1.00KiB used 0.00B path /dev/sdb1 That's odd; and I know you've moved on from this problem but I would have liked to see the super for /dev/sdb1 and also the installer log for what commands were used for partitioning, including mkfs and device add commands. For what it's worth, 'btrfs dev add' formats the device being added, it does not need to be formatted in advance, and also it resizes the file system properly. Thank you, in fact my system seems more and more broken, so I cannot go further.. For example, a simple df gives me an IO error. I prefer reinstall the whole. But to avoid the same problems, is there somewhere an howto "install a basic RAID1 btrfs debian stretch system" ? OK, I install strech on 1 disk and then how do I "format" and add the second disk ? My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in prder to start in degraded mode Good luck with this. The Btrfs archives are full of various limitations of Btrfs raid1. There is no automatic degraded mount for Btrfs. And if you persistently ask for degraded mount, you run the risk of other problems if there's merely a delayed discovery of one of the devices. Once a Btrfs volume is degraded, it does not automatically resume normal operation just because the formerly missing device becomes available. So... this is flat out not suitable for use cases where you need unattended raid1 degraded boot. Well, I understnad the lesson that there is no hope currently to boot a degraded system... Thank you very much