Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots
On 12/1/18 1:19 AM, Qu Wenruo wrote: > > > On 2018/12/1 上午12:52, Josef Bacik wrote: >> From: Josef Bacik >> >> When debugging some weird extent reference bug I suspected that we were >> changing a snapshot while we were deleting it, which could explain my >> bug. > > May I ask under which case we're going to modify an unlinked snapshot? > > Maybe metadata relocation? I have some old logging from a filesystem that got corrupted when doing a subvolume delete while doing metadata balance: https://syrinx.knorrie.org/~knorrie/btrfs/domokun-balance-skinny-lockup.txt This is one of the very few moments when I really lost a btrfs filesystem and had to mkfs again to move on. I'm wondering if this one looks related. I never found an explanation for it. Hans > > Thanks, > Qu > >> This was indeed what was happening, and this patch helped me >> verify my theory. It is never correct to modify the snapshot once it's >> being deleted, so mark the root when we are deleting it and make sure we >> complain about it when it happens. >> >> Signed-off-by: Josef Bacik >> --- >> fs/btrfs/ctree.c | 3 +++ >> fs/btrfs/ctree.h | 1 + >> fs/btrfs/extent-tree.c | 9 + >> 3 files changed, 13 insertions(+) >> >> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c >> index 5912a97b07a6..5f82f86085e8 100644 >> --- a/fs/btrfs/ctree.c >> +++ b/fs/btrfs/ctree.c >> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle >> *trans, >> u64 search_start; >> int ret; >> >> +if (test_bit(BTRFS_ROOT_DELETING, &root->state)) >> +WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being >> dropped\n"); >> + >> if (trans->transaction != fs_info->running_transaction) >> WARN(1, KERN_CRIT "trans %llu running %llu\n", >> trans->transid, >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index facde70c15ed..5a3a94ccb65c 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1199,6 +1199,7 @@ enum { >> BTRFS_ROOT_FORCE_COW, >> BTRFS_ROOT_MULTI_LOG_TASKS, >> BTRFS_ROOT_DIRTY, >> +BTRFS_ROOT_DELETING, >> }; >> >> /* >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 581c2a0b2945..dcb699dd57f3 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root, >> if (block_rsv) >> trans->block_rsv = block_rsv; >> >> +/* >> + * This will help us catch people modifying the fs tree while we're >> + * dropping it. It is unsafe to mess with the fs tree while it's being >> + * dropped as we unlock the root node and parent nodes as we walk down >> + * the tree, assuming nothing will change. If something does change >> + * then we'll have stale information and drop references to blocks we've >> + * already dropped. >> + */ >> +set_bit(BTRFS_ROOT_DELETING, &root->state); >> if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) { >> level = btrfs_header_level(root->node); >> path->nodes[level] = btrfs_lock_root_node(root); >> > -- Hans van Kranenburg signature.asc Description: OpenPGP digital signature
Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots
On 2018/12/1 上午12:52, Josef Bacik wrote: > From: Josef Bacik > > When debugging some weird extent reference bug I suspected that we were > changing a snapshot while we were deleting it, which could explain my > bug. May I ask under which case we're going to modify an unlinked snapshot? Maybe metadata relocation? Thanks, Qu > This was indeed what was happening, and this patch helped me > verify my theory. It is never correct to modify the snapshot once it's > being deleted, so mark the root when we are deleting it and make sure we > complain about it when it happens. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/ctree.c | 3 +++ > fs/btrfs/ctree.h | 1 + > fs/btrfs/extent-tree.c | 9 + > 3 files changed, 13 insertions(+) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index 5912a97b07a6..5f82f86085e8 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle > *trans, > u64 search_start; > int ret; > > + if (test_bit(BTRFS_ROOT_DELETING, &root->state)) > + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being > dropped\n"); > + > if (trans->transaction != fs_info->running_transaction) > WARN(1, KERN_CRIT "trans %llu running %llu\n", > trans->transid, > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index facde70c15ed..5a3a94ccb65c 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1199,6 +1199,7 @@ enum { > BTRFS_ROOT_FORCE_COW, > BTRFS_ROOT_MULTI_LOG_TASKS, > BTRFS_ROOT_DIRTY, > + BTRFS_ROOT_DELETING, > }; > > /* > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 581c2a0b2945..dcb699dd57f3 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root, > if (block_rsv) > trans->block_rsv = block_rsv; > > + /* > + * This will help us catch people modifying the fs tree while we're > + * dropping it. It is unsafe to mess with the fs tree while it's being > + * dropped as we unlock the root node and parent nodes as we walk down > + * the tree, assuming nothing will change. If something does change > + * then we'll have stale information and drop references to blocks we've > + * already dropped. > + */ > + set_bit(BTRFS_ROOT_DELETING, &root->state); > if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) { > level = btrfs_header_level(root->node); > path->nodes[level] = btrfs_lock_root_node(root); > signature.asc Description: OpenPGP digital signature
Re: Need help with potential ~45TB dataloss
On 2018/11/30 下午9:53, Patrick Dijkgraaf wrote: > Hi all, > > I have been a happy BTRFS user for quite some time. But now I'm facing > a potential ~45TB dataloss... :-( > I hope someone can help! > > I have Server A and Server B. Both having a 20-devices BTRFS RAID6 > filesystem. Because of known RAID5/6 risks, Server B was a backup of > Server A. > After applying updates to server B and reboot, the FS would not mount > anymore. Because it was "just" a backup. I decided to recreate the FS > and perform a new backup. Later, I discovered that the FS was not > broken, but I faced this issue: > https://patchwork.kernel.org/patch/10694997/ Sorry for the inconvenience. I didn't realize the max_chunk_size limit isn't reliable at that timing. > > Anyway, the FS was already recreated, so I needed to do a new backup. > During the backup (using rsync -vah), Server A (the source) encountered > an I/O error and my rsync failed. In an attempt to "quick fix" the > issue, I rebooted Server A after which the FS would not mount anymore. Did you have any dmesg about that IO error? And how is the reboot scheduled? Forced power off or normal reboot command? > > I documented what I have tried, below. I have not yet tried anything > except what is shown, because I am afraid of causing more harm to > the FS. Pretty clever, no btrfs check --repair is a pretty good move. > I hope somebody here can give me advice on how to (hopefully) > retrieve my data... > > Thanks in advance! > > == > > [root@cornelis ~]# btrfs fi show > Label: 'cornelis-btrfs' uuid: ac643516-670e-40f3-aa4c-f329fc3795fd > Total devices 1 FS bytes used 463.92GiB > devid1 size 800.00GiB used 493.02GiB path > /dev/mapper/cornelis-cornelis--btrfs > > Label: 'data' uuid: 4c66fa8b-8fc6-4bba-9d83-02a2a1d69ad5 > Total devices 20 FS bytes used 44.85TiB > devid1 size 3.64TiB used 3.64TiB path /dev/sdn2 > devid2 size 3.64TiB used 3.64TiB path /dev/sdp2 > devid3 size 3.64TiB used 3.64TiB path /dev/sdu2 > devid4 size 3.64TiB used 3.64TiB path /dev/sdx2 > devid5 size 3.64TiB used 3.64TiB path /dev/sdh2 > devid6 size 3.64TiB used 3.64TiB path /dev/sdg2 > devid7 size 3.64TiB used 3.64TiB path /dev/sdm2 > devid8 size 3.64TiB used 3.64TiB path /dev/sdw2 > devid9 size 3.64TiB used 3.64TiB path /dev/sdj2 > devid 10 size 3.64TiB used 3.64TiB path /dev/sdt2 > devid 11 size 3.64TiB used 3.64TiB path /dev/sdk2 > devid 12 size 3.64TiB used 3.64TiB path /dev/sdq2 > devid 13 size 3.64TiB used 3.64TiB path /dev/sds2 > devid 14 size 3.64TiB used 3.64TiB path /dev/sdf2 > devid 15 size 7.28TiB used 588.80GiB path /dev/sdr2 > devid 16 size 7.28TiB used 588.80GiB path /dev/sdo2 > devid 17 size 7.28TiB used 588.80GiB path /dev/sdv2 > devid 18 size 7.28TiB used 588.80GiB path /dev/sdi2 > devid 19 size 7.28TiB used 588.80GiB path /dev/sdl2 > devid 20 size 7.28TiB used 588.80GiB path /dev/sde2 > > [root@cornelis ~]# mount /dev/sdn2 /mnt/data > mount: /mnt/data: wrong fs type, bad option, bad superblock on > /dev/sdn2, missing codepage or helper program, or other error. What is the dmesg of the mount failure? And have you tried -o ro,degraded ? > > [root@cornelis ~]# btrfs check /dev/sdn2 > Opening filesystem to check... > parent transid verify failed on 46451963543552 wanted 114401 found > 114173 > parent transid verify failed on 46451963543552 wanted 114401 found > 114173 > checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF > checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 > checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 > bad tree block 46451963543552, bytenr mismatch, want=46451963543552, > have=75208089814272 > Couldn't read tree root Would you please also paste the output of "btrfs ins dump-super /dev/sdn2" ? It looks like your tree root (or at least some tree root nodes/leaves get corrupted) > ERROR: cannot open file system And since it's your tree root corrupted, you could also try "btrfs-find-root " to try to get a good old copy of your tree root. But I suspect the corruption happens before you noticed, thus the old tree root may not help much. Also, the output of "btrfs ins dump-tree -t root " will help. Thanks, Qu > > [root@cornelis ~]# btrfs restore /dev/sdn2 /mnt/data/ > parent transid verify failed on 46451963543552 wanted 114401 found > 114173 > parent transid verify failed on 46451963543552 wanted 114401 found > 114173 > checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF > checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 > checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 > bad tree block 46451963543552, bytenr mismatch, want=46451963543552, > have=75208089814272 > Couldn't read tree
Re: [PATCH 2/2] btrfs: run delayed items before dropping the snapshot
On Fri, Nov 30, 2018 at 5:12 PM Filipe Manana wrote: > > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik wrote: > > > > From: Josef Bacik > > > > With my delayed refs patches in place we started seeing a large amount > > of aborts in __btrfs_free_extent > > > > BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 > > root 35964 owner 1 offset 0 > > Call Trace: > > ? btrfs_merge_delayed_refs+0xaf/0x340 > > __btrfs_run_delayed_refs+0x6ea/0xfc0 > > ? btrfs_set_path_blocking+0x31/0x60 > > btrfs_run_delayed_refs+0xeb/0x180 > > btrfs_commit_transaction+0x179/0x7f0 > > ? btrfs_check_space_for_delayed_refs+0x30/0x50 > > ? should_end_transaction.isra.19+0xe/0x40 > > btrfs_drop_snapshot+0x41c/0x7c0 > > btrfs_clean_one_deleted_snapshot+0xb5/0xd0 > > cleaner_kthread+0xf6/0x120 > > kthread+0xf8/0x130 > > ? btree_invalidatepage+0x90/0x90 > > ? kthread_bind+0x10/0x10 > > ret_from_fork+0x35/0x40 > > > > This was because btrfs_drop_snapshot depends on the root not being modified > > while it's dropping the snapshot. It will unlock the root node (and really > > every node) as it walks down the tree, only to re-lock it when it needs to > > do > > something. This is a problem because if we modify the tree we could cow a > > block > > in our path, which free's our reference to that block. Then once we get > > back to > > that shared block we'll free our reference to it again, and get ENOENT when > > trying to lookup our extent reference to that block in __btrfs_free_extent. > > > > This is ultimately happening because we have delayed items left to be > > processed > > for our deleted snapshot _after_ all of the inodes are closed for the > > snapshot. > > We only run the delayed inode item if we're deleting the inode, and even > > then we > > do not run the delayed insertions or delayed removals. These can be run at > > any > > point after our final inode does it's last iput, which is what triggers the > > snapshot deletion. We can end up with the snapshot deletion happening and > > then > > have the delayed items run on that file system, resulting in the above > > problem. > > > > This problem has existed forever, however my patches made it much easier to > > hit > > as I wake up the cleaner much more often to deal with delayed iputs, which > > made > > us more likely to start the snapshot dropping work before the transaction > > commits, which is when the delayed items would generally be run. Before, > > generally speaking, we would run the delayed items, commit the transaction, > > and > > wakeup the cleaner thread to start deleting snapshots, which means we were > > less > > likely to hit this problem. You could still hit it if you had multiple > > snapshots to be deleted and ended up with lots of delayed items, but it was > > definitely harder. > > > > Fix for now by simply running all the delayed items before starting to drop > > the > > snapshot. We could make this smarter in the future by making the delayed > > items > > per-root, and then simply drop any delayed items for roots that we are > > going to > > delete. But for now just a quick and easy solution is the safest. > > > > Cc: sta...@vger.kernel.org > > Signed-off-by: Josef Bacik > > Reviewed-by: Filipe Manana > > Great catch! > I've hit this error from __btrfs_free_extent() a handful of times > over the years, but never managed > to reproduce it on demand or figure out it was related to snapshot deletion. > > > --- > > fs/btrfs/extent-tree.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index dcb699dd57f3..965702034b22 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root, > > goto out_free; > > } > > > > + btrfs_run_delayed_items(trans); > > + Btw, we should check the return value of this and return it if it's an error? We can't do nothing with it in the context of the cleaner thread, which is why, I suppose, you chose to ignore the value (besides that the error might have been for some other root). But this can be used in the context of relocation, where we can return the error back to userspace. Thanks. > > if (block_rsv) > > trans->block_rsv = block_rsv; > > > > -- > > 2.14.3 > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots
On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote: > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik wrote: > > > > From: Josef Bacik > > > > When debugging some weird extent reference bug I suspected that we were > > changing a snapshot while we were deleting it, which could explain my > > bug. This was indeed what was happening, and this patch helped me > > verify my theory. It is never correct to modify the snapshot once it's > > being deleted, so mark the root when we are deleting it and make sure we > > complain about it when it happens. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.c | 3 +++ > > fs/btrfs/ctree.h | 1 + > > fs/btrfs/extent-tree.c | 9 + > > 3 files changed, 13 insertions(+) > > > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > > index 5912a97b07a6..5f82f86085e8 100644 > > --- a/fs/btrfs/ctree.c > > +++ b/fs/btrfs/ctree.c > > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct > > btrfs_trans_handle *trans, > > u64 search_start; > > int ret; > > > > + if (test_bit(BTRFS_ROOT_DELETING, &root->state)) > > + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being > > dropped\n"); > > Please use btrfs_warn(), it makes sure we use a consistent message > style, identifies the fs, etc. > Also, "thats" should be "that is" or "that's". > Ah yeah, I was following the other convention in there but we should probably convert all of those to btrfs_warn. I'll fix the grammer thing as well, just a leftover from the much less code of conduct friendly message I originally had there. Thanks, Josef
Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots
On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik wrote: > > From: Josef Bacik > > When debugging some weird extent reference bug I suspected that we were > changing a snapshot while we were deleting it, which could explain my > bug. This was indeed what was happening, and this patch helped me > verify my theory. It is never correct to modify the snapshot once it's > being deleted, so mark the root when we are deleting it and make sure we > complain about it when it happens. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/ctree.c | 3 +++ > fs/btrfs/ctree.h | 1 + > fs/btrfs/extent-tree.c | 9 + > 3 files changed, 13 insertions(+) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index 5912a97b07a6..5f82f86085e8 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle > *trans, > u64 search_start; > int ret; > > + if (test_bit(BTRFS_ROOT_DELETING, &root->state)) > + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being > dropped\n"); Please use btrfs_warn(), it makes sure we use a consistent message style, identifies the fs, etc. Also, "thats" should be "that is" or "that's". With that fixed, Reviewed-by: Filipe Manana > + > if (trans->transaction != fs_info->running_transaction) > WARN(1, KERN_CRIT "trans %llu running %llu\n", >trans->transid, > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index facde70c15ed..5a3a94ccb65c 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1199,6 +1199,7 @@ enum { > BTRFS_ROOT_FORCE_COW, > BTRFS_ROOT_MULTI_LOG_TASKS, > BTRFS_ROOT_DIRTY, > + BTRFS_ROOT_DELETING, > }; > > /* > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 581c2a0b2945..dcb699dd57f3 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root, > if (block_rsv) > trans->block_rsv = block_rsv; > > + /* > +* This will help us catch people modifying the fs tree while we're > +* dropping it. It is unsafe to mess with the fs tree while it's > being > +* dropped as we unlock the root node and parent nodes as we walk down > +* the tree, assuming nothing will change. If something does change > +* then we'll have stale information and drop references to blocks > we've > +* already dropped. > +*/ > + set_bit(BTRFS_ROOT_DELETING, &root->state); > if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) { > level = btrfs_header_level(root->node); > path->nodes[level] = btrfs_lock_root_node(root); > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH 2/2] btrfs: run delayed items before dropping the snapshot
On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik wrote: > > From: Josef Bacik > > With my delayed refs patches in place we started seeing a large amount > of aborts in __btrfs_free_extent > > BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root > 35964 owner 1 offset 0 > Call Trace: > ? btrfs_merge_delayed_refs+0xaf/0x340 > __btrfs_run_delayed_refs+0x6ea/0xfc0 > ? btrfs_set_path_blocking+0x31/0x60 > btrfs_run_delayed_refs+0xeb/0x180 > btrfs_commit_transaction+0x179/0x7f0 > ? btrfs_check_space_for_delayed_refs+0x30/0x50 > ? should_end_transaction.isra.19+0xe/0x40 > btrfs_drop_snapshot+0x41c/0x7c0 > btrfs_clean_one_deleted_snapshot+0xb5/0xd0 > cleaner_kthread+0xf6/0x120 > kthread+0xf8/0x130 > ? btree_invalidatepage+0x90/0x90 > ? kthread_bind+0x10/0x10 > ret_from_fork+0x35/0x40 > > This was because btrfs_drop_snapshot depends on the root not being modified > while it's dropping the snapshot. It will unlock the root node (and really > every node) as it walks down the tree, only to re-lock it when it needs to do > something. This is a problem because if we modify the tree we could cow a > block > in our path, which free's our reference to that block. Then once we get back > to > that shared block we'll free our reference to it again, and get ENOENT when > trying to lookup our extent reference to that block in __btrfs_free_extent. > > This is ultimately happening because we have delayed items left to be > processed > for our deleted snapshot _after_ all of the inodes are closed for the > snapshot. > We only run the delayed inode item if we're deleting the inode, and even then > we > do not run the delayed insertions or delayed removals. These can be run at > any > point after our final inode does it's last iput, which is what triggers the > snapshot deletion. We can end up with the snapshot deletion happening and > then > have the delayed items run on that file system, resulting in the above > problem. > > This problem has existed forever, however my patches made it much easier to > hit > as I wake up the cleaner much more often to deal with delayed iputs, which > made > us more likely to start the snapshot dropping work before the transaction > commits, which is when the delayed items would generally be run. Before, > generally speaking, we would run the delayed items, commit the transaction, > and > wakeup the cleaner thread to start deleting snapshots, which means we were > less > likely to hit this problem. You could still hit it if you had multiple > snapshots to be deleted and ended up with lots of delayed items, but it was > definitely harder. > > Fix for now by simply running all the delayed items before starting to drop > the > snapshot. We could make this smarter in the future by making the delayed > items > per-root, and then simply drop any delayed items for roots that we are going > to > delete. But for now just a quick and easy solution is the safest. > > Cc: sta...@vger.kernel.org > Signed-off-by: Josef Bacik Reviewed-by: Filipe Manana Great catch! I've hit this error from __btrfs_free_extent() a handful of times over the years, but never managed to reproduce it on demand or figure out it was related to snapshot deletion. > --- > fs/btrfs/extent-tree.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index dcb699dd57f3..965702034b22 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root, > goto out_free; > } > > + btrfs_run_delayed_items(trans); > + > if (block_rsv) > trans->block_rsv = block_rsv; > > -- > 2.14.3 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
[PATCH 0/2] Fix aborts when dropping snapshots
With more machines running my recent delayed ref rsv patches we started seeing a spike in boxes aborting in __btrfs_free_extent with being unable to find the extent ref. The full details are in 2/2, but the summary is there's been a bug ever since the original delayed inode item stuff was introduced where it could run once the snapshot was being deleted, which will result in all sorts of extent reference shenanigans. This was tricky to hit before, but with my iput changes it's become much easier to hit on our build boxes that are heavy users of snapshot creation/deletion. Thanks, Josef
[PATCH 1/2] btrfs: catch cow on deleting snapshots
From: Josef Bacik When debugging some weird extent reference bug I suspected that we were changing a snapshot while we were deleting it, which could explain my bug. This was indeed what was happening, and this patch helped me verify my theory. It is never correct to modify the snapshot once it's being deleted, so mark the root when we are deleting it and make sure we complain about it when it happens. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.c | 3 +++ fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 9 + 3 files changed, 13 insertions(+) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 5912a97b07a6..5f82f86085e8 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, u64 search_start; int ret; + if (test_bit(BTRFS_ROOT_DELETING, &root->state)) + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being dropped\n"); + if (trans->transaction != fs_info->running_transaction) WARN(1, KERN_CRIT "trans %llu running %llu\n", trans->transid, diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index facde70c15ed..5a3a94ccb65c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1199,6 +1199,7 @@ enum { BTRFS_ROOT_FORCE_COW, BTRFS_ROOT_MULTI_LOG_TASKS, BTRFS_ROOT_DIRTY, + BTRFS_ROOT_DELETING, }; /* diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 581c2a0b2945..dcb699dd57f3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root, if (block_rsv) trans->block_rsv = block_rsv; + /* +* This will help us catch people modifying the fs tree while we're +* dropping it. It is unsafe to mess with the fs tree while it's being +* dropped as we unlock the root node and parent nodes as we walk down +* the tree, assuming nothing will change. If something does change +* then we'll have stale information and drop references to blocks we've +* already dropped. +*/ + set_bit(BTRFS_ROOT_DELETING, &root->state); if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) { level = btrfs_header_level(root->node); path->nodes[level] = btrfs_lock_root_node(root); -- 2.14.3
[PATCH 2/2] btrfs: run delayed items before dropping the snapshot
From: Josef Bacik With my delayed refs patches in place we started seeing a large amount of aborts in __btrfs_free_extent BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0 Call Trace: ? btrfs_merge_delayed_refs+0xaf/0x340 __btrfs_run_delayed_refs+0x6ea/0xfc0 ? btrfs_set_path_blocking+0x31/0x60 btrfs_run_delayed_refs+0xeb/0x180 btrfs_commit_transaction+0x179/0x7f0 ? btrfs_check_space_for_delayed_refs+0x30/0x50 ? should_end_transaction.isra.19+0xe/0x40 btrfs_drop_snapshot+0x41c/0x7c0 btrfs_clean_one_deleted_snapshot+0xb5/0xd0 cleaner_kthread+0xf6/0x120 kthread+0xf8/0x130 ? btree_invalidatepage+0x90/0x90 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 This was because btrfs_drop_snapshot depends on the root not being modified while it's dropping the snapshot. It will unlock the root node (and really every node) as it walks down the tree, only to re-lock it when it needs to do something. This is a problem because if we modify the tree we could cow a block in our path, which free's our reference to that block. Then once we get back to that shared block we'll free our reference to it again, and get ENOENT when trying to lookup our extent reference to that block in __btrfs_free_extent. This is ultimately happening because we have delayed items left to be processed for our deleted snapshot _after_ all of the inodes are closed for the snapshot. We only run the delayed inode item if we're deleting the inode, and even then we do not run the delayed insertions or delayed removals. These can be run at any point after our final inode does it's last iput, which is what triggers the snapshot deletion. We can end up with the snapshot deletion happening and then have the delayed items run on that file system, resulting in the above problem. This problem has existed forever, however my patches made it much easier to hit as I wake up the cleaner much more often to deal with delayed iputs, which made us more likely to start the snapshot dropping work before the transaction commits, which is when the delayed items would generally be run. Before, generally speaking, we would run the delayed items, commit the transaction, and wakeup the cleaner thread to start deleting snapshots, which means we were less likely to hit this problem. You could still hit it if you had multiple snapshots to be deleted and ended up with lots of delayed items, but it was definitely harder. Fix for now by simply running all the delayed items before starting to drop the snapshot. We could make this smarter in the future by making the delayed items per-root, and then simply drop any delayed items for roots that we are going to delete. But for now just a quick and easy solution is the safest. Cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dcb699dd57f3..965702034b22 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root, goto out_free; } + btrfs_run_delayed_items(trans); + if (block_rsv) trans->block_rsv = block_rsv; -- 2.14.3
Re: btrfs development - question about crypto api integration
On 30.11.18 г. 17:22 ч., Chris Mason wrote: > On 29 Nov 2018, at 12:37, Nikolay Borisov wrote: > >> On 29.11.18 г. 18:43 ч., Jean Fobe wrote: >>> Hi all, >>> I've been studying LZ4 and other compression algorithms on the >>> kernel, and seen other projects such as zram and ubifs using the >>> crypto api. Is there a technical reason for not using the crypto api >>> for compression (and possibly for encryption) in btrfs? >>> I did not find any design/technical implementation choices in >>> btrfs development in the developer's FAQ on the wiki. If I completely >>> missed it, could someone point me in the right direction? >>> Lastly, if there is no technical reason for this, would it be >>> something interesting to have implemented? >> >> I personally think it would be better if btrfs' exploited the generic >> framework. And in fact when you look at zstd, btrfs does use the >> generic, low-level ZSTD routines but not the crypto library wrappers. >> If >> I were I'd try and convert zstd (since it's the most recently added >> algorithm) to using the crypto layer to see if there are any lurking >> problems. > > Back when I first added the zlib support, the zlib API was both easier > to use and a better fit for our async worker threads. That doesn't mean > we shouldn't switch, it's just how we got to step one ;) And what about zstd? WHy is it also using the low level api and not the crypto wrappers? > > -chris >
Re: btrfs development - question about crypto api integration
On 29 Nov 2018, at 12:37, Nikolay Borisov wrote: > On 29.11.18 г. 18:43 ч., Jean Fobe wrote: >> Hi all, >> I've been studying LZ4 and other compression algorithms on the >> kernel, and seen other projects such as zram and ubifs using the >> crypto api. Is there a technical reason for not using the crypto api >> for compression (and possibly for encryption) in btrfs? >> I did not find any design/technical implementation choices in >> btrfs development in the developer's FAQ on the wiki. If I completely >> missed it, could someone point me in the right direction? >> Lastly, if there is no technical reason for this, would it be >> something interesting to have implemented? > > I personally think it would be better if btrfs' exploited the generic > framework. And in fact when you look at zstd, btrfs does use the > generic, low-level ZSTD routines but not the crypto library wrappers. > If > I were I'd try and convert zstd (since it's the most recently added > algorithm) to using the crypto layer to see if there are any lurking > problems. Back when I first added the zlib support, the zlib API was both easier to use and a better fit for our async worker threads. That doesn't mean we shouldn't switch, it's just how we got to step one ;) -chris
Need help with potential ~45TB dataloss
Hi all, I have been a happy BTRFS user for quite some time. But now I'm facing a potential ~45TB dataloss... :-( I hope someone can help! I have Server A and Server B. Both having a 20-devices BTRFS RAID6 filesystem. Because of known RAID5/6 risks, Server B was a backup of Server A. After applying updates to server B and reboot, the FS would not mount anymore. Because it was "just" a backup. I decided to recreate the FS and perform a new backup. Later, I discovered that the FS was not broken, but I faced this issue: https://patchwork.kernel.org/patch/10694997/ Anyway, the FS was already recreated, so I needed to do a new backup. During the backup (using rsync -vah), Server A (the source) encountered an I/O error and my rsync failed. In an attempt to "quick fix" the issue, I rebooted Server A after which the FS would not mount anymore. I documented what I have tried, below. I have not yet tried anything except what is shown, because I am afraid of causing more harm to the FS. I hope somebody here can give me advice on how to (hopefully) retrieve my data... Thanks in advance! == [root@cornelis ~]# btrfs fi show Label: 'cornelis-btrfs' uuid: ac643516-670e-40f3-aa4c-f329fc3795fd Total devices 1 FS bytes used 463.92GiB devid1 size 800.00GiB used 493.02GiB path /dev/mapper/cornelis-cornelis--btrfs Label: 'data' uuid: 4c66fa8b-8fc6-4bba-9d83-02a2a1d69ad5 Total devices 20 FS bytes used 44.85TiB devid1 size 3.64TiB used 3.64TiB path /dev/sdn2 devid2 size 3.64TiB used 3.64TiB path /dev/sdp2 devid3 size 3.64TiB used 3.64TiB path /dev/sdu2 devid4 size 3.64TiB used 3.64TiB path /dev/sdx2 devid5 size 3.64TiB used 3.64TiB path /dev/sdh2 devid6 size 3.64TiB used 3.64TiB path /dev/sdg2 devid7 size 3.64TiB used 3.64TiB path /dev/sdm2 devid8 size 3.64TiB used 3.64TiB path /dev/sdw2 devid9 size 3.64TiB used 3.64TiB path /dev/sdj2 devid 10 size 3.64TiB used 3.64TiB path /dev/sdt2 devid 11 size 3.64TiB used 3.64TiB path /dev/sdk2 devid 12 size 3.64TiB used 3.64TiB path /dev/sdq2 devid 13 size 3.64TiB used 3.64TiB path /dev/sds2 devid 14 size 3.64TiB used 3.64TiB path /dev/sdf2 devid 15 size 7.28TiB used 588.80GiB path /dev/sdr2 devid 16 size 7.28TiB used 588.80GiB path /dev/sdo2 devid 17 size 7.28TiB used 588.80GiB path /dev/sdv2 devid 18 size 7.28TiB used 588.80GiB path /dev/sdi2 devid 19 size 7.28TiB used 588.80GiB path /dev/sdl2 devid 20 size 7.28TiB used 588.80GiB path /dev/sde2 [root@cornelis ~]# mount /dev/sdn2 /mnt/data mount: /mnt/data: wrong fs type, bad option, bad superblock on /dev/sdn2, missing codepage or helper program, or other error. [root@cornelis ~]# btrfs check /dev/sdn2 Opening filesystem to check... parent transid verify failed on 46451963543552 wanted 114401 found 114173 parent transid verify failed on 46451963543552 wanted 114401 found 114173 checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 bad tree block 46451963543552, bytenr mismatch, want=46451963543552, have=75208089814272 Couldn't read tree root ERROR: cannot open file system [root@cornelis ~]# btrfs restore /dev/sdn2 /mnt/data/ parent transid verify failed on 46451963543552 wanted 114401 found 114173 parent transid verify failed on 46451963543552 wanted 114401 found 114173 checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4 bad tree block 46451963543552, bytenr mismatch, want=46451963543552, have=75208089814272 Couldn't read tree root Could not open root, trying backup super warning, device 14 is missing warning, device 13 is missing warning, device 12 is missing warning, device 11 is missing warning, device 10 is missing warning, device 9 is missing warning, device 8 is missing warning, device 7 is missing warning, device 6 is missing warning, device 5 is missing warning, device 4 is missing warning, device 3 is missing warning, device 2 is missing checksum verify failed on 22085632 found 5630EA32 wanted 1AA6FFF0 checksum verify failed on 22085632 found 5630EA32 wanted 1AA6FFF0 bad tree block 22085632, bytenr mismatch, want=22085632, have=1147797504 ERROR: cannot read chunk root Could not open root, trying backup super warning, device 14 is missing warning, device 13 is missing warning, device 12 is missing warning, device 11 is missing warning, device 10 is missing warning, device 9 is missing warning, device 8 is missing warning, device 7 is missing warning, device 6 is missing warning, device 5 is missing warning, device 4 i
Re: [PATCH 2/2] fs: Don't open-code lru_to_page
On Thu, Nov 29, 2018 at 3:56 PM Nikolay Borisov wrote: > > There are a bunch of filesystems which essentially open-code lru_to_page > helper. Change them to using the helper. No functional changes. > > Signed-off-by: Nikolay Borisov > --- > > Since this is a mostly mechanical change I've actually batched all of them in > a single patch. > > fs/afs/file.c| 5 +++-- > fs/btrfs/extent_io.c | 2 +- > fs/ceph/addr.c | 5 ++--- > fs/cifs/file.c | 3 ++- > fs/ext4/readpage.c | 2 +- > fs/ocfs2/aops.c | 3 ++- > fs/orangefs/inode.c | 2 +- > mm/swap.c| 2 +- > 8 files changed, 13 insertions(+), 11 deletions(-) > > diff --git a/fs/afs/file.c b/fs/afs/file.c > index d6bc3f5d784b..323ae9912203 100644 > --- a/fs/afs/file.c > +++ b/fs/afs/file.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include "internal.h" > > static int afs_file_mmap(struct file *file, struct vm_area_struct *vma); > @@ -441,7 +442,7 @@ static int afs_readpages_one(struct file *file, struct > address_space *mapping, > /* Count the number of contiguous pages at the front of the list. > Note > * that the list goes prev-wards rather than next-wards. > */ > - first = list_entry(pages->prev, struct page, lru); > + first = lru_to_page(pages); > index = first->index + 1; > n = 1; > for (p = first->lru.prev; p != pages; p = p->prev) { > @@ -473,7 +474,7 @@ static int afs_readpages_one(struct file *file, struct > address_space *mapping, > * page at the end of the file. > */ > do { > - page = list_entry(pages->prev, struct page, lru); > + page = lru_to_page(pages); > list_del(&page->lru); > index = page->index; > if (add_to_page_cache_lru(page, mapping, index, > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 19f4b8fd654f..8332c5f4b1c3 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -4104,7 +4104,7 @@ int extent_readpages(struct address_space *mapping, > struct list_head *pages, > u64 prev_em_start = (u64)-1; > > for (page_idx = 0; page_idx < nr_pages; page_idx++) { > - page = list_entry(pages->prev, struct page, lru); > + page = lru_to_page(pages); > > prefetchw(&page->flags); > list_del(&page->lru); > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c > index 8eade7a993c1..5d0c05e288cc 100644 > --- a/fs/ceph/addr.c > +++ b/fs/ceph/addr.c > @@ -306,7 +306,7 @@ static int start_read(struct inode *inode, struct > ceph_rw_context *rw_ctx, > struct ceph_osd_client *osdc = > &ceph_inode_to_client(inode)->client->osdc; > struct ceph_inode_info *ci = ceph_inode(inode); > - struct page *page = list_entry(page_list->prev, struct page, lru); > + struct page *page = lru_to_page(page_list); > struct ceph_vino vino; > struct ceph_osd_request *req; > u64 off; > @@ -333,8 +333,7 @@ static int start_read(struct inode *inode, struct > ceph_rw_context *rw_ctx, > if (got) > ceph_put_cap_refs(ci, got); > while (!list_empty(page_list)) { > - page = list_entry(page_list->prev, > - struct page, lru); > + page = lru_to_page(page_list); > list_del(&page->lru); > put_page(page); > } For the ceph patch Acked-by:: "Yan, Zheng" diff --git a/fs/cifs/file.c b/fs/cifs/file.c > index 74c33d5fafc8..b16a4d887d17 100644 > --- a/fs/cifs/file.c > +++ b/fs/cifs/file.c > @@ -33,6 +33,7 @@ > #include > #include > #include > +#include > #include > #include "cifsfs.h" > #include "cifspdu.h" > @@ -3975,7 +3976,7 @@ readpages_get_pages(struct address_space *mapping, > struct list_head *page_list, > > INIT_LIST_HEAD(tmplist); > > - page = list_entry(page_list->prev, struct page, lru); > + page = lru_to_page(page_list); > > /* > * Lock the page and put it in the cache. Since no one else > diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c > index f461d75ac049..6aa282ee455a 100644 > --- a/fs/ext4/readpage.c > +++ b/fs/ext4/readpage.c > @@ -128,7 +128,7 @@ int ext4_mpage_readpages(struct address_space *mapping, > > prefetchw(&page->flags); > if (pages) { > - page = list_entry(pages->prev, struct page, lru); > + page = lru_to_page(pages); > list_del(&page->lru); > if (add_to_page_cache_lru(page, mapping, page->index, > readahead_gfp_mask(mapping))) > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c > index eb1ce304
[PATCH RESEND] fstests: btrfs use forget if not reload
[I will send this the xfstest ML after kernel and progs patch has been integrated]. btrfs reload was introduced to cleanup the device list inside the btrfs kernel module. The problem with the reload approach is that you can't run btrfs test cases 124,125, 154 and 164 on the system with btrfs as root fs. Now as we are introducing the btrfs forget feature as an btrfs device scan option [1], so here are its pertaining changes in the fstests. So these changes ensures we use the forget feature where available and if its not then falls back to the reload approach. [1] btrfs-progs: add cli to forget one or all scanned devices btrfs: introduce feature to forget a btrfs device Signed-off-by: Anand Jain --- common/btrfs| 20 tests/btrfs/124 | 6 +++--- tests/btrfs/125 | 6 +++--- tests/btrfs/154 | 6 +++--- tests/btrfs/164 | 4 ++-- 5 files changed, 31 insertions(+), 11 deletions(-) diff --git a/common/btrfs b/common/btrfs index 26dc0bb9600f..c7fbec11c8c1 100644 --- a/common/btrfs +++ b/common/btrfs @@ -374,3 +374,23 @@ _scratch_btrfs_sectorsize() $BTRFS_UTIL_PROG inspect-internal dump-super $SCRATCH_DEV |\ grep sectorsize | awk '{print $2}' } + +_btrfs_supports_forget() +{ + $BTRFS_UTIL_PROG device scan --help | grep -wq forget && \ + $BTRFS_UTIL_PROG device scan --forget > /dev/null 2>&1 +} + +_require_btrfs_forget_if_not_fs_loadable() +{ + _btrfs_supports_forget && return + + _require_loadable_fs_module "btrfs" +} + +_btrfs_forget_if_not_fs_reload() +{ + _btrfs_supports_forget && return + + _reload_fs_module "btrfs" +} diff --git a/tests/btrfs/124 b/tests/btrfs/124 index ce3ad6aa3a58..edbeff1443f5 100755 --- a/tests/btrfs/124 +++ b/tests/btrfs/124 @@ -51,7 +51,7 @@ _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 2 _test_unmount -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 2 @@ -86,7 +86,7 @@ echo "clean btrfs ko" >> $seqres.full _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload echo >> $seqres.full echo "-Write degraded mount fill upto $max_fs_sz bytes-" >> $seqres.full @@ -125,7 +125,7 @@ echo echo "Mount degraded with the other dev" _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded $dev2 $SCRATCH_MNT >>$seqres.full 2>&1 _run_btrfs_util_prog filesystem show checkpoint3=`md5sum $SCRATCH_MNT/tf2` diff --git a/tests/btrfs/125 b/tests/btrfs/125 index e38de264b28e..9161a2ddeaad 100755 --- a/tests/btrfs/125 +++ b/tests/btrfs/125 @@ -50,7 +50,7 @@ _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 3 _test_unmount -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 3 @@ -102,7 +102,7 @@ echo "unmount" >> $seqres.full _scratch_unmount echo "clean btrfs ko" >> $seqres.full # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded,device=$dev2 $dev1 $SCRATCH_MNT >>$seqres.full 2>&1 dd if=/dev/zero of="$SCRATCH_MNT"/tf2 bs=$bs count=$count \ >>$seqres.full 2>&1 @@ -138,7 +138,7 @@ echo "Mount degraded but with other dev" _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded,device=${dev2} $dev3 $SCRATCH_MNT >>$seqres.full 2>&1 diff --git a/tests/btrfs/154 b/tests/btrfs/154 index 99ea232aba4c..01745d20e9fc 100755 --- a/tests/btrfs/154 +++ b/tests/btrfs/154 @@ -36,7 +36,7 @@ rm -f $seqres.full _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 2 -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 2 @@ -90,7 +90,7 @@ degrade_mount_write() echo "clean btrfs ko" >> $seqres.full # un-scan the btrfs devices - _reload_fs_module "btrfs" + _btrfs_forget_if_not_fs_reload _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1 cnt=$(( $COUNT/10 )) dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \ @@ -142,7 +142,7 @@ verify() echo "unmount" >> $seqres.full _scratch_unmount - _reload_fs_module "btrfs" + _btrfs_forget_if_not_fs_reload _mount -o degraded $DEV2 $SCRATCH_MNT >>$seqres.full 2>&1 verify_checkpoint1=`md5sum $SCRATCH_MNT/tf1` verify_checkpoint2=`md5sum $SCRATCH_MNT/tf2` diff --git a/tests/btrfs/164 b/tests/btrfs/164 index 097191a0e493..55042c4035e0 100755 --- a/tests/btrfs/164 +++ b/tests/btrfs/164 @@ -36,7 +36,7 @@ rm -f $seqres.full # Modify as appropriate. _supported_fs btrfs _supported_os Linux -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _require_scratch_dev_pool 2 _scratch_dev_pool_get 2 @@ -69,7 +69,7 @@ delete_seed() {
[PATCH RESEND v12] btrfs: introduce feature to forget a btrfs device
Support for a new command 'btrfs dev forget [dev]' is proposed here to undo the effects of 'btrfs dev scan [dev]'. For this purpose this patch proposes to use ioctl #5 as it was empty. IOW(BTRFS_IOCTL_MAGIC, 5, ..) This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from the /dev/btrfs-control to forget one or all devices, (devices which are not mounted) from the btrfs kernel. The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be set to specify the device path. And all unmounted devices can be removed from the kernel if no device path is provided. Again, the devices are removed only if the relevant fsid aren't mounted. This new cli can provide.. . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the device is not going to be mounted. . Ability to mount the device in degraded mode when one of the other device is corrupted like in split brain raid1. . Running test cases which requires btrfs.ko-reload if the rootfs is btrfs. Signed-off-by: Anand Jain Reviewed-by: Nikolay Borisov --- v11->v12: fix coding style add spacing before after ":". v1->v11: Pls ref to the cover-letter. (sorry about that). fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d3c6bbc0aa3a..eba2966913ae 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2247,6 +2247,9 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, ret = PTR_ERR_OR_ZERO(device); mutex_unlock(&uuid_mutex); break; + case BTRFS_IOC_FORGET_DEV: + ret = btrfs_forget_devices(vol->name); + break; case BTRFS_IOC_DEVICES_READY: mutex_lock(&uuid_mutex); device = btrfs_scan_one_device(vol->name, FMODE_READ, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8e36cbb355df..48415a9edd46 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1397,6 +1397,15 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, return 0; } +int btrfs_forget_devices(const char *path) +{ + mutex_lock(&uuid_mutex); + btrfs_free_stale_devices(strlen(path) ? path : NULL, NULL); + mutex_unlock(&uuid_mutex); + + return 0; +} + /* * Look for a btrfs signature on a device. This may be called out of the mount path * and we are not allowed to call set_blocksize during the scan. The superblock diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1d936ce282c3..a4a89d5050f5 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -414,6 +414,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, fmode_t flags, void *holder); struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, void *holder); +int btrfs_forget_devices(const char *path); int btrfs_close_devices(struct btrfs_fs_devices *fs_devices); void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step); void btrfs_assign_next_active_device(struct btrfs_device *device, diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index e0763bc4158e..c195896d478f 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -837,6 +837,8 @@ enum btrfs_err_code { struct btrfs_ioctl_vol_args) #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \ struct btrfs_ioctl_vol_args) +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \ + struct btrfs_ioctl_vol_args) /* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks -- 1.8.3.1
[PATCH RESEND v12] btrfs-progs: add cli to forget one or all scanned devices
This patch adds cli btrfs device forget [dev] to remove the given device structure in the kernel if the device is unmounted. If no argument is given it shall remove all stale (device which are not mounted) from the kernel. Signed-off-by: Anand Jain Reviewed-by: Nikolay Borisov --- v11->v12: none. v1->v12: pls ref to the cover-letter. cmds-device.c | 72 --- ioctl.h | 2 ++ 2 files changed, 61 insertions(+), 13 deletions(-) diff --git a/cmds-device.c b/cmds-device.c index 2a05f70a76a9..280d6f555377 100644 --- a/cmds-device.c +++ b/cmds-device.c @@ -254,10 +254,32 @@ static int cmd_device_delete(int argc, char **argv) return _cmd_device_remove(argc, argv, cmd_device_delete_usage); } +static int btrfs_forget_devices(char *path) +{ + struct btrfs_ioctl_vol_args args; + int ret; + int fd; + + fd = open("/dev/btrfs-control", O_RDWR); + if (fd < 0) + return -errno; + + memset(&args, 0, sizeof(args)); + if (path) + strncpy_null(args.name, path); + ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, &args); + if (ret) + ret = -errno; + close(fd); + return ret; +} + static const char * const cmd_device_scan_usage[] = { - "btrfs device scan [(-d|--all-devices)| [...]]", - "Scan devices for a btrfs filesystem", + "btrfs device scan [(-d|--all-devices)|(-u|--forget)| "\ + "[...]]", + "Scan or forget (deregister) devices for a btrfs filesystem", " -d|--all-devices (deprecated)", + " -u|--forget [ ..]", NULL }; @@ -267,37 +289,53 @@ static int cmd_device_scan(int argc, char **argv) int devstart; int all = 0; int ret = 0; + int forget = 0; optind = 0; while (1) { int c; static const struct option long_options[] = { { "all-devices", no_argument, NULL, 'd'}, + { "forget", no_argument, NULL, 'u'}, { NULL, 0, NULL, 0} }; - c = getopt_long(argc, argv, "d", long_options, NULL); + c = getopt_long(argc, argv, "du", long_options, NULL); if (c < 0) break; switch (c) { case 'd': all = 1; break; + case 'u': + forget = 1; + break; default: usage(cmd_device_scan_usage); } } devstart = optind; + if (all && forget) + usage(cmd_device_scan_usage); + if (all && check_argc_max(argc - optind, 1)) usage(cmd_device_scan_usage); if (all || argc - optind == 0) { - printf("Scanning for Btrfs filesystems\n"); - ret = btrfs_scan_devices(); - error_on(ret, "error %d while scanning", ret); - ret = btrfs_register_all_devices(); - error_on(ret, "there are %d errors while registering devices", ret); + if (forget) { + ret = btrfs_forget_devices(NULL); + error_on(ret, "'%s', forget failed", +strerror(-ret)); + } else { + printf("Scanning for Btrfs filesystems\n"); + ret = btrfs_scan_devices(); + error_on(ret, "error %d while scanning", ret); + ret = btrfs_register_all_devices(); + error_on(ret, + "there are %d errors while registering devices", + ret); + } goto out; } @@ -315,11 +353,19 @@ static int cmd_device_scan(int argc, char **argv) ret = 1; goto out; } - printf("Scanning for Btrfs filesystems in '%s'\n", path); - if (btrfs_register_one_device(path) != 0) { - ret = 1; - free(path); - goto out; + if (forget) { + ret = btrfs_forget_devices(path); + if (ret) + error("Can't forget '%s': %s", + path, strerror(-ret)); + } else { + printf("Scanning for Btrfs filesystems in '%s'\n", + path); + if (btrfs_register_one_device(path) != 0) { + ret = 1; + free(path); + goto out; + }
[PATCH RESEND] fstests: btrfs use forget if not reload
[I will send this the xfstest ML after kernel and progs patch has been integrated]. btrfs reload was introduced to cleanup the device list inside the btrfs kernel module. The problem with the reload approach is that you can't run btrfs test cases 124,125, 154 and 164 on the system with btrfs as root fs. Now as we are introducing the btrfs forget feature as an btrfs device scan option [1], so here are its pertaining changes in the fstests. So these changes ensures we use the forget feature where available and if its not then falls back to the reload approach. [1] btrfs-progs: add cli to forget one or all scanned devices btrfs: introduce feature to forget a btrfs device Signed-off-by: Anand Jain --- common/btrfs| 20 tests/btrfs/124 | 6 +++--- tests/btrfs/125 | 6 +++--- tests/btrfs/154 | 6 +++--- tests/btrfs/164 | 4 ++-- 5 files changed, 31 insertions(+), 11 deletions(-) diff --git a/common/btrfs b/common/btrfs index 26dc0bb9600f..c7fbec11c8c1 100644 --- a/common/btrfs +++ b/common/btrfs @@ -374,3 +374,23 @@ _scratch_btrfs_sectorsize() $BTRFS_UTIL_PROG inspect-internal dump-super $SCRATCH_DEV |\ grep sectorsize | awk '{print $2}' } + +_btrfs_supports_forget() +{ + $BTRFS_UTIL_PROG device scan --help | grep -wq forget && \ + $BTRFS_UTIL_PROG device scan --forget > /dev/null 2>&1 +} + +_require_btrfs_forget_if_not_fs_loadable() +{ + _btrfs_supports_forget && return + + _require_loadable_fs_module "btrfs" +} + +_btrfs_forget_if_not_fs_reload() +{ + _btrfs_supports_forget && return + + _reload_fs_module "btrfs" +} diff --git a/tests/btrfs/124 b/tests/btrfs/124 index ce3ad6aa3a58..edbeff1443f5 100755 --- a/tests/btrfs/124 +++ b/tests/btrfs/124 @@ -51,7 +51,7 @@ _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 2 _test_unmount -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 2 @@ -86,7 +86,7 @@ echo "clean btrfs ko" >> $seqres.full _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload echo >> $seqres.full echo "-Write degraded mount fill upto $max_fs_sz bytes-" >> $seqres.full @@ -125,7 +125,7 @@ echo echo "Mount degraded with the other dev" _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded $dev2 $SCRATCH_MNT >>$seqres.full 2>&1 _run_btrfs_util_prog filesystem show checkpoint3=`md5sum $SCRATCH_MNT/tf2` diff --git a/tests/btrfs/125 b/tests/btrfs/125 index e38de264b28e..9161a2ddeaad 100755 --- a/tests/btrfs/125 +++ b/tests/btrfs/125 @@ -50,7 +50,7 @@ _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 3 _test_unmount -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 3 @@ -102,7 +102,7 @@ echo "unmount" >> $seqres.full _scratch_unmount echo "clean btrfs ko" >> $seqres.full # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded,device=$dev2 $dev1 $SCRATCH_MNT >>$seqres.full 2>&1 dd if=/dev/zero of="$SCRATCH_MNT"/tf2 bs=$bs count=$count \ >>$seqres.full 2>&1 @@ -138,7 +138,7 @@ echo "Mount degraded but with other dev" _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded,device=${dev2} $dev3 $SCRATCH_MNT >>$seqres.full 2>&1 diff --git a/tests/btrfs/154 b/tests/btrfs/154 index 99ea232aba4c..01745d20e9fc 100755 --- a/tests/btrfs/154 +++ b/tests/btrfs/154 @@ -36,7 +36,7 @@ rm -f $seqres.full _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 2 -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 2 @@ -90,7 +90,7 @@ degrade_mount_write() echo "clean btrfs ko" >> $seqres.full # un-scan the btrfs devices - _reload_fs_module "btrfs" + _btrfs_forget_if_not_fs_reload _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1 cnt=$(( $COUNT/10 )) dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \ @@ -142,7 +142,7 @@ verify() echo "unmount" >> $seqres.full _scratch_unmount - _reload_fs_module "btrfs" + _btrfs_forget_if_not_fs_reload _mount -o degraded $DEV2 $SCRATCH_MNT >>$seqres.full 2>&1 verify_checkpoint1=`md5sum $SCRATCH_MNT/tf1` verify_checkpoint2=`md5sum $SCRATCH_MNT/tf2` diff --git a/tests/btrfs/164 b/tests/btrfs/164 index 097191a0e493..55042c4035e0 100755 --- a/tests/btrfs/164 +++ b/tests/btrfs/164 @@ -36,7 +36,7 @@ rm -f $seqres.full # Modify as appropriate. _supported_fs btrfs _supported_os Linux -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _require_scratch_dev_pool 2 _scratch_dev_pool_get 2 @@ -69,7 +69,7 @@ delete_seed() {
[PATCH RESEND v12] Add cli and ioctl to forget scanned device(s)
v12: Fixed coding style - leave space between " : ". v11: btrfs-progs: Bring the code into the else part of if(forget). Use strerror to print the erorr instead of ret. v10: Make btrfs-progs changes more readable. With an effort to keep the known bug [1] as it is.. [1] The cli 'btrfs device scan --all /dev/sdb' which should have scanned only one device, ends up scanning all the devices and I am not trying to fix this bug in this patch because.. . -d|--all is marked as deprecated, I hope -d option would go away . For now some script might be using this bug as a feature, and fixing this bug might lead to mount failure. v9: Make forget as a btrfs device scan option. Use forget in the fstests, now you can run fstests with btrfs as rootfs which helps to exercise the uuid_mutex lock. v8: Change log update in the kernel patch. v7: Use struct btrfs_ioctl_vol_args (instead of struct btrfs_ioctl_vol_args_v2) as its inline with other ioctl btrfs-control The CLI usage/features remains same. However internally the ioctl flag is not required to delete all the unmounted devices. Instead leave btrfs_ioctl_vol_args::name NULL. v6: Use the changed fn name btrfs_free_stale_devices(). Change in title: Old v5: Cover-letter: [PATCH v5] Add cli and ioctl to ignore a scanned device Kernel: [PATCH v5] btrfs: introduce feature to ignore a btrfs device Progs: [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli v5: Adds feature to delete all stale devices Reuses btrfs_free_stale_devices() fn and so depends on the patch-set [1] in the ML. Uses struct btrfs_ioctl_vol_args_v2 instead of struct btrfs_ioctl_vol_args as arg Does the device path matching instead of btrfs_device matching (we won't delete the mounted device as btrfs_free_stale_devices() checks for it) v4: No change. But as the ML thread may be confusing, so resend. v3: No change. Send to correct ML. v2: Accepts review from Nikolay, details are in the specific patch. Patch 1/2 is renamed from [PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list delete to [PATCH 1/2] btrfs: add function to device list delete Adds cli and ioctl to forget a scanned device or forget all stale devices in the kernel. Anand Jain (1): btrfs: introduce feature to forget a btrfs device fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) Anand Jain (1): btrfs-progs: add cli to forget one or all scanned devices cmds-device.c | 63 ++- ioctl.h | 2 ++ 2 files changed, 56 insertions(+), 9 deletions(-) [1] Anand Jain (1): fstests: btrfs use forget if not reload common/btrfs| 20 tests/btrfs/124 | 6 +++--- tests/btrfs/125 | 6 +++--- tests/btrfs/154 | 6 +++--- tests/btrfs/164 | 4 ++-- 5 files changed, 31 insertions(+), 11 deletions(-) -- 1.8.3.1