Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Hans van Kranenburg
On 12/1/18 1:19 AM, Qu Wenruo wrote:
> 
> 
> On 2018/12/1 上午12:52, Josef Bacik wrote:
>> From: Josef Bacik 
>>
>> When debugging some weird extent reference bug I suspected that we were
>> changing a snapshot while we were deleting it, which could explain my
>> bug.
> 
> May I ask under which case we're going to modify an unlinked snapshot?
> 
> Maybe metadata relocation?

I have some old logging from a filesystem that got corrupted when doing
a subvolume delete while doing metadata balance:

https://syrinx.knorrie.org/~knorrie/btrfs/domokun-balance-skinny-lockup.txt

This is one of the very few moments when I really lost a btrfs
filesystem and had to mkfs again to move on.

I'm wondering if this one looks related. I never found an explanation
for it.

Hans

> 
> Thanks,
> Qu
> 
>> This was indeed what was happening, and this patch helped me
>> verify my theory.  It is never correct to modify the snapshot once it's
>> being deleted, so mark the root when we are deleting it and make sure we
>> complain about it when it happens.
>>
>> Signed-off-by: Josef Bacik 
>> ---
>>  fs/btrfs/ctree.c   | 3 +++
>>  fs/btrfs/ctree.h   | 1 +
>>  fs/btrfs/extent-tree.c | 9 +
>>  3 files changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>> index 5912a97b07a6..5f82f86085e8 100644
>> --- a/fs/btrfs/ctree.c
>> +++ b/fs/btrfs/ctree.c
>> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
>> *trans,
>>  u64 search_start;
>>  int ret;
>>  
>> +if (test_bit(BTRFS_ROOT_DELETING, &root->state))
>> +WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
>> dropped\n");
>> +
>>  if (trans->transaction != fs_info->running_transaction)
>>  WARN(1, KERN_CRIT "trans %llu running %llu\n",
>> trans->transid,
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index facde70c15ed..5a3a94ccb65c 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1199,6 +1199,7 @@ enum {
>>  BTRFS_ROOT_FORCE_COW,
>>  BTRFS_ROOT_MULTI_LOG_TASKS,
>>  BTRFS_ROOT_DIRTY,
>> +BTRFS_ROOT_DELETING,
>>  };
>>  
>>  /*
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 581c2a0b2945..dcb699dd57f3 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>>  if (block_rsv)
>>  trans->block_rsv = block_rsv;
>>  
>> +/*
>> + * This will help us catch people modifying the fs tree while we're
>> + * dropping it.  It is unsafe to mess with the fs tree while it's being
>> + * dropped as we unlock the root node and parent nodes as we walk down
>> + * the tree, assuming nothing will change.  If something does change
>> + * then we'll have stale information and drop references to blocks we've
>> + * already dropped.
>> + */
>> +set_bit(BTRFS_ROOT_DELETING, &root->state);
>>  if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
>>  level = btrfs_header_level(root->node);
>>  path->nodes[level] = btrfs_lock_root_node(root);
>>
> 


-- 
Hans van Kranenburg



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Qu Wenruo


On 2018/12/1 上午12:52, Josef Bacik wrote:
> From: Josef Bacik 
> 
> When debugging some weird extent reference bug I suspected that we were
> changing a snapshot while we were deleting it, which could explain my
> bug.

May I ask under which case we're going to modify an unlinked snapshot?

Maybe metadata relocation?

Thanks,
Qu

> This was indeed what was happening, and this patch helped me
> verify my theory.  It is never correct to modify the snapshot once it's
> being deleted, so mark the root when we are deleting it and make sure we
> complain about it when it happens.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.c   | 3 +++
>  fs/btrfs/ctree.h   | 1 +
>  fs/btrfs/extent-tree.c | 9 +
>  3 files changed, 13 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 5912a97b07a6..5f82f86085e8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
> *trans,
>   u64 search_start;
>   int ret;
>  
> + if (test_bit(BTRFS_ROOT_DELETING, &root->state))
> + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
> dropped\n");
> +
>   if (trans->transaction != fs_info->running_transaction)
>   WARN(1, KERN_CRIT "trans %llu running %llu\n",
>  trans->transid,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index facde70c15ed..5a3a94ccb65c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1199,6 +1199,7 @@ enum {
>   BTRFS_ROOT_FORCE_COW,
>   BTRFS_ROOT_MULTI_LOG_TASKS,
>   BTRFS_ROOT_DIRTY,
> + BTRFS_ROOT_DELETING,
>  };
>  
>  /*
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 581c2a0b2945..dcb699dd57f3 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>   if (block_rsv)
>   trans->block_rsv = block_rsv;
>  
> + /*
> +  * This will help us catch people modifying the fs tree while we're
> +  * dropping it.  It is unsafe to mess with the fs tree while it's being
> +  * dropped as we unlock the root node and parent nodes as we walk down
> +  * the tree, assuming nothing will change.  If something does change
> +  * then we'll have stale information and drop references to blocks we've
> +  * already dropped.
> +  */
> + set_bit(BTRFS_ROOT_DELETING, &root->state);
>   if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
>   level = btrfs_header_level(root->node);
>   path->nodes[level] = btrfs_lock_root_node(root);
> 



signature.asc
Description: OpenPGP digital signature


Re: Need help with potential ~45TB dataloss

2018-11-30 Thread Qu Wenruo


On 2018/11/30 下午9:53, Patrick Dijkgraaf wrote:
> Hi all,
> 
> I have been a happy BTRFS user for quite some time. But now I'm facing
> a potential ~45TB dataloss... :-(
> I hope someone can help!
> 
> I have Server A and Server B. Both having a 20-devices BTRFS RAID6
> filesystem. Because of known RAID5/6 risks, Server B was a backup of
> Server A.
> After applying updates to server B and reboot, the FS would not mount
> anymore. Because it was "just" a backup. I decided to recreate the FS
> and perform a new backup. Later, I discovered that the FS was not
> broken, but I faced this issue: 
> https://patchwork.kernel.org/patch/10694997/

Sorry for the inconvenience.

I didn't realize the max_chunk_size limit isn't reliable at that timing.

> 
> Anyway, the FS was already recreated, so I needed to do a new backup.
> During the backup (using rsync -vah), Server A (the source) encountered
> an I/O error and my rsync failed. In an attempt to "quick fix" the
> issue, I rebooted Server A after which the FS would not mount anymore.

Did you have any dmesg about that IO error?

And how is the reboot scheduled? Forced power off or normal reboot command?

> 
> I documented what I have tried, below. I have not yet tried anything
> except what is shown, because I am afraid of causing more harm to
> the FS.

Pretty clever, no btrfs check --repair is a pretty good move.

> I hope somebody here can give me advice on how to (hopefully)
> retrieve my data...
> 
> Thanks in advance!
> 
> ==
> 
> [root@cornelis ~]# btrfs fi show
> Label: 'cornelis-btrfs'  uuid: ac643516-670e-40f3-aa4c-f329fc3795fd
>   Total devices 1 FS bytes used 463.92GiB
>   devid1 size 800.00GiB used 493.02GiB path
> /dev/mapper/cornelis-cornelis--btrfs
> 
> Label: 'data'  uuid: 4c66fa8b-8fc6-4bba-9d83-02a2a1d69ad5
>   Total devices 20 FS bytes used 44.85TiB
>   devid1 size 3.64TiB used 3.64TiB path /dev/sdn2
>   devid2 size 3.64TiB used 3.64TiB path /dev/sdp2
>   devid3 size 3.64TiB used 3.64TiB path /dev/sdu2
>   devid4 size 3.64TiB used 3.64TiB path /dev/sdx2
>   devid5 size 3.64TiB used 3.64TiB path /dev/sdh2
>   devid6 size 3.64TiB used 3.64TiB path /dev/sdg2
>   devid7 size 3.64TiB used 3.64TiB path /dev/sdm2
>   devid8 size 3.64TiB used 3.64TiB path /dev/sdw2
>   devid9 size 3.64TiB used 3.64TiB path /dev/sdj2
>   devid   10 size 3.64TiB used 3.64TiB path /dev/sdt2
>   devid   11 size 3.64TiB used 3.64TiB path /dev/sdk2
>   devid   12 size 3.64TiB used 3.64TiB path /dev/sdq2
>   devid   13 size 3.64TiB used 3.64TiB path /dev/sds2
>   devid   14 size 3.64TiB used 3.64TiB path /dev/sdf2
>   devid   15 size 7.28TiB used 588.80GiB path /dev/sdr2
>   devid   16 size 7.28TiB used 588.80GiB path /dev/sdo2
>   devid   17 size 7.28TiB used 588.80GiB path /dev/sdv2
>   devid   18 size 7.28TiB used 588.80GiB path /dev/sdi2
>   devid   19 size 7.28TiB used 588.80GiB path /dev/sdl2
>   devid   20 size 7.28TiB used 588.80GiB path /dev/sde2
> 
> [root@cornelis ~]# mount /dev/sdn2 /mnt/data
> mount: /mnt/data: wrong fs type, bad option, bad superblock on
> /dev/sdn2, missing codepage or helper program, or other error.

What is the dmesg of the mount failure?

And have you tried -o ro,degraded ?

> 
> [root@cornelis ~]# btrfs check /dev/sdn2
> Opening filesystem to check...
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> bad tree block 46451963543552, bytenr mismatch, want=46451963543552,
> have=75208089814272
> Couldn't read tree root

Would you please also paste the output of "btrfs ins dump-super /dev/sdn2" ?

It looks like your tree root (or at least some tree root nodes/leaves
get corrupted)

> ERROR: cannot open file system

And since it's your tree root corrupted, you could also try
"btrfs-find-root " to try to get a good old copy of your tree root.

But I suspect the corruption happens before you noticed, thus the old
tree root may not help much.

Also, the output of "btrfs ins dump-tree -t root " will help.

Thanks,
Qu
> 
> [root@cornelis ~]# btrfs restore /dev/sdn2 /mnt/data/
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> parent transid verify failed on 46451963543552 wanted 114401 found
> 114173
> checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
> bad tree block 46451963543552, bytenr mismatch, want=46451963543552,
> have=75208089814272
> Couldn't read tree

Re: [PATCH 2/2] btrfs: run delayed items before dropping the snapshot

2018-11-30 Thread Filipe Manana
On Fri, Nov 30, 2018 at 5:12 PM Filipe Manana  wrote:
>
> On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> >
> > From: Josef Bacik 
> >
> > With my delayed refs patches in place we started seeing a large amount
> > of aborts in __btrfs_free_extent
> >
> > BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 
> > root 35964  owner 1 offset 0
> > Call Trace:
> >  ? btrfs_merge_delayed_refs+0xaf/0x340
> >  __btrfs_run_delayed_refs+0x6ea/0xfc0
> >  ? btrfs_set_path_blocking+0x31/0x60
> >  btrfs_run_delayed_refs+0xeb/0x180
> >  btrfs_commit_transaction+0x179/0x7f0
> >  ? btrfs_check_space_for_delayed_refs+0x30/0x50
> >  ? should_end_transaction.isra.19+0xe/0x40
> >  btrfs_drop_snapshot+0x41c/0x7c0
> >  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
> >  cleaner_kthread+0xf6/0x120
> >  kthread+0xf8/0x130
> >  ? btree_invalidatepage+0x90/0x90
> >  ? kthread_bind+0x10/0x10
> >  ret_from_fork+0x35/0x40
> >
> > This was because btrfs_drop_snapshot depends on the root not being modified
> > while it's dropping the snapshot.  It will unlock the root node (and really
> > every node) as it walks down the tree, only to re-lock it when it needs to 
> > do
> > something.  This is a problem because if we modify the tree we could cow a 
> > block
> > in our path, which free's our reference to that block.  Then once we get 
> > back to
> > that shared block we'll free our reference to it again, and get ENOENT when
> > trying to lookup our extent reference to that block in __btrfs_free_extent.
> >
> > This is ultimately happening because we have delayed items left to be 
> > processed
> > for our deleted snapshot _after_ all of the inodes are closed for the 
> > snapshot.
> > We only run the delayed inode item if we're deleting the inode, and even 
> > then we
> > do not run the delayed insertions or delayed removals.  These can be run at 
> > any
> > point after our final inode does it's last iput, which is what triggers the
> > snapshot deletion.  We can end up with the snapshot deletion happening and 
> > then
> > have the delayed items run on that file system, resulting in the above 
> > problem.
> >
> > This problem has existed forever, however my patches made it much easier to 
> > hit
> > as I wake up the cleaner much more often to deal with delayed iputs, which 
> > made
> > us more likely to start the snapshot dropping work before the transaction
> > commits, which is when the delayed items would generally be run.  Before,
> > generally speaking, we would run the delayed items, commit the transaction, 
> > and
> > wakeup the cleaner thread to start deleting snapshots, which means we were 
> > less
> > likely to hit this problem.  You could still hit it if you had multiple
> > snapshots to be deleted and ended up with lots of delayed items, but it was
> > definitely harder.
> >
> > Fix for now by simply running all the delayed items before starting to drop 
> > the
> > snapshot.  We could make this smarter in the future by making the delayed 
> > items
> > per-root, and then simply drop any delayed items for roots that we are 
> > going to
> > delete.  But for now just a quick and easy solution is the safest.
> >
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Josef Bacik 
>
> Reviewed-by: Filipe Manana 
>
> Great catch!
> I've hit this error from  __btrfs_free_extent() a handful of times
> over the years, but never managed
> to reproduce it on demand or figure out it was related to snapshot deletion.
>
> > ---
> >  fs/btrfs/extent-tree.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index dcb699dd57f3..965702034b22 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> > goto out_free;
> > }
> >
> > +   btrfs_run_delayed_items(trans);
> > +

Btw, we should check the return value of this and return it if it's an error?
We can't do nothing with it in the context of the cleaner thread,
which is why, I suppose, you chose to ignore the value (besides that
the error might have been for some other root).
But this can be used in the context of relocation, where we can return
the error back to userspace.

Thanks.

> > if (block_rsv)
> > trans->block_rsv = block_rsv;
> >
> > --
> > 2.14.3
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Josef Bacik
On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote:
> On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> >
> > From: Josef Bacik 
> >
> > When debugging some weird extent reference bug I suspected that we were
> > changing a snapshot while we were deleting it, which could explain my
> > bug.  This was indeed what was happening, and this patch helped me
> > verify my theory.  It is never correct to modify the snapshot once it's
> > being deleted, so mark the root when we are deleting it and make sure we
> > complain about it when it happens.
> >
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.c   | 3 +++
> >  fs/btrfs/ctree.h   | 1 +
> >  fs/btrfs/extent-tree.c | 9 +
> >  3 files changed, 13 insertions(+)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 5912a97b07a6..5f82f86085e8 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct 
> > btrfs_trans_handle *trans,
> > u64 search_start;
> > int ret;
> >
> > +   if (test_bit(BTRFS_ROOT_DELETING, &root->state))
> > +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
> > dropped\n");
> 
> Please use btrfs_warn(), it makes sure we use a consistent message
> style, identifies the fs, etc.
> Also, "thats" should be "that is" or "that's".
> 

Ah yeah, I was following the other convention in there but we should probably
convert all of those to btrfs_warn.  I'll fix the grammer thing as well, just a
leftover from the much less code of conduct friendly message I originally had
there.  Thanks,

Josef


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Filipe Manana
On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> When debugging some weird extent reference bug I suspected that we were
> changing a snapshot while we were deleting it, which could explain my
> bug.  This was indeed what was happening, and this patch helped me
> verify my theory.  It is never correct to modify the snapshot once it's
> being deleted, so mark the root when we are deleting it and make sure we
> complain about it when it happens.
>
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.c   | 3 +++
>  fs/btrfs/ctree.h   | 1 +
>  fs/btrfs/extent-tree.c | 9 +
>  3 files changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 5912a97b07a6..5f82f86085e8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
> *trans,
> u64 search_start;
> int ret;
>
> +   if (test_bit(BTRFS_ROOT_DELETING, &root->state))
> +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
> dropped\n");

Please use btrfs_warn(), it makes sure we use a consistent message
style, identifies the fs, etc.
Also, "thats" should be "that is" or "that's".

With that fixed,
Reviewed-by: Filipe Manana 

> +
> if (trans->transaction != fs_info->running_transaction)
> WARN(1, KERN_CRIT "trans %llu running %llu\n",
>trans->transid,
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index facde70c15ed..5a3a94ccb65c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1199,6 +1199,7 @@ enum {
> BTRFS_ROOT_FORCE_COW,
> BTRFS_ROOT_MULTI_LOG_TASKS,
> BTRFS_ROOT_DIRTY,
> +   BTRFS_ROOT_DELETING,
>  };
>
>  /*
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 581c2a0b2945..dcb699dd57f3 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> +   /*
> +* This will help us catch people modifying the fs tree while we're
> +* dropping it.  It is unsafe to mess with the fs tree while it's 
> being
> +* dropped as we unlock the root node and parent nodes as we walk down
> +* the tree, assuming nothing will change.  If something does change
> +* then we'll have stale information and drop references to blocks 
> we've
> +* already dropped.
> +*/
> +   set_bit(BTRFS_ROOT_DELETING, &root->state);
> if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
> level = btrfs_header_level(root->node);
> path->nodes[level] = btrfs_lock_root_node(root);
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH 2/2] btrfs: run delayed items before dropping the snapshot

2018-11-30 Thread Filipe Manana
On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
>
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
>
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
>
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
>
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
>
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 

Reviewed-by: Filipe Manana 

Great catch!
I've hit this error from  __btrfs_free_extent() a handful of times
over the years, but never managed
to reproduce it on demand or figure out it was related to snapshot deletion.

> ---
>  fs/btrfs/extent-tree.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dcb699dd57f3..965702034b22 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> goto out_free;
> }
>
> +   btrfs_run_delayed_items(trans);
> +
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


[PATCH 0/2] Fix aborts when dropping snapshots

2018-11-30 Thread Josef Bacik
With more machines running my recent delayed ref rsv patches we started seeing a
spike in boxes aborting in __btrfs_free_extent with being unable to find the
extent ref.  The full details are in 2/2, but the summary is there's been a bug
ever since the original delayed inode item stuff was introduced where it could
run once the snapshot was being deleted, which will result in all sorts of
extent reference shenanigans.  This was tricky to hit before, but with my iput
changes it's become much easier to hit on our build boxes that are heavy users
of snapshot creation/deletion.  Thanks,

Josef


[PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Josef Bacik
From: Josef Bacik 

When debugging some weird extent reference bug I suspected that we were
changing a snapshot while we were deleting it, which could explain my
bug.  This was indeed what was happening, and this patch helped me
verify my theory.  It is never correct to modify the snapshot once it's
being deleted, so mark the root when we are deleting it and make sure we
complain about it when it happens.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.c   | 3 +++
 fs/btrfs/ctree.h   | 1 +
 fs/btrfs/extent-tree.c | 9 +
 3 files changed, 13 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 5912a97b07a6..5f82f86085e8 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
u64 search_start;
int ret;
 
+   if (test_bit(BTRFS_ROOT_DELETING, &root->state))
+   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
dropped\n");
+
if (trans->transaction != fs_info->running_transaction)
WARN(1, KERN_CRIT "trans %llu running %llu\n",
   trans->transid,
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index facde70c15ed..5a3a94ccb65c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1199,6 +1199,7 @@ enum {
BTRFS_ROOT_FORCE_COW,
BTRFS_ROOT_MULTI_LOG_TASKS,
BTRFS_ROOT_DIRTY,
+   BTRFS_ROOT_DELETING,
 };
 
 /*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 581c2a0b2945..dcb699dd57f3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
if (block_rsv)
trans->block_rsv = block_rsv;
 
+   /*
+* This will help us catch people modifying the fs tree while we're
+* dropping it.  It is unsafe to mess with the fs tree while it's being
+* dropped as we unlock the root node and parent nodes as we walk down
+* the tree, assuming nothing will change.  If something does change
+* then we'll have stale information and drop references to blocks we've
+* already dropped.
+*/
+   set_bit(BTRFS_ROOT_DELETING, &root->state);
if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
level = btrfs_header_level(root->node);
path->nodes[level] = btrfs_lock_root_node(root);
-- 
2.14.3



[PATCH 2/2] btrfs: run delayed items before dropping the snapshot

2018-11-30 Thread Josef Bacik
From: Josef Bacik 

With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent

BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
35964  owner 1 offset 0
Call Trace:
 ? btrfs_merge_delayed_refs+0xaf/0x340
 __btrfs_run_delayed_refs+0x6ea/0xfc0
 ? btrfs_set_path_blocking+0x31/0x60
 btrfs_run_delayed_refs+0xeb/0x180
 btrfs_commit_transaction+0x179/0x7f0
 ? btrfs_check_space_for_delayed_refs+0x30/0x50
 ? should_end_transaction.isra.19+0xe/0x40
 btrfs_drop_snapshot+0x41c/0x7c0
 btrfs_clean_one_deleted_snapshot+0xb5/0xd0
 cleaner_kthread+0xf6/0x120
 kthread+0xf8/0x130
 ? btree_invalidatepage+0x90/0x90
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40

This was because btrfs_drop_snapshot depends on the root not being modified
while it's dropping the snapshot.  It will unlock the root node (and really
every node) as it walks down the tree, only to re-lock it when it needs to do
something.  This is a problem because if we modify the tree we could cow a block
in our path, which free's our reference to that block.  Then once we get back to
that shared block we'll free our reference to it again, and get ENOENT when
trying to lookup our extent reference to that block in __btrfs_free_extent.

This is ultimately happening because we have delayed items left to be processed
for our deleted snapshot _after_ all of the inodes are closed for the snapshot.
We only run the delayed inode item if we're deleting the inode, and even then we
do not run the delayed insertions or delayed removals.  These can be run at any
point after our final inode does it's last iput, which is what triggers the
snapshot deletion.  We can end up with the snapshot deletion happening and then
have the delayed items run on that file system, resulting in the above problem.

This problem has existed forever, however my patches made it much easier to hit
as I wake up the cleaner much more often to deal with delayed iputs, which made
us more likely to start the snapshot dropping work before the transaction
commits, which is when the delayed items would generally be run.  Before,
generally speaking, we would run the delayed items, commit the transaction, and
wakeup the cleaner thread to start deleting snapshots, which means we were less
likely to hit this problem.  You could still hit it if you had multiple
snapshots to be deleted and ended up with lots of delayed items, but it was
definitely harder.

Fix for now by simply running all the delayed items before starting to drop the
snapshot.  We could make this smarter in the future by making the delayed items
per-root, and then simply drop any delayed items for roots that we are going to
delete.  But for now just a quick and easy solution is the safest.

Cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dcb699dd57f3..965702034b22 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
goto out_free;
}
 
+   btrfs_run_delayed_items(trans);
+
if (block_rsv)
trans->block_rsv = block_rsv;
 
-- 
2.14.3



Re: btrfs development - question about crypto api integration

2018-11-30 Thread Nikolay Borisov



On 30.11.18 г. 17:22 ч., Chris Mason wrote:
> On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:
> 
>> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
>>> Hi all,
>>> I've been studying LZ4 and other compression algorithms on the
>>> kernel, and seen other projects such as zram and ubifs using the
>>> crypto api. Is there a technical reason for not using the crypto api
>>> for compression (and possibly for encryption) in btrfs?
>>> I did not find any design/technical implementation choices in
>>> btrfs development in the developer's FAQ on the wiki. If I completely
>>> missed it, could someone point me in the right direction?
>>> Lastly, if there is no technical reason for this, would it be
>>> something interesting to have implemented?
>>
>> I personally think it would be better if btrfs' exploited the generic
>> framework. And in fact when you look at zstd, btrfs does use the
>> generic, low-level ZSTD routines but not the crypto library wrappers. 
>> If
>> I were I'd try and convert zstd (since it's the most recently added
>> algorithm) to using the crypto layer to see if there are any lurking
>> problems.
> 
> Back when I first added the zlib support, the zlib API was both easier 
> to use and a better fit for our async worker threads.  That doesn't mean 
> we shouldn't switch, it's just how we got to step one ;)

And what about zstd? WHy is it also using the low level api and not the
crypto wrappers?

> 
> -chris
> 


Re: btrfs development - question about crypto api integration

2018-11-30 Thread Chris Mason
On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:

> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
>> Hi all,
>> I've been studying LZ4 and other compression algorithms on the
>> kernel, and seen other projects such as zram and ubifs using the
>> crypto api. Is there a technical reason for not using the crypto api
>> for compression (and possibly for encryption) in btrfs?
>> I did not find any design/technical implementation choices in
>> btrfs development in the developer's FAQ on the wiki. If I completely
>> missed it, could someone point me in the right direction?
>> Lastly, if there is no technical reason for this, would it be
>> something interesting to have implemented?
>
> I personally think it would be better if btrfs' exploited the generic
> framework. And in fact when you look at zstd, btrfs does use the
> generic, low-level ZSTD routines but not the crypto library wrappers. 
> If
> I were I'd try and convert zstd (since it's the most recently added
> algorithm) to using the crypto layer to see if there are any lurking
> problems.

Back when I first added the zlib support, the zlib API was both easier 
to use and a better fit for our async worker threads.  That doesn't mean 
we shouldn't switch, it's just how we got to step one ;)

-chris


Need help with potential ~45TB dataloss

2018-11-30 Thread Patrick Dijkgraaf
Hi all,

I have been a happy BTRFS user for quite some time. But now I'm facing
a potential ~45TB dataloss... :-(
I hope someone can help!

I have Server A and Server B. Both having a 20-devices BTRFS RAID6
filesystem. Because of known RAID5/6 risks, Server B was a backup of
Server A.
After applying updates to server B and reboot, the FS would not mount
anymore. Because it was "just" a backup. I decided to recreate the FS
and perform a new backup. Later, I discovered that the FS was not
broken, but I faced this issue: 
https://patchwork.kernel.org/patch/10694997/

Anyway, the FS was already recreated, so I needed to do a new backup.
During the backup (using rsync -vah), Server A (the source) encountered
an I/O error and my rsync failed. In an attempt to "quick fix" the
issue, I rebooted Server A after which the FS would not mount anymore.

I documented what I have tried, below. I have not yet tried anything
except what is shown, because I am afraid of causing more harm to
the FS. I hope somebody here can give me advice on how to (hopefully)
retrieve my data...

Thanks in advance!

==

[root@cornelis ~]# btrfs fi show
Label: 'cornelis-btrfs'  uuid: ac643516-670e-40f3-aa4c-f329fc3795fd
Total devices 1 FS bytes used 463.92GiB
devid1 size 800.00GiB used 493.02GiB path
/dev/mapper/cornelis-cornelis--btrfs

Label: 'data'  uuid: 4c66fa8b-8fc6-4bba-9d83-02a2a1d69ad5
Total devices 20 FS bytes used 44.85TiB
devid1 size 3.64TiB used 3.64TiB path /dev/sdn2
devid2 size 3.64TiB used 3.64TiB path /dev/sdp2
devid3 size 3.64TiB used 3.64TiB path /dev/sdu2
devid4 size 3.64TiB used 3.64TiB path /dev/sdx2
devid5 size 3.64TiB used 3.64TiB path /dev/sdh2
devid6 size 3.64TiB used 3.64TiB path /dev/sdg2
devid7 size 3.64TiB used 3.64TiB path /dev/sdm2
devid8 size 3.64TiB used 3.64TiB path /dev/sdw2
devid9 size 3.64TiB used 3.64TiB path /dev/sdj2
devid   10 size 3.64TiB used 3.64TiB path /dev/sdt2
devid   11 size 3.64TiB used 3.64TiB path /dev/sdk2
devid   12 size 3.64TiB used 3.64TiB path /dev/sdq2
devid   13 size 3.64TiB used 3.64TiB path /dev/sds2
devid   14 size 3.64TiB used 3.64TiB path /dev/sdf2
devid   15 size 7.28TiB used 588.80GiB path /dev/sdr2
devid   16 size 7.28TiB used 588.80GiB path /dev/sdo2
devid   17 size 7.28TiB used 588.80GiB path /dev/sdv2
devid   18 size 7.28TiB used 588.80GiB path /dev/sdi2
devid   19 size 7.28TiB used 588.80GiB path /dev/sdl2
devid   20 size 7.28TiB used 588.80GiB path /dev/sde2

[root@cornelis ~]# mount /dev/sdn2 /mnt/data
mount: /mnt/data: wrong fs type, bad option, bad superblock on
/dev/sdn2, missing codepage or helper program, or other error.

[root@cornelis ~]# btrfs check /dev/sdn2
Opening filesystem to check...
parent transid verify failed on 46451963543552 wanted 114401 found
114173
parent transid verify failed on 46451963543552 wanted 114401 found
114173
checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF
checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
bad tree block 46451963543552, bytenr mismatch, want=46451963543552,
have=75208089814272
Couldn't read tree root
ERROR: cannot open file system

[root@cornelis ~]# btrfs restore /dev/sdn2 /mnt/data/
parent transid verify failed on 46451963543552 wanted 114401 found
114173
parent transid verify failed on 46451963543552 wanted 114401 found
114173
checksum verify failed on 46451963543552 found A8F2A769 wanted 4C111ADF
checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
checksum verify failed on 46451963543552 found 32153BE8 wanted 8B07ABE4
bad tree block 46451963543552, bytenr mismatch, want=46451963543552,
have=75208089814272
Couldn't read tree root
Could not open root, trying backup super
warning, device 14 is missing
warning, device 13 is missing
warning, device 12 is missing
warning, device 11 is missing
warning, device 10 is missing
warning, device 9 is missing
warning, device 8 is missing
warning, device 7 is missing
warning, device 6 is missing
warning, device 5 is missing
warning, device 4 is missing
warning, device 3 is missing
warning, device 2 is missing
checksum verify failed on 22085632 found 5630EA32 wanted 1AA6FFF0
checksum verify failed on 22085632 found 5630EA32 wanted 1AA6FFF0
bad tree block 22085632, bytenr mismatch, want=22085632,
have=1147797504
ERROR: cannot read chunk root
Could not open root, trying backup super
warning, device 14 is missing
warning, device 13 is missing
warning, device 12 is missing
warning, device 11 is missing
warning, device 10 is missing
warning, device 9 is missing
warning, device 8 is missing
warning, device 7 is missing
warning, device 6 is missing
warning, device 5 is missing
warning, device 4 i

Re: [PATCH 2/2] fs: Don't open-code lru_to_page

2018-11-30 Thread Yan, Zheng
On Thu, Nov 29, 2018 at 3:56 PM Nikolay Borisov  wrote:
>
> There are a bunch of filesystems which essentially open-code lru_to_page
> helper. Change them to using the helper. No functional changes.
>
> Signed-off-by: Nikolay Borisov 
> ---
>
> Since this is a mostly mechanical change I've actually batched all of them in
> a single patch.
>
>  fs/afs/file.c| 5 +++--
>  fs/btrfs/extent_io.c | 2 +-
>  fs/ceph/addr.c   | 5 ++---
>  fs/cifs/file.c   | 3 ++-
>  fs/ext4/readpage.c   | 2 +-
>  fs/ocfs2/aops.c  | 3 ++-
>  fs/orangefs/inode.c  | 2 +-
>  mm/swap.c| 2 +-
>  8 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/fs/afs/file.c b/fs/afs/file.c
> index d6bc3f5d784b..323ae9912203 100644
> --- a/fs/afs/file.c
> +++ b/fs/afs/file.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>
>  static int afs_file_mmap(struct file *file, struct vm_area_struct *vma);
> @@ -441,7 +442,7 @@ static int afs_readpages_one(struct file *file, struct 
> address_space *mapping,
> /* Count the number of contiguous pages at the front of the list.  
> Note
>  * that the list goes prev-wards rather than next-wards.
>  */
> -   first = list_entry(pages->prev, struct page, lru);
> +   first = lru_to_page(pages);
> index = first->index + 1;
> n = 1;
> for (p = first->lru.prev; p != pages; p = p->prev) {
> @@ -473,7 +474,7 @@ static int afs_readpages_one(struct file *file, struct 
> address_space *mapping,
>  * page at the end of the file.
>  */
> do {
> -   page = list_entry(pages->prev, struct page, lru);
> +   page = lru_to_page(pages);
> list_del(&page->lru);
> index = page->index;
> if (add_to_page_cache_lru(page, mapping, index,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 19f4b8fd654f..8332c5f4b1c3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4104,7 +4104,7 @@ int extent_readpages(struct address_space *mapping, 
> struct list_head *pages,
> u64 prev_em_start = (u64)-1;
>
> for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> -   page = list_entry(pages->prev, struct page, lru);
> +   page = lru_to_page(pages);
>
> prefetchw(&page->flags);
> list_del(&page->lru);
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8eade7a993c1..5d0c05e288cc 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -306,7 +306,7 @@ static int start_read(struct inode *inode, struct 
> ceph_rw_context *rw_ctx,
> struct ceph_osd_client *osdc =
> &ceph_inode_to_client(inode)->client->osdc;
> struct ceph_inode_info *ci = ceph_inode(inode);
> -   struct page *page = list_entry(page_list->prev, struct page, lru);
> +   struct page *page = lru_to_page(page_list);
> struct ceph_vino vino;
> struct ceph_osd_request *req;
> u64 off;
> @@ -333,8 +333,7 @@ static int start_read(struct inode *inode, struct 
> ceph_rw_context *rw_ctx,
> if (got)
> ceph_put_cap_refs(ci, got);
> while (!list_empty(page_list)) {
> -   page = list_entry(page_list->prev,
> - struct page, lru);
> +   page = lru_to_page(page_list);
> list_del(&page->lru);
> put_page(page);
> }

For the ceph patch

Acked-by:: "Yan, Zheng"  diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 74c33d5fafc8..b16a4d887d17 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "cifsfs.h"
>  #include "cifspdu.h"
> @@ -3975,7 +3976,7 @@ readpages_get_pages(struct address_space *mapping, 
> struct list_head *page_list,
>
> INIT_LIST_HEAD(tmplist);
>
> -   page = list_entry(page_list->prev, struct page, lru);
> +   page = lru_to_page(page_list);
>
> /*
>  * Lock the page and put it in the cache. Since no one else
> diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> index f461d75ac049..6aa282ee455a 100644
> --- a/fs/ext4/readpage.c
> +++ b/fs/ext4/readpage.c
> @@ -128,7 +128,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
>
> prefetchw(&page->flags);
> if (pages) {
> -   page = list_entry(pages->prev, struct page, lru);
> +   page = lru_to_page(pages);
> list_del(&page->lru);
> if (add_to_page_cache_lru(page, mapping, page->index,
>   readahead_gfp_mask(mapping)))
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index eb1ce304

[PATCH RESEND] fstests: btrfs use forget if not reload

2018-11-30 Thread Anand Jain
[I will send this the xfstest ML after kernel and progs patch
 has been integrated].

btrfs reload was introduced to cleanup the device list inside the btrfs
kernel module.

The problem with the reload approach is that you can't run btrfs test
cases 124,125, 154 and 164 on the system with btrfs as root fs.

Now as we are introducing the btrfs forget feature as an btrfs device
scan option [1], so here are its pertaining changes in the fstests.

So these changes ensures we use the forget feature where available and
if its not then falls back to the reload approach.

[1]
btrfs-progs: add cli to forget one or all scanned devices
btrfs: introduce feature to forget a btrfs device

Signed-off-by: Anand Jain 
---
 common/btrfs| 20 
 tests/btrfs/124 |  6 +++---
 tests/btrfs/125 |  6 +++---
 tests/btrfs/154 |  6 +++---
 tests/btrfs/164 |  4 ++--
 5 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/common/btrfs b/common/btrfs
index 26dc0bb9600f..c7fbec11c8c1 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -374,3 +374,23 @@ _scratch_btrfs_sectorsize()
$BTRFS_UTIL_PROG inspect-internal dump-super $SCRATCH_DEV |\
grep sectorsize | awk '{print $2}'
 }
+
+_btrfs_supports_forget()
+{
+   $BTRFS_UTIL_PROG device scan --help | grep -wq forget && \
+   $BTRFS_UTIL_PROG device scan --forget > /dev/null 2>&1
+}
+
+_require_btrfs_forget_if_not_fs_loadable()
+{
+   _btrfs_supports_forget && return
+
+   _require_loadable_fs_module "btrfs"
+}
+
+_btrfs_forget_if_not_fs_reload()
+{
+   _btrfs_supports_forget && return
+
+   _reload_fs_module "btrfs"
+}
diff --git a/tests/btrfs/124 b/tests/btrfs/124
index ce3ad6aa3a58..edbeff1443f5 100755
--- a/tests/btrfs/124
+++ b/tests/btrfs/124
@@ -51,7 +51,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -86,7 +86,7 @@ echo "clean btrfs ko" >> $seqres.full
 _scratch_unmount
 
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 echo >> $seqres.full
 echo "-Write degraded mount fill upto $max_fs_sz bytes-" >> 
$seqres.full
@@ -125,7 +125,7 @@ echo
 echo "Mount degraded with the other dev"
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded $dev2 $SCRATCH_MNT >>$seqres.full 2>&1
 _run_btrfs_util_prog filesystem show
 checkpoint3=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index e38de264b28e..9161a2ddeaad 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -50,7 +50,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 3
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 3
 
@@ -102,7 +102,7 @@ echo "unmount" >> $seqres.full
 _scratch_unmount
 echo "clean btrfs ko" >> $seqres.full
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded,device=$dev2 $dev1 $SCRATCH_MNT >>$seqres.full 2>&1
 dd if=/dev/zero of="$SCRATCH_MNT"/tf2 bs=$bs count=$count \
>>$seqres.full 2>&1
@@ -138,7 +138,7 @@ echo "Mount degraded but with other dev"
 
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 _mount -o degraded,device=${dev2} $dev3 $SCRATCH_MNT >>$seqres.full 2>&1
 
diff --git a/tests/btrfs/154 b/tests/btrfs/154
index 99ea232aba4c..01745d20e9fc 100755
--- a/tests/btrfs/154
+++ b/tests/btrfs/154
@@ -36,7 +36,7 @@ rm -f $seqres.full
 _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -90,7 +90,7 @@ degrade_mount_write()
 
echo "clean btrfs ko" >> $seqres.full
# un-scan the btrfs devices
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
cnt=$(( $COUNT/10 ))
dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
@@ -142,7 +142,7 @@ verify()
echo "unmount" >> $seqres.full
 
_scratch_unmount
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV2 $SCRATCH_MNT >>$seqres.full 2>&1
verify_checkpoint1=`md5sum $SCRATCH_MNT/tf1`
verify_checkpoint2=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/164 b/tests/btrfs/164
index 097191a0e493..55042c4035e0 100755
--- a/tests/btrfs/164
+++ b/tests/btrfs/164
@@ -36,7 +36,7 @@ rm -f $seqres.full
 # Modify as appropriate.
 _supported_fs btrfs
 _supported_os Linux
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 _require_scratch_dev_pool 2
 
 _scratch_dev_pool_get 2
@@ -69,7 +69,7 @@ delete_seed()
 {

[PATCH RESEND v12] btrfs: introduce feature to forget a btrfs device

2018-11-30 Thread Anand Jain
Support for a new command 'btrfs dev forget [dev]' is proposed here
to undo the effects of 'btrfs dev scan [dev]'. For this purpose
this patch proposes to use ioctl #5 as it was empty.
IOW(BTRFS_IOCTL_MAGIC, 5, ..)
This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
the /dev/btrfs-control to forget one or all devices, (devices which are
not mounted) from the btrfs kernel.

The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
set to specify the device path. And all unmounted devices can be removed
from the kernel if no device path is provided.

Again, the devices are removed only if the relevant fsid aren't mounted.

This new cli can provide..
 . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the
   device is not going to be mounted.
 . Ability to mount the device in degraded mode when one of the other
   device is corrupted like in split brain raid1.
 . Running test cases which requires btrfs.ko-reload if the rootfs
   is btrfs.

Signed-off-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v11->v12: fix coding style add spacing before after ":".
v1->v11: Pls ref to the cover-letter. (sorry about that).

 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d3c6bbc0aa3a..eba2966913ae 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2247,6 +2247,9 @@ static long btrfs_control_ioctl(struct file *file, 
unsigned int cmd,
ret = PTR_ERR_OR_ZERO(device);
mutex_unlock(&uuid_mutex);
break;
+   case BTRFS_IOC_FORGET_DEV:
+   ret = btrfs_forget_devices(vol->name);
+   break;
case BTRFS_IOC_DEVICES_READY:
mutex_lock(&uuid_mutex);
device = btrfs_scan_one_device(vol->name, FMODE_READ,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e36cbb355df..48415a9edd46 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1397,6 +1397,15 @@ static int btrfs_read_disk_super(struct block_device 
*bdev, u64 bytenr,
return 0;
 }
 
+int btrfs_forget_devices(const char *path)
+{
+   mutex_lock(&uuid_mutex);
+   btrfs_free_stale_devices(strlen(path) ? path : NULL, NULL);
+   mutex_unlock(&uuid_mutex);
+
+   return 0;
+}
+
 /*
  * Look for a btrfs signature on a device. This may be called out of the mount 
path
  * and we are not allowed to call set_blocksize during the scan. The superblock
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1d936ce282c3..a4a89d5050f5 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -414,6 +414,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
   fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
   fmode_t flags, void *holder);
+int btrfs_forget_devices(const char *path);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
 void btrfs_assign_next_active_device(struct btrfs_device *device,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index e0763bc4158e..c195896d478f 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -837,6 +837,8 @@ enum btrfs_err_code {
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
+  struct btrfs_ioctl_vol_args)
 /* trans start and trans end are dangerous, and only for
  * use by applications that know how to avoid the
  * resulting deadlocks
-- 
1.8.3.1



[PATCH RESEND v12] btrfs-progs: add cli to forget one or all scanned devices

2018-11-30 Thread Anand Jain
This patch adds cli
  btrfs device forget [dev]
to remove the given device structure in the kernel if the device
is unmounted. If no argument is given it shall remove all stale
(device which are not mounted) from the kernel.

Signed-off-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v11->v12: none.
v1->v12: pls ref to the cover-letter.

 cmds-device.c | 72 ---
 ioctl.h   |  2 ++
 2 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index 2a05f70a76a9..280d6f555377 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -254,10 +254,32 @@ static int cmd_device_delete(int argc, char **argv)
return _cmd_device_remove(argc, argv, cmd_device_delete_usage);
 }
 
+static int btrfs_forget_devices(char *path)
+{
+   struct btrfs_ioctl_vol_args args;
+   int ret;
+   int fd;
+
+   fd = open("/dev/btrfs-control", O_RDWR);
+   if (fd < 0)
+   return -errno;
+
+   memset(&args, 0, sizeof(args));
+   if (path)
+   strncpy_null(args.name, path);
+   ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, &args);
+   if (ret)
+   ret = -errno;
+   close(fd);
+   return ret;
+}
+
 static const char * const cmd_device_scan_usage[] = {
-   "btrfs device scan [(-d|--all-devices)| [...]]",
-   "Scan devices for a btrfs filesystem",
+   "btrfs device scan [(-d|--all-devices)|(-u|--forget)| "\
+   "[...]]",
+   "Scan or forget (deregister) devices for a btrfs filesystem",
" -d|--all-devices (deprecated)",
+   " -u|--forget [ ..]",
NULL
 };
 
@@ -267,37 +289,53 @@ static int cmd_device_scan(int argc, char **argv)
int devstart;
int all = 0;
int ret = 0;
+   int forget = 0;
 
optind = 0;
while (1) {
int c;
static const struct option long_options[] = {
{ "all-devices", no_argument, NULL, 'd'},
+   { "forget", no_argument, NULL, 'u'},
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "d", long_options, NULL);
+   c = getopt_long(argc, argv, "du", long_options, NULL);
if (c < 0)
break;
switch (c) {
case 'd':
all = 1;
break;
+   case 'u':
+   forget = 1;
+   break;
default:
usage(cmd_device_scan_usage);
}
}
devstart = optind;
 
+   if (all && forget)
+   usage(cmd_device_scan_usage);
+
if (all && check_argc_max(argc - optind, 1))
usage(cmd_device_scan_usage);
 
if (all || argc - optind == 0) {
-   printf("Scanning for Btrfs filesystems\n");
-   ret = btrfs_scan_devices();
-   error_on(ret, "error %d while scanning", ret);
-   ret = btrfs_register_all_devices();
-   error_on(ret, "there are %d errors while registering devices", 
ret);
+   if (forget) {
+   ret = btrfs_forget_devices(NULL);
+   error_on(ret, "'%s', forget failed",
+strerror(-ret));
+   } else {
+   printf("Scanning for Btrfs filesystems\n");
+   ret = btrfs_scan_devices();
+   error_on(ret, "error %d while scanning", ret);
+   ret = btrfs_register_all_devices();
+   error_on(ret,
+   "there are %d errors while registering devices",
+   ret);
+   }
goto out;
}
 
@@ -315,11 +353,19 @@ static int cmd_device_scan(int argc, char **argv)
ret = 1;
goto out;
}
-   printf("Scanning for Btrfs filesystems in '%s'\n", path);
-   if (btrfs_register_one_device(path) != 0) {
-   ret = 1;
-   free(path);
-   goto out;
+   if (forget) {
+   ret = btrfs_forget_devices(path);
+   if (ret)
+   error("Can't forget '%s': %s",
+   path, strerror(-ret));
+   } else {
+   printf("Scanning for Btrfs filesystems in '%s'\n",
+   path);
+   if (btrfs_register_one_device(path) != 0) {
+   ret = 1;
+   free(path);
+   goto out;
+   }
  

[PATCH RESEND] fstests: btrfs use forget if not reload

2018-11-30 Thread Anand Jain
[I will send this the xfstest ML after kernel and progs patch
 has been integrated].

btrfs reload was introduced to cleanup the device list inside the btrfs
kernel module.

The problem with the reload approach is that you can't run btrfs test
cases 124,125, 154 and 164 on the system with btrfs as root fs.

Now as we are introducing the btrfs forget feature as an btrfs device
scan option [1], so here are its pertaining changes in the fstests.

So these changes ensures we use the forget feature where available and
if its not then falls back to the reload approach.

[1]
btrfs-progs: add cli to forget one or all scanned devices
btrfs: introduce feature to forget a btrfs device

Signed-off-by: Anand Jain 
---
 common/btrfs| 20 
 tests/btrfs/124 |  6 +++---
 tests/btrfs/125 |  6 +++---
 tests/btrfs/154 |  6 +++---
 tests/btrfs/164 |  4 ++--
 5 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/common/btrfs b/common/btrfs
index 26dc0bb9600f..c7fbec11c8c1 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -374,3 +374,23 @@ _scratch_btrfs_sectorsize()
$BTRFS_UTIL_PROG inspect-internal dump-super $SCRATCH_DEV |\
grep sectorsize | awk '{print $2}'
 }
+
+_btrfs_supports_forget()
+{
+   $BTRFS_UTIL_PROG device scan --help | grep -wq forget && \
+   $BTRFS_UTIL_PROG device scan --forget > /dev/null 2>&1
+}
+
+_require_btrfs_forget_if_not_fs_loadable()
+{
+   _btrfs_supports_forget && return
+
+   _require_loadable_fs_module "btrfs"
+}
+
+_btrfs_forget_if_not_fs_reload()
+{
+   _btrfs_supports_forget && return
+
+   _reload_fs_module "btrfs"
+}
diff --git a/tests/btrfs/124 b/tests/btrfs/124
index ce3ad6aa3a58..edbeff1443f5 100755
--- a/tests/btrfs/124
+++ b/tests/btrfs/124
@@ -51,7 +51,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -86,7 +86,7 @@ echo "clean btrfs ko" >> $seqres.full
 _scratch_unmount
 
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 echo >> $seqres.full
 echo "-Write degraded mount fill upto $max_fs_sz bytes-" >> 
$seqres.full
@@ -125,7 +125,7 @@ echo
 echo "Mount degraded with the other dev"
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded $dev2 $SCRATCH_MNT >>$seqres.full 2>&1
 _run_btrfs_util_prog filesystem show
 checkpoint3=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index e38de264b28e..9161a2ddeaad 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -50,7 +50,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 3
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 3
 
@@ -102,7 +102,7 @@ echo "unmount" >> $seqres.full
 _scratch_unmount
 echo "clean btrfs ko" >> $seqres.full
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded,device=$dev2 $dev1 $SCRATCH_MNT >>$seqres.full 2>&1
 dd if=/dev/zero of="$SCRATCH_MNT"/tf2 bs=$bs count=$count \
>>$seqres.full 2>&1
@@ -138,7 +138,7 @@ echo "Mount degraded but with other dev"
 
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 _mount -o degraded,device=${dev2} $dev3 $SCRATCH_MNT >>$seqres.full 2>&1
 
diff --git a/tests/btrfs/154 b/tests/btrfs/154
index 99ea232aba4c..01745d20e9fc 100755
--- a/tests/btrfs/154
+++ b/tests/btrfs/154
@@ -36,7 +36,7 @@ rm -f $seqres.full
 _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -90,7 +90,7 @@ degrade_mount_write()
 
echo "clean btrfs ko" >> $seqres.full
# un-scan the btrfs devices
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
cnt=$(( $COUNT/10 ))
dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
@@ -142,7 +142,7 @@ verify()
echo "unmount" >> $seqres.full
 
_scratch_unmount
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV2 $SCRATCH_MNT >>$seqres.full 2>&1
verify_checkpoint1=`md5sum $SCRATCH_MNT/tf1`
verify_checkpoint2=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/164 b/tests/btrfs/164
index 097191a0e493..55042c4035e0 100755
--- a/tests/btrfs/164
+++ b/tests/btrfs/164
@@ -36,7 +36,7 @@ rm -f $seqres.full
 # Modify as appropriate.
 _supported_fs btrfs
 _supported_os Linux
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 _require_scratch_dev_pool 2
 
 _scratch_dev_pool_get 2
@@ -69,7 +69,7 @@ delete_seed()
 {

[PATCH RESEND v12] Add cli and ioctl to forget scanned device(s)

2018-11-30 Thread Anand Jain
v12:
  Fixed coding style - leave space between " : ".

v11:
 btrfs-progs: Bring the code into the else part of if(forget).
  Use strerror to print the erorr instead of ret.

v10:
 Make btrfs-progs changes more readable.
 With an effort to keep the known bug [1] as it is..
  [1]
   The cli 'btrfs device scan --all /dev/sdb' which should have scanned
only one device, ends up scanning all the devices and I am not trying
to fix this bug in this patch because..
  . -d|--all is marked as deprecated, I hope -d option would go away
  . For now some script might be using this bug as a feature, and fixing
this bug might lead to mount failure.

v9:
 Make forget as a btrfs device scan option.
 Use forget in the fstests, now you can run fstests with btrfs as rootfs
  which helps to exercise the uuid_mutex lock.

v8:
 Change log update in the kernel patch.

v7:
 Use struct btrfs_ioctl_vol_args (instead of struct
  btrfs_ioctl_vol_args_v2) as its inline with other ioctl
  btrfs-control
 The CLI usage/features remains same. However internally the ioctl flag
  is not required to delete all the unmounted devices. Instead leave
  btrfs_ioctl_vol_args::name NULL.

v6:
 Use the changed fn name btrfs_free_stale_devices().

 Change in title:
 Old v5:
 Cover-letter:
  [PATCH v5] Add cli and ioctl to ignore a scanned device
 Kernel:
  [PATCH v5] btrfs: introduce feature to ignore a btrfs device
 Progs:
  [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli

v5:
  Adds feature to delete all stale devices
  Reuses btrfs_free_stale_devices() fn and so depends on the
patch-set [1] in the ML.
  Uses struct btrfs_ioctl_vol_args_v2 instead of
struct btrfs_ioctl_vol_args as arg
  Does the device path matching instead of btrfs_device matching
(we won't delete the mounted device as btrfs_free_stale_devices()
checks for it)
v4:
  No change. But as the ML thread may be confusing, so resend.
v3:
  No change. Send to correct ML.
v2:
  Accepts review from Nikolay, details are in the specific patch.
  Patch 1/2 is renamed from
[PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list 
delete
  to
[PATCH 1/2] btrfs: add function to device list delete

Adds cli and ioctl to forget a scanned device or forget all stale
devices in the kernel.

Anand Jain (1):
  btrfs: introduce feature to forget a btrfs device

 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

Anand Jain (1):
  btrfs-progs: add cli to forget one or all scanned devices

 cmds-device.c | 63 ++-
 ioctl.h   |  2 ++
 2 files changed, 56 insertions(+), 9 deletions(-)

[1]
Anand Jain (1):
  fstests: btrfs use forget if not reload

 common/btrfs| 20 
 tests/btrfs/124 |  6 +++---
 tests/btrfs/125 |  6 +++---
 tests/btrfs/154 |  6 +++---
 tests/btrfs/164 |  4 ++--
 5 files changed, 31 insertions(+), 11 deletions(-)

-- 
1.8.3.1