Re: IO errors when building RAID1.... ?

2018-08-31 Thread Duncan
Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted:

> If you want you can post the output from 'sudo smartctl -x /dev/sda'
> which will contain more information... but this is in some sense
> superfluous. The problem is very clearly a bad drive, the drive
> explicitly report to libata a write error, and included the sector LBA
> affected, and only the drive firmware would know that. It's not likely a
> cable problem or something like. And that the write error is reported at
> all means it's persistent, not transient.

Two points:

1) Does this happen to be an archive/SMR (shingled magnetic recording) 
device?  If so that might be the problem as such devices really aren't 
suited to normal usage (they really are designed for archiving), and 
btrfs' COW patterns can exacerbate the issue.  It's quite possible that 
the original install didn't load up the IO as heavily as the balance-
convert does, so the problem appears with convert but not for install.

2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, 
I'd suggest running badblocks -w (make sure the device doesn't have 
anything valuable on it!) on the device -- note that this will take 
awhile, probably a couple days perhaps longer, as it writes four 
different patterns to the entire device one at a time, reading everything 
back to verify the pattern was written correctly, so it's actually going 
over the entire device 8 times, alternating write and read, but it should 
settle the issue of the reliability of the device.

Or if you'd rather spend the money than the time and it's not under 
warrantee still, just replace it, or at least buy a new one to use while 
you run the tests on that one.  I fully understand that tying up the 
thing running tests on it for days straight may not be viable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:42:20PM -0400, Josef Bacik wrote:
> We may abort the transaction during a commit and not have a chance to
> run the pending bgs stuff, which will leave block groups on our list and
> cause us accounting issues and leaked memory.  Fix this by running the
> pending bgs when we cleanup a transaction.

Reviewed-by: Omar Sandoval 

Again, I think it's fine to reuse the same function as long as there's a
comment here.

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 89d14f135837..0f39a0d302d3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   btrfs_scrub_continue(fs_info);
>  cleanup_transaction:
>   btrfs_trans_release_metadata(trans);
> + btrfs_create_pending_block_groups(trans);
>   btrfs_trans_release_chunk_metadata(trans);
>   trans->block_rsv = NULL;
>   btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
> -- 
> 2.14.3
> 


Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:42:19PM -0400, Josef Bacik wrote:
> We still need to do all of the accounting cleanup for pending block
> groups if we abort.  So set the ret to trans->aborted so if we aborted
> the cleanup happens and everybody is happy.

Reviewed-by: Omar Sandoval 

Reusing the loop is fine IMO, but a comment would be appreciated.

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 90f267f4dd0f..132a1157982c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct 
> btrfs_trans_handle *trans)
>   struct btrfs_root *extent_root = fs_info->extent_root;
>   struct btrfs_block_group_item item;
>   struct btrfs_key key;
> - int ret = 0;
> + int ret = trans->aborted;
>   bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>   trans->can_flush_pending_bgs = false;
> -- 
> 2.14.3
> 


Re: [PATCH 21/35] btrfs: only run delayed refs if we're committing

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:42:11PM -0400, Josef Bacik wrote:
> I noticed in a giant dbench run that we spent a lot of time on lock
> contention while running transaction commit.  This is because dbench
> results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> they all run the delayed refs first thing, so they all contend with
> each other.  This leads to seconds of 0 throughput.  Change this to only
> run the delayed refs if we're the ones committing the transaction.  This
> makes the latency go away and we get no more lock contention.

This means that we're going to spend more time running delayed refs
while in TRANS_STATE_COMMIT_START, so couldn't we end up blocking new
transactions more than before?

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 24 +---
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index ebb0c0405598..2bb19e2ded5e 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - /* make a pass through all the delayed refs we have so far
> -  * any runnings procs may add more while we are here
> -  */
> - ret = btrfs_run_delayed_refs(trans, 0);
> - if (ret) {
> - btrfs_end_transaction(trans);
> - return ret;
> - }
> -
>   cur_trans = trans->transaction;
>  
>   /*
> @@ -1939,12 +1930,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   if (!list_empty(>new_bgs))
>   btrfs_create_pending_block_groups(trans);
>  
> - ret = btrfs_run_delayed_refs(trans, 0);
> - if (ret) {
> - btrfs_end_transaction(trans);
> - return ret;
> - }
> -
>   if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
>   int run_it = 0;
>  
> @@ -2015,6 +2000,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   spin_unlock(_info->trans_lock);
>   }
>  
> + /*
> +  * We are now the only one in the commit area, we can run delayed refs
> +  * without hitting a bunch of lock contention from a lot of people
> +  * trying to commit the transaction at once.
> +  */
> + ret = btrfs_run_delayed_refs(trans, 0);
> + if (ret)
> + goto cleanup_transaction;
> +
>   extwriter_counter_dec(cur_trans, trans->type);
>  
>   ret = btrfs_start_delalloc_flush(fs_info);
> -- 
> 2.14.3
> 


Re: [PATCH 23/35] btrfs: assert on non-empty delayed iputs

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:42:13PM -0400, Josef Bacik wrote:
> I ran into an issue where there was some reference being held on an
> inode that I couldn't track.  This assert wasn't triggered, but it at
> least rules out we're doing something stupid.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/disk-io.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 0e42401756b8..11ea2ea7439e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>   kthread_stop(fs_info->transaction_kthread);
>   kthread_stop(fs_info->cleaner_kthread);
>  
> + ASSERT(list_empty(_info->delayed_iputs));
>   set_bit(BTRFS_FS_CLOSING_DONE, _info->flags);
>  
>   btrfs_free_qgroup_config(fs_info);
> -- 
> 2.14.3
> 


Re: [PATCH 09/35] btrfs: protect space cache inode alloc with nofs

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:41:59PM -0400, Josef Bacik wrote:
> If we're allocating a new space cache inode it's likely going to be
> under a transaction handle, so we need to use memalloc_nofs_save() in
> order to avoid deadlocks, and more importantly lockdep messages that
> make xfstests fail.

Could use a comment where we call memalloc_nofs_save(). Otherwise,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/free-space-cache.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index c3888c113d81..db93a5f035a0 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "free-space-cache.h"
>  #include "transaction.h"
> @@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct 
> btrfs_root *root,
>   struct btrfs_free_space_header *header;
>   struct extent_buffer *leaf;
>   struct inode *inode = NULL;
> + unsigned nofs_flag;
>   int ret;
>  
>   key.objectid = BTRFS_FREE_SPACE_OBJECTID;
> @@ -68,7 +70,9 @@ static struct inode *__lookup_free_space_inode(struct 
> btrfs_root *root,
>   btrfs_disk_key_to_cpu(, _key);
>   btrfs_release_path(path);
>  
> + nofs_flag = memalloc_nofs_save();
>   inode = btrfs_iget(fs_info->sb, , root, NULL);
> + memalloc_nofs_restore(nofs_flag);
>   if (IS_ERR(inode))
>   return inode;
>  
> -- 
> 2.14.3
> 


Re: [PATCH 08/35] btrfs: release metadata before running delayed refs

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:41:58PM -0400, Josef Bacik wrote:
> We want to release the unused reservation we have since it refills the
> delayed refs reserve, which will make everything go smoother when
> running the delayed refs if we're short on our reservation.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 99741254e27e..ebb0c0405598 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   return ret;
>   }
>  
> + btrfs_trans_release_metadata(trans);
> + trans->block_rsv = NULL;
> +
>   /* make a pass through all the delayed refs we have so far
>* any runnings procs may add more while we are here
>*/
> @@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   return ret;
>   }
>  
> - btrfs_trans_release_metadata(trans);
> - trans->block_rsv = NULL;
> -
>   cur_trans = trans->transaction;
>  
>   /*
> -- 
> 2.14.3
> 


Re: [PATCH 22/35] btrfs: make sure we create all new bgs

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:42:12PM -0400, Josef Bacik wrote:
> We can actually allocate new chunks while we're creating our bg's, so
> instead of doing list_for_each_safe, just do while (!list_empty()) so we
> make sure to catch any new bg's that get added to the list.

Reviewed-by: Omar Sandoval 

Since Nikolay pointed it out, might as well mention in the commit
message that this can happen because we modify the chunk and extent
trees.

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ca98c39308f6..fc30ff96f0d6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  {
>   struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_group_cache *block_group, *tmp;
> + struct btrfs_block_group_cache *block_group;
>   struct btrfs_root *extent_root = fs_info->extent_root;
>   struct btrfs_block_group_item item;
>   struct btrfs_key key;
> @@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct 
> btrfs_trans_handle *trans)
>   bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>   trans->can_flush_pending_bgs = false;
> - list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) {
> + while (!list_empty(>new_bgs)) {
> + block_group = list_first_entry(>new_bgs,
> +struct btrfs_block_group_cache,
> +bg_list);
>   if (ret)
>   goto next;
>  
> -- 
> 2.14.3
> 


Re: [PATCH 06/35] btrfs: check if free bgs for commit

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:41:56PM -0400, Josef Bacik wrote:
> may_commit_transaction will skip committing the transaction if we don't
> have enough pinned space or if we're trying to find space for a SYSTEM
> chunk.  However if we have pending free block groups in this transaction
> we still want to commit as we may be able to allocate a chunk to make
> our reservation.  So instead of just returning ENOSPC, check if we have
> free block groups pending, and if so commit the transaction to allow us
> to use that free space.

This makes sense.

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 18 --
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6e7f350754d2..80615a579b18 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>   struct btrfs_trans_handle *trans;
>   u64 bytes;
>   u64 reclaim_bytes = 0;
> + bool do_commit = true;

I find this naming a little mind bending when I read

do_commit = false;
goto commit;

Since the end result is that we always join the transaction if we make
it past the (!bytes) check anyways, can we do the pending bgs check
first? I find the following easier to follow, fwiw.

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de6f75f5547b..dd7aeb5fb6bf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4779,18 +4779,25 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
-   goto commit;
+   trans = btrfs_join_transaction(fs_info->extent_root);
+   if (IS_ERR(trans))
+   return -ENOSPC;
+
+   /*
+* See if we have a pending bg or there is enough pinned space to make
+* this reservation.
+*/
+   if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) ||
+   __percpu_counter_compare(_info->total_bytes_pinned, bytes,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+   return btrfs_commit_transaction(trans);
 
/*
 * See if there is some space in the delayed insertion reservation for
 * this reservation.
 */
if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
+   goto enospc;
 
spin_lock(_rsv->lock);
if (delayed_rsv->size > bytes)
@@ -4801,16 +4808,14 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
   bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-   return -ENOSPC;
-   }
-
-commit:
-   trans = btrfs_join_transaction(fs_info->extent_root);
-   if (IS_ERR(trans))
-   return -ENOSPC;
+  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+   goto enospc;
 
return btrfs_commit_transaction(trans);
+
+enospc:
+   btrfs_end_transaction(trans);
+   return -ENOSPC;
 }
 
 /*



Re: [PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:41:53PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> Unify the extent_op handling as well, just add a flag so we don't
> actually run the extent op from check_ref_cleanup and instead return a
> value so that we can skip cleaning up the ref head.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 17 +
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4c9fd35bca07..87c42a2c45b1 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,18 +2443,23 @@ static void unselect_delayed_ref_head(struct 
> btrfs_delayed_ref_root *delayed_ref
>  }
>  
>  static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> -  struct btrfs_delayed_ref_head *head)
> +  struct btrfs_delayed_ref_head *head,
> +  bool run_extent_op)
>  {
>   struct btrfs_delayed_extent_op *extent_op = head->extent_op;
>   int ret;
>  
>   if (!extent_op)
>   return 0;
> +
>   head->extent_op = NULL;
>   if (head->must_insert_reserved) {
>   btrfs_free_delayed_extent_op(extent_op);
>   return 0;
> + } else if (!run_extent_op) {
> + return 1;
>   }
> +
>   spin_unlock(>lock);
>   ret = run_delayed_extent_op(trans, head, extent_op);
>   btrfs_free_delayed_extent_op(extent_op);

So if cleanup_extent_op() returns 1, then the head was unlocked, unless
run_extent_op was true. That's pretty confusing. Can we make it always
unlock in the !must_insert_reserved case?

> @@ -2506,7 +2511,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>  
>   delayed_refs = >transaction->delayed_refs;
>  
> - ret = cleanup_extent_op(trans, head);
> + ret = cleanup_extent_op(trans, head, true);
>   if (ret < 0) {
>   unselect_delayed_ref_head(delayed_refs, head);
>   btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
> @@ -6977,12 +6982,8 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!RB_EMPTY_ROOT(>ref_tree))
>   goto out;
>  
> - if (head->extent_op) {
> - if (!head->must_insert_reserved)
> - goto out;
> - btrfs_free_delayed_extent_op(head->extent_op);
> - head->extent_op = NULL;
> - }
> + if (cleanup_extent_op(trans, head, false))
> + goto out;
>  
>   /*
>* waiting for the lock here would deadlock.  If someone else has it
> -- 
> 2.14.3
> 


Re: [PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper

2018-08-31 Thread Omar Sandoval
On Thu, Aug 30, 2018 at 01:41:52PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 6799950fa057..4c9fd35bca07 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);

While you're here, could you fix this botched whitespace?


[PATCH v5 3/6] vfs: update swap_{,de}activate documentation

2018-08-31 Thread Omar Sandoval
From: Omar Sandoval 

The documentation for these functions is wrong in several ways:

- swap_activate() is called with the inode locked
- swap_activate() takes a swap_info_struct * and a sector_t *
- swap_activate() can also return a positive number of extents it added
  itself
- swap_deactivate() does not return anything

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 Documentation/filesystems/Locking | 17 +++--
 Documentation/filesystems/vfs.txt | 12 
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index efea228ccd8a..b970c8c2ee22 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -210,8 +210,9 @@ prototypes:
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned 
long);
int (*error_remove_page)(struct address_space *, struct page *);
-   int (*swap_activate)(struct file *);
-   int (*swap_deactivate)(struct file *);
+   int (*swap_activate)(struct swap_info_struct *, struct file *,
+sector_t *);
+   void (*swap_deactivate)(struct file *);
 
 locking rules:
All except set_page_dirty and freepage may block
@@ -235,8 +236,8 @@ putback_page:   yes
 launder_page:  yes
 is_partially_uptodate: yes
 error_remove_page: yes
-swap_activate: no
-swap_deactivate:   no
+swap_activate: yes
+swap_deactivate:   no
 
->write_begin(), ->write_end() and ->readpage() may be called from
 the request handler (/dev/loop).
@@ -333,14 +334,10 @@ cleaned, or an error value if not. Note that in order to 
prevent the page
 getting mapped back in and redirtied, it needs to be kept locked
 across the entire operation.
 
-   ->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+   ->swap_activate is called from sys_swapon() with the inode locked.
 
->swap_deactivate() will be called in the sys_swapoff()
-path after ->swap_activate() returned success.
+path after ->swap_activate() returned success. The inode is not locked.
 
 --- file_lock_operations --
 prototypes:
diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 4b2084d0f1fb..40d6d6d4b76b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -652,8 +652,9 @@ struct address_space_operations {
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page);
-   int (*swap_activate)(struct file *);
-   int (*swap_deactivate)(struct file *);
+   int (*swap_activate)(struct swap_info_struct *, struct file *,
+sector_t *);
+   void (*swap_deactivate)(struct file *);
 };
 
   writepage: called by the VM to write a dirty page to backing store.
@@ -830,8 +831,11 @@ struct address_space_operations {
 
   swap_activate: Called when swapon is used on a file to allocate
space if necessary and pin the block lookup information in
-   memory. A return value of zero indicates success,
-   in which case this file can be used to back swapspace.
+   memory. If this returns zero, the swap system will call the address
+   space operations ->readpage() and ->direct_IO(). Alternatively, this
+   may call add_swap_extent() and return the number of extents added, in
+   which case the swap system will use the provided blocks directly
+   instead of going through the filesystem.
 
   swap_deactivate: Called during swapoff on files where swap_activate
was successful.
-- 
2.18.0



[PATCH v5 0/6] Btrfs: implement swap file support

2018-08-31 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This series implements swap file support for Btrfs.

Changes since v4 [1]:

- Added a kernel doc for btrfs_get_chunk_map()
- Got rid of "Btrfs: push EXCL_OP set into btrfs_rm_device()"
- Made activate error messages more clear and consistent
- Changed clear vs unlock order in activate error case
- Added "mm: export add_swap_extent()" as a separate patch
- Added a btrfs_wait_ordered_range() at the beginning of
  btrfs_swap_activate() to catch newly created files
- Added some Reviewed-bys from Nikolay

I took a stab at adding support for balance when a swap file is active,
but it's a major pain: we need to mark block groups which contain swap
file extents, check the block group counter in relocate/scrub, then
unmark the block groups when the swap file is deactivated, which gets
really messy because the file can grow while it is an active swap file.
If this is a deal breaker, I can work something out, but I don't think
it's worth the trouble.

This was tested with the swap tests in xfstests plus my new tests here
[2]. Additionally, I used my swapme test program [3] and ran a few
memory-intensive workloads (e.g., a highly parallel kernel build),
verifying that swap was being used. All of this was done with lockdep
enabled.

This series is based on v4.19-rc1. Please take a look.

Thanks!

1: https://www.spinics.net/lists/linux-btrfs/msg78731.html
2: https://github.com/osandov/xfstests/tree/btrfs-swap
3: https://github.com/osandov/osandov-linux/blob/master/scripts/swapme.c

Omar Sandoval (6):
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  mm: export add_swap_extent()
  vfs: update swap_{,de}activate documentation
  Btrfs: prevent ioctls from interfering with a swap file
  Btrfs: rename get_chunk_map() and make it non-static
  Btrfs: support swap files

 Documentation/filesystems/Locking |  17 +--
 Documentation/filesystems/vfs.txt |  12 +-
 fs/btrfs/ctree.h  |   6 +
 fs/btrfs/disk-io.c|   3 +
 fs/btrfs/inode.c  | 232 ++
 fs/btrfs/ioctl.c  |  51 ++-
 fs/btrfs/volumes.c|  28 ++--
 fs/btrfs/volumes.h|   9 ++
 include/linux/swap.h  |  13 +-
 mm/page_io.c  |   6 +-
 mm/swapfile.c |  14 +-
 11 files changed, 348 insertions(+), 43 deletions(-)

-- 
2.18.0



[PATCH v5 4/6] Btrfs: prevent ioctls from interfering with a swap file

2018-08-31 Thread Omar Sandoval
From: Omar Sandoval 

When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:

- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
- balance
- device remove/replace/resize

Don't allow those to happen on an active swap file. Balance and device
remove/replace/resize in particular are disallowed entirely; in the
future, we can relax this so that relocation skips/errors out only on
chunks containing an active swap file.

Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ctree.h   |  6 ++
 fs/btrfs/disk-io.c |  3 +++
 fs/btrfs/ioctl.c   | 51 ++
 fs/btrfs/volumes.c |  6 ++
 4 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 53af9f5253f4..1c767a6394ae 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1121,6 +1121,9 @@ struct btrfs_fs_info {
u32 sectorsize;
u32 stripesize;
 
+   /* Number of active swapfiles */
+   atomic_t nr_swapfiles;
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
spinlock_t ref_verify_lock;
struct rb_root block_tree;
@@ -1285,6 +1288,9 @@ struct btrfs_root {
spinlock_t qgroup_meta_rsv_lock;
u64 qgroup_meta_rsv_pertrans;
u64 qgroup_meta_rsv_prealloc;
+
+   /* Number of active swapfiles */
+   atomic_t nr_swapfiles;
 };
 
 struct btrfs_file_private {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5124c15705ce..50ee5cd3efae 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1187,6 +1187,7 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
atomic_set(>log_batch, 0);
refcount_set(>refs, 1);
atomic_set(>will_be_snapshotted, 0);
+   atomic_set(>nr_swapfiles, 0);
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
@@ -2781,6 +2782,8 @@ int open_ctree(struct super_block *sb,
fs_info->sectorsize = 4096;
fs_info->stripesize = 4096;
 
+   atomic_set(_info->nr_swapfiles, 0);
+
ret = btrfs_alloc_stripe_hash_table(fs_info);
if (ret) {
err = ret;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 63600dc2ac4c..cc230dcd32a4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -290,6 +290,11 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
} else if (fsflags & FS_COMPR_FL) {
const char *comp;
 
+   if (IS_SWAPFILE(inode)) {
+   ret = -ETXTBSY;
+   goto out_unlock;
+   }
+
binode->flags |= BTRFS_INODE_COMPRESS;
binode->flags &= ~BTRFS_INODE_NOCOMPRESS;
 
@@ -751,6 +756,12 @@ static int create_snapshot(struct btrfs_root *root, struct 
inode *dir,
if (!test_bit(BTRFS_ROOT_REF_COWS, >state))
return -EINVAL;
 
+   if (atomic_read(>nr_swapfiles)) {
+   btrfs_info(fs_info,
+  "cannot snapshot subvolume with active swapfile");
+   return -ETXTBSY;
+   }
+
pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_KERNEL);
if (!pending_snapshot)
return -ENOMEM;
@@ -1487,9 +1498,13 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
}
 
inode_lock(inode);
-   if (do_compress)
-   BTRFS_I(inode)->defrag_compress = compress_type;
-   ret = cluster_pages_for_defrag(inode, pages, i, cluster);
+   if (IS_SWAPFILE(inode)) {
+   ret = -ETXTBSY;
+   } else {
+   if (do_compress)
+   BTRFS_I(inode)->defrag_compress = compress_type;
+   ret = cluster_pages_for_defrag(inode, pages, i, 
cluster);
+   }
if (ret < 0) {
inode_unlock(inode);
goto out_ra;
@@ -1585,6 +1600,12 @@ static noinline int btrfs_ioctl_resize(struct file *file,
return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
}
 
+   if (atomic_read(_info->nr_swapfiles)) {
+   btrfs_info(fs_info, "cannot resize with active swapfile");
+   ret = -ETXTBSY;
+   goto out;
+   }
+
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args)) {
ret = PTR_ERR(vol_args);
@@ -3538,6 +3559,11 

[PATCH v5 6/6] Btrfs: support swap files

2018-08-31 Thread Omar Sandoval
From: Omar Sandoval 

Implement the swap file a_ops on Btrfs. Activation needs to make sure
that the file can be used as a swap file, which currently means it must
be fully allocated as nocow with no compression on one device. It also
sets up the swap extents directly with add_swap_extent(), so export it.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 232 +++
 1 file changed, 232 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9357a19d2bff..c0409e632768 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -10437,6 +10438,235 @@ void btrfs_set_range_writeback(struct extent_io_tree 
*tree, u64 start, u64 end)
}
 }
 
+struct btrfs_swap_info {
+   u64 start;
+   u64 block_start;
+   u64 block_len;
+   u64 lowest_ppage;
+   u64 highest_ppage;
+   unsigned long nr_pages;
+   int nr_extents;
+};
+
+static int btrfs_add_swap_extent(struct swap_info_struct *sis,
+struct btrfs_swap_info *bsi)
+{
+   unsigned long nr_pages;
+   u64 first_ppage, first_ppage_reported, next_ppage;
+   int ret;
+
+   first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT;
+   next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len,
+   PAGE_SIZE) >> PAGE_SHIFT;
+
+   if (first_ppage >= next_ppage)
+   return 0;
+   nr_pages = next_ppage - first_ppage;
+
+   first_ppage_reported = first_ppage;
+   if (bsi->start == 0)
+   first_ppage_reported++;
+   if (bsi->lowest_ppage > first_ppage_reported)
+   bsi->lowest_ppage = first_ppage_reported;
+   if (bsi->highest_ppage < (next_ppage - 1))
+   bsi->highest_ppage = next_ppage - 1;
+
+   ret = add_swap_extent(sis, bsi->nr_pages, nr_pages, first_ppage);
+   if (ret < 0)
+   return ret;
+   bsi->nr_extents += ret;
+   bsi->nr_pages += nr_pages;
+   return 0;
+}
+
+static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
+  sector_t *span)
+{
+   struct inode *inode = file_inode(file);
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+   struct extent_io_tree *io_tree = _I(inode)->io_tree;
+   struct extent_state *cached_state = NULL;
+   struct extent_map *em = NULL;
+   struct btrfs_device *device = NULL;
+   struct btrfs_swap_info bsi = {
+   .lowest_ppage = (sector_t)-1ULL,
+   };
+   int ret = 0;
+   u64 isize = inode->i_size;
+   u64 start;
+
+   /*
+* If the swap file was just created, make sure delalloc is done. If the
+* file changes again after this, the user is doing something stupid and
+* we don't really care.
+*/
+   ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+   if (ret)
+   return ret;
+
+   /*
+* The inode is locked, so these flags won't change after we check them.
+*/
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
+   btrfs_err(fs_info, "swapfile must not be compressed");
+   return -EINVAL;
+   }
+   if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
+   btrfs_err(fs_info, "swapfile must not be copy-on-write");
+   return -EINVAL;
+   }
+
+   /*
+* Balance or device remove/replace/resize can move stuff around from
+* under us. The EXCL_OP flag makes sure they aren't running/won't run
+* concurrently while we are mapping the swap extents, and the fs_info
+* nr_swapfiles counter prevents them from running while the swap file
+* is active and moving the extents. Note that this also prevents a
+* concurrent device add which isn't actually necessary, but it's not
+* really worth the trouble to allow it.
+*/
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
+   return -EBUSY;
+   atomic_inc(_info->nr_swapfiles);
+   /*
+* Snapshots can create extents which require COW even if NODATACOW is
+* set. We use this counter to prevent snapshots. We must increment it
+* before walking the extents because we don't want a concurrent
+* snapshot to run after we've already checked the extents.
+*/
+   atomic_inc(_I(inode)->root->nr_swapfiles);
+
+   lock_extent_bits(io_tree, 0, isize - 1, _state);
+   start = 0;
+   while (start < isize) {
+   u64 end, logical_block_start, physical_block_start;
+   u64 len = isize - start;
+
+   em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, len, 0);
+   if (IS_ERR(em)) {
+   ret = PTR_ERR(em);
+   goto out;
+ 

[PATCH v5 2/6] mm: export add_swap_extent()

2018-08-31 Thread Omar Sandoval
From: Omar Sandoval 

Btrfs will need this for swap file support.

Signed-off-by: Omar Sandoval 
---
 mm/swapfile.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d3f95833d12e..51cb30de17bc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2365,6 +2365,7 @@ add_swap_extent(struct swap_info_struct *sis, unsigned 
long start_page,
list_add_tail(_se->list, >first_swap_extent.list);
return 1;
 }
+EXPORT_SYMBOL_GPL(add_swap_extent);
 
 /*
  * A `swap extent' is a simple thing which maps a contiguous range of pages
-- 
2.18.0



[PATCH v5 1/6] mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS

2018-08-31 Thread Omar Sandoval
From: Omar Sandoval 

The SWP_FILE flag serves two purposes: to make swap_{read,write}page()
go through the filesystem, and to make swapoff() call
->swap_deactivate(). For Btrfs, we want the latter but not the former,
so split this flag into two. This makes us always call
->swap_deactivate() if ->swap_activate() succeeded, not just if it
didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 include/linux/swap.h | 13 +++--
 mm/page_io.c |  6 +++---
 mm/swapfile.c| 13 -
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8e2c11e692ba..0fda0aa743f0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -167,13 +167,14 @@ enum {
SWP_SOLIDSTATE  = (1 << 4), /* blkdev seeks are cheap */
SWP_CONTINUED   = (1 << 5), /* swap_map has count continuation */
SWP_BLKDEV  = (1 << 6), /* its a block device */
-   SWP_FILE= (1 << 7), /* set after swap_activate success */
-   SWP_AREA_DISCARD = (1 << 8),/* single-time swap area discards */
-   SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
-   SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
-   SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
+   SWP_ACTIVATED   = (1 << 7), /* set after swap_activate success */
+   SWP_FS  = (1 << 8), /* swap file goes through fs */
+   SWP_AREA_DISCARD = (1 << 9),/* single-time swap area discards */
+   SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
+   SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
+   SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
/* add others here before... */
-   SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */
+   SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/page_io.c b/mm/page_io.c
index aafd19ec1db4..e8653c368069 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,7 +283,7 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
struct swap_info_struct *sis = page_swap_info(page);
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
@@ -365,7 +365,7 @@ int swap_readpage(struct page *page, bool synchronous)
goto out;
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
@@ -423,7 +423,7 @@ int swap_set_page_dirty(struct page *page)
 {
struct swap_info_struct *sis = page_swap_info(page);
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_FS) {
struct address_space *mapping = sis->swap_file->f_mapping;
 
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d954b71c4f9c..d3f95833d12e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -989,7 +989,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], 
int entry_size)
goto nextsi;
}
if (size == SWAPFILE_CLUSTER) {
-   if (!(si->flags & SWP_FILE))
+   if (!(si->flags & SWP_FS))
n_ret = swap_alloc_cluster(si, swp_entries);
} else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
@@ -2310,12 +2310,13 @@ static void destroy_swap_extents(struct 
swap_info_struct *sis)
kfree(se);
}
 
-   if (sis->flags & SWP_FILE) {
+   if (sis->flags & SWP_ACTIVATED) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
 
-   sis->flags &= ~SWP_FILE;
-   mapping->a_ops->swap_deactivate(swap_file);
+   sis->flags &= ~SWP_ACTIVATED;
+   if (mapping->a_ops->swap_deactivate)
+   mapping->a_ops->swap_deactivate(swap_file);
}
 }
 
@@ -2411,8 +2412,10 @@ static int setup_swap_extents(struct swap_info_struct 
*sis, sector_t *span)
 
if (mapping->a_ops->swap_activate) {
ret = mapping->a_ops->swap_activate(sis, swap_file, span);
+   if (ret >= 0)
+   sis->flags |= SWP_ACTIVATED;
if (!ret) {
-   

[PATCH v5 5/6] Btrfs: rename get_chunk_map() and make it non-static

2018-08-31 Thread Omar Sandoval
From: Omar Sandoval 

The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/volumes.c | 22 +++---
 fs/btrfs/volumes.h |  9 +
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c03ef5322689..0aa8aff6774b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2712,8 +2712,8 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info 
*fs_info, u64 chunk_offset)
return ret;
 }
 
-static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
-   u64 logical, u64 length)
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+  u64 logical, u64 length)
 {
struct extent_map_tree *em_tree;
struct extent_map *em;
@@ -2750,7 +2750,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, 
u64 chunk_offset)
int i, ret = 0;
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 
-   em = get_chunk_map(fs_info, chunk_offset, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
if (IS_ERR(em)) {
/*
 * This is a logic error, but we don't want to just rely on the
@@ -4884,7 +4884,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle 
*trans,
int i = 0;
int ret = 0;
 
-   em = get_chunk_map(fs_info, chunk_offset, chunk_size);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, chunk_size);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5026,7 +5026,7 @@ int btrfs_chunk_readonly(struct btrfs_fs_info *fs_info, 
u64 chunk_offset)
int miss_ndevs = 0;
int i;
 
-   em = get_chunk_map(fs_info, chunk_offset, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
if (IS_ERR(em))
return 1;
 
@@ -5086,7 +5086,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
logical, u64 len)
struct map_lookup *map;
int ret;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
if (IS_ERR(em))
/*
 * We could return errors for these cases, but that could get
@@ -5132,7 +5132,7 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info 
*fs_info,
struct map_lookup *map;
unsigned long len = fs_info->sectorsize;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
 
if (!WARN_ON(IS_ERR(em))) {
map = em->map_lookup;
@@ -5149,7 +5149,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, 
u64 logical, u64 len)
struct map_lookup *map;
int ret = 0;
 
-   em = get_chunk_map(fs_info, logical, len);
+   em = btrfs_get_chunk_map(fs_info, logical, len);
 
if(!WARN_ON(IS_ERR(em))) {
map = em->map_lookup;
@@ -5308,7 +5308,7 @@ static int __btrfs_map_block_for_discard(struct 
btrfs_fs_info *fs_info,
/* discard always return a bbio */
ASSERT(bbio_ret);
 
-   em = get_chunk_map(fs_info, logical, length);
+   em = btrfs_get_chunk_map(fs_info, logical, length);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5634,7 +5634,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
return __btrfs_map_block_for_discard(fs_info, logical,
 *length, bbio_ret);
 
-   em = get_chunk_map(fs_info, logical, *length);
+   em = btrfs_get_chunk_map(fs_info, logical, *length);
if (IS_ERR(em))
return PTR_ERR(em);
 
@@ -5933,7 +5933,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 
chunk_start,
u64 rmap_len;
int i, j, nr = 0;
 
-   em = get_chunk_map(fs_info, chunk_start, 1);
+   em = btrfs_get_chunk_map(fs_info, chunk_start, 1);
if (IS_ERR(em))
return -EIO;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 23e9285d88de..d68c8a05a774 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -465,6 +465,15 @@ unsigned long btrfs_full_stripe_len(struct btrfs_fs_info 
*fs_info,
 int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 u64 chunk_offset, u64 chunk_size);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
+/**
+ * btrfs_get_chunk_map() - Find the mapping containing the given logical 
extent.
+ * @logical: Logical block offset in bytes.
+ * @length: Length of extent in bytes.
+ *
+ * Return: Chunk mapping or ERR_PTR.
+ */
+struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
+  u64 logical, u64 length);
 
 static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
  int 

Re: IO errors when building RAID1.... ?

2018-08-31 Thread Chris Murphy
If you want you can post the output from 'sudo smartctl -x /dev/sda'
which will contain more information... but this is in some sense
superfluous. The problem is very clearly a bad drive, the drive
explicitly report to libata a write error, and included the sector LBA
affected, and only the drive firmware would know that. It's not likely
a cable problem or something like. And that the write error is
reported at all means it's persistent, not transient.


Chris Murphy


Re: IO errors when building RAID1.... ?

2018-08-31 Thread Chris Murphy
On Fri, Aug 31, 2018 at 10:35 AM, Pierre Couderc  wrote:


>
> Aug 31 17:34:55 server su[559]: Successful su for root by nous
> Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root
> Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for
> user root by nous(uid=1000)
> Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create
> session: Already running in a session
> Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added
> /dev/sdb1
> Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating block
> group 1103101952 flags 1
> Aug 31 17:36:12 server sshd[572]: Accepted password for nous from
> 2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2
> Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened for
> user nous by (uid=0)
> Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous.
> Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous.
> Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50)
> Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 SErr
> 0x40d0802 action 0xe frozen
> Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt
> PHYRdyChg CommWake 10B8B DevExch }
> Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA
> Aug 31 17:36:16 server kernel: ata1.00: cmd
> c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in
> res
> 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error)
> Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY }
> Aug 31 17:36:16 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:17 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2)
> Aug 31 17:36:20 server su[590]: Successful su for root by nous
> Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root
> Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for
> user root by nous(uid=1000)
> Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create
> session: Already running in a session
> Aug 31 17:36:23 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:23 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2)
> Aug 31 17:36:30 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:30 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2)
> Aug 31 17:36:31 server kernel: ata1.00: disabled
> Aug 31 17:36:36 server kernel: ata1.00: hard resetting link
> Aug 31 17:36:37 server kernel: ata1.01: hard resetting link
> Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0)
> Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 133
> SControl 300)
> Aug 31 17:36:38 server kernel: ata1.01: SATA link down (SStatus 4 SControl
> 0)
> Aug 31 17:36:38 server kernel: ata1.00: NODEV after polling detection
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Sense Key : Illegal
> Request [current]
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 Add. Sense: Unaligned
> write command
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00
> 00 02 cd 00 00 00 60 00
> Aug 31 17:36:38 server kernel: blk_update_request: I/O error, dev sda,
> sector 183552
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: BTRFS error (device sda1): bdev /dev/sda1
> errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: rejecting I/O to offline device
> Aug 31 17:36:38 server kernel: sd 0:0:0:0: [sda] killing request
> Aug 31 

bug: btrfs-progs scrub -R flag doesn't show per device stats

2018-08-31 Thread Chris Murphy
btrfs-progs v4.17.1

man btrfs-scrub:

   -R
   print raw statistics per-device instead of a summary


However, on a two device Btrfs volume, -R does not show per device
statistics. See screenshot:

https://drive.google.com/open?id=1xmt_NHGlNJPc8I0F4_OZxgGe9b3quCnD

Additionally, the description of -d and -R doesn't help me distinquish
between the two. -R says "instead of a summary" so that suggests -d
will summarize but isn't explicitly stated.



-- 
Chris Murphy


Re: How to erase a RAID1 (+++)?

2018-08-31 Thread Duncan
Alberto Bursi posted on Fri, 31 Aug 2018 14:54:46 + as excerpted:

> I just keep around a USB drive with a full Linux system on it, to act as
> "recovery". If the btrfs raid fails I boot into that and I can do
> maintenance with a full graphical interface and internet access so I can
> google things.

I do very similar, except my "recovery boot" is my backup (with normally 
including for root two levels of backup/recovery available, three for 
some things).

I've actually gone so far as to have /etc/fstab be a symlink to one of 
several files, depending on what version of root vs. the off-root 
filesystems I'm booting, with a set of modular files that get assembled 
by scripts to build the fstabs as appropriate.  So updating fstab is a 
process of updating the modules, then running the scripts to create the 
actual fstabs, and after I update a root backup the last step is changing 
the symlink to point to the appropriate fstab for that backup, so it's 
correct if I end up booting from it.

Meanwhile, each root, working and two backups, is its own set of two 
device partitions in btrfs raid1 mode.  (One set of backups is on 
separate physical devices, covering the device death scenario, the other 
is on different partitions on the same, newer and larger pair of physical 
devices as the working set, so it won't cover device death but still 
covers fat-fingering, filesystem fubaring, bad upgrades, etc.)

/boot is separate and there's four of those (working and three backups), 
one each on each device of the two physical pairs, with the bios able to 
point to any of the four.  I run grub2, so once the bios loads that, I 
can interactively load kernels from any of the other three /boots and 
choose to boot any of the three roots.

And I build my own kernels, with an initrd attached as an initramfs to 
each, and test that they boot.  So selecting a kernel by definition 
selects its attached initramfs as well, meaning the initr*s are backed up 
and selected with the kernels.

(As I said earlier it'd sure be nice to be able to do away with the 
initr*s again.  I was actually thinking about testing that today, which 
was supposed to be a day off, but got called in to work, so the test will 
have to wait once again...)

What's nice about all that is that just as you said, each recovery/backup 
is a snapshot of the working system at the time I took the backup, so 
it's not a limited recovery boot at all, it has the same access to tools, 
manpages, net, X/plasma, browsers, etc, that my normal system does, 
because it /is/ my normal system from whenever I took the backup.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



IO errors when building RAID1.... ?

2018-08-31 Thread Pierre Couderc
When trying to build a RAID1 on main fs. After  normal debian stretch 
install :




root@server:/home/nous# btrfs device add /dev/sdb1 /
root@server:/home/nous# btrfs fi show
Label: none  uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc
    Total devices 2 FS bytes used 824.60MiB
    devid    1 size 1.82TiB used 3.02GiB path /dev/sda1
    devid    2 size 1.82TiB used 0.00B path /dev/sdb1

root@server:/home/nous# btrfs balance start -v -mconvert=raid1 
-dconvert=raid1 /

Dumping filters: flags 0x7, state 0x0, force is off
  DATA (flags 0x100): converting, target=16, soft is off
  METADATA (flags 0x100): converting, target=16, soft is off
  SYSTEM (flags 0x100): converting, target=16, soft is off
Killed
root@server:/home/nous# btrfs fi show
Label: none  uuid: ef0b9dad-c0eb-4a3b-9b41-e5e249363abc
    Total devices 2 FS bytes used 1.29GiB
    devid    2 size 1.82TiB used 1.00GiB path /dev/sdb1
    *** Some devices missing


Some IO errors on /dev/sda are found in journalctl (see them below)

I cannot believe that /dev/sda has no hard disk errors when installing 
without problems, but has many ones when I "btrfs device add /dev/sdb1 /".


I can reproduce the problem : reinstall (3times...) and try "btrfs 
device add /dev/sdb1 /" with the same results...





Aug 31 17:34:55 server su[559]: Successful su for root by nous
Aug 31 17:34:55 server su[559]: + /dev/pts/1 nous:root
Aug 31 17:34:55 server su[559]: pam_unix(su:session): session opened for 
user root by nous(uid=1000)
Aug 31 17:34:55 server su[559]: pam_systemd(su:session): Cannot create 
session: Already running in a session
Aug 31 17:35:03 server kernel: BTRFS info (device sda1): disk added 
/dev/sdb1
Aug 31 17:35:40 server kernel: BTRFS info (device sda1): relocating 
block group 1103101952 flags 1
Aug 31 17:36:12 server sshd[572]: Accepted password for nous from 
2a01:e34:eeaf:c5f0:e54:15ff:feb1:b1c9 port 49308 ssh2
Aug 31 17:36:12 server sshd[572]: pam_unix(sshd:session): session opened 
for user nous by (uid=0)

Aug 31 17:36:12 server systemd-logind[415]: New session 4 of user nous.
Aug 31 17:36:12 server systemd[1]: Started Session 4 of user nous.
Aug 31 17:36:16 server kernel: ata1: lost interrupt (Status 0x50)
Aug 31 17:36:16 server kernel: ata1.00: exception Emask 0x50 SAct 0x0 
SErr 0x40d0802 action 0xe frozen
Aug 31 17:36:16 server kernel: ata1.00: SError: { RecovComm HostInt 
PHYRdyChg CommWake 10B8B DevExch }

Aug 31 17:36:16 server kernel: ata1.00: failed command: READ DMA
Aug 31 17:36:16 server kernel: ata1.00: cmd 
c8/00:60:00:cd:02/00:00:00:00:00/e0 tag 0 dma 49152 in
    res 
40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x54 (ATA bus error)

Aug 31 17:36:16 server kernel: ata1.00: status: { DRDY }
Aug 31 17:36:16 server kernel: ata1.00: hard resetting link
Aug 31 17:36:17 server kernel: ata1.01: hard resetting link
Aug 31 17:36:18 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:18 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:18 server kernel: ata1.01: SATA link down (SStatus 4 
SControl 0)

Aug 31 17:36:18 server kernel: ata1.00: NODEV after polling detection
Aug 31 17:36:18 server kernel: ata1.00: revalidation failed (errno=-2)
Aug 31 17:36:20 server su[590]: Successful su for root by nous
Aug 31 17:36:20 server su[590]: + /dev/pts/2 nous:root
Aug 31 17:36:20 server su[590]: pam_unix(su:session): session opened for 
user root by nous(uid=1000)
Aug 31 17:36:20 server su[590]: pam_systemd(su:session): Cannot create 
session: Already running in a session

Aug 31 17:36:23 server kernel: ata1.00: hard resetting link
Aug 31 17:36:23 server kernel: ata1.01: hard resetting link
Aug 31 17:36:24 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:25 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:25 server kernel: ata1.01: SATA link down (SStatus 4 
SControl 0)

Aug 31 17:36:25 server kernel: ata1.00: NODEV after polling detection
Aug 31 17:36:25 server kernel: ata1.00: revalidation failed (errno=-2)
Aug 31 17:36:30 server kernel: ata1.00: hard resetting link
Aug 31 17:36:30 server kernel: ata1.01: hard resetting link
Aug 31 17:36:31 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:31 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:31 server kernel: ata1.01: SATA link down (SStatus 4 
SControl 0)

Aug 31 17:36:31 server kernel: ata1.00: NODEV after polling detection
Aug 31 17:36:31 server kernel: ata1.00: revalidation failed (errno=-2)
Aug 31 17:36:31 server kernel: ata1.00: disabled
Aug 31 17:36:36 server kernel: ata1.00: hard resetting link
Aug 31 17:36:37 server kernel: ata1.01: hard resetting link
Aug 31 17:36:38 server kernel: ata1.01: failed to resume link (SControl 0)
Aug 31 17:36:38 server kernel: ata1.00: SATA link up 6.0 Gbps (SStatus 
133 SControl 300)
Aug 31 17:36:38 server kernel: ata1.01: SATA 

Re: How to erase a RAID1 (+++)?

2018-08-31 Thread Pierre Couderc



On 08/31/2018 04:54 PM, Alberto Bursi wrote:

On 8/31/2018 8:53 AM, Pierre Couderc wrote:

OK, I have understood the message... I was planning that as you said
"semi-routinely", and I understand btrfs is not soon enough ready, and
I am very very far to be a specialist as you are.
So, I shall mount my RAID1 very standard, and I  shall expect the
disaster, hoping it does not occur
Now, I shall try to absorb all that...

Thank you very much !


I just keep around a USB drive with a full Linux system on it, to act as
"recovery". If the btrfs raid fails I boot into that and I can do
maintenance with a full graphical interface and internet access so I can
google things.

Of course on a home server you can't do that without some automation
that will switch boot device after some amount of boot failures of the
main OS.

While if your server has BMC (lights-out management) you can switch boot
device through that.

-Alberto
Thank you Alberto, yes,I have some other computers to react in case of 
failure.




Re: How to erase a RAID1 (+++)?

2018-08-31 Thread Alberto Bursi

On 8/31/2018 8:53 AM, Pierre Couderc wrote:
>
> OK, I have understood the message... I was planning that as you said 
> "semi-routinely", and I understand btrfs is not soon enough ready, and 
> I am very very far to be a specialist as you are.
> So, I shall mount my RAID1 very standard, and I  shall expect the 
> disaster, hoping it does not occur
> Now, I shall try to absorb all that...
>
> Thank you very much !
>

I just keep around a USB drive with a full Linux system on it, to act as 
"recovery". If the btrfs raid fails I boot into that and I can do 
maintenance with a full graphical interface and internet access so I can 
google things.

Of course on a home server you can't do that without some automation 
that will switch boot device after some amount of boot failures of the 
main OS.

While if your server has BMC (lights-out management) you can switch boot 
device through that.

-Alberto



Re: [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:57:45AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> > into a helper and cleanup the calling functions.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/delayed-ref.c | 14 ++
> >  fs/btrfs/delayed-ref.h |  3 ++-
> >  fs/btrfs/extent-tree.c | 24 
> >  3 files changed, 20 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > index 62ff545ba1f7..3a9e4ac21794 100644
> > --- a/fs/btrfs/delayed-ref.c
> > +++ b/fs/btrfs/delayed-ref.c
> > @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
> > return head;
> >  }
> >  
> > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> > +  struct btrfs_delayed_ref_head *head)
> > +{
> > +   lockdep_assert_held(_refs->lock);
> > +   lockdep_assert_held(>lock);
> > +
> > +   rb_erase(>href_node, _refs->href_root);
> > +   RB_CLEAR_NODE(>href_node);
> > +   atomic_dec(_refs->num_entries);
> > +   delayed_refs->num_heads--;
> > +   if (head->processing == 0)
> > +   delayed_refs->num_heads_ready--;
> > +}
> > +
> >  /*
> >   * Helper to insert the ref_node to the tail or merge with tail.
> >   *
> > diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> > index d9f2a4ebd5db..7769177b489e 100644
> > --- a/fs/btrfs/delayed-ref.h
> > +++ b/fs/btrfs/delayed-ref.h
> > @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> > btrfs_delayed_ref_head *head)
> >  {
> > mutex_unlock(>mutex);
> >  }
> > -
> > +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> > +  struct btrfs_delayed_ref_head *head);
> >  
> >  struct btrfs_delayed_ref_head *
> >  btrfs_select_ref_head(struct btrfs_trans_handle *trans);
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index f77226d8020a..6799950fa057 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct 
> > btrfs_trans_handle *trans,
> > spin_unlock(_refs->lock);
> > return 1;
> > }
> > -   delayed_refs->num_heads--;
> > -   rb_erase(>href_node, _refs->href_root);
> > -   RB_CLEAR_NODE(>href_node);
> > -   spin_unlock(>lock);
> > +   btrfs_delete_ref_head(delayed_refs, head);
> > spin_unlock(_refs->lock);
> > -   atomic_dec(_refs->num_entries);
> > +   spin_unlock(>lock);
> >  
> 
> Again, the feedback of reversed lock-order is not addressed:
> 
> https://www.spinics.net/lists/linux-btrfs/msg80482.html
> 

Oops, I'll fix this.

Josef


Re: [PATCH 15/35] btrfs: run delayed iputs before committing

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:55:22AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We want to have a complete picture of any delayed inode updates before
> > we make the decision to commit or not, so make sure we run delayed iputs
> > before making the decision to commit or not.
> 
> Again, there was request for more detail which is not addressed:
> 
> https://www.spinics.net/lists/linux-btrfs/msg81237.html

I'll make the changelog more explicit, thanks,

Josef


Re: [PATCH 07/35] btrfs: dump block_rsv whe dumping space info

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:53:54AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > For enospc_debug having the block rsvs is super helpful to see if we've
> > done something wrong.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/extent-tree.c | 15 +++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 80615a579b18..df826f713034 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct 
> > btrfs_fs_info *fs_info,
> > return ret;
> >  }
> >  
> > +static void dump_block_rsv(struct btrfs_block_rsv *rsv)
> > +{
> > +   spin_lock(>lock);
> > +   printk(KERN_ERR "%d: size %llu reserved %llu\n",
> > +  rsv->type, (unsigned long long)rsv->size,
> > +  (unsigned long long)rsv->reserved);
> > +   spin_unlock(>lock);
> > +}
> 
> There was feedback on this hunk which you seem to have ignored:
> 
> https://www.spinics.net/lists/linux-btrfs/msg80473.html
> 

Yup good point, I'll fix this up.  Thanks,

Josef


Re: [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:52:55AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:41, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > We use this number to figure out how many delayed refs to run, but
> > __btrfs_run_delayed_refs really only checks every time we need a new
> > delayed ref head, so we always run at least one ref head completely no
> > matter what the number of items on it.  So instead track only the ref
> > heads added by this trans handle and adjust the counting appropriately
> > in __btrfs_run_delayed_refs.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/delayed-ref.c | 3 ---
> >  fs/btrfs/extent-tree.c | 5 +
> >  2 files changed, 1 insertion(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> > index 3a9e4ac21794..27f7dd4e3d52 100644
> > --- a/fs/btrfs/delayed-ref.c
> > +++ b/fs/btrfs/delayed-ref.c
> > @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct 
> > btrfs_trans_handle *trans,
> > ref->in_tree = 0;
> > btrfs_put_delayed_ref(ref);
> > atomic_dec(_refs->num_entries);
> > -   if (trans->delayed_ref_updates)
> > -   trans->delayed_ref_updates--;
> 
> There was feedback on this particular hunk and you've completely ignored
> it, that's not nice:
> 
> https://www.spinics.net/lists/linux-btrfs/msg80514.html

I just missed it in the last go around (as is the case for the other ones).  I'm
not sure what part is confusing, we only want delayed_ref_updates to be how many
delayed ref heads there are, which is what this patch is changing.  I could
probably split this between these two changes and the count changing below since
they are slightly different things, I'll do that.  Thanks,

Josef


Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:48:58AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We may abort the transaction during a commit and not have a chance to
> > run the pending bgs stuff, which will leave block groups on our list and
> > cause us accounting issues and leaked memory.  Fix this by running the
> > pending bgs when we cleanup a transaction.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/transaction.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 89d14f135837..0f39a0d302d3 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct 
> > btrfs_trans_handle *trans)
> > btrfs_scrub_continue(fs_info);
> >  cleanup_transaction:
> > btrfs_trans_release_metadata(trans);
> > +   btrfs_create_pending_block_groups(trans);
> 
> And now you've basically hi-jacked btrfs_create_pending_block_groups to
> just act as "delete all bg" in case transaction is aborted. Considering
> this and the previous patch I'd rather you replace them with a single
> one which introduces a new function delete_pending_bgs or whatever and
> use that. This will be more explicit and self-documenting.
> 

I haven't hi-jacked it and I'm not adding another helper when we already have
code that does the right thing.  Remember if we abort a transaction in a path
that doesn't commit we still end up calling btrfs_create_pending_block_groups()
in btrfs_end_transaction which does the cleanup there, I'm not adding a bunch of
new code for a case that's easily handled with the previous fix and works the
same way in other paths.  Thanks,

Josef


Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:46:36AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We still need to do all of the accounting cleanup for pending block
> > groups if we abort.  So set the ret to trans->aborted so if we aborted
> > the cleanup happens and everybody is happy.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/extent-tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 90f267f4dd0f..132a1157982c 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct 
> > btrfs_trans_handle *trans)
> > struct btrfs_root *extent_root = fs_info->extent_root;
> > struct btrfs_block_group_item item;
> > struct btrfs_key key;
> > -   int ret = 0;
> > +   int ret = trans->aborted;
> 
> This is really subtle and magical and not obvious from the context of
> the patch, but if the transaction is aborted this will change the loop
> to actually just delete all block groups in ->new_bgs. I'd rather have
> an explicit loop for that honestly.

We need it this way in case creating the bg's errors out anyway, there's no
sense in adding a bunch of code to do something we have to handle already
anyway.  Thanks,

Josef


Re: [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:42:13AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We weren't doing any of the accounting cleanup when we aborted
> > transactions.  Fix this by making cleanup_ref_head_accounting global and
> > calling it from the abort code, this fixes the issue where our
> > accounting was all wrong after the fs aborts.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h   |  5 +
> >  fs/btrfs/disk-io.c |  1 +
> >  fs/btrfs/extent-tree.c | 13 ++---
> >  3 files changed, 12 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 791e287c2292..67923b2030b8 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -35,6 +35,7 @@
> >  struct btrfs_trans_handle;
> >  struct btrfs_transaction;
> >  struct btrfs_pending_snapshot;
> > +struct btrfs_delayed_ref_root;
> >  extern struct kmem_cache *btrfs_trans_handle_cachep;
> >  extern struct kmem_cache *btrfs_bit_radix_cachep;
> >  extern struct kmem_cache *btrfs_path_cachep;
> > @@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
> > *trans,
> >unsigned long count);
> >  int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
> >  unsigned long count, u64 transid, int wait);
> > +void
> > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> > + struct btrfs_delayed_ref_root *delayed_refs,
> > + struct btrfs_delayed_ref_head *head);
> >  int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 
> > len);
> >  int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
> >  struct btrfs_fs_info *fs_info, u64 bytenr,
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 1d3f5731d616..caaca8154a1a 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct 
> > btrfs_transaction *trans,
> > if (pin_bytes)
> > btrfs_pin_extent(fs_info, head->bytenr,
> >  head->num_bytes, 1);
> > +   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
> > btrfs_put_delayed_ref_head(head);
> > cond_resched();
> > spin_lock(_refs->lock);
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 32579221d900..031d2b11ddee 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct 
> > btrfs_trans_handle *trans,
> > return ret ? ret : 1;
> >  }
> >  
> > -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> > -   struct btrfs_delayed_ref_head *head)
> > +void
> > +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> > + struct btrfs_delayed_ref_root *delayed_refs,
> > + struct btrfs_delayed_ref_head *head)
> >  {
> I don't see any reason to change the signature of the function, the new
> call sites have valid transaction handles where you can obtain
> references to fs_info/delayed_refs. Just stick with adding btrfs_ prefix
> and exporting it.
> 

We don't have a valid transaction handle in btrfs_destroy_delayed_refs because
we can call it at umount time when we no longer have a trans handle.  Thanks,

Josef


Re: [PATCH 22/35] btrfs: make sure we create all new bgs

2018-08-31 Thread Josef Bacik
On Fri, Aug 31, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 30.08.2018 20:42, Josef Bacik wrote:
> > We can actually allocate new chunks while we're creating our bg's, so
> > instead of doing list_for_each_safe, just do while (!list_empty()) so we
> > make sure to catch any new bg's that get added to the list.
> 
> HOw can this occur, please elaborate and put an example callstack in the
> commit log.
> 

Eh?  We're modifying the extent tree and chunk tree, which can cause bg's to be
allocated, it's just common sense.

Josef


Re: [PATCH 01/35] btrfs: add btrfs_delete_ref_head helper

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:41, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 24 
>  3 files changed, 20 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 62ff545ba1f7..3a9e4ac21794 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index d9f2a4ebd5db..7769177b489e 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *
>  btrfs_select_ref_head(struct btrfs_trans_handle *trans);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f77226d8020a..6799950fa057 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - spin_unlock(>lock);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
> + spin_unlock(>lock);
>  

Again, the feedback of reversed lock-order is not addressed:

https://www.spinics.net/lists/linux-btrfs/msg80482.html

>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;
> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


Re: [PATCH 15/35] btrfs: run delayed iputs before committing

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> We want to have a complete picture of any delayed inode updates before
> we make the decision to commit or not, so make sure we run delayed iputs
> before making the decision to commit or not.

Again, there was request for more detail which is not addressed:

https://www.spinics.net/lists/linux-btrfs/msg81237.html

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7c0e99e1f56c..064db7ebaf67 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4831,6 +4831,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>   goto commit;
>   }
>  
> + mutex_lock(_info->cleaner_delayed_iput_mutex);
> + btrfs_run_delayed_iputs(fs_info);
> + mutex_unlock(_info->cleaner_delayed_iput_mutex);
> +
>   spin_lock(_rsv->lock);
>   reclaim_bytes += delayed_rsv->reserved;
>   spin_unlock(_rsv->lock);
> 


Re: [PATCH 07/35] btrfs: dump block_rsv whe dumping space info

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:41, Josef Bacik wrote:
> For enospc_debug having the block rsvs is super helpful to see if we've
> done something wrong.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 80615a579b18..df826f713034 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct 
> btrfs_fs_info *fs_info,
>   return ret;
>  }
>  
> +static void dump_block_rsv(struct btrfs_block_rsv *rsv)
> +{
> + spin_lock(>lock);
> + printk(KERN_ERR "%d: size %llu reserved %llu\n",
> +rsv->type, (unsigned long long)rsv->size,
> +(unsigned long long)rsv->reserved);
> + spin_unlock(>lock);
> +}

There was feedback on this hunk which you seem to have ignored:

https://www.spinics.net/lists/linux-btrfs/msg80473.html

> +
>  static void dump_space_info(struct btrfs_fs_info *fs_info,
>   struct btrfs_space_info *info, u64 bytes,
>   int dump_block_groups)
> @@ -7929,6 +7938,12 @@ static void dump_space_info(struct btrfs_fs_info 
> *fs_info,
>   info->bytes_readonly);
>   spin_unlock(>lock);
>  
> + dump_block_rsv(_info->global_block_rsv);
> + dump_block_rsv(_info->trans_block_rsv);
> + dump_block_rsv(_info->chunk_block_rsv);
> + dump_block_rsv(_info->delayed_block_rsv);
> + dump_block_rsv(_info->delayed_refs_rsv);
> +
>   if (!dump_block_groups)
>   return;
>  
> 


Re: [PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:41, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  So instead track only the ref
> heads added by this trans handle and adjust the counting appropriately
> in __btrfs_run_delayed_refs.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  fs/btrfs/extent-tree.c | 5 +
>  2 files changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 3a9e4ac21794..27f7dd4e3d52 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;

There was feedback on this particular hunk and you've completely ignored
it, that's not nice:

https://www.spinics.net/lists/linux-btrfs/msg80514.html

>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 87c42a2c45b1..20531389a20a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2583,6 +2583,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   spin_unlock(_refs->lock);
>   break;
>   }
> + count++;
>  
>   /* grab the lock that says we are going to process
>* all the refs for this head */
> @@ -2596,7 +2597,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>*/
>   if (ret == -EAGAIN) {
>   locked_ref = NULL;
> - count++;
>   continue;
>   }
>   }
> @@ -2624,7 +2624,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   unselect_delayed_ref_head(delayed_refs, locked_ref);
>   locked_ref = NULL;
>   cond_resched();
> - count++;
>   continue;
>   }
>  
> @@ -2642,7 +2641,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   return ret;
>   }
>   locked_ref = NULL;
> - count++;
>   continue;
>   }
>  
> @@ -2693,7 +2691,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   btrfs_put_delayed_ref(ref);
> - count++;
>   cond_resched();
>   }
>  
> 


Re: [PATCH 30/35] btrfs: cleanup pending bgs on transaction abort

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> We may abort the transaction during a commit and not have a chance to
> run the pending bgs stuff, which will leave block groups on our list and
> cause us accounting issues and leaked memory.  Fix this by running the
> pending bgs when we cleanup a transaction.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 89d14f135837..0f39a0d302d3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   btrfs_scrub_continue(fs_info);
>  cleanup_transaction:
>   btrfs_trans_release_metadata(trans);
> + btrfs_create_pending_block_groups(trans);

And now you've basically hi-jacked btrfs_create_pending_block_groups to
just act as "delete all bg" in case transaction is aborted. Considering
this and the previous patch I'd rather you replace them with a single
one which introduces a new function delete_pending_bgs or whatever and
use that. This will be more explicit and self-documenting.

>   btrfs_trans_release_chunk_metadata(trans);
>   trans->block_rsv = NULL;
>   btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
> 


Re: [PATCH 29/35] btrfs: just delete pending bgs if we are aborted

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> We still need to do all of the accounting cleanup for pending block
> groups if we abort.  So set the ret to trans->aborted so if we aborted
> the cleanup happens and everybody is happy.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 90f267f4dd0f..132a1157982c 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct 
> btrfs_trans_handle *trans)
>   struct btrfs_root *extent_root = fs_info->extent_root;
>   struct btrfs_block_group_item item;
>   struct btrfs_key key;
> - int ret = 0;
> + int ret = trans->aborted;

This is really subtle and magical and not obvious from the context of
the patch, but if the transaction is aborted this will change the loop
to actually just delete all block groups in ->new_bgs. I'd rather have
an explicit loop for that honestly.

>   bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>   trans->can_flush_pending_bgs = false;
> 


Re: [PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> The first thing we do is loop through the list, this
> 
> if (!list_empty())
>   btrfs_create_pending_block_groups();
> 
> thing is just wasted space.
> 
> Signed-off-by: Josef Bacik 

Makes sense, although it would have been ideal if this patch followed
directly your " btrfs: make sure we create all new bgs" one.

Anyway:

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/extent-tree.c | 3 +--
>  fs/btrfs/transaction.c | 6 ++
>  2 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 031d2b11ddee..90f267f4dd0f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2970,8 +2970,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
> *trans,
>   }
>  
>   if (run_all) {
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);
> + btrfs_create_pending_block_groups(trans);
>  
>   spin_lock(_refs->lock);
>   node = rb_first(_refs->href_root);
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2bb19e2ded5e..89d14f135837 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -839,8 +839,7 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);
> + btrfs_create_pending_block_groups(trans);
>  
>   btrfs_trans_release_chunk_metadata(trans);
>  
> @@ -1927,8 +1926,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   cur_trans->delayed_refs.flushing = 1;
>   smp_wmb();
>  
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);
> + btrfs_create_pending_block_groups(trans);
>  
>   if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
>   int run_it = 0;
> 


Re: [PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> We weren't doing any of the accounting cleanup when we aborted
> transactions.  Fix this by making cleanup_ref_head_accounting global and
> calling it from the abort code, this fixes the issue where our
> accounting was all wrong after the fs aborts.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  5 +
>  fs/btrfs/disk-io.c |  1 +
>  fs/btrfs/extent-tree.c | 13 ++---
>  3 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 791e287c2292..67923b2030b8 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -35,6 +35,7 @@
>  struct btrfs_trans_handle;
>  struct btrfs_transaction;
>  struct btrfs_pending_snapshot;
> +struct btrfs_delayed_ref_root;
>  extern struct kmem_cache *btrfs_trans_handle_cachep;
>  extern struct kmem_cache *btrfs_bit_radix_cachep;
>  extern struct kmem_cache *btrfs_path_cachep;
> @@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
> *trans,
>  unsigned long count);
>  int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
>unsigned long count, u64 transid, int wait);
> +void
> +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> +   struct btrfs_delayed_ref_root *delayed_refs,
> +   struct btrfs_delayed_ref_head *head);
>  int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 
> len);
>  int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
>struct btrfs_fs_info *fs_info, u64 bytenr,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 1d3f5731d616..caaca8154a1a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct 
> btrfs_transaction *trans,
>   if (pin_bytes)
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
>   btrfs_put_delayed_ref_head(head);
>   cond_resched();
>   spin_lock(_refs->lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 32579221d900..031d2b11ddee 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct 
> btrfs_trans_handle *trans,
>   return ret ? ret : 1;
>  }
>  
> -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> - struct btrfs_delayed_ref_head *head)
> +void
> +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
> +   struct btrfs_delayed_ref_root *delayed_refs,
> +   struct btrfs_delayed_ref_head *head)
>  {
I don't see any reason to change the signature of the function, the new
call sites have valid transaction handles where you can obtain
references to fs_info/delayed_refs. Just stick with adding btrfs_ prefix
and exporting it.

> - struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_delayed_ref_root *delayed_refs =
> - >transaction->delayed_refs;
>   int nr_items = 1;
>  
>   if (head->total_ref_mod < 0) {
> @@ -2549,7 +2548,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - cleanup_ref_head_accounting(trans, head);
> + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
> @@ -7191,7 +7190,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> - cleanup_ref_head_accounting(trans, head);
> + btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 


Re: [PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> Instead of open coding this stuff use the helper instead.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/disk-io.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index c72ab2ca7627..1d3f5731d616 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct 
> btrfs_transaction *trans,
>   if (head->must_insert_reserved)
>   pin_bytes = true;
>   btrfs_free_delayed_extent_op(head->extent_op);
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> - atomic_dec(_refs->num_entries);
> - rb_erase(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>   mutex_unlock(>mutex);
> 


Re: [PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> We have this open coded in btrfs_destroy_delayed_refs, use the helper
> instead.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/disk-io.c | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 11ea2ea7439e..c72ab2ca7627 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct 
> btrfs_transaction *trans,
>  
>   head = rb_entry(node, struct btrfs_delayed_ref_head,
>   href_node);
> - if (!mutex_trylock(>mutex)) {
> - refcount_inc(>refs);
> - spin_unlock(_refs->lock);
> -
> - mutex_lock(>mutex);
> - mutex_unlock(>mutex);
> - btrfs_put_delayed_ref_head(head);
> - spin_lock(_refs->lock);
> + if (btrfs_delayed_ref_lock(delayed_refs, head))
>   continue;
> - }
> +
>   spin_lock(>lock);
>   while ((n = rb_first(>ref_tree)) != NULL) {
>   ref = rb_entry(n, struct btrfs_delayed_ref_node,
> 


Re: [PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> We don't need the trans except to get the delayed_refs_root, so just
> pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a
> day.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/delayed-ref.c | 5 +
>  fs/btrfs/delayed-ref.h | 2 +-
>  fs/btrfs/extent-tree.c | 2 +-
>  3 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 96ce087747b2..87778645bf4a 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr,
>   return NULL;
>  }
>  
> -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
> +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
>  struct btrfs_delayed_ref_head *head)
>  {
> - struct btrfs_delayed_ref_root *delayed_refs;
> -
> - delayed_refs = >transaction->delayed_refs;
>   lockdep_assert_held(_refs->lock);
>   if (mutex_trylock(>mutex))
>   return 0;
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 7769177b489e..ee636d7a710a 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle 
> *trans,
>  struct btrfs_delayed_ref_head *
>  btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>   u64 bytenr);
> -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
> +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
>  struct btrfs_delayed_ref_head *head);
>  static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head 
> *head)
>  {
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index fc30ff96f0d6..32579221d900 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2591,7 +2591,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
> btrfs_trans_handle *trans,
>  
>   /* grab the lock that says we are going to process
>* all the refs for this head */
> - ret = btrfs_delayed_ref_lock(trans, locked_ref);
> + ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref);
>   spin_unlock(_refs->lock);
>   /*
>* we may have dropped the spin lock to get the head
> 


Re: [PATCH 22/35] btrfs: make sure we create all new bgs

2018-08-31 Thread Nikolay Borisov



On 30.08.2018 20:42, Josef Bacik wrote:
> We can actually allocate new chunks while we're creating our bg's, so
> instead of doing list_for_each_safe, just do while (!list_empty()) so we
> make sure to catch any new bg's that get added to the list.

HOw can this occur, please elaborate and put an example callstack in the
commit log.

> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ca98c39308f6..fc30ff96f0d6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info 
> *info)
>  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>  {
>   struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_group_cache *block_group, *tmp;
> + struct btrfs_block_group_cache *block_group;
>   struct btrfs_root *extent_root = fs_info->extent_root;
>   struct btrfs_block_group_item item;
>   struct btrfs_key key;
> @@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct 
> btrfs_trans_handle *trans)
>   bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
>  
>   trans->can_flush_pending_bgs = false;
> - list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) {
> + while (!list_empty(>new_bgs)) {
> + block_group = list_first_entry(>new_bgs,
> +struct btrfs_block_group_cache,
> +bg_list);
>   if (ret)
>   goto next;
>  
> 


Re: How to erase a RAID1 (+++)?

2018-08-31 Thread Pierre Couderc




On 08/31/2018 04:29 AM, Duncan wrote:

Chris Murphy posted on Thu, 30 Aug 2018 11:08:28 -0600 as excerpted:


My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks
in prder to start in degraded mode

Good luck with this. The Btrfs archives are full of various limitations
of Btrfs raid1. There is no automatic degraded mount for Btrfs. And if
you persistently ask for degraded mount, you run the risk of other
problems if there's merely a delayed discovery of one of the devices.
Once a Btrfs volume is degraded, it does not automatically resume normal
operation just because the formerly missing device becomes available.

So... this is flat out not suitable for use cases where you need
unattended raid1 degraded boot.

Agreeing in general and adding some detail...

1) Are you intending to use an initr*?  I'm not sure the current status
(I actually need to test again for myself), but at least in the past,
booting a btrfs raid1 rootfs required an initr*, and I have and use one
here, for that purpose alone (until switching to btrfs raid1 root, I went
initr*-less, and would prefer that again, due to the complications of
maintaining an initr*).

The base problem is that with raid1 (or other forms of multi-device
btrfs, but it happens to be raid1 that's in question for both you and I)
the filesystem needs multiple devices to complete the filesystem and the
kernel's root= parameter takes only one.  When mounting after userspace
is up, a btrfs device scan is normally run (often automatically by udev)
before the mount, that lets btrfs in the kernel track what devices belong
to what filesystems, so pointing to just one of the devices is enough
because the kernel knows from that what filesystem is intended and can
match up the others that go with it from the earlier scan.

Now there's a btrfs mount option, device=/dev/*, that can be provided
more than once for additional devices, that can /normally/ be used to
tell the kernel what specific devices to use, bypassing the need for
btrfs device scan, and in /theory/, passing that like other mount options
in the kernel commandline via rootflags= /should/ "just work".

But for reasons I as a btrfs user (not dev, and definitely not kernel or
btrfs dev) don't fully understand, passing device= via rootflags= is, or
at least was, broken, so properly mounting a multi-device btrfs required
(and may still require) userspace, thus for a multi-device btrfs rootfs,
an initr*.

So direct-booting to a multi-device btrfs rootfs didn't normally work.
It would if you passed rootflags=degraded (at least with a two-device
raid1 so the one device passed in root= contained one copy of
everything), but then it was unclear if the additional device was
successfully added to the raid1 later, or not.  And with no automatic
sync and bringing back to undegraded status, it was a risk I didn't want
to take.  So unfortunately, initr* it was!

But I originally tested that when I setup my own btrfs raid1 rootfs very
long ago in kernel and btrfs terms, kernel 3.6 or so IIRC, and while I've
not /seen/ anything definitive on-list to suggest rootflags=device= is
unbroken now (I asked recently and got an affirmative reply, but I asked
for clarification and I've not seen it, tho perhaps it's there and I've
not read it yet), perhaps I missed it.  And I've not retested lately, tho
I really should as while I asked I guess the only real way to know is to
try it for myself, and it'd definitely be nice to be direct-booting
without having to bother with an initr*, again.

2) As both Chris and I alluded to, unlike say mdraid, btrfs doesn't (yet)
have an automatic mechanism to re-sync and "undegrade" after having been
mounted degraded,rw.  A btrfs scrub can be run to re-sync raid1 chunks,
but single chunks may have been added while in the degraded state as
well, and those need a balance convert to raid1 mode, before the
filesystem and data on it can be be considered reliably able to withstand
device loss once again.

In fact, while the problem has been fixed now, for quite awhile if the
filesystem was mounted degraded,rw, you often had exactly that one mount
to fix the problem, as new chunks would be written in single mode and
after that the filesystem would refuse to mount writable,degraded, and
would only let you mount degraded,ro, which would let you get data off it
but not let you fix the problem.  Word to the wise if you're planning on
running stable-debian (which tend to be older) kernels, or even just
trying to use them for recovery if you need to!  (The fix was to have a
mount check if at least one copy of all chunks were available and allow rw
mounting if so, instead of simply assuming that any single-mode chunks at
all meant some wouldn't be available on a multi-device filesystem with a
device missing, thus forcing read-only mode only mounting, as it used to
do.)

3) If a btrfs raid1 is mounted degraded,rw with one device missing, then
mounted again degraded,rw, with a different device 

Re: How to erase a RAID1 (+++)?

2018-08-31 Thread Pierre Couderc




On 08/30/2018 07:08 PM, Chris Murphy wrote:

On Thu, Aug 30, 2018 at 3:13 AM, Pierre Couderc  wrote:

Trying to install a RAID1 on a debian stretch, I made some mistake and got
this, after installing on disk1 and trying to add second disk  :


root@server:~# fdisk -l
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2a799300

Device Boot StartEndSectors  Size Id Type
/dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux


Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x9770f6fa

Device Boot StartEndSectors  Size Id Type
/dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended


Extended partition type is not a problem if you're using GRUB as the
bootloader; other bootloaders may not like this. Strictly speaking the
type code 0x05 is incorrect, GRUB ignores type code, as does the
kernel. GRUB also ignores the active bit (boot flag).




And :

root@server:~# btrfs fi show
Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
 Total devices 2 FS bytes used 1.10GiB
 devid1 size 1.82TiB used 4.02GiB path /dev/sda1
 devid2 size 1.00KiB used 0.00B path /dev/sdb1

That's odd; and I know you've moved on from this problem but I would
have liked to see the super for /dev/sdb1 and also the installer log
for what commands were used for partitioning, including mkfs and
device add commands.

For what it's worth, 'btrfs dev add' formats the device being added,
it does not need to be formatted in advance, and also it resizes the
file system properly.
Thank you, in fact my system seems more and more broken, so I cannot go 
further..

For example, a simple
df gives me an IO error.
I prefer reinstall the whole.
But to avoid the same problems, is there somewhere an howto "install a 
basic RAID1 btrfs debian stretch system" ?
OK, I install strech on 1 disk and then how do I "format" and add the 
second disk ?




My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in
prder to start in degraded mode

Good luck with this. The Btrfs archives are full of various
limitations of Btrfs raid1. There is no automatic degraded mount for
Btrfs. And if you persistently ask for degraded mount, you run the
risk of other problems if there's merely a delayed discovery of one of
the devices. Once a Btrfs volume is degraded, it does not
automatically resume normal operation just because the formerly
missing device becomes available.

So... this is flat out not suitable for use cases where you need
unattended raid1 degraded boot.

Well, I understnad the lesson that there  is  no hope currently to boot 
a degraded system...

Thank you very much