Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


Yep.  It has to be nodatasum, at least within the confines of datasum 
today.  DAX mmap writes are essentially in the same situation as with 
direct i/o when another thread modifies the buffer being submitted. 
Except rather than it being a race, it happens every time.  An 
alternative here could be to add the ability to mark a crc as unreliable 
and then go back and update them once the last DAX mmap reference is 
dropped on a range.  There's no reason to make this a requirement of the 
initial implementation, though.



2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?


It's the second question that's the hard part.  As Adam describes later, 
setting each pfn read-only will ensure page faults cause the remapping.


The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.


-Jeff


Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

  fs/btrfs/Makefile   |1
  fs/btrfs/ctree.h|   17 ++
  fs/btrfs/dax.c  |  303 
++--
  fs/btrfs/file.c |   29 
  fs/btrfs/inode.c|   54 +
  fs/btrfs/ioctl.c|5
  fs/btrfs/super.c|   15 ++
  fs/dax.c|   35 --
  include/linux/dax.h |   16 ++
  9 files changed, 430 insertions(+), 45 deletions(-)




--
Jeff Mahoney
SUSE Labs



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 8:03 AM, Qu Wenruo wrote:



On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.


The general idea is:

1) there is no page cache involved. read() and write() are like direct 
i/o writes in concept.  The user buffer is written directly (via what is 
essentially a specialized memcpy) to the NVDIMM.
2) for mmap, once the mapping is established and mapped, the file system 
is not involved.  The application writes directly to the memory as it 
would a normal mmap, except it's persistent.  All that's required to 
ensure persistence is a CPU cache flush.  The only way the file system 
is involved again is if some operation has occurred to reset the WP bit.



If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.


It can't, because it is only invoked on the page fault path and we want 
to try to limit those as much as possible.



Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.
There's no way to force a page fault for every mmap read/write.  Even if 
there was, we wouldn't want that.  No user would turn that on when they 
can just make similar guarantees in their app (which are typically apps 
that do this already) and not pay any performance penalty.   The idea 
with DAX mmap is that the file system manages the namespace, space 
allocation, and permissions.  Otherwise we stay out of the way.


-Jeff
--
Jeff Mahoney
SUSE Labs



Re: [PATCH RESEND] btrfs: fix error handling in free_log_tree

2018-09-07 Thread Jeff Mahoney
On 9/7/18 8:00 AM, David Sterba wrote:
> On Thu, Sep 06, 2018 at 04:59:33PM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney 
> 
> If this is a resend, I can't find the previous postings, same or similar
> subject.

I had tagged it as submitted in March, but I can't find any posting of
it either.

>> When we hit an I/O error in free_log_tree->walk_log_tree during file system
>> shutdown we can crash due to there not being a valid transaction handle.
>>
>> Use btrfs_handle_fs_error when there's no transaction handle to use.
>>
>> BUG: unable to handle kernel NULL pointer dereference at 0060
>> IP: free_log_tree+0xd2/0x140 [btrfs]
>> PGD 0 P4D 0
>> Oops:  [#1] SMP DEBUG_PAGEALLOC PTI
>> Modules linked in: 
>> CPU: 2 PID: 23544 Comm: umount Tainted: GW4.12.14-kvmsmall 
>> #9 SLE15 (unreleased)
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> task: 96bfd3478880 task.stack: a7cf40d78000
>> RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
>> RSP: 0018:a7cf40d7bd10 EFLAGS: 00010282
>> RAX: fffb RBX: fffb RCX: 0002
>> RDX:  RSI: 96c02f07d4c8 RDI: 0282
>> RBP: 96c013cf1000 R08: 96c02f07d4c8 R09: 96c02f07d4d0
>> R10:  R11: 0002 R12: 
>> R13: 96c005e800c0 R14: a7cf40d7bdb8 R15: 
>> FS:  7f17856bcfc0() GS:96c03f60() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 0060 CR3: 45ed6002 CR4: 003606e0
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> Call Trace:
>>  ? wait_for_writer+0xb0/0xb0 [btrfs]
>>  btrfs_free_log+0x17/0x30 [btrfs]
>>  btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
>>  btrfs_free_fs_roots+0xc0/0x130 [btrfs]
>>  ? wait_for_completion+0xf2/0x100
>>  close_ctree+0xea/0x2e0 [btrfs]
>>  ? kthread_stop+0x161/0x260
>>  generic_shutdown_super+0x6c/0x120
>>  kill_anon_super+0xe/0x20
>>  btrfs_kill_super+0x13/0x100 [btrfs]
>>  deactivate_locked_super+0x3f/0x70
>>  cleanup_mnt+0x3b/0x70
>>  task_work_run+0x78/0x90
>>  exit_to_usermode_loop+0x77/0xa6
>>  do_syscall_64+0x1c5/0x1e0
>>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>> RIP: 0033:0x7f1784f90827
>> RSP: 002b:7ffdeeb03118 EFLAGS: 0246 ORIG_RAX: 00a6
>> RAX:  RBX: 556a60c62970 RCX: 7f1784f90827
>> RDX: 0001 RSI:  RDI: 556a60c62b50
>> RBP:  R08: 0005 R09: 
>> R10: 556a60c63900 R11: 0246 R12: 556a60c62b50
>> R13: 7f17854a81c4 R14:  R15: 
>> Code: 65 a1 fd ff be 01 00 00 00 48 89 ef e8 58 a1 fd ff 48 8b 7d 00 e8 9f 
>> 33 fe ff 48 89 ef e8 17 6c d3 ed 48 83 c4 50 5b 5d 41 5c c3 <49> 8b 44 24 60 
>> f0 0f ba a8 80 65 01 00 02 72 23 83 fb fb 75 39
>> RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: a7cf40d7bd10
>> CR2: 0060
>> ---[ end trace 3bc199fbf8fb4977 ]---
>>
>> Cc:  # v3.13
>> Fixes: 681ae50917df9 (Btrfs: cleanup reserved space when freeing tree log on 
>> error)
>> Signed-off-by: Jeff Mahoney 
> 
> Reviewed-by: David Sterba 
> 
>> ---
>>  fs/btrfs/tree-log.c | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>> index f8220ec02036..a5f6971a125f 100644
>> --- a/fs/btrfs/tree-log.c
>> +++ b/fs/btrfs/tree-log.c
>> @@ -3143,9 +3143,12 @@ static void free_log_tree(struct btrfs_trans_handle 
>> *trans,
>>  };
>>  
>>  ret = walk_log_tree(trans, log, );
>> -/* I don't think this can happen but just in case */
> 
> Example of a very useful comment :)

Heh, I think it was true when the comment was added. :)

-Jeff

-- 
Jeff Mahoney
SUSE Labs




signature.asc
Description: OpenPGP digital signature


Re: [COMMAND HANGS] The command 'btrfs subvolume sync -s 2 xyz' can hangs.

2018-08-09 Thread Jeff Mahoney
On 8/9/18 11:15 AM, Giuseppe Della Bianca wrote:
> Hi.
> 
> My system: 
> - Fedora 28 x86_64
> - kernel-4.17.7-200
> - btrfs-progs-4.15.1-1
> 
> The command 'btrfs subvolume sync -s 2 xyz' hangs in this case:
> 
> - Run command 'btrfs subvolume sync -s 2 xyz' .
> - After some time the kernel reports an error on the filesystem.
>   (error that existed before the command was launched.)
> - The filesystem goes in read-only mode.
> - The command hangs.

Can you provide the output of 'dmesg' and the contents of
/proc//stack where  is the pid of the btrfs command process?

-Jeff
-- 
Jeff Mahoney
SUSE Labs




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: Fix a C compliance issue

2018-06-20 Thread Jeff Mahoney
On 6/20/18 12:55 PM, David Sterba wrote:
> On Wed, Jun 20, 2018 at 04:44:54PM +, Bart Van Assche wrote:
>> On Mon, 2018-06-18 at 12:31 +0300, Nikolay Borisov wrote:
>>> On 18.06.2018 12:26, David Sterba wrote:
>>>> On Sat, Jun 16, 2018 at 01:28:13PM +0300, Nikolay Borisov wrote:
>>>>> I'd rather not see more printk being added. Nothing prevents from having
>>>>> the fmt string being passed to pr_info.
>>>>
>>>> So you mean to do
>>>>
>>>> +  static const char fmt[] = "Btrfs loaded, crc32c=%s"
>>>> +  pr_info(fmt);
>>>
>>> Pretty much, something along the lines of
>>>
>>> pr_info(fmt, crc32c_impl).
>>>
>>> printk requires having the KERN_INFO in the format string, which I see
>>> no point in doing, correct me if I'm wrong?
>>
>> You should know that what you proposed doesn't compile because pr_info()
>> relies on string concatenation and hence requires that its first argument is
>> a string constant instead of a const char pointer. Anyway, I will rework this
>> patch such that it uses pr_info() instead of printk().
> 
> Right, the pr_info(fmt,...) does not compile. The closest version I got to is
> below. It does not look pretty, but I can't think of a better version right
> now.
> 
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2369,7 +2369,8 @@ static __cold void btrfs_interface_exit(void)
>  
>  static void __init btrfs_print_mod_info(void)
>  {
> -   static const char fmt[] = KERN_INFO "Btrfs loaded, crc32c=%s"
> +   static const char fmt1[] = "Btrfs loaded, crc32c=";
> +   static const char fmt2[] =
>  #ifdef CONFIG_BTRFS_DEBUG
> ", debug=on"
>  #endif
> @@ -2383,7 +2384,7 @@ static void __init btrfs_print_mod_info(void)
> ", ref-verify=on"
>  #endif
> "\n";
> -   printk(fmt, crc32c_impl());
> +   pr_info("%s%s%s", fmt1, crc32c_impl(), fmt2);
>  }
>  
>  static int __init init_btrfs_fs(void)

The shed should be yellow.

-Jeff

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 891cd2ed5dd4..57c9da0b459f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2375,21 +2375,20 @@ static __cold void btrfs_interface_exit(void)

 static void __init btrfs_print_mod_info(void)
 {
-   pr_info("Btrfs loaded, crc32c=%s"
+   pr_info("Btrfs loaded, crc32c=%s", crc32c_impl());
 #ifdef CONFIG_BTRFS_DEBUG
-   ", debug=on"
+   pr_cont(", debug=on");
 #endif
 #ifdef CONFIG_BTRFS_ASSERT
-   ", assert=on"
+   pr_cont(", assert=on");
 #endif
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
-   ", integrity-checker=on"
+   pr_cont(", integrity-checker=on");
 #endif
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
-   ", ref-verify=on"
+   pr_cont(", ref-verify=on")
 #endif
-   "\n",
-   crc32c_impl());
+   pr_cont("\n");
 }

 static int null_open(struct block_device *bdev, fmode_t mode)



-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: Remove fs_info argument from __btrfs_inc_extent_ref

2018-06-19 Thread Jeff Mahoney
On 6/18/18 7:59 AM, Nikolay Borisov wrote:
> This function already takes a transaction which holds a reference to
> the fs_info struct. Use that reference and remove the extra arg. No
> functional changes.

I like the idea here.  I wasn't sold at first, but I think if we can
standardize on taking only a trans handle when one is required and both
a trans and fs_info when it's optional, it'll make the code clearer.
This cleanup can percolate up the stack to cover pretty much all of
delayed refs.

Reviewed-by: Jeff Mahoney 

> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/extent-tree.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4850e538ab10..59645ced6fbc 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2208,12 +2208,12 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
> *trans,
>  }
>  
>  static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
> -   struct btrfs_fs_info *fs_info,
> struct btrfs_delayed_ref_node *node,
> u64 parent, u64 root_objectid,
> u64 owner, u64 offset, int refs_to_add,
> struct btrfs_delayed_extent_op *extent_op)
>  {
> + struct btrfs_fs_info *fs_info = trans->fs_info;
>   struct btrfs_path *path;
>   struct extent_buffer *leaf;
>   struct btrfs_extent_item *item;
> @@ -2297,10 +2297,9 @@ static int run_delayed_data_ref(struct 
> btrfs_trans_handle *trans,
>ref->objectid, ref->offset,
>, node->ref_mod);
>   } else if (node->action == BTRFS_ADD_DELAYED_REF) {
> - ret = __btrfs_inc_extent_ref(trans, fs_info, node, parent,
> -  ref_root, ref->objectid,
> -  ref->offset, node->ref_mod,
> -  extent_op);
> + ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
> +  ref->objectid, ref->offset,
> +  node->ref_mod, extent_op);
>   } else if (node->action == BTRFS_DROP_DELAYED_REF) {
>   ret = __btrfs_free_extent(trans, fs_info, node, parent,
> ref_root, ref->objectid,
> @@ -2450,10 +2449,8 @@ static int run_delayed_tree_ref(struct 
> btrfs_trans_handle *trans,
>   BUG_ON(!extent_op || !extent_op->update_flags);
>   ret = alloc_reserved_tree_block(trans, node, extent_op);
>   } else if (node->action == BTRFS_ADD_DELAYED_REF) {
> - ret = __btrfs_inc_extent_ref(trans, fs_info, node,
> -  parent, ref_root,
> -  ref->level, 0, 1,
> -  extent_op);
> + ret = __btrfs_inc_extent_ref(trans, node, parent, ref_root,
> +  ref->level, 0, 1, extent_op);
>   } else if (node->action == BTRFS_DROP_DELAYED_REF) {
>   ret = __btrfs_free_extent(trans, fs_info, node,
> parent, ref_root,
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: Document __btrfs_inc_extent_ref

2018-06-14 Thread Jeff Mahoney
On 5/18/18 9:12 AM, Nikolay Borisov wrote:
> Here is a doc-only patch which tires to deobfuscate the terra-incognita
> that arguments for delayed refs are.
> 
> Signed-off-by: Nikolay Borisov 
> ---
> Hello, 
> 
> This patch needs revieweing since I'm not entirely sure I managed to capture 
> the semantics of the "parent" and "owner" arguments. Specifically, parent is 
> passed "ref->parent" only if we have a shared block, however looking at the 
> code where parent is passed to the add delayed tree/data refs, it seems that 
> parent is set to the eb->Start only if the tree where this extent comes from 
> is 
> the data_reloc_tree, i.e the code in 
> replace_path/replace_file_extents/do_relocation
> __btrfs_cow_block.

This makes sense.  When we create a new extent or add a reference to an
existing one, we have a root to use as a reference.  Those references
are created as indirect references.  When a reference is dropped and the
remaining references are either shared or implied from higher levels in
the tree, we need to create a shared reference.

The exception is relocation which is creating new extents in a literal
sense, but is really just moving them.  We only have the references
already in place to work with, so we'll need to insert updated shared
references to point to the new parent.

> For the "owner" argument in case of data extents it'set to the ino, for 
> metadata
> extents it's a bit trickier, what I think it always contains for such extents 
> is the level of the parent block in the tree. 

It contains the level of the buffer in the tree.

__btrfs_cow_block():

level = btrfs_header_level(buf);

if (level == 0)
btrfs_item_key(buf, _key, 0);
else
btrfs_node_key(buf, _key, 0);

Leaf node extents show up in the tree dump as (objectid METADATA_ITEM 0).

If you're referring to the following comment, it was introduced in the
commit that added skinny metadata (3173a18f70554).
/*
 * Owner is our parent level, so we can just add one to get the
level
 * for the block we are interested in.
 */

I think that Josef was maybe documenting why just using the level was
safe instead of using the full key + level as the old tree block refs
did.  It could probably be made more clear if that's the case.

> If this function is documented correctly then it wil be fairly trivial to 
> document btrfs_add_delayed(tree|data)_ref ones as well. 
> 
>  fs/btrfs/extent-tree.c | 29 +
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2ce32f05812f..5a2f4a86dc71 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2207,6 +2207,35 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
> *trans,
>   return ret;
>  }
>  
> +/*
> + * __btrfs_inc_extent_ref - insert backreference for a given extent
> + *
> + * @trans:   Handle of transaction
> + *
> + * @node:The delayed ref node used to get the bytenr/length for
> + *   extent
> + *
> + * @parent:  If this is a shared extent (BTRFS_SHARED_DATA_REF_KEY/
> + *   BTRFS_SHARED_BLOCK_REF_KEY) then parent *may* hold the> + * 
> logical bytenr of the parent block.

If this is a shared extent then parent holds the logical bytenr of the
parent block.  Since new extents are always created with indirect
references, this will only be the case when relocating a shared extent.
In that case, root_objectid will be BTRFS_TREE_RELOC_OBJECTID.
Otherwise, parent must be 0.

> + * @root_objectid: The id of the root where this modification has originated,
> + *   this can be either one of the well-known metadata trees or
> + *   the subvolume id which references this extent.
> + *
> + * @owner:   For data extents it is the inode number of the owning file.
> + *   For metadata extents this parameter holds the level in the
> + *   tree of the parent block.
>

As mentioned above, it holds the level of the tree that contains the
block.  The parent can be looked up indirectly by taking this level and
adding 1 until we hit the level of the root node.

> + * @offset:  For metadata extents this is always 0. For data extents it
> + *   is the fileoffset this extent belongs to.

For metadata extents, offset is ignored.  It just happens to be passed
as 0 in existing code.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH RFC] btrfs: scrub: Don't use inode pages for device replace

2018-05-31 Thread Jeff Mahoney
nocow_ctx->mirror_num = mirror_num;
>   nocow_ctx->physical_for_dev_replace = physical_for_dev_replace;
> +
> + nocow_ctx->fb_physical = physical;
> + nocow_ctx->fb_gen = gen;
> + nocow_ctx->fb_dev = dev;
> +
>   btrfs_init_work(_ctx->work, btrfs_scrubnc_helper,
>   copy_nocow_pages_worker, NULL, NULL);
>   INIT_LIST_HEAD(_ctx->inodes);
> @@ -4487,7 +4499,7 @@ static void copy_nocow_pages_worker(struct btrfs_work 
> *work)
>  }
>  
>  static int check_extent_to_block(struct btrfs_inode *inode, u64 start, u64 
> len,
> -  u64 logical)
> +  u64 logical, int *compressed)
>  {
>   struct extent_state *cached_state = NULL;
>   struct btrfs_ordered_extent *ordered;
> @@ -4523,6 +4535,7 @@ static int check_extent_to_block(struct btrfs_inode 
> *inode, u64 start, u64 len,
>   ret = 1;
>   goto out_unlock;
>   }
> + *compressed = em->compress_type;
>   free_extent_map(em);
>  
>  out_unlock:
> @@ -4543,6 +4556,7 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 
> offset, u64 root,
>   u64 nocow_ctx_logical;
>   u64 len = nocow_ctx->len;
>   unsigned long index;
> + int compressed;
>   int srcu_index;
>   int ret = 0;
>   int err = 0;
> @@ -4576,12 +4590,20 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 
> offset, u64 root,
>   nocow_ctx_logical = nocow_ctx->logical;
>  
>   ret = check_extent_to_block(BTRFS_I(inode), offset, len,
> - nocow_ctx_logical);
> + nocow_ctx_logical, );
>   if (ret) {
>   ret = ret > 0 ? 0 : ret;
>   goto out;
>   }
>  
> + /*
> +  * We hit the damn nodatasum compressed extent, we can't use any page
> +  * from inode as they are all *UNCOMPRESSED*.
> +  * We fall back to scrub_pages() for such case.
> +  */
> + if (compressed)
> + goto fallback;
> +
>   while (len >= PAGE_SIZE) {
>   index = offset >> PAGE_SHIFT;
>  again:
> @@ -4624,11 +4646,16 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 
> offset, u64 root,
>   }
>  
>   ret = check_extent_to_block(BTRFS_I(inode), offset, len,
> - nocow_ctx_logical);
> + nocow_ctx_logical, );
>   if (ret) {
>   ret = ret > 0 ? 0 : ret;
>   goto next_page;
>   }
> + if (compressed) {
> + unlock_page(page);
> + put_page(page);
> + goto fallback;
> + }
>  
>   err = write_page_nocow(nocow_ctx->sctx,
>  physical_for_dev_replace, page);
> @@ -4651,6 +4678,19 @@ static int copy_nocow_pages_for_inode(u64 inum, u64 
> offset, u64 root,
>   inode_unlock(inode);
>   iput(inode);
>   return ret;
> +
> +fallback:
> + inode_unlock(inode);
> + iput(inode);
> +
> + ret = scrub_pages(nocow_ctx->sctx, nocow_ctx->logical,
> +   nocow_ctx->len, nocow_ctx->fb_physical,
> +   nocow_ctx->fb_dev, BTRFS_EXTENT_FLAG_DATA,
> +   nocow_ctx->fb_gen, nocow_ctx->mirror_num,
> +   NULL, 0, physical_for_dev_replace);
> + if (!ret)
> + ret = COPY_COMPLETE;
> + return ret;
>  }
>  
>  static int write_page_nocow(struct scrub_ctx *sctx,
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/3] btrfs: add read mirror policy

2018-05-17 Thread Jeff Mahoney
On 5/17/18 8:25 AM, Austin S. Hemmelgarn wrote:
> On 2018-05-16 22:32, Anand Jain wrote:
>>
>>
>> On 05/17/2018 06:35 AM, David Sterba wrote:
>>> On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote:
>>>> Not yet ready for the integration. As I need to introduce
>>>> -o no_read_mirror_policy instead of -o read_mirror_policy=-
>>>
>>> Mount option is mostly likely not the right interface for setting such
>>> options, as usual.
>>
>>   I am ok to make it ioctl for the final. What do you think?
>>
>>
>>   But to reproduce the bug posted in
>>     Btrfs: fix the corruption by reading stale btree blocks
>>   It needs to be a mount option, as randomly the pid can
>>   still pick the disk specified in the mount option.
>>
> Personally, I'd vote for filesystem property (thus handled through the
> standard `btrfs property` command) that can be overridden by a mount
> option.  With that approach, no new tool (or change to an existing tool)
> would be needed, existing volumes could be converted to use it in a
> backwards compatible manner (old kernels would just ignore the
> property), and you could still have the behavior you want in tests (and
> in theory it could easily be adapted to be a per-subvolume setting if we
> ever get per-subvolume chunk profile support).

Properties are a combination of interfaces presented through a single
command.  Although the kernel API would allow a direct-to-property
interface via the btrfs.* extended attributes, those are currently
limited to a single inode.  The label property is set via ioctl and
stored in the superblock.  The read-only subvolume property is also set
by ioctl but stored in the root flags.

As it stands, every property is explicitly defined in the tools, so any
addition would require tools changes.  This is a bigger discussion,
though.  We *could* use the xattr interface to access per-root or
fs-global properties, but we'd need to define that interface.
btrfs_listxattr could get interesting, though I suppose we could
simplify it by only allowing the per-subvolume and fs-global operations
on root inodes.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/3] btrfs: add read mirror policy

2018-05-17 Thread Jeff Mahoney
On 5/16/18 6:35 PM, David Sterba wrote:
> On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote:
>> Not yet ready for the integration. As I need to introduce
>> -o no_read_mirror_policy instead of -o read_mirror_policy=-
> 
> Mount option is mostly likely not the right interface for setting such
> options, as usual.


I've seen a few alternate suggestions in the thread.  I suppose the real
question is: what and where is the intended persistence for this choice?

A mount option gets it via fstab.  How would a user be expected to set
it consistently via ioctl on each mount?  Properties could work, but
there's more discussion needed there.  Personally, I like the property
idea since it could conceivably be used on a per-file basis.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: qgroup: Search commit root for rescan to avoid missing extent

2018-05-11 Thread Jeff Mahoney
On 5/3/18 3:20 AM, Qu Wenruo wrote:
> When doing qgroup rescan using the following script (modified from
> btrfs/017 test case), we can sometimes hit qgroup corruption.
> 
> --
> umount $dev &> /dev/null
> umount $mnt &> /dev/null
> 
> mkfs.btrfs -f -n 64k $dev
> mount $dev $mnt
> 
> extent_size=8192
> 
> xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
> btrfs subvolume snapshot $mnt $mnt/snap
> 
> xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
> btrfs quota enable $mnt
> 
>  # -W is the new option to only wait rescan while not starting new one
> btrfs quota rescan -W $mnt
> btrfs qgroup show -prce $mnt
> 
>  # Need to patch btrfs-progs to report qgroup mismatch as error
> btrfs check $dev || _fail
> --
> 
> For fast machine, we can hit some corruption which missed accounting
> tree blocks:
> --
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5   8.00KiB0.00B none none --- ---
> 0/257 8.00KiB0.00B none none --- ---
> --
> 
> This is due to the fact that we're always searching commit root for
> btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
> from current transaction, not commit root.
> 
> And if our tree blocks get modified in current transaction, we won't
> find any owner in commit root, thus causing the corruption.
> 
> Fix it by searching commit root for extent tree for
> qgroup_rescan_leaf().
> 
> Reported-by: Nikolay Borisov <nbori...@suse.com>
> Signed-off-by: Qu Wenruo <w...@suse.com>
> ---
> 
> Please keep in mind that it is possible to hit another type of race
> which double accounting tree blocks:
> --
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5  136.00KiB 128.00KiB none none --- ---
> 0/257136.00KiB 128.00KiB none none --- ---
> --
> For this type of corruption, this patch could reduce the possibility,
> but the root cause is race between transaction commit and qgroup rescan,
> which needs to be addressed in another patch.
> ---
>  fs/btrfs/qgroup.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 4baa4ba2d630..829e8fe5c97e 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2681,6 +2681,11 @@ static void btrfs_qgroup_rescan_worker(struct 
> btrfs_work *work)
>   path = btrfs_alloc_path();
>   if (!path)
>   goto out;
> + /*
> +  * Rescan should only search for commit root, and any later difference
> +  * should be recorded by qgroup
> +  */
> + path->search_commit_root = 1;
>  
>   err = 0;
>   while (!err && !btrfs_fs_closing(fs_info)) {
> 

If we're searching the commit root here, do we need the tree mod
sequence number dance in qgroup_rescan_leaf anymore?

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: qgroup: Finish rescan when hit the last leaf of extent tree

2018-05-11 Thread Jeff Mahoney
be scanning any
> +  * further blocks. We cannot unset the RESCAN flag here, because
> +  * we want to commit the transaction if everything went well.
> +  * To make the live accounting work in this phase, we set our
> +  * scan progress pointer such that every real extent objectid
> +  * will be smaller.
> +  */
> + fs_info->qgroup_rescan_progress.objectid = (u64)-1;
> + btrfs_release_path(path);
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return 1;
>  }
>  
>  static void btrfs_qgroup_rescan_worker(struct btrfs_work *work)
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: push relocation recovery into a helper thread

2018-05-10 Thread Jeff Mahoney
On 4/17/18 2:45 PM, Jeff Mahoney wrote:
> On a file system with many snapshots and qgroups enabled, an interrupted
> balance can end up taking a long time to mount due to recovering the
> relocations during mount.  It does this in the task performing the mount,
> which can't be interrupted and may prevent mounting (and systems booting)
> for a long time as well.  The thing is that as part of balance, this
> runs in the background all the time.  This patch pushes the recovery
> into a helper thread and allows the mount to continue normally.  We hold
> off on resuming any paused balance operation until after the relocation
> has completed, disallow any new balance operations if it's running, and
> wait for it on umount and remounting read-only.
> 
> This doesn't do anything to address the relocation recovery operation
> taking a long time but does allow the file system to mount.

I'm abandoning this patch.  The right fix is to fix qgroups.  The
workload that I was targeting takes seconds to recover without qgroups
in the way.

-Jeff

> Signed-off-by: Jeff Mahoney <je...@suse.com>
> ---
>  fs/btrfs/ctree.h  |7 +++
>  fs/btrfs/disk-io.c|7 ++-
>  fs/btrfs/relocation.c |   92 
> +-
>  fs/btrfs/super.c  |5 +-
>  fs/btrfs/volumes.c|6 +++
>  5 files changed, 97 insertions(+), 20 deletions(-)
> 
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1052,6 +1052,10 @@ struct btrfs_fs_info {
>   struct btrfs_work qgroup_rescan_work;
>   bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>  
> + /* relocation recovery items */
> + bool relocation_recovery_started;
> + struct completion relocation_recovery_completion;
> +
>   /* filesystem state */
>   unsigned long fs_state;
>  
> @@ -3671,7 +3675,8 @@ int btrfs_init_reloc_root(struct btrfs_t
> struct btrfs_root *root);
>  int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>   struct btrfs_root *root);
> -int btrfs_recover_relocation(struct btrfs_root *root);
> +int btrfs_recover_relocation(struct btrfs_fs_info *fs_info);
> +void btrfs_wait_for_relocation_completion(struct btrfs_fs_info *fs_info);
>  int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len);
>  int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
> struct btrfs_root *root, struct extent_buffer *buf,
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2999,7 +2999,7 @@ retry_root_backup:
>   goto fail_qgroup;
>  
>   mutex_lock(_info->cleaner_mutex);
> - ret = btrfs_recover_relocation(tree_root);
> + ret = btrfs_recover_relocation(fs_info);
>   mutex_unlock(_info->cleaner_mutex);
>   if (ret < 0) {
>   btrfs_warn(fs_info, "failed to recover relocation: %d",
> @@ -3017,7 +3017,8 @@ retry_root_backup:
>   if (IS_ERR(fs_info->fs_root)) {
>   err = PTR_ERR(fs_info->fs_root);
>   btrfs_warn(fs_info, "failed to read fs tree: %d", err);
> - goto fail_qgroup;
> + close_ctree(fs_info);
> + return err;
>   }
>  
>   if (sb_rdonly(sb))
> @@ -3778,6 +3779,8 @@ void close_ctree(struct btrfs_fs_info *f
>   /* wait for the qgroup rescan worker to stop */
>   btrfs_qgroup_wait_for_completion(fs_info, false);
>  
> + btrfs_wait_for_relocation_completion(fs_info);
> +
>   /* wait for the uuid_scan task to finish */
>   down(_info->uuid_tree_rescan_sem);
>   /* avoid complains from lockdep et al., set sem back to initial state */
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -4492,14 +4493,61 @@ static noinline_for_stack int mark_garba
>  }
>  
>  /*
> - * recover relocation interrupted by system crash.
> - *
>   * this function resumes merging reloc trees with corresponding fs trees.
>   * this is important for keeping the sharing of tree blocks
>   */
> -int btrfs_recover_relocation(struct btrfs_root *root)
> +static int
> +btrfs_resume_relocation(void *data)
>  {
> - struct btrfs_fs_info *fs_info = root->fs_info;
> + struct btrfs_fs_info *fs_info = data;
> + struct btrfs_trans_handle *trans;
> + struct reloc_control *rc = fs_info->reloc_ctl;
> + int err, ret;
> +
> + btrfs_info(fs_info, "resuming relocation&q

Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-05-10 Thread Jeff Mahoney
On 5/2/18 5:11 PM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> 
> Commit 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
> ended up reintroducing the hang-on-unmount bug that the commit it
> intended to fix addressed.
> 
> The race this time is between qgroup_rescan_init setting
> ->qgroup_rescan_running = true and the worker starting.  There are
> many scenarios where we initialize the worker and never start it.  The
> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
> This can happen even without involving error handling, since mounting
> the file system read-only returns between initializing the worker and
> queueing it.
> 
> The right place to do it is when we're queuing the worker.  The flag
> really just means that btrfs_ioctl_quota_rescan_wait should wait for
> a completion.
> 
> Since the BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is overloaded to
> refer to both runtime behavior and on-disk state, we introduce a new
> fs_info->qgroup_rescan_ready to indicate that we're initialized and
> waiting to start.
> 
> This patch introduces a new helper, queue_rescan_worker, that handles
> most of the initialization, the two flags, and queuing the worker,
> including races with unmount.
> 
> While we're at it, ->qgroup_rescan_running is protected only by the
> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
> to take the spinlock too.
> 
> Fixes: 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
> Signed-off-by: Jeff Mahoney <je...@suse.com>
> ---
>  fs/btrfs/ctree.h  |  2 ++
>  fs/btrfs/qgroup.c | 94 
> +--
>  2 files changed, 58 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index da308774b8a4..4003498bb714 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1045,6 +1045,8 @@ struct btrfs_fs_info {
>   struct btrfs_workqueue *qgroup_rescan_workers;
>   struct completion qgroup_rescan_completion;
>   struct btrfs_work qgroup_rescan_work;
> + /* qgroup rescan worker is running or queued to run */
> + bool qgroup_rescan_ready;
>   bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>  
>   /* filesystem state */
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index aa259d6986e1..466744741873 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -101,6 +101,7 @@ static int
>  qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
>  int init_flags);
>  static void qgroup_rescan_zero_tracking(struct btrfs_fs_info *fs_info);
> +static void btrfs_qgroup_rescan_worker(struct btrfs_work *work);
>  
>  /* must be called with qgroup_ioctl_lock held */
>  static struct btrfs_qgroup *find_qgroup_rb(struct btrfs_fs_info *fs_info,
> @@ -2072,6 +2073,46 @@ int btrfs_qgroup_account_extents(struct 
> btrfs_trans_handle *trans,
>   return ret;
>  }
>  
> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
> +{
> + mutex_lock(_info->qgroup_rescan_lock);
> + if (btrfs_fs_closing(fs_info)) {
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> +
> + if (WARN_ON(!fs_info->qgroup_rescan_ready)) {
> + btrfs_warn(fs_info, "rescan worker not ready");
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> + fs_info->qgroup_rescan_ready = false;
> +
> + if (WARN_ON(fs_info->qgroup_rescan_running)) {
> + btrfs_warn(fs_info, "rescan worker already queued");
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> +
> + /*
> +  * Being queued is enough for btrfs_qgroup_wait_for_completion
> +  * to need to wait.
> +  */
> + fs_info->qgroup_rescan_running = true;
> + init_completion(_info->qgroup_rescan_completion);
> + mutex_unlock(_info->qgroup_rescan_lock);
> +
> + memset(_info->qgroup_rescan_work, 0,
> +sizeof(fs_info->qgroup_rescan_work));
> +
> + btrfs_init_work(_info->qgroup_rescan_work,
> + btrfs_qgroup_rescan_helper,
> + btrfs_qgroup_rescan_worker, NULL, NULL);
> +
> + btrfs_queue_work(fs_info->qgroup_rescan_workers,
> +  _info->qgroup_rescan_work);
> +}
> +
>  /*
>   * called from commit_transaction. Writes all changed qgroups to disk.
>   */
> @@ -2123,8 +2164,7 @@ int btrfs

Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-05-10 Thread Jeff Mahoney
On 5/2/18 5:11 PM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> 
> Commit 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
> ended up reintroducing the hang-on-unmount bug that the commit it
> intended to fix addressed.
> 
> The race this time is between qgroup_rescan_init setting
> ->qgroup_rescan_running = true and the worker starting.  There are
> many scenarios where we initialize the worker and never start it.  The
> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
> This can happen even without involving error handling, since mounting
> the file system read-only returns between initializing the worker and
> queueing it.
> 
> The right place to do it is when we're queuing the worker.  The flag
> really just means that btrfs_ioctl_quota_rescan_wait should wait for
> a completion.
> 
> Since the BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is overloaded to
> refer to both runtime behavior and on-disk state, we introduce a new
> fs_info->qgroup_rescan_ready to indicate that we're initialized and
> waiting to start.
> 
> This patch introduces a new helper, queue_rescan_worker, that handles
> most of the initialization, the two flags, and queuing the worker,
> including races with unmount.
> 
> While we're at it, ->qgroup_rescan_running is protected only by the
> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
> to take the spinlock too.
> 
> Fixes: 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
> Signed-off-by: Jeff Mahoney <je...@suse.com>
> ---
>  fs/btrfs/ctree.h  |  2 ++
>  fs/btrfs/qgroup.c | 94 
> +--
>  2 files changed, 58 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index da308774b8a4..4003498bb714 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1045,6 +1045,8 @@ struct btrfs_fs_info {
>   struct btrfs_workqueue *qgroup_rescan_workers;
>   struct completion qgroup_rescan_completion;
>   struct btrfs_work qgroup_rescan_work;
> + /* qgroup rescan worker is running or queued to run */
> + bool qgroup_rescan_ready;
>   bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>  
>   /* filesystem state */
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index aa259d6986e1..466744741873 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -101,6 +101,7 @@ static int
>  qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
>  int init_flags);
>  static void qgroup_rescan_zero_tracking(struct btrfs_fs_info *fs_info);
> +static void btrfs_qgroup_rescan_worker(struct btrfs_work *work);
>  
>  /* must be called with qgroup_ioctl_lock held */
>  static struct btrfs_qgroup *find_qgroup_rb(struct btrfs_fs_info *fs_info,
> @@ -2072,6 +2073,46 @@ int btrfs_qgroup_account_extents(struct 
> btrfs_trans_handle *trans,
>   return ret;
>  }
>  
> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
> +{
> + mutex_lock(_info->qgroup_rescan_lock);
> + if (btrfs_fs_closing(fs_info)) {
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> +
> + if (WARN_ON(!fs_info->qgroup_rescan_ready)) {
> + btrfs_warn(fs_info, "rescan worker not ready");
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> + fs_info->qgroup_rescan_ready = false;
> +
> + if (WARN_ON(fs_info->qgroup_rescan_running)) {
> + btrfs_warn(fs_info, "rescan worker already queued");
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> +
> + /*
> +  * Being queued is enough for btrfs_qgroup_wait_for_completion
> +  * to need to wait.
> +  */
> + fs_info->qgroup_rescan_running = true;
> + init_completion(_info->qgroup_rescan_completion);
> + mutex_unlock(_info->qgroup_rescan_lock);
> +
> + memset(_info->qgroup_rescan_work, 0,
> +sizeof(fs_info->qgroup_rescan_work));
> +
> + btrfs_init_work(_info->qgroup_rescan_work,
> + btrfs_qgroup_rescan_helper,
> + btrfs_qgroup_rescan_worker, NULL, NULL);
> +
> + btrfs_queue_work(fs_info->qgroup_rescan_workers,
> +  _info->qgroup_rescan_work);
> +}
> +
>  /*
>   * called from commit_transaction. Writes all changed qgroups to disk.
>   */
> @@ -2123,8 +2164,7 @@ int btrfs

Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root

2018-05-08 Thread Jeff Mahoney
rc.info/?l=linux-btrfs=130532890824992=2
>>
>> During the discussion, one question did come up - why can't
>> filesystems like Btrfs use a superblock per subvolume? There's a
>> couple of problems with that:
>>
>> - It's common for a single Btrfs filesystem to have thousands of
>>   subvolumes. So keeping a superblock for each subvol in memory would
>>   get prohibively expensive - imagine having 8000 copies of struct
>>   super_block for a file system just because we wanted some separation
>>   of say, s_dev.
> 
> That's no different to using individual overlay mounts for the
> thousands of containers that are on the system. This doesn't seem to
> be a major problem...

Overlay mounts are indepedent of one another and don't need coordination
among them.  The memory usage is relatively unimportant.  The important
part is having a bunch of superblocks that all correspond to the same
resources and coordinating them at the VFS level.  Your assumptions
below follow how your XFS subvolumes work, where there's a clear
hierarchy between the subvolumes and the master filesystem with a
mapping layer between them.  Btrfs subvolumes have no such hierarchy.
Everything is shared.  So while we could use a writeback hierarchy to
merge all of the inode lists before doing writeback on the 'master'
superblock, we'd gain nothing by it.  Handling anything involving
s_umount with a superblock per subvolume would be a nightmare.
Ultimately, it would be a ton of effort that amounts to working around
the VFS instead of with it.

>> - Writeback would also have to walk all of these superblocks -
>>   again not very good for system performance.
> 
> Background writeback is backing device focussed, not superblock
> focussed. It will only iterate the superblocks that have dirty
> inodes on the bdi writeback lists, not all the superblocks on the
> bdi. IOWs, this isn't a major problem except for sync() operations
> that iterate superblocks.
> 
>> - Anyone wanting to lock down I/O on a filesystem would have to
>> freeze all the superblocks. This goes for most things related to
>> I/O really - we simply can't afford to have the kernel walking
>> thousands of superblocks to sync a single fs.
> 
> Not with XFS subvolumes. Freezing the underlying parent filesystem
> will effectively stop all IO from the mounted subvolumes by freezing
> remapping calls before IO. Sure, those subvolumes aren't in a
> consistent state, but we don't freeze userspace so none of the
> application data is ever in a consistent state when filesystems are
> frozen.
> 
> So, again, I'm not sure there's /subvolume/ problem here. There's
> definitely a "freeze heirarchy" problem, but that already exists and
> it's something we talked about at LSFMM because we need to solve it
> for reliable hibernation.

There's only a freeze hierarchy problem if we have to use multiple
superblocks.  Otherwise, we freeze the whole thing or we don't.  Trying
to freeze a single subvolume would be an illusion for the user since the
underlying file system would still be active underneath it.  Under the
hood, things like relocation don't even look at what subvolume owns a
particular extent until it must.  So it could be coordinating thousands
of superblocks to do what a single lock does now and for what benefit?

>> It's far more efficient then to pull those fields we need for a
>> subvolume namespace into their own structure.
>
> I'm not convinced yet - it still feels like it's the wrong layer to
> be solving the multiple namespace per superblock problem

It needs to be between the inode and the superblock.  If there are
multiple user-visible namespaces, each will still get the same
underlying file system namespace.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 0/3] btrfs: qgroup rescan races (part 1)

2018-05-04 Thread Jeff Mahoney
On 5/4/18 1:59 AM, Nikolay Borisov wrote:
> 
> 
> On  4.05.2018 01:27, Jeff Mahoney wrote:
>> On 5/3/18 2:23 AM, Nikolay Borisov wrote:
>>>
>>>
>>> On  3.05.2018 00:11, je...@suse.com wrote:
>>>> From: Jeff Mahoney <je...@suse.com>
>>>>
>>>> Hi Dave -
>>>>
>>>> Here's the updated patchset for the rescan races.  This fixes the issue
>>>> where we'd try to start multiple workers.  It introduces a new "ready"
>>>> bool that we set during initialization and clear while queuing the worker.
>>>> The queuer is also now responsible for most of the initialization.
>>>>
>>>> I have a separate patch set start that gets rid of the racy mess 
>>>> surrounding
>>>> the rescan worker startup.  We can handle it in btrfs_run_qgroups and
>>>> just set a flag to start it everywhere else.
>>> I'd be interested in seeing those patches. Some time ago I did send a
>>> patch which cleaned up the way qgroup rescan was initiated. It was done
>>> from "btrfs_run_qgroups" and I think this is messy. Whatever we do we
>>> ought to really have well-defined semantics when qgroups rescan are run,
>>> preferably we shouldn't be conflating rescan + run (unless there is
>>> _really_ good reason to do). In the past the rescan from scan was used
>>> only during qgroup enabling.
>>
>> I think btrfs_run_qgroups is the place to do it.  Here's why:
>>
>> 2773 int
>> 2774 btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info)
>> 2775 {
>> 2776 int ret = 0;
>> 2777 struct btrfs_trans_handle *trans;
>> 2778
>> 2779 ret = qgroup_rescan_init(fs_info, 0, 1);
>> 2780 if (ret)
>> 2781 return ret;
>> 2782
>> 2783 /*
>> 2784  * We have set the rescan_progress to 0, which means no more
>> 2785  * delayed refs will be accounted by btrfs_qgroup_account_ref.
>> 2786  * However, btrfs_qgroup_account_ref may be right after its call
>> 2787  * to btrfs_find_all_roots, in which case it would still do the
>> 2788  * accounting.
>> 2789  * To solve this, we're committing the transaction, which will
>> 2790  * ensure we run all delayed refs and only after that, we are
>> 2791  * going to clear all tracking information for a clean start.
>> 2792  */
>> 2793
>> 2794 trans = btrfs_join_transaction(fs_info->fs_root);
>> 2795 if (IS_ERR(trans)) {
>> 2796 fs_info->qgroup_flags &= 
>> ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
>> 2797 return PTR_ERR(trans);
>> 2798 }
>> 2799 ret = btrfs_commit_transaction(trans);
>> 2800 if (ret) {
>> 2801 fs_info->qgroup_flags &= 
>> ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
>> 2802 return ret;
>> 2803 }
>> 2804
>> 2805 qgroup_rescan_zero_tracking(fs_info);
>> 2806
>> 2807 queue_rescan_worker(fs_info);
>> 2808 return 0;
>> 2809 }
>>
>> The delayed ref race should exist anywhere we initiate a rescan outside of
>> initially enabling qgroups.  We already zero the tracking and queue the 
>> rescan
>> worker in btrfs_run_qgroups for when we enable qgroups.  Why not just always
>> queue the worker there so the initialization and execution has a clear 
>> starting point?
> 
> This is no longer true in upstream as of commit 5d23515be669 ("btrfs:
> Move qgroup rescan on quota enable to btrfs_quota_enable"). Hence my
> asking about this. I guess if we make it unconditional it won't increase
> the complexity, but the original code which was only run during qgroup
> enable was rather iffy I Just don't want to repeat this.

Ah, ok.  My repo is still using v4.16.  How does this work with the race
that is described in btrfs_qgroup_rescan?

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 0/3] btrfs: qgroup rescan races (part 1)

2018-05-03 Thread Jeff Mahoney
On 5/3/18 2:23 AM, Nikolay Borisov wrote:
> 
> 
> On  3.05.2018 00:11, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> Hi Dave -
>>
>> Here's the updated patchset for the rescan races.  This fixes the issue
>> where we'd try to start multiple workers.  It introduces a new "ready"
>> bool that we set during initialization and clear while queuing the worker.
>> The queuer is also now responsible for most of the initialization.
>>
>> I have a separate patch set start that gets rid of the racy mess surrounding
>> the rescan worker startup.  We can handle it in btrfs_run_qgroups and
>> just set a flag to start it everywhere else.
> I'd be interested in seeing those patches. Some time ago I did send a
> patch which cleaned up the way qgroup rescan was initiated. It was done
> from "btrfs_run_qgroups" and I think this is messy. Whatever we do we
> ought to really have well-defined semantics when qgroups rescan are run,
> preferably we shouldn't be conflating rescan + run (unless there is
> _really_ good reason to do). In the past the rescan from scan was used
> only during qgroup enabling.

I think btrfs_run_qgroups is the place to do it.  Here's why:

2773 int
2774 btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info)
2775 {
2776 int ret = 0;
2777 struct btrfs_trans_handle *trans;
2778
2779 ret = qgroup_rescan_init(fs_info, 0, 1);
2780 if (ret)
2781 return ret;
2782
2783 /*
2784  * We have set the rescan_progress to 0, which means no more
2785  * delayed refs will be accounted by btrfs_qgroup_account_ref.
2786  * However, btrfs_qgroup_account_ref may be right after its call
2787  * to btrfs_find_all_roots, in which case it would still do the
2788  * accounting.
2789  * To solve this, we're committing the transaction, which will
2790  * ensure we run all delayed refs and only after that, we are
2791  * going to clear all tracking information for a clean start.
2792  */
2793
2794 trans = btrfs_join_transaction(fs_info->fs_root);
2795 if (IS_ERR(trans)) {
2796 fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
2797 return PTR_ERR(trans);
2798 }
2799 ret = btrfs_commit_transaction(trans);
2800 if (ret) {
2801 fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
2802 return ret;
2803 }
2804
2805 qgroup_rescan_zero_tracking(fs_info);
2806
2807 queue_rescan_worker(fs_info);
2808 return 0;
2809 }

The delayed ref race should exist anywhere we initiate a rescan outside of
initially enabling qgroups.  We already zero the tracking and queue the rescan
worker in btrfs_run_qgroups for when we enable qgroups.  Why not just always
queue the worker there so the initialization and execution has a clear starting 
point?

There are a few other races I'd like to fix as well.  We call btrfs_run_qgroups
directly from btrfs_ioctl_qgroup_assign, which is buggy since
btrfs_add_qgroup_relation only checks to see if the quota_root exists.  It will
exist as soon as btrfs_quota_enable runs but we won't have committed the
transaction yet.  The call will end up enabling quotas in the middle of a 
transaction.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-05-03 Thread Jeff Mahoney
On 5/3/18 11:52 AM, Nikolay Borisov wrote:
> 
> 
> On  3.05.2018 16:39, Jeff Mahoney wrote:
>> On 5/3/18 3:24 AM, Nikolay Borisov wrote:
>>>
>>>
>>> On  3.05.2018 00:11, je...@suse.com wrote:
>>>> From: Jeff Mahoney <je...@suse.com>
>>>>
>>>> Commit 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
>>>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>>>> ended up reintroducing the hang-on-unmount bug that the commit it
>>>> intended to fix addressed.
>>>>
>>>> The race this time is between qgroup_rescan_init setting
>>>> ->qgroup_rescan_running = true and the worker starting.  There are
>>>> many scenarios where we initialize the worker and never start it.  The
>>>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>>>> This can happen even without involving error handling, since mounting
>>>> the file system read-only returns between initializing the worker and
>>>> queueing it.
>>>>
>>>> The right place to do it is when we're queuing the worker.  The flag
>>>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>>>> a completion.
>>>>
>>>> Since the BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is overloaded to
>>>> refer to both runtime behavior and on-disk state, we introduce a new
>>>> fs_info->qgroup_rescan_ready to indicate that we're initialized and
>>>> waiting to start.
>>>
>>> Am I correct in my understanding that this qgroup_rescan_ready flag is
>>> used to avoid qgroup_rescan_init being called AFTER it has already been
>>> called but BEFORE queue_rescan_worker ? Why wasn't the initial version
>>> of this patch without this flag sufficient?
>>
>> No, the race is between clearing the BTRFS_QGROUP_STATUS_FLAG_RESCAN
>> flag near the end of the worker and clearing the running flag.  The
>> rescan lock is dropped in between, so btrfs_rescan_init will let a new
>> rescan request in while we update the status item on disk.  We wouldn't
>> have queued another worker since that's what the warning catches, but if
>> there were already tasks waiting for completion, they wouldn't have been
>> woken since the wait queue list would be reinitialized.  There's no way
>> to reorder clearing the flag without changing how we handle
>> ->qgroup_flags.  I plan on doing that separately.  This was just meant
>> to be the simple fix.
> 
> Great, I think some of this information should go into the change log,
> in explaining what the symptoms of the race condition are.

You're right.  I was treating as a race that my patch introduced but it
didn't.  It just complained about it.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-05-03 Thread Jeff Mahoney
On 5/3/18 3:24 AM, Nikolay Borisov wrote:
> 
> 
> On  3.05.2018 00:11, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> Commit 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>> ended up reintroducing the hang-on-unmount bug that the commit it
>> intended to fix addressed.
>>
>> The race this time is between qgroup_rescan_init setting
>> ->qgroup_rescan_running = true and the worker starting.  There are
>> many scenarios where we initialize the worker and never start it.  The
>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>> This can happen even without involving error handling, since mounting
>> the file system read-only returns between initializing the worker and
>> queueing it.
>>
>> The right place to do it is when we're queuing the worker.  The flag
>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>> a completion.
>>
>> Since the BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is overloaded to
>> refer to both runtime behavior and on-disk state, we introduce a new
>> fs_info->qgroup_rescan_ready to indicate that we're initialized and
>> waiting to start.
> 
> Am I correct in my understanding that this qgroup_rescan_ready flag is
> used to avoid qgroup_rescan_init being called AFTER it has already been
> called but BEFORE queue_rescan_worker ? Why wasn't the initial version
> of this patch without this flag sufficient?

No, the race is between clearing the BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag near the end of the worker and clearing the running flag.  The
rescan lock is dropped in between, so btrfs_rescan_init will let a new
rescan request in while we update the status item on disk.  We wouldn't
have queued another worker since that's what the warning catches, but if
there were already tasks waiting for completion, they wouldn't have been
woken since the wait queue list would be reinitialized.  There's no way
to reorder clearing the flag without changing how we handle
->qgroup_flags.  I plan on doing that separately.  This was just meant
to be the simple fix.

That we can use the ready variable to also ensure that we don't let
qgroup_rescan_init be called twice without running the rescan is a nice
bonus.

-Jeff

>>
>> This patch introduces a new helper, queue_rescan_worker, that handles
>> most of the initialization, the two flags, and queuing the worker,
>> including races with unmount.
>>
>> While we're at it, ->qgroup_rescan_running is protected only by the
>> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
>> to take the spinlock too.
>>
>> Fixes: 8d9eddad194 (Btrfs: fix qgroup rescan worker initialization)
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/ctree.h  |  2 ++
>>  fs/btrfs/qgroup.c | 94 
>> +--
>>  2 files changed, 58 insertions(+), 38 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index da308774b8a4..4003498bb714 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1045,6 +1045,8 @@ struct btrfs_fs_info {
>>  struct btrfs_workqueue *qgroup_rescan_workers;
>>  struct completion qgroup_rescan_completion;
>>  struct btrfs_work qgroup_rescan_work;
>> +/* qgroup rescan worker is running or queued to run */
>> +bool qgroup_rescan_ready;
>>  bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>>  
>>  /* filesystem state */
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index aa259d6986e1..466744741873 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -101,6 +101,7 @@ static int
>>  qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
>> int init_flags);
>>  static void qgroup_rescan_zero_tracking(struct btrfs_fs_info *fs_info);
>> +static void btrfs_qgroup_rescan_worker(struct btrfs_work *work);
>>  
>>  /* must be called with qgroup_ioctl_lock held */
>>  static struct btrfs_qgroup *find_qgroup_rb(struct btrfs_fs_info *fs_info,
>> @@ -2072,6 +2073,46 @@ int btrfs_qgroup_account_extents(struct 
>> btrfs_trans_handle *trans,
>>  return ret;
>>  }
>>  
>> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
>> +{
>> +mutex_lock(_info->qgroup_rescan_lock);
>> +if (btrfs_fs_closing(fs_info)) {
>> +mutex_unlock(_info->qgroup_rescan_lock);
>> +return;
>> +}
>> +

Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-05-02 Thread Jeff Mahoney
On 5/2/18 9:15 AM, David Sterba wrote:
> On Wed, May 02, 2018 at 12:29:28PM +0200, David Sterba wrote:
>> On Thu, Apr 26, 2018 at 03:23:49PM -0400, je...@suse.com wrote:
>>> From: Jeff Mahoney <je...@suse.com>
>>> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
>>> +{
>>> +   mutex_lock(_info->qgroup_rescan_lock);
>>> +   if (btrfs_fs_closing(fs_info)) {
>>> +   mutex_unlock(_info->qgroup_rescan_lock);
>>> +   return;
>>> +   }
>>> +   if (WARN_ON(fs_info->qgroup_rescan_running)) {
>>
>> The warning is quite noisy, I see it after tests btrfs/ 017, 022, 124,
>> 139, 153. Is it necessary for non-debugging builds?
>>
>> The tested branch was full for-next so it could be your patchset
>> interacting with other fixes, but the warning noise level question still
>> stands.
> 
> So it must be something with the rest of misc-next or for-next patches,
> current for 4.17 queue does show the warning at all, and the patch is ok
> for merge.
>
You might have something that causes it to be more noisy but it looks
like it should be possible to hit on 4.16.  The warning is supposed to
detect and complain about multiple rescan threads starting.  What I
think it's doing here is (correctly) identifying a different race: at
the end of btrfs_qgroup_rescan_worker, we clear the rescan status flag,
drop the lock, commit the status item transaction, and then update
->qgroup_rescan_running.  If a rescan is requested before the lock is
reacquired, we'll try to start it up and then hit that warning.

So, the warning is doing its job.  Please hold off on merging this patch.

IMO the root cause is overloading fs_info->qgroup_flags to correspond to
the on-disk item and control runtime behavior.  I've been meaning to fix
that for a while, so I'll do that now.

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior (possible bugs) in btrfs

2018-04-30 Thread Jeff Mahoney
On 4/30/18 12:04 PM, Vijay Chidambaram wrote:
> Hi,
> 
> We found two more cases where the btrfs behavior is a little strange.
> In one case, an fsync-ed file goes missing after a crash. In the
> other, a renamed file shows up in both directories after a crash.

Hi Vijay -

What kernel version did you observe these with?  These seem like bugs
Filipe has already fixed.

-Jeff


> Workload 1:
> 
> mkdir A
> mkdir B
> mkdir A/C
> creat B/foo
> fsync B/foo
> link B/foo A/C/foo
> fsync A
> -- crash --
> 
> Expected state after recovery:
> B B/foo A A/C exist
> 
> What we find:
> Only B B/foo exist
> 
> A is lost even after explicit fsync to A.
> 
> Workload 2:
> 
> mkdir A
> mkdir A/C
> rename A/C B
> touch B/bar
> fsync B/bar
> rename B/bar A/bar
> rename A B (replacing B with A at this point)
> fsync B/bar
> -- crash --
> 
> Expected contents after recovery:
> A/bar
> 
> What we find after recovery:
> A/bar
> B/bar
> 
> We think this breaks rename's atomicity guarantee. bar should be
> present in either A or B, but now it is present in both.
> 
> Thanks,
> Vijay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-30 Thread Jeff Mahoney
On 4/30/18 2:20 AM, Qu Wenruo wrote:
> 
> 
> On 2018年04月27日 03:23, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>> ended up reintroducing the hang-on-unmount bug that the commit it
>> intended to fix addressed.
>>
>> The race this time is between qgroup_rescan_init setting
>> ->qgroup_rescan_running = true and the worker starting.  There are
>> many scenarios where we initialize the worker and never start it.  The
>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>> This can happen even without involving error handling, since mounting
>> the file system read-only returns between initializing the worker and
>> queueing it.
>>
>> The right place to do it is when we're queuing the worker.  The flag
>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>> a completion.
>>
>> This patch introduces a new helper, queue_rescan_worker, that handles
>> the ->qgroup_rescan_running flag, including any races with umount.
>>
>> While we're at it, ->qgroup_rescan_running is protected only by the
>> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
>> to take the spinlock too.
>>
>> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
> 
> A little off-topic, (thanks Nikolay for reporting this) sometimes
> btrfs/017 could report qgroup corruption, and it turns out it's related
> to rescan racy, which double account existing tree blocks twice.
> (One by btrfs quota enable, another by btrfs quota rescan -w)
> 
> Would this patch help in such case?

It shouldn't.  This only fixes races between the rescan worker getting
initialized and running vs waiting for it to complete.

-Jeff

>>  fs/btrfs/ctree.h  |  1 +
>>  fs/btrfs/qgroup.c | 40 
>>  2 files changed, 29 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index da308774b8a4..dbba615f4d0f 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1045,6 +1045,7 @@ struct btrfs_fs_info {
>>  struct btrfs_workqueue *qgroup_rescan_workers;
>>  struct completion qgroup_rescan_completion;
>>  struct btrfs_work qgroup_rescan_work;
>> +/* qgroup rescan worker is running or queued to run */
>>  bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>>  
>>  /* filesystem state */
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index aa259d6986e1..be491b6c020a 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -2072,6 +2072,30 @@ int btrfs_qgroup_account_extents(struct 
>> btrfs_trans_handle *trans,
>>  return ret;
>>  }
>>  
>> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
>> +{
>> +mutex_lock(_info->qgroup_rescan_lock);
>> +if (btrfs_fs_closing(fs_info)) {
>> +mutex_unlock(_info->qgroup_rescan_lock);
>> +return;
>> +}
>> +if (WARN_ON(fs_info->qgroup_rescan_running)) {
>> +btrfs_warn(fs_info, "rescan worker already queued");
>> +mutex_unlock(_info->qgroup_rescan_lock);
>> +return;
>> +}
>> +
>> +/*
>> + * Being queued is enough for btrfs_qgroup_wait_for_completion
>> + * to need to wait.
>> + */
>> +fs_info->qgroup_rescan_running = true;
>> +mutex_unlock(_info->qgroup_rescan_lock);
>> +
>> +btrfs_queue_work(fs_info->qgroup_rescan_workers,
>> + _info->qgroup_rescan_work);
>> +}
>> +
>>  /*
>>   * called from commit_transaction. Writes all changed qgroups to disk.
>>   */
>> @@ -2123,8 +2147,7 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans,
>>  ret = qgroup_rescan_init(fs_info, 0, 1);
>>  if (!ret) {
>>  qgroup_rescan_zero_tracking(fs_info);
>> -btrfs_queue_work(fs_info->qgroup_rescan_workers,
>> - _info->qgroup_rescan_work);
>> +queue_rescan_worker(fs_info);
>>  }
>>  ret = 0;
>>  }
>> @@ -2713,7 +2736,6 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 
>> progress_objectid,
>>  si

Re: [PATCH 3/3] btrfs-progs: build: use m4_flatten instead of m4_chomp

2018-04-29 Thread Jeff Mahoney
On 4/29/18 6:13 AM, David Sterba wrote:
> On Fri, Apr 27, 2018 at 03:18:17PM -0400, Jeff Mahoney wrote:
>> On 4/27/18 2:56 PM, je...@suse.com wrote:
>>> From: Jeff Mahoney <je...@suse.com>
>>>
>>> Commit 2e1932e6a38 (btrfs-progs: build: simplify version tracking)
>>> started m4_chomp to strip the newlines from the version file.  m4_chomp
>>> was introduced in autoconf 2.64 but SLE11 ships with autoconf 2.63.
>>> For purposes of just stripping the newline, m4_flatten is sufficient.
>>
>> Scratch that.  The previous patch also requires autoconf 2.64.
> 
> I wanted to avoid shell tricks, but this should work everywhere:
> 
> m4_esyscmd([echo -n $(cat VERSION)])
> 

m4_flatten should work everywhere.  It's the AX_CHECK_COMPILE_FLAGS that
depends on autoconf 2.64.

I can fix the dependency, but it ends up looking like:

m4_version_prereq([2.64], [
AX_CHECK_COMPILE_FLAG([-std=gnu90],[CSTD=-std=gnu90],[CSTD=-std=gnu89])
  ], [
AX_GCC_VERSION([4], [5], [0], [CSTD=-std=gnu90],[CSTD=-std=gnu89])
  ])
AC_SUBST([CSTD])

AX_GCC_VERSION is deprecated, but works with earlier autoconf versions.
I'm not thrilled about it, but keying off the autoconf version and using
the newer way for newer versions means it'll be easier to drop it later
when we drop support for earlier autoconf.

Alternatively, I can just put this patch in the OBS project.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Jeff Mahoney
On 4/27/18 12:40 PM, David Sterba wrote:
> On Fri, Apr 27, 2018 at 12:02:13PM -0400, Jeff Mahoney wrote:
>>>> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
>>>> +{
>>>
>>> And this had to be moved upwards as there was earlier use of
>>> btrfs_queue_work that matched following the hunk.
>>
>> Weird.  That must be exactly the kind of mismerge artifact that we were
>> talking about the other day.  In my tree it's in the right spot.
> 
> I've tried current master, upcoming pull request queue (misc-4.17, one
> nonc-onflicting patch) and current misc-next. None of them applies the
> patch cleanly and the function is still added after the first use, so
> this would not compile.
> 
> The result can be found in
> https://github.com/kdave/btrfs-devel/commits/ext/jeffm/qgroup-fixes
> 

Thanks.  The "Fixes" is incorrect there.  I had the right commit message
but not the right commit id.  It should be:

8d9eddad1946 (Btrfs: fix qgroup rescan worker initialization)

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs-progs: build: use m4_flatten instead of m4_chomp

2018-04-27 Thread Jeff Mahoney
On 4/27/18 2:56 PM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> 
> Commit 2e1932e6a38 (btrfs-progs: build: simplify version tracking)
> started m4_chomp to strip the newlines from the version file.  m4_chomp
> was introduced in autoconf 2.64 but SLE11 ships with autoconf 2.63.
> For purposes of just stripping the newline, m4_flatten is sufficient.

Scratch that.  The previous patch also requires autoconf 2.64.

-Jeff

> Signed-off-by: Jeff Mahoney <je...@suse.com>
> ---
>  configure.ac | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/configure.ac b/configure.ac
> index 17880206..a0cebf15 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -1,5 +1,5 @@
>  AC_INIT([btrfs-progs],
> - m4_chomp(m4_include([VERSION])),
> + m4_flatten(m4_include([VERSION])),
>   [linux-btrfs@vger.kernel.org],,
>   [http://btrfs.wiki.kernel.org])
>  
> 


-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: qgroup, don't try to insert status item after ENOMEM in rescan worker

2018-04-27 Thread Jeff Mahoney
If we fail to allocate memory for a path, don't bother trying to
insert the qgroup status item.  We haven't done anything yet and it'll
fail also.  Just print an error and be done with it.

Signed-off-by: Jeff Mahoney <je...@suse.com>
---
 fs/btrfs/qgroup.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 8de423a0c7e3..b795bad54705 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2648,7 +2648,6 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work 
*work)
btrfs_end_transaction(trans);
}
 
-out:
btrfs_free_path(path);
 
mutex_lock(_info->qgroup_rescan_lock);
@@ -2684,13 +2683,13 @@ static void btrfs_qgroup_rescan_worker(struct 
btrfs_work *work)
 
if (btrfs_fs_closing(fs_info)) {
btrfs_info(fs_info, "qgroup scan paused");
-   } else if (err >= 0) {
+   err = 0;
+   } else if (err >= 0)
btrfs_info(fs_info, "qgroup scan completed%s",
err > 0 ? " (inconsistency flag cleared)" : "");
-   } else {
+out:
+   if (err < 0)
btrfs_err(fs_info, "qgroup scan failed with %d", err);
-   }
-
 done:
mutex_lock(_info->qgroup_rescan_lock);
fs_info->qgroup_rescan_running = false;
-- 
2.12.3


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: qgroup, don't try to insert status item after ENOMEM in rescan worker

2018-04-27 Thread Jeff Mahoney
On 4/27/18 11:44 AM, David Sterba wrote:
> On Thu, Apr 26, 2018 at 11:39:50PM +0300, Nikolay Borisov wrote:
>> On 26.04.2018 22:23, je...@suse.com wrote:
>>> From: Jeff Mahoney <je...@suse.com>
>>>
>>> If we fail to allocate memory for a path, don't bother trying to
>>> insert the qgroup status item.  We haven't done anything yet and it'll
>>> fail also.  Just print an error and be done with it.
>>>
>>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>>
>> nit: So the code is correct however, having the out label there is
>> really ugly. What about on path alloc failure just have the print in the
>> if branch do goto done?
> 
> Yeah, I don't like jumping to the inner blocks either. I saw this in the
> qgroup code so we should clean it up and not add new instances.
> 
> In this case, only the path allocation failure jumps to the out label,
> so printing the message and then jump to done makes sense to me.
> However, the message would have to be duplicated in the end, and I don't
> see a better way without further restructuring the code.
> 

It doesn't require major surgery.  The else can be disconnected.

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Jeff Mahoney
On 4/27/18 11:56 AM, David Sterba wrote:
> On Thu, Apr 26, 2018 at 03:23:49PM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>> ended up reintroducing the hang-on-unmount bug that the commit it
>> intended to fix addressed.
>>
>> The race this time is between qgroup_rescan_init setting
>> ->qgroup_rescan_running = true and the worker starting.  There are
>> many scenarios where we initialize the worker and never start it.  The
>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>> This can happen even without involving error handling, since mounting
>> the file system read-only returns between initializing the worker and
>> queueing it.
>>
>> The right place to do it is when we're queuing the worker.  The flag
>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>> a completion.
>>
>> This patch introduces a new helper, queue_rescan_worker, that handles
>> the ->qgroup_rescan_running flag, including any races with umount.
>>
>> While we're at it, ->qgroup_rescan_running is protected only by the
>> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
>> to take the spinlock too.
>>
>> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
> 
> I've added this to misc-next as I'd like to push it to the next rc. The
> Fixes is fixed.
> 
>> +/* qgroup rescan worker is running or queued to run */
>>  bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
> 
> Comments merged.

Thanks.

>>  /* filesystem state */
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index aa259d6986e1..be491b6c020a 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -2072,6 +2072,30 @@ int btrfs_qgroup_account_extents(struct 
>> btrfs_trans_handle *trans,
>>  return ret;
>>  }
>>  
>> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
>> +{
> 
> And this had to be moved upwards as there was earlier use of
> btrfs_queue_work that matched following the hunk.

Weird.  That must be exactly the kind of mismerge artifact that we were
talking about the other day.  In my tree it's in the right spot.

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Jeff Mahoney
On 4/27/18 4:48 AM, Filipe Manana wrote:
> On Thu, Apr 26, 2018 at 8:23 PM,  <je...@suse.com> wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>> ended up reintroducing the hang-on-unmount bug that the commit it
>> intended to fix addressed.
>>
>> The race this time is between qgroup_rescan_init setting
>> ->qgroup_rescan_running = true and the worker starting.  There are
>> many scenarios where we initialize the worker and never start it.  The
>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>> This can happen even without involving error handling, since mounting
>> the file system read-only returns between initializing the worker and
>> queueing it.
>>
>> The right place to do it is when we're queuing the worker.  The flag
>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>> a completion.
>>
>> This patch introduces a new helper, queue_rescan_worker, that handles
>> the ->qgroup_rescan_running flag, including any races with umount.
>>
>> While we're at it, ->qgroup_rescan_running is protected only by the
>> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
>> to take the spinlock too.
>>
>> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
> 
> The commit id and subjects don't match:
> 
> commit d2c609b834d62f1e91f1635a27dca29f7806d3d6
> Author: Jeff Mahoney <je...@suse.com>
> Date:   Mon Aug 15 12:10:33 2016 -0400
> 
> btrfs: properly track when rescan worker is running
> 


Thanks.  Fixed.

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: push relocation recovery into a helper thread

2018-04-24 Thread Jeff Mahoney
On 4/23/18 5:43 PM, David Sterba wrote:
> On Tue, Apr 17, 2018 at 02:45:33PM -0400, Jeff Mahoney wrote:
>> On a file system with many snapshots and qgroups enabled, an interrupted
>> balance can end up taking a long time to mount due to recovering the
>> relocations during mount.  It does this in the task performing the mount,
>> which can't be interrupted and may prevent mounting (and systems booting)
>> for a long time as well.  The thing is that as part of balance, this
>> runs in the background all the time.  This patch pushes the recovery
>> into a helper thread and allows the mount to continue normally.  We hold
>> off on resuming any paused balance operation until after the relocation
>> has completed, disallow any new balance operations if it's running, and
>> wait for it on umount and remounting read-only.
> 
> The overall logic sounds ok.

Thanks for the review.  I've updated the style issues in my patch and
removed them from the quote below.

> So, this can also stall the umount, right? Eg. if I start mount,
> relocation goes to background, then unmount will have to wait for
> completion.

Yep, I didn't try to solve that problem since the file system wouldn't
even mount before.  Makes sense to make it unmountable, though.  That's
a change that would probably speed up btrfs balance cancel as well.

> Balance pause is requested at umount time, something similar should be
> possible for relocation recovery. The fs_state bit CLOSING could be
> checked somewhere in the loop.

An earlier version had that check in the top of the loop in
merge_reloc_roots, but I think a better spot would be the top of the
merge_reloc_root loop.

>> This doesn't do anything to address the relocation recovery operation
>> taking a long time but does allow the file system to mount.
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/ctree.h  |7 +++
>>  fs/btrfs/disk-io.c|7 ++-
>>  fs/btrfs/relocation.c |   92 
>> +-
>>  fs/btrfs/super.c  |5 +-
>>  fs/btrfs/volumes.c|6 +++
>>  5 files changed, 97 insertions(+), 20 deletions(-)
>>
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1052,6 +1052,10 @@ struct btrfs_fs_info {
>>  struct btrfs_work qgroup_rescan_work;
>>  bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>>  
>> +/* relocation recovery items */
>> +bool relocation_recovery_started;
>> +struct completion relocation_recovery_completion;
> 
> This seems to copy the pattern of qgroup rescan, the completion +
> workqueue. I'm planning to move this scheme to the fs_state bit instead
> of bool and the wait_for_war with global workqueue, but for now we can
> leave as it is here.

Such that we just put these jobs on a workqueue instead?

>> +if (err == 0) {
>> +struct btrfs_root *fs_root;
>> +
>> +/* cleanup orphan inode in data relocation tree */
>> +fs_root = read_fs_root(fs_info, BTRFS_DATA_RELOC_TREE_OBJECTID);
>> +if (IS_ERR(fs_root))
>> +err = PTR_ERR(fs_root);
>> +else
>> +err = btrfs_orphan_cleanup(fs_root);
>> +}
>> +mutex_unlock(_info->cleaner_mutex);
>> +clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
> 
> The part that sets the bit is in the caller, btrfs_recover_relocation,
> but this can race if the kthread is too fast.
> 
> btrfs_recover_relocation
>   start kthread with btrfs_resume_relocation
>   clear_bit
>   set_bit
>   ...
> 
> now we're stuck with the EXCL_OP set without any operation actually running.
> 
> The bit can be set right before the kthread is started and cleared
> inside.

There's no opportunity to race since the thread can't run until
btrfs_recover_relocation returns and releases the cleaner mutex.

>> @@ -4620,16 +4670,21 @@ int btrfs_recover_relocation(struct btrf
>>  if (err)
>>  goto out_free;
>>  
>> -merge_reloc_roots(rc);
>> -
>> -unset_reloc_control(rc);
>> -
>> -trans = btrfs_join_transaction(rc->extent_root);
>> -if (IS_ERR(trans)) {
>> -err = PTR_ERR(trans);
>> +tsk = kthread_run(btrfs_resume_relocation, fs_info,
>> +  "relocation-recovery");
> 
> Would be good to name it 'btrfs-reloc-recovery', ie with btrfs in the
> name so it's easy greppable from the process list.

Right.  In an earlier version, I was using a btrfs_worker so that was
added automatically.

>> +if

[PATCH] btrfs: push relocation recovery into a helper thread

2018-04-17 Thread Jeff Mahoney
On a file system with many snapshots and qgroups enabled, an interrupted
balance can end up taking a long time to mount due to recovering the
relocations during mount.  It does this in the task performing the mount,
which can't be interrupted and may prevent mounting (and systems booting)
for a long time as well.  The thing is that as part of balance, this
runs in the background all the time.  This patch pushes the recovery
into a helper thread and allows the mount to continue normally.  We hold
off on resuming any paused balance operation until after the relocation
has completed, disallow any new balance operations if it's running, and
wait for it on umount and remounting read-only.

This doesn't do anything to address the relocation recovery operation
taking a long time but does allow the file system to mount.

Signed-off-by: Jeff Mahoney <je...@suse.com>
---
 fs/btrfs/ctree.h  |7 +++
 fs/btrfs/disk-io.c|7 ++-
 fs/btrfs/relocation.c |   92 +-
 fs/btrfs/super.c  |5 +-
 fs/btrfs/volumes.c|6 +++
 5 files changed, 97 insertions(+), 20 deletions(-)

--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1052,6 +1052,10 @@ struct btrfs_fs_info {
struct btrfs_work qgroup_rescan_work;
bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
 
+   /* relocation recovery items */
+   bool relocation_recovery_started;
+   struct completion relocation_recovery_completion;
+
/* filesystem state */
unsigned long fs_state;
 
@@ -3671,7 +3675,8 @@ int btrfs_init_reloc_root(struct btrfs_t
  struct btrfs_root *root);
 int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
-int btrfs_recover_relocation(struct btrfs_root *root);
+int btrfs_recover_relocation(struct btrfs_fs_info *fs_info);
+void btrfs_wait_for_relocation_completion(struct btrfs_fs_info *fs_info);
 int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len);
 int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
  struct btrfs_root *root, struct extent_buffer *buf,
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2999,7 +2999,7 @@ retry_root_backup:
goto fail_qgroup;
 
mutex_lock(_info->cleaner_mutex);
-   ret = btrfs_recover_relocation(tree_root);
+   ret = btrfs_recover_relocation(fs_info);
mutex_unlock(_info->cleaner_mutex);
if (ret < 0) {
btrfs_warn(fs_info, "failed to recover relocation: %d",
@@ -3017,7 +3017,8 @@ retry_root_backup:
if (IS_ERR(fs_info->fs_root)) {
err = PTR_ERR(fs_info->fs_root);
btrfs_warn(fs_info, "failed to read fs tree: %d", err);
-   goto fail_qgroup;
+   close_ctree(fs_info);
+   return err;
}
 
if (sb_rdonly(sb))
@@ -3778,6 +3779,8 @@ void close_ctree(struct btrfs_fs_info *f
/* wait for the qgroup rescan worker to stop */
btrfs_qgroup_wait_for_completion(fs_info, false);
 
+   btrfs_wait_for_relocation_completion(fs_info);
+
/* wait for the uuid_scan task to finish */
down(_info->uuid_tree_rescan_sem);
/* avoid complains from lockdep et al., set sem back to initial state */
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -4492,14 +4493,61 @@ static noinline_for_stack int mark_garba
 }
 
 /*
- * recover relocation interrupted by system crash.
- *
  * this function resumes merging reloc trees with corresponding fs trees.
  * this is important for keeping the sharing of tree blocks
  */
-int btrfs_recover_relocation(struct btrfs_root *root)
+static int
+btrfs_resume_relocation(void *data)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_fs_info *fs_info = data;
+   struct btrfs_trans_handle *trans;
+   struct reloc_control *rc = fs_info->reloc_ctl;
+   int err, ret;
+
+   btrfs_info(fs_info, "resuming relocation");
+
+   BUG_ON(!rc);
+
+   mutex_lock(_info->cleaner_mutex);
+
+   merge_reloc_roots(rc);
+
+   unset_reloc_control(rc);
+
+   trans = btrfs_join_transaction(rc->extent_root);
+   if (IS_ERR(trans))
+   err = PTR_ERR(trans);
+   else {
+   ret = btrfs_commit_transaction(trans);
+   if (ret < 0)
+   err = ret;
+   }
+
+   kfree(rc);
+
+   if (err == 0) {
+   struct btrfs_root *fs_root;
+
+   /* cleanup orphan inode in data relocation tree */
+   fs_root = read_fs_root(fs_info, BTRFS_DATA_RELOC

Re: [PATCH v2] btrfs: Validate child tree block's level and first key

2018-03-23 Thread Jeff Mahoney
t; ret = PTR_ERR(eb);
>>> break;
>>> @@ -2036,6 +2040,8 @@ int walk_down_reloc_tree(struct btrfs_root *root, 
>>> struct btrfs_path *path,
>>> last_snapshot = btrfs_root_last_snapshot(>root_item);
>>>  
>>>     for (i = *level; i > 0; i--) {
>>> +   struct btrfs_key first_key;
>>> +
>>> eb = path->nodes[i];
>>> nritems = btrfs_header_nritems(eb);
>>> while (path->slots[i] < nritems) {
>>> @@ -2056,7 +2062,9 @@ int walk_down_reloc_tree(struct btrfs_root *root, 
>>> struct btrfs_path *path,
>>> }
>>>  
>>> bytenr = btrfs_node_blockptr(eb, path->slots[i]);
>>> -   eb = read_tree_block(fs_info, bytenr, ptr_gen);
>>> +   btrfs_node_key_to_cpu(eb, _key, path->slots[i]);
>>> +   eb = read_tree_block(fs_info, bytenr, ptr_gen, _key,
>>> +i - 1);
>>> if (IS_ERR(eb)) {
>>> return PTR_ERR(eb);
>>> } else if (!extent_buffer_uptodate(eb)) {
>>> @@ -2714,6 +2722,8 @@ static int do_relocation(struct btrfs_trans_handle 
>>> *trans,
>>> path->lowest_level = node->level + 1;
>>> rc->backref_cache.path[node->level] = node;
>>> list_for_each_entry(edge, >upper, list[LOWER]) {
>>> +   struct btrfs_key first_key;
>>> +
>>> cond_resched();
>>>  
>>> upper = edge->node[UPPER];
>>> @@ -2779,7 +2789,9 @@ static int do_relocation(struct btrfs_trans_handle 
>>> *trans,
>>>  
>>> blocksize = root->fs_info->nodesize;
>>> generation = btrfs_node_ptr_generation(upper->eb, slot);
>>> -   eb = read_tree_block(fs_info, bytenr, generation);
>>> +   btrfs_node_key_to_cpu(upper->eb, _key, slot);
>>> +   eb = read_tree_block(fs_info, bytenr, generation, _key,
>>> +upper->level - 1);
>>> if (IS_ERR(eb)) {
>>> err = PTR_ERR(eb);
>>> goto next;
>>> @@ -2944,7 +2956,8 @@ static int get_tree_block_key(struct btrfs_fs_info 
>>> *fs_info,
>>> struct extent_buffer *eb;
>>>  
>>> BUG_ON(block->key_ready);
>>> -   eb = read_tree_block(fs_info, block->bytenr, block->key.offset);
>>> +   eb = read_tree_block(fs_info, block->bytenr, block->key.offset, NULL,
>>> +0);
>>> if (IS_ERR(eb)) {
>>> return PTR_ERR(eb);
>>> } else if (!extent_buffer_uptodate(eb)) {
>>> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>>> index 434457794c27..b98a1801b406 100644
>>> --- a/fs/btrfs/tree-log.c
>>> +++ b/fs/btrfs/tree-log.c
>>> @@ -304,7 +304,7 @@ static int process_one_buffer(struct btrfs_root *log,
>>>  * pin down any logged extents, so we have to read the block.
>>>  */
>>> if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
>>> -   ret = btrfs_read_buffer(eb, gen);
>>> +   ret = btrfs_read_buffer(eb, gen, NULL, 0);
>>> if (ret)
>>> return ret;
>>> }
>>> @@ -2420,7 +2420,7 @@ static int replay_one_buffer(struct btrfs_root *log, 
>>> struct extent_buffer *eb,
>>> int i;
>>> int ret;
>>>  
>>> -   ret = btrfs_read_buffer(eb, gen);
>>> +   ret = btrfs_read_buffer(eb, gen, NULL, 0);
>>> if (ret)
>>> return ret;
>>>  
>>> @@ -2537,6 +2537,8 @@ static noinline int walk_down_log_tree(struct 
>>> btrfs_trans_handle *trans,
>>> WARN_ON(*level >= BTRFS_MAX_LEVEL);
>>>  
>>> while (*level > 0) {
>>> +   struct btrfs_key first_key;
>>> +
>>> WARN_ON(*level < 0);
>>> WARN_ON(*level >= BTRFS_MAX_LEVEL);
>>> cur = path->nodes[*level];
>>> @@ -2549,6 +2551,7 @@ static noinline int walk_down_log_tree(struct 
>>> btrfs_trans_handle *trans,
>>>  
>>> bytenr = btrfs_node_blockptr(cur, path->slots[*level]);
>>> ptr_gen = btrfs_node_ptr_generation(cur, path->slots[*level]);
>>> +   btrfs_node_key_to_cpu(cur, _key, path->slots[*level]);
>>> blocksize = fs_info->nodesize;
>>>  
>>> parent = path->nodes[*level];
>>> @@ -2567,7 +2570,8 @@ static noinline int walk_down_log_tree(struct 
>>> btrfs_trans_handle *trans,
>>>  
>>> path->slots[*level]++;
>>> if (wc->free) {
>>> -   ret = btrfs_read_buffer(next, ptr_gen);
>>> +   ret = btrfs_read_buffer(next, ptr_gen,
>>> +   _key, *level - 1);
>>> if (ret) {
>>> free_extent_buffer(next);
>>> return ret;
>>> @@ -2597,7 +2601,7 @@ static noinline int walk_down_log_tree(struct 
>>> btrfs_trans_handle *trans,
>>> free_extent_buffer(next);
>>> continue;
>>> }
>>> -   ret = btrfs_read_buffer(next, ptr_gen);
>>> +   ret = btrfs_read_buffer(next, ptr_gen, _key, *level - 1);
>>> if (ret) {
>>> free_extent_buffer(next);
>>> return ret;
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers

2018-03-19 Thread Jeff Mahoney
On 3/19/18 2:08 PM, David Sterba wrote:
> On Mon, Mar 19, 2018 at 01:52:05PM -0400, Jeff Mahoney wrote:
>> On 3/16/18 4:12 PM, David Sterba wrote:
>>> On Fri, Mar 16, 2018 at 02:36:27PM -0400, je...@suse.com wrote:
>>>> From: Jeff Mahoney <je...@suse.com>
>>>>
>>>> While running btrfs/011, I hit the following lockdep splat.
>>>>
>>>> This is the important bit:
>>>>pcpu_alloc+0x1ac/0x5e0
>>>>__percpu_counter_init+0x4e/0xb0
>>>>btrfs_init_fs_root+0x99/0x1c0 [btrfs]
>>>>btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
>>>>resolve_indirect_refs+0x130/0x830 [btrfs]
>>>>find_parent_nodes+0x69e/0xff0 [btrfs]
>>>>btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
>>>>btrfs_find_all_roots+0x50/0x70 [btrfs]
>>>>btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
>>>>btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
>>>>
>>>> The percpu_counter_init call in btrfs_alloc_subvolume_writers
>>>> uses GFP_KERNEL, which we can't do during transaction commit.
>>>>
>>>> This switches it to GFP_NOFS.
>>>
>>>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>>>> ---
>>>>  fs/btrfs/disk-io.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index 21f34ad0d411..eb6bb3169a9e 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -1108,7 +1108,7 @@ static struct btrfs_subvolume_writers 
>>>> *btrfs_alloc_subvolume_writers(void)
>>>>if (!writers)
>>>>return ERR_PTR(-ENOMEM);
>>>>  
>>>> -  ret = percpu_counter_init(>counter, 0, GFP_KERNEL);
>>>> +  ret = percpu_counter_init(>counter, 0, GFP_NOFS);
>>>
>>> A line above the diff context is another allocation that does GFP_NOFS,
>>> so one of the gfp flags were wrong.
>>>
>>> Looks like there's another instance where percpu allocates with
>>> GFP_KERNEL: create_space_info that can be called from the path that
>>> allocates chunks, so this also looks like a NOFS candidate.
>>
>> We can get rid of this case entirely.  Those call sites should be
>> removed since the space_infos are all allocated at mount time.
> 
> That would be great and make a few things simpler. So this means that
> __find_space_info never fails once the space infos are properly
> initialized, right? That was my concern in do_chunk_alloc and
> btrfs_make_block_group (that's called from __btrfs_alloc_chunk).

That's a different case.  The raid levels are added when the first block
group of a particular read level is loaded up.  That can happen when the
block groups are read in initially, where it should be safe to use
GFP_KERNEL or when a chunk of a new type is allocated.  The thing is
that a chunk of a new type will only be allocated when we're converting
via balance, so we may be able to do the kobject_add for the raid level
when we start the balance rather than wait for it to create the block group.

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers

2018-03-19 Thread Jeff Mahoney
On 3/16/18 4:12 PM, David Sterba wrote:
> On Fri, Mar 16, 2018 at 02:36:27PM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> While running btrfs/011, I hit the following lockdep splat.
>>
>> This is the important bit:
>>pcpu_alloc+0x1ac/0x5e0
>>__percpu_counter_init+0x4e/0xb0
>>btrfs_init_fs_root+0x99/0x1c0 [btrfs]
>>btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
>>resolve_indirect_refs+0x130/0x830 [btrfs]
>>find_parent_nodes+0x69e/0xff0 [btrfs]
>>btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
>>btrfs_find_all_roots+0x50/0x70 [btrfs]
>>btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
>>btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
>>
>> The percpu_counter_init call in btrfs_alloc_subvolume_writers
>> uses GFP_KERNEL, which we can't do during transaction commit.
>>
>> This switches it to GFP_NOFS.
> 
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/disk-io.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 21f34ad0d411..eb6bb3169a9e 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1108,7 +1108,7 @@ static struct btrfs_subvolume_writers 
>> *btrfs_alloc_subvolume_writers(void)
>>  if (!writers)
>>  return ERR_PTR(-ENOMEM);
>>  
>> -ret = percpu_counter_init(>counter, 0, GFP_KERNEL);
>> +ret = percpu_counter_init(>counter, 0, GFP_NOFS);
> 
> A line above the diff context is another allocation that does GFP_NOFS,
> so one of the gfp flags were wrong.
> 
> Looks like there's another instance where percpu allocates with
> GFP_KERNEL: create_space_info that can be called from the path that
> allocates chunks, so this also looks like a NOFS candidate.

We can get rid of this case entirely.  Those call sites should be
removed since the space_infos are all allocated at mount time.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers

2018-03-16 Thread Jeff Mahoney
On 3/16/18 4:12 PM, David Sterba wrote:
> On Fri, Mar 16, 2018 at 02:36:27PM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> While running btrfs/011, I hit the following lockdep splat.
>>
>> This is the important bit:
>>pcpu_alloc+0x1ac/0x5e0
>>__percpu_counter_init+0x4e/0xb0
>>btrfs_init_fs_root+0x99/0x1c0 [btrfs]
>>btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
>>resolve_indirect_refs+0x130/0x830 [btrfs]
>>find_parent_nodes+0x69e/0xff0 [btrfs]
>>btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
>>btrfs_find_all_roots+0x50/0x70 [btrfs]
>>btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
>>btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
>>
>> The percpu_counter_init call in btrfs_alloc_subvolume_writers
>> uses GFP_KERNEL, which we can't do during transaction commit.
>>
>> This switches it to GFP_NOFS.
> 
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/disk-io.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 21f34ad0d411..eb6bb3169a9e 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1108,7 +1108,7 @@ static struct btrfs_subvolume_writers 
>> *btrfs_alloc_subvolume_writers(void)
>>  if (!writers)
>>  return ERR_PTR(-ENOMEM);
>>  
>> -ret = percpu_counter_init(>counter, 0, GFP_KERNEL);
>> +ret = percpu_counter_init(>counter, 0, GFP_NOFS);
> 
> A line above the diff context is another allocation that does GFP_NOFS,
> so one of the gfp flags were wrong.

This one was wrong.  It was initially implicitly GFP_KERNEL until Tejun
added the gfp_t argument and used GFP_KERNEL for most of the sites.
Since that was effectively a no-op, it was the right thing for him to do
without asking every subsystem maintainer their preference.

> Looks like there's another instance where percpu allocates with
> GFP_KERNEL: create_space_info that can be called from the path that
> allocates chunks, so this also looks like a NOFS candidate.

That's probably for the same reason.

> And in the same function, there's another indirect and hidden GFP_KERNEL
> allocation from kobject_init_and_add. So in this case we can't fix all
> the gfp problems at the call site and will have to use the scoped
> approach eventually.

Yep.  That's not a huge barrier, though.  We can push the kobject_add
into a workqueue pretty easily.

> I haven't found any instance of such lockdep reports in my logs (over a
> long period), so it's quite unlikely to end up in the recursive
> allocation.
> 
> Patch added to next, thanks. 

When hunting to see if this had already been fixed, I did find two
reports.  One from Qu from April of last year and another from Mike
Galbraith in 2016.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers

2018-03-16 Thread Jeff Mahoney
On 3/16/18 2:48 PM, Nikolay Borisov wrote:
> 
> 
> On 16.03.2018 20:36, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> While running btrfs/011, I hit the following lockdep splat.
>>
>> This is the important bit:
>>pcpu_alloc+0x1ac/0x5e0
>>__percpu_counter_init+0x4e/0xb0
>>btrfs_init_fs_root+0x99/0x1c0 [btrfs]
>>btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
>>resolve_indirect_refs+0x130/0x830 [btrfs]
>>find_parent_nodes+0x69e/0xff0 [btrfs]
>>btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
>>btrfs_find_all_roots+0x50/0x70 [btrfs]
>>btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
>>btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
>>
>> The percpu_counter_init call in btrfs_alloc_subvolume_writers
>> uses GFP_KERNEL, which we can't do during transaction commit.
>>
>> This switches it to GFP_NOFS.
> 
> Given there is effort underway to actually kill GFP_NOFS and replace it
> with the context annotation routines, shouldn't instead use those
> routines directly ?

I don't think those have landed yet.  When they do, it should obsolete
the gfp flags here in any context since we can also read roots from code
that doesn't need GFP_NOFS.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 13/20] btrfs-progs: use cmd_struct as command entry point

2018-03-11 Thread Jeff Mahoney
On 3/7/18 9:40 PM, je...@suse.com wrote:
> diff --git a/cmds-filesystem.c b/cmds-filesystem.c
> index 62112705..ec038f2f 100644
> --- a/cmds-filesystem.c
> +++ b/cmds-filesystem.c
> @@ -1075,6 +1078,7 @@ next:
>  
>   return !!defrag_global_errors;
>  }
> +static DEFINE_SIMPLE_COMMAND(filesystem_defrag, "defrag");

"defragment"

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 13/20] btrfs-progs: use cmd_struct as command entry point

2018-03-11 Thread Jeff Mahoney
On 3/7/18 9:40 PM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> diff --git a/cmds-inspect.c b/cmds-inspect.c
> index afd7fe48..12f200b3 100644
> --- a/cmds-inspect.c
> +++ b/cmds-inspect.c
> @@ -625,33 +629,27 @@ static int cmd_inspect_min_dev_size(int argc, char 
> **argv)
>  out:
>   return !!ret;
>  }
> +static DEFINE_SIMPLE_COMMAND(inspect_min_dev_size, "min-dev-size");
>  
>  static const char inspect_cmd_group_info[] =
>  "query various internal information";
>  
> -const struct cmd_group inspect_cmd_group = {
> +static const struct cmd_group inspect_cmd_group = {
>   inspect_cmd_group_usage, inspect_cmd_group_info, {
> - { "inode-resolve", cmd_inspect_inode_resolve,
> - cmd_inspect_inode_resolve_usage, NULL, 0 },
> - { "logical-resolve", cmd_inspect_logical_resolve,
> - cmd_inspect_logical_resolve_usage, NULL, 0 },
> - { "subvolid-resolve", cmd_inspect_subvolid_resolve,
> - cmd_inspect_subvolid_resolve_usage, NULL, 0 },
> - { "rootid", cmd_inspect_rootid, cmd_inspect_rootid_usage, NULL,
> - 0 },
> - { "min-dev-size", cmd_inspect_min_dev_size,
> - cmd_inspect_min_dev_size_usage, NULL, 0 },
> - { "dump-tree", cmd_inspect_dump_tree,
> - cmd_inspect_dump_tree_usage, NULL, 0 },
> - { "dump-super", cmd_inspect_dump_super,
> - cmd_inspect_dump_super_usage, NULL, 0 },
> - { "tree-stats", cmd_inspect_tree_stats,
> - cmd_inspect_tree_stats_usage, NULL, 0 },
> - NULL_CMD_STRUCT
> + _struct_inspect_inode_resolve,
> + _struct_inspect_logical_resolve,
> + _struct_inspect_subvolid_resolve,
> + _struct_inspect_rootid,
> + _struct_inspect_min_dev_size,
> + _struct_inspect_dump_tree,
> + _struct_inspect_dump_super,
> + _struct_inspect_tree_stats,
> + NULL
>   }
>  };
>  
> -int cmd_inspect(int argc, char **argv)
> +static int cmd_inspect(int argc, char **argv)
>  {
>   return handle_command_group(_cmd_group, argc, argv);
>  }
> +DEFINE_GROUP_COMMAND(inspect, "inspect");

"inspect-internal"

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 08/20] btrfs-progs: qgroups: introduce btrfs_qgroup_query

2018-03-08 Thread Jeff Mahoney
On 3/8/18 12:54 AM, Qu Wenruo wrote:
> 
> 
> On 2018年03月08日 10:40, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> The only mechanism we have in the progs for searching qgroups is to load
>> all of them and filter the results.  This works for qgroup show but
>> to add quota information to 'btrfs subvoluem show' it's pretty wasteful.
>>
>> This patch splits out setting up the search and performing the search so
>> we can search for a single qgroupid more easily.  Since TREE_SEARCH
>> will give results that don't strictly match the search terms, we add
>> a filter to match only the results we care about.
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  qgroup.c | 143 
>> ---
>>  qgroup.h |   7 
>>  2 files changed, 116 insertions(+), 34 deletions(-)
>>
>> diff --git a/qgroup.c b/qgroup.c
>> index 57815718..d076b1de 100644
>> --- a/qgroup.c
>> +++ b/qgroup.c
>> @@ -1165,11 +1165,30 @@ static inline void print_status_flag_warning(u64 
>> flags)
>>  warning("qgroup data inconsistent, rescan recommended");
>>  }
>>  
>> -static int __qgroups_search(int fd, struct qgroup_lookup *qgroup_lookup)
>> +static bool key_in_range(const struct btrfs_key *key,
>> + const struct btrfs_ioctl_search_key *sk)
>> +{
>> +if (key->objectid < sk->min_objectid ||
>> +key->objectid > sk->max_objectid)
>> +return false;
>> +
>> +if (key->type < sk->min_type ||
>> +key->type > sk->max_type)
>> +return false;
>> +
>> +if (key->offset < sk->min_offset ||
>> +key->offset > sk->max_offset)
>> +return false;
>> +
>> +return true;
>> +}
> 
> Even with the key_in_range() check here, we are still following the tree
> search slice behavior:
> 
> tree search will still gives us all the items in key range from
> (min_objectid, min_type, min_offset) to
> (max_objectid, max_type, max_offset).
> 
> I don't see much different between the tree search ioctl and this one.

It's fundamentally different.

The one in the kernel has a silly interface.  It should be min_key and
max_key since the components aren't evaluated independently.  It
effectively treats min_key and max_key as 136-bit values and returns
everything between them, inclusive.  That's the slice behavior, as you
call it.

This key_in_range treats each component separately and acts as a filter
on the slice returned from the kernel.  If we request min/max_offset =
259, the caller will not get anything without offset = 259.

I suppose, ultimately, this could also be done using a filter on the
rbtree using the existing interface but that seems even more wasteful.

-Jeff

>> +
>> +static int __qgroups_search(int fd, struct btrfs_ioctl_search_args *args,
>> +struct qgroup_lookup *qgroup_lookup)
>>  {
>>  int ret;
>> -struct btrfs_ioctl_search_args args;
>> -struct btrfs_ioctl_search_key *sk = 
>> +struct btrfs_ioctl_search_key *sk = >key;
>> +struct btrfs_ioctl_search_key filter_key = args->key;
>>  struct btrfs_ioctl_search_header *sh;
>>  unsigned long off = 0;
>>  unsigned int i;
>> @@ -1180,30 +1199,15 @@ static int __qgroups_search(int fd, struct 
>> qgroup_lookup *qgroup_lookup)
>>  u64 qgroupid;
>>  u64 qgroupid1;
>>  
>> -memset(, 0, sizeof(args));
>> -
>> -sk->tree_id = BTRFS_QUOTA_TREE_OBJECTID;
>> -sk->max_type = BTRFS_QGROUP_RELATION_KEY;
>> -sk->min_type = BTRFS_QGROUP_STATUS_KEY;
>> -sk->max_objectid = (u64)-1;
>> -sk->max_offset = (u64)-1;
>> -sk->max_transid = (u64)-1;
>> -sk->nr_items = 4096;
>> -
>>  qgroup_lookup_init(qgroup_lookup);
>>  
>>  while (1) {
>> -ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
>> +ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args);
>>  if (ret < 0) {
>> -if (errno == ENOENT) {
>> -error("can't list qgroups: quotas not enabled");
>> +if (errno == ENOENT)
>>  ret = -ENOTTY;
>> -} else {
>> -error("can't list qgroups: %s",
>> -   strerror(errno));
>> +else
>

Re: [PATCH 06/20] btrfs-progs: qgroups: add pathname to show output

2018-03-08 Thread Jeff Mahoney
On 3/8/18 12:33 AM, Qu Wenruo wrote:
> 
> 
> On 2018年03月08日 10:40, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> The btrfs qgroup show command currently only exports qgroup IDs,
>> forcing the user to resolve which subvolume each corresponds to.
>>
>> This patch adds pathname resolution to qgroup show so that when
>> the -P option is used, the last column contains the pathname of
>> the root of the subvolume it describes.  In the case of nested
>> qgroups, it will show the number of member qgroups or the paths
>> of the members if the -v option is used.
>>
>> Pathname can also be used as a sort parameter.
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
> 
> Reviewed-by: Qu Wenruo <w...@suse.com>
> 
> Except one nitpick inlined below.
> 
> [snip]
>>  }
>> +if (bq->pathname)
>> +free((void *)bq->pathname);
> 
> What about just free(bq->pathname);?
> 
> Is this (void *) used to get around the const prefix?

Yes.

Thanks,

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/8] btrfs-progs: constify pathnames passed as arguments

2018-03-07 Thread Jeff Mahoney
On 3/7/18 3:17 AM, Nikolay Borisov wrote:
> 
> 
> On  2.03.2018 20:46, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> It's unlikely we're going to modify a pathname argument, so codify that
>> and use const.
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  chunk-recover.c | 4 ++--
>>  cmds-device.c   | 2 +-
>>  cmds-fi-usage.c | 6 +++---
>>  cmds-rescue.c   | 4 ++--
>>  send-utils.c| 4 ++--
>>  5 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/chunk-recover.c b/chunk-recover.c
>> index 705bcf52..1d30db51 100644
>> --- a/chunk-recover.c
>> +++ b/chunk-recover.c
>> @@ -1492,7 +1492,7 @@ out:
>>  return ERR_PTR(ret);
>>  }
>>  
>> -static int recover_prepare(struct recover_control *rc, char *path)
>> +static int recover_prepare(struct recover_control *rc, const char *path)
>>  {
>>  int ret;
>>  int fd;
>> @@ -2296,7 +2296,7 @@ static void validate_rebuild_chunks(struct 
>> recover_control *rc)
>>  /*
>>   * Return 0 when successful, < 0 on error and > 0 if aborted by user
>>   */
>> -int btrfs_recover_chunk_tree(char *path, int verbose, int yes)
>> +int btrfs_recover_chunk_tree(const char *path, int verbose, int yes)
>>  {
>>  int ret = 0;
>>  struct btrfs_root *root = NULL;
>> diff --git a/cmds-device.c b/cmds-device.c
>> index 86459d1b..a49c9d9d 100644
>> --- a/cmds-device.c
>> +++ b/cmds-device.c
>> @@ -526,7 +526,7 @@ static const char * const cmd_device_usage_usage[] = {
>>  NULL
>>  };
>>  
>> -static int _cmd_device_usage(int fd, char *path, unsigned unit_mode)
>> +static int _cmd_device_usage(int fd, const char *path, unsigned unit_mode)
> 
> Actually the path parameter is not used in this function at all, I'd say
> just remove it.

Yep, it's unused, but that's a different project.  Add
-Wunused-parameter and see what shakes out. :)

>>  {
>>  int i;
>>  int ret = 0;> diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
>> index de7ad668..9a1c76ab 100644
>> --- a/cmds-fi-usage.c
>> +++ b/cmds-fi-usage.c
>> @@ -227,7 +227,7 @@ static int cmp_btrfs_ioctl_space_info(const void *a, 
>> const void *b)
>>  /*
>>   * This function load all the information about the space usage
>>   */
>> -static struct btrfs_ioctl_space_args *load_space_info(int fd, char *path)
>> +static struct btrfs_ioctl_space_args *load_space_info(int fd, const char 
>> *path)
>>  {
>>  struct btrfs_ioctl_space_args *sargs = NULL, *sargs_orig = NULL;
>>  int ret, count;
>> @@ -305,7 +305,7 @@ static void get_raid56_used(struct chunk_info *chunks, 
>> int chunkcount,
>>  #define MIN_UNALOCATED_THRESH   SZ_16M
>>  static int print_filesystem_usage_overall(int fd, struct chunk_info 
>> *chunkinfo,
>>  int chunkcount, struct device_info *devinfo, int devcount,
>> -char *path, unsigned unit_mode)
>> +const char *path, unsigned unit_mode)
>>  {
>>  struct btrfs_ioctl_space_args *sargs = NULL;
>>  int i;
>> @@ -931,7 +931,7 @@ static void _cmd_filesystem_usage_linear(unsigned 
>> unit_mode,
>>  static int print_filesystem_usage_by_chunk(int fd,
>>  struct chunk_info *chunkinfo, int chunkcount,
>>  struct device_info *devinfo, int devcount,
>> -char *path, unsigned unit_mode, int tabular)
>> +const char *path, unsigned unit_mode, int tabular)
>>  {
>>  struct btrfs_ioctl_space_args *sargs;
>>  int ret = 0;
>> diff --git a/cmds-rescue.c b/cmds-rescue.c
>> index c40088ad..c61145bc 100644
>> --- a/cmds-rescue.c
>> +++ b/cmds-rescue.c
>> @@ -32,8 +32,8 @@ static const char * const rescue_cmd_group_usage[] = {
>>  NULL
>>  };
>>  
>> -int btrfs_recover_chunk_tree(char *path, int verbose, int yes);
>> -int btrfs_recover_superblocks(char *path, int verbose, int yes);
>> +int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);
> 
> That path argument is being passed to recover_prepare which can alo use
> a const to its path parameter

Yep, and it was in the first chunk.

>> +int btrfs_recover_superblocks(const char *path, int verbose, int yes);
>>  
>>  static const char * const cmd_rescue_chunk_recover_usage[] = {
>>  "btrfs rescue chunk-recover [options] ",
>> diff --git a/send-utils.c b/send-utils.c
>> index b5289e76..8ce94de1 100644
>> --- a/send-utils.c
>&

Re: [PATCH 6/8] btrfs-progs: qgroups: introduce btrfs_qgroup_query

2018-03-07 Thread Jeff Mahoney
On 3/7/18 3:02 AM, Misono, Tomohiro wrote:
> On 2018/03/03 3:47, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> The only mechanism we have in the progs for searching qgroups is to load
>> all of them and filter the results.  This works for qgroup show but
>> to add quota information to 'btrfs subvoluem show' it's pretty wasteful.
>>
>> This patch splits out setting up the search and performing the search so
>> we can search for a single qgroupid more easily.
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  qgroup.c | 98 
>> +---
>>  qgroup.h |  7 +
>>  2 files changed, 77 insertions(+), 28 deletions(-)
>>
>> diff --git a/qgroup.c b/qgroup.c
>> index b1be3311..2d0a6947 100644
>> --- a/qgroup.c
>> +++ b/qgroup.c
>> @@ -1146,11 +1146,11 @@ static inline void print_status_flag_warning(u64 
>> flags)
>>  warning("qgroup data inconsistent, rescan recommended");
>>  }
>>  
>> -static int __qgroups_search(int fd, struct qgroup_lookup *qgroup_lookup)
>> +static int __qgroups_search(int fd, struct btrfs_ioctl_search_args *args,
>> +struct qgroup_lookup *qgroup_lookup)
>>  {
>>  int ret;
>> -struct btrfs_ioctl_search_args args;
>> -struct btrfs_ioctl_search_key *sk = 
>> +struct btrfs_ioctl_search_key *sk = >key;
>>  struct btrfs_ioctl_search_header *sh;
>>  unsigned long off = 0;
>>  unsigned int i;
>> @@ -1161,30 +1161,12 @@ static int __qgroups_search(int fd, struct 
>> qgroup_lookup *qgroup_lookup)
>>  u64 qgroupid;
>>  u64 qgroupid1;
>>  
>> -memset(, 0, sizeof(args));
>> -
>> -sk->tree_id = BTRFS_QUOTA_TREE_OBJECTID;
>> -sk->max_type = BTRFS_QGROUP_RELATION_KEY;
>> -sk->min_type = BTRFS_QGROUP_STATUS_KEY;
>> -sk->max_objectid = (u64)-1;
>> -sk->max_offset = (u64)-1;
>> -sk->max_transid = (u64)-1;
>> -sk->nr_items = 4096;
>> -
>>  qgroup_lookup_init(qgroup_lookup);
>>  
>>  while (1) {
>> -ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
>> +ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args);
>>  if (ret < 0) {
>> -if (errno == ENOENT) {
>> -error("can't list qgroups: quotas not enabled");
>> -ret = -ENOTTY;
>> -} else {
>> -error("can't list qgroups: %s",
>> -   strerror(errno));
>> -ret = -errno;
>> -}
>> -
>> +ret = -errno;
> 
> Originally, -ENOTTY would be returned when qgroup is disabled
> but this changes to return -ENOENT. so, it seems that error check
> in 7th patch would not work correctly when qgroup is disabled.
> 

Good catch.

Thanks,

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 7/8] btrfs-progs: subvolume: add quota info to btrfs sub show

2018-03-07 Thread Jeff Mahoney
On 3/7/18 1:09 AM, Qu Wenruo wrote:
> 
> 
> On 2018年03月03日 02:47, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> This patch reports on the first-level qgroup, if any, associated with
>> a particular subvolume.  It displays the usage and limit, subject
>> to the usual unit parameters.
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  cmds-subvolume.c | 46 ++
>>  1 file changed, 46 insertions(+)
>>
>> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
>> index 8a473f7a..29d0e0e5 100644
>> --- a/cmds-subvolume.c
>> +++ b/cmds-subvolume.c
>> @@ -972,6 +972,7 @@ static const char * const cmd_subvol_show_usage[] = {
>>  "Show more information about the subvolume",
>>  "-r|--rootid   rootid of the subvolume",
>>  "-u|--uuid uuid of the subvolume",
>> +HELPINFO_UNITS_SHORT_LONG,
>>  "",
>>  "If no option is specified,  will be shown, otherwise",
>>  "the rootid or uuid are resolved relative to the  path.",
>> @@ -993,6 +994,13 @@ static int cmd_subvol_show(int argc, char **argv)
>>  int by_uuid = 0;
>>  u64 rootid_arg;
>>  u8 uuid_arg[BTRFS_UUID_SIZE];
>> +struct btrfs_qgroup_stats stats;
>> +unsigned int unit_mode;
>> +const char *referenced_size;
>> +const char *referenced_limit_size = "-";
>> +unsigned field_width = 0;
>> +
>> +unit_mode = get_unit_mode_from_arg(, argv, 1);
>>  
>>  while (1) {
>>  int c;
>> @@ -1112,6 +1120,44 @@ static int cmd_subvol_show(int argc, char **argv)
>>  btrfs_list_subvols_print(fd, filter_set, NULL, BTRFS_LIST_LAYOUT_RAW,
>>  1, raw_prefix);
>>  
>> +ret = btrfs_qgroup_query(fd, get_ri.root_id, );
>> +if (ret < 0) {
>> +if (ret == -ENODATA)
>> +printf("Quotas must be enabled for per-subvolume 
>> usage\n");
> 
> This seems a little confusing.
> If quota is disabled, we get ENOTTY not ENODATA.
> 
> For ENODATA we know quota is enabled but just no info for this qgroup.

Yep.

Thanks,

-Jeff


> Thanks,
> Qu
> 
>> +else if (ret != -ENOTTY)
>> +fprintf(stderr,
>> +"\nERROR: BTRFS_IOC_QUOTA_QUERY failed: %s\n",
>> +strerror(errno));
>> +goto out;
>> +}
>> +
>> +printf("\tQuota Usage:\t\t");
>> +fflush(stdout);
>> +
>> +referenced_size = pretty_size_mode(stats.info.referenced, unit_mode);
>> +if (stats.limit.max_referenced)
>> +   referenced_limit_size = pretty_size_mode(
>> +stats.limit.max_referenced,
>> +unit_mode);
>> +field_width = max(strlen(referenced_size),
>> +  strlen(referenced_limit_size));
>> +
>> +printf("%-*s referenced, %s exclusive\n ", field_width,
>> +   referenced_size,
>> +   pretty_size_mode(stats.info.exclusive, unit_mode));
>> +
>> +printf("\tQuota Limits:\t\t");
>> +if (stats.limit.max_referenced || stats.limit.max_exclusive) {
>> +const char *excl = "-";
>> +
>> +if (stats.limit.max_exclusive)
>> +   excl = pretty_size_mode(stats.limit.max_exclusive,
>> +   unit_mode);
>> +printf("%-*s referenced, %s exclusive\n", field_width,
>> +   referenced_limit_size, excl);
>> +} else
>> +printf("None\n");
>> +
>>  out:
>>  /* clean up */
>>  free(get_ri.path);
>>
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 6/8] btrfs-progs: qgroups: introduce btrfs_qgroup_query

2018-03-07 Thread Jeff Mahoney
On 3/7/18 12:58 AM, Qu Wenruo wrote:
> 
> 
> On 2018年03月03日 02:47, je...@suse.com wrote:
>> diff --git a/qgroup.c b/qgroup.c
>> index b1be3311..2d0a6947 100644
>> --- a/qgroup.c
>> +++ b/qgroup.c
>> @@ -1267,6 +1249,66 @@ static int __qgroups_search(int fd, struct 
>> qgroup_lookup *qgroup_lookup)
>>  return ret;
>>  }
>>  
>> +static int qgroups_search_all(int fd, struct qgroup_lookup *qgroup_lookup)
>> +{
>> +struct btrfs_ioctl_search_args args = {
>> +.key = {
>> +.tree_id = BTRFS_QUOTA_TREE_OBJECTID,
>> +.max_type = BTRFS_QGROUP_RELATION_KEY,
>> +.min_type = BTRFS_QGROUP_STATUS_KEY,
>> +.max_objectid = (u64)-1,
>> +.max_offset = (u64)-1,
>> +.max_transid = (u64)-1,
>> +.nr_items = 4096,
>> +},
>> +};
>> +int ret;
>> +
>> +ret = __qgroups_search(fd, , qgroup_lookup);
>> +if (ret == -ENOTTY)
>> +error("can't list qgroups: quotas not enabled");
>> +else if (ret < 0)
>> +error("can't list qgroups: %s", strerror(-ret));
>> +return ret;
>> +}
>> +
>> +int btrfs_qgroup_query(int fd, u64 qgroupid, struct btrfs_qgroup_stats 
>> *stats)
>> +{
>> +struct btrfs_ioctl_search_args args = {
>> +.key = {
>> +.tree_id = BTRFS_QUOTA_TREE_OBJECTID,
>> +.min_type = BTRFS_QGROUP_INFO_KEY,
>> +.max_type = BTRFS_QGROUP_LIMIT_KEY,
>> +.max_objectid = 0,
>> +.max_offset = qgroupid,
>> +.max_transid = (u64)-1,
>> +.nr_items = 4096, /* should be 2, i think */
> 
> 2 is not correct in fact.
> 
> As QGROUP_INFO is smaller than QGROUP_LIMIT, to get a slice of all what
> we need, we need to include all other unrelated items.
> 
> One example will be:
>   item 1 key (0 QGROUP_INFO 0/5) itemoff 16211 itemsize 40
>   item 2 key (0 QGROUP_INFO 0/257) itemoff 16171 itemsize 40
>   item 3 key (0 QGROUP_INFO 1/1) itemoff 16131 itemsize 40
>   item 4 key (0 QGROUP_LIMIT 0/5) itemoff 16091 itemsize 40
>   item 5 key (0 QGROUP_LIMIT 0/257) itemoff 16051 itemsize 40
>   item 6 key (0 QGROUP_LIMIT 1/1) itemoff 16011 itemsize 40
> 
> To query qgroup info about 0/257, above setup will get the following slice:
>   item 1 key (0 QGROUP_INFO 0/5) itemoff 16211 itemsize 40
>   item 2 key (0 QGROUP_INFO 0/257) itemoff 16171 itemsize 40
>   item 3 key (0 QGROUP_INFO 1/1) itemoff 16131 itemsize 40
>   item 4 key (0 QGROUP_LIMIT 0/5) itemoff 16091 itemsize 40
>   item 5 key (0 QGROUP_LIMIT 0/257) itemoff 16051 itemsize 40
> So we still need that large @nr_items.
> 
> Despite this comment it looks good.

Of course.  I use TREE_SEARCH so infrequently that I forget about this
every time so the pain is always fresh.

It should be .min_offset = qgroupid, .nr_items = 2, but of course that
doesn't work either for different reasons.  __qgroups_search's loop will
loop until it comes back with no more results and it sets the nr_items
itself to 4096 at the end of the loop.  The key comparison in the ioctl
only does a regular key comparison and offset doesn't get evaluated if
the types aren't equal.  That works fine when doing tree insertion or
searches for a single key but is wrong for searching for a range.  I
have a TREE_SEARCH_V3 lying around somewhere to address this ridiculous
behavior and should probably finish it up at some point.

This hasn't mattered for __qgroup_search until now since it hasn't used
anything other than -1 for the offset and objectid so I'll just add a
filter there.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 4/8] btrfs-progs: qgroups: add pathname to show output

2018-03-07 Thread Jeff Mahoney
On 3/7/18 12:45 AM, Qu Wenruo wrote:
> 
> 
> On 2018年03月03日 02:47, je...@suse.com wrote:
>> diff --git a/cmds-qgroup.c b/cmds-qgroup.c
>> index 48686436..94cd0fd3 100644
>> --- a/cmds-qgroup.c
>> +++ b/cmds-qgroup.c
>> @@ -280,8 +280,10 @@ static const char * const cmd_qgroup_show_usage[] = {
>>  "   (including ancestral qgroups)",
>>  "-f list all qgroups which impact the given path",
>>  "   (excluding ancestral qgroups)",
>> +"-P print first-level qgroups using pathname",
>> +"-v verbose, prints all nested subvolumes",
> 
> Did you mean the subvolume paths of all children qgroups?

Yes.  I'll make that clearer.

>>  HELPINFO_UNITS_LONG,
>> -"--sort=qgroupid,rfer,excl,max_rfer,max_excl",
>> +"--sort=qgroupid,rfer,excl,max_rfer,max_excl,pathname",
>>  "   list qgroups sorted by specified items",
>>  "   you can use '+' or '-' in front of each item.",
>>  "   (+:ascending, -:descending, ascending default)",

>> diff --git a/qgroup.c b/qgroup.c
>> index 67bc0738..83918134 100644
>> --- a/qgroup.c
>> +++ b/qgroup.c
>> @@ -210,8 +220,42 @@ static void print_qgroup_column_add_blank(enum 
>> btrfs_qgroup_column_enum column,
>>  printf(" ");
>>  }
>>  
>> +void print_pathname_column(struct btrfs_qgroup *qgroup, bool verbose)
>> +{
>> +struct btrfs_qgroup_list *list = NULL;
>> +
>> +fputs("  ", stdout);
>> +if (btrfs_qgroup_level(qgroup->qgroupid) > 0) {
>> +int count = 0;
> 
> Newline after declaration please.

Ack.

> And declaration in if() {} block doesn't pass checkpath IIRC.

Declarations in if () {} are fine.

>> +list_for_each_entry(list, >qgroups,
>> +next_qgroup) {
>> +if (verbose) {
>> +struct btrfs_qgroup *member = list->qgroup;
> 
> Same coding style problem here.

Ack.

>> +u64 level = 
>> btrfs_qgroup_level(member->qgroupid);
>> +u64 sid = btrfs_qgroup_subvid(member->qgroupid);
>> +if (count)
>> +fputs(" ", stdout);
>> +if (btrfs_qgroup_level(member->qgroupid) == 0)
>> +fputs(member->pathname, stdout);
> 
> What about checking member->pathname before outputting?
> As it could be missing.

Yep.

>> +static const char *qgroup_pathname(int fd, u64 qgroupid)
>> +{
>> +struct root_info root_info;
>> +int ret;
>> +char *pathname;
>> +
>> +ret = get_subvol_info_by_rootid_fd(fd, _info, qgroupid);

This is a leak too.  Callers are responsible for freeing the root_info
paths.  With this and your review fixed, valgrind passes with 0 leaks
for btrfs qgroups show -P.

>> +if (ret)
>> +return NULL;
>> +
>> +ret = asprintf(, "%s%s",
>> +   root_info.full_path[0] == '/' ? "" : "/",
>> +   root_info.full_path);
>> +if (ret < 0)
>> +return NULL;
>> +
>> +return pathname;
>> +}
>> +
>>  /*
>>   * Lookup or insert btrfs_qgroup into qgroup_lookup.
>>   *
>> @@ -588,7 +697,7 @@ static struct btrfs_qgroup *qgroup_tree_search(struct 
>> qgroup_lookup *root_tree,
>>   * Return the pointer to the btrfs_qgroup if found or if inserted 
>> successfully.
>>   * Return ERR_PTR if any error occurred.
>>   */
>> -static struct btrfs_qgroup *get_or_add_qgroup(
>> +static struct btrfs_qgroup *get_or_add_qgroup(int fd,
>>  struct qgroup_lookup *qgroup_lookup, u64 qgroupid)
>>  {
>>  struct btrfs_qgroup *bq;
>> @@ -608,6 +717,8 @@ static struct btrfs_qgroup *get_or_add_qgroup(
>>  INIT_LIST_HEAD(>qgroups);
>>  INIT_LIST_HEAD(>members);
>>  
>> +bq->pathname = qgroup_pathname(fd, qgroupid);
>> +
> 
> Here qgroup_pathname() will allocate memory, but no one is freeing this
> memory.
> 
> The cleaner should be in __free_btrfs_qgroup() but there is no
> modification to that function.

Ack.

Thanks for the review,

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 8/8] btrfs-progs: qgroups: export qgroups usage information as JSON

2018-03-07 Thread Jeff Mahoney
On 3/7/18 1:34 AM, Qu Wenruo wrote:
> 
> 
> On 2018年03月03日 02:47, je...@suse.com wrote:
>> diff --git a/configure.ac b/configure.ac
>> index 56d17c3a..6aec672a 100644
>> --- a/configure.ac
>> +++ b/configure.ac
>> @@ -197,6 +197,12 @@ PKG_STATIC(UUID_LIBS_STATIC, [uuid])
>>  PKG_CHECK_MODULES(ZLIB, [zlib])
>>  PKG_STATIC(ZLIB_LIBS_STATIC, [zlib])
>>  
>> +PKG_CHECK_MODULES(JSON, [json-c], [
> 
> Json-c is quite common and also used by cryptsetup, so pretty good
> library choice.

Yep, so that puts it in the base system packages of most distros.

>> diff --git a/qgroup.c b/qgroup.c
>> index 2d0a6947..f632a45c 100644
>> --- a/qgroup.c
>> +++ b/qgroup.c
>>  return ret;
>>  }
>>  
>> +#ifdef HAVE_JSON
>> +static void format_qgroupid(char *buf, size_t size, u64 qgroupid)
>> +{
>> +int ret;
>> +
>> +ret = snprintf(buf, size, "%llu/%llu",
>> +   btrfs_qgroup_level(qgroupid),
>> +   btrfs_qgroup_subvid(qgroupid));
>> +ASSERT(ret < sizeof(buf));
> 
> This is designed to catch truncated snprintf(), right?
> This can be addressed by setting up the @buf properly.
> (See below)
> 
> And in fact, due to that super magic number, we won't hit this ASSERT()
> anyway.

Yep, but ASSERTs are there to detect/prevent developer mistakes.  This
should only trigger due to a simple bug, so we ASSERT rather than handle
the error gracefully.

[...]

>> +static bool export_one_qgroup(json_object *container,
>> + const struct btrfs_qgroup *qgroup, bool compat)
>> +{
>> +json_object *obj = json_object_new_object();
>> +json_object *tmp;
>> +char buf[42];
> 
> Answer to the ultimate question of life, the universe, and everything. :)
> 
> Although according to the format level/subvolid, it should be
> count_digits(MAX_U16) + 1 + count_digits(MAX_U48) + 1. (1 for '/' and 1
> for '\n')
> 
> Could be defined as a macro as:
> #define QGROUP_FORMAT_BUF_LEN (count_digits(1ULL<<16) + 1 + \
>count_digits(1ULL<<48) + 1)

Which would mean we'd execute count_digits twice for every qgroup to
resolve a constant.  I'll replace the magic number with a define though.

> BTW, the result is just 22.
It's a worst-case.  We're using %llu, so 42 is the length of two 64-bit
numbers in base ten, plus the slash and nul terminator.  In practice we
won't hit the limit, but it doesn't hurt.

Thanks for the review.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/8] btrfs-progs: quota: Add -W option to rescan to wait without starting rescan

2018-03-02 Thread Jeff Mahoney
On 3/2/18 1:59 PM, Nikolay Borisov wrote:
> 
> 
> On  2.03.2018 20:46, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>> @@ -135,8 +141,9 @@ static int cmd_quota_rescan(int argc, char **argv)
>>  }
>>  }
>>  
>> -if (ioctlnum != BTRFS_IOC_QUOTA_RESCAN && wait_for_completion) {
>> -error("switch -w cannot be used with -s");
>> +if (ioctlnum == BTRFS_IOC_QUOTA_RESCAN_STATUS && wait_for_completion) {
>> +error("switch -%c cannot be used with -s",
>> +  ioctlnum ? 'w' : 'W');
> 
> You can't really distinguish between w/W in this context, since ioctlnum
> will be RESCAN_STATUS. So just harcode the w/W in the text message itself?

Yep.  Derp.

Thanks,

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/8] btrfs-progs: qgroups usability

2018-03-02 Thread Jeff Mahoney
On 3/2/18 1:39 PM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> 
> Hi all -
> 
> The following series addresses some usability issues with the qgroups UI.
> 
> 1) Adds -W option so we can wait on a rescan completing without starting one.
> 2) Adds qgroup information to 'btrfs subvolume show'
> 3) Adds a -P option to show pathnames for first-level qgroups (or member
>of nested qgroups with -v)
> 4) Allows exporting the qgroup table in JSON format for use by external
>programs/scripts.


Grumble.  Ignore this thread.  I had reordered the patches and didn't
clean up an older git format-patch.

-Jeff

> -Jeff
> 
> Jeff Mahoney (8):
>   btrfs-progs: quota: Add -W option to rescan to wait without starting
> rescan
>   btrfs-progs: qgroups: fix misleading index check
>   btrfs-progs: constify pathnames passed as arguments
>   btrfs-progs: qgroups: add pathname to show output
>   btrfs-progs: qgroups: introduce and use info and limit structures
>   btrfs-progs: qgroups: introduce btrfs_qgroup_query
>   btrfs-progs: subvolume: add quota info to btrfs sub show
>   btrfs-progs: qgroups: export qgroups usage information as JSON
> 
>  Documentation/btrfs-qgroup.asciidoc |   8 +
>  Documentation/btrfs-quota.asciidoc  |  10 +-
>  Makefile.inc.in |   4 +-
>  chunk-recover.c |   4 +-
>  cmds-device.c   |   2 +-
>  cmds-fi-usage.c |   6 +-
>  cmds-qgroup.c   |  49 +++-
>  cmds-quota.c|  21 +-
>  cmds-rescue.c   |   4 +-
>  cmds-subvolume.c|  46 
>  configure.ac|   6 +
>  kerncompat.h|   1 +
>  qgroup.c| 526 
> ++--
>  qgroup.h|  22 +-
>  send-utils.c|   4 +-
>  utils.c     |  22 +-
>  utils.h |   2 +
>  17 files changed, 621 insertions(+), 116 deletions(-)
> 


-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-02-23 Thread Jeff Mahoney
On 2/22/18 6:34 PM, Qu Wenruo wrote:
> 
> 
> On 2018年02月23日 06:44, Jeff Mahoney wrote:
>> On 12/22/17 1:18 AM, Qu Wenruo wrote:
>>> Unlike reservation calculation used in inode rsv for metadata, qgroup
>>> doesn't really need to care things like csum size or extent usage for
>>> whole tree COW.
>>>
>>> Qgroup care more about net change of extent usage.
>>> That's to say, if we're going to insert one file extent, it will mostly
>>> find its place in CoWed tree block, leaving no change in extent usage.
>>> Or cause leaf split, result one new net extent, increasing qgroup number
>>> by nodesize.
>>> (Or even more rare case, increase the tree level, increasing qgroup
>>> number by 2 * nodesize)
>>>
>>> So here instead of using the way overkilled calculation for extent
>>> allocator, which cares more about accurate and no error, qgroup doesn't
>>> need that over-calculated reservation.
>>>
>>> This patch will maintain 2 new members in btrfs_block_rsv structure for
>>> qgroup, using much smaller calculation for qgroup rsv, reducing false
>>> EDQUOT.
>>
>>
>> I think this is the right idea but, rather than being the last step in a
>> qgroup rework, it should be the first.
> 
> That's right.
> 
> Although putting it as 1st patch needs extra work to co-operate with
> later type seperation.
> 
>>  Don't qgroup reservation
>> lifetimes match the block reservation lifetimes?
> 
> Also correct, but...
> 
>>  We wouldn't want a
>> global reserve and we wouldn't track usage on a per-block basis, but
>> they should otherwise match.  We already have clear APIs surrounding the
>> use of block reservations, so it seems to me that it'd make sense to
>> extend those to cover the qgroup cases as well.  Otherwise, the rest of
>> your patchset makes a parallel reservation system with a different API.
>> That keeps the current state of qgroups as a bolt-on that always needs
>> to be tracked separately (and which developers need to ensure they get
>> right).
> 
> The problem is, block reservation is designed to ensure every CoW could
> be fulfilled without error.
> 
> That's to say, for case like CoW write with csum, we need to care about
> space reservation not only for EXTENT_DATA in fs tree, but also later
> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree.
> 
> However extent tree and csum tree doesn't contribute to quota at all.
> If we follow such over-reservation, it would be much much easier to hit
> false EDQUOTA early.

I'm not suggesting a 1:1 mapping between block reservations and qgroup
reservations.  If that were possible, we wouldn't need separate
reservations at all.  What we can do is only use bytes from the qgroup
reservation when we allocate the leaf nodes belonging to the root we're
tracking.  Everywhere else we can migrate bytes normally between
reservations the same way we do for block reservations.  As we discussed
offline yesterday, I'll work up something along what I have in mind and
see if it works out.

-Jeff

> That's the main reason why a separate (and a little simplified) block
> rsv tracking system.
> 
> And if there is better solution, I'm all ears.
> 
> Thanks,
> Qu
> 
>>
>> -Jeff
>>
>>> Signed-off-by: Qu Wenruo <w...@suse.com>
>>> ---
>>>  fs/btrfs/ctree.h   | 18 +
>>>  fs/btrfs/extent-tree.c | 55 
>>> --
>>>  2 files changed, 62 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 0c58f92c2d44..97783ba91e00 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
>>> unsigned short full;
>>> unsigned short type;
>>> unsigned short failfast;
>>> +
>>> +   /*
>>> +* Qgroup equivalent for @size @reserved
>>> +*
>>> +* Unlike normal normal @size/@reserved for inode rsv,
>>> +* qgroup doesn't care about things like csum size nor how many tree
>>> +* blocks it will need to reserve.
>>> +*
>>> +* Qgroup cares more about *NET* change of extent usage.
>>> +* So for ONE newly inserted file extent, in worst case it will cause
>>> +* leaf split and level increase, nodesize for each file extent
>>> +* is already way overkilled.
>>> +*
>>> +* In short, qgroup_size/reserved is the up limit of possible needed
>>> +* qgroup metadata re

Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-02-22 Thread Jeff Mahoney
 trans->chunk_bytes_reserved, NULL);
>   trans->chunk_bytes_reserved = 0;
>  }
>  
> @@ -6036,6 +6061,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct 
> btrfs_fs_info *fs_info,
>  {
>   struct btrfs_block_rsv *block_rsv = >block_rsv;
>   u64 reserve_size = 0;
> + u64 qgroup_rsv_size = 0;
>   u64 csum_leaves;
>   unsigned outstanding_extents;
>  
> @@ -6048,9 +6074,16 @@ static void 
> btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>inode->csum_bytes);
>   reserve_size += btrfs_calc_trans_metadata_size(fs_info,
>          csum_leaves);
> + /*
> +  * For qgroup rsv, the calculation is very simple:
> +  * nodesize for each outstanding extent.
> +  * This is already overkilled under most case.
> +  */
> + qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
>  
>   spin_lock(_rsv->lock);
>   block_rsv->size = reserve_size;
> + block_rsv->qgroup_rsv_size = qgroup_rsv_size;
>   spin_unlock(_rsv->lock);
>  }
>  
> @@ -8405,7 +8438,7 @@ static void unuse_block_rsv(struct btrfs_fs_info 
> *fs_info,
>   struct btrfs_block_rsv *block_rsv, u32 blocksize)
>  {
>   block_rsv_add_bytes(block_rsv, blocksize, 0);
> - block_rsv_release_bytes(fs_info, block_rsv, NULL, 0);
> + block_rsv_release_bytes(fs_info, block_rsv, NULL, 0, NULL);
>  }
>  
>  /*
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: qgroups, properly handle no reservations

2018-02-21 Thread Jeff Mahoney
On 2/21/18 8:36 PM, Qu Wenruo wrote:
> 
> 
> On 2018年02月22日 04:19, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> There are several places where we call btrfs_qgroup_reserve_meta and
>> assume that a return value of 0 means that the reservation was successful.
>>
>> Later, we use the original bytes value passed to that call to free
>> bytes during error handling or to pass the number of bytes reserved to
>> the caller.
>>
>> This patch returns -ENODATA when we don't perform a reservation so that
>> callers can make the distinction.  This also lets call sites not
>> necessarily care whether qgroups are enabled.
> 
> IMHO if we don't need to reserve, returning 0 seems good enough.
> Caller doesn't really need to care if it has reserved some bytes.
> 
> Or is there any special case where we need to distinguish such case?

Anywhere where the reservation takes place prior to the transaction
starting, which is pretty much everywhere.  We wait until transaction
commit to flip the bit to turn on quotas, which means that if a
transaction commits that enables quotas lands in between the reservation
being take and any error handling that involves freeing the reservation,
we'll end up with an underflow.

This is the first patch of a series I'm working on, but it can stand
alone.  The rest is the patch set I mentioned when we talked a few
months ago where the lifetimes of reservations are incorrect.  We can't
just drop all the reservations at the end of the transaction because 1)
the lifetime of some reservations can cross transactions and 2) because
especially in the start_transaction case, we do the reservation prior to
waiting to join the transaction.  So if the transaction we're waiting on
commits, our reservation goes away with it but we continue on as if we
still have it.

-Jeff

> Thanks,
> Qu
> 
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/extent-tree.c | 33 -
>>  fs/btrfs/qgroup.c  |  4 ++--
>>  fs/btrfs/transaction.c |  5 -
>>  3 files changed, 22 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index c1618ab9fecf..2d5e963fae88 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -5988,20 +5988,18 @@ int btrfs_subvolume_reserve_metadata(struct 
>> btrfs_root *root,
>>   u64 *qgroup_reserved,
>>   bool use_global_rsv)
>>  {
>> -u64 num_bytes;
>>  int ret;
>>  struct btrfs_fs_info *fs_info = root->fs_info;
>>  struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
>> +/* One for parent inode, two for dir entries */
>> +u64 num_bytes = 3 * fs_info->nodesize;
>>  
>> -if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) {
>> -/* One for parent inode, two for dir entries */
>> -num_bytes = 3 * fs_info->nodesize;
>> -ret = btrfs_qgroup_reserve_meta(root, num_bytes, true);
>> -if (ret)
>> -return ret;
>> -} else {
>> +ret = btrfs_qgroup_reserve_meta(root, num_bytes, true);
>> +if (ret == -ENODATA) {
>>  num_bytes = 0;
>> -}
>> +ret = 0;
>> +} else if (ret)
>> +return ret;
>>  
>>  *qgroup_reserved = num_bytes;
>>  
>> @@ -6057,6 +6055,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
>> *inode, u64 num_bytes)
>>  enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>>  int ret = 0;
>>  bool delalloc_lock = true;
>> +u64 qgroup_reserved;
>>  
>>  /* If we are a free space inode we need to not flush since we will be in
>>   * the middle of a transaction commit.  We also don't need the delalloc
>> @@ -6090,17 +6089,17 @@ int btrfs_delalloc_reserve_metadata(struct 
>> btrfs_inode *inode, u64 num_bytes)
>>  btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>>  spin_unlock(>lock);
>>  
>> -if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) {
>> -ret = btrfs_qgroup_reserve_meta(root,
>> -nr_extents * fs_info->nodesize, true);
>> -if (ret)
>> -goto out_fail;
>> -}
>> +qgroup_reserved = nr_extents * fs_info->nodesize;
>> +ret = btrfs_qgroup_reserve_meta(root, qgroup_reserved, true);
>> +if (ret == -ENODATA) {
>> +ret = 0;
>> +qgroup_reserved

Re: [PATCH] btrfs: btrfs_evict_inode must clear all inodes

2018-01-29 Thread Jeff Mahoney
On 1/29/18 2:58 PM, Liu Bo wrote:
> On Mon, Jan 29, 2018 at 11:46:28AM -0500, Jeff Mahoney wrote:
>> btrfs_evict_inode must clear all inodes or we'll hit a BUG_ON in evict().
>>
>> Fixes: 3d48d9810de (btrfs: Handle uninitialised inode eviction)
>> Cc: Nikolay Borisov <nbori...@suse.com>
>> Cc: <sta...@vger.kernel.org> # v4.8+
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/inode.c |1 +
>>  1 file changed, 1 insertion(+)
>>
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -5282,6 +5282,7 @@ void btrfs_evict_inode(struct inode *ino
>>  trace_btrfs_inode_evict(inode);
>>  
>>  if (!root) {
>> +clear_inode(inode);
>>  kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
> 
> I had a patch for this, and also kmem_cache_free() is not supposed to
> be called here, but in ->destroy_inode().

Yep, that too.

Thanks,

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: btrfs_evict_inode must clear all inodes

2018-01-29 Thread Jeff Mahoney
btrfs_evict_inode must clear all inodes or we'll hit a BUG_ON in evict().

Fixes: 3d48d9810de (btrfs: Handle uninitialised inode eviction)
Cc: Nikolay Borisov <nbori...@suse.com>
Cc: <sta...@vger.kernel.org> # v4.8+
Signed-off-by: Jeff Mahoney <je...@suse.com>
---
 fs/btrfs/inode.c |1 +
 1 file changed, 1 insertion(+)

--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5282,6 +5282,7 @@ void btrfs_evict_inode(struct inode *ino
trace_btrfs_inode_evict(inode);
 
if (!root) {
+   clear_inode(inode);
kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
return;
    }

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug? fstrim only trims unallocated space, not unused in bg's

2017-11-20 Thread Jeff Mahoney
On 11/20/17 11:04 PM, Chris Murphy wrote:
> On Mon, Nov 20, 2017 at 6:46 PM, Jeff Mahoney <je...@suse.com> wrote:
>> On 11/20/17 5:59 PM, Chris Murphy wrote:
>>> On Mon, Nov 20, 2017 at 1:40 PM, Jeff Mahoney <je...@suse.com> wrote:
>>>> On 11/20/17 3:01 PM, Jeff Mahoney wrote:
>>>>> On 11/20/17 3:00 PM, Jeff Mahoney wrote:
>>>>>> On 11/19/17 4:38 PM, Chris Murphy wrote:
>>>>>>> On Sat, Nov 18, 2017 at 11:27 PM, Andrei Borzenkov 
>>>>>>> <arvidj...@gmail.com> wrote:
>>>>>>>> 19.11.2017 09:17, Chris Murphy пишет:
>>>>>>>>> fstrim should trim free space, but it only trims unallocated. This is
>>>>>>>>> with kernel 4.14.0 and the entire 4.13 series. I'm pretty sure it
>>>>>>>>> behaved this way with 4.12 also.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, I was told it should also trim free space ...
>>>>>>>>
>>>>>>>> https://www.spinics.net/lists/linux-btrfs/msg61819.html
>>>>>>>>
>>>>>>>
>>>>>>> It definitely isn't. If I do a partial balance, then fstrim, I get a
>>>>>>> larger trimmed value, corresponding exactly to unallocated space.
>>>>>>
>>>>>>
>>>>>> I've just tested with 4.14 and it definitely trims within block groups.
>>>>>
>>>>> Derp.  This should read 4.12.
>>>>>
>>>>>> I've attached my test script and the log of the run.  I'll build and
>>>>>> test a 4.14 kernel and see if I can reproduce there.  It may well be
>>>>>> that we're just misreporting the bytes trimmed.
>>>>
>>>> I get the same results on v4.14.  I wrote up a little script to parse
>>>> the btrfs-debug-tree extent tree dump and the discards that are issued
>>>> after the final sync (when the tree is dumped) match.
>>>>
>>>> The script output is also as expected:
>>>> /mnt2: 95.1 GiB (102082281472 bytes) trimmed
>>>> # remove every other 100MB file, totalling 1.5 GB
>>>> + sync
>>>> + killall blktrace
>>>> + wait
>>>> + echo 'after sync'
>>>> + sleep 1
>>>> + btrace -a discard /dev/loop0
>>>> + fstrim -v /mnt2
>>>> /mnt2: 96.6 GiB (103659962368 bytes) trimmed
>>>>
>>>> One thing that may not be apparent is that the byte count is from the
>>>> device(s)'s perspective.  If you have a file system with duplicate
>>>> chunks or a redundant RAID mode, the numbers will reflect that.
>>>>
>>>> The total byte count should be correct as well.  It's the total number
>>>> of bytes that we submit for discard and that were accepted by the block
>>>> layer.
>>>>
>>>> Do you have a test case that shows it being wrong and can you provide
>>>> the blktrace capture of the device(s) while the fstrim is running?
>>>
>>>
>>> Further,
>>>
>>> # fstrim -v /
>>> /: 38 GiB (40767586304 bytes) trimmed
>>>
>>> And then delete 10G worth of files, do not balance, and do nothing for
>>> a minute before:
>>>
>>> # fstrim -v /
>>> /: 38 GiB (40767586304 bytes) trimmed
>>>
>>> It's the same value. Free space according to fi us is +10 larger than
>>> before, and yet nothing additional is trimmed than before. So I don't
>>> know what's going on but it's not working for me.
>>
>> What happens if you sync before doing the fstrim again?  The code is
>> there to drop extents within block groups.  It works for me.  The big
>> thing is that the space must be freed entirely before we can trim.
> 
> I've sync'd and I've also rebooted, it's the same.
> 
> [root@f27h ~]# fstrim -v /
> /: 38 GiB (40767586304 bytes) trimmed
> [root@f27h ~]# btrfs fi us /
> Overall:
> Device size:  70.00GiB
> Device allocated:  32.03GiB
> Device unallocated:  37.97GiB
> Device missing: 0.00B
> Used:  15.50GiB
> Free (estimated):  52.93GiB(min: 52.93GiB)
> Data ratio:  1.00
> Metadata ratio:  1.00
> Global reserve:  53.97MiB(used: 192.00KiB)
> 
> Data,single: Size:30.00GiB, Used:15.04GiB
>/dev/nvme0n1p8  30.00GiB
> 
> Metadata,single: Size:2.00GiB, Used:473.34MiB
>/dev/nvme0n1p8   2.00GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB
>/dev/nvme0n1p8  32.00MiB
> 
> Unallocated:
>/dev/nvme0n1p8  37.97GiB
> [root@f27h ~]#

What's the discard granularity on that device?

grep . /sys/block/nvme0n1/queue/discard_*
cat /sys/block/nvme0n1/discard*

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: bug? fstrim only trims unallocated space, not unused in bg's

2017-11-20 Thread Jeff Mahoney
On 11/20/17 5:59 PM, Chris Murphy wrote:
> On Mon, Nov 20, 2017 at 1:40 PM, Jeff Mahoney <je...@suse.com> wrote:
>> On 11/20/17 3:01 PM, Jeff Mahoney wrote:
>>> On 11/20/17 3:00 PM, Jeff Mahoney wrote:
>>>> On 11/19/17 4:38 PM, Chris Murphy wrote:
>>>>> On Sat, Nov 18, 2017 at 11:27 PM, Andrei Borzenkov <arvidj...@gmail.com> 
>>>>> wrote:
>>>>>> 19.11.2017 09:17, Chris Murphy пишет:
>>>>>>> fstrim should trim free space, but it only trims unallocated. This is
>>>>>>> with kernel 4.14.0 and the entire 4.13 series. I'm pretty sure it
>>>>>>> behaved this way with 4.12 also.
>>>>>>>
>>>>>>
>>>>>> Well, I was told it should also trim free space ...
>>>>>>
>>>>>> https://www.spinics.net/lists/linux-btrfs/msg61819.html
>>>>>>
>>>>>
>>>>> It definitely isn't. If I do a partial balance, then fstrim, I get a
>>>>> larger trimmed value, corresponding exactly to unallocated space.
>>>>
>>>>
>>>> I've just tested with 4.14 and it definitely trims within block groups.
>>>
>>> Derp.  This should read 4.12.
>>>
>>>> I've attached my test script and the log of the run.  I'll build and
>>>> test a 4.14 kernel and see if I can reproduce there.  It may well be
>>>> that we're just misreporting the bytes trimmed.
>>
>> I get the same results on v4.14.  I wrote up a little script to parse
>> the btrfs-debug-tree extent tree dump and the discards that are issued
>> after the final sync (when the tree is dumped) match.
>>
>> The script output is also as expected:
>> /mnt2: 95.1 GiB (102082281472 bytes) trimmed
>> # remove every other 100MB file, totalling 1.5 GB
>> + sync
>> + killall blktrace
>> + wait
>> + echo 'after sync'
>> + sleep 1
>> + btrace -a discard /dev/loop0
>> + fstrim -v /mnt2
>> /mnt2: 96.6 GiB (103659962368 bytes) trimmed
>>
>> One thing that may not be apparent is that the byte count is from the
>> device(s)'s perspective.  If you have a file system with duplicate
>> chunks or a redundant RAID mode, the numbers will reflect that.
>>
>> The total byte count should be correct as well.  It's the total number
>> of bytes that we submit for discard and that were accepted by the block
>> layer.
>>
>> Do you have a test case that shows it being wrong and can you provide
>> the blktrace capture of the device(s) while the fstrim is running?
> 
> 
> Further,
> 
> # fstrim -v /
> /: 38 GiB (40767586304 bytes) trimmed
> 
> And then delete 10G worth of files, do not balance, and do nothing for
> a minute before:
> 
> # fstrim -v /
> /: 38 GiB (40767586304 bytes) trimmed
> 
> It's the same value. Free space according to fi us is +10 larger than
> before, and yet nothing additional is trimmed than before. So I don't
> know what's going on but it's not working for me.

What happens if you sync before doing the fstrim again?  The code is
there to drop extents within block groups.  It works for me.  The big
thing is that the space must be freed entirely before we can trim.

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: bug? fstrim only trims unallocated space, not unused in bg's

2017-11-20 Thread Jeff Mahoney
On 11/20/17 3:01 PM, Jeff Mahoney wrote:
> On 11/20/17 3:00 PM, Jeff Mahoney wrote:
>> On 11/19/17 4:38 PM, Chris Murphy wrote:
>>> On Sat, Nov 18, 2017 at 11:27 PM, Andrei Borzenkov <arvidj...@gmail.com> 
>>> wrote:
>>>> 19.11.2017 09:17, Chris Murphy пишет:
>>>>> fstrim should trim free space, but it only trims unallocated. This is
>>>>> with kernel 4.14.0 and the entire 4.13 series. I'm pretty sure it
>>>>> behaved this way with 4.12 also.
>>>>>
>>>>
>>>> Well, I was told it should also trim free space ...
>>>>
>>>> https://www.spinics.net/lists/linux-btrfs/msg61819.html
>>>>
>>>
>>> It definitely isn't. If I do a partial balance, then fstrim, I get a
>>> larger trimmed value, corresponding exactly to unallocated space.
>>
>>
>> I've just tested with 4.14 and it definitely trims within block groups.
> 
> Derp.  This should read 4.12.
> 
>> I've attached my test script and the log of the run.  I'll build and
>> test a 4.14 kernel and see if I can reproduce there.  It may well be
>> that we're just misreporting the bytes trimmed.

I get the same results on v4.14.  I wrote up a little script to parse
the btrfs-debug-tree extent tree dump and the discards that are issued
after the final sync (when the tree is dumped) match.

The script output is also as expected:
/mnt2: 95.1 GiB (102082281472 bytes) trimmed
# remove every other 100MB file, totalling 1.5 GB
+ sync
+ killall blktrace
+ wait
+ echo 'after sync'
+ sleep 1
+ btrace -a discard /dev/loop0
+ fstrim -v /mnt2
/mnt2: 96.6 GiB (103659962368 bytes) trimmed

One thing that may not be apparent is that the byte count is from the
device(s)'s perspective.  If you have a file system with duplicate
chunks or a redundant RAID mode, the numbers will reflect that.

The total byte count should be correct as well.  It's the total number
of bytes that we submit for discard and that were accepted by the block
layer.

Do you have a test case that shows it being wrong and can you provide
the blktrace capture of the device(s) while the fstrim is running?

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: bug? fstrim only trims unallocated space, not unused in bg's

2017-11-20 Thread Jeff Mahoney
On 11/20/17 3:00 PM, Jeff Mahoney wrote:
> On 11/19/17 4:38 PM, Chris Murphy wrote:
>> On Sat, Nov 18, 2017 at 11:27 PM, Andrei Borzenkov <arvidj...@gmail.com> 
>> wrote:
>>> 19.11.2017 09:17, Chris Murphy пишет:
>>>> fstrim should trim free space, but it only trims unallocated. This is
>>>> with kernel 4.14.0 and the entire 4.13 series. I'm pretty sure it
>>>> behaved this way with 4.12 also.
>>>>
>>>
>>> Well, I was told it should also trim free space ...
>>>
>>> https://www.spinics.net/lists/linux-btrfs/msg61819.html
>>>
>>
>> It definitely isn't. If I do a partial balance, then fstrim, I get a
>> larger trimmed value, corresponding exactly to unallocated space.
> 
> 
> I've just tested with 4.14 and it definitely trims within block groups.

Derp.  This should read 4.12.

> I've attached my test script and the log of the run.  I'll build and
> test a 4.14 kernel and see if I can reproduce there.  It may well be
> that we're just misreporting the bytes trimmed.
> 
> -Jeff
> 
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/6] btrfs: qgroup: Skeleton to support separate qgroup reservation type

2017-10-24 Thread Jeff Mahoney
On 10/24/17 8:29 AM, Jeff Mahoney wrote:
> On 10/24/17 7:51 AM, Qu Wenruo wrote:
>>
>>
>> On 2017年10月24日 19:00, Nikolay Borisov wrote:
>>> nit: Why not BTRFS_QGROUP_RSV_TYPES_MAX = 2;
>>
>> My original plan is just as the same as yours.
>>
>> However I still remember I did it before and David fixed it by using
>> TYPES, so I follow his naming schema here.
>>
>> Kernel is also using this naming schema else where:
>> d91876496bcf ("btrfs: compress: put variables defined per compress type
>> in struct to make cache friendly")
> 
> The COMPRESS_TYPES pattern isn't the right pattern to follow here.
> That's a special case since there's a _NONE that doesn't have anything
> associated with it, so we don't need to take a slot in the array.
> 
> We also don't care about any of the specific values, just that they
> start at 0.  The BTRFS_COMPRESS_TYPES example also has a
> BTRFS_COMPRESS_LAST item in the enum, which serves the same purpose as
> MAX.  I don't have a strong opinion on the naming, just that we don't
> play games with +1 when handling arrays since, as you say, that's just
> waiting for subtle bugs later.
> 
> enum btrfs_qgroup_rsv_type {
>   BTRFS_QGROUP_RSV_DATA = 0,
>   BTRFS_QGROUP_RSV_META,
>   BTRFS_QGROUP_RSV_LAST,
> };

It turns out that BTRFS_COMPRESS_LAST did exist until very recently:

commit dc2f29212a2648164b054016dc5b948bf0fc92d5
Author: Anand Jain <anand.j...@oracle.com>
Date:   Sun Aug 13 12:02:41 2017 +0800

btrfs: remove unused BTRFS_COMPRESS_LAST

... so it was there, but just wasn't used.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/6] btrfs: qgroup: Skeleton to support separate qgroup reservation type

2017-10-24 Thread Jeff Mahoney
fore and David fixed it by using
> TYPES, so I follow his naming schema here.
> 
> Kernel is also using this naming schema else where:
> d91876496bcf ("btrfs: compress: put variables defined per compress type
> in struct to make cache friendly")

The COMPRESS_TYPES pattern isn't the right pattern to follow here.
That's a special case since there's a _NONE that doesn't have anything
associated with it, so we don't need to take a slot in the array.

We also don't care about any of the specific values, just that they
start at 0.  The BTRFS_COMPRESS_TYPES example also has a
BTRFS_COMPRESS_LAST item in the enum, which serves the same purpose as
MAX.  I don't have a strong opinion on the naming, just that we don't
play games with +1 when handling arrays since, as you say, that's just
waiting for subtle bugs later.

enum btrfs_qgroup_rsv_type {
BTRFS_QGROUP_RSV_DATA = 0,
BTRFS_QGROUP_RSV_META,
BTRFS_QGROUP_RSV_LAST,
};

>>> +};
>>> +
>>> +/*
>>> + * Represents how many bytes we reserved for this qgroup.
>>> + *
>>> + * Each type should have different reservation behavior.
>>> + * E.g, data follows its io_tree flag modification, while
>>> + * *currently* meta is just reserve-and-clear during transcation.
>>> + *
>>> + * TODO: Add new type for delalloc, which can exist across several
>>> + * transaction.
>>> + */

Minor nit: It's not just delalloc.  Delayed items and inodes can as
well.  The general rule is that qgroup reservations aren't essentially
different from block reservations and follow the same usage patterns
when operating on leaf nodes.

>>> +struct btrfs_qgroup_rsv {
>>> +   u64 values[BTRFS_QGROUP_RSV_TYPES + 1];
>>
>> nit: And here just BTRFS_QGROUP_RSV_TYPES_MAX rather than the +1 here,
>> seems more idiomatic to me.
> 
> To follow same naming schema from David.
> (IIRC it was about tree-checker patchset, checking file extent type part)
> 
> In fact, I crashed kernel several times due to the tiny +1, without even
> a clue for hours just testing blindly, until latest gcc gives warning
> about it.

BTRFS_QGROUP_RSV_LAST would do the job here.

-Jeff

>>
>>> +};
>>> +
>>>  /*
>>>   * one struct for each qgroup, organized in fs_info->qgroup_tree.
>>>   */
>>> @@ -88,6 +108,7 @@ struct btrfs_qgroup {
>>>  * reservation tracking
>>>  */
>>> u64 reserved;
>>> +   struct btrfs_qgroup_rsv rsv;
>>>  
>>> /*
>>>  * lists
>>> @@ -228,12 +249,14 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
>>> *trans,
>>>  struct btrfs_fs_info *fs_info, u64 srcid, u64 objectid,
>>>  struct btrfs_qgroup_inherit *inherit);
>>>  void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
>>> -  u64 ref_root, u64 num_bytes);
>>> +  u64 ref_root, u64 num_bytes,
>>> +  enum btrfs_qgroup_rsv_type type);
>>>  static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info 
>>> *fs_info,
>>>  u64 ref_root, u64 num_bytes)
>>>  {
>>> trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
>>> -   btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
>>> +   btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes,
>>> + BTRFS_QGROUP_RSV_DATA);
>>>  }
>>>  
>>>  #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: reproducable oops in btrfs/130 with latests mainline

2017-10-17 Thread Jeff Mahoney
On 10/17/17 8:46 AM, Qu Wenruo wrote:
> 
> 
> On 2017年10月17日 19:11, Po-Hsu Lin wrote:
>> Hello Chris,
>>
>> I can reproduce this on my side too, with Ubuntu 16.04 + 4.4.0-97 kernel.
> 
> Btrfs/130 is a known bug.
> 
> I submitted it to raise the concern about such situation and purposed
> one possible solution (just disable deduped file detection).
> 
> But the solution doesn't get accepted.

It also works very well as a performance test for qgroup runtime
improvements. :)

-Jeff

> Thanks,
> Qu
>>
>> PREEMPT config:
>> $ cat config-4.4.0-97-generic | grep PREEMPT
>> CONFIG_PREEMPT_NOTIFIERS=y
>> # CONFIG_PREEMPT_NONE is not set
>> CONFIG_PREEMPT_VOLUNTARY=y
>> # CONFIG_PREEMPT is not set
>>
>> Bug reports on launchpad:
>> https://bugs.launchpad.net/bugs/1718925
>> https://bugs.launchpad.net/bugs/1717443
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Fatal failure, btrfs raid with duplicated metadata

2017-10-11 Thread Jeff Mahoney
On 10/11/17 2:20 PM, Ian Kumlien wrote:
> 
> 
> On Wed, Oct 11, 2017 at 2:10 PM Jeff Mahoney <je...@suse.com
> <mailto:je...@suse.com>> wrote:
> 
> On 10/11/17 12:41 PM, Ian Kumlien wrote:
> 
> [--8<--] 
> 
> > Eventually the filesystem becomes read-only and everything is odd...
> 
> Are you still able to mount it?  I'd be surprised if you could if check
> can't open the file system.
> 
> 
> Nope, it's like there never was a filesystem in the first place... 
> 
> But since metadata should be duplicated all over, i'd assume that it
> would be able to mount it and survive =)

If you'd been using RAID1 or something instead, you'd be able to mount
the file system and replace the disk.
> > Trying to run btrfs check on the disks results in:
> > btrfs check -b /dev/disk/by-uuid/8d431da9-dad4-481c-a5ad-5e6844f31da0
> > bytenr mismatch, want=912228352, have=0
> > Couldn't read tree root
> > ERROR: cannot open file system
> >
> > (For backup and normal)
> >
> > So even if the data is duplicated on all disks, something in the above
> > errors seemed to cause it to abort
> > (These disks are seagate sshd disks, never ever buying them again)
> 
> If you have metadata: dup, that doesn't mean the metadata is duplicated
> on every disk.  It means that there are two copies of the metadata on a
> single disk.  If that disk is going bad and returning failures for both
> copies of the metadata, you may be out of luck.  It's really intended
> for single spinning disks to get a little bit more resiliency in the
> face of bad sectors.
> 
> 
> Oh? it looks like it would be 2 per 1 device, but ok - Then i could have
> had a issue where the drive that keeps the metadata is gone... I
> suspected that I did do DUP on multiple devices
> 
> from the man page:
>        Note 1: DUP may exist on more than 1 device if it starts on a
> single device and another one is added. Since version 4.5.1,
>        mkfs.btrfs will let you create DUP on multiple devices.

I can see how you'd reach that conclusion.  The wording is somewhat
confusing.  We allocate space in chunks that are usually about 1GB in
size.  When DUP is used, we allocate two chunks on the same device and
that is presented as a single usable chunk.  The constituent chunks will
be allocated on the same device, but which device is used can change
with each allocation.

Say you have 5 disks and 8 metadata chunks.  They can be allocated like so:

sda: A A D D
sdb: B B
sdc: C C
sde: F F G G
sdf: H H I I

There is no redundancy in the case of a disk failure, only for sector
failures.  To spread metadata across disks for redundancy you'll need to
use a raid mode instead.  If one of those disks is failing and it
contains a critical part of the metadata, the file system won't be
mountable.

> The check error above means that it wasn't able to map a logical address
> to a physical address.  Typically that means that the mapping was lost.
> 
> 
> I was more reporting that it happened and if there was any useful data
> that we could extract from this if it's a failure that shouldn't happen :)
> 
> I haven't wiped anything yet - preparing to replace the disks though

Thanks for reporting it, but in this context, it's somewhat of an
expected failure mode.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Fatal failure, btrfs raid with duplicated metadata

2017-10-11 Thread Jeff Mahoney
On 10/11/17 12:41 PM, Ian Kumlien wrote:
> Hi,
> 
> I was running a btrfs raid with 6 disks, metadata: dup and data: raid 6
> 
> Two of the disks started behaving oddly:
> [436823.570296] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=244, sector_sz=512)
> [436823.578604] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=52, sector_sz=512) [436823.617593]
> sd 3:1:0:4: [sdf] Unaligned partial completion (resid=56,
> sector_sz=512)
> [436823.617771] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=222, sector_sz=512)
> [436823.618386] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=246, sector_sz=512)
> [436823.618463] sd 3:1:0:4: [sdf] Unaligned partial completion
> (resid=56, sector_sz=512)
> [436977.701944] scsi_io_completion: 68 callbacks suppressed
> [436977.701973] sd 3:1:0:4: [sdf] tag#0 FAILED Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [436977.701982] sd 3:1:0:4: [sdf] tag#0 Sense Key : Hardware Error
> [current]
> [436977.701991] sd 3:1:0:4: [sdf] tag#0 Add. Sense: Logical unit
> failure
> [436977.702000] sd 3:1:0:4: [sdf] tag#0 CDB: Read(10) 28 00 02 fb fb
> 80 00 00 28 00
> [436977.702005] print_req_error: 68 callbacks suppressed
> [436977.702010] print_req_error: critical target error, dev sdf,
> sector 50068352
> [498132.144319] print_req_error: 450 callbacks suppressed
> [498132.144324] print_req_error: critical target error, dev sdf,
> sector 41777640
> [498132.144590] btrfs_dev_stat_print_on_error: 540 callbacks
> suppressed
> [498132.144600] BTRFS error (device sdb1): bdev /dev/sdf1 errs: wr
> 632, rd 1526, flush 0, corrupt 0, gen 0
> 
> Eventually the filesystem becomes read-only and everything is odd...

Are you still able to mount it?  I'd be surprised if you could if check
can't open the file system.

> Trying to run btrfs check on the disks results in:
> btrfs check -b /dev/disk/by-uuid/8d431da9-dad4-481c-a5ad-5e6844f31da0
> bytenr mismatch, want=912228352, have=0
> Couldn't read tree root
> ERROR: cannot open file system
> 
> (For backup and normal)
> 
> So even if the data is duplicated on all disks, something in the above
> errors seemed to cause it to abort
> (These disks are seagate sshd disks, never ever buying them again)

If you have metadata: dup, that doesn't mean the metadata is duplicated
on every disk.  It means that there are two copies of the metadata on a
single disk.  If that disk is going bad and returning failures for both
copies of the metadata, you may be out of luck.  It's really intended
for single spinning disks to get a little bit more resiliency in the
face of bad sectors.

The check error above means that it wasn't able to map a logical address
to a physical address.  Typically that means that the mapping was lost.

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [RFC v3 0/2] vfs / btrfs: add support for ustat()

2017-08-23 Thread Jeff Mahoney
On 8/15/14 5:29 AM, Al Viro wrote:
> On Thu, Aug 14, 2014 at 07:58:56PM -0700, Luis R. Rodriguez wrote:
> 
>> Christoph had noted that this seemed associated to the problem
>> that the btrfs uses different assignments for st_dev than s_dev,
>> but much as I'd like to see that changed based on discussions so
>> far its unclear if this is going to be possible unless strong
>> commitment is reached.

Resurrecting a dead thread since we've been carrying this patch anyway
since then.

> Explain, please.  Whose commitment and commitment to what, exactly?
> Having different ->st_dev values for different files on the same
> fs is a bloody bad idea; why does btrfs do that at all?  If nothing else,
> it breaks the usual "are those two files on the same fs?" tests...

It's because btrfs snapshots would have inode number collisions.
Changing the inode numbers for snapshots would negate a big benefit of
btrfs snapshots: the quick creation and lightweight on-disk
representation due to metadata sharing.

The thing is that ustat() used to work.  Your commit 0ee5dc676a5f8
(btrfs: kill magical embedded struct superblock) had a regression:
Since it replaced the superblock with a simple dev_t, it rendered the
device no longer discoverable by user_get_super.  We need a list_head to
attach for searching.

There's an argument that this is hacky.  It's valid.  The only other
feedback I've heard is to use a real superblock for subvolumes to do
this instead.  That doesn't work either, due to things like freeze/thaw
and inode writeback.  Ultimately, what we need is a single file system
with multiple namespaces.  Years ago we just needed different inode
namespaces, but as people have started adopting btrfs for containers, we
need more than that.  I've heard requests for per-subvolume security
contexts.  I'd imagine user namespaces are on someone's wish list.  A
working df can be done with ->d_automount, but the way btrfs handles
having a "canonical" subvolume location has always been a way to avoid
directory loops.  I'd like to just automount subvolumes everywhere
they're referenced.  One solution, for which I have no code yet, is to
have something like a superblock-light that we can hang things like a
security context, a user namespace, and an anonymous dev.  Most file
systems would have just one.  Btrfs would have one per subvolume.

That's a big project with a bunch of discussion.  So for now, I'd like
to move this patch forward while we (I) work on the bigger issue.

BTW, in this same thread, Christoph said:> Again, NAK.  Make btrfs
report the proper anon dev_t in stat and
> everything will just work.

We do.  We did then too.  But what doesn't work is a user doing stat()
and then using the dev_t to call ustat().

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/5] btrfs: Use common error handling code in btrfs_update_root()

2017-08-21 Thread Jeff Mahoney
On 8/21/17 8:40 AM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 21 Aug 2017 13:10:15 +0200
> 
> Add a jump target so that a bit of exception handling can be better reused
> at the end of this function.
> 
> This issue was detected by using the Coccinelle software.

btrfs_abort_transaction dumps __FILE__:__LINE__ in the log so this patch
makes the code more difficult to debug.

-Jeff

> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  fs/btrfs/root-tree.c | 27 +++
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> index 5b488af6f25e..bc497ba9d9d1 100644
> --- a/fs/btrfs/root-tree.c
> +++ b/fs/btrfs/root-tree.c
> @@ -145,10 +145,8 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, 
> struct btrfs_root
>   return -ENOMEM;
>  
>   ret = btrfs_search_slot(trans, root, key, path, 0, 1);
> - if (ret < 0) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret < 0)
> + goto abort_transaction;
>  
>   if (ret != 0) {
>   btrfs_print_leaf(path->nodes[0]);
> @@ -171,23 +169,17 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, 
> struct btrfs_root
>   btrfs_release_path(path);
>   ret = btrfs_search_slot(trans, root, key, path,
>   -1, 1);
> - if (ret < 0) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret < 0)
> + goto abort_transaction;
>  
>   ret = btrfs_del_item(trans, root, path);
> - if (ret < 0) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret < 0)
> + goto abort_transaction;
>   btrfs_release_path(path);
>   ret = btrfs_insert_empty_item(trans, root, path,
>   key, sizeof(*item));
> - if (ret < 0) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret < 0)
> + goto abort_transaction;
>   l = path->nodes[0];
>   slot = path->slots[0];
>   ptr = btrfs_item_ptr_offset(l, slot);
> @@ -204,6 +196,9 @@ int btrfs_update_root(struct btrfs_trans_handle *trans, 
> struct btrfs_root
>  out:
>   btrfs_free_path(path);
>   return ret;
> +abort_transaction:
> + btrfs_abort_transaction(trans, ret);
> + goto out;
>  }
>  
>  int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root 
> *root,
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 4/5] btrfs: Use common error handling code in update_ref_path()

2017-08-21 Thread Jeff Mahoney
On 8/21/17 8:41 AM, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Mon, 21 Aug 2017 13:34:29 +0200
> 
> Add a jump target so that a bit of exception handling can be better reused
> in this function.
> 
> This issue was detected by using the Coccinelle software.

Adding a jump label in the middle of a conditional for "common" error
handling makes the code more difficult to understand.

-Jeff

> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  fs/btrfs/send.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 59fb1ed6ca20..a96edc91a101 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -3697,12 +3697,12 @@ static int update_ref_path(struct send_ctx *sctx, 
> struct recorded_ref *ref)
>   return -ENOMEM;
>  
>   ret = get_cur_path(sctx, ref->dir, ref->dir_gen, new_path);
> - if (ret < 0) {
> - fs_path_free(new_path);
> - return ret;
> - }
> + if (ret < 0)
> + goto free_path;
> +
>   ret = fs_path_add(new_path, ref->name, ref->name_len);
>   if (ret < 0) {
> +free_path:
>   fs_path_free(new_path);
>   return ret;
>   }
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/5] btrfs: Use common error handling code in __btrfs_free_extent()

2017-08-21 Thread Jeff Mahoney
, _ref);
> - if (ret) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret)
> + goto abort_transaction;
>   }
>   } else {
>   if (found_extent) {
> @@ -7078,37 +7067,33 @@ static int __btrfs_free_extent(struct 
> btrfs_trans_handle *trans,
>   last_ref = 1;
>   ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
> num_to_del);
> - if (ret) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret)
> + goto abort_transaction;
> +
>   btrfs_release_path(path);
>  
>   if (is_data) {
>   ret = btrfs_del_csums(trans, info, bytenr, num_bytes);
> - if (ret) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret)
> + goto abort_transaction;
>   }
>  
>   ret = add_to_free_space_tree(trans, info, bytenr, num_bytes);
> - if (ret) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret)
> + goto abort_transaction;
>  
>   ret = update_block_group(trans, info, bytenr, num_bytes, 0);
> - if (ret) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret)
> + goto abort_transaction;
>   }
>   btrfs_release_path(path);
>  
>  out:
>   btrfs_free_path(path);
>   return ret;
> +abort_transaction:
> + btrfs_abort_transaction(trans, ret);
> + goto out;
>  }
>  
>  /*
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/5] btrfs: Use common error handling code in btrfs_mark_extent_written()

2017-08-21 Thread Jeff Mahoney
bort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret)
> + goto abort_transaction;
>   }
>   if (del_nr == 0) {
>   fi = btrfs_item_ptr(leaf, path->slots[0],
> @@ -1314,14 +1297,17 @@ int btrfs_mark_extent_written(struct 
> btrfs_trans_handle *trans,
>   btrfs_mark_buffer_dirty(leaf);
>  
>   ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
> -     if (ret < 0) {
> - btrfs_abort_transaction(trans, ret);
> - goto out;
> - }
> + if (ret < 0)
> + goto abort_transaction;
>   }
>  out:
>   btrfs_free_path(path);
>   return 0;
> +e_inval:
> + ret = -EINVAL;
> +abort_transaction:
> + btrfs_abort_transaction(trans, ret);
> + goto out;
>  }
>  
>  /*
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


[PATCH v2] btrfs: pass fs_info to btrfs_del_root instead of tree_root

2017-08-17 Thread Jeff Mahoney
btrfs_del_roots always uses the tree_root.  Let's pass fs_info instead.

Signed-off-by: Jeff Mahoney <je...@suse.com>
---
 fs/btrfs/ctree.h   | 4 ++--
 fs/btrfs/extent-tree.c | 2 +-
 fs/btrfs/free-space-tree.c | 2 +-
 fs/btrfs/qgroup.c  | 3 +--
 fs/btrfs/root-tree.c   | 7 ---
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2ec2b24c0fec..044ca9b65a7b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2979,8 +2979,8 @@ int btrfs_del_root_ref(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info,
   u64 root_id, u64 ref_id, u64 dirid, u64 *sequence,
   const char *name, int name_len);
-int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-  const struct btrfs_key *key);
+int btrfs_del_root(struct btrfs_trans_handle *trans,
+  struct btrfs_fs_info *fs_info, const struct btrfs_key *key);
 int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
  const struct btrfs_key *key,
  struct btrfs_root_item *item);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 116c5615d6c2..69eee2667720 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9184,7 +9184,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
if (err)
goto out_end_trans;
 
-   ret = btrfs_del_root(trans, tree_root, >root_key);
+   ret = btrfs_del_root(trans, fs_info, >root_key);
if (ret) {
btrfs_abort_transaction(trans, ret);
goto out_end_trans;
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index a5e34de06c2f..684f12247db7 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1257,7 +1257,7 @@ int btrfs_clear_free_space_tree(struct btrfs_fs_info 
*fs_info)
if (ret)
goto abort;
 
-   ret = btrfs_del_root(trans, tree_root, _space_root->root_key);
+   ret = btrfs_del_root(trans, fs_info, _space_root->root_key);
if (ret)
goto abort;
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index ddc37c537058..5c8b61c86e61 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -946,7 +946,6 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
 int btrfs_quota_disable(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_root *tree_root = fs_info->tree_root;
struct btrfs_root *quota_root;
int ret = 0;
 
@@ -968,7 +967,7 @@ int btrfs_quota_disable(struct btrfs_trans_handle *trans,
if (ret)
goto out;
 
-   ret = btrfs_del_root(trans, tree_root, _root->root_key);
+   ret = btrfs_del_root(trans, fs_info, _root->root_key);
if (ret)
goto out;
 
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 5b488af6f25e..9fb9896610e0 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -335,10 +335,11 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
return err;
 }
 
-/* drop the root item for 'key' from 'root' */
-int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-  const struct btrfs_key *key)
+/* drop the root item for 'key' from the tree root */
+int btrfs_del_root(struct btrfs_trans_handle *trans,
+  struct btrfs_fs_info *fs_info, const struct btrfs_key *key)
 {
+   struct btrfs_root *root = fs_info->tree_root;
struct btrfs_path *path;
int ret;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: pass fs_info to routines that always take tree_root

2017-08-17 Thread Jeff Mahoney
On 8/2/17 3:54 PM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> 
> btrfs_find_root and btrfs_del_root always use the tree_root.  Let's pass
> fs_info instead.

This one is broken.  btrfs_read_fs_root is called during log tree
recovery with the log_root_tree.  I'll send an updated patch.

-Jeff

> Signed-off-by: Jeff Mahoney <je...@suse.com>
> ---
>  fs/btrfs/ctree.h   |  7 ---
>  fs/btrfs/disk-io.c |  2 +-
>  fs/btrfs/extent-tree.c |  4 ++--
>  fs/btrfs/free-space-tree.c |  2 +-
>  fs/btrfs/qgroup.c  |  3 +--
>  fs/btrfs/root-tree.c   | 15 +--
>  6 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 3f3eb7b17cac..eed7cc991a80 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2973,8 +2973,8 @@ int btrfs_del_root_ref(struct btrfs_trans_handle *trans,
>  struct btrfs_fs_info *fs_info,
>  u64 root_id, u64 ref_id, u64 dirid, u64 *sequence,
>  const char *name, int name_len);
> -int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> -const struct btrfs_key *key);
> +int btrfs_del_root(struct btrfs_trans_handle *trans,
> +struct btrfs_fs_info *fs_info, const struct btrfs_key *key);
>  int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root 
> *root,
> const struct btrfs_key *key,
> struct btrfs_root_item *item);
> @@ -2982,7 +2982,8 @@ int __must_check btrfs_update_root(struct 
> btrfs_trans_handle *trans,
>  struct btrfs_root *root,
>  struct btrfs_key *key,
>  struct btrfs_root_item *item);
> -int btrfs_find_root(struct btrfs_root *root, const struct btrfs_key 
> *search_key,
> +int btrfs_find_root(struct btrfs_fs_info *fs_info,
> + const struct btrfs_key *search_key,
>   struct btrfs_path *path, struct btrfs_root_item *root_item,
>   struct btrfs_key *root_key);
>  int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 080e2ebb8aa0..ea1959937875 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1581,7 +1581,7 @@ static struct btrfs_root *btrfs_read_tree_root(struct 
> btrfs_root *tree_root,
>  
>   __setup_root(root, fs_info, key->objectid);
>  
> - ret = btrfs_find_root(tree_root, key, path,
> + ret = btrfs_find_root(fs_info, key, path,
> >root_item, >root_key);
>   if (ret) {
>   if (ret > 0)
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 82d53a7b6652..12fa33accdcc 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9192,14 +9192,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>   if (err)
>   goto out_end_trans;
>  
> - ret = btrfs_del_root(trans, tree_root, >root_key);
> + ret = btrfs_del_root(trans, fs_info, >root_key);
>   if (ret) {
>   btrfs_abort_transaction(trans, ret);
>   goto out_end_trans;
>   }
>  
>   if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
> - ret = btrfs_find_root(tree_root, >root_key, path,
> + ret = btrfs_find_root(fs_info, >root_key, path,
> NULL, NULL);
>   if (ret < 0) {
>   btrfs_abort_transaction(trans, ret);
> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> index a5e34de06c2f..684f12247db7 100644
> --- a/fs/btrfs/free-space-tree.c
> +++ b/fs/btrfs/free-space-tree.c
> @@ -1257,7 +1257,7 @@ int btrfs_clear_free_space_tree(struct btrfs_fs_info 
> *fs_info)
>   if (ret)
>   goto abort;
>  
> - ret = btrfs_del_root(trans, tree_root, _space_root->root_key);
> + ret = btrfs_del_root(trans, fs_info, _space_root->root_key);
>   if (ret)
>   goto abort;
>  
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 4ce351efe281..ba60523a443c 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -946,7 +946,6 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans,
>  int btrfs_quota_disable(struct btrfs_trans_handle *trans,
>   struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_root *tree_root = fs_info->tree_root;
>   struct btrfs_root *quota_root;
>   int ret = 0;
>  
> @@ -968,7 +967,7 @@ int btrfs_quota_disable(stru

Re: btrfs-progs-v4.12: cross compiling

2017-08-15 Thread Jeff Mahoney
On 8/14/17 11:10 AM, David Sterba wrote:
> On Mon, Aug 14, 2017 at 10:14:42PM +0800, Qu Wenruo wrote:
>> On 2017年08月14日 22:03, David Sterba wrote:
>>> On Mon, Aug 14, 2017 at 09:17:08PM +0800, Qu Wenruo wrote:
>>>> On 2017年08月14日 21:06, David Sterba wrote:
>>>>> On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote:
>>>>>> Since versions 4.12 btrfs-progs is complicated to cross compile for
>>>>>> other systems.
>>>>>> The problem is, that this version includes mktables, which needs to be
>>>>>> compiled for the host system and executed there for the creation of
>>>>>> tables.c.
>>>>>>
>>>>>> Are there any changes planed for the next version of btrfs-progs to make
>>>>>> the cross compiling as simple as in the past? A included tables.c for
>>>>>> example?
>>>>>
>>>>> Yes, keeping the generated tables.c around is fine. There's no reason it
>>>>> needs to be generated each time during build. I'll fix that in 4.12.1.
>>>>
>>>> But the number of lines and impossibility to review it makes it not
>>>> suitable to be managed by git.
>>>
>>> I don't understand your concern. The file is generated from a set of
>>> formulas, not intended to be updated directly.
>>
>> Yes, it should never be updated directly, so it's generated by a less 
>> than 400 lines program, instead of a whole 10K+ lines file managed by git.
> 
> mktables.c is synced from kernel sources, taking updates from there is
> easier than porting any changes to the proposed scripted implementation.
> 
> The workflow is simple:
> - copy kernel mktables.c changes to btrfs-progs mktables.c
> - compile mktables
> - run 'make kernel-lib/tables.c'

Can't this happen as part of a make dist (that we don't do right now)?

> - commit the changes to git

... and anyone using the git repo directly can sort out how to build it?

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Btrfs umount hang

2017-08-09 Thread Jeff Mahoney
On 8/8/17 7:30 AM, Angel Shtilianov wrote:
> crash> bt -f 31625
> PID: 31625  TASK: 88046a833400  CPU: 7   COMMAND: "btrfs-transacti"
> wants to acquire struct extent_buffer 88000460aca0 lock, whose
> lock_owner is 27574.
> 
> here is pid 27574:
> PID: 27574  TASK: 88038b469a00  CPU: 4   COMMAND: "kworker/u32:9"
> which is also is trying to acquire eb lock 8802598b6200, and here
> the owner is 31696.
> 
> 31696 is
> PID: 31696  TASK: 88044b59ce00  CPU: 5   COMMAND: "umount"
> 
> So definitely here is a kind of deadlock.
> umount holds the lock needed by the workers to complete and waits them
> to complete.
> Lockdep wouldn't complain about that.
> I am still about to investigate what has previously triggered/disabled 
> lockdep.
> I have to obtain the log from the machine, but I need some time to get it.
> 
> Jeff, you were right.
> Could you help demystifying how we ended up here?

Hi Angel -

It looks like a regression introduced by 291c7d2f5, but that's a very
old commit.  As that commit says, it's a rare occurence to hit that
wait, and that's probably why we haven't seen this issue sooner.

There's potential for this to happen whenever two threads are modifying
the tree at once and one needs to find a free extent.  I'll need to
think a bit on how to fix it.

-Jeff

> Best regards,
> Angel
> 
> On Mon, Aug 7, 2017 at 9:10 PM, Jeff Mahoney <je...@suse.com> wrote:
>> On 8/7/17 1:19 PM, Jeff Mahoney wrote:
>>> On 8/7/17 10:12 AM, Angel Shtilianov wrote:
>>>> Hi there,
>>>> I'm investigating sporadic hanging during btrfs umount. The FS is
>>>> contained in a loop mounted file.
>>>> I have no reproduction scenario and the issue may happen once a day or
>>>> once a month. It is rare, but frustrating.
>>>> I have a crashdump (the server has been manually crashed and collected
>>>> a crashdump), so I could take look through the data structures.
>>>> What happens is that umount is getting in D state and a the kernel
>>>> complains about hung tasks. We are using kernel 4.4.y The actual back
>>>> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
>>>> used (4.4.30 through 4.4.70).
>>>> Tasks like:
>>>> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
>>>> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
>>>> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
>>>> are getting blocked waiting for btrfs_tree_read_lock, which is owned
>>>> by task umount:31696 (which is also blocked for more than 120 seconds)
>>>> regarding the lock debug.
>>>>
>>>> umount is hung in "cache_block_group", see the '>' mark:
>>>>while (cache->cached == BTRFS_CACHE_FAST) {
>>>> struct btrfs_caching_control *ctl;
>>>>
>>>> ctl = cache->caching_ctl;
>>>> atomic_inc(>count);
>>>> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
>>>> spin_unlock(>lock);
>>>>
>>>>>schedule();
>>>>
>>>> finish_wait(>wait, );
>>>> put_caching_control(ctl);
>>>> spin_lock(>lock);
>>>> }
>>>>
>>>> The complete backtraces could be found in the attached log.
>>>>
>>>> Do you have any ideas ?
>>>
>>> Hi Angel -
>>>
>>> In your log, it says lockdep is disabled.  What tripped it earlier?
>>> Lockdep really should be catching locking deadlocks in situations like
>>> this, if that's really the underlying cause.
>>
>> Actually, I'm not sure if lockdep would catch this one.  Here's my
>> hypothesis:
>>
>> kworker/u32:9 is waiting on a read lock while reading the free space
>> cache, which means it owns the cache->cached value and will issue the
>> wakeup when it completes.
>>
>> umount is waiting on for the wakeup from kworker/u32:9 but is holding
>> some tree locks in write mode.
>>
>> If kworker/u32:9 is waiting on the locks that umount holds, we have a
>> deadlock.
>>
>> Can you dump the extent buffer that kworker/u32:9 is waiting on?  Part
>> of that will contain the PID of the holder, and if matches umount, we
>> found the cause.
>>
>> -Jeff
>>
>> --
>> Jeff Mahoney
>> SUSE Labs
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Btrfs umount hang

2017-08-07 Thread Jeff Mahoney
On 8/7/17 1:19 PM, Jeff Mahoney wrote:
> On 8/7/17 10:12 AM, Angel Shtilianov wrote:
>> Hi there,
>> I'm investigating sporadic hanging during btrfs umount. The FS is
>> contained in a loop mounted file.
>> I have no reproduction scenario and the issue may happen once a day or
>> once a month. It is rare, but frustrating.
>> I have a crashdump (the server has been manually crashed and collected
>> a crashdump), so I could take look through the data structures.
>> What happens is that umount is getting in D state and a the kernel
>> complains about hung tasks. We are using kernel 4.4.y The actual back
>> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
>> used (4.4.30 through 4.4.70).
>> Tasks like:
>> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
>> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
>> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
>> are getting blocked waiting for btrfs_tree_read_lock, which is owned
>> by task umount:31696 (which is also blocked for more than 120 seconds)
>> regarding the lock debug.
>>
>> umount is hung in "cache_block_group", see the '>' mark:
>>while (cache->cached == BTRFS_CACHE_FAST) {
>> struct btrfs_caching_control *ctl;
>>
>> ctl = cache->caching_ctl;
>> atomic_inc(>count);
>> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
>> spin_unlock(>lock);
>>
>>>schedule();
>>
>> finish_wait(>wait, );
>> put_caching_control(ctl);
>> spin_lock(>lock);
>> }
>>
>> The complete backtraces could be found in the attached log.
>>
>> Do you have any ideas ?
> 
> Hi Angel -
> 
> In your log, it says lockdep is disabled.  What tripped it earlier?
> Lockdep really should be catching locking deadlocks in situations like
> this, if that's really the underlying cause.

Actually, I'm not sure if lockdep would catch this one.  Here's my
hypothesis:

kworker/u32:9 is waiting on a read lock while reading the free space
cache, which means it owns the cache->cached value and will issue the
wakeup when it completes.

umount is waiting on for the wakeup from kworker/u32:9 but is holding
some tree locks in write mode.

If kworker/u32:9 is waiting on the locks that umount holds, we have a
deadlock.

Can you dump the extent buffer that kworker/u32:9 is waiting on?  Part
of that will contain the PID of the holder, and if matches umount, we
found the cause.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Btrfs umount hang

2017-08-07 Thread Jeff Mahoney
On 8/7/17 10:12 AM, Angel Shtilianov wrote:
> Hi there,
> I'm investigating sporadic hanging during btrfs umount. The FS is
> contained in a loop mounted file.
> I have no reproduction scenario and the issue may happen once a day or
> once a month. It is rare, but frustrating.
> I have a crashdump (the server has been manually crashed and collected
> a crashdump), so I could take look through the data structures.
> What happens is that umount is getting in D state and a the kernel
> complains about hung tasks. We are using kernel 4.4.y The actual back
> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
> used (4.4.30 through 4.4.70).
> Tasks like:
> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
> are getting blocked waiting for btrfs_tree_read_lock, which is owned
> by task umount:31696 (which is also blocked for more than 120 seconds)
> regarding the lock debug.
> 
> umount is hung in "cache_block_group", see the '>' mark:
>while (cache->cached == BTRFS_CACHE_FAST) {
> struct btrfs_caching_control *ctl;
> 
> ctl = cache->caching_ctl;
> atomic_inc(>count);
> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
> spin_unlock(>lock);
> 
>>schedule();
> 
> finish_wait(>wait, );
> put_caching_control(ctl);
> spin_lock(>lock);
> }
> 
> The complete backtraces could be found in the attached log.
> 
> Do you have any ideas ?

Hi Angel -

In your log, it says lockdep is disabled.  What tripped it earlier?
Lockdep really should be catching locking deadlocks in situations like
this, if that's really the underlying cause.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 4/5] btrfs-progs: tests: fix typo in convert-tests/008-readonly-image

2017-07-31 Thread Jeff Mahoney
On 7/27/17 9:27 PM, Qu Wenruo wrote:
> 
> 
> On 2017年07月27日 23:47, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> The dd in convert-tests/008-readonly-image is expected to fail, so
>> there being a typo in the file name has gone unnoticed.
> 
> Thanks for catching this.
> This is very embarrassing.
> 
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>   tests/convert-tests/008-readonly-image/test.sh | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/tests/convert-tests/008-readonly-image/test.sh
>> b/tests/convert-tests/008-readonly-image/test.sh
>> index 27c9373..cc846fa 100755
>> --- a/tests/convert-tests/008-readonly-image/test.sh
>> +++ b/tests/convert-tests/008-readonly-image/test.sh
>> @@ -14,9 +14,10 @@ convert_test_prep_fs $default_mke2fs
>>   run_check_umount_test_dev
>>   convert_test_do_convert
>>   run_check_mount_test_dev
>> +run_check e2fsck -n "$TEST_MNT/ext2_saved/image"
>> # It's expected to fail
>> -$SUDO_HELPER dd if=/dev/zero of="$TEST_MNT/ext2_save/image" bs=1M
>> count=1 \
>> +$SUDO_HELPER dd if=/dev/zero of="$TEST_MNT/ext2_saved/image" bs=1M
>> count=1 \
>>   &> /dev/null
> 
> BTW, now we have run_mustfail() function, so we don't need to manually
> check the return value now.

Ah, right.  I'll incorporate this with Lakshmipathi's feedback.

-Jeff

> Thanks,
> Qu
> 
>>   if [ $? -ne 1 ]; then
>>   echo "after convert ext2_save/image is not read-only"
>> @@ -24,3 +25,4 @@ if [ $? -ne 1 ]; then
>>   fi
>>   run_check_umount_test_dev
>>   convert_test_post_rollback
>> +
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/3] btrfs-progs: convert: properly handle reserved ranges while iterating files

2017-07-27 Thread Jeff Mahoney
On 7/27/17 12:38 PM, Jeff Mahoney wrote:
> On 7/26/17 9:35 PM, Qu Wenruo wrote:
>>
>>
>> On 2017年07月26日 04:54, je...@suse.com wrote:
>>> From: Jeff Mahoney <je...@suse.com>
>>>
>>> Commit 522ef705e38 (btrfs-progs: convert: Introduce function to calculate
>>> the available space) changed how we handle migrating file data so that
>>> we never have btrfs space associated with the reserved ranges.  This
>>> works pretty well and when we iterate over the file blocks, the
>>> associations are redirected to the migrated locations.
>>>
>>> This commit missed the case in block_iterate_proc where we just check
>>> for intersection with a superblock location before looking up a block
>>> group.  intersect_with_sb checks to see if the range intersects with
>>> a stripe containing a superblock but, in fact, we've reserved the
>>> full 0-1MB range at the start of the disk.  So a file block located
>>> at e.g. 160kB will fall in the reserved region but won't be excepted
>>> in block_iterate_block.  We ultimately hit a BUG_ON when we fail
>>> to look up the block group for that location.
>>
>> The description of the problem  is indeed correct.
>>
>>>
>>> This is reproducible using convert-tests/003-ext4-basic.
>>
>> Thanks for pointing this out, I also reproduced it.
>>
>> While it would be nicer if you could upload a special crafted image as
>> indicated test case.
>> IIRC the test passed without problem several versions ago, so there may
>> be some factors preventing the bug from being exposed.
>>
>>>
>>> The fix is to have intersect_with_sb and block_iterate_proc understand
>>> the full size of the reserved ranges.  Since we use the range to
>>> determine the boundary for the block iterator, let's just return the
>>> boundary.  0 isn't a valid boundary and means that we proceed normally
>>> with block group lookup.
>>
>> I'm OK with current fix as it indeed fix the bug and has minimal impact
>> on current code.
>>
>> So feel free to add:
>> Reviewed-by: Qu Wenruo <quwenruo.bt...@gmx.com>
>>
>> While I think there is a better way to solve it more completely.
>>
>> As when we run into block_iterate_proc(), we have already created
>> ext2_save/image.
>> So we can use the the image as ext2 <-> btrfs position mapping, just as
>> we have already done in record_file_blocks().
>>
>> That's to say, we don't need too much care about the intersection with
>> reserved range, but just letting record_file_blocks() to handle it will
>> be good enough.
>>
>> What do you think about this idea?
> 
> I think you're right.  It should do the mapping already so we don't need
> to do anything special in block_iterate_proc.  I can test that in a bit.

So the idea works and, in fact, we could really get rid of most of
block_iterate_proc and still get correct results.  This code is an
optimization so that we can quickly assemble larger extents and not have
to grow on-disk extents repeatedly.  The code as it is does the right
thing most of the time but the boundary condition triggers the barrier
for every block in a migrated range and record_file_blocks must do the
growing.  The end result works fine, it's just slower than it needs to
be for that 0-1MB range.  I'm not sure I care enough to invest the time
to fix that.

-Jeff

> -Jeff
> 
>> Thanks,
>> Qu
>>
>>>
>>> Cc: Qu Wenruo <quwenruo.bt...@gmx.com>
>>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>>> ---
>>>   convert/source-fs.c | 25 +++--
>>>   1 file changed, 11 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/convert/source-fs.c b/convert/source-fs.c
>>> index 80e4e41..09f6995 100644
>>> --- a/convert/source-fs.c
>>> +++ b/convert/source-fs.c
>>> @@ -28,18 +28,16 @@ const struct simple_range btrfs_reserved_ranges[3]
>>> = {
>>>   { BTRFS_SB_MIRROR_OFFSET(2), SZ_64K }
>>>   };
>>>   -static int intersect_with_sb(u64 bytenr, u64 num_bytes)
>>> +static u64 intersect_with_reserved(u64 bytenr, u64 num_bytes)
>>>   {
>>>   int i;
>>> -u64 offset;
>>>   -for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>>> -offset = btrfs_sb_offset(i);
>>> -offset &= ~((u64)BTRFS_STRIPE_LEN - 1);
>>> +for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
>>> +const struct simple_range *range = _reserved_ranges[i];
>>>   -

Re: [PATCH 1/3] btrfs-progs: convert: properly handle reserved ranges while iterating files

2017-07-27 Thread Jeff Mahoney
On 7/26/17 9:35 PM, Qu Wenruo wrote:
> 
> 
> On 2017年07月26日 04:54, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> Commit 522ef705e38 (btrfs-progs: convert: Introduce function to calculate
>> the available space) changed how we handle migrating file data so that
>> we never have btrfs space associated with the reserved ranges.  This
>> works pretty well and when we iterate over the file blocks, the
>> associations are redirected to the migrated locations.
>>
>> This commit missed the case in block_iterate_proc where we just check
>> for intersection with a superblock location before looking up a block
>> group.  intersect_with_sb checks to see if the range intersects with
>> a stripe containing a superblock but, in fact, we've reserved the
>> full 0-1MB range at the start of the disk.  So a file block located
>> at e.g. 160kB will fall in the reserved region but won't be excepted
>> in block_iterate_block.  We ultimately hit a BUG_ON when we fail
>> to look up the block group for that location.
> 
> The description of the problem  is indeed correct.
> 
>>
>> This is reproducible using convert-tests/003-ext4-basic.
> 
> Thanks for pointing this out, I also reproduced it.
> 
> While it would be nicer if you could upload a special crafted image as
> indicated test case.
> IIRC the test passed without problem several versions ago, so there may
> be some factors preventing the bug from being exposed.
> 
>>
>> The fix is to have intersect_with_sb and block_iterate_proc understand
>> the full size of the reserved ranges.  Since we use the range to
>> determine the boundary for the block iterator, let's just return the
>> boundary.  0 isn't a valid boundary and means that we proceed normally
>> with block group lookup.
> 
> I'm OK with current fix as it indeed fix the bug and has minimal impact
> on current code.
> 
> So feel free to add:
> Reviewed-by: Qu Wenruo <quwenruo.bt...@gmx.com>
> 
> While I think there is a better way to solve it more completely.
> 
> As when we run into block_iterate_proc(), we have already created
> ext2_save/image.
> So we can use the the image as ext2 <-> btrfs position mapping, just as
> we have already done in record_file_blocks().
> 
> That's to say, we don't need too much care about the intersection with
> reserved range, but just letting record_file_blocks() to handle it will
> be good enough.
> 
> What do you think about this idea?

I think you're right.  It should do the mapping already so we don't need
to do anything special in block_iterate_proc.  I can test that in a bit.

-Jeff

> Thanks,
> Qu
> 
>>
>> Cc: Qu Wenruo <quwenruo.bt...@gmx.com>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>   convert/source-fs.c | 25 +++--
>>   1 file changed, 11 insertions(+), 14 deletions(-)
>>
>> diff --git a/convert/source-fs.c b/convert/source-fs.c
>> index 80e4e41..09f6995 100644
>> --- a/convert/source-fs.c
>> +++ b/convert/source-fs.c
>> @@ -28,18 +28,16 @@ const struct simple_range btrfs_reserved_ranges[3]
>> = {
>>   { BTRFS_SB_MIRROR_OFFSET(2), SZ_64K }
>>   };
>>   -static int intersect_with_sb(u64 bytenr, u64 num_bytes)
>> +static u64 intersect_with_reserved(u64 bytenr, u64 num_bytes)
>>   {
>>   int i;
>> -u64 offset;
>>   -for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>> -offset = btrfs_sb_offset(i);
>> -offset &= ~((u64)BTRFS_STRIPE_LEN - 1);
>> +for (i = 0; i < ARRAY_SIZE(btrfs_reserved_ranges); i++) {
>> +const struct simple_range *range = _reserved_ranges[i];
>>   -if (bytenr < offset + BTRFS_STRIPE_LEN &&
>> -bytenr + num_bytes > offset)
>> -return 1;
>> +if (bytenr < range_end(range) &&
>> +bytenr + num_bytes >= range->start)
>> +return range_end(range);
>>   }
>>   return 0;
>>   }
>> @@ -64,14 +62,14 @@ int block_iterate_proc(u64 disk_block, u64
>> file_block,
>> struct blk_iterate_data *idata)
>>   {
>>   int ret = 0;
>> -int sb_region;
>> +u64 reserved_boundary;
>>   int do_barrier;
>>   struct btrfs_root *root = idata->root;
>>   struct btrfs_block_group_cache *cache;
>>   u64 bytenr = disk_block * root->sectorsize;
>>   -sb_region = intersect_with_sb(bytenr, root->sectorsize);
>> -do_barrier = sb_region || disk_block >= idat

Re: [PATCH 5/7] btrfs-progs: backref: add list_first_pref helper

2017-07-26 Thread Jeff Mahoney
On 7/26/17 9:22 AM, Jeff Mahoney wrote:
> On 7/26/17 3:08 AM, Nikolay Borisov wrote:
>>
>>
>> On 25.07.2017 23:51, je...@suse.com wrote:
>>> From: Jeff Mahoney <je...@suse.com>
>>>
>>> ---
>>>  backref.c | 11 +++
>>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/backref.c b/backref.c
>>> index ac1b506..be3376a 100644
>>> --- a/backref.c
>>> +++ b/backref.c
>>> @@ -130,6 +130,11 @@ struct __prelim_ref {
>>> u64 wanted_disk_byte;
>>>  };
>>>  
>>> +static struct __prelim_ref *list_first_pref(struct list_head *head)
>>> +{
>>> +   return list_first_entry(head, struct __prelim_ref, list);
>>> +}
>>> +
>>
>> I think this just adds one more level of abstraction with no real
>> benefit whatsoever. Why not drop the patch entirely.
> 
> Ack.  I thought it might be more readable but it ends up taking the same
> number of characters.

Actually, no, it doesn't.  That's only true if using 'head' as the list head
as in the helper.

It ends up being

    ref = list_first_pref(>pending_missing_keys);
vs
ref = list_first_entry(>pending_missing_keys,
   struct __prelim_ref, list);

and I have to say I prefer reading the former.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/7] btrfs-progs: backref: add list_first_pref helper

2017-07-26 Thread Jeff Mahoney
On 7/26/17 3:08 AM, Nikolay Borisov wrote:
> 
> 
> On 25.07.2017 23:51, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> ---
>>  backref.c | 11 +++
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/backref.c b/backref.c
>> index ac1b506..be3376a 100644
>> --- a/backref.c
>> +++ b/backref.c
>> @@ -130,6 +130,11 @@ struct __prelim_ref {
>>  u64 wanted_disk_byte;
>>  };
>>  
>> +static struct __prelim_ref *list_first_pref(struct list_head *head)
>> +{
>> +return list_first_entry(head, struct __prelim_ref, list);
>> +}
>> +
> 
> I think this just adds one more level of abstraction with no real
> benefit whatsoever. Why not drop the patch entirely.

Ack.  I thought it might be more readable but it ends up taking the same
number of characters.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/7] btrfs-progs: extent-cache: actually cache extent buffers

2017-07-26 Thread Jeff Mahoney
On 7/26/17 3:00 AM, Nikolay Borisov wrote:
> 
> 
> On 25.07.2017 23:51, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> We have the infrastructure to cache extent buffers but we don't actually
>> do the caching.  As soon as the last reference is dropped, the buffer
>> is dropped.  This patch keeps the extent buffers around until the max
>> cache size is reached (defaults to 25% of memory) and then it drops
>> the last 10% of the LRU to free up cache space for reallocation.  The
>> cache size is configurable (for use by e.g. lowmem) when the cache is
>> initialized.
>>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>

>> @@ -567,7 +580,21 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
>> extent_buffer *src)
>>  return new;
>>  }
>>  
>> -void free_extent_buffer(struct extent_buffer *eb)
>> +static void free_extent_buffer_final(struct extent_buffer *eb)
>> +{
>> +struct extent_io_tree *tree = eb->tree;
>> +
>> +BUG_ON(eb->refs);
>> +BUG_ON(tree->cache_size < eb->len);
>> +list_del_init(>lru);
>> +if (!(eb->flags & EXTENT_BUFFER_DUMMY)) {
>> +remove_cache_extent(>cache, >cache_node);
>> +tree->cache_size -= eb->len;
>> +}
>> +free(eb);
>> +}
>> +
>> +static void free_extent_buffer_internal(struct extent_buffer *eb, int 
>> free_now)
> 
> nit: free_ow -> boolean

Ack.  There should be a bunch of int -> bool conversions elsewhere too.

>> @@ -619,6 +650,21 @@ struct extent_buffer *find_first_extent_buffer(struct 
>> extent_io_tree *tree,
>>  return eb;
>>  }
>>  
>> +static void
>> +trim_extent_buffer_cache(struct extent_io_tree *tree)
>> +{
>> +struct extent_buffer *eb, *tmp;
>> +u64 count = 0;
> 
> count seems to be a leftover from something, so you could remove it

Yep, that was during debugging.  Removed.

>> @@ -2521,3 +2522,14 @@ u8 rand_u8(void)
>>  void btrfs_config_init(void)
>>  {
>>  }
>> +
>> +unsigned long total_memory(void)
> 
> perhaps rename to total_memory_bytes and return the memory size in
> bytes. Returning them in kilobytes seems rather arbitrary. That way
> you'd save the constant *1024 to turn the kbs in bytes in the callers
> (currently only in extent_io_tree_init())
> 

Ack.

Thanks,

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] btrfs-progs: convert: add support for converting reiserfs

2017-07-25 Thread Jeff Mahoney
On 7/25/17 4:54 PM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> 
> This patch adds support to convert reiserfs file systems in-place to btrfs.
> 
> It will convert extended attribute files to btrfs extended attributes,
> translate ACLs, coalesce tails that consist of multiple items into one item,
> and convert tails that are too big into indirect files.
> 
> This requires that libreiserfscore 3.6.27 be available.
> 
> Many of the test cases for convert apply regardless of what the source
> file system is and using ext4 is sufficient.  I've included several
> test cases that are reiserfs-specific.
> 
> Signed-off-by: Jeff Mahoney <je...@suse.com>
> ---
>  Makefile   |3 +-
>  Makefile.inc.in|3 +-
>  configure.ac   |   10 +-
>  convert/main.c |   13 +-
>  convert/source-reiserfs.c  | 1011 
> 
>  convert/source-reiserfs.h  |  105 ++
>  tests/common.convert   |   14 +-
>  tests/convert-tests/010-reiserfs-basic/test.sh |   16 +
>  .../011-reiserfs-delete-all-rollback/test.sh   |   67 ++
>  .../012-reiserfs-large-hole-extent/test.sh |   23 +
>  .../013-reiserfs-common-inode-flags/test.sh|   35 +
>  11 files changed, 1290 insertions(+), 10 deletions(-)
>  create mode 100644 convert/source-reiserfs.c
>  create mode 100644 convert/source-reiserfs.h
>  create mode 100755 tests/convert-tests/010-reiserfs-basic/test.sh
>  create mode 100755 
> tests/convert-tests/011-reiserfs-delete-all-rollback/test.sh
>  create mode 100755 tests/convert-tests/012-reiserfs-large-hole-extent/test.sh
>  create mode 100755 
> tests/convert-tests/013-reiserfs-common-inode-flags/test.sh
> 
> diff --git a/Makefile b/Makefile
> index 81598df..f7f6dab 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -111,7 +111,7 @@ libbtrfs_headers = send-stream.h send-utils.h send.h 
> kernel-lib/rbtree.h btrfs-l
>  kernel-lib/radix-tree.h kernel-lib/sizes.h extent-cache.h \
>  extent_io.h ioctl.h ctree.h btrfsck.h version.h
>  convert_objects = convert/main.o convert/common.o convert/source-fs.o \
> -   convert/source-ext2.o
> +   convert/source-ext2.o convert/source-reiserfs.o
>  mkfs_objects = mkfs/main.o mkfs/common.o
>  
>  TESTS = fsck-tests.sh convert-tests.sh
> @@ -188,6 +188,7 @@ endif
>  # external libs required by various binaries; for btrfs-foo,
>  # specify btrfs_foo_libs = ; see $($(subst...)) rules below
>  btrfs_convert_cflags = -DBTRFSCONVERT_EXT2=$(BTRFSCONVERT_EXT2)
> +btrfs_convert_cflags += -DBTRFSCONVERT_REISERFS=$(BTRFSCONVERT_REISERFS)
>  btrfs_fragments_libs = -lgd -lpng -ljpeg -lfreetype
>  btrfs_debug_tree_objects = cmds-inspect-dump-tree.o
>  btrfs_show_super_objects = cmds-inspect-dump-super.o
> diff --git a/Makefile.inc.in b/Makefile.inc.in
> index 4e1b68c..3c7bc03 100644
> --- a/Makefile.inc.in
> +++ b/Makefile.inc.in
> @@ -12,6 +12,7 @@ INSTALL = @INSTALL@
>  DISABLE_DOCUMENTATION = @DISABLE_DOCUMENTATION@
>  DISABLE_BTRFSCONVERT = @DISABLE_BTRFSCONVERT@
>  BTRFSCONVERT_EXT2 = @BTRFSCONVERT_EXT2@
> +BTRFSCONVERT_REISERFS = @BTRFSCONVERT_REISERFS@
>  
>  SUBST_CFLAGS = @CFLAGS@
>  SUBST_LDFLAGS = @LDFLAGS@
> @@ -31,6 +32,6 @@ udevruledir = ${udevdir}/rules.d
>  
>  # external libs required by various binaries; for btrfs-foo,
>  # specify btrfs_foo_libs = ; see $($(subst...)) rules in 
> Makefile
> -btrfs_convert_libs = @EXT2FS_LIBS@ @COM_ERR_LIBS@
> +btrfs_convert_libs = @EXT2FS_LIBS@ @COM_ERR_LIBS@ @REISERFS_LIBS@
>  
>  MAKEFILE_INC_INCLUDED = yes
> diff --git a/configure.ac b/configure.ac
> index 30055f8..3a8bd3f 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -120,6 +120,7 @@ fi
>  
>  convertfs=
>  BTRFSCONVERT_EXT2=0
> +BTRFSCONVERT_REISERFS=0
>  if test "x$enable_convert" = xyes; then
>   if test "x$with_convert" = "xauto" || echo "$with_convert" | grep -q 
> "ext2"; then
>   PKG_CHECK_MODULES(EXT2FS, [ext2fs >= 1.42],,
> @@ -131,11 +132,18 @@ if test "x$enable_convert" = xyes; then
>   convertfs="${convertfs:+$convertfs,}ext2"
>   BTRFSCONVERT_EXT2=1
>   fi
> + if test "x$with_convert" = "xauto" || echo "$with_convert" | grep -q 
> "reiserfs"; then
> + PKG_CHECK_MODULES(REISERFS, [reiserfscore >= 3.6.27],
> +   [BTRFSCONVERT_REISERFS=1],[])
> +   

[PATCH] btrfs: fix lockup in find_free_extent with read-only block groups

2017-07-19 Thread Jeff Mahoney
If we have a block group that is all of the following:
1) uncached in memory
2) is read-only
3) has a disk cache state that indicates we need to recreate the cache

AND the file system has enough free space fragmentation such that the
request for an extent of a given size can't be honored;

AND have a single CPU core;

AND it's the block group with the highest starting offset such that
there are no opportunities (like reading from disk) for the loop to
yield the CPU;

We can end up with a lockup.

The root cause is simple.  Once we're in the position that we've read in
all of the other block groups directly and none of those block groups
can honor the request, there are no more opportunities to sleep.  We end
up trying to start a caching thread which never gets run if we only have
one core.  This *should* present as a hung task waiting on the caching
thread to make some progress, but it doesn't.  Instead, it degrades into
a busy loop because of the placement of the read-only check.

During the first pass through the loop, block_group->cached will be set
to BTRFS_CACHE_STARTED and have_caching_bg will be set.  Then we hit the
read-only check and short circuit the loop.  We're not yet in
LOOP_CACHING_WAIT, so we skip that loop back before going through the
loop again for other raid groups.

Then we move to LOOP_CACHING_WAIT state.

During the this pass through the loop, ->cached will still be
BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter
cache_block_group, do a lot of nothing, and return, and also set
have_caching_bg again.  Then we hit the read-only check and short circuit
the loop.  The same thing happens as before except now we DO trigger
the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the
top.  We do this forever.

There are two fixes in this patch since they address the same underlying
bug.

The first is to add a cond_resched to the end of the loop to ensure
that the caching thread always has an opportunity to run.  This will
fix the soft lockup issue, but find_free_extent will still loop doing
nothing until the thread has completed.

The second is to move the read-only check to the top of the loop.  We're
never going to return an allocation within a read-only block group so
we may as well skip it early.  The check for ->cached == BTRFS_CACHE_ERROR
would cause the same problem except that BTRFS_CACHE_ERROR is considered
a "done" state and we won't re-set have_caching_bg again.

Many thanks to Stephan Kulow <co...@suse.de> for his excellent help in
the testing process.

Signed-off-by: Jeff Mahoney <je...@suse.com>
---
 fs/btrfs/extent-tree.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7571,6 +7571,10 @@ search:
u64 offset;
int cached;
 
+   /* If the block group is read-only, we can skip it entirely. */
+   if (unlikely(block_group->ro))
+   continue;
+
btrfs_grab_block_group(block_group, delalloc);
search_start = block_group->key.objectid;
 
@@ -7606,8 +7610,6 @@ have_block_group:
 
if (unlikely(block_group->cached == BTRFS_CACHE_ERROR))
goto loop;
-   if (unlikely(block_group->ro))
-   goto loop;
 
/*
 * Ok we want to try and use the cluster allocator, so
@@ -7819,6 +7821,7 @@ loop:
failed_alloc = false;
BUG_ON(index != get_block_group_index(block_group));
btrfs_release_block_group(block_group, delalloc);
+   cond_resched();
    }
up_read(_info->groups_sem);
 

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: account for pinned bytes in should_alloc_chunk

2017-06-29 Thread Jeff Mahoney
On 6/29/17 3:21 PM, Omar Sandoval wrote:
> On Thu, Jun 22, 2017 at 09:51:47AM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney <je...@suse.com>
>>
>> In a heavy write scenario, we can end up with a large number of pinned bytes.
>> This can translate into (very) premature ENOSPC because pinned bytes
>> must be accounted for when allowing a reservation but aren't accounted for
>> when deciding whether to create a new chunk.
>>
>> This patch adds the accounting to should_alloc_chunk so that we can
>> create the chunk.
> 
> Hey, Jeff,

Hi Omar -

> Does this fix your ENOSPC problem on a fresh filesystem? I just tracked

No, it didn't.  It helped somewhat, but we were still hitting it
frequently.  What did help was reverting "Btrfs: skip commit transaction
if we don't have enough pinned bytes" (not upstream yet, on the list).

> down an ENOSPC issue someone here reported when doing a btrfs send to a
> fresh filesystem and it sounds a lot like your issue: metadata
> bytes_may_use shoots up but we don't allocate any chunks for it. I'm not
> seeing how including bytes_pinned will help for this case. We won't have
> any pinned bytes when populating a new fs, right?

Our test environment is just installing the OS.  That means lots of
creates, writes, and then renames, so there's a fair amount of metadata
churn that results in elevated pinned_bytes.  Rsync can cause the same
workload pretty easily too.  Nikolay was going to look into coming up
with a configuration for fsstress that would emulate it.

> I don't have a good solution. Allocating chunks based on bytes_may_use
> is going to way over-allocate because of our worst-case estimations. I'm
> double-checking now that the flusher is doing the right thing and not
> missing anything. I'll keep digging, just wanted to know if you had any
> thoughts.

My suspicion is that it all just happens to work and that there are
several bugs working together that approximate a correct result.  My
reasoning is that the patch I referenced above is correct.  The logic in
may_commit_transaction is inverted and causing a ton of additional
transaction commits.  I think that having the additional transaction
commits is serving to free pinned bytes more quickly so things just work
for the most part and pinned bytes doesn't play as much of a role.  But
once the transaction count comes down, that pinned bytes count gets
elevated and becomes more important.  I think it should be taken into
account to determine whether committing a transaction early will result
in releasing enough space to honor the reservation without allocating a
new chunk.  If the answer is yes, flush it.  If no, there's no point in
flushing it now, so just allocate the chunk and move on.

The big question is where this 80% number comes into play.

There is a caveat here: almost all of our testing has been on 4.4 with a
bunch of these patches backported.  I believe the same issue will still
be there on mainline, but we're in release crunch mode and I haven't had
a chance to test more fully.

-Jeff

>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/extent-tree.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 33d979e9ea2a..88b04742beea 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4377,7 +4377,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
>> *fs_info,
>>  {
>>  struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
>>  u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
>> -u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
>> +u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + 
>> sinfo->bytes_pinned;
>>  u64 thresh;
>>  
>>  if (force == CHUNK_ALLOC_FORCE)
>> -- 
>> 2.11.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Lock between userspace and btrfs-cleaner on extent_buffer

2017-06-29 Thread Jeff Mahoney
On 6/29/17 2:46 PM, Sargun Dhillon wrote:
> On Thu, Jun 29, 2017 at 11:42 AM, Jeff Mahoney <je...@suse.com> wrote:
>> On 6/28/17 6:02 PM, Sargun Dhillon wrote:
>>> On Wed, Jun 28, 2017 at 2:55 PM, Jeff Mahoney <je...@suse.com> wrote:
>>>> On 6/27/17 5:12 PM, Jeff Mahoney wrote:
>>>>> On 6/13/17 9:05 PM, Sargun Dhillon wrote:
>>>>>> On Thu, Jun 8, 2017 at 11:34 AM, Sargun Dhillon <sar...@sargun.me> wrote:
>>>>>>> I have a deadlock caught in the wild between two processes --
>>>>>>> btrfs-cleaner, and userspace process (Docker). Here, you can see both
>>>>>>> of the backtraces. btrfs-cleaner is trying to get a lock on
>>>>>>> 9859d360caf0, which is owned by Docker's pid. Docker on the other
>>>>>>> hand is trying to get a lock on 9859dc0f0578, which is owned by
>>>>>>> btrfs-cleaner's Pid.
>>>>>>>
>>>>>>> This is on vanilla 4.11.3 without much workload. The background
>>>>>>> workload was basically starting and stopping Docker with a medium
>>>>>>> sized image like ubuntu:latest with sleep 5. So, snapshot creation,
>>>>>>> destruction. And there's some stuff that's logging to btrfs.
>>>>>
>>>>> Hi Sargun -
>>>>>
>>>>> We hit this bug in testing last week.  I have a patch that I've written
>>>>> up and have run under your reproducer for a while.  So far it hasn't
>>>>> hit.  I'll post it shortly and CC you.  It does depend lightly on the
>>>>> rbtree code, though.  Since we'll want this fix for -stable, I'll write
>>>>> up a version for that too.
>>>>
>>>> After thinking about it a bit more, I think my patch just happens to
>>>> make it less likely to hit but would ultimately degrade into a livelock
>>>> where it was a deadlock previously.  I was just trylocking and
>>>> requeuing, so both threads are allowed to do other work and maybe even
>>>> finish but ultimately if there's a true deadlock it'll hit anyway.
>>>>
>>>> -Jeff
>>>>
>>> Does it make sense to spend the time on making it so that
>>> btrfs-cleaner has abortable operations, and the ability to abort if
>>> the root deletion either takes too long, or if it receives a signal?
>>> Although, such a case may result in a livelock, to me it seems like a
>>> lot less bad than deadlocking.
>>
>>
>> For now, reverting:
>>
>> commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7
>> Author: Qu Wenruo <quwen...@cn.fujitsu.com>
>> Date:   Wed Feb 15 10:43:03 2017 +0800
>>
>> btrfs: qgroup: Move half of the qgroup accounting time out of commit
>> trans
>>
>> ... should do the trick.
>>
>> -Jeff
>>
> I thought it was this as well, but we still saw lock-ups even after
> reverting this change on 4.11. They were rarer, but we still saw
> issues with locked up btrfs-transactions. It may have been due to a
> different issue. If you want. I can try to revert this, and run a
> workload on it to see where the exact lock-up is?

Yeah, I'd be interested in those results.

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Lock between userspace and btrfs-cleaner on extent_buffer

2017-06-29 Thread Jeff Mahoney
On 6/28/17 6:02 PM, Sargun Dhillon wrote:
> On Wed, Jun 28, 2017 at 2:55 PM, Jeff Mahoney <je...@suse.com> wrote:
>> On 6/27/17 5:12 PM, Jeff Mahoney wrote:
>>> On 6/13/17 9:05 PM, Sargun Dhillon wrote:
>>>> On Thu, Jun 8, 2017 at 11:34 AM, Sargun Dhillon <sar...@sargun.me> wrote:
>>>>> I have a deadlock caught in the wild between two processes --
>>>>> btrfs-cleaner, and userspace process (Docker). Here, you can see both
>>>>> of the backtraces. btrfs-cleaner is trying to get a lock on
>>>>> 9859d360caf0, which is owned by Docker's pid. Docker on the other
>>>>> hand is trying to get a lock on 9859dc0f0578, which is owned by
>>>>> btrfs-cleaner's Pid.
>>>>>
>>>>> This is on vanilla 4.11.3 without much workload. The background
>>>>> workload was basically starting and stopping Docker with a medium
>>>>> sized image like ubuntu:latest with sleep 5. So, snapshot creation,
>>>>> destruction. And there's some stuff that's logging to btrfs.
>>>
>>> Hi Sargun -
>>>
>>> We hit this bug in testing last week.  I have a patch that I've written
>>> up and have run under your reproducer for a while.  So far it hasn't
>>> hit.  I'll post it shortly and CC you.  It does depend lightly on the
>>> rbtree code, though.  Since we'll want this fix for -stable, I'll write
>>> up a version for that too.
>>
>> After thinking about it a bit more, I think my patch just happens to
>> make it less likely to hit but would ultimately degrade into a livelock
>> where it was a deadlock previously.  I was just trylocking and
>> requeuing, so both threads are allowed to do other work and maybe even
>> finish but ultimately if there's a true deadlock it'll hit anyway.
>>
>> -Jeff
>>
> Does it make sense to spend the time on making it so that
> btrfs-cleaner has abortable operations, and the ability to abort if
> the root deletion either takes too long, or if it receives a signal?
> Although, such a case may result in a livelock, to me it seems like a
> lot less bad than deadlocking.


For now, reverting:

commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7
Author: Qu Wenruo <quwen...@cn.fujitsu.com>
Date:   Wed Feb 15 10:43:03 2017 +0800

btrfs: qgroup: Move half of the qgroup accounting time out of commit
trans

... should do the trick.

-Jeff

>>> -Jeff
>>>
>>>>> crash> bt -FF
>>>>> PID: 3423   TASK: 985ec7a16580  CPU: 2   COMMAND: "btrfs-cleaner"
>>>>>  #0 [afca9d9078e8] __schedule at bb235729
>>>>> afca9d9078f0:  [985eccb2e580:task_struct]
>>>>> afca9d907900: [985ec7a16580:task_struct] 985ed949b280
>>>>> afca9d907910: afca9d907978 __schedule+953
>>>>> afca9d907920: btree_get_extent 9de968f0
>>>>> afca9d907930: 985ed949b280 afca9d907958
>>>>> afca9d907940: 0004 00a90842012fd9df
>>>>> afca9d907950: [985ec7a16580:task_struct]
>>>>> [9859d360cb50:btrfs_extent_buffer]
>>>>> afca9d907960: [9859d360cb58:btrfs_extent_buffer]
>>>>> [985ec7a16580:task_struct]
>>>>> afca9d907970: [985ec7a16580:task_struct] afca9d907990
>>>>> afca9d907980: schedule+54
>>>>>  #1 [afca9d907980] schedule at bb235c96
>>>>> afca9d907988: [9859d360caf0:btrfs_extent_buffer] 
>>>>> afca9d9079f8
>>>>> afca9d907998: btrfs_tree_read_lock+204
>>>>>  #2 [afca9d907998] btrfs_tree_read_lock at c03e112c [btrfs]
>>>>> afca9d9079a0: 985e [985ec7a16580:task_struct]
>>>>> afca9d9079b0: autoremove_wake_function
>>>>> [9859d360cb60:btrfs_extent_buffer]
>>>>> afca9d9079c0: [9859d360cb60:btrfs_extent_buffer] 
>>>>> 00a90842012fd9df
>>>>> afca9d9079d0: [985a6ca3c370:Acpi-State]
>>>>> [9859d360caf0:btrfs_extent_buffer]
>>>>> afca9d9079e0: afca9d907ac0 [985e751bc000:kmalloc-8192]
>>>>> afca9d9079f0: [985e751bc000:kmalloc-8192] afca9d907a48
>>>>> afca9d907a00: __add_missing_keys+190
>>>>>  #3 [afca9d907a00] __add_missing_keys at c040abae [btrfs]
>>>>> afca9d907a08:  fff

Re: Lock between userspace and btrfs-cleaner on extent_buffer

2017-06-28 Thread Jeff Mahoney
On 6/27/17 5:12 PM, Jeff Mahoney wrote:
> On 6/13/17 9:05 PM, Sargun Dhillon wrote:
>> On Thu, Jun 8, 2017 at 11:34 AM, Sargun Dhillon <sar...@sargun.me> wrote:
>>> I have a deadlock caught in the wild between two processes --
>>> btrfs-cleaner, and userspace process (Docker). Here, you can see both
>>> of the backtraces. btrfs-cleaner is trying to get a lock on
>>> 9859d360caf0, which is owned by Docker's pid. Docker on the other
>>> hand is trying to get a lock on 9859dc0f0578, which is owned by
>>> btrfs-cleaner's Pid.
>>>
>>> This is on vanilla 4.11.3 without much workload. The background
>>> workload was basically starting and stopping Docker with a medium
>>> sized image like ubuntu:latest with sleep 5. So, snapshot creation,
>>> destruction. And there's some stuff that's logging to btrfs.
> 
> Hi Sargun -
> 
> We hit this bug in testing last week.  I have a patch that I've written
> up and have run under your reproducer for a while.  So far it hasn't
> hit.  I'll post it shortly and CC you.  It does depend lightly on the
> rbtree code, though.  Since we'll want this fix for -stable, I'll write
> up a version for that too.

After thinking about it a bit more, I think my patch just happens to
make it less likely to hit but would ultimately degrade into a livelock
where it was a deadlock previously.  I was just trylocking and
requeuing, so both threads are allowed to do other work and maybe even
finish but ultimately if there's a true deadlock it'll hit anyway.

-Jeff

> -Jeff
> 
>>> crash> bt -FF
>>> PID: 3423   TASK: 985ec7a16580  CPU: 2   COMMAND: "btrfs-cleaner"
>>>  #0 [afca9d9078e8] __schedule at bb235729
>>> afca9d9078f0:  [985eccb2e580:task_struct]
>>> afca9d907900: [985ec7a16580:task_struct] 985ed949b280
>>> afca9d907910: afca9d907978 __schedule+953
>>> afca9d907920: btree_get_extent 9de968f0
>>> afca9d907930: 985ed949b280 afca9d907958
>>> afca9d907940: 0004 00a90842012fd9df
>>> afca9d907950: [985ec7a16580:task_struct]
>>> [9859d360cb50:btrfs_extent_buffer]
>>> afca9d907960: [9859d360cb58:btrfs_extent_buffer]
>>> [985ec7a16580:task_struct]
>>> afca9d907970: [985ec7a16580:task_struct] afca9d907990
>>> afca9d907980: schedule+54
>>>  #1 [afca9d907980] schedule at bb235c96
>>> afca9d907988: [9859d360caf0:btrfs_extent_buffer] 
>>> afca9d9079f8
>>> afca9d907998: btrfs_tree_read_lock+204
>>>  #2 [afca9d907998] btrfs_tree_read_lock at c03e112c [btrfs]
>>> afca9d9079a0: 985e [985ec7a16580:task_struct]
>>> afca9d9079b0: autoremove_wake_function
>>> [9859d360cb60:btrfs_extent_buffer]
>>> afca9d9079c0: [9859d360cb60:btrfs_extent_buffer] 
>>> 00a90842012fd9df
>>> afca9d9079d0: [985a6ca3c370:Acpi-State]
>>> [9859d360caf0:btrfs_extent_buffer]
>>> afca9d9079e0: afca9d907ac0 [985e751bc000:kmalloc-8192]
>>> afca9d9079f0: [985e751bc000:kmalloc-8192] afca9d907a48
>>> afca9d907a00: __add_missing_keys+190
>>>  #3 [afca9d907a00] __add_missing_keys at c040abae [btrfs]
>>> afca9d907a08:  afca9d907a28
>>> afca9d907a18: free_extent_buffer+75 00a90842012fd9df
>>> afca9d907a28: afca9d907ab0 afca9d907be8
>>> afca9d907a38:  [985e78dae540:btrfs_path]
>>> afca9d907a48: afca9d907b28 find_parent_nodes+889
>>>  #4 [afca9d907a50] find_parent_nodes at c040c4d9 [btrfs]
>>> afca9d907a58: [985e751bc000:kmalloc-8192]
>>> [9859d613cf40:kmalloc-32]
>>> afca9d907a68: [9859d613c220:kmalloc-32] 
>>> afca9d907a78: 030dc000 
>>> afca9d907a88: [985e78dae540:btrfs_path] 
>>> afca9d907a98: 000178dae540 
>>> afca9d907aa8: 0002 afca9d907ab0
>>> afca9d907ab8: afca9d907ab0 [985a6ca3c370:Acpi-State]
>>> afca9d907ac8: [985a6ca3ce10:Acpi-State] c000985e751bc000
>>> afca9d907ad8: 01a9030d 
>>> afca9d907ae8: a9030dc0 0001
>>> afca9d907af8: 00a90842012fd9df [98

Re: Lock between userspace and btrfs-cleaner on extent_buffer

2017-06-27 Thread Jeff Mahoney
trfs volume with qgroups and limits setup on
> /mnt/btrfs (or whatever you set ROOT to). This was tested on a 32-CPU
> machine, without CONFIG_PREEMPT. Upon disabling cores, we were not
> able to reproduce the issue.
> 
> #!/bin/bash -x
> 
> ROOT=/mnt/btrfs
> #sudo btrfs quota enable ${ROOT}
> 
> btrfs subvolume create ${ROOT}/foo
> mkdir -p ${ROOT}/snapshots
> mkdir -p ${ROOT}/bar
> 
> set +x
> for i in $(seq 1 15000); do
> dd if=/dev/urandom of=${ROOT}/foo/${i} bs=$((1 + RANDOM %
> (4096 * 1))) count=1 status=none
> done
> set -x
> 
> SUBVOLCOUNT=100
> btrfs subvolume delete $ROOT/snapshots/foo*
> for i in $(seq 1 $SUBVOLCOUNT); do
> btrfs subvolume snapshot ${ROOT}/foo ${ROOT}/snapshots/foo${i}
> done
> 
> while true; do
> # Delete $SUBVOLCOUNT random subvolumes
> volumes_to_delete=""
> for i in $(seq 1 10); do
> vol_id=$((1 + RANDOM % $SUBVOLCOUNT))
> volumes_to_delete="${volumes_to_delete}
> ${ROOT}/snapshots/foo${vol_id}"
> set +x
> for i in $(seq 1 1500); do
> dd if=/dev/urandom
> of=${ROOT}/snapshots/foo${vol_id}/${i} bs=100 count=1 status=none
> done
> set -x
> done
> volumes_to_delete=$(echo $volumes_to_delete|cut -b1-)
> set +x
> for i in $(seq 500); do
> mkdir ${ROOT}/bar/${i}
> done
> btrfs subvolume delete $volumes_to_delete
> for i in $(seq 500); do
> rmdir ${ROOT}/bar/${i}
> done
> set -x
> for vol in $volumes_to_delete; do
> if [ ! -d $vol ]; then
> btrfs subvolume snapshot ${ROOT}/foo $vol
> fi
> done
> done
> 
> #
> The cleaner gets stuck in:
> [] btrfs_tree_read_lock+0xcc/0x120 [btrfs]
> [] __add_missing_keys+0xbe/0x130 [btrfs]
> [] find_parent_nodes+0x379/0x900 [btrfs]
> [] __btrfs_find_all_roots+0xa9/0x120 [btrfs]
> [] btrfs_find_all_roots+0x55/0x70 [btrfs]
> [] btrfs_qgroup_trace_extent+0x12e/0x170 [btrfs]
> [] btrfs_qgroup_trace_leaf_items+0x117/0x150 [btrfs]
> [] btrfs_qgroup_trace_subtree+0x1c3/0x350 [btrfs]
> [] do_walk_down+0x2f7/0x590 [btrfs]
> [] walk_down_tree+0xbd/0x100 [btrfs]
> [] btrfs_drop_snapshot+0x3e4/0x8b0 [btrfs]
> [] btrfs_clean_one_deleted_snapshot+0xb7/0x100 [btrfs]
> [] cleaner_kthread+0x13a/0x190 [btrfs]
> [] kthread+0x109/0x140
> [] ret_from_fork+0x2c/0x40
> [] 0x
> 
> And the opposing process (where the actual syscall varies):
> [] btrfs_tree_read_lock+0xcc/0x120 [btrfs]
> [] __add_missing_keys+0xbe/0x130 [btrfs]
> [] find_parent_nodes+0x379/0x900 [btrfs]
> [] __btrfs_find_all_roots+0xa9/0x120 [btrfs]
> [] btrfs_find_all_roots+0x55/0x70 [btrfs]
> [] btrfs_qgroup_trace_extent_post+0x34/0x60 [btrfs]
> [] btrfs_add_delayed_tree_ref+0x1be/0x1e0 [btrfs]
> [] btrfs_inc_extent_ref+0x4c/0x60 [btrfs]
> [] __btrfs_mod_ref+0x152/0x240 [btrfs]
> [] btrfs_inc_ref+0x14/0x20 [btrfs]
> [] update_ref_for_cow+0xdc/0x340 [btrfs]
> [] __btrfs_cow_block+0x218/0x5e0 [btrfs]
> [] btrfs_cow_block+0xff/0x1e0 [btrfs]
> [] btrfs_search_slot+0x208/0x9c0 [btrfs]
> [] btrfs_truncate_inode_items+0x1a1/0x1040 [btrfs]
> [] btrfs_truncate+0xfc/0x2c0 [btrfs]
> [] btrfs_setattr+0x22d/0x370 [btrfs]
> [] notify_change+0x2db/0x430
> [] do_truncate+0x75/0xc0
> [] path_openat+0x362/0x1450
> [] do_filp_open+0x99/0x110
> [] do_sys_open+0x124/0x210
> [] SyS_open+0x1e/0x20
> [] entry_SYSCALL_64_fastpath+0x1e/0xad
> [] 0x
> 
> We have a small emergency patch that appears to help, until an actual
> solution is found (if anyone else is running into this):
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 7699e16..e0a261a8 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1566,11 +1566,18 @@ int btrfs_find_all_roots(struct
> btrfs_trans_handle *trans,
>  {
> int ret;
> 
> -   if (!trans)
> +   if (!trans) {
> down_read(_info->commit_root_sem);
> +   down_write(_info->find_all_root_sem);
> +   } else
> +   down_read(_info->find_all_root_sem);
> +
> ret = __btrfs_find_all_roots(trans, fs_info, bytenr, time_seq, roots);
> -   if (!trans)
> +   if (!trans) {
> +   up_write(_info->find_all_root_sem);
> up_read(_info->commit_root_sem);
> +   } else
> +   up_read(_info->find_all_root_sem);
> return ret;
>  }
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c411590..9ed0735 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -840,6 +840,8 @@ struct btrfs_fs_info {
> 
> struct rw_semaphore commit_root_sem;
> 
> +   struct rw_semaphore find_all_root_sem;
> +
> struct rw_semaphore cleanup_work_sem;
> 
> struct rw_semaphore subvol_sem;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index eb1ee7b..c227895 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2741,6 +2741,7 @@ int open_ctree(struct super_block *sb,
> mutex_init(_info->volume_mutex);
> mutex_init(_info->ro_block_group_mutex);
> init_rwsem(_info->commit_root_sem);
> +   init_rwsem(_info->find_all_root_sem);
> init_rwsem(_info->cleanup_work_sem);
> init_rwsem(_info->subvol_sem);
> sema_init(_info->uuid_tree_rescan_sem, 1);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: backref, properly iterate the missing keys

2017-06-27 Thread Jeff Mahoney
[This should probably be rolled into Patch 8].


We iterate over the indirect tree looking for refs that don't have
key associated with them, look them up, and update the ref with the
resolved key.  The problem is that when we resolve the key, we've
changed where the ref would be located in the tree, which means
searches and insertions would fail.

The good news is that it's not a visible bug since we don't actually search
for or insert new items into the tree at this point.

This patch uses a separate tree for those refs, resolves them, and inserts
them into the indirect tree.  This has the benefit of letting us skip
the refs that don't need any attention and this is used in the next patch.

Signed-off-by: Jeff Mahoney <je...@suse.com>
---
 fs/btrfs/backref.c |   37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -132,6 +132,7 @@ struct preftree {
 struct preftrees {
struct preftree direct;/* BTRFS_SHARED_[DATA|BLOCK]_REF_KEY */
struct preftree indirect;  /* BTRFS_[TREE_BLOCK|EXTENT_DATA]_REF_KEY */
+   struct preftree indirect_missing_keys;
 };
 
 /*
@@ -406,8 +407,11 @@ static int add_indirect_ref(const struct
u64 wanted_disk_byte, int count,
struct share_check *sc, gfp_t gfp_mask)
 {
-   return add_prelim_ref(fs_info, >indirect, root_id, key,
- level, 0, wanted_disk_byte, count, sc, gfp_mask);
+   struct preftree *tree = >indirect;
+   if (!key)
+   tree = >indirect_missing_keys;
+   return add_prelim_ref(fs_info, tree, root_id, key, level, 0,
+ wanted_disk_byte, count, sc, gfp_mask);
 }
 
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
@@ -707,22 +711,25 @@ static int add_missing_keys(struct btrfs
 {
struct prelim_ref *ref;
struct extent_buffer *eb;
-   struct rb_node *node = rb_first(>indirect.root);
+   struct preftree *tree = >indirect_missing_keys;
+   struct rb_node *node;
 
-   while (node) {
+   while ((node = rb_first(>root))) {
ref = rb_entry(node, struct prelim_ref, rbnode);
-   node = rb_next(>rbnode);
-   BUG_ON(ref->parent);/* should not be a direct ref */
+   rb_erase(node, >root);
 
-   if (ref->key_for_search.type)
-   continue;
+   BUG_ON(ref->parent);/* should not be a direct ref */
+   BUG_ON(ref->key_for_search.type);
BUG_ON(!ref->wanted_disk_byte);
+
eb = read_tree_block(fs_info->tree_root, ref->wanted_disk_byte,
 0);
if (IS_ERR(eb)) {
+   release_pref(ref);
return PTR_ERR(eb);
} else if (!extent_buffer_uptodate(eb)) {
free_extent_buffer(eb);
+   release_pref(ref);
return -EIO;
}
btrfs_tree_read_lock(eb);
@@ -749,12 +756,15 @@ static int add_delayed_refs(const struct
struct btrfs_delayed_ref_node *node;
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
struct btrfs_key key;
-   struct btrfs_key op_key = {0};
+   struct btrfs_key tmp_op_key;
+   struct btrfs_key *op_key = NULL;
int count;
int ret = 0;
 
-   if (extent_op && extent_op->update_key)
-   btrfs_disk_key_to_cpu(_key, _op->key);
+   if (extent_op && extent_op->update_key) {
+   btrfs_disk_key_to_cpu(_op_key, _op->key);
+   op_key = _op_key;
+   }
 
spin_lock(>lock);
list_for_each_entry(node, >ref_list, list) {
@@ -783,7 +793,7 @@ static int add_delayed_refs(const struct
 
ref = btrfs_delayed_node_to_tree_ref(node);
ret = add_indirect_ref(fs_info, preftrees, ref->root,
-  _key, ref->level + 1,
+  op_key, ref->level + 1,
   node->bytenr, count, sc,
   GFP_ATOMIC);
break;
@@ -1206,6 +1216,8 @@ again:
if (ret)
goto out;
 
+   WARN_ON(!RB_EMPTY_ROOT(_missing_keys.root));
+
ret = resolve_indirect_refs(fs_info, path, time_seq, ,
extent_item_pos, total_refs, sc);
if (ret)
@@ -1289,6 +1301,7 @@ out:
 
prelim_release();
prelim_release();
+   prelim_release(_missing_keys);
 
if (ret < 0)
free_inode_elem_list(eie);

--
To unsubscribe from this list: send the line "

Re: [PATCH 08/13] btrfs: convert prelimary reference tracking to use rbtrees

2017-06-27 Thread Jeff Mahoney
On 6/27/17 12:49 PM, David Sterba wrote:
> On Tue, Jun 20, 2017 at 10:06:48AM -0600, Edmund Nadolski wrote:
>> It's been known for a while that the use of multiple lists
>> that are periodically merged was an algorithmic problem within
>> btrfs.  There are several workloads that don't complete in any
>> reasonable amount of time (e.g. btrfs/130) and others that cause
>> soft lockups.
>>
>> The solution is to use a pair of rbtrees that do insertion merging
>> for both indirect and direct refs, with the former converting
>> refs into the latter.  The result is a btrfs/130 workload that
>> used to take several hours now takes about half of that. This
>> runtime still isn't acceptable and a future patch will address that
>> by moving the rbtrees higher in the stack so the lookups can be
>> shared across multiple calls to find_parent_nodes.
>>
>> Signed-off-by: Edmund Nadolski <enadol...@suse.com>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>> ---
>>  fs/btrfs/backref.c | 415 
>> ++---
>>  1 file changed, 267 insertions(+), 148 deletions(-)
>>
>> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
>> index 0d1e7cb..daae7b6 100644
>> --- a/fs/btrfs/backref.c
>> +++ b/fs/btrfs/backref.c
>> @@ -129,7 +129,7 @@ static int find_extent_in_eb(const struct extent_buffer 
>> *eb,
>>   * this structure records all encountered refs on the way up to the root
>>   */
>>  struct prelim_ref {
>> -struct list_head list;
>> +struct rb_node rbnode;
>>  u64 root_id;
>>  struct btrfs_key key_for_search;
>>  int level;
>> @@ -139,6 +139,17 @@ struct prelim_ref {
>>  u64 wanted_disk_byte;
>>  };
>>  
>> +struct preftree {
>> +struct rb_root root;
>> +};
>> +
>> +#define PREFTREE_INIT   { .root = RB_ROOT }
>> +
>> +struct preftrees {
>> +struct preftree direct;/* BTRFS_SHARED_[DATA|BLOCK]_REF_KEY */
>> +struct preftree indirect;  /* BTRFS_[TREE_BLOCK|EXTENT_DATA]_REF_KEY */
>> +};
>> +
>>  static struct kmem_cache *btrfs_prelim_ref_cache;
>>  
>>  int __init btrfs_prelim_ref_init(void)
>> @@ -158,6 +169,108 @@ void btrfs_prelim_ref_exit(void)
>>  kmem_cache_destroy(btrfs_prelim_ref_cache);
>>  }
>>  
>> +static void release_pref(struct prelim_ref *ref)
>> +{
>> +kmem_cache_free(btrfs_prelim_ref_cache, ref);
> 
> This is a bit confusing, 'release' in btrfs code is used to release the
> resources but not to actually free the memory. See eg.
> btrfs_release_path and btrfs_free_path. As the helper is trivial and I
> don't see any other additions to it in the followup patches, I suggest
> to drop and opencode it.

I don't mind renaming it to free_pref, but free_pref(ref) is a whole lot
easier to read and write than the alternative.

>> @@ -429,38 +560,58 @@ unode_aux_to_inode_list(struct ulist_node *node)
>>  }
>>  
>>  /*
>> - * resolve all indirect backrefs from the list
>> + * We maintain two separate rbtrees: one for indirect refs and one for
>> + * direct refs. Each tree does merge on insertion.  Once all of the
>> + * refs have been located, we iterate over the indirect ref tree, resolve
>> + * each reference and remove it from the indirect tree, and then insert
>> + * the resolved reference into the direct tree, merging there too.
>> + *
>> + * New backrefs (i.e., for parent nodes) are added to the direct/indirect
>> + * rbtrees as they are encountered.  The new, indirect backrefs are
>> + * resolved as above.
>>   */
>>  static int resolve_indirect_refs(struct btrfs_fs_info *fs_info,
>>   struct btrfs_path *path, u64 time_seq,
>> - struct list_head *head,
>> + struct preftrees *preftrees,
>>   const u64 *extent_item_pos, u64 total_refs,
>>   u64 root_objectid)
>>  {
>>  int err;
>>  int ret = 0;
>> -struct prelim_ref *ref;
>> -struct prelim_ref *ref_safe;
>> -struct prelim_ref *new_ref;
>>  struct ulist *parents;
>>  struct ulist_node *node;
>>  struct ulist_iterator uiter;
>> +struct rb_node *rnode;
>>  
>>  parents = ulist_alloc(GFP_NOFS);
>>  if (!parents)
>>  return -ENOMEM;
>>  
>>  /*
>> - * _safe allows us to insert directly after the current item without
>> - * iterating over 

Re: [PATCH 08/13] btrfs: convert prelimary reference tracking to use rbtrees

2017-06-27 Thread Jeff Mahoney
On 6/26/17 1:07 PM, Jeff Mahoney wrote:
> On 6/20/17 12:06 PM, Edmund Nadolski wrote:
>> It's been known for a while that the use of multiple lists
>> that are periodically merged was an algorithmic problem within
>> btrfs.  There are several workloads that don't complete in any
>> reasonable amount of time (e.g. btrfs/130) and others that cause
>> soft lockups.
>>
>> The solution is to use a pair of rbtrees that do insertion merging
>> for both indirect and direct refs, with the former converting
>> refs into the latter.  The result is a btrfs/130 workload that
>> used to take several hours now takes about half of that. This
>> runtime still isn't acceptable and a future patch will address that
>> by moving the rbtrees higher in the stack so the lookups can be
>> shared across multiple calls to find_parent_nodes.
>>
>> Signed-off-by: Edmund Nadolski <enadol...@suse.com>
>> Signed-off-by: Jeff Mahoney <je...@suse.com>
> [...]
> 
>> @@ -504,37 +665,22 @@ static int resolve_indirect_refs(struct btrfs_fs_info 
>> *fs_info,
>>  return ret;
>>  }
>>  
>> -static inline int ref_for_same_block(struct prelim_ref *ref1,
>> - struct prelim_ref *ref2)
>> -{
>> -if (ref1->level != ref2->level)
>> -return 0;
>> -if (ref1->root_id != ref2->root_id)
>> -return 0;
>> -if (ref1->key_for_search.type != ref2->key_for_search.type)
>> -return 0;
>> -if (ref1->key_for_search.objectid != ref2->key_for_search.objectid)
>> -return 0;
>> -if (ref1->key_for_search.offset != ref2->key_for_search.offset)
>> -return 0;
>> -if (ref1->parent != ref2->parent)
>> -return 0;
>> -
>> -return 1;
>> -}
>> -
>>  /*
>>   * read tree blocks and add keys where required.
>>   */
>>  static int add_missing_keys(struct btrfs_fs_info *fs_info,
>> -struct list_head *head)
>> +struct preftrees *preftrees)
>>  {
>>  struct prelim_ref *ref;
>>  struct extent_buffer *eb;
>> +struct rb_node *node = rb_first(>indirect.root);
>> +
>> +while (node) {
>> +ref = rb_entry(node, struct prelim_ref, rbnode);
>> +node = rb_next(>rbnode);
>> +if (WARN(ref->parent, "BUG: direct ref found in indirect tree"))
>> +return -EINVAL;
>>  
>> -list_for_each_entry(ref, head, list) {
>> -if (ref->parent)
>> -continue;
>>  if (ref->key_for_search.type)
>>  continue;
>>  BUG_ON(!ref->wanted_disk_byte);
> 
> Hi Ed -
> 
> I missed this in earlier review, but this can't work.  We're modifying
> the ref in a way that the comparator will care about -- so the node
> would move in the tree.
> 
> It's not a fatal flaw and, in fact, leaves us an opening to fix a
> separate locking issue.

Ed and I discussed this offline.  It turns out that this is a code bug
but not a functional bug.  Once we hit add_missing_keys, we don't do any
more inserts or searches.  We only iterate over every node and remove
each as we go, so the tree order doesn't matter.  I'll post a fix shortly.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 08/13] btrfs: convert prelimary reference tracking to use rbtrees

2017-06-26 Thread Jeff Mahoney
On 6/20/17 12:06 PM, Edmund Nadolski wrote:
> It's been known for a while that the use of multiple lists
> that are periodically merged was an algorithmic problem within
> btrfs.  There are several workloads that don't complete in any
> reasonable amount of time (e.g. btrfs/130) and others that cause
> soft lockups.
> 
> The solution is to use a pair of rbtrees that do insertion merging
> for both indirect and direct refs, with the former converting
> refs into the latter.  The result is a btrfs/130 workload that
> used to take several hours now takes about half of that. This
> runtime still isn't acceptable and a future patch will address that
> by moving the rbtrees higher in the stack so the lookups can be
> shared across multiple calls to find_parent_nodes.
> 
> Signed-off-by: Edmund Nadolski <enadol...@suse.com>
> Signed-off-by: Jeff Mahoney <je...@suse.com>
[...]

> @@ -504,37 +665,22 @@ static int resolve_indirect_refs(struct btrfs_fs_info 
> *fs_info,
>   return ret;
>  }
>  
> -static inline int ref_for_same_block(struct prelim_ref *ref1,
> -  struct prelim_ref *ref2)
> -{
> - if (ref1->level != ref2->level)
> - return 0;
> - if (ref1->root_id != ref2->root_id)
> - return 0;
> - if (ref1->key_for_search.type != ref2->key_for_search.type)
> - return 0;
> - if (ref1->key_for_search.objectid != ref2->key_for_search.objectid)
> - return 0;
> - if (ref1->key_for_search.offset != ref2->key_for_search.offset)
> - return 0;
> - if (ref1->parent != ref2->parent)
> - return 0;
> -
> - return 1;
> -}
> -
>  /*
>   * read tree blocks and add keys where required.
>   */
>  static int add_missing_keys(struct btrfs_fs_info *fs_info,
> - struct list_head *head)
> + struct preftrees *preftrees)
>  {
>   struct prelim_ref *ref;
>   struct extent_buffer *eb;
> + struct rb_node *node = rb_first(>indirect.root);
> +
> + while (node) {
> + ref = rb_entry(node, struct prelim_ref, rbnode);
> + node = rb_next(>rbnode);
> + if (WARN(ref->parent, "BUG: direct ref found in indirect tree"))
> + return -EINVAL;
>  
> - list_for_each_entry(ref, head, list) {
> - if (ref->parent)
> - continue;
>   if (ref->key_for_search.type)
>   continue;
>   BUG_ON(!ref->wanted_disk_byte);

Hi Ed -

I missed this in earlier review, but this can't work.  We're modifying
the ref in a way that the comparator will care about -- so the node
would move in the tree.

It's not a fatal flaw and, in fact, leaves us an opening to fix a
separate locking issue.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PULL] Btrfs for 4.13, part 1

2017-06-25 Thread Jeff Mahoney
On 6/25/17 8:53 PM, Qu Wenruo wrote:
> 
> 
> At 06/26/2017 05:34 AM, Jeff Mahoney wrote:
>> On 6/24/17 6:05 AM, Wang Shilong wrote:
>>> Sorry for bikeshedding.
>>>
>>> On Fri, Jun 23, 2017 at 11:16 PM, David Sterba <dste...@suse.com> wrote:
>>>> Hi,
>>>>
>>>> this is the main batch for 4.13. There are some user visible
>>>> changes, see
>>>> below. The core updates improve error handling (mostly related to
>>>> bios), with
>>>> the usual incremental work on the GFP_NOFS (mis)use removal. All
>>>> patches have
>>>>
>>>> Fabian Frederick (1):
>>>>btrfs: kmap() can't fail
>>>>
>>> <..SNIP..>
>>>>
>>>> Sargun Dhillon (2):
>>>>btrfs: add quota override flag to enable quota override for
>>>> CAP_SYS_RESOURCE
>>>>btrfs: Add quota_override knob into sysfs
>>>>
>>>> Su Yue (9):
>>>>btrfs: Introduce btrfs_is_name_len_valid to avoid reading
>>>> beyond boundary
>>>>btrfs: Check name_len with boundary in verify dir_item
>>>>btrfs: Check name_len on add_inode_ref call path
>>>>btrfs: Verify dir_item in replay_xattr_deletes
>>>>btrfs: Check name_len in btrfs_check_ref_name_override
>>>>btrfs: Check name_len before read in iterate_dir_item
>>>>btrfs: Check name_len before reading btrfs_get_name
>>>>btrfs: Check name_len before in btrfs_del_root_ref
>>>>btrfs: Verify dir_item in iterate_object_props
>>>
>>> Hmm..add those check might be expensive for metadata operations,
>>> especially in hot path, i could see similar behavior in Ext4 for ext4
>>> dentry check.
>>>
>>> Could we run some metadata tests to confirm?it makes sense to
>>> add check but with min affects to performace.
> 
> IIRC there is a backtrace in one of the patches.
> If needed, we can also upload the fuzzed image.
> Although we don't have good enough test suite which accepts fuzzed image
> for kernel.
> 
>>
>> Agreed.  XFS does this at read/write and results in invalid entries
>> never even making it to the consumer.  That's the approach I had
>> partially written up.  This approach has the advantage of being more
>> fine-grained so we don't end up dropping e.g. an entire node's worth of
>> entries but is more expensive to maintain and at runtime.
> 
> We already have tons of runtime check at chunk/super block read time.
> 
> This new check is not as comprehensive as what we did in
> chunk/superblock read time, but only to ensure that item with variable
> length doesn't cross its boundary.
> So performance wise it should not be a problem.
> 
> Although it's expensive to maintain, as for each structure with variable
> length, we need to call verification function every time.
> 
> But we can also extract the check to leaf reading time, this should
> reduce the effort to maintain it and make it easier to expand (or even
> make it optional if it really affects performance).

Well, ideally, we'd do complete checks for any type that we have enough
information to check at read or write time.  If a check is important
enough to be done at a consumer site, it's important enough to be done
at read/write.  Then we can make the consumer code much simpler and
catch some forms of corruption before it hits the disk.

-Jeff

> Thanks,
> Qu
> 
>>
>> -Jeff
>>
>>>> Timofey Titovets (3):
>>>>Btrfs: lzo: fix typo in error message after failed deflate
>>>>Btrfs: lzo: compressed data size must be less then input size
>>>>Btrfs: compression must free at least one sector size
>>>>
>>>> Yonghong Song (1):
>>>>Btrfs: add statx support
>>>>
>>>>   fs/btrfs/backref.c   |  10 +-
>>>>   fs/btrfs/check-integrity.c   |  53 ++---
>>>>   fs/btrfs/compression.c   |  94 ++--
>>>>   fs/btrfs/compression.h   |  44 +++-
>>>>   fs/btrfs/ctree.c |  42 ++--
>>>>   fs/btrfs/ctree.h |  84 ---
>>>>   fs/btrfs/delayed-ref.c   |  29 ++-
>>>>   fs/btrfs/delayed-ref.h   |   6 +-
>>>>   fs/btrfs/dir-item.c  |  94 +++-
>>>>   fs/btrfs/disk-io.c   | 179 +++
>>>>   fs/btrfs/disk-io.h   |   8 +-
>>>>   fs/btrfs/export.c 

Re: [PULL] Btrfs for 4.13, part 1

2017-06-25 Thread Jeff Mahoney
On 6/24/17 6:05 AM, Wang Shilong wrote:
> Sorry for bikeshedding.
> 
> On Fri, Jun 23, 2017 at 11:16 PM, David Sterba <dste...@suse.com> wrote:
>> Hi,
>>
>> this is the main batch for 4.13. There are some user visible changes, see
>> below. The core updates improve error handling (mostly related to bios), with
>> the usual incremental work on the GFP_NOFS (mis)use removal. All patches have
>>
>> Fabian Frederick (1):
>>   btrfs: kmap() can't fail
>>
> <..SNIP..>
>>
>> Sargun Dhillon (2):
>>   btrfs: add quota override flag to enable quota override for 
>> CAP_SYS_RESOURCE
>>   btrfs: Add quota_override knob into sysfs
>>
>> Su Yue (9):
>>   btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond 
>> boundary
>>   btrfs: Check name_len with boundary in verify dir_item
>>   btrfs: Check name_len on add_inode_ref call path
>>   btrfs: Verify dir_item in replay_xattr_deletes
>>   btrfs: Check name_len in btrfs_check_ref_name_override
>>   btrfs: Check name_len before read in iterate_dir_item
>>   btrfs: Check name_len before reading btrfs_get_name
>>   btrfs: Check name_len before in btrfs_del_root_ref
>>   btrfs: Verify dir_item in iterate_object_props
> 
> Hmm..add those check might be expensive for metadata operations,
> especially in hot path, i could see similar behavior in Ext4 for ext4
> dentry check.
> 
> Could we run some metadata tests to confirm?it makes sense to
> add check but with min affects to performace.

Agreed.  XFS does this at read/write and results in invalid entries
never even making it to the consumer.  That's the approach I had
partially written up.  This approach has the advantage of being more
fine-grained so we don't end up dropping e.g. an entire node's worth of
entries but is more expensive to maintain and at runtime.

-Jeff

>> Timofey Titovets (3):
>>   Btrfs: lzo: fix typo in error message after failed deflate
>>   Btrfs: lzo: compressed data size must be less then input size
>>   Btrfs: compression must free at least one sector size
>>
>> Yonghong Song (1):
>>   Btrfs: add statx support
>>
>>  fs/btrfs/backref.c   |  10 +-
>>  fs/btrfs/check-integrity.c   |  53 ++---
>>  fs/btrfs/compression.c   |  94 ++--
>>  fs/btrfs/compression.h   |  44 +++-
>>  fs/btrfs/ctree.c |  42 ++--
>>  fs/btrfs/ctree.h |  84 ---
>>  fs/btrfs/delayed-ref.c   |  29 ++-
>>  fs/btrfs/delayed-ref.h   |   6 +-
>>  fs/btrfs/dir-item.c  |  94 +++-
>>  fs/btrfs/disk-io.c   | 179 +++
>>  fs/btrfs/disk-io.h   |   8 +-
>>  fs/btrfs/export.c|   5 +
>>  fs/btrfs/extent-tree.c   | 481 
>> +--
>>  fs/btrfs/extent_io.c | 217 --
>>  fs/btrfs/extent_io.h |  82 +--
>>  fs/btrfs/file-item.c |  31 ++-
>>  fs/btrfs/file.c  |  46 ++--
>>  fs/btrfs/free-space-tree.c   |  38 ++--
>>  fs/btrfs/inode-map.c |   4 +-
>>  fs/btrfs/inode.c | 449 
>>  fs/btrfs/ioctl.c |  16 +-
>>  fs/btrfs/lzo.c   |  33 +--
>>  fs/btrfs/print-tree.c|   7 +-
>>  fs/btrfs/props.c |   7 +
>>  fs/btrfs/qgroup.c| 223 +-
>>  fs/btrfs/qgroup.h|   9 +-
>>  fs/btrfs/raid56.c|  16 +-
>>  fs/btrfs/reada.c |   1 -
>>  fs/btrfs/relocation.c|  15 +-
>>  fs/btrfs/root-tree.c |   7 +
>>  fs/btrfs/scrub.c | 209 +++--
>>  fs/btrfs/send.c  | 112 ++---
>>  fs/btrfs/super.c |  72 +-
>>  fs/btrfs/sysfs.c |  41 
>>  fs/btrfs/tests/extent-io-tests.c |   2 +-
>>  fs/btrfs/transaction.c   |  23 +-
>>  fs/btrfs/tree-log.c  |  44 +++-
>>  fs/btrfs/volumes.c   |  74 +++---
>>  fs/btrfs/volumes.h   |   7 +
>>  fs/btrfs/xattr.c |   2 +-
>>  fs/btrfs/zlib.c  |  20 +-
>>  include/trace/events/btrfs.h |  36 ---
>>  include/uapi/linux/btrfs.h   |  63 +++--
>>  43 files changed, 1682 insertions(+), 1353 deletions(-)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Jeff Mahoney
On 6/21/17 5:15 PM, Chris Mason wrote:
> 
> 
> On 06/21/2017 05:08 PM, Jeff Mahoney wrote:
>> On 6/21/17 4:31 PM, Chris Mason wrote:
>>> On 06/21/2017 04:14 PM, Jeff Mahoney wrote:
>>>> On 6/14/17 11:44 AM, je...@suse.com wrote:
>>>>> From: Jeff Mahoney <je...@suse.com>
>>>>>
>>>>> In a heavy write scenario, we can end up with a large number of pinned
>>>>> bytes.  This can translate into (very) premature ENOSPC because pinned
>>>>> bytes must be accounted for when allowing a reservation but aren't
>>>>> accounted for when deciding whether to create a new chunk.
>>>>>
>>>>> This patch adds the accounting to should_alloc_chunk so that we can
>>>>> create the chunk.
>>>>>
>>>>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>>>>> ---
>>>>>  fs/btrfs/extent-tree.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>> index cb0b924..d027807 100644
>>>>> --- a/fs/btrfs/extent-tree.c
>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>> @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct
>>>>> btrfs_fs_info *fs_info,
>>>>>  {
>>>>>  struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
>>>>>  u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
>>>>> -u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
>>>>> +u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved +
>>>>> sinfo->bytes_pinned + sinfo->bytes_may_use;
>>>>>  u64 thresh;
>>>>>
>>>>>  if (force == CHUNK_ALLOC_FORCE)
>>>>>
>>>>
>>>>
>>>> Ignore this patch.  It certainly allocates chunks more aggressively,
>>>> but
>>>> it means we end up with a ton of metadata chunks even when we don't
>>>> have
>>>> much metadata.
>>>>
>>>
>>> Josef and I pushed this needle back and forth a bunch of times in the
>>> early days.  I still think we can allocate a few more chunks than we do
>>> now...
>>
>> I agree.  This patch was to fix an issue that we are seeing during
>> installation.  It'd stop with ENOSPC with >50GB completely unallocated.
>> The patch passed the test cases that were failing before but now it's
>> failing differently.  I was worried this pattern might be the end result:
>>
>> Data,single: Size:4.00GiB, Used:3.32GiB
>>/dev/vde4.00GiB
>>
>> Metadata,DUP: Size:20.00GiB, Used:204.12MiB
>>/dev/vde   40.00GiB
>>
>> System,DUP: Size:8.00MiB, Used:16.00KiB
>>/dev/vde   16.00MiB
>>
>> This is on a fresh file system with just "cp /usr /mnt" executed.
>>
>> I'm looking into it a bit more now.
> 
> Does this failure still happen with Omar's ENOSPC fix (commit:
> 70e7af244f24c94604ef6eca32ad297632018583)

Nope.  There aren't any warnings either with or without my patch.
Adding Omar's didn't make a difference.

-Jeff


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Jeff Mahoney
On 6/21/17 4:31 PM, Chris Mason wrote:
> On 06/21/2017 04:14 PM, Jeff Mahoney wrote:
>> On 6/14/17 11:44 AM, je...@suse.com wrote:
>>> From: Jeff Mahoney <je...@suse.com>
>>>
>>> In a heavy write scenario, we can end up with a large number of pinned
>>> bytes.  This can translate into (very) premature ENOSPC because pinned
>>> bytes must be accounted for when allowing a reservation but aren't
>>> accounted for when deciding whether to create a new chunk.
>>>
>>> This patch adds the accounting to should_alloc_chunk so that we can
>>> create the chunk.
>>>
>>> Signed-off-by: Jeff Mahoney <je...@suse.com>
>>> ---
>>>  fs/btrfs/extent-tree.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index cb0b924..d027807 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct
>>> btrfs_fs_info *fs_info,
>>>  {
>>>  struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
>>>  u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
>>> -u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
>>> +u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved +
>>> sinfo->bytes_pinned + sinfo->bytes_may_use;
>>>  u64 thresh;
>>>
>>>  if (force == CHUNK_ALLOC_FORCE)
>>>
>>
>>
>> Ignore this patch.  It certainly allocates chunks more aggressively, but
>> it means we end up with a ton of metadata chunks even when we don't have
>> much metadata.
>>
> 
> Josef and I pushed this needle back and forth a bunch of times in the
> early days.  I still think we can allocate a few more chunks than we do
> now...

I agree.  This patch was to fix an issue that we are seeing during
installation.  It'd stop with ENOSPC with >50GB completely unallocated.
The patch passed the test cases that were failing before but now it's
failing differently.  I was worried this pattern might be the end result:

Data,single: Size:4.00GiB, Used:3.32GiB
   /dev/vde4.00GiB

Metadata,DUP: Size:20.00GiB, Used:204.12MiB
   /dev/vde   40.00GiB

System,DUP: Size:8.00MiB, Used:16.00KiB
   /dev/vde   16.00MiB

This is on a fresh file system with just "cp /usr /mnt" executed.

I'm looking into it a bit more now.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: account for pinned bytes and bytes_may_use in should_alloc_chunk

2017-06-21 Thread Jeff Mahoney
On 6/14/17 11:44 AM, je...@suse.com wrote:
> From: Jeff Mahoney <je...@suse.com>
> 
> In a heavy write scenario, we can end up with a large number of pinned
> bytes.  This can translate into (very) premature ENOSPC because pinned
> bytes must be accounted for when allowing a reservation but aren't
> accounted for when deciding whether to create a new chunk.
> 
> This patch adds the accounting to should_alloc_chunk so that we can
> create the chunk.
> 
> Signed-off-by: Jeff Mahoney <je...@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index cb0b924..d027807 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4389,7 +4389,7 @@ static int should_alloc_chunk(struct btrfs_fs_info 
> *fs_info,
>  {
>   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
>   u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
> - u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved;
> + u64 num_allocated = sinfo->bytes_used + sinfo->bytes_reserved + 
> sinfo->bytes_pinned + sinfo->bytes_may_use;
>   u64 thresh;
>  
>   if (force == CHUNK_ALLOC_FORCE)
> 


Ignore this patch.  It certainly allocates chunks more aggressively, but
it means we end up with a ton of metadata chunks even when we don't have
much metadata.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 7/7] Btrfs: warn if total_bytes_pinned is non-zero on unmount

2017-06-13 Thread Jeff Mahoney
On 6/6/17 7:45 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osan...@fb.com>
> 
> Catch any future/remaining leaks or underflows of total_bytes_pinned.
> 
> Signed-off-by: Omar Sandoval <osan...@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 75ad24f8d253..5fb2fb27eda6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9860,6 +9860,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>   space_info->bytes_reserved > 0 ||
>   space_info->bytes_may_use > 0))
>   dump_space_info(info, space_info, 0, 0);
> + WARN_ON(percpu_counter_sum(_info->total_bytes_pinned) != 
> 0);
>   list_del(_info->list);
>   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>   struct kobject *kobj;
> 

Can we group this in with the other WARN_ON and add printing
total_bytes_pinned to dump_space_info?  Understanding the magnitude and
whether we're underflowed or haven't released enough is helpful.  While
testing your patchset, I did this and it found a few bugs in cleanup
after error.  I'll post those patches shortly.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/6] Btrfs: add a helper to retrive extent inline ref type

2017-05-26 Thread Jeff Mahoney
On 5/26/17 3:09 AM, Nikolay Borisov wrote:
> 
> 
> On 26.05.2017 03:26, Liu Bo wrote:
>> An invalid value of extent inline ref type may be read from a
>> malicious image which may force btrfs to crash.
>>
>> This adds a helper which does sanity check for the ref type, so we can
>> know if it's sane, return type if so, otherwise return an error.
>>
>> Signed-off-by: Liu Bo <bo.li@oracle.com>
>> ---
>>  fs/btrfs/ctree.h   |  4 
>>  fs/btrfs/extent-tree.c | 35 +++
>>  2 files changed, 39 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index c411590..206ae8c 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -2542,6 +2542,10 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
>> address_space *mapping)
>>  
>>  /* extent-tree.c */
>>  
>> +int btrfs_get_extent_inline_ref_type(struct extent_buffer *eb,
>> + struct btrfs_extent_inline_ref *iref,
>> + int is_data);
>> +
>>  u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 
>> csum_bytes);
>>  
>>  static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_fs_info 
>> *fs_info,
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index be54776..fba8ca0 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -1117,6 +1117,41 @@ static int convert_extent_item_v0(struct 
>> btrfs_trans_handle *trans,
>>  }
>>  #endif
>>  
>> +/*
>> + * is_data == 0, tree block type is required,
>> + * is_data == 1, data type is requried,
>> + * is_data == 2, either type is OK.
>> + */
> 
> Can you change those numbers to either #defines or better an enum type?
> Looking at one call site the last argument being a number says nothing
> and one has to context switch to the function definition. E.g. from
> patch2 :
> 
> *out_type = btrfs_get_extent_inline_ref_type(eb, *out_eiref, 2);
> 
> possible names:
> 
> BTRFS_BLOCK_REF_TYPE
> BTRFS_DATA_REF_TYPE
> BTRFS_ANY_TYPE

Agreed.

-Jeff

>> +int btrfs_get_extent_inline_ref_type(struct extent_buffer *eb,
>> + struct btrfs_extent_inline_ref *iref,
>> + int is_data)
>> +{
>> +int type = btrfs_extent_inline_ref_type(eb, iref);
>> +
>> +if (type == BTRFS_TREE_BLOCK_REF_KEY ||
>> +type == BTRFS_SHARED_BLOCK_REF_KEY ||
>> +type == BTRFS_SHARED_DATA_REF_KEY ||
>> +type == BTRFS_EXTENT_DATA_REF_KEY) {
>> +if (is_data == 2) {
>> +return type;
>> +} else if (is_data == 1) {
>> +if (type == BTRFS_EXTENT_DATA_REF_KEY ||
>> +type == BTRFS_SHARED_DATA_REF_KEY)
>> +return type;
>> +} else {
>> +if (type == BTRFS_TREE_BLOCK_REF_KEY ||
>> +type == BTRFS_SHARED_BLOCK_REF_KEY)
>> +return type;
>> +}
>> +}
>> +
>> +btrfs_print_leaf(eb->fs_info, eb);
>> +WARN(1, "eb %llu(%s block) invalid extent inline ref type %d\n",
>> + eb->start, (is_data) ? "data" : "tree", type);
>> +
>> +return -EINVAL;
>> +}
>> +
>>  static u64 hash_extent_data_ref(u64 root_objectid, u64 owner, u64 offset)
>>  {
>>  u32 high_crc = ~(u32)0;
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/2] btrfs: Separate space_info create/update

2017-05-17 Thread Jeff Mahoney
On 5/17/17 4:52 PM, Noah Massey wrote:
> On Wed, May 17, 2017 at 4:34 PM, Nikolay Borisov <nbori...@suse.com> wrote:
>>
>>
>> On 17.05.2017 21:57, Noah Massey wrote:
>>> On Wed, May 17, 2017 at 11:07 AM, Nikolay Borisov <nbori...@suse.com> wrote:
>>>> Currently the struct space_info creation code is intermixed in the
>>>> udpate_space_info function. There are well-defined points at which the we
>>>
>>> ^^^ update_space_info
>>>
>>>> actually want to create brand-new space_info structs (e.g. during mount of
>>>> the filesystem as well as sometimes when adding/initialising new chunks). 
>>>> In
>>>> such cases udpate_space_info is called with 0 as the bytes parameter. All 
>>>> of
>>>> this makes for spaghetti code.
>>>>
>>>> Fix it by factoring out the creation code in a separate create_space_info
>>>> structure. This also allows to simplify the internals. Furthermore it will
>>>> make the update_space_info function not fail, allowing to remove error
>>>> handling in callers. This will come in a follow up patch.
>>>>
>>>> This bears no functional changes
>>>>
>>>> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
>>>> Reviewed-by: Jeff Mahoney <je...@suse.com>
>>>> ---
>>>>  fs/btrfs/extent-tree.c | 127 
>>>> -
>>>>  1 file changed, 62 insertions(+), 65 deletions(-)
>>>>
>>>> Change since v1:
>>>>
>>>>  Incorporated Jeff Mahoney's feedback and added his reviewed-by
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index be5477676cc8..28848e45b018 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -3914,15 +3914,58 @@ static const char *alloc_name(u64 flags)
>>>> };
>>>>  }
>>>>
>>>> +static int create_space_info(struct btrfs_fs_info *info, u64 flags,
>>>> +struct btrfs_space_info **new) {
>>>> +
>>>> +   struct btrfs_space_info *space_info;
>>>> +   int i;
>>>> +   int ret;
>>>> +
>>>> +   space_info = kzalloc(sizeof(*space_info), GFP_NOFS);
>>>> +   if (!space_info)
>>>> +   return -ENOMEM;
>>>> +
>>>> +   ret = percpu_counter_init(_info->total_bytes_pinned, 0, 
>>>> GFP_KERNEL);
>>>> +   if (ret) {
>>>> +   kfree(space_info);
>>>> +   return ret;
>>>> +   }
>>>> +
>>>> +   for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
>>>> +   INIT_LIST_HEAD(_info->block_groups[i]);
>>>> +   init_rwsem(_info->groups_sem);
>>>> +   spin_lock_init(_info->lock);
>>>> +   space_info->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
>>>> +   space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
>>>> +   init_waitqueue_head(_info->wait);
>>>> +   INIT_LIST_HEAD(_info->ro_bgs);
>>>> +   INIT_LIST_HEAD(_info->tickets);
>>>> +   INIT_LIST_HEAD(_info->priority_tickets);
>>>> +
>>>> +   ret = kobject_init_and_add(_info->kobj, _info_ktype,
>>>> +   info->space_info_kobj, "%s",
>>>> +   alloc_name(space_info->flags));
>>>> +   if (ret) {
>>>> +   percpu_counter_destroy(_info->total_bytes_pinned);
>>>> +   kfree(space_info);
>>>> +   return ret;
>>>> +   }
>>>> +
>>>> +   *new = space_info;
>>>> +   list_add_rcu(_info->list, >space_info);
>>>> +   if (flags & BTRFS_BLOCK_GROUP_DATA)
>>>> +   info->data_sinfo = space_info;
>>>> +
>>>> +   return ret;
>>>> +}
>>>> +
>>>>  static int update_space_info(struct btrfs_fs_info *info, u64 flags,
>>>>  u64 total_bytes, u64 bytes_used,
>>>>  u64 bytes_readonly,
>>>>  struct btrfs_space_info **space_info)
>>>>  {
>>>> struct btrfs_space_info *found;

Re: [PATCH 1/2] btrfs: Separate space_info create/update

2017-05-17 Thread Jeff Mahoney
found->flush = 0;
> - init_waitqueue_head(>wait);
> - INIT_LIST_HEAD(>ro_bgs);
> - INIT_LIST_HEAD(>tickets);
> - INIT_LIST_HEAD(>priority_tickets);
> -
> - ret = kobject_init_and_add(>kobj, _info_ktype,
> - info->space_info_kobj, "%s",
> - alloc_name(found->flags));
> - if (ret) {
> - kfree(found);
> - return ret;
> - }
> -
> - *space_info = found;
> - list_add_rcu(>list, >space_info);
> - if (flags & BTRFS_BLOCK_GROUP_DATA)
> - info->data_sinfo = found;
> -
> - return ret;
>  }
>  
>  static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
> @@ -4495,10 +4490,9 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans,
>  
>   space_info = __find_space_info(fs_info, flags);
>   if (!space_info) {
> - ret = update_space_info(fs_info, flags, 0, 0, 0, _info);
> + ret = create_space_info(fs_info, flags, _info);
>   BUG_ON(ret); /* -ENOMEM */
>   }
> - BUG_ON(!space_info); /* Logic error */
>  
>  again:
>   spin_lock(_info->lock);
> @@ -5763,7 +5757,7 @@ int btrfs_orphan_reserve_metadata(struct 
> btrfs_trans_handle *trans,
>*/
>   u64 num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
>  
> - trace_btrfs_space_reservation(fs_info, "orphan", btrfs_ino(inode), 
> + trace_btrfs_space_reservation(fs_info, "orphan", btrfs_ino(inode),
>   num_bytes, 1);
>   return btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
>  }
> @@ -10153,6 +10147,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
> *trans,
>   struct btrfs_block_group_cache *cache;
>   int ret;
>  
> +
>   btrfs_set_log_full_commit(fs_info, trans);
>  
>   cache = btrfs_create_block_group_cache(fs_info, chunk_offset, size);

Drop this chunk.

> @@ -10191,16 +10186,18 @@ int btrfs_make_block_group(struct 
> btrfs_trans_handle *trans,
>   }
>  #endif
>   /*
> -  * Call to ensure the corresponding space_info object is created and
> -  * assigned to our block group, but don't update its counters just yet.
> -  * We want our bg to be added to the rbtree with its ->space_info set.
> +  * Ensure the corresponding space_info object is created and
> +  * assigned to our block group. We want our bg to be added to the rbtree
> +  * with its ->space_info set.
>*/
> - ret = update_space_info(fs_info, cache->flags, 0, 0, 0,
> - >space_info);
> - if (ret) {
> - btrfs_remove_free_space_cache(cache);
> - btrfs_put_block_group(cache);
> - return ret;
> + cache->space_info = __find_space_info(fs_info, cache->flags);
> + if (!cache->space_info) {
> + ret = create_space_info(fs_info, cache->flags, 
> >space_info);
> + if (ret) {
> + btrfs_remove_free_space_cache(cache);
> + btrfs_put_block_group(cache);
> + return ret;
> + }
>   }
>  
>   ret = btrfs_add_block_group_cache(fs_info, cache);
> @@ -10774,21 +10771,21 @@ int btrfs_init_space_info(struct btrfs_fs_info 
> *fs_info)
>   mixed = 1;
>  
>   flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - ret = update_space_info(fs_info, flags, 0, 0, 0, _info);
> + ret = create_space_info(fs_info, flags, _info);
>   if (ret)
>   goto out;
>  
>   if (mixed) {
>   flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
> - ret = update_space_info(fs_info, flags, 0, 0, 0, _info);
> + ret = create_space_info(fs_info, flags, _info);
>   } else {
>   flags = BTRFS_BLOCK_GROUP_METADATA;
> - ret = update_space_info(fs_info, flags, 0, 0, 0, _info);
> + ret = create_space_info(fs_info, flags, _info);
>   if (ret)
>   goto out;
>  
>   flags = BTRFS_BLOCK_GROUP_DATA;
> - ret = update_space_info(fs_info, flags, 0, 0, 0, _info);
> + ret = create_space_info(fs_info, flags, _info);
>   }
>  out:
>   return ret;
> 

Reviewed-by: Jeff Mahoney <je...@suse.com>

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


  1   2   3   4   5   6   7   8   >