Re: Mount stalls indefinitely after enabling quota groups.

2018-08-11 Thread Qu Wenruo


On 2018/8/12 下午12:18, Dan Merillat wrote:
> On Sat, Aug 11, 2018 at 9:36 PM Qu Wenruo  wrote:
> 
>>> I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
>>> to disable quota offline.
>>
>> Patch set (from my work mailbox), titled "[PATCH] btrfs-progs: rescue:
>> Add ability to disable quota offline".
>> Can also be fetched from github:
>> https://github.com/adam900710/btrfs-progs/tree/quota_disable
>>
>> Usage is:
>> # btrfs rescue disable-quota 
>>
>> Tested locally, it would just toggle the ON/OFF flag for quota, so the
>> modification should be minimal.
> 
> Noticed one thing while testing this, but it's not related to the
> patch so I'll keep it here.
> I still had the ,ro mounts in fstab, and while it mounted ro quickly
> *unmounting* the filesystem, even readonly,
> got hung up:
> 
> Aug 11 23:47:27 fileserver kernel: [  484.314725] INFO: task
> umount:5422 blocked for more than 120 seconds.
> Aug 11 23:47:27 fileserver kernel: [  484.314787]   Not tainted
> 4.17.14-dirty #3
> Aug 11 23:47:27 fileserver kernel: [  484.314892] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 11 23:47:27 fileserver kernel: [  484.315006] umount  D
> 0  5422   4656 0x0080
> Aug 11 23:47:27 fileserver kernel: [  484.315122] Call Trace:
> Aug 11 23:47:27 fileserver kernel: [  484.315176]  ? __schedule+0x2c0/0x820
> Aug 11 23:47:27 fileserver kernel: [  484.315270]  ?
> kmem_cache_alloc+0x167/0x1b0
> Aug 11 23:47:27 fileserver kernel: [  484.315358]  schedule+0x3c/0x90
> Aug 11 23:47:27 fileserver kernel: [  484.315493]  
> schedule_timeout+0x1e4/0x430
> Aug 11 23:47:27 fileserver kernel: [  484.315542]  ?
> kmem_cache_alloc+0x167/0x1b0
> Aug 11 23:47:27 fileserver kernel: [  484.315686]  wait_for_common+0xb1/0x170
> Aug 11 23:47:27 fileserver kernel: [  484.315798]  ? wake_up_q+0x70/0x70
> Aug 11 23:47:27 fileserver kernel: [  484.315911]
> btrfs_qgroup_wait_for_completion+0x5f/0x80

This normally waits for rescan.
It may be possible that your original "btrfs quota enable" kicked in
rescan but hasn't finished before umount.

But for RO mount we shouldn't have any rescan running.

Maybe I could find some spare time looking into it.

Thanks for the report,
Qu

> Aug 11 23:47:27 fileserver kernel: [  484.316031]  close_ctree+0x27/0x2d0
> Aug 11 23:47:27 fileserver kernel: [  484.316138]
> generic_shutdown_super+0x69/0x110
> Aug 11 23:47:27 fileserver kernel: [  484.316252]  kill_anon_super+0xe/0x20
> Aug 11 23:47:27 fileserver kernel: [  484.316301]  btrfs_kill_super+0x13/0x100
> Aug 11 23:47:27 fileserver kernel: [  484.316349]
> deactivate_locked_super+0x39/0x70
> Aug 11 23:47:27 fileserver kernel: [  484.316399]  cleanup_mnt+0x3b/0x70
> Aug 11 23:47:27 fileserver kernel: [  484.316459]  task_work_run+0x89/0xb0
> Aug 11 23:47:27 fileserver kernel: [  484.316519]
> exit_to_usermode_loop+0x8c/0x90
> Aug 11 23:47:27 fileserver kernel: [  484.316579]  do_syscall_64+0xf1/0x110
> Aug 11 23:47:27 fileserver kernel: [  484.316639]
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Is it trying to write changes to a ro mount, or is it doing a bunch of
> work that it's just going to throw away?  I ended up using sysrq-b
> after commenting out the entries in fstab.
> 
> Everything seems fine with the filesystem now.  I appreciate all the help!
> 



signature.asc
Description: OpenPGP digital signature


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-11 Thread Dan Merillat
On Sat, Aug 11, 2018 at 9:36 PM Qu Wenruo  wrote:

> > I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
> > to disable quota offline.
>
> Patch set (from my work mailbox), titled "[PATCH] btrfs-progs: rescue:
> Add ability to disable quota offline".
> Can also be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/quota_disable
>
> Usage is:
> # btrfs rescue disable-quota 
>
> Tested locally, it would just toggle the ON/OFF flag for quota, so the
> modification should be minimal.

Noticed one thing while testing this, but it's not related to the
patch so I'll keep it here.
I still had the ,ro mounts in fstab, and while it mounted ro quickly
*unmounting* the filesystem, even readonly,
got hung up:

Aug 11 23:47:27 fileserver kernel: [  484.314725] INFO: task
umount:5422 blocked for more than 120 seconds.
Aug 11 23:47:27 fileserver kernel: [  484.314787]   Not tainted
4.17.14-dirty #3
Aug 11 23:47:27 fileserver kernel: [  484.314892] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 11 23:47:27 fileserver kernel: [  484.315006] umount  D
0  5422   4656 0x0080
Aug 11 23:47:27 fileserver kernel: [  484.315122] Call Trace:
Aug 11 23:47:27 fileserver kernel: [  484.315176]  ? __schedule+0x2c0/0x820
Aug 11 23:47:27 fileserver kernel: [  484.315270]  ?
kmem_cache_alloc+0x167/0x1b0
Aug 11 23:47:27 fileserver kernel: [  484.315358]  schedule+0x3c/0x90
Aug 11 23:47:27 fileserver kernel: [  484.315493]  schedule_timeout+0x1e4/0x430
Aug 11 23:47:27 fileserver kernel: [  484.315542]  ?
kmem_cache_alloc+0x167/0x1b0
Aug 11 23:47:27 fileserver kernel: [  484.315686]  wait_for_common+0xb1/0x170
Aug 11 23:47:27 fileserver kernel: [  484.315798]  ? wake_up_q+0x70/0x70
Aug 11 23:47:27 fileserver kernel: [  484.315911]
btrfs_qgroup_wait_for_completion+0x5f/0x80
Aug 11 23:47:27 fileserver kernel: [  484.316031]  close_ctree+0x27/0x2d0
Aug 11 23:47:27 fileserver kernel: [  484.316138]
generic_shutdown_super+0x69/0x110
Aug 11 23:47:27 fileserver kernel: [  484.316252]  kill_anon_super+0xe/0x20
Aug 11 23:47:27 fileserver kernel: [  484.316301]  btrfs_kill_super+0x13/0x100
Aug 11 23:47:27 fileserver kernel: [  484.316349]
deactivate_locked_super+0x39/0x70
Aug 11 23:47:27 fileserver kernel: [  484.316399]  cleanup_mnt+0x3b/0x70
Aug 11 23:47:27 fileserver kernel: [  484.316459]  task_work_run+0x89/0xb0
Aug 11 23:47:27 fileserver kernel: [  484.316519]
exit_to_usermode_loop+0x8c/0x90
Aug 11 23:47:27 fileserver kernel: [  484.316579]  do_syscall_64+0xf1/0x110
Aug 11 23:47:27 fileserver kernel: [  484.316639]
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Is it trying to write changes to a ro mount, or is it doing a bunch of
work that it's just going to throw away?  I ended up using sysrq-b
after commenting out the entries in fstab.

Everything seems fine with the filesystem now.  I appreciate all the help!


Re: [PATCH] btrfs-progs: rescue: Add ability to disable quota offline

2018-08-11 Thread Dan Merillat
On Sat, Aug 11, 2018 at 9:34 PM Qu Wenruo  wrote:
>
> Provide an offline tool to disable quota.
>
> For kernel which skip_balance doesn't work, there is no way to disable
> quota on huge fs with balance, as quota will cause balance to hang for a
> long long time for each tree block switch.
>
> So add an offline rescue tool to disable quota.
>
> Reported-by: Dan Merillat 
> Signed-off-by: Qu Wenruo 

That fixed it, thanks.

Tested-By: Dan Merillat 


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-11 Thread Chris Murphy
On Fri, Aug 10, 2018 at 9:29 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Fri, 10 Aug 2018 12:07:34 -0600 as excerpted:
>
>> But whether data is shared or exclusive seems potentially ephemeral, and
>> not something a sysadmin should even be able to anticipate let alone
>> individual users.
>
> Define "user(s)".

The person who is saving their document on a network share, and
they've never heard of Btrfs.


> Arguably, in the context of btrfs tool usage, "user" /is/ the admin,

I'm not talking about btrfs tools. I'm talking about rational,
predictable behavior of a shared folder.

If I try to drop a 1GiB file into my share and I'm denied, not enough
free space, and behind the scenes it's because of a quota limit, I
expect I can delete *any* file(s) amounting to create 1GiB free space
and then I'll be able to drop that file successfully without error.

But if I'm unwittingly deleting shared files, my quota usage won't go
down, and I still can't save my file. So now I somehow need a secret
incantation to discover only my exclusive files and delete enough of
them in order to save this 1GiB file. It's weird, it's unexpected, I
think it's a use case failure. Maybe Btrfs quotas isn't meant to work
with samba or NFS shares. *shrug*



>
> "Regular users" as you use the term, that is the non-admins who just need
> to know how close they are to running out of their allotted storage
> resources, shouldn't really need to care about btrfs tool usage in the
> first place, and btrfs commands in general, including btrfs quota related
> commands, really aren't targeted at them, and aren't designed to report
> the type of information they are likely to find useful.  Other tools will
> be more appropriate.

I'm not talking about any btrfs commands or even the term quota for
regular users. I'm talking about saving a file, being denied, and how
does the user figure out how to free up space?

Anyway, it's a hypothetical scenario. While I have Samba running on a
Btrfs volume with various shares as subvolumes, I don't have quotas
enabled.



-- 
Chris Murphy


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-11 Thread Qu Wenruo


On 2018/8/12 上午8:30, Qu Wenruo wrote:
> 
> 
> On 2018/8/12 上午5:10, Dan Merillat wrote:
>> 19 hours later, still going extremely slowly and taking longer and
>> longer for progress made.  Main symptom is the mount process is
>> spinning at 100% CPU, interspersed with btrfs-transaction spinning at
>> 100% CPU.
>> So far it's racked up 14h45m of CPU time on mount and an additional
>> 3h40m on btrfs-transaction.
>>
>> The current drop key changes every 10-15 minutes when I check it via
>> inspect-internal, so some progress is slowly being made.
>>
>> I built the kernel with ftrace to see what's going on internally, this
>> is the pattern I'm seeing:
>>
> [snip]
> 
> It looks pretty like qgroup, but too many noise.
> The pin point trace event would btrfs_find_all_roots().
> 
>>
>> Repeats indefinitely.  btrace shows basically zero activity on the
>> array while it spins, with the occasional burst when mount &
>> btrfs-transaction swap off.
>>
>> To recap the chain of events leading up to this:
>> 11TB Array got completely full and started fragmenting badly.
>> Ran bedup and it found 600gb of duplicate files that it offline-shared.
>> Reboot for unrelated reasons
> 
> 11T, with highly deduped usage is really the worst scenario case for qgroup.
> Qgroup is not really good at handle hight reflinked files, nor balance.
> When they combines, it goes worse.
> 
>> Enabled quota on all subvolumes to try to track where the new data is
>> coming from
>> Tried to balance metadata due to transaction CPU spikes
>> Force-rebooted after the array was completely lagged out.
>>
>> Now attempting to mount it RW.  Readonly works, but RW has taken well
>> over 24 hours at this point.
> 
> I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
> to disable quota offline.

Patch set (from my work mailbox), titled "[PATCH] btrfs-progs: rescue:
Add ability to disable quota offline".
Can also be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/quota_disable

Usage is:
# btrfs rescue disable-quota 

Tested locally, it would just toggle the ON/OFF flag for quota, so the
modification should be minimal.

Thanks,
Qu

> 
> Thanks,
> Qu
> 
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs-progs: rescue: Add ability to disable quota offline

2018-08-11 Thread Qu Wenruo
Provide an offline tool to disable quota.

For kernel which skip_balance doesn't work, there is no way to disable
quota on huge fs with balance, as quota will cause balance to hang for a
long long time for each tree block switch.

So add an offline rescue tool to disable quota.

Reported-by: Dan Merillat 
Signed-off-by: Qu Wenruo 
---
This can patch can be fetched from github repo:
https://github.com/adam900710/btrfs-progs/tree/quota_disable
---
 Documentation/btrfs-rescue.asciidoc |  6 +++
 cmds-rescue.c   | 80 +
 2 files changed, 86 insertions(+)

diff --git a/Documentation/btrfs-rescue.asciidoc 
b/Documentation/btrfs-rescue.asciidoc
index f94a0ff2b45e..fb088c1a768a 100644
--- a/Documentation/btrfs-rescue.asciidoc
+++ b/Documentation/btrfs-rescue.asciidoc
@@ -31,6 +31,12 @@ help.
 NOTE: Since *chunk-recover* will scan the whole device, it will be *VERY* slow
 especially executed on a large device.
 
+*disable-quota* ::
+disable quota offline
++
+Acts as a fallback method to disable quota for case where mount hangs due to
+balance and quota.
+
 *fix-device-size* ::
 fix device size and super block total bytes values that are do not match
 +
diff --git a/cmds-rescue.c b/cmds-rescue.c
index 38c4ab9b2ef6..c7cd92427e9d 100644
--- a/cmds-rescue.c
+++ b/cmds-rescue.c
@@ -250,6 +250,84 @@ out:
return !!ret;
 }
 
+static const char * const cmd_rescue_disable_quota_usage[] = {
+   "btrfs rescue disable-quota ",
+   "Disable quota, especially useful for balance mount hang when quota 
enabled",
+   "",
+   NULL
+};
+
+static int cmd_rescue_disable_quota(int argc, char **argv)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_fs_info *fs_info;
+   struct btrfs_path path;
+   struct btrfs_root *root;
+   struct btrfs_qgroup_status_item *qi;
+   struct btrfs_key key;
+   char *devname;
+   int ret;
+
+   clean_args_no_options(argc, argv, cmd_rescue_disable_quota_usage);
+   if (check_argc_exact(argc, 2))
+   usage(cmd_rescue_disable_quota_usage);
+
+   devname = argv[optind];
+   ret = check_mounted(devname);
+   if (ret < 0) {
+   error("could not check mount status: %s", strerror(-ret));
+   return !!ret;
+   } else if (ret) {
+   error("%s is currently mounted", devname);
+   return !!ret;
+   }
+   fs_info = open_ctree_fs_info(devname, 0, 0, 0, OPEN_CTREE_WRITES);
+   if (!fs_info) {
+   error("could not open btrfs");
+   ret = -EIO;
+   return !!ret;
+   }
+   root = fs_info->quota_root;
+   if (!root) {
+   printf("Quota is not enabled, no need to modify the fs\n");
+   goto close;
+   }
+   btrfs_init_path();
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   error("failed to start transaction: %s", strerror(-ret));
+   goto close;
+   }
+   key.objectid = 0;
+   key.type = BTRFS_QGROUP_STATUS_KEY;
+   key.offset = 0;
+   ret = btrfs_search_slot(trans, root, , , 0, 1);
+   if (ret < 0) {
+   error("failed to search tree: %s", strerror(-ret));
+   goto close;
+   }
+   if (ret > 0) {
+   printf(
+   "qgroup status item not found, not need to modify the fs");
+   ret = 0;
+   goto release;
+   }
+   qi = btrfs_item_ptr(path.nodes[0], path.slots[0],
+   struct btrfs_qgroup_status_item);
+   btrfs_set_qgroup_status_flags(path.nodes[0], qi,
+   BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT);
+   btrfs_mark_buffer_dirty(path.nodes[0]);
+   ret = btrfs_commit_transaction(trans, root);
+   if (ret < 0)
+   error("failed to commit transaction: %s", strerror(-ret));
+release:
+   btrfs_release_path();
+close:
+   close_ctree(fs_info->tree_root);
+   return !!ret;
+}
+
 static const char rescue_cmd_group_info[] =
 "toolbox for specific rescue operations";
 
@@ -262,6 +340,8 @@ const struct cmd_group rescue_cmd_group = {
{ "zero-log", cmd_rescue_zero_log, cmd_rescue_zero_log_usage, 
NULL, 0},
{ "fix-device-size", cmd_rescue_fix_device_size,
cmd_rescue_fix_device_size_usage, NULL, 0},
+   { "disable-quota", cmd_rescue_disable_quota,
+   cmd_rescue_disable_quota_usage, NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.18.0



Re: Mount stalls indefinitely after enabling quota groups.

2018-08-11 Thread Qu Wenruo


On 2018/8/12 上午8:59, Dan Merillat wrote:
> On Sat, Aug 11, 2018 at 8:30 PM Qu Wenruo  wrote:
>>
>> It looks pretty like qgroup, but too many noise.
>> The pin point trace event would btrfs_find_all_roots().
> 
> I had this half-written when you replied.
> 
> Agreed: looks like bulk of time spent resides in qgroups.  Spent some
> time with sysrq-l and ftrace:
> 
> ? __rcu_read_unlock+0x5/0x50
> ? return_to_handler+0x15/0x36
> __rcu_read_unlock+0x5/0x50
> find_extent_buffer+0x47/0x90extent_io.c:4888
> read_block_for_search.isra.12+0xc8/0x350ctree.c:2399
> btrfs_search_slot+0x3e7/0x9c0   ctree.c:2837
> btrfs_next_old_leaf+0x1dc/0x410 ctree.c:5702
> btrfs_next_old_item ctree.h:2952
> add_all_parents backref.c:487
> resolve_indirect_refs+0x3f7/0x7e0   backref.c:575
> find_parent_nodes+0x42d/0x1290  backref.c:1236
> ? find_parent_nodes+0x5/0x1290  backref.c:1114
> btrfs_find_all_roots_safe+0x98/0x100backref.c:1414
> btrfs_find_all_roots+0x52/0x70  backref.c:1442
> btrfs_qgroup_trace_extent_post+0x27/0x60qgroup.c:1503
> btrfs_qgroup_trace_leaf_items+0x104/0x130   qgroup.c:1589
> btrfs_qgroup_trace_subtree+0x26a/0x3a0  qgroup.c:1750
> do_walk_down+0x33c/0x5a0extent-tree.c:8883
> walk_down_tree+0xa8/0xd0extent-tree.c:9041
> btrfs_drop_snapshot+0x370/0x8b0 extent-tree.c:9203
> merge_reloc_roots+0xcf/0x220
> btrfs_recover_relocation+0x26d/0x400
> ? btrfs_cleanup_fs_roots+0x16a/0x180
> btrfs_remount+0x32e/0x510
> do_remount_sb+0x67/0x1e0
> do_mount+0x712/0xc90
> 
> The mount is looping in btrfs_qgroup_trace_subtree, as evidenced by
> the following ftrace filter:
> fileserver:/sys/kernel/tracing# cat set_ftrace_filter
> btrfs_qgroup_trace_extent
> btrfs_qgroup_trace_subtree

Yep, it's quota causing the hang.

> 
[snip]
> 
> So 10-13 minutes per cycle.
> 
>> 11T, with highly deduped usage is really the worst scenario case for qgroup.
>> Qgroup is not really good at handle hight reflinked files, nor balance.
>> When they combines, it goes worse.
> 
> I'm not really understanding the use-case of qgroup if it melts down
> on large systems with a shared base + individual changes.

The problem is, for balance btrfs is doing a trick by switch tree reloc
tree with real fs tree.
However, tree reloc tree doesn't account to quota, but for real fs tree
it contributes to quota.

And since above owner changes, btrfs needs to do a full subtree rescan.
For small subvolume it's not a problem, but for large subvolume, quota
needs to rescan thousands tree blocks, and due to highly deduped files,
each tree blocks needs extra iterations for each deduped files.

Both factors contribute to the slow mount.

There are several workaround patches in the mail list, one is to make
the balance background for mount, so it won't hang mount.
But it still makes transaction pretty slow (write will still be blocked
for a long time)

There is also plan to skip subtree rescan completely, but it needs extra
review to ensure such tree block switch won't change quota number.

Thanks,
Qu

> 
>> I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
>> to disable quota offline.
> 
> Ok.  I was looking at just doing this to speed things up:
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 51b5e2da708c..c5bf937b79f0 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8877,7 +8877,7 @@ static noinline int do_walk_down(struct
> btrfs_trans_handle *trans,
> parent = 0;
> }
> 
> -   if (need_account) {
> +   if (0) {
> ret = btrfs_qgroup_trace_subtree(trans, root, next,
>  generation, level - 
> 1);
> if (ret) {
> 
> 
> btrfs_err_rl(fs_info,
>   "Error %d accounting shared subtree. Quota
> is out of sync, rescan required.",
>   ret);
>  }
> 
> 
> If I follow, this will leave me with inconsistent qgroups and a full
> rescan is required.  That seems an acceptable tradeoff, since it seems
> like the best plan going forward is to nuke the qgroups anyway.
> 
> There's still the btrfs-transaction spin, but I'm hoping that's
> related to qgroups as well.
> 
>>
>> Thanks,
>> Qu
> 
> Appreciate it.  I was going to go with my hackjob patch to avoid any
> untested rewriting - there's already an error path for "something went
> wrong updating qgroups during walk_tree" so it seemed safest to take
> advantage of it.  I'll patch either the kernel or the btrfs programs,
> whichever you think is best.
> 



signature.asc
Description: OpenPGP digital signature


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-11 Thread Dan Merillat
On Sat, Aug 11, 2018 at 8:30 PM Qu Wenruo  wrote:
>
> It looks pretty like qgroup, but too many noise.
> The pin point trace event would btrfs_find_all_roots().

I had this half-written when you replied.

Agreed: looks like bulk of time spent resides in qgroups.  Spent some
time with sysrq-l and ftrace:

? __rcu_read_unlock+0x5/0x50
? return_to_handler+0x15/0x36
__rcu_read_unlock+0x5/0x50
find_extent_buffer+0x47/0x90extent_io.c:4888
read_block_for_search.isra.12+0xc8/0x350ctree.c:2399
btrfs_search_slot+0x3e7/0x9c0   ctree.c:2837
btrfs_next_old_leaf+0x1dc/0x410 ctree.c:5702
btrfs_next_old_item ctree.h:2952
add_all_parents backref.c:487
resolve_indirect_refs+0x3f7/0x7e0   backref.c:575
find_parent_nodes+0x42d/0x1290  backref.c:1236
? find_parent_nodes+0x5/0x1290  backref.c:1114
btrfs_find_all_roots_safe+0x98/0x100backref.c:1414
btrfs_find_all_roots+0x52/0x70  backref.c:1442
btrfs_qgroup_trace_extent_post+0x27/0x60qgroup.c:1503
btrfs_qgroup_trace_leaf_items+0x104/0x130   qgroup.c:1589
btrfs_qgroup_trace_subtree+0x26a/0x3a0  qgroup.c:1750
do_walk_down+0x33c/0x5a0extent-tree.c:8883
walk_down_tree+0xa8/0xd0extent-tree.c:9041
btrfs_drop_snapshot+0x370/0x8b0 extent-tree.c:9203
merge_reloc_roots+0xcf/0x220
btrfs_recover_relocation+0x26d/0x400
? btrfs_cleanup_fs_roots+0x16a/0x180
btrfs_remount+0x32e/0x510
do_remount_sb+0x67/0x1e0
do_mount+0x712/0xc90

The mount is looping in btrfs_qgroup_trace_subtree, as evidenced by
the following ftrace filter:
fileserver:/sys/kernel/tracing# cat set_ftrace_filter
btrfs_qgroup_trace_extent
btrfs_qgroup_trace_subtree

# cat trace
...
   mount-6803  [003]  80407.649752:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_subtree
   mount-6803  [003]  80407.649772:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.649797:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.649821:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.649846:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.701652:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.754547:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.754574:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.754598:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.754622:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [003]  80407.754646:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items

... repeats 240 times

   mount-6803  [002]  80412.568804:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [002]  80412.568825:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items
   mount-6803  [002]  80412.568850:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_subtree
   mount-6803  [002]  80412.568872:
btrfs_qgroup_trace_extent <-btrfs_qgroup_trace_leaf_items

Looks like invocations of btrfs_qgroup_trace_subtree are taking forever:

   mount-6803  [006]  80641.627709:
btrfs_qgroup_trace_subtree <-do_walk_down
   mount-6803  [003]  81433.760945:
btrfs_qgroup_trace_subtree <-do_walk_down
(add do_walk_down to the trace here)
   mount-6803  [001]  82124.623557: do_walk_down <-walk_down_tree
   mount-6803  [001]  82124.623567:
btrfs_qgroup_trace_subtree <-do_walk_down
   mount-6803  [006]  82695.241306: do_walk_down <-walk_down_tree
   mount-6803  [006]  82695.241316:
btrfs_qgroup_trace_subtree <-do_walk_down

So 10-13 minutes per cycle.

> 11T, with highly deduped usage is really the worst scenario case for qgroup.
> Qgroup is not really good at handle hight reflinked files, nor balance.
> When they combines, it goes worse.

I'm not really understanding the use-case of qgroup if it melts down
on large systems with a shared base + individual changes.

> I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
> to disable quota offline.

Ok.  I was looking at just doing this to speed things up:

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 51b5e2da708c..c5bf937b79f0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8877,7 +8877,7 @@ static noinline int do_walk_down(struct
btrfs_trans_handle *trans,
parent = 0;
}

-   if (need_account) {
+   if (0) {
ret 

Re: Mount stalls indefinitely after enabling quota groups.

2018-08-11 Thread Qu Wenruo


On 2018/8/12 上午5:10, Dan Merillat wrote:
> 19 hours later, still going extremely slowly and taking longer and
> longer for progress made.  Main symptom is the mount process is
> spinning at 100% CPU, interspersed with btrfs-transaction spinning at
> 100% CPU.
> So far it's racked up 14h45m of CPU time on mount and an additional
> 3h40m on btrfs-transaction.
> 
> The current drop key changes every 10-15 minutes when I check it via
> inspect-internal, so some progress is slowly being made.
> 
> I built the kernel with ftrace to see what's going on internally, this
> is the pattern I'm seeing:
> 
[snip]

It looks pretty like qgroup, but too many noise.
The pin point trace event would btrfs_find_all_roots().

> 
> Repeats indefinitely.  btrace shows basically zero activity on the
> array while it spins, with the occasional burst when mount &
> btrfs-transaction swap off.
> 
> To recap the chain of events leading up to this:
> 11TB Array got completely full and started fragmenting badly.
> Ran bedup and it found 600gb of duplicate files that it offline-shared.
> Reboot for unrelated reasons

11T, with highly deduped usage is really the worst scenario case for qgroup.
Qgroup is not really good at handle hight reflinked files, nor balance.
When they combines, it goes worse.

> Enabled quota on all subvolumes to try to track where the new data is
> coming from
> Tried to balance metadata due to transaction CPU spikes
> Force-rebooted after the array was completely lagged out.
> 
> Now attempting to mount it RW.  Readonly works, but RW has taken well
> over 24 hours at this point.

I'll add a new rescue subcommand, 'btrfs rescue disable-quota' for you
to disable quota offline.

Thanks,
Qu




signature.asc
Description: OpenPGP digital signature


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-11 Thread Dan Merillat
19 hours later, still going extremely slowly and taking longer and
longer for progress made.  Main symptom is the mount process is
spinning at 100% CPU, interspersed with btrfs-transaction spinning at
100% CPU.
So far it's racked up 14h45m of CPU time on mount and an additional
3h40m on btrfs-transaction.

The current drop key changes every 10-15 minutes when I check it via
inspect-internal, so some progress is slowly being made.

I built the kernel with ftrace to see what's going on internally, this
is the pattern I'm seeing:

   mount-6803  [002] ...1 69023.970964: btrfs_next_old_leaf
<-resolve_indirect_refs
   mount-6803  [002] ...1 69023.970965: btrfs_release_path
<-btrfs_next_old_leaf
   mount-6803  [002] ...1 69023.970965: btrfs_search_slot
<-btrfs_next_old_leaf
   mount-6803  [002] ...1 69023.970966:
btrfs_clear_path_blocking <-btrfs_search_slot
   mount-6803  [002] ...1 69023.970966:
btrfs_set_path_blocking <-btrfs_clear_path_blocking
   mount-6803  [002] ...1 69023.970967: btrfs_bin_search
<-btrfs_search_slot
   mount-6803  [002] ...1 69023.970967: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970967: btrfs_get_token_64
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970968: btrfs_get_token_64
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970968: btrfs_node_key
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970969: btrfs_buffer_uptodate
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970969:
btrfs_clear_path_blocking <-btrfs_search_slot
   mount-6803  [002] ...1 69023.970970:
btrfs_set_path_blocking <-btrfs_clear_path_blocking
   mount-6803  [002] ...1 69023.970970: btrfs_bin_search
<-btrfs_search_slot
   mount-6803  [002] ...1 69023.970970: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970971: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970971: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970972: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970972: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970973: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970973: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970973: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970974: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970974: btrfs_get_token_64
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970975: btrfs_get_token_64
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970975: btrfs_node_key
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970976: btrfs_buffer_uptodate
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970976:
btrfs_clear_path_blocking <-btrfs_search_slot
   mount-6803  [002] ...1 69023.970976:
btrfs_set_path_blocking <-btrfs_clear_path_blocking
   mount-6803  [002] ...1 69023.970977: btrfs_bin_search
<-btrfs_search_slot
   mount-6803  [002] ...1 69023.970977: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970978: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970978: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970978: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970979: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970979: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970980: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970980: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970980: btrfs_comp_cpu_keys
<-generic_bin_search.constprop.14
   mount-6803  [002] ...1 69023.970981: btrfs_get_token_64
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970981: btrfs_get_token_64
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970982: btrfs_node_key
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970982: btrfs_buffer_uptodate
<-read_block_for_search.isra.12
   mount-6803  [002] ...1 69023.970983:
btrfs_clear_path_blocking <-btrfs_search_slot
   mount-6803  [002] ...1 69023.970983:
btrfs_set_path_blocking <-btrfs_clear_path_blocking
   mount-6803  [002] ...1 

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-11 Thread Zygo Blaxell
On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote:
> I guess that covers most topics, two last questions:
> 
> Will the write hole behave differently on Raid 6 compared to Raid 5 ?

Not really.  It changes the probability distribution (you get an extra
chance to recover using a parity block in some cases), but there are
still cases where data gets lost that didn't need to be.

> Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? 

There may be benefits of raid5 metadata, but they are small compared to
the risks.

In some configurations it may not be possible to allocate the last
gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
time while raid5 will allocate 1GB chunks from N disks at a time, and if
N is an odd number there could be one chunk left over in the array that
is unusable.  Most users will find this irrelevant because a large disk
array that is filled to the last GB will become quite slow due to long
free space search and seek times--you really want to keep usage below 95%,
maybe 98% at most, and that means the last GB will never be needed.

Reading raid5 metadata could theoretically be faster than raid1, but that
depends on a lot of variables, so you can't assume it as a rule of thumb.

Raid6 metadata is more interesting because it's the only currently
supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
that benefit is rather limited due to the write hole bug.

There are patches floating around that implement multi-disk raid1 (i.e. 3
or 4 mirror copies instead of just 2).  This would be much better for
metadata than raid6--more flexible, more robust, and my guess is that
it will be faster as well (no need for RMW updates or journal seeks).

> -
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> 


signature.asc
Description: PGP signature


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-11 Thread Andrei Borzenkov
10.08.2018 12:33, Tomasz Pala пишет:
> 
>> For 4 disk with 1T free space each, if you're using RAID5 for data, then
>> you can write 3T data.
>> But if you're also using RAID10 for metadata, and you're using default
>> inline, we can use small files to fill the free space, resulting 2T
>> available space.
>>
>> So in this case how would you calculate the free space? 3T or 2T or
>> anything between them?
> 
> The answear is pretty simple: 3T. Rationale:
> - this is the space I do can put in a single data stream,
> - people are aware that there is metadata overhead with any object;
>   after all, metadata are also data,
> - while filling the fs with small files the free space available would
>   self-adjust after every single file put, so after uploading 1T of such
>   files the df should report 1.5T free. There would be nothing weird(er
>   that now) that 1T of data has actually eaten 1.5T of storage.
> 
> No crystal ball calculations, just KISS; since one _can_ put 3T file
> (non sparse, uncompressible, bulk written) on a filesystem, the free space is 
> 3T.
> 

As far as I can tell, that is exactly what "df" reports now. "btrfs fi
us" will tell you both max (reported by "df") and worst case min.


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-11 Thread erenthetitan
I guess that covers most topics, two last questions:

Will the write hole behave differently on Raid 6 compared to Raid 5 ?
Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? 
-
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-11 Thread Andrei Borzenkov
10.08.2018 21:21, Tomasz Pala пишет:
> On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote:
> 
>>> I.e.: every shared segment should be accounted within quota (at least once).
>> I think what you mean to say here is that every shared extent should be 
>> accounted to quotas for every location it is reflinked from.  IOW, that 
>> if an extent is shared between two subvolumes each with it's own quota, 
>> they should both have it accounted against their quota.
> 
> Yes.
> 

This is what "referenced" in quota group report is, is not it? What is
missing here?