Re: [PATCH] btrfs/139: creation/deletion within qgroup limits

2017-03-13 Thread Eryu Guan
[please cc linux-btrfs@vger.kernel.org for btrfs specific tests]

On Mon, Mar 13, 2017 at 04:37:16PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> If we create and delete files within the qgroup limits, qg->reserved
> (allocations before commits) over-inflates and causes -EDQUOT to
> be returned pre-maturely.
> 
> Also, 32/64bit data-type exchanges can cause reserved (u64)
> to go negative (very large) and -EDQUOT is returned
> pre-maturely.
> 
> Signed-off-by: Goldwyn Rodrigues 

Thanks for the test! I tested it with 4.10 kernel and test failed. I
assume it's expected result. Some comments inline.

> ---
>  tests/btrfs/139 |  75 +
>  tests/btrfs/139.out | 156 
> 
>  tests/btrfs/group   |   1 +
>  3 files changed, 232 insertions(+)
>  create mode 100755 tests/btrfs/139
>  create mode 100644 tests/btrfs/139.out
> 
> diff --git a/tests/btrfs/139 b/tests/btrfs/139
> new file mode 100755
> index 000..df0ef3e
> --- /dev/null
> +++ b/tests/btrfs/139
> @@ -0,0 +1,75 @@
> +#! /bin/bash
> +# FS QA Test 139
> +#
> +# Check if btrfs quota limits are not reached when you constantly
> +# create and delete files within the exclusive qgroup limits.
> +#
> +# Finally we create files to exceed the quota.
> +#
> +#---
> +# Copyright (c) 2017 SUSE. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +_supported_fs btrfs
> +_supported_os Linux

Need _require_scratch, it check if SCRATCH_DEV is present and unmounts
it if so.

> +
> +_scratch_mkfs > /dev/null 2>&1
> +#_scratch_mount

Then need to _scratch_mount here.

> +
> +SUBVOL=$SCRATCH_MNT/subvol
> +
> +_run_btrfs_util_prog subvolume create $SUBVOL
> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
> +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT

I always see this message in dmesg, not sure if it affects test?

BTRFS info (device sdc2): qgroup_rescan_init failed with -115

> +_run_btrfs_util_prog qgroup limit -e 1G $SUBVOL
> +
> +for i in $(seq 1 10); do
> + for j in $(seq 1 7); do
> + $XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j | 
> _filter_xfs_io

No need to filter & print xfs_io stdout, it makes the .out file not
necessarily too long. We can discard stdout, so if error happens the
stderr output could break the .out file match anyway.

$XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j >/dev/null


> + done
> + rm -f $SUBVOL/file*
> +done
> +
> +for j in $(seq 1 8); do
> + $XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j | _filter_xfs_io

pwrite error is expected here, and newer xfs_io only prints "pwrite" not
"pwrite64", so _filter_xfs_io_error should be used to filter the error
message from stderr (not stdout).

$XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j 2>&1 | 
_filter_xfs_io_error

> +done

Better to add comments about these two for loops, what is the purpose of
each loop etc.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs-progs: dump-tree: Fix duplicated output when using -t option

2017-03-13 Thread Qu Wenruo
When using -t option to output trees not in root tree (chunk/root/log
root), then we output the tree twice.

Fix it

Signed-off-by: Qu Wenruo 
---
v2:
  None
---
 cmds-inspect-dump-tree.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index 6c6c0e5c..2c6bec7f 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -371,21 +371,21 @@ again:
if (tree_id && tree_id == BTRFS_ROOT_TREE_OBJECTID) {
if (!info->tree_root->node) {
error("cannot print root tree, invalid pointer");
-   goto no_node;
+   goto close_root;
}
printf("root tree\n");
btrfs_print_tree(info->tree_root, info->tree_root->node, 1);
-   goto no_node;
+   goto close_root;
}
 
if (tree_id && tree_id == BTRFS_CHUNK_TREE_OBJECTID) {
if (!info->chunk_root->node) {
error("cannot print chunk tree, invalid pointer");
-   goto no_node;
+   goto close_root;
}
printf("chunk tree\n");
btrfs_print_tree(info->chunk_root, info->chunk_root->node, 1);
-   goto no_node;
+   goto close_root;
}
 
key.offset = 0;
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs-progs: dump-tree: Also output log root tree

2017-03-13 Thread Qu Wenruo
In btrfs-dump-tree, we output any existing log tree, however we don't
output the log root tree, which records all root items for log trees.

This makes it confusing for any one who want to know where the log tree
comes from.

Signed-off-by: Qu Wenruo 
---
v2:
  Check if log_root_tree exists before printing it in root_only mode
---
 cmds-inspect-dump-tree.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index 2c6bec7f..5a5ca536 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -344,6 +344,11 @@ int cmd_inspect_dump_tree(int argc, char **argv)
printf("chunk tree: %llu level %d\n",
 (unsigned long long)info->chunk_root->node->start,
 btrfs_header_level(info->chunk_root->node));
+   if (info->log_root_tree)
+   printf("log root tree: %llu level %d\n",
+  info->log_root_tree->node->start,
+   btrfs_header_level(
+   info->log_root_tree->node));
} else {
if (info->tree_root->node) {
printf("root tree\n");
@@ -356,6 +361,12 @@ int cmd_inspect_dump_tree(int argc, char **argv)
btrfs_print_tree(info->chunk_root,
 info->chunk_root->node, 1);
}
+
+   if (info->log_root_tree) {
+   printf("log root tree\n");
+   btrfs_print_tree(info->log_root_tree,
+   info->log_root_tree->node, 1);
+   }
}
}
tree_root_scan = info->tree_root;
@@ -388,6 +399,17 @@ again:
goto close_root;
}
 
+   if (tree_id && tree_id == BTRFS_TREE_LOG_OBJECTID) {
+   if (!info->log_root_tree) {
+   error("cannot print log root tree, invalid pointer");
+   goto close_root;
+   }
+   printf("log root tree\n");
+   btrfs_print_tree(info->log_root_tree, info->log_root_tree->node,
+1);
+   goto close_root;
+   }
+
key.offset = 0;
key.objectid = 0;
key.type = BTRFS_ROOT_ITEM_KEY;
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: dump-tree: Fix duplicated output when using -t option

2017-03-13 Thread Qu Wenruo
When using -t option to output trees not in root tree (chunk/root/log
root), then we output the tree twice.

Fix it

Signed-off-by: Qu Wenruo 
---
 cmds-inspect-dump-tree.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index 6c6c0e5c..2c6bec7f 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -371,21 +371,21 @@ again:
if (tree_id && tree_id == BTRFS_ROOT_TREE_OBJECTID) {
if (!info->tree_root->node) {
error("cannot print root tree, invalid pointer");
-   goto no_node;
+   goto close_root;
}
printf("root tree\n");
btrfs_print_tree(info->tree_root, info->tree_root->node, 1);
-   goto no_node;
+   goto close_root;
}
 
if (tree_id && tree_id == BTRFS_CHUNK_TREE_OBJECTID) {
if (!info->chunk_root->node) {
error("cannot print chunk tree, invalid pointer");
-   goto no_node;
+   goto close_root;
}
printf("chunk tree\n");
btrfs_print_tree(info->chunk_root, info->chunk_root->node, 1);
-   goto no_node;
+   goto close_root;
}
 
key.offset = 0;
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: dump-tree: Also output log root tree

2017-03-13 Thread Qu Wenruo
In btrfs-dump-tree, we output any existing log tree, however we don't
output the log root tree, which records all root items for log trees.

This makes it confusing for any one who want to know where the log tree
comes from.

Signed-off-by: Qu Wenruo 
---
 cmds-inspect-dump-tree.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index 2c6bec7f..eca91b5e 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -344,6 +344,9 @@ int cmd_inspect_dump_tree(int argc, char **argv)
printf("chunk tree: %llu level %d\n",
 (unsigned long long)info->chunk_root->node->start,
 btrfs_header_level(info->chunk_root->node));
+   printf("log root tree: %llu level %d\n",
+(unsigned long 
long)info->log_root_tree->node->start,
+btrfs_header_level(info->log_root_tree->node));
} else {
if (info->tree_root->node) {
printf("root tree\n");
@@ -356,6 +359,12 @@ int cmd_inspect_dump_tree(int argc, char **argv)
btrfs_print_tree(info->chunk_root,
 info->chunk_root->node, 1);
}
+
+   if (info->log_root_tree) {
+   printf("log root tree\n");
+   btrfs_print_tree(info->log_root_tree,
+   info->log_root_tree->node, 1);
+   }
}
}
tree_root_scan = info->tree_root;
@@ -388,6 +397,17 @@ again:
goto close_root;
}
 
+   if (tree_id && tree_id == BTRFS_TREE_LOG_OBJECTID) {
+   if (!info->log_root_tree) {
+   error("cannot print log root tree, invalid pointer");
+   goto close_root;
+   }
+   printf("log root tree\n");
+   btrfs_print_tree(info->log_root_tree, info->log_root_tree->node,
+1);
+   goto close_root;
+   }
+
key.offset = 0;
key.objectid = 0;
key.type = BTRFS_ROOT_ITEM_KEY;
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-13 Thread Anand Jain





+struct device_checkpoint {
+struct list_head list;
+struct btrfs_device *device;
+int stat_value_checkpoint;
+};
+
+static int add_device_checkpoint(struct list_head *checkpoint,


Could we have another structure instead of list_head to record device
checkpoints?
The list_head is never a meaningful structure under most cases.


 I didn't understand this, there is device_checkpoint and the context
 of struct list_head *checkpoint would start and end within
 barrier_all_devices().


I just mean to encapsulate the list_head into another structure, e.g,
call it device_checkpoints or something else.

Just a list_head is not quite meaningful.


 OK. Will do.


+static int check_stat_flush(struct btrfs_device *dev,
+struct list_head *checkpoint)
+{
+int val;
+struct device_checkpoint *cdev;
+
+list_for_each_entry(cdev, checkpoint, list) {
+if (cdev->device == dev) {
+val = btrfs_dev_stat_read(dev,
+BTRFS_DEV_STAT_FLUSH_ERRS);
+if (cdev->stat_value_checkpoint != val)
+return 1;


This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified
by checkpoint related code.
Or any other modifier can easily break the check, causing false alert.


 I have already checked it, its not possible with the current code,
 or do you see if that is possible with the current code ? or Did I
 miss something ?


You didn't miss anything.

However I just got the same comment on better not to play around
something inside btrfs_device.

So I'm not sure if it's good to do it this time just for cleaning up
barrier_all_devices().



 Right. The reason I said that before was first of all the concept
 of err_send and err_wait wasn't needed as shown in the patch set
 1 to 3 in this patch-set. We have the dev_stat flush which keeps
 track of the flush error in the right way.
 Next to monitor the change in the flush error instead of checkpoint
 probably dev_stat_ccnt is the correct way (which in the long term
 we would need that kind on the per error-stat for rest of the
 error-stat including flush). But unless we have all the requirements
 well understood from the other pending stuff like device disappear
 and reappear I won't change the struct btrfs_devices as of now.

 Hope this clarifies better. Will send out the patch with the review
 comment incorporated. Thanks. On top of which per chunk missing
 device check patch can be added.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

2017-03-13 Thread Qu Wenruo



At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:


Am 13.03.2017 um 08:39 schrieb Qu Wenruo:



At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:

Hi Qu,

Am 13.03.2017 um 02:16 schrieb Qu Wenruo:

But wasn't this part of the code identical in V5? Why does it only
happen with V7?


There are still difference, but just as you said, the related
part(checking if inode is free space cache inode) is identical across v5
and v7.


But if i boot v7 it always happens. If i boot v5 it always works. Have
done 5 repeatet tests.


I rechecked the code change between v7 and v5.

It turns out that, the code base may cause the problem.

In v7, the base is v4.11-rc1, which introduced quite a lot of 
btrfs_inode cleanup.


One of the difference is the parameter for btrfs_is_free_space_inode().

In v7, the parameter @inode changed from struct inode to struct btrfs_inode.

So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(), 
other than plain inode.


That's the most possible cause for me here.

So would you please paste the final patch applied to your tree?
Git diff or git format-patch can both handle it.

Thanks,
Qu




I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
could happen in both v5 and v7.

What's the difference between SUSE and mainline kernel?


A lot ;-) But i don't think anything related.


Maybe some mainline kernel commits have already fixed it?


May be no idea. But i haven't found any reason why v5 works.

Stefan



Thanks,
Qu





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

2017-03-13 Thread Qu Wenruo



At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote:


Am 13.03.2017 um 08:39 schrieb Qu Wenruo:



At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:

Hi Qu,

Am 13.03.2017 um 02:16 schrieb Qu Wenruo:


At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:

Hi Qu,

while V5 was running fine against the openSUSE-42.2 kernel (based on
v4.4).


Thanks for the test.


V7 results in OOPS to me:
BUG: unable to handle kernel NULL pointer dereference at
01f0


This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
nice clue.


IP: [] __endio_write_update_ordered+0x33/0x140
[btrfs]


IP points to:
---
static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
{
struct btrfs_root *root = inode->root; << Either here

if (root == root->fs_info->tree_root && << Or here
btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)

---

Taking the above offset into consideration, it's only possible for later
case.

So here, we have a btrfs_inode whose @root is NULL.


But wasn't this part of the code identical in V5? Why does it only
happen with V7?


There are still difference, but just as you said, the related
part(checking if inode is free space cache inode) is identical across v5
and v7.


But if i boot v7 it always happens. If i boot v5 it always works. Have
done 5 repeatet tests.


That's a problem now, I'll dig it further to find out the difference.

Thanks,
Qu




I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
could happen in both v5 and v7.

What's the difference between SUSE and mainline kernel?


A lot ;-) But i don't think anything related.


Maybe some mainline kernel commits have already fixed it?


May be no idea. But i haven't found any reason why v5 works.

Stefan



Thanks,
Qu



This can be fixed easily by checking @root inside
btrfs_is_free_space_inode(), as the backtrace shows that it's only
happening for DirectIO, and it won't happen for free space cache inode.

But I'm more curious how this happened for a more accurate fix, or we
could have other NULL pointer access.

Did you have any reproducer for this?


Sorry no - this is a production MariaDB Server running btrfs with
compress-force=zlib. But if i could test anything i'll do.

Greets,
Stefan



Thanks,
Qu


PGD 14e18d4067 PUD 14e1868067 PMD 0
Oops:  [#1] SMP
Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq
ata_generic
virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
i2c_core usb_common ata_piix floppy
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.7.5-20140722_172050-sagunt 04/01/2014
task: b4e0f500 ti: b4e0 task.ti: b4e0
RIP: 0010:[] []
__endio_write_update_ordered+0x33/0x140 [btrfs]
RSP: 0018:8814eae03cd8 EFLAGS: 00010086
RAX:  RBX: 8814e8fd5aa8 RCX: 0001
RDX: 0010 RSI: 0010 RDI: 8814e45885c0
RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a
R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0
R13: 8814e125d700 R14: 0010 R15: 8800376c6a80
FS: () GS:8814eae0()
knlGS:
CS: 0010 DS:  ES:  CR0: 80050033
CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack:
 0010 8814e8fd5aa8 8814e953f3c0
8814e125d700 0010 8800376c6a80 8814eae03d38
c03ddf67 8814e86b6a80 8814e8fd5aa8 0001
Call Trace:
[] btrfs_endio_direct_write+0x37/0x60 [btrfs]
[] bio_endio+0x57/0x60
[] btrfs_end_bio+0xa1/0x140 [btrfs]
[] bio_endio+0x57/0x60
[] blk_update_request+0x8b/0x330
[] blk_mq_end_request+0x1a/0x70
[] virtblk_request_done+0x3f/0x70 [virtio_blk]
[] __blk_mq_complete_request+0x78/0xe0
[] blk_mq_complete_request+0x1c/0x20
[] virtblk_done+0x64/0xe0 [virtio_blk]
[] vring_interrupt+0x3a/0x90
[] __handle_irq_event_percpu+0x89/0x1b0
[] handle_irq_event_percpu+0x23/0x60
[] handle_irq_event+0x3b/0x60
[] handle_edge_irq+0x6f/0x150
[] handle_irq+0x1d/0x30
[] do_IRQ+0x4b/0xd0
[] common_interrupt+0x8c/0x8c
DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
Leftover inexact backtrace:
2017-03-12 20:33:08 
2017-03-12 20:33:08  [] ?
native_safe_halt+0x6/0x10
[] default_idle+0x1e/0xe0
[] arch_cpu_idle+0xf/0x20
[] default_idle_call+0x3b/0x40
[] cpu_startup_entry+0x29a/0x370
[] rest_init+0x7c/0x80
[] start_kernel+0x490/0x49d
[] ? early_idt_handler_array+0x120/0x120
[] x86_64_start_reservations+0x2a/0x2c
[] x86_64_start_kernel+0x13b/0x14a
Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
RIP 

Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-13 Thread Qu Wenruo



At 03/14/2017 12:21 AM, Anand Jain wrote:



Thanks for the review..


On 03/13/2017 05:05 PM, Qu Wenruo wrote:



At 03/13/2017 03:42 PM, Anand Jain wrote:

The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.


The idea itself is quite good, and we do need it.


 Thanks.



By doing this it helps to further develop patches which would tune
the error-actions as needed.

Here functions such as btrfs_dev_stats_dirty() couldn't be used
because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 96
+++---
 1 file changed, 85 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5719e036048b..12531a5b14ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device
*device, int wait)
 return 0;
 }

+struct device_checkpoint {
+struct list_head list;
+struct btrfs_device *device;
+int stat_value_checkpoint;
+};
+
+static int add_device_checkpoint(struct list_head *checkpoint,


Could we have another structure instead of list_head to record device
checkpoints?
The list_head is never a meaningful structure under most cases.


 I didn't understand this, there is device_checkpoint and the context
 of struct list_head *checkpoint would start and end within
 barrier_all_devices().


I just mean to encapsulate the list_head into another structure, e.g, 
call it device_checkpoints or something else.


Just a list_head is not quite meaningful.






+struct btrfs_device *device)
+{
+struct device_checkpoint *cdev =
+kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
+if (!cdev)
+return -ENOMEM;


This means that, we must check return value of add_device_checkpoint(),
while later code doesn't check it at all.


 oh. I will correct this.




+
+list_add(>list, checkpoint);


And I prefer to do extra check, in case such device is already inserted
once.


 Hmm with the current code its not at all possible, but let me add it.


+
+cdev->device = device;
+cdev->stat_value_checkpoint =
+btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS);
+
+return 0;
+}
+
+static void fini_devices_checkpoint(struct list_head *checkpoint)


Never a fan of the "fini_" naming.
What about "release_"?


 will change it.


+{
+struct device_checkpoint *cdev;
+
+while(!list_empty(checkpoint)) {
+cdev = list_entry(checkpoint->next,
+struct device_checkpoint, list);
+list_del(>list);
+kfree(cdev);
+}
+}
+
+static int check_stat_flush(struct btrfs_device *dev,
+struct list_head *checkpoint)
+{
+int val;
+struct device_checkpoint *cdev;
+
+list_for_each_entry(cdev, checkpoint, list) {
+if (cdev->device == dev) {
+val = btrfs_dev_stat_read(dev,
+BTRFS_DEV_STAT_FLUSH_ERRS);
+if (cdev->stat_value_checkpoint != val)
+return 1;


This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified
by checkpoint related code.
Or any other modifier can easily break the check, causing false alert.


 I have already checked it, its not possible with the current code,
 or do you see if that is possible with the current code ? or Did I
 miss something ?


You didn't miss anything.

However I just got the same comment on better not to play around 
something inside btrfs_device.


So I'm not sure if it's good to do it this time just for cleaning up 
barrier_all_devices().


Thanks,
Qu




IIRC that's the reason why I update my previous degraded patch.


Personally speaking, I prefer the patchset to take more usage of the
checkpoint system, or it's a little overkilled for current usage.


 Just want to make sure things are done in the right way.


Thanks, Anand



Thanks,
Qu


+}
+}
+return 0;
+}
+
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs,
+struct list_head *checkpoint)
+{
+int dropouts = 0;
+struct btrfs_device *dev;
+
+list_for_each_entry_rcu(dev, >devices, dev_list) {
+if (!dev->bdev || check_stat_flush(dev, checkpoint))
+dropouts++;
+}
+
+if (dropouts >
+fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+return -EIO;
+
+return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)
 {
 struct list_head *head;
 struct btrfs_device *dev;
-int dropouts = 0;
 int ret;
+struct list_head checkpoint;
+
+INIT_LIST_HEAD();

 /* send down all the barriers */
 head = >fs_devices->devices;
@@ -3587,29 

Re: Hard crash on 4.9.5

2017-03-13 Thread Omar Sandoval
On Mon, Mar 13, 2017 at 10:58:29PM +0100, Kai Krakow wrote:
> Am Sat, 28 Jan 2017 15:50:38 -0500
> schrieb Matt McKinnon :
> 
> > This same file system (which crashed again with the same errors) is
> > also giving this output during a metadata or data balance:
> 
> This looks somewhat familiar to the err=-17 that I am experiencing when
> using VirtualBox image on btrfs in CoW mode (compress=lzo).
> 
> During IO intensive workloads, it results in "object already exists,
> err -17" (or similar, someone else also experienced it through another
> workload). The resulting btrfs check show the same errors, giving
> inodes without csum.
> 
> Trying to continue using this file system in successive boots usually
> results in boot freezes or complete unmountable filesystem, broken
> beyond repair.
> 
> I'm feeling that using the bfq elevator usually enables me to trigger
> this bug also without using VirtualBox, i.e. during normal system
> usage, and mostly during boot when IO load is very high. So I also
> stopped using bfq although it was giving me a much superior
> interactivity.
> 
> Marking vbox images nocow and using standard elevators (cfq, deadline)
> exposes no such problems so far - even during excessive IO loads.
> 
> EOM

This sounds similar to a bug I fixed here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e2bd3b7fac91b79a6115fd1511ca20b2a09696d

That change is in v4.10. If you're not already running a kernel version
with that fix, could you check if that solves it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hard crash on 4.9.5

2017-03-13 Thread Kai Krakow
Am Sat, 28 Jan 2017 15:50:38 -0500
schrieb Matt McKinnon :

> This same file system (which crashed again with the same errors) is
> also giving this output during a metadata or data balance:

This looks somewhat familiar to the err=-17 that I am experiencing when
using VirtualBox image on btrfs in CoW mode (compress=lzo).

During IO intensive workloads, it results in "object already exists,
err -17" (or similar, someone else also experienced it through another
workload). The resulting btrfs check show the same errors, giving
inodes without csum.

Trying to continue using this file system in successive boots usually
results in boot freezes or complete unmountable filesystem, broken
beyond repair.

I'm feeling that using the bfq elevator usually enables me to trigger
this bug also without using VirtualBox, i.e. during normal system
usage, and mostly during boot when IO load is very high. So I also
stopped using bfq although it was giving me a much superior
interactivity.

Marking vbox images nocow and using standard elevators (cfq, deadline)
exposes no such problems so far - even during excessive IO loads.

EOM

> Jan 27 19:42:47 my_machine kernel: [  335.018123] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 2191360
> Jan 27 19:42:47 my_machine kernel: [  335.018128] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 2195456
> Jan 27 19:42:47 my_machine kernel: [  335.018491] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4018176
> Jan 27 19:42:47 my_machine kernel: [  335.018496] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4022272
> Jan 27 19:42:47 my_machine kernel: [  335.018499] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4026368
> Jan 27 19:42:47 my_machine kernel: [  335.018502] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4030464
> Jan 27 19:42:47 my_machine kernel: [  335.019443] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 6156288
> Jan 27 19:42:47 my_machine kernel: [  335.019688] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 7933952
> Jan 27 19:42:47 my_machine kernel: [  335.019693] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 7938048
> Jan 27 19:42:47 my_machine kernel: [  335.019754] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 8077312
> Jan 27 19:42:47 my_machine kernel: [  335.025485] BTRFS warning
> (device sda1): csum failed ino 28472371 off 2191360 csum 4031061501
> expected csum 0 Jan 27 19:42:47 my_machine kernel: [  335.025490]
> BTRFS warning (device sda1): csum failed ino 28472371 off 2195456
> csum 2371784003 expected csum 0 Jan 27 19:42:47 my_machine kernel:
> [  335.025526] BTRFS warning (device sda1): csum failed ino 28472371
> off 4018176 csum 3812080098 expected csum 0 Jan 27 19:42:47
> my_machine kernel: [  335.025531] BTRFS warning (device sda1): csum
> failed ino 28472371 off 4022272 csum 2776681411 expected csum 0 Jan
> 27 19:42:47 my_machine kernel: [  335.025534] BTRFS warning (device
> sda1): csum failed ino 28472371 off 4026368 csum 1179241675 expected
> csum 0 Jan 27 19:42:47 my_machine kernel: [  335.025540] BTRFS
> warning (device sda1): csum failed ino 28472371 off 4030464 csum
> 1256914217 expected csum 0 Jan 27 19:42:47 my_machine kernel:
> [  335.026142] BTRFS warning (device sda1): csum failed ino 28472371
> off 7933952 csum 2695958066 expected csum 0 Jan 27 19:42:47
> my_machine kernel: [  335.026147] BTRFS warning (device sda1): csum
> failed ino 28472371 off 7938048 csum 3260800596 expected csum 0 Jan
> 27 19:42:47 my_machine kernel: [  335.026934] BTRFS warning (device
> sda1): csum failed ino 28472371 off 6156288 csum 4293116449 expected
> csum 0 Jan 27 19:42:47 my_machine kernel: [  335.033249] BTRFS
> warning (device sda1): csum failed ino 28472371 off 8077312 csum
> 4031878292 expected csum 0
> 
> Can these be ignored?
> 
> 
> On 01/25/2017 04:06 PM, Liu Bo wrote:
> > On Mon, Jan 23, 2017 at 03:03:55PM -0500, Matt McKinnon wrote:  
> >> Wondering what to do about this error which says 'reboot needed'.
> >> Has happened a three times in the past week:
> >>  
> >
> > Well, I don't think btrfs's logic here is wrong, the following stack
> > shows that a nfs client has sent a second unlink against the same
> > inode while somehow the inode was not fully deleted by the first
> > unlink.
> >
> > So it'd be good that you could add some debugging information to
> > get us further.
> >
> > Thanks,
> >
> > -liubo
> >  
> >> Jan 23 14:16:17 my_machine kernel: [ 2568.595648] BTRFS error
> >> (device sda1): err add delayed dir index item(index: 23810) into
> >> the deletion tree of the delayed node(root id: 257, inode id:
> >> 2661433, errno: -17) Jan 23 14:16:17 my_machine kernel:
> >> [ 2568.611010] [ cut here ]
> >> Jan 23 14:16:17 my_machine kernel: [ 2568.615628] kernel BUG at
> >> 

[PATCH 2/2] btrfs: replace hardcoded value with SEQ_NONE macro

2017-03-13 Thread ednadolski
From: Edmund Nadolski 

Define the SEQ_NONE macro to replace (u64)-1 in places where said
value triggers a special-case ref search behavior.

Signed-off-by: Edmund Nadolski 
Reviewed-by: Jeff Mahoney 
---
 fs/btrfs/backref.c | 16 
 fs/btrfs/backref.h |  2 ++
 fs/btrfs/qgroup.c  |  4 ++--
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index e794b6e..297de5b 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -538,7 +538,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
 * slot==nritems. In that case, go to the next leaf before we continue.
 */
if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
-   if (time_seq == (u64)-1)
+   if (time_seq == SEQ_NONE)
ret = btrfs_next_leaf(root, path);
else
ret = btrfs_next_old_leaf(root, path, time_seq);
@@ -582,7 +582,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
eie = NULL;
}
 next:
-   if (time_seq == (u64)-1)
+   if (time_seq == SEQ_NONE)
ret = btrfs_next_item(root, path);
else
ret = btrfs_next_old_item(root, path, time_seq);
@@ -634,7 +634,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
 
if (path->search_commit_root)
root_level = btrfs_header_level(root->commit_root);
-   else if (time_seq == (u64)-1)
+   else if (time_seq == SEQ_NONE)
root_level = btrfs_header_level(root->node);
else
root_level = btrfs_old_root_level(root, time_seq);
@@ -645,7 +645,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
}
 
path->lowest_level = level;
-   if (time_seq == (u64)-1)
+   if (time_seq == SEQ_NONE)
ret = btrfs_search_slot(NULL, root, >key_for_search, path,
0, 0);
else
@@ -1199,7 +1199,7 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info,
  *
  * NOTE: This can return values > 0
  *
- * If time_seq is set to (u64)-1, it will not search delayed_refs, and behave
+ * If time_seq is set to SEQ_NONE, it will not search delayed_refs, and behave
  * much like trans == NULL case, the difference only lies in it will not
  * commit root.
  * The special case is for qgroup to search roots in commit_transaction().
@@ -1246,7 +1246,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
path->skip_locking = 1;
}
 
-   if (time_seq == (u64)-1)
+   if (time_seq == SEQ_NONE)
path->skip_locking = 1;
 
/*
@@ -1276,9 +1276,9 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
if (trans && likely(trans->type != __TRANS_DUMMY) &&
-   time_seq != (u64)-1) {
+   time_seq != SEQ_NONE) {
 #else
-   if (trans && time_seq != (u64)-1) {
+   if (trans && time_seq != SEQ_NONE) {
 #endif
/*
 * look if there are updates for this ref queued and lock the
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index 9c41fba..20915a6 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -23,6 +23,8 @@
 #include "ulist.h"
 #include "extent_io.h"
 
+#define SEQ_NONE   ((u64)-1)
+
 struct inode_fs_paths {
struct btrfs_path   *btrfs_path;
struct btrfs_root   *fs_root;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index bb7e42f..76d84af 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2077,12 +2077,12 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
goto cleanup;
}
/*
-* Use (u64)-1 as time_seq to do special search, which
+* Use SEQ_NONE as time_seq to do special search, which
 * doesn't lock tree or delayed_refs and search current
 * root. It's safe inside commit_transaction().
 */
ret = btrfs_find_all_roots(trans, fs_info,
-   record->bytenr, (u64)-1, _roots);
+   record->bytenr, SEQ_NONE, _roots);
if (ret < 0)
goto cleanup;
if (qgroup_to_skip) {
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: provide enumeration for __merge_refs mode argument

2017-03-13 Thread ednadolski
From: Edmund Nadolski 

Replace hardcoded numeric values for __merge_refs 'mode' argument
with descriptive constants.

Signed-off-by: Edmund Nadolski 
Reviewed-by: Jeff Mahoney 
---
 fs/btrfs/backref.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 7699e16..e794b6e 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -26,6 +26,11 @@
 #include "delayed-ref.h"
 #include "locking.h"
 
+enum merge_mode {
+   MERGE_IDENTICAL_KEYS = 1,
+   MERGE_IDENTICAL_PARENTS,
+};
+
 /* Just an arbitrary number so we can be sure this happened */
 #define BACKREF_FOUND_SHARED 6
 
@@ -809,14 +814,12 @@ static int __add_missing_keys(struct btrfs_fs_info 
*fs_info,
 /*
  * merge backrefs and adjust counts accordingly
  *
- * mode = 1: merge identical keys, if key is set
  *FIXME: if we add more keys in __add_prelim_ref, we can merge more here.
  *   additionally, we could even add a key range for the blocks we
  *   looked into to merge even more (-> replace unresolved refs by 
those
  *   having a parent).
- * mode = 2: merge identical parents
  */
-static void __merge_refs(struct list_head *head, int mode)
+static void __merge_refs(struct list_head *head, enum merge_mode mode)
 {
struct __prelim_ref *pos1;
 
@@ -829,7 +832,7 @@ static void __merge_refs(struct list_head *head, int mode)
 
if (!ref_for_same_block(ref1, ref2))
continue;
-   if (mode == 1) {
+   if (mode == MERGE_IDENTICAL_KEYS) {
if (!ref1->parent && ref2->parent)
swap(ref1, ref2);
} else {
@@ -1374,7 +1377,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
if (ret)
goto out;
 
-   __merge_refs(, 1);
+   __merge_refs(, MERGE_IDENTICAL_KEYS);
 
ret = __resolve_indirect_refs(fs_info, path, time_seq, ,
  extent_item_pos, total_refs,
@@ -1382,7 +1385,7 @@ static int find_parent_nodes(struct btrfs_trans_handle 
*trans,
if (ret)
goto out;
 
-   __merge_refs(, 2);
+   __merge_refs(, MERGE_IDENTICAL_PARENTS);
 
while (!list_empty()) {
ref = list_first_entry(, struct __prelim_ref, list);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Cleanup for some hardcoded constants

2017-03-13 Thread ednadolski
From: Edmund Nadolski 

This series replaces several hard-coded values with descriptive
symbols.

Edmund Nadolski (2):
  btrfs: provide enumeration for __merge_refs mode argument
  btrfs: replace hardcoded value with SEQ_NONE macro

 fs/btrfs/backref.c | 31 +--
 fs/btrfs/backref.h |  2 ++
 fs/btrfs/qgroup.c  |  4 ++--
 3 files changed, 21 insertions(+), 16 deletions(-)

-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Home storage with btrfs

2017-03-13 Thread waxhead

Same here, Have been using BTRFS for a 'scratch' disk since about 2014.
The disk have had quite some abuse and no issues yet.
I don't use compression, snapshots or any fancy features.
I have recently moved all of the root filesystem to BTRFS with 5x SSD 
disks set up in RAID1 and everything is (still) working fine, and I have 
been shuffling large amounts of data on this volume. I bet the SSD's 
will break before BTRFS does, so the real test is yet to come I guess...
I am on Debian GNU/Linux with kernel 4.9.0-2-amd64 (Debian 4.9.13-1) - 
btrfs-progs 4.7.3


However, keep in mind that backups is winning the fight against binary 
related traumas :)


Peter Becker wrote:

I can confirm this. I have also no generell issues since the past 2
years with BTRFS in RAID1 and 6 Disks with different sizes and also no
issues with the DUP profile on a single disk.
Only some performance issues with deduplication and very large files.
But i also recommand to use a newer kernel (4.4 or higher) or better
the newest and build a newer version of btrfs progs form source.
I use Ubuntu 16.04 and kernel 4.9 + btrfs progs 4.9 currently.

2017-03-13 13:02 GMT+01:00 Austin S. Hemmelgarn :

On 2017-03-13 07:52, Juan Orti Alcaine wrote:

2017-03-13 12:29 GMT+01:00 Hérikz Nawarro :

Hello everyone,

Today is safe to use btrfs for home storage? No raid, just secure
storage for some files and create snapshots from it.


In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5
years and I feel the most serious bugs have been fixed, because in the
last two years I have not experienced any issue.

In general, I'd agree.  I've not seen any issues resulting from BTRFS itself
for the past 2.5 years (although it's helped me find quite a lot of marginal
or failing hardware over that time), but I've also not used many of the less
stable features (raid56, qgroups, and a handful of other things).

One piece of advice I will give though, try to keep the total number of
snapshots to a reasonably small three digit number (ideally less than 200,
absolutely less than 300), otherwise performance is going to be horrible.


Anyway, keeping your kernel and btrfs-progs updated is a must, and of
course, having good backups. I'm using Fedora and it's fine.

Also agreed, Fedora is one of the best options for a traditional distro
(they're very good about staying up to date and back-porting bug-fixes from
the upstream kernel).  The other two I'd recommend are Arch (they actually
use an almost upstream kernel and are generally the first distro to have new
versions of any arbitrary software) and Gentoo (similar to Arch, but more
maintenance intensive (although also more efficient (usually))).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Change qgroup_meta_rsv to 64bit

2017-03-13 Thread David Sterba
On Sat, Mar 11, 2017 at 05:03:26PM -0600, Goldwyn Rodrigues wrote:
> - int reserved;
> + long reserved;

> - reserved = atomic_xchg(>qgroup_meta_rsv, 0);
> + reserved = atomic64_xchg(>qgroup_meta_rsv, 0);

atomic64_xchg returns 'long long' so u64 should be used.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: cli-tests: Convert non-raid filesystem to raid

2017-03-13 Thread David Sterba
On Sun, Jan 29, 2017 at 08:14:55PM +0530, Lakshmipathi.G wrote:
> Simple script to verify non-raid filesystem conversion.

The cli (command line interface) tests are supposed to cover the common
usecases from the point of option combinations etc, not really verifying
the result. It would be good if you add more than just one simple test,
especially for the balance filters. Adding the helpers to prepare the
filesystem (number of devices, raid profiles) would be a good start.
Some code for that is in the mkfs tests, but this belongs to the common
scripts so we could easily use that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fresh Raid-1 setup, dump-tree shows invalid owner id

2017-03-13 Thread David Sterba
On Mon, Jan 30, 2017 at 06:07:01PM +0530, Lakshmipathi.G wrote:
> >
> > Yes, the owner is the number of the tree.
> >
> > DATA_RELOC_TREE is -9, but then unsigned 64 bits.
> >
>  -9 + 2**64
> > 18446744073709551607L
> >
> > So the result is a number that's close to the max or 64 bits.
> >
> > You can find those numbers in the kernel source in
> >   include/uapi/linux/btrfs_tree.h
> >
> > e.g.:
> >
> > #define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL
> >
> 
> Thanks for the details. This owner number looked different from other
> owner ids, so wanted to check on the same, now understood.

Recently, in the commit 6f6b643e44ef7, the known tree ids are printed as
the small negative numbers. We can also do that in progs, however we'd
need some tricks so we don't duplicate all format strings one with %llu
and one with %lld. Something like

format: "%s%llu"
values:
  id > LAST_FREE ? "-" : ""
  id > LAST_FREE ? 2^64 - id : id

and  hidden in helpers etc.

I don't think we should replace all references to tree ids with the
symbolic names as they're used in keys so searching for the numeric
value in the dump output would miss something or make the regexp more
complicated. I won't object against making the tree id output format
selectable by option though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: misc-tests: Primary Superblock corruption and recovery using backup Superblock.

2017-03-13 Thread David Sterba
On Wed, Feb 15, 2017 at 02:44:05PM +0530, Lakshmipathi.G wrote:
> On Wed, Feb 15, 2017 at 09:36:03AM +0800, Qu Wenruo wrote:
> > 
> > 
> > >+  # Corrupt superblock checksum
> > >+dd if=/dev/zero of=$TEST_DEV seek=$superblock_offset bs=1 \
> > >+count=4  conv=notrunc &> /dev/null
> > >+  run_check_stdout $SUDO_HELPER mount $TEST_DEV $TEST_MNT | \
> > >+  grep -q 'wrong fs type'
> > 
> > What about using btrfs check instead of trying to mount it?
> > 
> > This could emit the need to use $SUDO_HELPER, and could catch super error
> > more accurate.
> > 
> > >+if [ $? -ne 0 ]; then
> > >+  _fail "Failed to corrupt superblock."
> > >+fi
> > >+
> > >+  # Copy backup superblock to primary
> > >+  run_check $TOP/btrfs-select-super -s 1 $TEST_DEV
> > >+  run_check $SUDO_HELPER mount $TEST_DEV $TEST_MNT
> > Same here.
> I started with 'btrfs check' and 'btrfs check --repair' but it seems like
> --repair don't fix the corruption. So just moved away from using it.

Check --repair will not fix the superblock checksum as it's not clear
whether the rest of the superblock is ok or not.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: misc-tests: Superblock corruption and recovery using backup.

2017-03-13 Thread David Sterba
On Fri, Feb 17, 2017 at 04:03:41AM +0530, Lakshmipathi.G wrote:
> On Thu, Feb 16, 2017 at 09:05:02AM +0800, Qu Wenruo wrote:
> > 
> > 
> > At 02/16/2017 04:50 AM, Lakshmipathi.G wrote:
> > >Signed-off-by: Lakshmipathi.G 
> > 
> > Looks good to me.
> > 
> > Reviewed by: Qu Wenruo 
> > 
> > BTW, it would be better to have some commit message.
> 
> ok, sent the v3 patch with commit message.

It's not necessary if the Subject says it all, so the changelog does not
simply repeat it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs-progs: misc-tests: Superblock corruption and recovery using backup.

2017-03-13 Thread David Sterba
On Fri, Feb 17, 2017 at 04:01:59AM +0530, Lakshmipathi.G wrote:
> Test script to recover damaged primary superblock using backup superblock.
> 
> Signed-off-by: Lakshmipathi.G 

Thanks for the test, a few comments below.

> ---
>  .../019-fix-superblock-corruption/test.sh  | 36 
> ++
>  1 file changed, 36 insertions(+)
>  create mode 100755 tests/misc-tests/019-fix-superblock-corruption/test.sh
> 
> diff --git a/tests/misc-tests/019-fix-superblock-corruption/test.sh 
> b/tests/misc-tests/019-fix-superblock-corruption/test.sh
> new file mode 100755
> index 000..95815e4
> --- /dev/null
> +++ b/tests/misc-tests/019-fix-superblock-corruption/test.sh
> @@ -0,0 +1,36 @@
> +#!/bin/bash
> +#
> +# Corrupt primary superblock and restore it using backup superblock.
> +#
> +
> +source $TOP/tests/common
> +
> +check_prereq btrfs-select-super
> +check_prereq btrfs
> +
> +setup_root_helper
> +prepare_test_dev 512M
> +
> +FIRST_SUPERBLOCK_OFFSET=65536
> +
> +test_superblock_restore()
> +{
> + run_check $SUDO_HELPER $TOP/mkfs.btrfs -f $TEST_DEV
> +
> + # Corrupt superblock checksum
> +dd if=/dev/zero of=$TEST_DEV seek=$FIRST_SUPERBLOCK_OFFSET bs=1 \
> +count=4  conv=notrunc &> /dev/null
> + # Run btrfs check to detect corruption
> + $TOP/btrfs check $TEST_DEV >& /dev/null && \
> + _fail "btrfs check should detect corruption"

No run_check? In general all test output is desired and should be in the
test log so you can use 'run_mayfail'. Also, please keep the entier
output and don't redirect it to /dev/null .

> + # Copy backup superblock to primary
> + run_check $TOP/btrfs-select-super -s 1 $TEST_DEV
> +
> + echo "Performing btrfs check" &>> $RESULTS

Use _log for that

> + $TOP/btrfs check $TEST_DEV &>> $RESULTS

Use run_check

> +if [ $? -ne 0 ]; then
> + _fail "Failed to fix superblock."

And this would be unnecessary.

I've added a section to tests/README.md to describe the common coding
practices. Feel free to read/update/fix. Older tests could violate the
recommendations so at least new tests are in line and we'll update the
rest incrementally.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs-progs: dump-tree: Also output log root tree

2017-03-13 Thread David Sterba
On Fri, Mar 03, 2017 at 12:13:45PM +0800, Qu Wenruo wrote:
> In btrfs-dump-tree, we output any existing log tree, however we don't
> output the log root tree, which records all root items for log trees.
> 
> This makes it confusing for any one who want to know where the log tree
> comes from.
> 
> Signed-off-by: Qu Wenruo 

What's the base of this patch or is there a patch that I've missed to
apply? This fails on current devel, 4.10, 4.9.1 and I don't see patch
1/2 in my mailbox or on patchwork. Please refresh and resend.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-13 Thread Anand Jain



Thanks for the review..


On 03/13/2017 05:05 PM, Qu Wenruo wrote:



At 03/13/2017 03:42 PM, Anand Jain wrote:

The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.


The idea itself is quite good, and we do need it.


 Thanks.



By doing this it helps to further develop patches which would tune
the error-actions as needed.

Here functions such as btrfs_dev_stats_dirty() couldn't be used
because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 96
+++---
 1 file changed, 85 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5719e036048b..12531a5b14ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device
*device, int wait)
 return 0;
 }

+struct device_checkpoint {
+struct list_head list;
+struct btrfs_device *device;
+int stat_value_checkpoint;
+};
+
+static int add_device_checkpoint(struct list_head *checkpoint,


Could we have another structure instead of list_head to record device
checkpoints?
The list_head is never a meaningful structure under most cases.


 I didn't understand this, there is device_checkpoint and the context
 of struct list_head *checkpoint would start and end within
 barrier_all_devices().




+struct btrfs_device *device)
+{
+struct device_checkpoint *cdev =
+kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
+if (!cdev)
+return -ENOMEM;


This means that, we must check return value of add_device_checkpoint(),
while later code doesn't check it at all.


 oh. I will correct this.




+
+list_add(>list, checkpoint);


And I prefer to do extra check, in case such device is already inserted
once.


 Hmm with the current code its not at all possible, but let me add it.


+
+cdev->device = device;
+cdev->stat_value_checkpoint =
+btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS);
+
+return 0;
+}
+
+static void fini_devices_checkpoint(struct list_head *checkpoint)


Never a fan of the "fini_" naming.
What about "release_"?


 will change it.


+{
+struct device_checkpoint *cdev;
+
+while(!list_empty(checkpoint)) {
+cdev = list_entry(checkpoint->next,
+struct device_checkpoint, list);
+list_del(>list);
+kfree(cdev);
+}
+}
+
+static int check_stat_flush(struct btrfs_device *dev,
+struct list_head *checkpoint)
+{
+int val;
+struct device_checkpoint *cdev;
+
+list_for_each_entry(cdev, checkpoint, list) {
+if (cdev->device == dev) {
+val = btrfs_dev_stat_read(dev,
+BTRFS_DEV_STAT_FLUSH_ERRS);
+if (cdev->stat_value_checkpoint != val)
+return 1;


This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified
by checkpoint related code.
Or any other modifier can easily break the check, causing false alert.


 I have already checked it, its not possible with the current code,
 or do you see if that is possible with the current code ? or Did I
 miss something ?


IIRC that's the reason why I update my previous degraded patch.


Personally speaking, I prefer the patchset to take more usage of the
checkpoint system, or it's a little overkilled for current usage.


 Just want to make sure things are done in the right way.


Thanks, Anand



Thanks,
Qu


+}
+}
+return 0;
+}
+
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs,
+struct list_head *checkpoint)
+{
+int dropouts = 0;
+struct btrfs_device *dev;
+
+list_for_each_entry_rcu(dev, >devices, dev_list) {
+if (!dev->bdev || check_stat_flush(dev, checkpoint))
+dropouts++;
+}
+
+if (dropouts >
+fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+return -EIO;
+
+return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)
 {
 struct list_head *head;
 struct btrfs_device *dev;
-int dropouts = 0;
 int ret;
+struct list_head checkpoint;
+
+INIT_LIST_HEAD();

 /* send down all the barriers */
 head = >fs_devices->devices;
@@ -3587,29 +3659,31 @@ static int barrier_all_devices(struct
btrfs_fs_info *info)
 if (!dev->in_fs_metadata || !dev->writeable)
 continue;

+add_device_checkpoint(, dev);
 ret = write_dev_flush(dev, 0);
-if (ret)
+if (ret) {
+fini_devices_checkpoint();
 return ret;
+}
 }

 /* wait for all the barriers */
 list_for_each_entry_rcu(dev, head, dev_list) {
   

Re: Question re. required since btrfs-progs-4.10

2017-03-13 Thread David Sterba
On Sat, Mar 11, 2017 at 05:04:52PM +0100, Holger Hoffstätte wrote:
> I'm on Gentoo and wanted to update Docker to 17.03.0, which failed when it
> couldn't build the btrfs driver due to a missing .
> This worked fine on another machine the other day, so I dug in and found that
> the only difference was an intermediate update to btrfs-progs 4.10.
> Sure enough: since 4.10 ctree.h includes , which is nowhere
> to be found in my package of linux-headers-4.10 (also not in 4.9).
> 
> Simply copying sizes.h from the 4.9 kernel sources made everything work,
> but I'm curious whether this is actually correct - hence my question whether
>  _should_ actually be installed? Is this a bug with
> Gentoo's kernel header package or a new problem with btrfs-progs?

That's a bug in btrfs-progs, the build tests do not cover that. There's
a local copy of sizes.h that's supposed to be used. The patch was ported
from kernel where linus/sizes.h is present but is probably not (yet)
part of the public linux-headers.

> I don't see any good reason why sizes.h should not be installed, but wanted
> to verify since there are probably users of other distributions here as
> well. :)

The build would fail on any distro. I'll commit a fix and update the
tests so this gets caught early. Thanks for the report.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

2017-03-13 Thread Stefan Priebe - Profihost AG

Am 13.03.2017 um 08:39 schrieb Qu Wenruo:
> 
> 
> At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:
>> Hi Qu,
>>
>> Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
>>>
>>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
 Hi Qu,

 while V5 was running fine against the openSUSE-42.2 kernel (based on
 v4.4).
>>>
>>> Thanks for the test.
>>>
 V7 results in OOPS to me:
 BUG: unable to handle kernel NULL pointer dereference at
 01f0
>>>
>>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
>>> nice clue.
>>>
 IP: [] __endio_write_update_ordered+0x33/0x140
 [btrfs]
>>>
>>> IP points to:
>>> ---
>>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
>>> {
>>> struct btrfs_root *root = inode->root; << Either here
>>>
>>> if (root == root->fs_info->tree_root && << Or here
>>> btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
>>>
>>> ---
>>>
>>> Taking the above offset into consideration, it's only possible for later
>>> case.
>>>
>>> So here, we have a btrfs_inode whose @root is NULL.
>>
>> But wasn't this part of the code identical in V5? Why does it only
>> happen with V7?
> 
> There are still difference, but just as you said, the related
> part(checking if inode is free space cache inode) is identical across v5
> and v7.

But if i boot v7 it always happens. If i boot v5 it always works. Have
done 5 repeatet tests.

> I'm afraid that's a rare race leading to NULL btrfs_inode->root, which
> could happen in both v5 and v7.
> 
> What's the difference between SUSE and mainline kernel?

A lot ;-) But i don't think anything related.

> Maybe some mainline kernel commits have already fixed it?

May be no idea. But i haven't found any reason why v5 works.

Stefan

> 
> Thanks,
> Qu
>>
>>> This can be fixed easily by checking @root inside
>>> btrfs_is_free_space_inode(), as the backtrace shows that it's only
>>> happening for DirectIO, and it won't happen for free space cache inode.
>>>
>>> But I'm more curious how this happened for a more accurate fix, or we
>>> could have other NULL pointer access.
>>>
>>> Did you have any reproducer for this?
>>
>> Sorry no - this is a production MariaDB Server running btrfs with
>> compress-force=zlib. But if i could test anything i'll do.
>>
>> Greets,
>> Stefan
>>
>>>
>>> Thanks,
>>> Qu
>>>
 PGD 14e18d4067 PUD 14e1868067 PMD 0
 Oops:  [#1] SMP
 Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
 xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
 nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq
 ata_generic
 virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
 i2c_core usb_common ata_piix floppy
 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
 1.7.5-20140722_172050-sagunt 04/01/2014
 task: b4e0f500 ti: b4e0 task.ti: b4e0
 RIP: 0010:[] []
 __endio_write_update_ordered+0x33/0x140 [btrfs]
 RSP: 0018:8814eae03cd8 EFLAGS: 00010086
 RAX:  RBX: 8814e8fd5aa8 RCX: 0001
 RDX: 0010 RSI: 0010 RDI: 8814e45885c0
 RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a
 R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0
 R13: 8814e125d700 R14: 0010 R15: 8800376c6a80
 FS: () GS:8814eae0()
 knlGS:
 CS: 0010 DS:  ES:  CR0: 80050033
 CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack:
  0010 8814e8fd5aa8 8814e953f3c0
 8814e125d700 0010 8800376c6a80 8814eae03d38
 c03ddf67 8814e86b6a80 8814e8fd5aa8 0001
 Call Trace:
 [] btrfs_endio_direct_write+0x37/0x60 [btrfs]
 [] bio_endio+0x57/0x60
 [] btrfs_end_bio+0xa1/0x140 [btrfs]
 [] bio_endio+0x57/0x60
 [] blk_update_request+0x8b/0x330
 [] blk_mq_end_request+0x1a/0x70
 [] virtblk_request_done+0x3f/0x70 [virtio_blk]
 [] __blk_mq_complete_request+0x78/0xe0
 [] blk_mq_complete_request+0x1c/0x20
 [] virtblk_done+0x64/0xe0 [virtio_blk]
 [] vring_interrupt+0x3a/0x90
 [] __handle_irq_event_percpu+0x89/0x1b0
 [] handle_irq_event_percpu+0x23/0x60
 [] handle_irq_event+0x3b/0x60
 [] handle_edge_irq+0x6f/0x150
 [] handle_irq+0x1d/0x30
 [] do_IRQ+0x4b/0xd0
 [] common_interrupt+0x8c/0x8c
 DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
 Leftover inexact backtrace:
 2017-03-12 20:33:08 
 2017-03-12 20:33:08  [] ?
 native_safe_halt+0x6/0x10
 [] default_idle+0x1e/0xe0
 [] arch_cpu_idle+0xf/0x20
 [] default_idle_call+0x3b/0x40
 [] cpu_startup_entry+0x29a/0x370
 [] 

Re: [PATCH v2] qgroup: Retry after commit on getting EDQUOT

2017-03-13 Thread Filipe Manana
On Mon, Mar 13, 2017 at 12:27 PM, Goldwyn Rodrigues  wrote:
>
>
> On 03/13/2017 07:14 AM, Filipe Manana wrote:
>> On Sat, Mar 11, 2017 at 11:02 PM, Goldwyn Rodrigues  wrote:
>>> From: Goldwyn Rodrigues 
>>>
>>> We are facing the same problem with EDQUOT which was experienced with
>>> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
>>> here is a fix. Let me know if it is too big a hammer.
>>>
>>> Quotas are reserved during the start of an operation, incrementing
>>> qg->reserved. However, it is written to disk in a commit_transaction
>>> which could take as long as commit_interval. In the meantime there
>>> could be deletions which are not accounted for because deletions are
>>> accounted for only while committed (free_refroot). So, when we get
>>> a EDQUOT flush the data to disk and try again.
>>
>> So this doesn't explain why the problem happens, taking into account
>> the reproducer below.
>> You mention that deletions are only accounted at transaction commit
>> time, but the reproducer does not do any deletions.
>
> Yes it does, after Cleanup. It is a loop. I suppose you got confused by
> the indent.

Indeed, the indentation is screwed up.

>
>> So why does the reproducer gets EDQUOT errors? The second sentence
>> "However, it is written to disk in a commit_transaction
>> which could take as long as commit_interval." also does not explain
>> why is that a problem, why it makes the reproducer get EDQUOT errors.
>>
>> I would like to see a clear and complete explanation in the changelog.
>
> I suppose this comment is irrelevant now?

Yes.

>
>>
>>>
>>> I combined the conditions of rfer and excl to reduce code lines, though
>>> the condition looks uglier.
>>
>> This sentence is outdated, it's no longer valid for v2 as it now calls
>> qgroup_check_limits().
>>
>>>
>>> Here is a sample script which shows this issue.
>>>
>>> DEVICE=/dev/vdb
>>> MOUNTPOINT=/mnt
>>> TESTVOL=$MOUNTPOINT/tmp
>>> QUOTA=5
>>> PROG=btrfs
>>> DD_BS="4k"
>>> DD_COUNT="256"
>>> RUN_TIMES=5000
>>>
>>> mkfs.btrfs -f $DEVICE
>>> mount -o commit=240 $DEVICE $MOUNTPOINT
>>> $PROG subvolume create $TESTVOL
>>> $PROG quota enable $TESTVOL
>>> $PROG qgroup limit ${QUOTA}G $TESTVOL
>>>
>>> typeset -i DD_RUN_GOOD
>>> typeset -i QUOTA
>>>
>>> function _check_cmd() {
>>> if [[ ${?} > 0 ]]; then
>>> echo -n "$(date) E: Running previous command"
>>> echo ${*}
>>> echo "Without sync"
>>> $PROG qgroup show -pcreFf ${TESTVOL}
>>> echo "With sync"
>>> $PROG qgroup show -pcreFf --sync ${TESTVOL}
>>> exit 1
>>> fi
>>> }
>>>
>>> while true; do
>>> DD_RUN_GOOD=$RUN_TIMES
>>>
>>> while (( ${DD_RUN_GOOD} != 0 )); do
>>> dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} 
>>> count=${DD_COUNT}
>>> _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} 
>>> bs=${DD_BS} count=${DD_COUNT}"
>>> DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
>>> done
>>>
>>> $PROG qgroup show -pcref $TESTVOL
>>> echo "--- Cleanup -- "
>>> rm $TESTVOL/quotatest*
>
> Here is the deletion.
>
>>>
>>> done
>
> And the infinite loop ending.
>
>>
>> Please write a test case for fstests.
>
> You mean xfstests. will do.

xfstests got renamed to fstests long time ago, and obviously it's
xfstests as that's the test suite we all use.

thanks

>
>>
>>>
>>> Changes since v1:
>>>  - Changed start_delalloc_roots() to start_delalloc_inode() to target
>>>the root in question only to reduce the amount of flush to be done.
>>>  - Added wait_ordered_extents().
>>
>> Listing what changed between versions is not meant to appear in the
>> changelog. This should go after the --- below. You should also prefix
>> the patch subject with "btrfs:" (or "btrfs-progs:"). See the examples
>> from others or the wiki for intsructions
>> (https://btrfs.wiki.kernel.org/index.php/Writing_patch_for_btrfs#Updating_patches).
>>
>
> Ok, understood. Thanks.
>
>> thanks
>>
>>>
>>> Signed-off-by: Goldwyn Rodrigues 
>>> Reviewed-by: Qu Wenruo 
>>> ---
>>>  fs/btrfs/qgroup.c | 15 +++
>>>  1 file changed, 15 insertions(+)
>>>
>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>> index a5da750..6215264 100644
>>> --- a/fs/btrfs/qgroup.c
>>> +++ b/fs/btrfs/qgroup.c
>>> @@ -2367,6 +2367,7 @@ static int qgroup_reserve(struct btrfs_root *root, 
>>> u64 num_bytes, bool enforce)
>>> struct btrfs_fs_info *fs_info = root->fs_info;
>>> u64 ref_root = root->root_key.objectid;
>>> int ret = 0;
>>> +   int retried = 0;
>>> struct ulist_node *unode;
>>> struct ulist_iterator uiter;
>>>
>>> @@ -2376,6 +2377,7 @@ static int qgroup_reserve(struct btrfs_root *root, 
>>> u64 num_bytes, bool enforce)
>>> if (num_bytes == 0)
>>> return 0;
>>>
>>> +retry:
>>>   

Re: [PATCH v2] qgroup: Retry after commit on getting EDQUOT

2017-03-13 Thread Goldwyn Rodrigues


On 03/13/2017 07:14 AM, Filipe Manana wrote:
> On Sat, Mar 11, 2017 at 11:02 PM, Goldwyn Rodrigues  wrote:
>> From: Goldwyn Rodrigues 
>>
>> We are facing the same problem with EDQUOT which was experienced with
>> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
>> here is a fix. Let me know if it is too big a hammer.
>>
>> Quotas are reserved during the start of an operation, incrementing
>> qg->reserved. However, it is written to disk in a commit_transaction
>> which could take as long as commit_interval. In the meantime there
>> could be deletions which are not accounted for because deletions are
>> accounted for only while committed (free_refroot). So, when we get
>> a EDQUOT flush the data to disk and try again.
> 
> So this doesn't explain why the problem happens, taking into account
> the reproducer below.
> You mention that deletions are only accounted at transaction commit
> time, but the reproducer does not do any deletions.

Yes it does, after Cleanup. It is a loop. I suppose you got confused by
the indent.

> So why does the reproducer gets EDQUOT errors? The second sentence
> "However, it is written to disk in a commit_transaction
> which could take as long as commit_interval." also does not explain
> why is that a problem, why it makes the reproducer get EDQUOT errors.
> 
> I would like to see a clear and complete explanation in the changelog.

I suppose this comment is irrelevant now?

> 
>>
>> I combined the conditions of rfer and excl to reduce code lines, though
>> the condition looks uglier.
> 
> This sentence is outdated, it's no longer valid for v2 as it now calls
> qgroup_check_limits().
> 
>>
>> Here is a sample script which shows this issue.
>>
>> DEVICE=/dev/vdb
>> MOUNTPOINT=/mnt
>> TESTVOL=$MOUNTPOINT/tmp
>> QUOTA=5
>> PROG=btrfs
>> DD_BS="4k"
>> DD_COUNT="256"
>> RUN_TIMES=5000
>>
>> mkfs.btrfs -f $DEVICE
>> mount -o commit=240 $DEVICE $MOUNTPOINT
>> $PROG subvolume create $TESTVOL
>> $PROG quota enable $TESTVOL
>> $PROG qgroup limit ${QUOTA}G $TESTVOL
>>
>> typeset -i DD_RUN_GOOD
>> typeset -i QUOTA
>>
>> function _check_cmd() {
>> if [[ ${?} > 0 ]]; then
>> echo -n "$(date) E: Running previous command"
>> echo ${*}
>> echo "Without sync"
>> $PROG qgroup show -pcreFf ${TESTVOL}
>> echo "With sync"
>> $PROG qgroup show -pcreFf --sync ${TESTVOL}
>> exit 1
>> fi
>> }
>>
>> while true; do
>> DD_RUN_GOOD=$RUN_TIMES
>>
>> while (( ${DD_RUN_GOOD} != 0 )); do
>> dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} 
>> count=${DD_COUNT}
>> _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} 
>> bs=${DD_BS} count=${DD_COUNT}"
>> DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
>> done
>>
>> $PROG qgroup show -pcref $TESTVOL
>> echo "--- Cleanup -- "
>> rm $TESTVOL/quotatest*

Here is the deletion.

>>
>> done

And the infinite loop ending.

> 
> Please write a test case for fstests.

You mean xfstests. will do.

> 
>>
>> Changes since v1:
>>  - Changed start_delalloc_roots() to start_delalloc_inode() to target
>>the root in question only to reduce the amount of flush to be done.
>>  - Added wait_ordered_extents().
> 
> Listing what changed between versions is not meant to appear in the
> changelog. This should go after the --- below. You should also prefix
> the patch subject with "btrfs:" (or "btrfs-progs:"). See the examples
> from others or the wiki for intsructions
> (https://btrfs.wiki.kernel.org/index.php/Writing_patch_for_btrfs#Updating_patches).
> 

Ok, understood. Thanks.

> thanks
> 
>>
>> Signed-off-by: Goldwyn Rodrigues 
>> Reviewed-by: Qu Wenruo 
>> ---
>>  fs/btrfs/qgroup.c | 15 +++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index a5da750..6215264 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -2367,6 +2367,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
>> num_bytes, bool enforce)
>> struct btrfs_fs_info *fs_info = root->fs_info;
>> u64 ref_root = root->root_key.objectid;
>> int ret = 0;
>> +   int retried = 0;
>> struct ulist_node *unode;
>> struct ulist_iterator uiter;
>>
>> @@ -2376,6 +2377,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
>> num_bytes, bool enforce)
>> if (num_bytes == 0)
>> return 0;
>>
>> +retry:
>> spin_lock(_info->qgroup_lock);
>> quota_root = fs_info->quota_root;
>> if (!quota_root)
>> @@ -2402,6 +2404,19 @@ static int qgroup_reserve(struct btrfs_root *root, 
>> u64 num_bytes, bool enforce)
>> qg = unode_aux_to_qgroup(unode);
>>
>> if (enforce && !qgroup_check_limits(qg, num_bytes)) {
>> +   if (!retried && 

Re: [PATCH v2] qgroup: Retry after commit on getting EDQUOT

2017-03-13 Thread Filipe Manana
On Sat, Mar 11, 2017 at 11:02 PM, Goldwyn Rodrigues  wrote:
> From: Goldwyn Rodrigues 
>
> We are facing the same problem with EDQUOT which was experienced with
> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
> here is a fix. Let me know if it is too big a hammer.
>
> Quotas are reserved during the start of an operation, incrementing
> qg->reserved. However, it is written to disk in a commit_transaction
> which could take as long as commit_interval. In the meantime there
> could be deletions which are not accounted for because deletions are
> accounted for only while committed (free_refroot). So, when we get
> a EDQUOT flush the data to disk and try again.

So this doesn't explain why the problem happens, taking into account
the reproducer below.
You mention that deletions are only accounted at transaction commit
time, but the reproducer does not do any deletions.
So why does the reproducer gets EDQUOT errors? The second sentence
"However, it is written to disk in a commit_transaction
which could take as long as commit_interval." also does not explain
why is that a problem, why it makes the reproducer get EDQUOT errors.

I would like to see a clear and complete explanation in the changelog.

>
> I combined the conditions of rfer and excl to reduce code lines, though
> the condition looks uglier.

This sentence is outdated, it's no longer valid for v2 as it now calls
qgroup_check_limits().

>
> Here is a sample script which shows this issue.
>
> DEVICE=/dev/vdb
> MOUNTPOINT=/mnt
> TESTVOL=$MOUNTPOINT/tmp
> QUOTA=5
> PROG=btrfs
> DD_BS="4k"
> DD_COUNT="256"
> RUN_TIMES=5000
>
> mkfs.btrfs -f $DEVICE
> mount -o commit=240 $DEVICE $MOUNTPOINT
> $PROG subvolume create $TESTVOL
> $PROG quota enable $TESTVOL
> $PROG qgroup limit ${QUOTA}G $TESTVOL
>
> typeset -i DD_RUN_GOOD
> typeset -i QUOTA
>
> function _check_cmd() {
> if [[ ${?} > 0 ]]; then
> echo -n "$(date) E: Running previous command"
> echo ${*}
> echo "Without sync"
> $PROG qgroup show -pcreFf ${TESTVOL}
> echo "With sync"
> $PROG qgroup show -pcreFf --sync ${TESTVOL}
> exit 1
> fi
> }
>
> while true; do
> DD_RUN_GOOD=$RUN_TIMES
>
> while (( ${DD_RUN_GOOD} != 0 )); do
> dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} 
> count=${DD_COUNT}
> _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} 
> bs=${DD_BS} count=${DD_COUNT}"
> DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
> done
>
> $PROG qgroup show -pcref $TESTVOL
> echo "--- Cleanup -- "
> rm $TESTVOL/quotatest*
>
> done

Please write a test case for fstests.

>
> Changes since v1:
>  - Changed start_delalloc_roots() to start_delalloc_inode() to target
>the root in question only to reduce the amount of flush to be done.
>  - Added wait_ordered_extents().

Listing what changed between versions is not meant to appear in the
changelog. This should go after the --- below. You should also prefix
the patch subject with "btrfs:" (or "btrfs-progs:"). See the examples
from others or the wiki for intsructions
(https://btrfs.wiki.kernel.org/index.php/Writing_patch_for_btrfs#Updating_patches).

thanks

>
> Signed-off-by: Goldwyn Rodrigues 
> Reviewed-by: Qu Wenruo 
> ---
>  fs/btrfs/qgroup.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index a5da750..6215264 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2367,6 +2367,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
> num_bytes, bool enforce)
> struct btrfs_fs_info *fs_info = root->fs_info;
> u64 ref_root = root->root_key.objectid;
> int ret = 0;
> +   int retried = 0;
> struct ulist_node *unode;
> struct ulist_iterator uiter;
>
> @@ -2376,6 +2377,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
> num_bytes, bool enforce)
> if (num_bytes == 0)
> return 0;
>
> +retry:
> spin_lock(_info->qgroup_lock);
> quota_root = fs_info->quota_root;
> if (!quota_root)
> @@ -2402,6 +2404,19 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
> num_bytes, bool enforce)
> qg = unode_aux_to_qgroup(unode);
>
> if (enforce && !qgroup_check_limits(qg, num_bytes)) {
> +   if (!retried && qg->reserved > 0) {
> +   struct btrfs_trans_handle *trans;
> +   spin_unlock(_info->qgroup_lock);
> +   btrfs_start_delalloc_inodes(root, 0);
> +   btrfs_wait_ordered_extents(root, -1, 0,
> +   (u64)-1);
> +   trans = btrfs_join_transaction(root);
> + 

Re: Home storage with btrfs

2017-03-13 Thread Austin S. Hemmelgarn

On 2017-03-13 07:52, Juan Orti Alcaine wrote:

2017-03-13 12:29 GMT+01:00 Hérikz Nawarro :

Hello everyone,

Today is safe to use btrfs for home storage? No raid, just secure
storage for some files and create snapshots from it.



In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5
years and I feel the most serious bugs have been fixed, because in the
last two years I have not experienced any issue.
In general, I'd agree.  I've not seen any issues resulting from BTRFS 
itself for the past 2.5 years (although it's helped me find quite a lot 
of marginal or failing hardware over that time), but I've also not used 
many of the less stable features (raid56, qgroups, and a handful of 
other things).


One piece of advice I will give though, try to keep the total number of 
snapshots to a reasonably small three digit number (ideally less than 
200, absolutely less than 300), otherwise performance is going to be 
horrible.


Anyway, keeping your kernel and btrfs-progs updated is a must, and of
course, having good backups. I'm using Fedora and it's fine.
Also agreed, Fedora is one of the best options for a traditional distro 
(they're very good about staying up to date and back-porting bug-fixes 
from the upstream kernel).  The other two I'd recommend are Arch (they 
actually use an almost upstream kernel and are generally the first 
distro to have new versions of any arbitrary software) and Gentoo 
(similar to Arch, but more maintenance intensive (although also more 
efficient (usually))).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Home storage with btrfs

2017-03-13 Thread Juan Orti Alcaine
2017-03-13 12:29 GMT+01:00 Hérikz Nawarro :
> Hello everyone,
>
> Today is safe to use btrfs for home storage? No raid, just secure
> storage for some files and create snapshots from it.
>

In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5
years and I feel the most serious bugs have been fixed, because in the
last two years I have not experienced any issue.

Anyway, keeping your kernel and btrfs-progs updated is a must, and of
course, having good backups. I'm using Fedora and it's fine.

Kind regards.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Home storage with btrfs

2017-03-13 Thread Hérikz Nawarro
Hello everyone,

Today is safe to use btrfs for home storage? No raid, just secure
storage for some files and create snapshots from it.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/17] fs, btrfs refcount conversions

2017-03-13 Thread Reshetova, Elena

> On Fri, Mar 03, 2017 at 10:55:09AM +0200, Elena Reshetova wrote:
> > Now when new refcount_t type and API are finally merged
> > (see include/linux/refcount.h), the following
> > patches convert various refcounters in the btrfs filesystem from atomic_t
> > to refcount_t. By doing this we prevent intentional or accidental
> > underflows or overflows that can led to use-after-free vulnerabilities.
> >
> > The below patches are fully independent and can be cherry-picked separately.
> > Since we convert all kernel subsystems in the same fashion, resulting
> > in about 300 patches, we have to group them for sending at least in some
> > fashion to be manageable. Please excuse the long cc list.
> 
> Thanks, the patchset looks good to me, I'm going to add it to the 4.12 queue.

Thank you very much!

> 
> > These patches have been tested with xfstests by running btrfs-related tests.
> > btrfs debug was enabled, warns on refcount errors, too. No output related to
> > refcount errors produced. However, the following errors were during the run:
> >  * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but
> >  process hangs. They all seem to be around qgroup, sometimes error visible
> >  such as qgroup scan failed -4 before it blocks, but not always.
> >  * test btrfs/104 dmesg has additional error output:
> >  BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0,
> >  to free: 4096
> >  I tried looking at the code on what causes the failure, but could not 
> > figure
> >  it out. It doesn't seem to be related to any refcount changes at least IMO.
> >
> > The above test failures are hard for me to understand and interpreted, but
> > they don't seem to relate to refcount conversions.
> 
> I don't think they're related to the refcount updates so we'll address
> them.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops in btrfs_recover_relocation, kernel 4.8.1

2017-03-13 Thread Hugo Mills
On Mon, Mar 13, 2017 at 10:22:04AM +0800, Qu Wenruo wrote:
> 
> 
> At 03/10/2017 08:23 PM, Hugo Mills wrote:
> >   Does anyone recall seeing this oops before? Is it something that
> >can be fixed with a newer kernel? (I'm on a USB stick for this, so a
> >new kernel is a major undertaking, and I'd like some reasonable
> >expectation of success if I do it).
> 
> Yes, v4.10 has the fix for the bug.
> >
> >   Background: I'm rebuilding a dead server. I needed to reduce the
> >device count on this FS to 6. Stupidly, I attached one device using an
> >external USB case, and the USB connection reset during the device
> >delete (within a few seconds). I can mount the FS -o ro,recovery, but
> >using -o recovery on its own causes the oops below. If I can recover
> >in-place, that would save me a *lot* of time in restoring backups.
> >
> >   Also... qgroups, WTH? I've *never* enabled qgroups on this FS.
> 
> Did you use any auto-backup system?

   Yes -- my own, which doesn't enable qgroups.

> Quite a lot of them will enable qgroup.
> 
> Anyway, after mounting with v4.10 kernel, you can easily find out if
> qgroup is enabled.

   They aren't: 

hrm@amelia:/media $ sudo btrfs qgroup show /media/btrfs/amelia/
ERROR: can't perform the search - No such file or directory
ERROR: can't list qgroups: No such file or directory

   Anyway, it's all working now with a 4.10 kernel. Thanks for the
reply.

   Hugo.

> Thanks,
> Qu
> 
> >
> >   For what it's worth, the FS passes btrfs check --readonly with no
> >errors reported. (btrfs --version is 4.7.3).
> >
> >   Hugo.
> >
> >[  566.852589] BTRFS warning (device sdh1): 'recovery' is deprecated, use 
> >'usebackuproot' instead
> >[  566.852591] BTRFS info (device sdh1): trying to use backup root at mount 
> >time
> >[  566.852592] BTRFS info (device sdh1): disk space caching is enabled
> >[  566.922803] BTRFS info (device sdh1): bdev /dev/sdh1 errs: wr 0, rd 20, 
> >flush 0, corrupt 0, gen 0
> >[  578.715616] BUG: unable to handle kernel paging request at 
> >fe50
> >[  578.715619] IP: [] 
> >qgroup_fix_relocated_data_extents+0x2b/0x2c0 [btrfs]
> >[  578.715638] PGD 2f400f067 PUD 2f4011067 PMD 0
> >[  578.715640] Oops:  [#1]
> >[  578.715642] Modules linked in: cpufreq_userspace(E) cpufreq_powersave(E) 
> >cpufreq_conservative(E) kvm_amd(E) kvm(E) irqbypass(E) crc32_pclmul(E) 
> >efi_pstore(E) ghash_clmulni_intel(E) pcspkr(E) serio_raw(E) efivars(E) 
> >fam15h_power(E) k10temp(E) btrfs(E) acpi_cpufreq(E) tpm_tis(E) 
> >tpm_tis_core(E) tpm(E) sp5100_tco(E) sg(E) snd_hda_codec_realtek(E) 
> >snd_hda_codec_hdmi(E) snd_hda_codec_generic(E) snd_hda_intel(E) 9p(E) 
> >snd_hda_codec(E) 9pnet(E) snd_hda_core(E) fscache(E) snd_hwdep(E) snd_pcm(E) 
> >snd_timer(E) snd(E) soundcore(E) shpchp(E) ib_iser(E) rdma_cm(E) iw_cm(E) 
> >ib_cm(E) ib_core(E) configfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) 
> >scsi_transport_iscsi(E) fuse(E) evdev(E) aoe(E) efivarfs(E) ip_tables(E) 
> >x_tables(E) autofs4(E) loop(E) overlay(E) nls_utf8(E) isofs(E) raid10(E) 
> >raid456(E) async_raid6_recov(E)
> >[  578.715663]  async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) 
> >raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) 
> >dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) raid1(E) md_mod(E) 
> >sd_mod(E) hid_generic(E) usbhid(E) hid(E) uas(E) usb_storage(E) 
> >crc32c_intel(E) aesni_intel(E) aes_x86_64(E) glue_helper(E) lrw(E) 
> >gf128mul(E) ablk_helper(E) cryptd(E) ohci_pci(E) ahci(E) libahci(E) 
> >sata_sil24(E) i2c_piix4(E) r8169(E) mii(E) ehci_pci(E) ohci_hcd(E) 
> >ehci_hcd(E) libata(E) scsi_mod(E) radeon(E) i2c_algo_bit(E) 
> >drm_kms_helper(E) xhci_pci(E) xhci_hcd(E) usbcore(E) usb_common(E) ttm(E) 
> >drm(E) button(E)
> >[  578.715684] CPU: 0 PID: 3532 Comm: mount Tainted: GE   
> >4.8.0-1-grml-amd64 #1 Debian 4.8.15-1+grml.1
> >[  578.715684] Hardware name: Gigabyte Technology Co., Ltd. To be filled by 
> >O.E.M./970A-DS3P, BIOS FD 02/26/2016
> >[  578.715686] task: 8fd7ead82fc0 task.stack: 8fd7db15c000
> >[  578.715687] RIP: 0010:[]  [] 
> >qgroup_fix_relocated_data_extents+0x2b/0x2c0 [btrfs]
> >[  578.715699] RSP: 0018:8fd7db15fa08  EFLAGS: 00010246
> >[  578.715700] RAX: 8fd7db133800 RBX: 8fd7e30c79a0 RCX: 
> >
> >[  578.715701] RDX: 8fd7ddb5cd10 RSI: 8fd7dc6b5000 RDI: 
> >8fd7ddb5cc80
> >[  578.715701] RBP: 8fd7e7445000 R08:  R09: 
> >8fd7ddb5cc80
> >[  578.715702] R10:  R11:  R12: 
> >8fd7db15faa0
> >[  578.715703] R13: 8fd7dc6b5000 R14:  R15: 
> >8fd7ddb5cc80
> >[  578.715704] FS:  7f23cd67f480() GS:bca35000() 
> >knlGS:
> >[  578.715705] CS:  0010 DS:  ES:  CR0: 80050033
> >[  578.715706] CR2: fe50 CR3: 00042302f000 CR4: 
> >000406b0
> >[  578.715707] Stack:
> >[  578.715707]   0801 

Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-13 Thread Qu Wenruo



At 03/13/2017 03:42 PM, Anand Jain wrote:

The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.


The idea itself is quite good, and we do need it.



By doing this it helps to further develop patches which would tune
the error-actions as needed.

Here functions such as btrfs_dev_stats_dirty() couldn't be used
because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 96 +++---
 1 file changed, 85 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5719e036048b..12531a5b14ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }

+struct device_checkpoint {
+   struct list_head list;
+   struct btrfs_device *device;
+   int stat_value_checkpoint;
+};
+
+static int add_device_checkpoint(struct list_head *checkpoint,


Could we have another structure instead of list_head to record device 
checkpoints?

The list_head is never a meaningful structure under most cases.



+   struct btrfs_device *device)
+{
+   struct device_checkpoint *cdev =
+   kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
+   if (!cdev)
+   return -ENOMEM;


This means that, we must check return value of add_device_checkpoint(), 
while later code doesn't check it at all.



+
+   list_add(>list, checkpoint);


And I prefer to do extra check, in case such device is already inserted 
once.



+
+   cdev->device = device;
+   cdev->stat_value_checkpoint =
+   btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS);
+
+   return 0;
+}
+
+static void fini_devices_checkpoint(struct list_head *checkpoint)


Never a fan of the "fini_" naming.
What about "release_"?


+{
+   struct device_checkpoint *cdev;
+
+   while(!list_empty(checkpoint)) {
+   cdev = list_entry(checkpoint->next,
+   struct device_checkpoint, list);
+   list_del(>list);
+   kfree(cdev);
+   }
+}
+
+static int check_stat_flush(struct btrfs_device *dev,
+   struct list_head *checkpoint)
+{
+   int val;
+   struct device_checkpoint *cdev;
+
+   list_for_each_entry(cdev, checkpoint, list) {
+   if (cdev->device == dev) {
+   val = btrfs_dev_stat_read(dev,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
+   if (cdev->stat_value_checkpoint != val)
+   return 1;


This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified 
by checkpoint related code.

Or any other modifier can easily break the check, causing false alert.

IIRC that's the reason why I update my previous degraded patch.


Personally speaking, I prefer the patchset to take more usage of the 
checkpoint system, or it's a little overkilled for current usage.


Thanks,
Qu


+   }
+   }
+   return 0;
+}
+
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs,
+   struct list_head *checkpoint)
+{
+   int dropouts = 0;
+   struct btrfs_device *dev;
+
+   list_for_each_entry_rcu(dev, >devices, dev_list) {
+   if (!dev->bdev || check_stat_flush(dev, checkpoint))
+   dropouts++;
+   }
+
+   if (dropouts >
+   fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int dropouts = 0;
int ret;
+   struct list_head checkpoint;
+
+   INIT_LIST_HEAD();

/* send down all the barriers */
head = >fs_devices->devices;
@@ -3587,29 +3659,31 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;

+   add_device_checkpoint(, dev);
ret = write_dev_flush(dev, 0);
-   if (ret)
+   if (ret) {
+   fini_devices_checkpoint();
return ret;
+   }
}

/* wait for all the barriers */
list_for_each_entry_rcu(dev, head, dev_list) {
if (dev->missing)
continue;
-   if (!dev->bdev) {
-   dropouts++;
+   if (!dev->bdev)
continue;

Re: [PATCH v3.1 5/7] btrfs: Allow barrier_all_devices to do chunk level device check

2017-03-13 Thread Anand Jain


Qu,

patch 4/4 added a cleanup for barrier_all_devices() and introduced
a new function check_barrier_error() where integration with per chunk
level device check will simplify.

 [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat 
flush error


Thanks, Anand



On 03/09/2017 09:34 AM, Qu Wenruo wrote:

The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it's can be easily changed to new per-chunk degradable check
framework.

Now btrfs_device will have two extra members, representing send/wait
error, set at write_dev_flush() time.
With these 2 new members, btrfs_check_rw_degradable() can check if the
fs is still OK when the fs is committed to disk.

Signed-off-by: Qu Wenruo 
Tested-by: Austin S. Hemmelgarn 
Tested-by: Adam Borowski 
Tested-by: Dmitrii Tcvetkov 
---
 fs/btrfs/disk-io.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 658b8fab1d39..549045a3e15f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3570,17 +3570,20 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int errors_send = 0;
-   int errors_wait = 0;
+   struct extra_rw_degrade_errors *errors;
int ret;

+   errors = alloc_extra_rw_degrade_errors(info->fs_devices->num_devices);
+   if (!errors)
+   return -ENOMEM;
+
/* send down all the barriers */
head = >fs_devices->devices;
list_for_each_entry_rcu(dev, head, dev_list) {
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_send++;
+   record_extra_rw_degrade_error(errors, dev->devid);
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3588,7 +3591,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)

ret = write_dev_flush(dev, 0);
if (ret)
-   errors_send++;
+   record_extra_rw_degrade_error(errors, dev->devid);
}

/* wait for all the barriers */
@@ -3596,7 +3599,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_wait++;
+   record_extra_rw_degrade_error(errors, dev->devid);
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3604,11 +3607,13 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)

ret = write_dev_flush(dev, 1);
if (ret)
-   errors_wait++;
+   record_extra_rw_degrade_error(errors, dev->devid);
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
+   if (!btrfs_check_rw_degradable(info, errors)) {
+   kfree(errors);
return -EIO;
+   }
+   kfree(errors);
return 0;
 }



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/9] Qgroup fixes

2017-03-13 Thread Qu Wenruo
Patchset can be fetched from my github:
https://github.com/adam900710/linux.git qgroup_fixes

The new base is v4.11-rc1, only minor conflicts and are all easy to handle.

This pull request includes:
1) Fix for inode_cache mount option
   Although no one really cares inode_cache mount option, it will screw
   qgroup for a long time.
   Not only it will screw up qgroup test uses golden output, but also
   new test cases use btrfsck to verify qgroup.

2) Fix for btrfs/104 kernel warning
   This is caused by quota enabling with dirty buffers not written to
   disc.

   Fixed by checking EXTENT_QGROUP_RESERVED flag other than just
   decreasing qgroup->reserved.

3) Fix qgroup underflow caused by freeing ranges not reserved by caller
   Mainly exposed by Chandan on PPC64.

   It's possible that buffered write is blocked by reserving metadata,
   and in error path we will release reserved space for both data and
   metadata.

   However the reserved data can already be reserved by another process
   writing the same page.

   In that case, data reserved space can be freed by two process, one
   for error path, one for normal write routine, causing underflow.

   Fixed by checking if that data range is reserved by ourselves and
   only free it if it's reserved by ourselves.

Update since 2016/12/09:
  Rebased to latest for-linux-4.11.

  Add missing reviewed-by and tested-by tags.

  Add more comment for btrfs_qgroup_reserve_data() for error handling.

  Add more comment for qgroup_free_reserved_data() for later
  enhancement (not function enhancement).

v2:
  Add reviewed-by tag for 2nd patch
  Update the first patch to follow the new trace point standard

Qu Wenruo (9):
  btrfs: qgroup: Add trace point for qgroup reserved space
  btrfs: qgroup: Re-arrange tracepoint timing to co-operate with
reserved space tracepoint
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount
option
  btrfs: qgroup: Add quick exit for non-fs extents
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Return actually freed bytes for qgroup release or free
data
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered
write and quota enable
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing
reserved ranges

 fs/btrfs/ctree.h |  12 +-
 fs/btrfs/extent-tree.c   |  31 --
 fs/btrfs/extent_io.h |  14 ++-
 fs/btrfs/file.c  |  46 +---
 fs/btrfs/inode-map.c |   6 +-
 fs/btrfs/inode.c |  64 +++
 fs/btrfs/ioctl.c |  11 +-
 fs/btrfs/qgroup.c| 258 ++-
 fs/btrfs/qgroup.h|  58 --
 fs/btrfs/relocation.c|  13 ++-
 fs/btrfs/transaction.c   |  21 ++--
 include/trace/events/btrfs.h |  44 
 12 files changed, 396 insertions(+), 182 deletions(-)

-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 9/9] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges

2017-03-13 Thread Qu Wenruo
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at error path:
(Page size 4K, function name without "btrfs_" prefix)

 Task A  | Task B
--
Buffered_write [0, 2K)   |
|- check_data_free_space()   |
|  |- qgroup_reserve_data()  |
| Range aligned to page  |
| range [0, 4K)  <<< |
| 4K bytes reserved  <<< |
|- copy pages to page cache  |
 | Buffered_write [2K, 4K)
 | |- check_data_free_space()
 | |  |- qgroup_reserved_data()
 | | Range alinged to page
 | | range [0, 4K)
 | | Already reserved by A <<<
 | | 0 bytes reserved  <<<
 | |- delalloc_reserve_metadata()
 | |  And it *FAILED* (Maybe EQUOTA)
 | |- free_reserved_data_space()
  |- qgroup_free_data()
 Range aligned to page range
 [0, 4K)
 Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at write back time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra 
Signed-off-by: Qu Wenruo 
Reviewed-by: Chandan Rajendra 
Tested-by: Chandan Rajendra 
---
 fs/btrfs/ctree.h   |  6 +++--
 fs/btrfs/extent-tree.c | 12 +
 fs/btrfs/file.c| 29 +++-
 fs/btrfs/inode.c   | 29 ++--
 fs/btrfs/ioctl.c   |  4 +--
 fs/btrfs/qgroup.c  | 72 ++
 fs/btrfs/qgroup.h  |  3 ++-
 fs/btrfs/relocation.c  |  8 +++---
 8 files changed, 117 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8349b3bd6522..f76b3ecdc715 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2690,7 +2690,10 @@ enum btrfs_flush_state {
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
 int btrfs_check_data_free_space(struct inode *inode,
struct extent_changeset *reserved, u64 start, u64 len);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
+void btrfs_free_reserved_data_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len);
+void btrfs_delalloc_release_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
@@ -2709,7 +2712,6 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode 
*inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes);
 int btrfs_delalloc_reserve_space(struct inode *inode,
struct extent_changeset *reserved, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
  unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 67f4e97ef8c6..9de0436becd2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4336,7 +4336,8 @@ void btrfs_free_reserved_data_space_noquota(struct inode 
*inode, u64 start,
  * This one will handle the per-inode data rsv map for accurate reserved
  * space framework.
  */
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+void btrfs_free_reserved_data_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
 
@@ -4346,7 +4347,7 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 start, u64 len)
start = round_down(start, root->fs_info->sectorsize);
 
btrfs_free_reserved_data_space_noquota(inode, start, len);
-

[PATCH v2 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions

2017-03-13 Thread Qu Wenruo
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it at error
path.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  6 --
 fs/btrfs/extent-tree.c | 17 -
 fs/btrfs/extent_io.h   | 12 
 fs/btrfs/file.c| 17 ++---
 fs/btrfs/inode-map.c   |  6 +-
 fs/btrfs/inode.c   | 24 
 fs/btrfs/ioctl.c   |  7 ++-
 fs/btrfs/qgroup.c  | 33 +
 fs/btrfs/qgroup.h  |  3 ++-
 fs/btrfs/relocation.c  |  5 -
 10 files changed, 100 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29b7fc28c607..8349b3bd6522 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2687,8 +2687,9 @@ enum btrfs_flush_state {
COMMIT_TRANS=   6,
 };
 
-int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
+int btrfs_check_data_free_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
@@ -2706,7 +2707,8 @@ void btrfs_subvolume_release_metadata(struct 
btrfs_fs_info *fs_info,
  struct btrfs_block_rsv *rsv);
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len);
 void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 89d1bcba3f8c..67f4e97ef8c6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3355,12 +3355,14 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
struct btrfs_fs_info *fs_info = block_group->fs_info;
struct btrfs_root *root = fs_info->tree_root;
struct inode *inode = NULL;
+   struct extent_changeset data_reserved;
u64 alloc_hint = 0;
int dcs = BTRFS_DC_ERROR;
u64 num_pages = 0;
int retries = 0;
int ret = 0;
 
+   extent_changeset_init(_reserved);
/*
 * If this block group is smaller than 100 megs don't bother caching the
 * block group.
@@ -3473,7 +3475,7 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
num_pages *= 16;
num_pages *= PAGE_SIZE;
 
-   ret = btrfs_check_data_free_space(inode, 0, num_pages);
+   ret = btrfs_check_data_free_space(inode, _reserved, 0, num_pages);
if (ret)
goto out_put;
 
@@ -3504,6 +3506,7 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
block_group->disk_cache_state = dcs;
spin_unlock(_group->lock);
 
+   extent_changeset_release(_reserved);
return ret;
 }
 
@@ -4272,7 +4275,8 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
  * Will replace old btrfs_check_data_free_space(), but for patch split,
  * add a new function first and then replace it.
  */
-int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
+int btrfs_check_data_free_space(struct inode *inode,
+   struct extent_changeset *reserved, u64 start, u64 len)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
int ret;
@@ -4287,9 +4291,11 @@ int btrfs_check_data_free_space(struct inode *inode, u64 
start, u64 len)
return ret;
 
/* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
-   ret = btrfs_qgroup_reserve_data(inode, start, len);
+   ret = btrfs_qgroup_reserve_data(inode, reserved, start, len);
if (ret < 0)
btrfs_free_reserved_data_space_noquota(inode, start, len);
+   else
+   ret = 0;
return ret;
 }
 
@@ -6130,11 +6136,12 @@ void 

[PATCH v2 5/9] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function

2017-03-13 Thread Qu Wenruo
Quite a lot of qgroup corruption happens due to wrong timing of calling
btrfs_qgroup_prepare_account_extents().

Since the safest timing is calling it just before
btrfs_qgroup_account_extents(), there is no need to separate these 2
function.

Merging them will make code cleaner and less bug prone.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c  | 49 -
 fs/btrfs/qgroup.h  |  2 --
 fs/btrfs/transaction.c | 10 --
 3 files changed, 16 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 445014de924c..9535ea56b6f3 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1403,38 +1403,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans,
return ret;
 }
 
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_qgroup_extent_record *record;
-   struct btrfs_delayed_ref_root *delayed_refs;
-   struct rb_node *node;
-   u64 qgroup_to_skip;
-   int ret = 0;
-
-   delayed_refs = >transaction->delayed_refs;
-   qgroup_to_skip = delayed_refs->qgroup_to_skip;
-
-   /*
-* No need to do lock, since this function will only be called in
-* btrfs_commit_transaction().
-*/
-   node = rb_first(_refs->dirty_extent_root);
-   while (node) {
-   record = rb_entry(node, struct btrfs_qgroup_extent_record,
- node);
-   if (WARN_ON(!record->old_roots))
-   ret = btrfs_find_all_roots(NULL, fs_info,
-   record->bytenr, 0, >old_roots);
-   if (ret < 0)
-   break;
-   if (qgroup_to_skip)
-   ulist_del(record->old_roots, qgroup_to_skip, 0);
-   node = rb_next(node);
-   }
-   return ret;
-}
-
 int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_root *delayed_refs,
struct btrfs_qgroup_extent_record *record)
@@ -2051,6 +2019,18 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
 
if (!ret) {
/*
+* old roots should be searched when inserting qgroup
+* extent record
+*/
+   if (WARN_ON(!record->old_roots)) {
+   /* Search commit root to find old_roots */
+   ret = btrfs_find_all_roots(NULL, fs_info,
+   record->bytenr, 0,
+   >old_roots);
+   if (ret < 0)
+   goto cleanup;
+   }
+   /*
 * Use (u64)-1 as time_seq to do special search, which
 * doesn't lock tree or delayed_refs and search current
 * root. It's safe inside commit_transaction().
@@ -2059,8 +2039,11 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
record->bytenr, (u64)-1, _roots);
if (ret < 0)
goto cleanup;
-   if (qgroup_to_skip)
+   if (qgroup_to_skip) {
ulist_del(new_roots, qgroup_to_skip, 0);
+   ulist_del(record->old_roots, qgroup_to_skip,
+ 0);
+   }
ret = btrfs_qgroup_account_extent(trans, fs_info,
record->bytenr, record->num_bytes,
record->old_roots, new_roots);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 19188e094d0e..b3c77f029dfe 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -134,8 +134,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans,
 int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
 void btrfs_free_qgroup_config(struct btrfs_fs_info *fs_info);
 struct btrfs_delayed_extent_op;
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info);
 /*
  * Inform qgroup to trace one dirty extent, its info is recorded in @record.
  * So qgroup can account it at transaction committing time.
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 894c4980c15c..edee58751eb3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1376,9 +1376,6 @@ static int qgroup_account_snapshot(struct 
btrfs_trans_handle *trans,
ret = commit_fs_roots(trans, fs_info);
if (ret)
 

[PATCH v2 1/9] btrfs: qgroup: Add trace point for qgroup reserved space

2017-03-13 Thread Qu Wenruo
Introduce the following trace points:
qgroup_update_reserve
qgroup_meta_reserve

These trace points are handy to trace qgroup reserve space related
problems.

Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup
structure to trace points, so that structure needs to be exported.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c| 52 +++-
 fs/btrfs/qgroup.h| 44 +
 include/trace/events/btrfs.h | 44 +
 3 files changed, 96 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a5da750c1087..68a7305b0a46 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -47,50 +47,6 @@
  *  - check all ioctl parameters
  */
 
-/*
- * one struct for each qgroup, organized in fs_info->qgroup_tree.
- */
-struct btrfs_qgroup {
-   u64 qgroupid;
-
-   /*
-* state
-*/
-   u64 rfer;   /* referenced */
-   u64 rfer_cmpr;  /* referenced compressed */
-   u64 excl;   /* exclusive */
-   u64 excl_cmpr;  /* exclusive compressed */
-
-   /*
-* limits
-*/
-   u64 lim_flags;  /* which limits are set */
-   u64 max_rfer;
-   u64 max_excl;
-   u64 rsv_rfer;
-   u64 rsv_excl;
-
-   /*
-* reservation tracking
-*/
-   u64 reserved;
-
-   /*
-* lists
-*/
-   struct list_head groups;  /* groups this group is member of */
-   struct list_head members; /* groups that are members of this group */
-   struct list_head dirty;   /* dirty groups */
-   struct rb_node node;  /* tree of qgroups */
-
-   /*
-* temp variables for accounting operations
-* Refer to qgroup_shared_accounting() for details.
-*/
-   u64 old_refcnt;
-   u64 new_refcnt;
-};
-
 static void btrfs_qgroup_update_old_refcnt(struct btrfs_qgroup *qg, u64 seq,
   int mod)
 {
@@ -1075,6 +1031,7 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info 
*fs_info,
qgroup->excl += sign * num_bytes;
qgroup->excl_cmpr += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes);
if (WARN_ON(qgroup->reserved < num_bytes))
report_reserved_underflow(fs_info, qgroup, num_bytes);
else
@@ -1100,6 +1057,8 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info 
*fs_info,
WARN_ON(sign < 0 && qgroup->excl < num_bytes);
qgroup->excl += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup,
+   -(s64)num_bytes);
if (WARN_ON(qgroup->reserved < num_bytes))
report_reserved_underflow(fs_info, qgroup,
  num_bytes);
@@ -2424,6 +2383,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 
num_bytes, bool enforce)
 
qg = unode_aux_to_qgroup(unode);
 
+   trace_qgroup_update_reserve(fs_info, qg, num_bytes);
qg->reserved += num_bytes;
}
 
@@ -2469,6 +2429,7 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
*fs_info,
 
qg = unode_aux_to_qgroup(unode);
 
+   trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes);
if (WARN_ON(qg->reserved < num_bytes))
report_reserved_underflow(fs_info, qg, num_bytes);
else
@@ -2945,6 +2906,7 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, 
int num_bytes,
return 0;
 
BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
+   trace_qgroup_meta_reserve(root, (s64)num_bytes);
ret = qgroup_reserve(root, num_bytes, enforce);
if (ret < 0)
return ret;
@@ -2964,6 +2926,7 @@ void btrfs_qgroup_free_meta_all(struct btrfs_root *root)
reserved = atomic_xchg(>qgroup_meta_rsv, 0);
if (reserved == 0)
return;
+   trace_qgroup_meta_reserve(root, -(s64)reserved);
btrfs_qgroup_free_refroot(fs_info, root->objectid, reserved);
 }
 
@@ -2978,6 +2941,7 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, int 
num_bytes)
BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize));
WARN_ON(atomic_read(>qgroup_meta_rsv) < num_bytes);
atomic_sub(num_bytes, >qgroup_meta_rsv);
+   trace_qgroup_meta_reserve(root, -(s64)num_bytes);
btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes);
 }
 
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 26932a8a1993..e5b42f1044fc 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -62,6 +62,50 @@ struct 

[PATCH v2 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option

2017-03-13 Thread Qu Wenruo
[BUG]
The easist way to reproduce the bug is:
--
 # mkfs.btrfs -f $dev -n 16K
 # mount $dev $mnt -o inode_cache
 # btrfs quota enable $mnt
 # btrfs quota rescan -w $mnt
 # btrfs qgroup show $mnt
qgroupid rfer excl
  
0/5  32.00KiB 32.00KiB
 ^^ Twice the correct value
--

And fstests/btrfs qgroup test group can easily detect them with
inode_cache mount option.
Although some of them are false alerts since old test cases are using
fixed golden output.
While new test cases will use "btrfs check" to detect qgroup mismatch.

[CAUSE]
Inode_cache mount option will make commit_fs_roots() to call
btrfs_save_ino_cache() to update fs/subvol trees, and generate new
delayed refs.

However we call btrfs_qgroup_prepare_account_extents() too early, before
commit_fs_roots().
This makes the "old_roots" for newly generated extents are always NULL.
For freeing extent case, this makes both new_roots and old_roots to be
empty, while correct old_roots should not be empty.
This causing qgroup numbers not decreased correctly.

[FIX]
Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
just before btrfs_qgroup_account_extents(), and add needed delayed_refs
handler.
So qgroup can handle inode_map mount options correctly.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/transaction.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 61b807de3e16..894c4980c15c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2130,13 +2130,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
goto scrub_continue;
}
 
-   /* Reocrd old roots for later qgroup accounting */
-   ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-   if (ret) {
-   mutex_unlock(_info->reloc_mutex);
-   goto scrub_continue;
-   }
-
/*
 * make sure none of the code above managed to slip in a
 * delayed item
@@ -2179,6 +2172,24 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_free_log_root_tree(trans, fs_info);
 
/*
+* commit_fs_roots() can call btrfs_save_ino_cache(), which generates
+* new delayed refs. Must handle them or qgroup can be wrong.
+*/
+   ret = btrfs_run_delayed_refs(trans, fs_info, (unsigned long)-1);
+   if (ret) {
+   mutex_unlock(_info->tree_log_mutex);
+   mutex_unlock(_info->reloc_mutex);
+   goto scrub_continue;
+   }
+
+   ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
+   if (ret) {
+   mutex_unlock(_info->tree_log_mutex);
+   mutex_unlock(_info->reloc_mutex);
+   goto scrub_continue;
+   }
+
+   /*
 * Since fs roots are all committed, we can get a quite accurate
 * new_roots. So let's do quota accounting.
 */
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/9] btrfs: qgroup: Add quick exit for non-fs extents

2017-03-13 Thread Qu Wenruo
For btrfs_qgroup_account_extent(), modify make it exit quicker for
non-fs extents.

This will also reduce the noise in trace_btrfs_qgroup_account_extent
event.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 41 +++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1fe481d71be5..445014de924c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1915,6 +1915,33 @@ static int qgroup_update_counters(struct btrfs_fs_info 
*fs_info,
return 0;
 }
 
+/*
+ * Helper to check if the @roots is a list of fs tree roots
+ * Return 0 for definitely not a fs/subvol tree roots ulist
+ * Return 1 for possible fs/subvol tree roots ulist(including empty)
+ */
+static int maybe_fs_roots(struct ulist *roots)
+{
+   struct ulist_node *unode;
+   struct ulist_iterator uiter;
+
+   /* Empty one, still possible for fs roots */
+   if (!roots || roots->nnodes == 0)
+   return 1;
+
+   ULIST_ITER_INIT();
+   unode = ulist_next(roots, );
+   if (!unode)
+   return 1;
+
+   /*
+* If it contains fs tree roots, then it must belongs to fs/subvol
+* trees.
+* If it contains non-fs tree, it won't be shared to fs/subvol trees.
+*/
+   return is_fstree(unode->val);
+}
+
 int
 btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info,
@@ -1931,10 +1958,20 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle 
*trans,
if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
return 0;
 
-   if (new_roots)
+   if (new_roots) {
+   if (!maybe_fs_roots(new_roots))
+   goto out_free;
nr_new_roots = new_roots->nnodes;
-   if (old_roots)
+   }
+   if (old_roots) {
+   if (!maybe_fs_roots(old_roots))
+   goto out_free;
nr_old_roots = old_roots->nnodes;
+   }
+
+   /* Quick exit, either not fs tree roots, or won't affect any qgroup */
+   if (nr_old_roots == 0 && nr_new_roots == 0)
+   goto out_free;
 
BUG_ON(!fs_info->quota_root);
 
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/9] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable

2017-03-13 Thread Qu Wenruo
[BUG]
Under the following case, we can underflow qgroup reserved space.

Task A|Task B
---
 Quota disabled   |
 Buffered write   |
 |- btrfs_check_data_free_space() |
 |  *NO* qgroup space is reserved |
 |  since quota is *DISABLED* |
 |- All pages are copied to page  |
cache |
  | Enable quota
  | Quota scan finished
  |
  | Sync_fs
  | |- run_delalloc_range
  | |- Write pages
  | |- btrfs_finish_ordered_io
  ||- insert_reserved_file_extent
  |   |- btrfs_qgroup_release_data()
  |  Since no qgroup space is
 reserved in Task A, we
 underflow qgroup reserved
 space
This can be detected by fstest btrfs/104.

[CAUSE]
In insert_reserved_file_extent() we info qgroup to release the @ram_bytes
size of qgroup reserved_space under all case.
And btrfs_qgroup_release_data() will check if qgroup is enabled.

However in above case, the buffered write happens before quota is
enabled, so we don't havee reserved space for that range.

[FIX]
In insert_reserved_file_extent(), we info qgroup to release the acctual
byte number it released.
In above case, since we don't have reserved space, we info qgroup to
release 0 byte, so the problem can be fixed.

And thanks to the @reserved parameter introduced by qgroup rework, and
previous patch to return release bytes, the fix can be as small as less
than 10 lines.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c40060cc481f..36b08abff564 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2044,6 +2044,7 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
struct btrfs_path *path;
struct extent_buffer *leaf;
struct btrfs_key ins;
+   u64 qg_released;
int extent_inserted = 0;
int ret;
 
@@ -2099,13 +2100,17 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
ins.objectid = disk_bytenr;
ins.offset = disk_num_bytes;
ins.type = BTRFS_EXTENT_ITEM_KEY;
-   ret = btrfs_alloc_reserved_file_extent(trans, root->root_key.objectid,
-   btrfs_ino(BTRFS_I(inode)), file_pos, ram_bytes, );
+
/*
 * Release the reserved range from inode dirty range map, as it is
 * already moved into delayed_ref_head
 */
-   btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+   ret = btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+   if (ret < 0)
+   goto out;
+   qg_released = ret;
+   ret = btrfs_alloc_reserved_file_extent(trans, root->root_key.objectid,
+   btrfs_ino(BTRFS_I(inode)), file_pos, qg_released, );
 out:
btrfs_free_path(path);
 
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint

2017-03-13 Thread Qu Wenruo
Newly introduced qgroup reserved space trace points are normally nested
into several common qgroup operations.

While some other trace points are not well placed to co-operate with
them, causing confusing output.

This patch re-arrange trace_btrfs_qgroup_release_data() and
trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
before reserved space ones.

Signed-off-by: Qu Wenruo 
Reviewed-by: Jeff Mahoney 
---
 fs/btrfs/qgroup.c | 10 +-
 fs/btrfs/qgroup.h |  6 +-
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 68a7305b0a46..1fe481d71be5 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2847,14 +2847,14 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode, u64 start, u64 len,
if (ret < 0)
goto out;
 
-   if (free) {
-   btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
-   BTRFS_I(inode)->root->objectid,
-   changeset.bytes_changed);
+   if (free)
trace_op = QGROUP_FREE;
-   }
trace_btrfs_qgroup_release_data(inode, start, len,
changeset.bytes_changed, trace_op);
+   if (free)
+   btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
+   BTRFS_I(inode)->root->objectid,
+   changeset.bytes_changed);
 out:
ulist_release(_changed);
return ret;
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index e5b42f1044fc..19188e094d0e 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -230,15 +230,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
 struct btrfs_qgroup_inherit *inherit);
 void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
   u64 ref_root, u64 num_bytes);
-/*
- * TODO: Add proper trace point for it, as btrfs_qgroup_free() is
- * called by everywhere, can't provide good trace for delayed ref case.
- */
 static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info,
 u64 ref_root, u64 num_bytes)
 {
-   btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
+   btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
 }
 void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
 
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/9] btrfs: qgroup: Return actually freed bytes for qgroup release or free data

2017-03-13 Thread Qu Wenruo
btrfs_qgroup_release/free_data() only returns 0 or minus error
number(ENOMEM is the only possible error).

This is normally good enough, but sometimes we need the accurate byte
number it freed/released.

Change it to return actually released/freed bytenr number instead of 0
for success.
And slightly modify related extent_changeset structure, since in btrfs
one none-hole data extent won't be larger than 128M, so "unsigned int"
is large enough for the use case.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 2 +-
 fs/btrfs/extent_io.h   | 2 +-
 fs/btrfs/qgroup.c  | 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be5477676cc8..89d1bcba3f8c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4288,7 +4288,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 
start, u64 len)
 
/* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
ret = btrfs_qgroup_reserve_data(inode, start, len);
-   if (ret)
+   if (ret < 0)
btrfs_free_reserved_data_space_noquota(inode, start, len);
return ret;
 }
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 3e4fad4a909d..bdca004dc859 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -201,7 +201,7 @@ struct extent_buffer {
  */
 struct extent_changeset {
/* How many bytes are set/cleared in this operation */
-   u64 bytes_changed;
+   unsigned int bytes_changed;
 
/* Changed ranges */
struct ulist range_changed;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9535ea56b6f3..87aeefb34959 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2875,6 +2875,7 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode, u64 start, u64 len,
btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info,
BTRFS_I(inode)->root->objectid,
changeset.bytes_changed);
+   ret = changeset.bytes_changed;
 out:
ulist_release(_changed);
return ret;
-- 
2.12.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

2017-03-13 Thread Qu Wenruo



At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote:

Hi Qu,

Am 13.03.2017 um 02:16 schrieb Qu Wenruo:


At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:

Hi Qu,

while V5 was running fine against the openSUSE-42.2 kernel (based on
v4.4).


Thanks for the test.


V7 results in OOPS to me:
BUG: unable to handle kernel NULL pointer dereference at 01f0


This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
nice clue.


IP: [] __endio_write_update_ordered+0x33/0x140 [btrfs]


IP points to:
---
static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
{
struct btrfs_root *root = inode->root; << Either here

if (root == root->fs_info->tree_root && << Or here
btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)

---

Taking the above offset into consideration, it's only possible for later
case.

So here, we have a btrfs_inode whose @root is NULL.


But wasn't this part of the code identical in V5? Why does it only
happen with V7?


There are still difference, but just as you said, the related 
part(checking if inode is free space cache inode) is identical across v5 
and v7.



I'm afraid that's a rare race leading to NULL btrfs_inode->root, which 
could happen in both v5 and v7.


What's the difference between SUSE and mainline kernel?
Maybe some mainline kernel commits have already fixed it?

Thanks,
Qu



This can be fixed easily by checking @root inside
btrfs_is_free_space_inode(), as the backtrace shows that it's only
happening for DirectIO, and it won't happen for free space cache inode.

But I'm more curious how this happened for a more accurate fix, or we
could have other NULL pointer access.

Did you have any reproducer for this?


Sorry no - this is a production MariaDB Server running btrfs with
compress-force=zlib. But if i could test anything i'll do.

Greets,
Stefan



Thanks,
Qu


PGD 14e18d4067 PUD 14e1868067 PMD 0
Oops:  [#1] SMP
Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
i2c_core usb_common ata_piix floppy
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.7.5-20140722_172050-sagunt 04/01/2014
task: b4e0f500 ti: b4e0 task.ti: b4e0
RIP: 0010:[] []
__endio_write_update_ordered+0x33/0x140 [btrfs]
RSP: 0018:8814eae03cd8 EFLAGS: 00010086
RAX:  RBX: 8814e8fd5aa8 RCX: 0001
RDX: 0010 RSI: 0010 RDI: 8814e45885c0
RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a
R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0
R13: 8814e125d700 R14: 0010 R15: 8800376c6a80
FS: () GS:8814eae0()
knlGS:
CS: 0010 DS:  ES:  CR0: 80050033
CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack:
 0010 8814e8fd5aa8 8814e953f3c0
8814e125d700 0010 8800376c6a80 8814eae03d38
c03ddf67 8814e86b6a80 8814e8fd5aa8 0001
Call Trace:
[] btrfs_endio_direct_write+0x37/0x60 [btrfs]
[] bio_endio+0x57/0x60
[] btrfs_end_bio+0xa1/0x140 [btrfs]
[] bio_endio+0x57/0x60
[] blk_update_request+0x8b/0x330
[] blk_mq_end_request+0x1a/0x70
[] virtblk_request_done+0x3f/0x70 [virtio_blk]
[] __blk_mq_complete_request+0x78/0xe0
[] blk_mq_complete_request+0x1c/0x20
[] virtblk_done+0x64/0xe0 [virtio_blk]
[] vring_interrupt+0x3a/0x90
[] __handle_irq_event_percpu+0x89/0x1b0
[] handle_irq_event_percpu+0x23/0x60
[] handle_irq_event+0x3b/0x60
[] handle_edge_irq+0x6f/0x150
[] handle_irq+0x1d/0x30
[] do_IRQ+0x4b/0xd0
[] common_interrupt+0x8c/0x8c
DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
Leftover inexact backtrace:
2017-03-12 20:33:08 
2017-03-12 20:33:08  [] ? native_safe_halt+0x6/0x10
[] default_idle+0x1e/0xe0
[] arch_cpu_idle+0xf/0x20
[] default_idle_call+0x3b/0x40
[] cpu_startup_entry+0x29a/0x370
[] rest_init+0x7c/0x80
[] start_kernel+0x490/0x49d
[] ? early_idt_handler_array+0x120/0x120
[] x86_64_start_reservations+0x2a/0x2c
[] x86_64_start_kernel+0x13b/0x14a
Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
RIP [] __endio_write_update_ordered+0x33/0x140 [btrfs]
RSP 
CR2: 01f0
---[ end trace 7529a0652fd7873e ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x3300 from 0x8100 (relocation range:
0x8000-0xbfff)

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

[PATCH 3/4] btrfs: cleanup barrier_all_devices() unify dev error count

2017-03-13 Thread Anand Jain
Now when counting number of error devices we don't need to count
them separately once during send and wait, as because device error
counted during send is more of static check.

Also kindly note that as of now there is no code which would set
dev->bdev = NULL unless device is missing. However I still kept
bdev == NULL counted towards error device in view of future
enhancements. And as the device_list_mutex is held when
barrier_all_devices() is called, I don't expect a new bdev to null
in between send and wait.

Now in this process I also rename error_wait to dropouts.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ee3e601da511..5719e036048b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3574,8 +3574,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int errors_send = 0;
-   int errors_wait = 0;
+   int dropouts = 0;
int ret;
 
/* send down all the barriers */
@@ -3583,10 +3582,8 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
list_for_each_entry_rcu(dev, head, dev_list) {
if (dev->missing)
continue;
-   if (!dev->bdev) {
-   errors_send++;
+   if (!dev->bdev)
continue;
-   }
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
@@ -3600,7 +3597,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
if (dev->missing)
continue;
if (!dev->bdev) {
-   errors_wait++;
+   dropouts++;
continue;
}
if (!dev->in_fs_metadata || !dev->writeable)
@@ -3608,10 +3605,9 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 
ret = write_dev_flush(dev, 1);
if (ret)
-   errors_wait++;
+   dropouts++;
}
-   if (errors_send > info->num_tolerated_disk_barrier_failures ||
-   errors_wait > info->num_tolerated_disk_barrier_failures)
+   if (dropouts > info->num_tolerated_disk_barrier_failures)
return -EIO;
return 0;
 }
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] btrfs: Communicate back ENOMEM when it occurs

2017-03-13 Thread Anand Jain
The only error that write dev flush (send) will fail is due
to the ENOMEM then, as its not a device specific error and
rather a system wide issue, we should rather stop further
iterations and perpetuate the -ENOMEM error to the caller.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08b74daf35d0..ee3e601da511 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3592,7 +3592,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
ret = write_dev_flush(dev, 0);
if (ret)
-   errors_send++;
+   return ret;
}
 
/* wait for all the barriers */
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-13 Thread Anand Jain
The objective of this patch is to cleanup barrier_all_devices()
so that the error checking is in a separate loop independent of
of the loop which submits and waits on the device flush requests.

By doing this it helps to further develop patches which would tune
the error-actions as needed.

Here functions such as btrfs_dev_stats_dirty() couldn't be used
because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 96 +++---
 1 file changed, 85 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5719e036048b..12531a5b14ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
return 0;
 }
 
+struct device_checkpoint {
+   struct list_head list;
+   struct btrfs_device *device;
+   int stat_value_checkpoint;
+};
+
+static int add_device_checkpoint(struct list_head *checkpoint,
+   struct btrfs_device *device)
+{
+   struct device_checkpoint *cdev =
+   kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
+   if (!cdev)
+   return -ENOMEM;
+
+   list_add(>list, checkpoint);
+
+   cdev->device = device;
+   cdev->stat_value_checkpoint =
+   btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS);
+
+   return 0;
+}
+
+static void fini_devices_checkpoint(struct list_head *checkpoint)
+{
+   struct device_checkpoint *cdev;
+
+   while(!list_empty(checkpoint)) {
+   cdev = list_entry(checkpoint->next,
+   struct device_checkpoint, list);
+   list_del(>list);
+   kfree(cdev);
+   }
+}
+
+static int check_stat_flush(struct btrfs_device *dev,
+   struct list_head *checkpoint)
+{
+   int val;
+   struct device_checkpoint *cdev;
+
+   list_for_each_entry(cdev, checkpoint, list) {
+   if (cdev->device == dev) {
+   val = btrfs_dev_stat_read(dev,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
+   if (cdev->stat_value_checkpoint != val)
+   return 1;
+   }
+   }
+   return 0;
+}
+
+static int check_barrier_error(struct btrfs_fs_devices *fsdevs,
+   struct list_head *checkpoint)
+{
+   int dropouts = 0;
+   struct btrfs_device *dev;
+
+   list_for_each_entry_rcu(dev, >devices, dev_list) {
+   if (!dev->bdev || check_stat_flush(dev, checkpoint))
+   dropouts++;
+   }
+
+   if (dropouts >
+   fsdevs->fs_info->num_tolerated_disk_barrier_failures)
+   return -EIO;
+
+   return 0;
+}
+
 /*
  * send an empty flush down to each device in parallel,
  * then wait for them
@@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
 {
struct list_head *head;
struct btrfs_device *dev;
-   int dropouts = 0;
int ret;
+   struct list_head checkpoint;
+
+   INIT_LIST_HEAD();
 
/* send down all the barriers */
head = >fs_devices->devices;
@@ -3587,29 +3659,31 @@ static int barrier_all_devices(struct btrfs_fs_info 
*info)
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
+   add_device_checkpoint(, dev);
ret = write_dev_flush(dev, 0);
-   if (ret)
+   if (ret) {
+   fini_devices_checkpoint();
return ret;
+   }
}
 
/* wait for all the barriers */
list_for_each_entry_rcu(dev, head, dev_list) {
if (dev->missing)
continue;
-   if (!dev->bdev) {
-   dropouts++;
+   if (!dev->bdev)
continue;
-   }
if (!dev->in_fs_metadata || !dev->writeable)
continue;
 
-   ret = write_dev_flush(dev, 1);
-   if (ret)
-   dropouts++;
+   write_dev_flush(dev, 1);
}
-   if (dropouts > info->num_tolerated_disk_barrier_failures)
-   return -EIO;
-   return 0;
+
+   ret = check_barrier_error(info->fs_devices, );
+
+   fini_devices_checkpoint();
+
+   return ret;
 }
 
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags)
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback

2017-03-13 Thread Anand Jain
REQ_PREFLUSH bio to flush dev cache uses btrfs_end_empty_barrier()
completion callback only, as of now, and there it accounts for dev
stat flush errors BTRFS_DEV_STAT_FLUSH_ERRS, so remove it from the
btrfs_end_bio().

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 73d56eef5e60..1563ae03079b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6010,9 +6010,6 @@ static void btrfs_end_bio(struct bio *bio)
else
btrfs_dev_stat_inc(dev,
BTRFS_DEV_STAT_READ_ERRS);
-   if (bio->bi_opf & REQ_PREFLUSH)
-   btrfs_dev_stat_inc(dev,
-   BTRFS_DEV_STAT_FLUSH_ERRS);
btrfs_dev_stat_print_on_error(dev);
}
}
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] cleanup barrier_all_devices()

2017-03-13 Thread Anand Jain
These patches provide cleanup of the function barrier_all_devices().

Anand Jain (4):
  btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback
  btrfs: Communicate back ENOMEM when it occurs
  btrfs: cleanup barrier_all_devices() unify dev error count
  btrfs: cleanup barrier_all_devices() to check dev stat flush error

-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

2017-03-13 Thread Stefan Priebe - Profihost AG
Hi Qu,

Am 13.03.2017 um 02:16 schrieb Qu Wenruo:
> 
> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote:
>> Hi Qu,
>>
>> while V5 was running fine against the openSUSE-42.2 kernel (based on
>> v4.4).
> 
> Thanks for the test.
> 
>> V7 results in OOPS to me:
>> BUG: unable to handle kernel NULL pointer dereference at 01f0
> 
> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite
> nice clue.
> 
>> IP: [] __endio_write_update_ordered+0x33/0x140 [btrfs]
> 
> IP points to:
> ---
> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
> {
> struct btrfs_root *root = inode->root; << Either here
> 
> if (root == root->fs_info->tree_root && << Or here
> btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID)
> 
> ---
> 
> Taking the above offset into consideration, it's only possible for later
> case.
> 
> So here, we have a btrfs_inode whose @root is NULL.

But wasn't this part of the code identical in V5? Why does it only
happen with V7?

> This can be fixed easily by checking @root inside
> btrfs_is_free_space_inode(), as the backtrace shows that it's only
> happening for DirectIO, and it won't happen for free space cache inode.
> 
> But I'm more curious how this happened for a more accurate fix, or we
> could have other NULL pointer access.
> 
> Did you have any reproducer for this?

Sorry no - this is a production MariaDB Server running btrfs with
compress-force=zlib. But if i could test anything i'll do.

Greets,
Stefan

> 
> Thanks,
> Qu
> 
>> PGD 14e18d4067 PUD 14e1868067 PMD 0
>> Oops:  [#1] SMP
>> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4
>> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set
>> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic
>> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci
>> i2c_core usb_common ata_piix floppy
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> 1.7.5-20140722_172050-sagunt 04/01/2014
>> task: b4e0f500 ti: b4e0 task.ti: b4e0
>> RIP: 0010:[] []
>> __endio_write_update_ordered+0x33/0x140 [btrfs]
>> RSP: 0018:8814eae03cd8 EFLAGS: 00010086
>> RAX:  RBX: 8814e8fd5aa8 RCX: 0001
>> RDX: 0010 RSI: 0010 RDI: 8814e45885c0
>> RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a
>> R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0
>> R13: 8814e125d700 R14: 0010 R15: 8800376c6a80
>> FS: () GS:8814eae0()
>> knlGS:
>> CS: 0010 DS:  ES:  CR0: 80050033
>> CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack:
>>  0010 8814e8fd5aa8 8814e953f3c0
>> 8814e125d700 0010 8800376c6a80 8814eae03d38
>> c03ddf67 8814e86b6a80 8814e8fd5aa8 0001
>> Call Trace:
>> [] btrfs_endio_direct_write+0x37/0x60 [btrfs]
>> [] bio_endio+0x57/0x60
>> [] btrfs_end_bio+0xa1/0x140 [btrfs]
>> [] bio_endio+0x57/0x60
>> [] blk_update_request+0x8b/0x330
>> [] blk_mq_end_request+0x1a/0x70
>> [] virtblk_request_done+0x3f/0x70 [virtio_blk]
>> [] __blk_mq_complete_request+0x78/0xe0
>> [] blk_mq_complete_request+0x1c/0x20
>> [] virtblk_done+0x64/0xe0 [virtio_blk]
>> [] vring_interrupt+0x3a/0x90
>> [] __handle_irq_event_percpu+0x89/0x1b0
>> [] handle_irq_event_percpu+0x23/0x60
>> [] handle_irq_event+0x3b/0x60
>> [] handle_edge_irq+0x6f/0x150
>> [] handle_irq+0x1d/0x30
>> [] do_IRQ+0x4b/0xd0
>> [] common_interrupt+0x8c/0x8c
>> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
>> Leftover inexact backtrace:
>> 2017-03-12 20:33:08 
>> 2017-03-12 20:33:08  [] ? native_safe_halt+0x6/0x10
>> [] default_idle+0x1e/0xe0
>> [] arch_cpu_idle+0xf/0x20
>> [] default_idle_call+0x3b/0x40
>> [] cpu_startup_entry+0x29a/0x370
>> [] rest_init+0x7c/0x80
>> [] start_kernel+0x490/0x49d
>> [] ? early_idt_handler_array+0x120/0x120
>> [] x86_64_start_reservations+0x2a/0x2c
>> [] x86_64_start_kernel+0x13b/0x14a
>> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc
>> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b
>> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84
>> RIP [] __endio_write_update_ordered+0x33/0x140 [btrfs]
>> RSP 
>> CR2: 01f0
>> ---[ end trace 7529a0652fd7873e ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> Kernel Offset: 0x3300 from 0x8100 (relocation range:
>> 0x8000-0xbfff)
>>
>> Greets,
>> Stefan
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 
--
To unsubscribe from this list: send the 

Re: [PATCH v3.1 1/7] btrfs: Introduce a function to check if all chunks a OK for degraded rw mount

2017-03-13 Thread Qu Wenruo



At 03/13/2017 03:29 PM, Anand Jain wrote:



On 03/09/2017 09:34 AM, Qu Wenruo wrote:

Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Signed-off-by: Qu Wenruo 
Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
Tested-by: Adam Borowski 
Tested-by: Dmitrii Tcvetkov 
---
 fs/btrfs/volumes.c | 55
++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 56 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 73d56eef5e60..83613955e3c2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6765,6 +6765,61 @@ int btrfs_read_sys_array(struct btrfs_fs_info
*fs_info)
 return -EIO;
 }

+/*
+ * Check if all chunks in the fs is OK for read-write degraded mount
+ *
+ * Return true if the fs is OK to be mounted degraded read-write
+ * Return false if the fs is not OK to be mounted degraded
+ */
+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info)
+{
+struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+struct extent_map *em;
+u64 next_start = 0;
+bool ret = true;
+
+read_lock(_tree->map_tree.lock);
+em = lookup_extent_mapping(_tree->map_tree, 0, (u64)-1);
+read_unlock(_tree->map_tree.lock);
+/* No chunk at all? Return false anyway */
+if (!em) {
+ret = false;
+goto out;
+}
+while (em) {
+struct map_lookup *map;
+int missing = 0;
+int max_tolerated;
+int i;
+
+map = (struct map_lookup *) em->bdev;



   any idea why not   map = em->map_lookup;  here?



My fault, will update the patch.

Thanks,
Qu


Thanks, Anand



+max_tolerated =
+btrfs_get_num_tolerated_disk_barrier_failures(
+map->type);
+for (i = 0; i < map->num_stripes; i++) {
+if (map->stripes[i].dev->missing)
+missing++;
+}
+if (missing > max_tolerated) {
+ret = false;
+btrfs_warn(fs_info,
+"chunk %llu missing %d devices, max tolerance is %d for writeble
mount",
+   em->start, missing, max_tolerated);
+free_extent_map(em);
+goto out;
+}
+next_start = extent_map_end(em);
+free_extent_map(em);
+
+read_lock(_tree->map_tree.lock);
+em = lookup_extent_mapping(_tree->map_tree, next_start,
+   (u64)(-1) - next_start);
+read_unlock(_tree->map_tree.lock);
+}
+out:
+return ret;
+}
+
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 {
 struct btrfs_root *root = fs_info->chunk_root;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 59be81206dd7..db1b5ef479cf 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -538,4 +538,5 @@ struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);

+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
 #endif







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3.1 1/7] btrfs: Introduce a function to check if all chunks a OK for degraded rw mount

2017-03-13 Thread Anand Jain



On 03/09/2017 09:34 AM, Qu Wenruo wrote:

Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Signed-off-by: Qu Wenruo 
Signed-off-by: Anand Jain 
Tested-by: Austin S. Hemmelgarn 
Tested-by: Adam Borowski 
Tested-by: Dmitrii Tcvetkov 
---
 fs/btrfs/volumes.c | 55 ++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 56 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 73d56eef5e60..83613955e3c2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6765,6 +6765,61 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
return -EIO;
 }

+/*
+ * Check if all chunks in the fs is OK for read-write degraded mount
+ *
+ * Return true if the fs is OK to be mounted degraded read-write
+ * Return false if the fs is not OK to be mounted degraded
+ */
+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_mapping_tree *map_tree = _info->mapping_tree;
+   struct extent_map *em;
+   u64 next_start = 0;
+   bool ret = true;
+
+   read_lock(_tree->map_tree.lock);
+   em = lookup_extent_mapping(_tree->map_tree, 0, (u64)-1);
+   read_unlock(_tree->map_tree.lock);
+   /* No chunk at all? Return false anyway */
+   if (!em) {
+   ret = false;
+   goto out;
+   }
+   while (em) {
+   struct map_lookup *map;
+   int missing = 0;
+   int max_tolerated;
+   int i;
+
+   map = (struct map_lookup *) em->bdev;



   any idea why not   map = em->map_lookup;  here?

Thanks, Anand



+   max_tolerated =
+   btrfs_get_num_tolerated_disk_barrier_failures(
+   map->type);
+   for (i = 0; i < map->num_stripes; i++) {
+   if (map->stripes[i].dev->missing)
+   missing++;
+   }
+   if (missing > max_tolerated) {
+   ret = false;
+   btrfs_warn(fs_info,
+   "chunk %llu missing %d devices, max tolerance is %d for writeble mount",
+  em->start, missing, max_tolerated);
+   free_extent_map(em);
+   goto out;
+   }
+   next_start = extent_map_end(em);
+   free_extent_map(em);
+
+   read_lock(_tree->map_tree.lock);
+   em = lookup_extent_mapping(_tree->map_tree, next_start,
+  (u64)(-1) - next_start);
+   read_unlock(_tree->map_tree.lock);
+   }
+out:
+   return ret;
+}
+
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 {
struct btrfs_root *root = fs_info->chunk_root;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 59be81206dd7..db1b5ef479cf 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -538,4 +538,5 @@ struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);

+bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
 #endif


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html