Re: [PATCH] btrfs/139: creation/deletion within qgroup limits
[please cc linux-btrfs@vger.kernel.org for btrfs specific tests] On Mon, Mar 13, 2017 at 04:37:16PM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > If we create and delete files within the qgroup limits, qg->reserved > (allocations before commits) over-inflates and causes -EDQUOT to > be returned pre-maturely. > > Also, 32/64bit data-type exchanges can cause reserved (u64) > to go negative (very large) and -EDQUOT is returned > pre-maturely. > > Signed-off-by: Goldwyn Rodrigues Thanks for the test! I tested it with 4.10 kernel and test failed. I assume it's expected result. Some comments inline. > --- > tests/btrfs/139 | 75 + > tests/btrfs/139.out | 156 > > tests/btrfs/group | 1 + > 3 files changed, 232 insertions(+) > create mode 100755 tests/btrfs/139 > create mode 100644 tests/btrfs/139.out > > diff --git a/tests/btrfs/139 b/tests/btrfs/139 > new file mode 100755 > index 000..df0ef3e > --- /dev/null > +++ b/tests/btrfs/139 > @@ -0,0 +1,75 @@ > +#! /bin/bash > +# FS QA Test 139 > +# > +# Check if btrfs quota limits are not reached when you constantly > +# create and delete files within the exclusive qgroup limits. > +# > +# Finally we create files to exceed the quota. > +# > +#--- > +# Copyright (c) 2017 SUSE. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +_supported_fs btrfs > +_supported_os Linux Need _require_scratch, it check if SCRATCH_DEV is present and unmounts it if so. > + > +_scratch_mkfs > /dev/null 2>&1 > +#_scratch_mount Then need to _scratch_mount here. > + > +SUBVOL=$SCRATCH_MNT/subvol > + > +_run_btrfs_util_prog subvolume create $SUBVOL > +_run_btrfs_util_prog quota enable $SCRATCH_MNT > +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT I always see this message in dmesg, not sure if it affects test? BTRFS info (device sdc2): qgroup_rescan_init failed with -115 > +_run_btrfs_util_prog qgroup limit -e 1G $SUBVOL > + > +for i in $(seq 1 10); do > + for j in $(seq 1 7); do > + $XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j | > _filter_xfs_io No need to filter & print xfs_io stdout, it makes the .out file not necessarily too long. We can discard stdout, so if error happens the stderr output could break the .out file match anyway. $XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j >/dev/null > + done > + rm -f $SUBVOL/file* > +done > + > +for j in $(seq 1 8); do > + $XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j | _filter_xfs_io pwrite error is expected here, and newer xfs_io only prints "pwrite" not "pwrite64", so _filter_xfs_io_error should be used to filter the error message from stderr (not stdout). $XFS_IO_PROG -f -c "pwrite 0 128m" $SUBVOL/file_$j 2>&1 | _filter_xfs_io_error > +done Better to add comments about these two for loops, what is the purpose of each loop etc. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs-progs: dump-tree: Fix duplicated output when using -t option
When using -t option to output trees not in root tree (chunk/root/log root), then we output the tree twice. Fix it Signed-off-by: Qu Wenruo--- v2: None --- cmds-inspect-dump-tree.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index 6c6c0e5c..2c6bec7f 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -371,21 +371,21 @@ again: if (tree_id && tree_id == BTRFS_ROOT_TREE_OBJECTID) { if (!info->tree_root->node) { error("cannot print root tree, invalid pointer"); - goto no_node; + goto close_root; } printf("root tree\n"); btrfs_print_tree(info->tree_root, info->tree_root->node, 1); - goto no_node; + goto close_root; } if (tree_id && tree_id == BTRFS_CHUNK_TREE_OBJECTID) { if (!info->chunk_root->node) { error("cannot print chunk tree, invalid pointer"); - goto no_node; + goto close_root; } printf("chunk tree\n"); btrfs_print_tree(info->chunk_root, info->chunk_root->node, 1); - goto no_node; + goto close_root; } key.offset = 0; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs-progs: dump-tree: Also output log root tree
In btrfs-dump-tree, we output any existing log tree, however we don't output the log root tree, which records all root items for log trees. This makes it confusing for any one who want to know where the log tree comes from. Signed-off-by: Qu Wenruo--- v2: Check if log_root_tree exists before printing it in root_only mode --- cmds-inspect-dump-tree.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index 2c6bec7f..5a5ca536 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -344,6 +344,11 @@ int cmd_inspect_dump_tree(int argc, char **argv) printf("chunk tree: %llu level %d\n", (unsigned long long)info->chunk_root->node->start, btrfs_header_level(info->chunk_root->node)); + if (info->log_root_tree) + printf("log root tree: %llu level %d\n", + info->log_root_tree->node->start, + btrfs_header_level( + info->log_root_tree->node)); } else { if (info->tree_root->node) { printf("root tree\n"); @@ -356,6 +361,12 @@ int cmd_inspect_dump_tree(int argc, char **argv) btrfs_print_tree(info->chunk_root, info->chunk_root->node, 1); } + + if (info->log_root_tree) { + printf("log root tree\n"); + btrfs_print_tree(info->log_root_tree, + info->log_root_tree->node, 1); + } } } tree_root_scan = info->tree_root; @@ -388,6 +399,17 @@ again: goto close_root; } + if (tree_id && tree_id == BTRFS_TREE_LOG_OBJECTID) { + if (!info->log_root_tree) { + error("cannot print log root tree, invalid pointer"); + goto close_root; + } + printf("log root tree\n"); + btrfs_print_tree(info->log_root_tree, info->log_root_tree->node, +1); + goto close_root; + } + key.offset = 0; key.objectid = 0; key.type = BTRFS_ROOT_ITEM_KEY; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs-progs: dump-tree: Fix duplicated output when using -t option
When using -t option to output trees not in root tree (chunk/root/log root), then we output the tree twice. Fix it Signed-off-by: Qu Wenruo--- cmds-inspect-dump-tree.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index 6c6c0e5c..2c6bec7f 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -371,21 +371,21 @@ again: if (tree_id && tree_id == BTRFS_ROOT_TREE_OBJECTID) { if (!info->tree_root->node) { error("cannot print root tree, invalid pointer"); - goto no_node; + goto close_root; } printf("root tree\n"); btrfs_print_tree(info->tree_root, info->tree_root->node, 1); - goto no_node; + goto close_root; } if (tree_id && tree_id == BTRFS_CHUNK_TREE_OBJECTID) { if (!info->chunk_root->node) { error("cannot print chunk tree, invalid pointer"); - goto no_node; + goto close_root; } printf("chunk tree\n"); btrfs_print_tree(info->chunk_root, info->chunk_root->node, 1); - goto no_node; + goto close_root; } key.offset = 0; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs-progs: dump-tree: Also output log root tree
In btrfs-dump-tree, we output any existing log tree, however we don't output the log root tree, which records all root items for log trees. This makes it confusing for any one who want to know where the log tree comes from. Signed-off-by: Qu Wenruo--- cmds-inspect-dump-tree.c | 20 1 file changed, 20 insertions(+) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index 2c6bec7f..eca91b5e 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -344,6 +344,9 @@ int cmd_inspect_dump_tree(int argc, char **argv) printf("chunk tree: %llu level %d\n", (unsigned long long)info->chunk_root->node->start, btrfs_header_level(info->chunk_root->node)); + printf("log root tree: %llu level %d\n", +(unsigned long long)info->log_root_tree->node->start, +btrfs_header_level(info->log_root_tree->node)); } else { if (info->tree_root->node) { printf("root tree\n"); @@ -356,6 +359,12 @@ int cmd_inspect_dump_tree(int argc, char **argv) btrfs_print_tree(info->chunk_root, info->chunk_root->node, 1); } + + if (info->log_root_tree) { + printf("log root tree\n"); + btrfs_print_tree(info->log_root_tree, + info->log_root_tree->node, 1); + } } } tree_root_scan = info->tree_root; @@ -388,6 +397,17 @@ again: goto close_root; } + if (tree_id && tree_id == BTRFS_TREE_LOG_OBJECTID) { + if (!info->log_root_tree) { + error("cannot print log root tree, invalid pointer"); + goto close_root; + } + printf("log root tree\n"); + btrfs_print_tree(info->log_root_tree, info->log_root_tree->node, +1); + goto close_root; + } + key.offset = 0; key.objectid = 0; key.type = BTRFS_ROOT_ITEM_KEY; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error
+struct device_checkpoint { +struct list_head list; +struct btrfs_device *device; +int stat_value_checkpoint; +}; + +static int add_device_checkpoint(struct list_head *checkpoint, Could we have another structure instead of list_head to record device checkpoints? The list_head is never a meaningful structure under most cases. I didn't understand this, there is device_checkpoint and the context of struct list_head *checkpoint would start and end within barrier_all_devices(). I just mean to encapsulate the list_head into another structure, e.g, call it device_checkpoints or something else. Just a list_head is not quite meaningful. OK. Will do. +static int check_stat_flush(struct btrfs_device *dev, +struct list_head *checkpoint) +{ +int val; +struct device_checkpoint *cdev; + +list_for_each_entry(cdev, checkpoint, list) { +if (cdev->device == dev) { +val = btrfs_dev_stat_read(dev, +BTRFS_DEV_STAT_FLUSH_ERRS); +if (cdev->stat_value_checkpoint != val) +return 1; This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified by checkpoint related code. Or any other modifier can easily break the check, causing false alert. I have already checked it, its not possible with the current code, or do you see if that is possible with the current code ? or Did I miss something ? You didn't miss anything. However I just got the same comment on better not to play around something inside btrfs_device. So I'm not sure if it's good to do it this time just for cleaning up barrier_all_devices(). Right. The reason I said that before was first of all the concept of err_send and err_wait wasn't needed as shown in the patch set 1 to 3 in this patch-set. We have the dev_stat flush which keeps track of the flush error in the right way. Next to monitor the change in the flush error instead of checkpoint probably dev_stat_ccnt is the correct way (which in the long term we would need that kind on the per error-stat for rest of the error-stat including flush). But unless we have all the requirements well understood from the other pending stuff like device disappear and reappear I won't change the struct btrfs_devices as of now. Hope this clarifies better. Will send out the patch with the review comment incorporated. Thanks. On top of which per chunk missing device check patch can be added. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote: Am 13.03.2017 um 08:39 schrieb Qu Wenruo: At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: Hi Qu, Am 13.03.2017 um 02:16 schrieb Qu Wenruo: But wasn't this part of the code identical in V5? Why does it only happen with V7? There are still difference, but just as you said, the related part(checking if inode is free space cache inode) is identical across v5 and v7. But if i boot v7 it always happens. If i boot v5 it always works. Have done 5 repeatet tests. I rechecked the code change between v7 and v5. It turns out that, the code base may cause the problem. In v7, the base is v4.11-rc1, which introduced quite a lot of btrfs_inode cleanup. One of the difference is the parameter for btrfs_is_free_space_inode(). In v7, the parameter @inode changed from struct inode to struct btrfs_inode. So in v7, we're passing BTRFS_I(inode) to btrfs_is_free_space_inode(), other than plain inode. That's the most possible cause for me here. So would you please paste the final patch applied to your tree? Git diff or git format-patch can both handle it. Thanks, Qu I'm afraid that's a rare race leading to NULL btrfs_inode->root, which could happen in both v5 and v7. What's the difference between SUSE and mainline kernel? A lot ;-) But i don't think anything related. Maybe some mainline kernel commits have already fixed it? May be no idea. But i haven't found any reason why v5 works. Stefan Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
At 03/13/2017 09:26 PM, Stefan Priebe - Profihost AG wrote: Am 13.03.2017 um 08:39 schrieb Qu Wenruo: At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: Hi Qu, Am 13.03.2017 um 02:16 schrieb Qu Wenruo: At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: Hi Qu, while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4). Thanks for the test. V7 results in OOPS to me: BUG: unable to handle kernel NULL pointer dereference at 01f0 This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite nice clue. IP: [] __endio_write_update_ordered+0x33/0x140 [btrfs] IP points to: --- static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) { struct btrfs_root *root = inode->root; << Either here if (root == root->fs_info->tree_root && << Or here btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) --- Taking the above offset into consideration, it's only possible for later case. So here, we have a btrfs_inode whose @root is NULL. But wasn't this part of the code identical in V5? Why does it only happen with V7? There are still difference, but just as you said, the related part(checking if inode is free space cache inode) is identical across v5 and v7. But if i boot v7 it always happens. If i boot v5 it always works. Have done 5 repeatet tests. That's a problem now, I'll dig it further to find out the difference. Thanks, Qu I'm afraid that's a rare race leading to NULL btrfs_inode->root, which could happen in both v5 and v7. What's the difference between SUSE and mainline kernel? A lot ;-) But i don't think anything related. Maybe some mainline kernel commits have already fixed it? May be no idea. But i haven't found any reason why v5 works. Stefan Thanks, Qu This can be fixed easily by checking @root inside btrfs_is_free_space_inode(), as the backtrace shows that it's only happening for DirectIO, and it won't happen for free space cache inode. But I'm more curious how this happened for a more accurate fix, or we could have other NULL pointer access. Did you have any reproducer for this? Sorry no - this is a production MariaDB Server running btrfs with compress-force=zlib. But if i could test anything i'll do. Greets, Stefan Thanks, Qu PGD 14e18d4067 PUD 14e1868067 PMD 0 Oops: [#1] SMP Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci i2c_core usb_common ata_piix floppy CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140722_172050-sagunt 04/01/2014 task: b4e0f500 ti: b4e0 task.ti: b4e0 RIP: 0010:[] [] __endio_write_update_ordered+0x33/0x140 [btrfs] RSP: 0018:8814eae03cd8 EFLAGS: 00010086 RAX: RBX: 8814e8fd5aa8 RCX: 0001 RDX: 0010 RSI: 0010 RDI: 8814e45885c0 RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0 R13: 8814e125d700 R14: 0010 R15: 8800376c6a80 FS: () GS:8814eae0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack: 0010 8814e8fd5aa8 8814e953f3c0 8814e125d700 0010 8800376c6a80 8814eae03d38 c03ddf67 8814e86b6a80 8814e8fd5aa8 0001 Call Trace: [] btrfs_endio_direct_write+0x37/0x60 [btrfs] [] bio_endio+0x57/0x60 [] btrfs_end_bio+0xa1/0x140 [btrfs] [] bio_endio+0x57/0x60 [] blk_update_request+0x8b/0x330 [] blk_mq_end_request+0x1a/0x70 [] virtblk_request_done+0x3f/0x70 [virtio_blk] [] __blk_mq_complete_request+0x78/0xe0 [] blk_mq_complete_request+0x1c/0x20 [] virtblk_done+0x64/0xe0 [virtio_blk] [] vring_interrupt+0x3a/0x90 [] __handle_irq_event_percpu+0x89/0x1b0 [] handle_irq_event_percpu+0x23/0x60 [] handle_irq_event+0x3b/0x60 [] handle_edge_irq+0x6f/0x150 [] handle_irq+0x1d/0x30 [] do_IRQ+0x4b/0xd0 [] common_interrupt+0x8c/0x8c DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b Leftover inexact backtrace: 2017-03-12 20:33:08 2017-03-12 20:33:08 [] ? native_safe_halt+0x6/0x10 [] default_idle+0x1e/0xe0 [] arch_cpu_idle+0xf/0x20 [] default_idle_call+0x3b/0x40 [] cpu_startup_entry+0x29a/0x370 [] rest_init+0x7c/0x80 [] start_kernel+0x490/0x49d [] ? early_idt_handler_array+0x120/0x120 [] x86_64_start_reservations+0x2a/0x2c [] x86_64_start_kernel+0x13b/0x14a Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 RIP
Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error
At 03/14/2017 12:21 AM, Anand Jain wrote: Thanks for the review.. On 03/13/2017 05:05 PM, Qu Wenruo wrote: At 03/13/2017 03:42 PM, Anand Jain wrote: The objective of this patch is to cleanup barrier_all_devices() so that the error checking is in a separate loop independent of of the loop which submits and waits on the device flush requests. The idea itself is quite good, and we do need it. Thanks. By doing this it helps to further develop patches which would tune the error-actions as needed. Here functions such as btrfs_dev_stats_dirty() couldn't be used because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 96 +++--- 1 file changed, 85 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5719e036048b..12531a5b14ff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device *device, int wait) return 0; } +struct device_checkpoint { +struct list_head list; +struct btrfs_device *device; +int stat_value_checkpoint; +}; + +static int add_device_checkpoint(struct list_head *checkpoint, Could we have another structure instead of list_head to record device checkpoints? The list_head is never a meaningful structure under most cases. I didn't understand this, there is device_checkpoint and the context of struct list_head *checkpoint would start and end within barrier_all_devices(). I just mean to encapsulate the list_head into another structure, e.g, call it device_checkpoints or something else. Just a list_head is not quite meaningful. +struct btrfs_device *device) +{ +struct device_checkpoint *cdev = +kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL); +if (!cdev) +return -ENOMEM; This means that, we must check return value of add_device_checkpoint(), while later code doesn't check it at all. oh. I will correct this. + +list_add(>list, checkpoint); And I prefer to do extra check, in case such device is already inserted once. Hmm with the current code its not at all possible, but let me add it. + +cdev->device = device; +cdev->stat_value_checkpoint = +btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS); + +return 0; +} + +static void fini_devices_checkpoint(struct list_head *checkpoint) Never a fan of the "fini_" naming. What about "release_"? will change it. +{ +struct device_checkpoint *cdev; + +while(!list_empty(checkpoint)) { +cdev = list_entry(checkpoint->next, +struct device_checkpoint, list); +list_del(>list); +kfree(cdev); +} +} + +static int check_stat_flush(struct btrfs_device *dev, +struct list_head *checkpoint) +{ +int val; +struct device_checkpoint *cdev; + +list_for_each_entry(cdev, checkpoint, list) { +if (cdev->device == dev) { +val = btrfs_dev_stat_read(dev, +BTRFS_DEV_STAT_FLUSH_ERRS); +if (cdev->stat_value_checkpoint != val) +return 1; This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified by checkpoint related code. Or any other modifier can easily break the check, causing false alert. I have already checked it, its not possible with the current code, or do you see if that is possible with the current code ? or Did I miss something ? You didn't miss anything. However I just got the same comment on better not to play around something inside btrfs_device. So I'm not sure if it's good to do it this time just for cleaning up barrier_all_devices(). Thanks, Qu IIRC that's the reason why I update my previous degraded patch. Personally speaking, I prefer the patchset to take more usage of the checkpoint system, or it's a little overkilled for current usage. Just want to make sure things are done in the right way. Thanks, Anand Thanks, Qu +} +} +return 0; +} + +static int check_barrier_error(struct btrfs_fs_devices *fsdevs, +struct list_head *checkpoint) +{ +int dropouts = 0; +struct btrfs_device *dev; + +list_for_each_entry_rcu(dev, >devices, dev_list) { +if (!dev->bdev || check_stat_flush(dev, checkpoint)) +dropouts++; +} + +if (dropouts > +fsdevs->fs_info->num_tolerated_disk_barrier_failures) +return -EIO; + +return 0; +} + /* * send an empty flush down to each device in parallel, * then wait for them @@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; -int dropouts = 0; int ret; +struct list_head checkpoint; + +INIT_LIST_HEAD(); /* send down all the barriers */ head = >fs_devices->devices; @@ -3587,29
Re: Hard crash on 4.9.5
On Mon, Mar 13, 2017 at 10:58:29PM +0100, Kai Krakow wrote: > Am Sat, 28 Jan 2017 15:50:38 -0500 > schrieb Matt McKinnon: > > > This same file system (which crashed again with the same errors) is > > also giving this output during a metadata or data balance: > > This looks somewhat familiar to the err=-17 that I am experiencing when > using VirtualBox image on btrfs in CoW mode (compress=lzo). > > During IO intensive workloads, it results in "object already exists, > err -17" (or similar, someone else also experienced it through another > workload). The resulting btrfs check show the same errors, giving > inodes without csum. > > Trying to continue using this file system in successive boots usually > results in boot freezes or complete unmountable filesystem, broken > beyond repair. > > I'm feeling that using the bfq elevator usually enables me to trigger > this bug also without using VirtualBox, i.e. during normal system > usage, and mostly during boot when IO load is very high. So I also > stopped using bfq although it was giving me a much superior > interactivity. > > Marking vbox images nocow and using standard elevators (cfq, deadline) > exposes no such problems so far - even during excessive IO loads. > > EOM This sounds similar to a bug I fixed here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e2bd3b7fac91b79a6115fd1511ca20b2a09696d That change is in v4.10. If you're not already running a kernel version with that fix, could you check if that solves it? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hard crash on 4.9.5
Am Sat, 28 Jan 2017 15:50:38 -0500 schrieb Matt McKinnon: > This same file system (which crashed again with the same errors) is > also giving this output during a metadata or data balance: This looks somewhat familiar to the err=-17 that I am experiencing when using VirtualBox image on btrfs in CoW mode (compress=lzo). During IO intensive workloads, it results in "object already exists, err -17" (or similar, someone else also experienced it through another workload). The resulting btrfs check show the same errors, giving inodes without csum. Trying to continue using this file system in successive boots usually results in boot freezes or complete unmountable filesystem, broken beyond repair. I'm feeling that using the bfq elevator usually enables me to trigger this bug also without using VirtualBox, i.e. during normal system usage, and mostly during boot when IO load is very high. So I also stopped using bfq although it was giving me a much superior interactivity. Marking vbox images nocow and using standard elevators (cfq, deadline) exposes no such problems so far - even during excessive IO loads. EOM > Jan 27 19:42:47 my_machine kernel: [ 335.018123] BTRFS info (device > sda1): no csum found for inode 28472371 start 2191360 > Jan 27 19:42:47 my_machine kernel: [ 335.018128] BTRFS info (device > sda1): no csum found for inode 28472371 start 2195456 > Jan 27 19:42:47 my_machine kernel: [ 335.018491] BTRFS info (device > sda1): no csum found for inode 28472371 start 4018176 > Jan 27 19:42:47 my_machine kernel: [ 335.018496] BTRFS info (device > sda1): no csum found for inode 28472371 start 4022272 > Jan 27 19:42:47 my_machine kernel: [ 335.018499] BTRFS info (device > sda1): no csum found for inode 28472371 start 4026368 > Jan 27 19:42:47 my_machine kernel: [ 335.018502] BTRFS info (device > sda1): no csum found for inode 28472371 start 4030464 > Jan 27 19:42:47 my_machine kernel: [ 335.019443] BTRFS info (device > sda1): no csum found for inode 28472371 start 6156288 > Jan 27 19:42:47 my_machine kernel: [ 335.019688] BTRFS info (device > sda1): no csum found for inode 28472371 start 7933952 > Jan 27 19:42:47 my_machine kernel: [ 335.019693] BTRFS info (device > sda1): no csum found for inode 28472371 start 7938048 > Jan 27 19:42:47 my_machine kernel: [ 335.019754] BTRFS info (device > sda1): no csum found for inode 28472371 start 8077312 > Jan 27 19:42:47 my_machine kernel: [ 335.025485] BTRFS warning > (device sda1): csum failed ino 28472371 off 2191360 csum 4031061501 > expected csum 0 Jan 27 19:42:47 my_machine kernel: [ 335.025490] > BTRFS warning (device sda1): csum failed ino 28472371 off 2195456 > csum 2371784003 expected csum 0 Jan 27 19:42:47 my_machine kernel: > [ 335.025526] BTRFS warning (device sda1): csum failed ino 28472371 > off 4018176 csum 3812080098 expected csum 0 Jan 27 19:42:47 > my_machine kernel: [ 335.025531] BTRFS warning (device sda1): csum > failed ino 28472371 off 4022272 csum 2776681411 expected csum 0 Jan > 27 19:42:47 my_machine kernel: [ 335.025534] BTRFS warning (device > sda1): csum failed ino 28472371 off 4026368 csum 1179241675 expected > csum 0 Jan 27 19:42:47 my_machine kernel: [ 335.025540] BTRFS > warning (device sda1): csum failed ino 28472371 off 4030464 csum > 1256914217 expected csum 0 Jan 27 19:42:47 my_machine kernel: > [ 335.026142] BTRFS warning (device sda1): csum failed ino 28472371 > off 7933952 csum 2695958066 expected csum 0 Jan 27 19:42:47 > my_machine kernel: [ 335.026147] BTRFS warning (device sda1): csum > failed ino 28472371 off 7938048 csum 3260800596 expected csum 0 Jan > 27 19:42:47 my_machine kernel: [ 335.026934] BTRFS warning (device > sda1): csum failed ino 28472371 off 6156288 csum 4293116449 expected > csum 0 Jan 27 19:42:47 my_machine kernel: [ 335.033249] BTRFS > warning (device sda1): csum failed ino 28472371 off 8077312 csum > 4031878292 expected csum 0 > > Can these be ignored? > > > On 01/25/2017 04:06 PM, Liu Bo wrote: > > On Mon, Jan 23, 2017 at 03:03:55PM -0500, Matt McKinnon wrote: > >> Wondering what to do about this error which says 'reboot needed'. > >> Has happened a three times in the past week: > >> > > > > Well, I don't think btrfs's logic here is wrong, the following stack > > shows that a nfs client has sent a second unlink against the same > > inode while somehow the inode was not fully deleted by the first > > unlink. > > > > So it'd be good that you could add some debugging information to > > get us further. > > > > Thanks, > > > > -liubo > > > >> Jan 23 14:16:17 my_machine kernel: [ 2568.595648] BTRFS error > >> (device sda1): err add delayed dir index item(index: 23810) into > >> the deletion tree of the delayed node(root id: 257, inode id: > >> 2661433, errno: -17) Jan 23 14:16:17 my_machine kernel: > >> [ 2568.611010] [ cut here ] > >> Jan 23 14:16:17 my_machine kernel: [ 2568.615628] kernel BUG at > >>
[PATCH 2/2] btrfs: replace hardcoded value with SEQ_NONE macro
From: Edmund NadolskiDefine the SEQ_NONE macro to replace (u64)-1 in places where said value triggers a special-case ref search behavior. Signed-off-by: Edmund Nadolski Reviewed-by: Jeff Mahoney --- fs/btrfs/backref.c | 16 fs/btrfs/backref.h | 2 ++ fs/btrfs/qgroup.c | 4 ++-- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index e794b6e..297de5b 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -538,7 +538,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, * slot==nritems. In that case, go to the next leaf before we continue. */ if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { - if (time_seq == (u64)-1) + if (time_seq == SEQ_NONE) ret = btrfs_next_leaf(root, path); else ret = btrfs_next_old_leaf(root, path, time_seq); @@ -582,7 +582,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, eie = NULL; } next: - if (time_seq == (u64)-1) + if (time_seq == SEQ_NONE) ret = btrfs_next_item(root, path); else ret = btrfs_next_old_item(root, path, time_seq); @@ -634,7 +634,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, if (path->search_commit_root) root_level = btrfs_header_level(root->commit_root); - else if (time_seq == (u64)-1) + else if (time_seq == SEQ_NONE) root_level = btrfs_header_level(root->node); else root_level = btrfs_old_root_level(root, time_seq); @@ -645,7 +645,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, } path->lowest_level = level; - if (time_seq == (u64)-1) + if (time_seq == SEQ_NONE) ret = btrfs_search_slot(NULL, root, >key_for_search, path, 0, 0); else @@ -1199,7 +1199,7 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info, * * NOTE: This can return values > 0 * - * If time_seq is set to (u64)-1, it will not search delayed_refs, and behave + * If time_seq is set to SEQ_NONE, it will not search delayed_refs, and behave * much like trans == NULL case, the difference only lies in it will not * commit root. * The special case is for qgroup to search roots in commit_transaction(). @@ -1246,7 +1246,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, path->skip_locking = 1; } - if (time_seq == (u64)-1) + if (time_seq == SEQ_NONE) path->skip_locking = 1; /* @@ -1276,9 +1276,9 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS if (trans && likely(trans->type != __TRANS_DUMMY) && - time_seq != (u64)-1) { + time_seq != SEQ_NONE) { #else - if (trans && time_seq != (u64)-1) { + if (trans && time_seq != SEQ_NONE) { #endif /* * look if there are updates for this ref queued and lock the diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h index 9c41fba..20915a6 100644 --- a/fs/btrfs/backref.h +++ b/fs/btrfs/backref.h @@ -23,6 +23,8 @@ #include "ulist.h" #include "extent_io.h" +#define SEQ_NONE ((u64)-1) + struct inode_fs_paths { struct btrfs_path *btrfs_path; struct btrfs_root *fs_root; diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index bb7e42f..76d84af 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2077,12 +2077,12 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans, goto cleanup; } /* -* Use (u64)-1 as time_seq to do special search, which +* Use SEQ_NONE as time_seq to do special search, which * doesn't lock tree or delayed_refs and search current * root. It's safe inside commit_transaction(). */ ret = btrfs_find_all_roots(trans, fs_info, - record->bytenr, (u64)-1, _roots); + record->bytenr, SEQ_NONE, _roots); if (ret < 0) goto cleanup; if (qgroup_to_skip) { -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: provide enumeration for __merge_refs mode argument
From: Edmund NadolskiReplace hardcoded numeric values for __merge_refs 'mode' argument with descriptive constants. Signed-off-by: Edmund Nadolski Reviewed-by: Jeff Mahoney --- fs/btrfs/backref.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 7699e16..e794b6e 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -26,6 +26,11 @@ #include "delayed-ref.h" #include "locking.h" +enum merge_mode { + MERGE_IDENTICAL_KEYS = 1, + MERGE_IDENTICAL_PARENTS, +}; + /* Just an arbitrary number so we can be sure this happened */ #define BACKREF_FOUND_SHARED 6 @@ -809,14 +814,12 @@ static int __add_missing_keys(struct btrfs_fs_info *fs_info, /* * merge backrefs and adjust counts accordingly * - * mode = 1: merge identical keys, if key is set *FIXME: if we add more keys in __add_prelim_ref, we can merge more here. * additionally, we could even add a key range for the blocks we * looked into to merge even more (-> replace unresolved refs by those * having a parent). - * mode = 2: merge identical parents */ -static void __merge_refs(struct list_head *head, int mode) +static void __merge_refs(struct list_head *head, enum merge_mode mode) { struct __prelim_ref *pos1; @@ -829,7 +832,7 @@ static void __merge_refs(struct list_head *head, int mode) if (!ref_for_same_block(ref1, ref2)) continue; - if (mode == 1) { + if (mode == MERGE_IDENTICAL_KEYS) { if (!ref1->parent && ref2->parent) swap(ref1, ref2); } else { @@ -1374,7 +1377,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, if (ret) goto out; - __merge_refs(, 1); + __merge_refs(, MERGE_IDENTICAL_KEYS); ret = __resolve_indirect_refs(fs_info, path, time_seq, , extent_item_pos, total_refs, @@ -1382,7 +1385,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, if (ret) goto out; - __merge_refs(, 2); + __merge_refs(, MERGE_IDENTICAL_PARENTS); while (!list_empty()) { ref = list_first_entry(, struct __prelim_ref, list); -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Cleanup for some hardcoded constants
From: Edmund NadolskiThis series replaces several hard-coded values with descriptive symbols. Edmund Nadolski (2): btrfs: provide enumeration for __merge_refs mode argument btrfs: replace hardcoded value with SEQ_NONE macro fs/btrfs/backref.c | 31 +-- fs/btrfs/backref.h | 2 ++ fs/btrfs/qgroup.c | 4 ++-- 3 files changed, 21 insertions(+), 16 deletions(-) -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
Same here, Have been using BTRFS for a 'scratch' disk since about 2014. The disk have had quite some abuse and no issues yet. I don't use compression, snapshots or any fancy features. I have recently moved all of the root filesystem to BTRFS with 5x SSD disks set up in RAID1 and everything is (still) working fine, and I have been shuffling large amounts of data on this volume. I bet the SSD's will break before BTRFS does, so the real test is yet to come I guess... I am on Debian GNU/Linux with kernel 4.9.0-2-amd64 (Debian 4.9.13-1) - btrfs-progs 4.7.3 However, keep in mind that backups is winning the fight against binary related traumas :) Peter Becker wrote: I can confirm this. I have also no generell issues since the past 2 years with BTRFS in RAID1 and 6 Disks with different sizes and also no issues with the DUP profile on a single disk. Only some performance issues with deduplication and very large files. But i also recommand to use a newer kernel (4.4 or higher) or better the newest and build a newer version of btrfs progs form source. I use Ubuntu 16.04 and kernel 4.9 + btrfs progs 4.9 currently. 2017-03-13 13:02 GMT+01:00 Austin S. Hemmelgarn: On 2017-03-13 07:52, Juan Orti Alcaine wrote: 2017-03-13 12:29 GMT+01:00 Hérikz Nawarro : Hello everyone, Today is safe to use btrfs for home storage? No raid, just secure storage for some files and create snapshots from it. In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5 years and I feel the most serious bugs have been fixed, because in the last two years I have not experienced any issue. In general, I'd agree. I've not seen any issues resulting from BTRFS itself for the past 2.5 years (although it's helped me find quite a lot of marginal or failing hardware over that time), but I've also not used many of the less stable features (raid56, qgroups, and a handful of other things). One piece of advice I will give though, try to keep the total number of snapshots to a reasonably small three digit number (ideally less than 200, absolutely less than 300), otherwise performance is going to be horrible. Anyway, keeping your kernel and btrfs-progs updated is a must, and of course, having good backups. I'm using Fedora and it's fine. Also agreed, Fedora is one of the best options for a traditional distro (they're very good about staying up to date and back-porting bug-fixes from the upstream kernel). The other two I'd recommend are Arch (they actually use an almost upstream kernel and are generally the first distro to have new versions of any arbitrary software) and Gentoo (similar to Arch, but more maintenance intensive (although also more efficient (usually))). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qgroup: Change qgroup_meta_rsv to 64bit
On Sat, Mar 11, 2017 at 05:03:26PM -0600, Goldwyn Rodrigues wrote: > - int reserved; > + long reserved; > - reserved = atomic_xchg(>qgroup_meta_rsv, 0); > + reserved = atomic64_xchg(>qgroup_meta_rsv, 0); atomic64_xchg returns 'long long' so u64 should be used. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: cli-tests: Convert non-raid filesystem to raid
On Sun, Jan 29, 2017 at 08:14:55PM +0530, Lakshmipathi.G wrote: > Simple script to verify non-raid filesystem conversion. The cli (command line interface) tests are supposed to cover the common usecases from the point of option combinations etc, not really verifying the result. It would be good if you add more than just one simple test, especially for the balance filters. Adding the helpers to prepare the filesystem (number of devices, raid profiles) would be a good start. Some code for that is in the mkfs tests, but this belongs to the common scripts so we could easily use that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fresh Raid-1 setup, dump-tree shows invalid owner id
On Mon, Jan 30, 2017 at 06:07:01PM +0530, Lakshmipathi.G wrote: > > > > Yes, the owner is the number of the tree. > > > > DATA_RELOC_TREE is -9, but then unsigned 64 bits. > > > -9 + 2**64 > > 18446744073709551607L > > > > So the result is a number that's close to the max or 64 bits. > > > > You can find those numbers in the kernel source in > > include/uapi/linux/btrfs_tree.h > > > > e.g.: > > > > #define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL > > > > Thanks for the details. This owner number looked different from other > owner ids, so wanted to check on the same, now understood. Recently, in the commit 6f6b643e44ef7, the known tree ids are printed as the small negative numbers. We can also do that in progs, however we'd need some tricks so we don't duplicate all format strings one with %llu and one with %lld. Something like format: "%s%llu" values: id > LAST_FREE ? "-" : "" id > LAST_FREE ? 2^64 - id : id and hidden in helpers etc. I don't think we should replace all references to tree ids with the symbolic names as they're used in keys so searching for the numeric value in the dump output would miss something or make the regexp more complicated. I won't object against making the tree id output format selectable by option though. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: misc-tests: Primary Superblock corruption and recovery using backup Superblock.
On Wed, Feb 15, 2017 at 02:44:05PM +0530, Lakshmipathi.G wrote: > On Wed, Feb 15, 2017 at 09:36:03AM +0800, Qu Wenruo wrote: > > > > > > >+ # Corrupt superblock checksum > > >+dd if=/dev/zero of=$TEST_DEV seek=$superblock_offset bs=1 \ > > >+count=4 conv=notrunc &> /dev/null > > >+ run_check_stdout $SUDO_HELPER mount $TEST_DEV $TEST_MNT | \ > > >+ grep -q 'wrong fs type' > > > > What about using btrfs check instead of trying to mount it? > > > > This could emit the need to use $SUDO_HELPER, and could catch super error > > more accurate. > > > > >+if [ $? -ne 0 ]; then > > >+ _fail "Failed to corrupt superblock." > > >+fi > > >+ > > >+ # Copy backup superblock to primary > > >+ run_check $TOP/btrfs-select-super -s 1 $TEST_DEV > > >+ run_check $SUDO_HELPER mount $TEST_DEV $TEST_MNT > > Same here. > I started with 'btrfs check' and 'btrfs check --repair' but it seems like > --repair don't fix the corruption. So just moved away from using it. Check --repair will not fix the superblock checksum as it's not clear whether the rest of the superblock is ok or not. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: misc-tests: Superblock corruption and recovery using backup.
On Fri, Feb 17, 2017 at 04:03:41AM +0530, Lakshmipathi.G wrote: > On Thu, Feb 16, 2017 at 09:05:02AM +0800, Qu Wenruo wrote: > > > > > > At 02/16/2017 04:50 AM, Lakshmipathi.G wrote: > > >Signed-off-by: Lakshmipathi.G> > > > Looks good to me. > > > > Reviewed by: Qu Wenruo > > > > BTW, it would be better to have some commit message. > > ok, sent the v3 patch with commit message. It's not necessary if the Subject says it all, so the changelog does not simply repeat it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs-progs: misc-tests: Superblock corruption and recovery using backup.
On Fri, Feb 17, 2017 at 04:01:59AM +0530, Lakshmipathi.G wrote: > Test script to recover damaged primary superblock using backup superblock. > > Signed-off-by: Lakshmipathi.GThanks for the test, a few comments below. > --- > .../019-fix-superblock-corruption/test.sh | 36 > ++ > 1 file changed, 36 insertions(+) > create mode 100755 tests/misc-tests/019-fix-superblock-corruption/test.sh > > diff --git a/tests/misc-tests/019-fix-superblock-corruption/test.sh > b/tests/misc-tests/019-fix-superblock-corruption/test.sh > new file mode 100755 > index 000..95815e4 > --- /dev/null > +++ b/tests/misc-tests/019-fix-superblock-corruption/test.sh > @@ -0,0 +1,36 @@ > +#!/bin/bash > +# > +# Corrupt primary superblock and restore it using backup superblock. > +# > + > +source $TOP/tests/common > + > +check_prereq btrfs-select-super > +check_prereq btrfs > + > +setup_root_helper > +prepare_test_dev 512M > + > +FIRST_SUPERBLOCK_OFFSET=65536 > + > +test_superblock_restore() > +{ > + run_check $SUDO_HELPER $TOP/mkfs.btrfs -f $TEST_DEV > + > + # Corrupt superblock checksum > +dd if=/dev/zero of=$TEST_DEV seek=$FIRST_SUPERBLOCK_OFFSET bs=1 \ > +count=4 conv=notrunc &> /dev/null > + # Run btrfs check to detect corruption > + $TOP/btrfs check $TEST_DEV >& /dev/null && \ > + _fail "btrfs check should detect corruption" No run_check? In general all test output is desired and should be in the test log so you can use 'run_mayfail'. Also, please keep the entier output and don't redirect it to /dev/null . > + # Copy backup superblock to primary > + run_check $TOP/btrfs-select-super -s 1 $TEST_DEV > + > + echo "Performing btrfs check" &>> $RESULTS Use _log for that > + $TOP/btrfs check $TEST_DEV &>> $RESULTS Use run_check > +if [ $? -ne 0 ]; then > + _fail "Failed to fix superblock." And this would be unnecessary. I've added a section to tests/README.md to describe the common coding practices. Feel free to read/update/fix. Older tests could violate the recommendations so at least new tests are in line and we'll update the rest incrementally. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs-progs: dump-tree: Also output log root tree
On Fri, Mar 03, 2017 at 12:13:45PM +0800, Qu Wenruo wrote: > In btrfs-dump-tree, we output any existing log tree, however we don't > output the log root tree, which records all root items for log trees. > > This makes it confusing for any one who want to know where the log tree > comes from. > > Signed-off-by: Qu WenruoWhat's the base of this patch or is there a patch that I've missed to apply? This fails on current devel, 4.10, 4.9.1 and I don't see patch 1/2 in my mailbox or on patchwork. Please refresh and resend. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error
Thanks for the review.. On 03/13/2017 05:05 PM, Qu Wenruo wrote: At 03/13/2017 03:42 PM, Anand Jain wrote: The objective of this patch is to cleanup barrier_all_devices() so that the error checking is in a separate loop independent of of the loop which submits and waits on the device flush requests. The idea itself is quite good, and we do need it. Thanks. By doing this it helps to further develop patches which would tune the error-actions as needed. Here functions such as btrfs_dev_stats_dirty() couldn't be used because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 96 +++--- 1 file changed, 85 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5719e036048b..12531a5b14ff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device *device, int wait) return 0; } +struct device_checkpoint { +struct list_head list; +struct btrfs_device *device; +int stat_value_checkpoint; +}; + +static int add_device_checkpoint(struct list_head *checkpoint, Could we have another structure instead of list_head to record device checkpoints? The list_head is never a meaningful structure under most cases. I didn't understand this, there is device_checkpoint and the context of struct list_head *checkpoint would start and end within barrier_all_devices(). +struct btrfs_device *device) +{ +struct device_checkpoint *cdev = +kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL); +if (!cdev) +return -ENOMEM; This means that, we must check return value of add_device_checkpoint(), while later code doesn't check it at all. oh. I will correct this. + +list_add(>list, checkpoint); And I prefer to do extra check, in case such device is already inserted once. Hmm with the current code its not at all possible, but let me add it. + +cdev->device = device; +cdev->stat_value_checkpoint = +btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS); + +return 0; +} + +static void fini_devices_checkpoint(struct list_head *checkpoint) Never a fan of the "fini_" naming. What about "release_"? will change it. +{ +struct device_checkpoint *cdev; + +while(!list_empty(checkpoint)) { +cdev = list_entry(checkpoint->next, +struct device_checkpoint, list); +list_del(>list); +kfree(cdev); +} +} + +static int check_stat_flush(struct btrfs_device *dev, +struct list_head *checkpoint) +{ +int val; +struct device_checkpoint *cdev; + +list_for_each_entry(cdev, checkpoint, list) { +if (cdev->device == dev) { +val = btrfs_dev_stat_read(dev, +BTRFS_DEV_STAT_FLUSH_ERRS); +if (cdev->stat_value_checkpoint != val) +return 1; This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified by checkpoint related code. Or any other modifier can easily break the check, causing false alert. I have already checked it, its not possible with the current code, or do you see if that is possible with the current code ? or Did I miss something ? IIRC that's the reason why I update my previous degraded patch. Personally speaking, I prefer the patchset to take more usage of the checkpoint system, or it's a little overkilled for current usage. Just want to make sure things are done in the right way. Thanks, Anand Thanks, Qu +} +} +return 0; +} + +static int check_barrier_error(struct btrfs_fs_devices *fsdevs, +struct list_head *checkpoint) +{ +int dropouts = 0; +struct btrfs_device *dev; + +list_for_each_entry_rcu(dev, >devices, dev_list) { +if (!dev->bdev || check_stat_flush(dev, checkpoint)) +dropouts++; +} + +if (dropouts > +fsdevs->fs_info->num_tolerated_disk_barrier_failures) +return -EIO; + +return 0; +} + /* * send an empty flush down to each device in parallel, * then wait for them @@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; -int dropouts = 0; int ret; +struct list_head checkpoint; + +INIT_LIST_HEAD(); /* send down all the barriers */ head = >fs_devices->devices; @@ -3587,29 +3659,31 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (!dev->in_fs_metadata || !dev->writeable) continue; +add_device_checkpoint(, dev); ret = write_dev_flush(dev, 0); -if (ret) +if (ret) { +fini_devices_checkpoint(); return ret; +} } /* wait for all the barriers */ list_for_each_entry_rcu(dev, head, dev_list) {
Re: Question re. required since btrfs-progs-4.10
On Sat, Mar 11, 2017 at 05:04:52PM +0100, Holger Hoffstätte wrote: > I'm on Gentoo and wanted to update Docker to 17.03.0, which failed when it > couldn't build the btrfs driver due to a missing . > This worked fine on another machine the other day, so I dug in and found that > the only difference was an intermediate update to btrfs-progs 4.10. > Sure enough: since 4.10 ctree.h includes , which is nowhere > to be found in my package of linux-headers-4.10 (also not in 4.9). > > Simply copying sizes.h from the 4.9 kernel sources made everything work, > but I'm curious whether this is actually correct - hence my question whether > _should_ actually be installed? Is this a bug with > Gentoo's kernel header package or a new problem with btrfs-progs? That's a bug in btrfs-progs, the build tests do not cover that. There's a local copy of sizes.h that's supposed to be used. The patch was ported from kernel where linus/sizes.h is present but is probably not (yet) part of the public linux-headers. > I don't see any good reason why sizes.h should not be installed, but wanted > to verify since there are probably users of other distributions here as > well. :) The build would fail on any distro. I'll commit a fix and update the tests so this gets caught early. Thanks for the report. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
Am 13.03.2017 um 08:39 schrieb Qu Wenruo: > > > At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: >> Hi Qu, >> >> Am 13.03.2017 um 02:16 schrieb Qu Wenruo: >>> >>> At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: Hi Qu, while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4). >>> >>> Thanks for the test. >>> V7 results in OOPS to me: BUG: unable to handle kernel NULL pointer dereference at 01f0 >>> >>> This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite >>> nice clue. >>> IP: [] __endio_write_update_ordered+0x33/0x140 [btrfs] >>> >>> IP points to: >>> --- >>> static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) >>> { >>> struct btrfs_root *root = inode->root; << Either here >>> >>> if (root == root->fs_info->tree_root && << Or here >>> btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) >>> >>> --- >>> >>> Taking the above offset into consideration, it's only possible for later >>> case. >>> >>> So here, we have a btrfs_inode whose @root is NULL. >> >> But wasn't this part of the code identical in V5? Why does it only >> happen with V7? > > There are still difference, but just as you said, the related > part(checking if inode is free space cache inode) is identical across v5 > and v7. But if i boot v7 it always happens. If i boot v5 it always works. Have done 5 repeatet tests. > I'm afraid that's a rare race leading to NULL btrfs_inode->root, which > could happen in both v5 and v7. > > What's the difference between SUSE and mainline kernel? A lot ;-) But i don't think anything related. > Maybe some mainline kernel commits have already fixed it? May be no idea. But i haven't found any reason why v5 works. Stefan > > Thanks, > Qu >> >>> This can be fixed easily by checking @root inside >>> btrfs_is_free_space_inode(), as the backtrace shows that it's only >>> happening for DirectIO, and it won't happen for free space cache inode. >>> >>> But I'm more curious how this happened for a more accurate fix, or we >>> could have other NULL pointer access. >>> >>> Did you have any reproducer for this? >> >> Sorry no - this is a production MariaDB Server running btrfs with >> compress-force=zlib. But if i could test anything i'll do. >> >> Greets, >> Stefan >> >>> >>> Thanks, >>> Qu >>> PGD 14e18d4067 PUD 14e1868067 PMD 0 Oops: [#1] SMP Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci i2c_core usb_common ata_piix floppy CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140722_172050-sagunt 04/01/2014 task: b4e0f500 ti: b4e0 task.ti: b4e0 RIP: 0010:[] [] __endio_write_update_ordered+0x33/0x140 [btrfs] RSP: 0018:8814eae03cd8 EFLAGS: 00010086 RAX: RBX: 8814e8fd5aa8 RCX: 0001 RDX: 0010 RSI: 0010 RDI: 8814e45885c0 RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0 R13: 8814e125d700 R14: 0010 R15: 8800376c6a80 FS: () GS:8814eae0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack: 0010 8814e8fd5aa8 8814e953f3c0 8814e125d700 0010 8800376c6a80 8814eae03d38 c03ddf67 8814e86b6a80 8814e8fd5aa8 0001 Call Trace: [] btrfs_endio_direct_write+0x37/0x60 [btrfs] [] bio_endio+0x57/0x60 [] btrfs_end_bio+0xa1/0x140 [btrfs] [] bio_endio+0x57/0x60 [] blk_update_request+0x8b/0x330 [] blk_mq_end_request+0x1a/0x70 [] virtblk_request_done+0x3f/0x70 [virtio_blk] [] __blk_mq_complete_request+0x78/0xe0 [] blk_mq_complete_request+0x1c/0x20 [] virtblk_done+0x64/0xe0 [virtio_blk] [] vring_interrupt+0x3a/0x90 [] __handle_irq_event_percpu+0x89/0x1b0 [] handle_irq_event_percpu+0x23/0x60 [] handle_irq_event+0x3b/0x60 [] handle_edge_irq+0x6f/0x150 [] handle_irq+0x1d/0x30 [] do_IRQ+0x4b/0xd0 [] common_interrupt+0x8c/0x8c DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b Leftover inexact backtrace: 2017-03-12 20:33:08 2017-03-12 20:33:08 [] ? native_safe_halt+0x6/0x10 [] default_idle+0x1e/0xe0 [] arch_cpu_idle+0xf/0x20 [] default_idle_call+0x3b/0x40 [] cpu_startup_entry+0x29a/0x370 []
Re: [PATCH v2] qgroup: Retry after commit on getting EDQUOT
On Mon, Mar 13, 2017 at 12:27 PM, Goldwyn Rodrigueswrote: > > > On 03/13/2017 07:14 AM, Filipe Manana wrote: >> On Sat, Mar 11, 2017 at 11:02 PM, Goldwyn Rodrigues wrote: >>> From: Goldwyn Rodrigues >>> >>> We are facing the same problem with EDQUOT which was experienced with >>> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but >>> here is a fix. Let me know if it is too big a hammer. >>> >>> Quotas are reserved during the start of an operation, incrementing >>> qg->reserved. However, it is written to disk in a commit_transaction >>> which could take as long as commit_interval. In the meantime there >>> could be deletions which are not accounted for because deletions are >>> accounted for only while committed (free_refroot). So, when we get >>> a EDQUOT flush the data to disk and try again. >> >> So this doesn't explain why the problem happens, taking into account >> the reproducer below. >> You mention that deletions are only accounted at transaction commit >> time, but the reproducer does not do any deletions. > > Yes it does, after Cleanup. It is a loop. I suppose you got confused by > the indent. Indeed, the indentation is screwed up. > >> So why does the reproducer gets EDQUOT errors? The second sentence >> "However, it is written to disk in a commit_transaction >> which could take as long as commit_interval." also does not explain >> why is that a problem, why it makes the reproducer get EDQUOT errors. >> >> I would like to see a clear and complete explanation in the changelog. > > I suppose this comment is irrelevant now? Yes. > >> >>> >>> I combined the conditions of rfer and excl to reduce code lines, though >>> the condition looks uglier. >> >> This sentence is outdated, it's no longer valid for v2 as it now calls >> qgroup_check_limits(). >> >>> >>> Here is a sample script which shows this issue. >>> >>> DEVICE=/dev/vdb >>> MOUNTPOINT=/mnt >>> TESTVOL=$MOUNTPOINT/tmp >>> QUOTA=5 >>> PROG=btrfs >>> DD_BS="4k" >>> DD_COUNT="256" >>> RUN_TIMES=5000 >>> >>> mkfs.btrfs -f $DEVICE >>> mount -o commit=240 $DEVICE $MOUNTPOINT >>> $PROG subvolume create $TESTVOL >>> $PROG quota enable $TESTVOL >>> $PROG qgroup limit ${QUOTA}G $TESTVOL >>> >>> typeset -i DD_RUN_GOOD >>> typeset -i QUOTA >>> >>> function _check_cmd() { >>> if [[ ${?} > 0 ]]; then >>> echo -n "$(date) E: Running previous command" >>> echo ${*} >>> echo "Without sync" >>> $PROG qgroup show -pcreFf ${TESTVOL} >>> echo "With sync" >>> $PROG qgroup show -pcreFf --sync ${TESTVOL} >>> exit 1 >>> fi >>> } >>> >>> while true; do >>> DD_RUN_GOOD=$RUN_TIMES >>> >>> while (( ${DD_RUN_GOOD} != 0 )); do >>> dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} >>> count=${DD_COUNT} >>> _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} >>> bs=${DD_BS} count=${DD_COUNT}" >>> DD_RUN_GOOD=(${DD_RUN_GOOD}-1) >>> done >>> >>> $PROG qgroup show -pcref $TESTVOL >>> echo "--- Cleanup -- " >>> rm $TESTVOL/quotatest* > > Here is the deletion. > >>> >>> done > > And the infinite loop ending. > >> >> Please write a test case for fstests. > > You mean xfstests. will do. xfstests got renamed to fstests long time ago, and obviously it's xfstests as that's the test suite we all use. thanks > >> >>> >>> Changes since v1: >>> - Changed start_delalloc_roots() to start_delalloc_inode() to target >>>the root in question only to reduce the amount of flush to be done. >>> - Added wait_ordered_extents(). >> >> Listing what changed between versions is not meant to appear in the >> changelog. This should go after the --- below. You should also prefix >> the patch subject with "btrfs:" (or "btrfs-progs:"). See the examples >> from others or the wiki for intsructions >> (https://btrfs.wiki.kernel.org/index.php/Writing_patch_for_btrfs#Updating_patches). >> > > Ok, understood. Thanks. > >> thanks >> >>> >>> Signed-off-by: Goldwyn Rodrigues >>> Reviewed-by: Qu Wenruo >>> --- >>> fs/btrfs/qgroup.c | 15 +++ >>> 1 file changed, 15 insertions(+) >>> >>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c >>> index a5da750..6215264 100644 >>> --- a/fs/btrfs/qgroup.c >>> +++ b/fs/btrfs/qgroup.c >>> @@ -2367,6 +2367,7 @@ static int qgroup_reserve(struct btrfs_root *root, >>> u64 num_bytes, bool enforce) >>> struct btrfs_fs_info *fs_info = root->fs_info; >>> u64 ref_root = root->root_key.objectid; >>> int ret = 0; >>> + int retried = 0; >>> struct ulist_node *unode; >>> struct ulist_iterator uiter; >>> >>> @@ -2376,6 +2377,7 @@ static int qgroup_reserve(struct btrfs_root *root, >>> u64 num_bytes, bool enforce) >>> if (num_bytes == 0) >>> return 0; >>> >>> +retry: >>>
Re: [PATCH v2] qgroup: Retry after commit on getting EDQUOT
On 03/13/2017 07:14 AM, Filipe Manana wrote: > On Sat, Mar 11, 2017 at 11:02 PM, Goldwyn Rodrigueswrote: >> From: Goldwyn Rodrigues >> >> We are facing the same problem with EDQUOT which was experienced with >> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but >> here is a fix. Let me know if it is too big a hammer. >> >> Quotas are reserved during the start of an operation, incrementing >> qg->reserved. However, it is written to disk in a commit_transaction >> which could take as long as commit_interval. In the meantime there >> could be deletions which are not accounted for because deletions are >> accounted for only while committed (free_refroot). So, when we get >> a EDQUOT flush the data to disk and try again. > > So this doesn't explain why the problem happens, taking into account > the reproducer below. > You mention that deletions are only accounted at transaction commit > time, but the reproducer does not do any deletions. Yes it does, after Cleanup. It is a loop. I suppose you got confused by the indent. > So why does the reproducer gets EDQUOT errors? The second sentence > "However, it is written to disk in a commit_transaction > which could take as long as commit_interval." also does not explain > why is that a problem, why it makes the reproducer get EDQUOT errors. > > I would like to see a clear and complete explanation in the changelog. I suppose this comment is irrelevant now? > >> >> I combined the conditions of rfer and excl to reduce code lines, though >> the condition looks uglier. > > This sentence is outdated, it's no longer valid for v2 as it now calls > qgroup_check_limits(). > >> >> Here is a sample script which shows this issue. >> >> DEVICE=/dev/vdb >> MOUNTPOINT=/mnt >> TESTVOL=$MOUNTPOINT/tmp >> QUOTA=5 >> PROG=btrfs >> DD_BS="4k" >> DD_COUNT="256" >> RUN_TIMES=5000 >> >> mkfs.btrfs -f $DEVICE >> mount -o commit=240 $DEVICE $MOUNTPOINT >> $PROG subvolume create $TESTVOL >> $PROG quota enable $TESTVOL >> $PROG qgroup limit ${QUOTA}G $TESTVOL >> >> typeset -i DD_RUN_GOOD >> typeset -i QUOTA >> >> function _check_cmd() { >> if [[ ${?} > 0 ]]; then >> echo -n "$(date) E: Running previous command" >> echo ${*} >> echo "Without sync" >> $PROG qgroup show -pcreFf ${TESTVOL} >> echo "With sync" >> $PROG qgroup show -pcreFf --sync ${TESTVOL} >> exit 1 >> fi >> } >> >> while true; do >> DD_RUN_GOOD=$RUN_TIMES >> >> while (( ${DD_RUN_GOOD} != 0 )); do >> dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} >> count=${DD_COUNT} >> _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} >> bs=${DD_BS} count=${DD_COUNT}" >> DD_RUN_GOOD=(${DD_RUN_GOOD}-1) >> done >> >> $PROG qgroup show -pcref $TESTVOL >> echo "--- Cleanup -- " >> rm $TESTVOL/quotatest* Here is the deletion. >> >> done And the infinite loop ending. > > Please write a test case for fstests. You mean xfstests. will do. > >> >> Changes since v1: >> - Changed start_delalloc_roots() to start_delalloc_inode() to target >>the root in question only to reduce the amount of flush to be done. >> - Added wait_ordered_extents(). > > Listing what changed between versions is not meant to appear in the > changelog. This should go after the --- below. You should also prefix > the patch subject with "btrfs:" (or "btrfs-progs:"). See the examples > from others or the wiki for intsructions > (https://btrfs.wiki.kernel.org/index.php/Writing_patch_for_btrfs#Updating_patches). > Ok, understood. Thanks. > thanks > >> >> Signed-off-by: Goldwyn Rodrigues >> Reviewed-by: Qu Wenruo >> --- >> fs/btrfs/qgroup.c | 15 +++ >> 1 file changed, 15 insertions(+) >> >> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c >> index a5da750..6215264 100644 >> --- a/fs/btrfs/qgroup.c >> +++ b/fs/btrfs/qgroup.c >> @@ -2367,6 +2367,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 >> num_bytes, bool enforce) >> struct btrfs_fs_info *fs_info = root->fs_info; >> u64 ref_root = root->root_key.objectid; >> int ret = 0; >> + int retried = 0; >> struct ulist_node *unode; >> struct ulist_iterator uiter; >> >> @@ -2376,6 +2377,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 >> num_bytes, bool enforce) >> if (num_bytes == 0) >> return 0; >> >> +retry: >> spin_lock(_info->qgroup_lock); >> quota_root = fs_info->quota_root; >> if (!quota_root) >> @@ -2402,6 +2404,19 @@ static int qgroup_reserve(struct btrfs_root *root, >> u64 num_bytes, bool enforce) >> qg = unode_aux_to_qgroup(unode); >> >> if (enforce && !qgroup_check_limits(qg, num_bytes)) { >> + if (!retried &&
Re: [PATCH v2] qgroup: Retry after commit on getting EDQUOT
On Sat, Mar 11, 2017 at 11:02 PM, Goldwyn Rodrigueswrote: > From: Goldwyn Rodrigues > > We are facing the same problem with EDQUOT which was experienced with > ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but > here is a fix. Let me know if it is too big a hammer. > > Quotas are reserved during the start of an operation, incrementing > qg->reserved. However, it is written to disk in a commit_transaction > which could take as long as commit_interval. In the meantime there > could be deletions which are not accounted for because deletions are > accounted for only while committed (free_refroot). So, when we get > a EDQUOT flush the data to disk and try again. So this doesn't explain why the problem happens, taking into account the reproducer below. You mention that deletions are only accounted at transaction commit time, but the reproducer does not do any deletions. So why does the reproducer gets EDQUOT errors? The second sentence "However, it is written to disk in a commit_transaction which could take as long as commit_interval." also does not explain why is that a problem, why it makes the reproducer get EDQUOT errors. I would like to see a clear and complete explanation in the changelog. > > I combined the conditions of rfer and excl to reduce code lines, though > the condition looks uglier. This sentence is outdated, it's no longer valid for v2 as it now calls qgroup_check_limits(). > > Here is a sample script which shows this issue. > > DEVICE=/dev/vdb > MOUNTPOINT=/mnt > TESTVOL=$MOUNTPOINT/tmp > QUOTA=5 > PROG=btrfs > DD_BS="4k" > DD_COUNT="256" > RUN_TIMES=5000 > > mkfs.btrfs -f $DEVICE > mount -o commit=240 $DEVICE $MOUNTPOINT > $PROG subvolume create $TESTVOL > $PROG quota enable $TESTVOL > $PROG qgroup limit ${QUOTA}G $TESTVOL > > typeset -i DD_RUN_GOOD > typeset -i QUOTA > > function _check_cmd() { > if [[ ${?} > 0 ]]; then > echo -n "$(date) E: Running previous command" > echo ${*} > echo "Without sync" > $PROG qgroup show -pcreFf ${TESTVOL} > echo "With sync" > $PROG qgroup show -pcreFf --sync ${TESTVOL} > exit 1 > fi > } > > while true; do > DD_RUN_GOOD=$RUN_TIMES > > while (( ${DD_RUN_GOOD} != 0 )); do > dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} > count=${DD_COUNT} > _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} > bs=${DD_BS} count=${DD_COUNT}" > DD_RUN_GOOD=(${DD_RUN_GOOD}-1) > done > > $PROG qgroup show -pcref $TESTVOL > echo "--- Cleanup -- " > rm $TESTVOL/quotatest* > > done Please write a test case for fstests. > > Changes since v1: > - Changed start_delalloc_roots() to start_delalloc_inode() to target >the root in question only to reduce the amount of flush to be done. > - Added wait_ordered_extents(). Listing what changed between versions is not meant to appear in the changelog. This should go after the --- below. You should also prefix the patch subject with "btrfs:" (or "btrfs-progs:"). See the examples from others or the wiki for intsructions (https://btrfs.wiki.kernel.org/index.php/Writing_patch_for_btrfs#Updating_patches). thanks > > Signed-off-by: Goldwyn Rodrigues > Reviewed-by: Qu Wenruo > --- > fs/btrfs/qgroup.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index a5da750..6215264 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -2367,6 +2367,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 > num_bytes, bool enforce) > struct btrfs_fs_info *fs_info = root->fs_info; > u64 ref_root = root->root_key.objectid; > int ret = 0; > + int retried = 0; > struct ulist_node *unode; > struct ulist_iterator uiter; > > @@ -2376,6 +2377,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 > num_bytes, bool enforce) > if (num_bytes == 0) > return 0; > > +retry: > spin_lock(_info->qgroup_lock); > quota_root = fs_info->quota_root; > if (!quota_root) > @@ -2402,6 +2404,19 @@ static int qgroup_reserve(struct btrfs_root *root, u64 > num_bytes, bool enforce) > qg = unode_aux_to_qgroup(unode); > > if (enforce && !qgroup_check_limits(qg, num_bytes)) { > + if (!retried && qg->reserved > 0) { > + struct btrfs_trans_handle *trans; > + spin_unlock(_info->qgroup_lock); > + btrfs_start_delalloc_inodes(root, 0); > + btrfs_wait_ordered_extents(root, -1, 0, > + (u64)-1); > + trans = btrfs_join_transaction(root); > +
Re: Home storage with btrfs
On 2017-03-13 07:52, Juan Orti Alcaine wrote: 2017-03-13 12:29 GMT+01:00 Hérikz Nawarro: Hello everyone, Today is safe to use btrfs for home storage? No raid, just secure storage for some files and create snapshots from it. In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5 years and I feel the most serious bugs have been fixed, because in the last two years I have not experienced any issue. In general, I'd agree. I've not seen any issues resulting from BTRFS itself for the past 2.5 years (although it's helped me find quite a lot of marginal or failing hardware over that time), but I've also not used many of the less stable features (raid56, qgroups, and a handful of other things). One piece of advice I will give though, try to keep the total number of snapshots to a reasonably small three digit number (ideally less than 200, absolutely less than 300), otherwise performance is going to be horrible. Anyway, keeping your kernel and btrfs-progs updated is a must, and of course, having good backups. I'm using Fedora and it's fine. Also agreed, Fedora is one of the best options for a traditional distro (they're very good about staying up to date and back-porting bug-fixes from the upstream kernel). The other two I'd recommend are Arch (they actually use an almost upstream kernel and are generally the first distro to have new versions of any arbitrary software) and Gentoo (similar to Arch, but more maintenance intensive (although also more efficient (usually))). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Home storage with btrfs
2017-03-13 12:29 GMT+01:00 Hérikz Nawarro: > Hello everyone, > > Today is safe to use btrfs for home storage? No raid, just secure > storage for some files and create snapshots from it. > In my humble opinion, yes. I'm running a RAID1 btrfs at home for 5 years and I feel the most serious bugs have been fixed, because in the last two years I have not experienced any issue. Anyway, keeping your kernel and btrfs-progs updated is a must, and of course, having good backups. I'm using Fedora and it's fine. Kind regards. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Home storage with btrfs
Hello everyone, Today is safe to use btrfs for home storage? No raid, just secure storage for some files and create snapshots from it. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/17] fs, btrfs refcount conversions
> On Fri, Mar 03, 2017 at 10:55:09AM +0200, Elena Reshetova wrote: > > Now when new refcount_t type and API are finally merged > > (see include/linux/refcount.h), the following > > patches convert various refcounters in the btrfs filesystem from atomic_t > > to refcount_t. By doing this we prevent intentional or accidental > > underflows or overflows that can led to use-after-free vulnerabilities. > > > > The below patches are fully independent and can be cherry-picked separately. > > Since we convert all kernel subsystems in the same fashion, resulting > > in about 300 patches, we have to group them for sending at least in some > > fashion to be manageable. Please excuse the long cc list. > > Thanks, the patchset looks good to me, I'm going to add it to the 4.12 queue. Thank you very much! > > > These patches have been tested with xfstests by running btrfs-related tests. > > btrfs debug was enabled, warns on refcount errors, too. No output related to > > refcount errors produced. However, the following errors were during the run: > > * tests btrfs/078, btrfs/114, btrfs/115, no errors anywhere in dmesg, but > > process hangs. They all seem to be around qgroup, sometimes error visible > > such as qgroup scan failed -4 before it blocks, but not always. > > * test btrfs/104 dmesg has additional error output: > > BTRFS warning (device vdc): qgroup 258 reserved space underflow, have: 0, > > to free: 4096 > > I tried looking at the code on what causes the failure, but could not > > figure > > it out. It doesn't seem to be related to any refcount changes at least IMO. > > > > The above test failures are hard for me to understand and interpreted, but > > they don't seem to relate to refcount conversions. > > I don't think they're related to the refcount updates so we'll address > them. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops in btrfs_recover_relocation, kernel 4.8.1
On Mon, Mar 13, 2017 at 10:22:04AM +0800, Qu Wenruo wrote: > > > At 03/10/2017 08:23 PM, Hugo Mills wrote: > > Does anyone recall seeing this oops before? Is it something that > >can be fixed with a newer kernel? (I'm on a USB stick for this, so a > >new kernel is a major undertaking, and I'd like some reasonable > >expectation of success if I do it). > > Yes, v4.10 has the fix for the bug. > > > > Background: I'm rebuilding a dead server. I needed to reduce the > >device count on this FS to 6. Stupidly, I attached one device using an > >external USB case, and the USB connection reset during the device > >delete (within a few seconds). I can mount the FS -o ro,recovery, but > >using -o recovery on its own causes the oops below. If I can recover > >in-place, that would save me a *lot* of time in restoring backups. > > > > Also... qgroups, WTH? I've *never* enabled qgroups on this FS. > > Did you use any auto-backup system? Yes -- my own, which doesn't enable qgroups. > Quite a lot of them will enable qgroup. > > Anyway, after mounting with v4.10 kernel, you can easily find out if > qgroup is enabled. They aren't: hrm@amelia:/media $ sudo btrfs qgroup show /media/btrfs/amelia/ ERROR: can't perform the search - No such file or directory ERROR: can't list qgroups: No such file or directory Anyway, it's all working now with a 4.10 kernel. Thanks for the reply. Hugo. > Thanks, > Qu > > > > > For what it's worth, the FS passes btrfs check --readonly with no > >errors reported. (btrfs --version is 4.7.3). > > > > Hugo. > > > >[ 566.852589] BTRFS warning (device sdh1): 'recovery' is deprecated, use > >'usebackuproot' instead > >[ 566.852591] BTRFS info (device sdh1): trying to use backup root at mount > >time > >[ 566.852592] BTRFS info (device sdh1): disk space caching is enabled > >[ 566.922803] BTRFS info (device sdh1): bdev /dev/sdh1 errs: wr 0, rd 20, > >flush 0, corrupt 0, gen 0 > >[ 578.715616] BUG: unable to handle kernel paging request at > >fe50 > >[ 578.715619] IP: [] > >qgroup_fix_relocated_data_extents+0x2b/0x2c0 [btrfs] > >[ 578.715638] PGD 2f400f067 PUD 2f4011067 PMD 0 > >[ 578.715640] Oops: [#1] > >[ 578.715642] Modules linked in: cpufreq_userspace(E) cpufreq_powersave(E) > >cpufreq_conservative(E) kvm_amd(E) kvm(E) irqbypass(E) crc32_pclmul(E) > >efi_pstore(E) ghash_clmulni_intel(E) pcspkr(E) serio_raw(E) efivars(E) > >fam15h_power(E) k10temp(E) btrfs(E) acpi_cpufreq(E) tpm_tis(E) > >tpm_tis_core(E) tpm(E) sp5100_tco(E) sg(E) snd_hda_codec_realtek(E) > >snd_hda_codec_hdmi(E) snd_hda_codec_generic(E) snd_hda_intel(E) 9p(E) > >snd_hda_codec(E) 9pnet(E) snd_hda_core(E) fscache(E) snd_hwdep(E) snd_pcm(E) > >snd_timer(E) snd(E) soundcore(E) shpchp(E) ib_iser(E) rdma_cm(E) iw_cm(E) > >ib_cm(E) ib_core(E) configfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) > >scsi_transport_iscsi(E) fuse(E) evdev(E) aoe(E) efivarfs(E) ip_tables(E) > >x_tables(E) autofs4(E) loop(E) overlay(E) nls_utf8(E) isofs(E) raid10(E) > >raid456(E) async_raid6_recov(E) > >[ 578.715663] async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) > >raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid0(E) multipath(E) linear(E) > >dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) raid1(E) md_mod(E) > >sd_mod(E) hid_generic(E) usbhid(E) hid(E) uas(E) usb_storage(E) > >crc32c_intel(E) aesni_intel(E) aes_x86_64(E) glue_helper(E) lrw(E) > >gf128mul(E) ablk_helper(E) cryptd(E) ohci_pci(E) ahci(E) libahci(E) > >sata_sil24(E) i2c_piix4(E) r8169(E) mii(E) ehci_pci(E) ohci_hcd(E) > >ehci_hcd(E) libata(E) scsi_mod(E) radeon(E) i2c_algo_bit(E) > >drm_kms_helper(E) xhci_pci(E) xhci_hcd(E) usbcore(E) usb_common(E) ttm(E) > >drm(E) button(E) > >[ 578.715684] CPU: 0 PID: 3532 Comm: mount Tainted: GE > >4.8.0-1-grml-amd64 #1 Debian 4.8.15-1+grml.1 > >[ 578.715684] Hardware name: Gigabyte Technology Co., Ltd. To be filled by > >O.E.M./970A-DS3P, BIOS FD 02/26/2016 > >[ 578.715686] task: 8fd7ead82fc0 task.stack: 8fd7db15c000 > >[ 578.715687] RIP: 0010:[] [] > >qgroup_fix_relocated_data_extents+0x2b/0x2c0 [btrfs] > >[ 578.715699] RSP: 0018:8fd7db15fa08 EFLAGS: 00010246 > >[ 578.715700] RAX: 8fd7db133800 RBX: 8fd7e30c79a0 RCX: > > > >[ 578.715701] RDX: 8fd7ddb5cd10 RSI: 8fd7dc6b5000 RDI: > >8fd7ddb5cc80 > >[ 578.715701] RBP: 8fd7e7445000 R08: R09: > >8fd7ddb5cc80 > >[ 578.715702] R10: R11: R12: > >8fd7db15faa0 > >[ 578.715703] R13: 8fd7dc6b5000 R14: R15: > >8fd7ddb5cc80 > >[ 578.715704] FS: 7f23cd67f480() GS:bca35000() > >knlGS: > >[ 578.715705] CS: 0010 DS: ES: CR0: 80050033 > >[ 578.715706] CR2: fe50 CR3: 00042302f000 CR4: > >000406b0 > >[ 578.715707] Stack: > >[ 578.715707] 0801
Re: [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error
At 03/13/2017 03:42 PM, Anand Jain wrote: The objective of this patch is to cleanup barrier_all_devices() so that the error checking is in a separate loop independent of of the loop which submits and waits on the device flush requests. The idea itself is quite good, and we do need it. By doing this it helps to further develop patches which would tune the error-actions as needed. Here functions such as btrfs_dev_stats_dirty() couldn't be used because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 96 +++--- 1 file changed, 85 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5719e036048b..12531a5b14ff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device *device, int wait) return 0; } +struct device_checkpoint { + struct list_head list; + struct btrfs_device *device; + int stat_value_checkpoint; +}; + +static int add_device_checkpoint(struct list_head *checkpoint, Could we have another structure instead of list_head to record device checkpoints? The list_head is never a meaningful structure under most cases. + struct btrfs_device *device) +{ + struct device_checkpoint *cdev = + kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL); + if (!cdev) + return -ENOMEM; This means that, we must check return value of add_device_checkpoint(), while later code doesn't check it at all. + + list_add(>list, checkpoint); And I prefer to do extra check, in case such device is already inserted once. + + cdev->device = device; + cdev->stat_value_checkpoint = + btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS); + + return 0; +} + +static void fini_devices_checkpoint(struct list_head *checkpoint) Never a fan of the "fini_" naming. What about "release_"? +{ + struct device_checkpoint *cdev; + + while(!list_empty(checkpoint)) { + cdev = list_entry(checkpoint->next, + struct device_checkpoint, list); + list_del(>list); + kfree(cdev); + } +} + +static int check_stat_flush(struct btrfs_device *dev, + struct list_head *checkpoint) +{ + int val; + struct device_checkpoint *cdev; + + list_for_each_entry(cdev, checkpoint, list) { + if (cdev->device == dev) { + val = btrfs_dev_stat_read(dev, + BTRFS_DEV_STAT_FLUSH_ERRS); + if (cdev->stat_value_checkpoint != val) + return 1; This check implies that BTRFS_DEV_STAT_FLUSH_ERRS will only be modified by checkpoint related code. Or any other modifier can easily break the check, causing false alert. IIRC that's the reason why I update my previous degraded patch. Personally speaking, I prefer the patchset to take more usage of the checkpoint system, or it's a little overkilled for current usage. Thanks, Qu + } + } + return 0; +} + +static int check_barrier_error(struct btrfs_fs_devices *fsdevs, + struct list_head *checkpoint) +{ + int dropouts = 0; + struct btrfs_device *dev; + + list_for_each_entry_rcu(dev, >devices, dev_list) { + if (!dev->bdev || check_stat_flush(dev, checkpoint)) + dropouts++; + } + + if (dropouts > + fsdevs->fs_info->num_tolerated_disk_barrier_failures) + return -EIO; + + return 0; +} + /* * send an empty flush down to each device in parallel, * then wait for them @@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; - int dropouts = 0; int ret; + struct list_head checkpoint; + + INIT_LIST_HEAD(); /* send down all the barriers */ head = >fs_devices->devices; @@ -3587,29 +3659,31 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (!dev->in_fs_metadata || !dev->writeable) continue; + add_device_checkpoint(, dev); ret = write_dev_flush(dev, 0); - if (ret) + if (ret) { + fini_devices_checkpoint(); return ret; + } } /* wait for all the barriers */ list_for_each_entry_rcu(dev, head, dev_list) { if (dev->missing) continue; - if (!dev->bdev) { - dropouts++; + if (!dev->bdev) continue;
Re: [PATCH v3.1 5/7] btrfs: Allow barrier_all_devices to do chunk level device check
Qu, patch 4/4 added a cleanup for barrier_all_devices() and introduced a new function check_barrier_error() where integration with per chunk level device check will simplify. [PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error Thanks, Anand On 03/09/2017 09:34 AM, Qu Wenruo wrote: The last user of num_tolerated_disk_barrier_failures is barrier_all_devices(). But it's can be easily changed to new per-chunk degradable check framework. Now btrfs_device will have two extra members, representing send/wait error, set at write_dev_flush() time. With these 2 new members, btrfs_check_rw_degradable() can check if the fs is still OK when the fs is committed to disk. Signed-off-by: Qu WenruoTested-by: Austin S. Hemmelgarn Tested-by: Adam Borowski Tested-by: Dmitrii Tcvetkov --- fs/btrfs/disk-io.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 658b8fab1d39..549045a3e15f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3570,17 +3570,20 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; - int errors_send = 0; - int errors_wait = 0; + struct extra_rw_degrade_errors *errors; int ret; + errors = alloc_extra_rw_degrade_errors(info->fs_devices->num_devices); + if (!errors) + return -ENOMEM; + /* send down all the barriers */ head = >fs_devices->devices; list_for_each_entry_rcu(dev, head, dev_list) { if (dev->missing) continue; if (!dev->bdev) { - errors_send++; + record_extra_rw_degrade_error(errors, dev->devid); continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3588,7 +3591,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 0); if (ret) - errors_send++; + record_extra_rw_degrade_error(errors, dev->devid); } /* wait for all the barriers */ @@ -3596,7 +3599,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { - errors_wait++; + record_extra_rw_degrade_error(errors, dev->devid); continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3604,11 +3607,13 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 1); if (ret) - errors_wait++; + record_extra_rw_degrade_error(errors, dev->devid); } - if (errors_send > info->num_tolerated_disk_barrier_failures || - errors_wait > info->num_tolerated_disk_barrier_failures) + if (!btrfs_check_rw_degradable(info, errors)) { + kfree(errors); return -EIO; + } + kfree(errors); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/9] Qgroup fixes
Patchset can be fetched from my github: https://github.com/adam900710/linux.git qgroup_fixes The new base is v4.11-rc1, only minor conflicts and are all easy to handle. This pull request includes: 1) Fix for inode_cache mount option Although no one really cares inode_cache mount option, it will screw qgroup for a long time. Not only it will screw up qgroup test uses golden output, but also new test cases use btrfsck to verify qgroup. 2) Fix for btrfs/104 kernel warning This is caused by quota enabling with dirty buffers not written to disc. Fixed by checking EXTENT_QGROUP_RESERVED flag other than just decreasing qgroup->reserved. 3) Fix qgroup underflow caused by freeing ranges not reserved by caller Mainly exposed by Chandan on PPC64. It's possible that buffered write is blocked by reserving metadata, and in error path we will release reserved space for both data and metadata. However the reserved data can already be reserved by another process writing the same page. In that case, data reserved space can be freed by two process, one for error path, one for normal write routine, causing underflow. Fixed by checking if that data range is reserved by ourselves and only free it if it's reserved by ourselves. Update since 2016/12/09: Rebased to latest for-linux-4.11. Add missing reviewed-by and tested-by tags. Add more comment for btrfs_qgroup_reserve_data() for error handling. Add more comment for qgroup_free_reserved_data() for later enhancement (not function enhancement). v2: Add reviewed-by tag for 2nd patch Update the first patch to follow the new trace point standard Qu Wenruo (9): btrfs: qgroup: Add trace point for qgroup reserved space btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option btrfs: qgroup: Add quick exit for non-fs extents btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable btrfs: qgroup: Introduce extent changeset for qgroup reserve functions btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges fs/btrfs/ctree.h | 12 +- fs/btrfs/extent-tree.c | 31 -- fs/btrfs/extent_io.h | 14 ++- fs/btrfs/file.c | 46 +--- fs/btrfs/inode-map.c | 6 +- fs/btrfs/inode.c | 64 +++ fs/btrfs/ioctl.c | 11 +- fs/btrfs/qgroup.c| 258 ++- fs/btrfs/qgroup.h| 58 -- fs/btrfs/relocation.c| 13 ++- fs/btrfs/transaction.c | 21 ++-- include/trace/events/btrfs.h | 44 12 files changed, 396 insertions(+), 182 deletions(-) -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 9/9] btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG] For the following case, btrfs can underflow qgroup reserved space at error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B -- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at write back time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan RajendraSigned-off-by: Qu Wenruo Reviewed-by: Chandan Rajendra Tested-by: Chandan Rajendra --- fs/btrfs/ctree.h | 6 +++-- fs/btrfs/extent-tree.c | 12 + fs/btrfs/file.c| 29 +++- fs/btrfs/inode.c | 29 ++-- fs/btrfs/ioctl.c | 4 +-- fs/btrfs/qgroup.c | 72 ++ fs/btrfs/qgroup.h | 3 ++- fs/btrfs/relocation.c | 8 +++--- 8 files changed, 117 insertions(+), 46 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8349b3bd6522..f76b3ecdc715 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2690,7 +2690,10 @@ enum btrfs_flush_state { int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); int btrfs_check_data_free_space(struct inode *inode, struct extent_changeset *reserved, u64 start, u64 len); -void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len); +void btrfs_free_reserved_data_space(struct inode *inode, + struct extent_changeset *reserved, u64 start, u64 len); +void btrfs_delalloc_release_space(struct inode *inode, + struct extent_changeset *reserved, u64 start, u64 len); void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, u64 len); void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, @@ -2709,7 +2712,6 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes); int btrfs_delalloc_reserve_space(struct inode *inode, struct extent_changeset *reserved, u64 start, u64 len); -void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info, unsigned short type); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 67f4e97ef8c6..9de0436becd2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4336,7 +4336,8 @@ void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, * This one will handle the per-inode data rsv map for accurate reserved * space framework. */ -void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len) +void btrfs_free_reserved_data_space(struct inode *inode, + struct extent_changeset *reserved, u64 start, u64 len) { struct btrfs_root *root = BTRFS_I(inode)->root; @@ -4346,7 +4347,7 @@ void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len) start = round_down(start, root->fs_info->sectorsize); btrfs_free_reserved_data_space_noquota(inode, start, len); -
[PATCH v2 8/9] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it at error path. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: Chandan RajendraSigned-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 6 -- fs/btrfs/extent-tree.c | 17 - fs/btrfs/extent_io.h | 12 fs/btrfs/file.c| 17 ++--- fs/btrfs/inode-map.c | 6 +- fs/btrfs/inode.c | 24 fs/btrfs/ioctl.c | 7 ++- fs/btrfs/qgroup.c | 33 + fs/btrfs/qgroup.h | 3 ++- fs/btrfs/relocation.c | 5 - 10 files changed, 100 insertions(+), 30 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29b7fc28c607..8349b3bd6522 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2687,8 +2687,9 @@ enum btrfs_flush_state { COMMIT_TRANS= 6, }; -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len); int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); +int btrfs_check_data_free_space(struct inode *inode, + struct extent_changeset *reserved, u64 start, u64 len); void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len); void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, u64 len); @@ -2706,7 +2707,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv); int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes); -int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len); +int btrfs_delalloc_reserve_space(struct inode *inode, + struct extent_changeset *reserved, u64 start, u64 len); void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 89d1bcba3f8c..67f4e97ef8c6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3355,12 +3355,14 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, struct btrfs_fs_info *fs_info = block_group->fs_info; struct btrfs_root *root = fs_info->tree_root; struct inode *inode = NULL; + struct extent_changeset data_reserved; u64 alloc_hint = 0; int dcs = BTRFS_DC_ERROR; u64 num_pages = 0; int retries = 0; int ret = 0; + extent_changeset_init(_reserved); /* * If this block group is smaller than 100 megs don't bother caching the * block group. @@ -3473,7 +3475,7 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, num_pages *= 16; num_pages *= PAGE_SIZE; - ret = btrfs_check_data_free_space(inode, 0, num_pages); + ret = btrfs_check_data_free_space(inode, _reserved, 0, num_pages); if (ret) goto out_put; @@ -3504,6 +3506,7 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, block_group->disk_cache_state = dcs; spin_unlock(_group->lock); + extent_changeset_release(_reserved); return ret; } @@ -4272,7 +4275,8 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) * Will replace old btrfs_check_data_free_space(), but for patch split, * add a new function first and then replace it. */ -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len) +int btrfs_check_data_free_space(struct inode *inode, + struct extent_changeset *reserved, u64 start, u64 len) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); int ret; @@ -4287,9 +4291,11 @@ int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len) return ret; /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */ - ret = btrfs_qgroup_reserve_data(inode, start, len); + ret = btrfs_qgroup_reserve_data(inode, reserved, start, len); if (ret < 0) btrfs_free_reserved_data_space_noquota(inode, start, len); + else + ret = 0; return ret; } @@ -6130,11 +6136,12 @@ void
[PATCH v2 5/9] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
Quite a lot of qgroup corruption happens due to wrong timing of calling btrfs_qgroup_prepare_account_extents(). Since the safest timing is calling it just before btrfs_qgroup_account_extents(), there is no need to separate these 2 function. Merging them will make code cleaner and less bug prone. Signed-off-by: Qu Wenruo--- fs/btrfs/qgroup.c | 49 - fs/btrfs/qgroup.h | 2 -- fs/btrfs/transaction.c | 10 -- 3 files changed, 16 insertions(+), 45 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 445014de924c..9535ea56b6f3 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1403,38 +1403,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, return ret; } -int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info) -{ - struct btrfs_qgroup_extent_record *record; - struct btrfs_delayed_ref_root *delayed_refs; - struct rb_node *node; - u64 qgroup_to_skip; - int ret = 0; - - delayed_refs = >transaction->delayed_refs; - qgroup_to_skip = delayed_refs->qgroup_to_skip; - - /* -* No need to do lock, since this function will only be called in -* btrfs_commit_transaction(). -*/ - node = rb_first(_refs->dirty_extent_root); - while (node) { - record = rb_entry(node, struct btrfs_qgroup_extent_record, - node); - if (WARN_ON(!record->old_roots)) - ret = btrfs_find_all_roots(NULL, fs_info, - record->bytenr, 0, >old_roots); - if (ret < 0) - break; - if (qgroup_to_skip) - ulist_del(record->old_roots, qgroup_to_skip, 0); - node = rb_next(node); - } - return ret; -} - int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_root *delayed_refs, struct btrfs_qgroup_extent_record *record) @@ -2051,6 +2019,18 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans, if (!ret) { /* +* old roots should be searched when inserting qgroup +* extent record +*/ + if (WARN_ON(!record->old_roots)) { + /* Search commit root to find old_roots */ + ret = btrfs_find_all_roots(NULL, fs_info, + record->bytenr, 0, + >old_roots); + if (ret < 0) + goto cleanup; + } + /* * Use (u64)-1 as time_seq to do special search, which * doesn't lock tree or delayed_refs and search current * root. It's safe inside commit_transaction(). @@ -2059,8 +2039,11 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans, record->bytenr, (u64)-1, _roots); if (ret < 0) goto cleanup; - if (qgroup_to_skip) + if (qgroup_to_skip) { ulist_del(new_roots, qgroup_to_skip, 0); + ulist_del(record->old_roots, qgroup_to_skip, + 0); + } ret = btrfs_qgroup_account_extent(trans, fs_info, record->bytenr, record->num_bytes, record->old_roots, new_roots); diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index 19188e094d0e..b3c77f029dfe 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -134,8 +134,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info); void btrfs_free_qgroup_config(struct btrfs_fs_info *fs_info); struct btrfs_delayed_extent_op; -int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info); /* * Inform qgroup to trace one dirty extent, its info is recorded in @record. * So qgroup can account it at transaction committing time. diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 894c4980c15c..edee58751eb3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1376,9 +1376,6 @@ static int qgroup_account_snapshot(struct btrfs_trans_handle *trans, ret = commit_fs_roots(trans, fs_info); if (ret)
[PATCH v2 1/9] btrfs: qgroup: Add trace point for qgroup reserved space
Introduce the following trace points: qgroup_update_reserve qgroup_meta_reserve These trace points are handy to trace qgroup reserve space related problems. Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup structure to trace points, so that structure needs to be exported. Signed-off-by: Qu Wenruo--- fs/btrfs/qgroup.c| 52 +++- fs/btrfs/qgroup.h| 44 + include/trace/events/btrfs.h | 44 + 3 files changed, 96 insertions(+), 44 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index a5da750c1087..68a7305b0a46 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -47,50 +47,6 @@ * - check all ioctl parameters */ -/* - * one struct for each qgroup, organized in fs_info->qgroup_tree. - */ -struct btrfs_qgroup { - u64 qgroupid; - - /* -* state -*/ - u64 rfer; /* referenced */ - u64 rfer_cmpr; /* referenced compressed */ - u64 excl; /* exclusive */ - u64 excl_cmpr; /* exclusive compressed */ - - /* -* limits -*/ - u64 lim_flags; /* which limits are set */ - u64 max_rfer; - u64 max_excl; - u64 rsv_rfer; - u64 rsv_excl; - - /* -* reservation tracking -*/ - u64 reserved; - - /* -* lists -*/ - struct list_head groups; /* groups this group is member of */ - struct list_head members; /* groups that are members of this group */ - struct list_head dirty; /* dirty groups */ - struct rb_node node; /* tree of qgroups */ - - /* -* temp variables for accounting operations -* Refer to qgroup_shared_accounting() for details. -*/ - u64 old_refcnt; - u64 new_refcnt; -}; - static void btrfs_qgroup_update_old_refcnt(struct btrfs_qgroup *qg, u64 seq, int mod) { @@ -1075,6 +1031,7 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info, qgroup->excl += sign * num_bytes; qgroup->excl_cmpr += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes); if (WARN_ON(qgroup->reserved < num_bytes)) report_reserved_underflow(fs_info, qgroup, num_bytes); else @@ -1100,6 +1057,8 @@ static int __qgroup_excl_accounting(struct btrfs_fs_info *fs_info, WARN_ON(sign < 0 && qgroup->excl < num_bytes); qgroup->excl += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, + -(s64)num_bytes); if (WARN_ON(qgroup->reserved < num_bytes)) report_reserved_underflow(fs_info, qgroup, num_bytes); @@ -2424,6 +2383,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce) qg = unode_aux_to_qgroup(unode); + trace_qgroup_update_reserve(fs_info, qg, num_bytes); qg->reserved += num_bytes; } @@ -2469,6 +2429,7 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info, qg = unode_aux_to_qgroup(unode); + trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes); if (WARN_ON(qg->reserved < num_bytes)) report_reserved_underflow(fs_info, qg, num_bytes); else @@ -2945,6 +2906,7 @@ int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, return 0; BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize)); + trace_qgroup_meta_reserve(root, (s64)num_bytes); ret = qgroup_reserve(root, num_bytes, enforce); if (ret < 0) return ret; @@ -2964,6 +2926,7 @@ void btrfs_qgroup_free_meta_all(struct btrfs_root *root) reserved = atomic_xchg(>qgroup_meta_rsv, 0); if (reserved == 0) return; + trace_qgroup_meta_reserve(root, -(s64)reserved); btrfs_qgroup_free_refroot(fs_info, root->objectid, reserved); } @@ -2978,6 +2941,7 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes) BUG_ON(num_bytes != round_down(num_bytes, fs_info->nodesize)); WARN_ON(atomic_read(>qgroup_meta_rsv) < num_bytes); atomic_sub(num_bytes, >qgroup_meta_rsv); + trace_qgroup_meta_reserve(root, -(s64)num_bytes); btrfs_qgroup_free_refroot(fs_info, root->objectid, num_bytes); } diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index 26932a8a1993..e5b42f1044fc 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -62,6 +62,50 @@ struct
[PATCH v2 3/9] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
[BUG] The easist way to reproduce the bug is: -- # mkfs.btrfs -f $dev -n 16K # mount $dev $mnt -o inode_cache # btrfs quota enable $mnt # btrfs quota rescan -w $mnt # btrfs qgroup show $mnt qgroupid rfer excl 0/5 32.00KiB 32.00KiB ^^ Twice the correct value -- And fstests/btrfs qgroup test group can easily detect them with inode_cache mount option. Although some of them are false alerts since old test cases are using fixed golden output. While new test cases will use "btrfs check" to detect qgroup mismatch. [CAUSE] Inode_cache mount option will make commit_fs_roots() to call btrfs_save_ino_cache() to update fs/subvol trees, and generate new delayed refs. However we call btrfs_qgroup_prepare_account_extents() too early, before commit_fs_roots(). This makes the "old_roots" for newly generated extents are always NULL. For freeing extent case, this makes both new_roots and old_roots to be empty, while correct old_roots should not be empty. This causing qgroup numbers not decreased correctly. [FIX] Modify the timing of calling btrfs_qgroup_prepare_account_extents() to just before btrfs_qgroup_account_extents(), and add needed delayed_refs handler. So qgroup can handle inode_map mount options correctly. Signed-off-by: Qu Wenruo--- fs/btrfs/transaction.c | 25 ++--- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 61b807de3e16..894c4980c15c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2130,13 +2130,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) goto scrub_continue; } - /* Reocrd old roots for later qgroup accounting */ - ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); - if (ret) { - mutex_unlock(_info->reloc_mutex); - goto scrub_continue; - } - /* * make sure none of the code above managed to slip in a * delayed item @@ -2179,6 +2172,24 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_free_log_root_tree(trans, fs_info); /* +* commit_fs_roots() can call btrfs_save_ino_cache(), which generates +* new delayed refs. Must handle them or qgroup can be wrong. +*/ + ret = btrfs_run_delayed_refs(trans, fs_info, (unsigned long)-1); + if (ret) { + mutex_unlock(_info->tree_log_mutex); + mutex_unlock(_info->reloc_mutex); + goto scrub_continue; + } + + ret = btrfs_qgroup_prepare_account_extents(trans, fs_info); + if (ret) { + mutex_unlock(_info->tree_log_mutex); + mutex_unlock(_info->reloc_mutex); + goto scrub_continue; + } + + /* * Since fs roots are all committed, we can get a quite accurate * new_roots. So let's do quota accounting. */ -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/9] btrfs: qgroup: Add quick exit for non-fs extents
For btrfs_qgroup_account_extent(), modify make it exit quicker for non-fs extents. This will also reduce the noise in trace_btrfs_qgroup_account_extent event. Signed-off-by: Qu Wenruo--- fs/btrfs/qgroup.c | 41 +++-- 1 file changed, 39 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 1fe481d71be5..445014de924c 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1915,6 +1915,33 @@ static int qgroup_update_counters(struct btrfs_fs_info *fs_info, return 0; } +/* + * Helper to check if the @roots is a list of fs tree roots + * Return 0 for definitely not a fs/subvol tree roots ulist + * Return 1 for possible fs/subvol tree roots ulist(including empty) + */ +static int maybe_fs_roots(struct ulist *roots) +{ + struct ulist_node *unode; + struct ulist_iterator uiter; + + /* Empty one, still possible for fs roots */ + if (!roots || roots->nnodes == 0) + return 1; + + ULIST_ITER_INIT(); + unode = ulist_next(roots, ); + if (!unode) + return 1; + + /* +* If it contains fs tree roots, then it must belongs to fs/subvol +* trees. +* If it contains non-fs tree, it won't be shared to fs/subvol trees. +*/ + return is_fstree(unode->val); +} + int btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, @@ -1931,10 +1958,20 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) return 0; - if (new_roots) + if (new_roots) { + if (!maybe_fs_roots(new_roots)) + goto out_free; nr_new_roots = new_roots->nnodes; - if (old_roots) + } + if (old_roots) { + if (!maybe_fs_roots(old_roots)) + goto out_free; nr_old_roots = old_roots->nnodes; + } + + /* Quick exit, either not fs tree roots, or won't affect any qgroup */ + if (nr_old_roots == 0 && nr_new_roots == 0) + goto out_free; BUG_ON(!fs_info->quota_root); -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 7/9] btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quota enable
[BUG] Under the following case, we can underflow qgroup reserved space. Task A|Task B --- Quota disabled | Buffered write | |- btrfs_check_data_free_space() | | *NO* qgroup space is reserved | | since quota is *DISABLED* | |- All pages are copied to page | cache | | Enable quota | Quota scan finished | | Sync_fs | |- run_delalloc_range | |- Write pages | |- btrfs_finish_ordered_io ||- insert_reserved_file_extent | |- btrfs_qgroup_release_data() | Since no qgroup space is reserved in Task A, we underflow qgroup reserved space This can be detected by fstest btrfs/104. [CAUSE] In insert_reserved_file_extent() we info qgroup to release the @ram_bytes size of qgroup reserved_space under all case. And btrfs_qgroup_release_data() will check if qgroup is enabled. However in above case, the buffered write happens before quota is enabled, so we don't havee reserved space for that range. [FIX] In insert_reserved_file_extent(), we info qgroup to release the acctual byte number it released. In above case, since we don't have reserved space, we info qgroup to release 0 byte, so the problem can be fixed. And thanks to the @reserved parameter introduced by qgroup rework, and previous patch to return release bytes, the fix can be as small as less than 10 lines. Signed-off-by: Qu Wenruo--- fs/btrfs/inode.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c40060cc481f..36b08abff564 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2044,6 +2044,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct extent_buffer *leaf; struct btrfs_key ins; + u64 qg_released; int extent_inserted = 0; int ret; @@ -2099,13 +2100,17 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, ins.objectid = disk_bytenr; ins.offset = disk_num_bytes; ins.type = BTRFS_EXTENT_ITEM_KEY; - ret = btrfs_alloc_reserved_file_extent(trans, root->root_key.objectid, - btrfs_ino(BTRFS_I(inode)), file_pos, ram_bytes, ); + /* * Release the reserved range from inode dirty range map, as it is * already moved into delayed_ref_head */ - btrfs_qgroup_release_data(inode, file_pos, ram_bytes); + ret = btrfs_qgroup_release_data(inode, file_pos, ram_bytes); + if (ret < 0) + goto out; + qg_released = ret; + ret = btrfs_alloc_reserved_file_extent(trans, root->root_key.objectid, + btrfs_ino(BTRFS_I(inode)), file_pos, qg_released, ); out: btrfs_free_path(path); -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/9] btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint
Newly introduced qgroup reserved space trace points are normally nested into several common qgroup operations. While some other trace points are not well placed to co-operate with them, causing confusing output. This patch re-arrange trace_btrfs_qgroup_release_data() and trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered before reserved space ones. Signed-off-by: Qu WenruoReviewed-by: Jeff Mahoney --- fs/btrfs/qgroup.c | 10 +- fs/btrfs/qgroup.h | 6 +- 2 files changed, 6 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 68a7305b0a46..1fe481d71be5 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2847,14 +2847,14 @@ static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len, if (ret < 0) goto out; - if (free) { - btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info, - BTRFS_I(inode)->root->objectid, - changeset.bytes_changed); + if (free) trace_op = QGROUP_FREE; - } trace_btrfs_qgroup_release_data(inode, start, len, changeset.bytes_changed, trace_op); + if (free) + btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info, + BTRFS_I(inode)->root->objectid, + changeset.bytes_changed); out: ulist_release(_changed); return ret; diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h index e5b42f1044fc..19188e094d0e 100644 --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -230,15 +230,11 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, struct btrfs_qgroup_inherit *inherit); void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info, u64 ref_root, u64 num_bytes); -/* - * TODO: Add proper trace point for it, as btrfs_qgroup_free() is - * called by everywhere, can't provide good trace for delayed ref case. - */ static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info, u64 ref_root, u64 num_bytes) { - btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes); trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes); + btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes); } void assert_qgroups_uptodate(struct btrfs_trans_handle *trans); -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 6/9] btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs_qgroup_release/free_data() only returns 0 or minus error number(ENOMEM is the only possible error). This is normally good enough, but sometimes we need the accurate byte number it freed/released. Change it to return actually released/freed bytenr number instead of 0 for success. And slightly modify related extent_changeset structure, since in btrfs one none-hole data extent won't be larger than 128M, so "unsigned int" is large enough for the use case. Signed-off-by: Qu Wenruo--- fs/btrfs/extent-tree.c | 2 +- fs/btrfs/extent_io.h | 2 +- fs/btrfs/qgroup.c | 1 + 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index be5477676cc8..89d1bcba3f8c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4288,7 +4288,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len) /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */ ret = btrfs_qgroup_reserve_data(inode, start, len); - if (ret) + if (ret < 0) btrfs_free_reserved_data_space_noquota(inode, start, len); return ret; } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 3e4fad4a909d..bdca004dc859 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -201,7 +201,7 @@ struct extent_buffer { */ struct extent_changeset { /* How many bytes are set/cleared in this operation */ - u64 bytes_changed; + unsigned int bytes_changed; /* Changed ranges */ struct ulist range_changed; diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 9535ea56b6f3..87aeefb34959 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2875,6 +2875,7 @@ static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len, btrfs_qgroup_free_refroot(BTRFS_I(inode)->root->fs_info, BTRFS_I(inode)->root->objectid, changeset.bytes_changed); + ret = changeset.bytes_changed; out: ulist_release(_changed); return ret; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
At 03/13/2017 03:26 PM, Stefan Priebe - Profihost AG wrote: Hi Qu, Am 13.03.2017 um 02:16 schrieb Qu Wenruo: At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: Hi Qu, while V5 was running fine against the openSUSE-42.2 kernel (based on v4.4). Thanks for the test. V7 results in OOPS to me: BUG: unable to handle kernel NULL pointer dereference at 01f0 This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite nice clue. IP: [] __endio_write_update_ordered+0x33/0x140 [btrfs] IP points to: --- static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) { struct btrfs_root *root = inode->root; << Either here if (root == root->fs_info->tree_root && << Or here btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) --- Taking the above offset into consideration, it's only possible for later case. So here, we have a btrfs_inode whose @root is NULL. But wasn't this part of the code identical in V5? Why does it only happen with V7? There are still difference, but just as you said, the related part(checking if inode is free space cache inode) is identical across v5 and v7. I'm afraid that's a rare race leading to NULL btrfs_inode->root, which could happen in both v5 and v7. What's the difference between SUSE and mainline kernel? Maybe some mainline kernel commits have already fixed it? Thanks, Qu This can be fixed easily by checking @root inside btrfs_is_free_space_inode(), as the backtrace shows that it's only happening for DirectIO, and it won't happen for free space cache inode. But I'm more curious how this happened for a more accurate fix, or we could have other NULL pointer access. Did you have any reproducer for this? Sorry no - this is a production MariaDB Server running btrfs with compress-force=zlib. But if i could test anything i'll do. Greets, Stefan Thanks, Qu PGD 14e18d4067 PUD 14e1868067 PMD 0 Oops: [#1] SMP Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci i2c_core usb_common ata_piix floppy CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140722_172050-sagunt 04/01/2014 task: b4e0f500 ti: b4e0 task.ti: b4e0 RIP: 0010:[] [] __endio_write_update_ordered+0x33/0x140 [btrfs] RSP: 0018:8814eae03cd8 EFLAGS: 00010086 RAX: RBX: 8814e8fd5aa8 RCX: 0001 RDX: 0010 RSI: 0010 RDI: 8814e45885c0 RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0 R13: 8814e125d700 R14: 0010 R15: 8800376c6a80 FS: () GS:8814eae0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack: 0010 8814e8fd5aa8 8814e953f3c0 8814e125d700 0010 8800376c6a80 8814eae03d38 c03ddf67 8814e86b6a80 8814e8fd5aa8 0001 Call Trace: [] btrfs_endio_direct_write+0x37/0x60 [btrfs] [] bio_endio+0x57/0x60 [] btrfs_end_bio+0xa1/0x140 [btrfs] [] bio_endio+0x57/0x60 [] blk_update_request+0x8b/0x330 [] blk_mq_end_request+0x1a/0x70 [] virtblk_request_done+0x3f/0x70 [virtio_blk] [] __blk_mq_complete_request+0x78/0xe0 [] blk_mq_complete_request+0x1c/0x20 [] virtblk_done+0x64/0xe0 [virtio_blk] [] vring_interrupt+0x3a/0x90 [] __handle_irq_event_percpu+0x89/0x1b0 [] handle_irq_event_percpu+0x23/0x60 [] handle_irq_event+0x3b/0x60 [] handle_edge_irq+0x6f/0x150 [] handle_irq+0x1d/0x30 [] do_IRQ+0x4b/0xd0 [] common_interrupt+0x8c/0x8c DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b Leftover inexact backtrace: 2017-03-12 20:33:08 2017-03-12 20:33:08 [] ? native_safe_halt+0x6/0x10 [] default_idle+0x1e/0xe0 [] arch_cpu_idle+0xf/0x20 [] default_idle_call+0x3b/0x40 [] cpu_startup_entry+0x29a/0x370 [] rest_init+0x7c/0x80 [] start_kernel+0x490/0x49d [] ? early_idt_handler_array+0x120/0x120 [] x86_64_start_reservations+0x2a/0x2c [] x86_64_start_kernel+0x13b/0x14a Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 RIP [] __endio_write_update_ordered+0x33/0x140 [btrfs] RSP CR2: 01f0 ---[ end trace 7529a0652fd7873e ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x3300 from 0x8100 (relocation range: 0x8000-0xbfff) Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to
[PATCH 3/4] btrfs: cleanup barrier_all_devices() unify dev error count
Now when counting number of error devices we don't need to count them separately once during send and wait, as because device error counted during send is more of static check. Also kindly note that as of now there is no code which would set dev->bdev = NULL unless device is missing. However I still kept bdev == NULL counted towards error device in view of future enhancements. And as the device_list_mutex is held when barrier_all_devices() is called, I don't expect a new bdev to null in between send and wait. Now in this process I also rename error_wait to dropouts. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ee3e601da511..5719e036048b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3574,8 +3574,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; - int errors_send = 0; - int errors_wait = 0; + int dropouts = 0; int ret; /* send down all the barriers */ @@ -3583,10 +3582,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info) list_for_each_entry_rcu(dev, head, dev_list) { if (dev->missing) continue; - if (!dev->bdev) { - errors_send++; + if (!dev->bdev) continue; - } if (!dev->in_fs_metadata || !dev->writeable) continue; @@ -3600,7 +3597,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { - errors_wait++; + dropouts++; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3608,10 +3605,9 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 1); if (ret) - errors_wait++; + dropouts++; } - if (errors_send > info->num_tolerated_disk_barrier_failures || - errors_wait > info->num_tolerated_disk_barrier_failures) + if (dropouts > info->num_tolerated_disk_barrier_failures) return -EIO; return 0; } -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: Communicate back ENOMEM when it occurs
The only error that write dev flush (send) will fail is due to the ENOMEM then, as its not a device specific error and rather a system wide issue, we should rather stop further iterations and perpetuate the -ENOMEM error to the caller. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 08b74daf35d0..ee3e601da511 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3592,7 +3592,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 0); if (ret) - errors_send++; + return ret; } /* wait for all the barriers */ -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error
The objective of this patch is to cleanup barrier_all_devices() so that the error checking is in a separate loop independent of of the loop which submits and waits on the device flush requests. By doing this it helps to further develop patches which would tune the error-actions as needed. Here functions such as btrfs_dev_stats_dirty() couldn't be used because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS. Signed-off-by: Anand Jain--- fs/btrfs/disk-io.c | 96 +++--- 1 file changed, 85 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5719e036048b..12531a5b14ff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3566,6 +3566,76 @@ static int write_dev_flush(struct btrfs_device *device, int wait) return 0; } +struct device_checkpoint { + struct list_head list; + struct btrfs_device *device; + int stat_value_checkpoint; +}; + +static int add_device_checkpoint(struct list_head *checkpoint, + struct btrfs_device *device) +{ + struct device_checkpoint *cdev = + kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL); + if (!cdev) + return -ENOMEM; + + list_add(>list, checkpoint); + + cdev->device = device; + cdev->stat_value_checkpoint = + btrfs_dev_stat_read(device, BTRFS_DEV_STAT_FLUSH_ERRS); + + return 0; +} + +static void fini_devices_checkpoint(struct list_head *checkpoint) +{ + struct device_checkpoint *cdev; + + while(!list_empty(checkpoint)) { + cdev = list_entry(checkpoint->next, + struct device_checkpoint, list); + list_del(>list); + kfree(cdev); + } +} + +static int check_stat_flush(struct btrfs_device *dev, + struct list_head *checkpoint) +{ + int val; + struct device_checkpoint *cdev; + + list_for_each_entry(cdev, checkpoint, list) { + if (cdev->device == dev) { + val = btrfs_dev_stat_read(dev, + BTRFS_DEV_STAT_FLUSH_ERRS); + if (cdev->stat_value_checkpoint != val) + return 1; + } + } + return 0; +} + +static int check_barrier_error(struct btrfs_fs_devices *fsdevs, + struct list_head *checkpoint) +{ + int dropouts = 0; + struct btrfs_device *dev; + + list_for_each_entry_rcu(dev, >devices, dev_list) { + if (!dev->bdev || check_stat_flush(dev, checkpoint)) + dropouts++; + } + + if (dropouts > + fsdevs->fs_info->num_tolerated_disk_barrier_failures) + return -EIO; + + return 0; +} + /* * send an empty flush down to each device in parallel, * then wait for them @@ -3574,8 +3644,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; - int dropouts = 0; int ret; + struct list_head checkpoint; + + INIT_LIST_HEAD(); /* send down all the barriers */ head = >fs_devices->devices; @@ -3587,29 +3659,31 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (!dev->in_fs_metadata || !dev->writeable) continue; + add_device_checkpoint(, dev); ret = write_dev_flush(dev, 0); - if (ret) + if (ret) { + fini_devices_checkpoint(); return ret; + } } /* wait for all the barriers */ list_for_each_entry_rcu(dev, head, dev_list) { if (dev->missing) continue; - if (!dev->bdev) { - dropouts++; + if (!dev->bdev) continue; - } if (!dev->in_fs_metadata || !dev->writeable) continue; - ret = write_dev_flush(dev, 1); - if (ret) - dropouts++; + write_dev_flush(dev, 1); } - if (dropouts > info->num_tolerated_disk_barrier_failures) - return -EIO; - return 0; + + ret = check_barrier_error(info->fs_devices, ); + + fini_devices_checkpoint(); + + return ret; } int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags) -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback
REQ_PREFLUSH bio to flush dev cache uses btrfs_end_empty_barrier() completion callback only, as of now, and there it accounts for dev stat flush errors BTRFS_DEV_STAT_FLUSH_ERRS, so remove it from the btrfs_end_bio(). Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 73d56eef5e60..1563ae03079b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6010,9 +6010,6 @@ static void btrfs_end_bio(struct bio *bio) else btrfs_dev_stat_inc(dev, BTRFS_DEV_STAT_READ_ERRS); - if (bio->bi_opf & REQ_PREFLUSH) - btrfs_dev_stat_inc(dev, - BTRFS_DEV_STAT_FLUSH_ERRS); btrfs_dev_stat_print_on_error(dev); } } -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] cleanup barrier_all_devices()
These patches provide cleanup of the function barrier_all_devices(). Anand Jain (4): btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback btrfs: Communicate back ENOMEM when it occurs btrfs: cleanup barrier_all_devices() unify dev error count btrfs: cleanup barrier_all_devices() to check dev stat flush error -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
Hi Qu, Am 13.03.2017 um 02:16 schrieb Qu Wenruo: > > At 03/13/2017 04:49 AM, Stefan Priebe - Profihost AG wrote: >> Hi Qu, >> >> while V5 was running fine against the openSUSE-42.2 kernel (based on >> v4.4). > > Thanks for the test. > >> V7 results in OOPS to me: >> BUG: unable to handle kernel NULL pointer dereference at 01f0 > > This 0x1f0 is the same as offsetof(struct brrfs_root, fs_info), quite > nice clue. > >> IP: [] __endio_write_update_ordered+0x33/0x140 [btrfs] > > IP points to: > --- > static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) > { > struct btrfs_root *root = inode->root; << Either here > > if (root == root->fs_info->tree_root && << Or here > btrfs_ino(inode) != BTRFS_BTREE_INODE_OBJECTID) > > --- > > Taking the above offset into consideration, it's only possible for later > case. > > So here, we have a btrfs_inode whose @root is NULL. But wasn't this part of the code identical in V5? Why does it only happen with V7? > This can be fixed easily by checking @root inside > btrfs_is_free_space_inode(), as the backtrace shows that it's only > happening for DirectIO, and it won't happen for free space cache inode. > > But I'm more curious how this happened for a more accurate fix, or we > could have other NULL pointer access. > > Did you have any reproducer for this? Sorry no - this is a production MariaDB Server running btrfs with compress-force=zlib. But if i could test anything i'll do. Greets, Stefan > > Thanks, > Qu > >> PGD 14e18d4067 PUD 14e1868067 PMD 0 >> Oops: [#1] SMP >> Modules linked in: netconsole xt_multiport ipt_REJECT nf_reject_ipv4 >> xt_set iptable_filter ip_tables x_tables ip_set_hash_net ip_set >> nfnetlink crc32_pclmul button loop btrfs xor usbhid raid6_pq ata_generic >> virtio_blk virtio_net uhci_hcd ehci_hcd i2c_piix4 usbcore virtio_pci >> i2c_core usb_common ata_piix floppy >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.52+112-ph #1 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 1.7.5-20140722_172050-sagunt 04/01/2014 >> task: b4e0f500 ti: b4e0 task.ti: b4e0 >> RIP: 0010:[] [] >> __endio_write_update_ordered+0x33/0x140 [btrfs] >> RSP: 0018:8814eae03cd8 EFLAGS: 00010086 >> RAX: RBX: 8814e8fd5aa8 RCX: 0001 >> RDX: 0010 RSI: 0010 RDI: 8814e45885c0 >> RBP: 8814eae03d10 R08: 8814e8334000 R09: 00018040003a >> R10: ea00507d8d00 R11: 88141f634080 R12: 8814e45885c0 >> R13: 8814e125d700 R14: 0010 R15: 8800376c6a80 >> FS: () GS:8814eae0() >> knlGS: >> CS: 0010 DS: ES: CR0: 80050033 >> CR2: 01f0 CR3: 0014e34c9000 CR4: 001406f0Stack: >> 0010 8814e8fd5aa8 8814e953f3c0 >> 8814e125d700 0010 8800376c6a80 8814eae03d38 >> c03ddf67 8814e86b6a80 8814e8fd5aa8 0001 >> Call Trace: >> [] btrfs_endio_direct_write+0x37/0x60 [btrfs] >> [] bio_endio+0x57/0x60 >> [] btrfs_end_bio+0xa1/0x140 [btrfs] >> [] bio_endio+0x57/0x60 >> [] blk_update_request+0x8b/0x330 >> [] blk_mq_end_request+0x1a/0x70 >> [] virtblk_request_done+0x3f/0x70 [virtio_blk] >> [] __blk_mq_complete_request+0x78/0xe0 >> [] blk_mq_complete_request+0x1c/0x20 >> [] virtblk_done+0x64/0xe0 [virtio_blk] >> [] vring_interrupt+0x3a/0x90 >> [] __handle_irq_event_percpu+0x89/0x1b0 >> [] handle_irq_event_percpu+0x23/0x60 >> [] handle_irq_event+0x3b/0x60 >> [] handle_edge_irq+0x6f/0x150 >> [] handle_irq+0x1d/0x30 >> [] do_IRQ+0x4b/0xd0 >> [] common_interrupt+0x8c/0x8c >> DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b >> Leftover inexact backtrace: >> 2017-03-12 20:33:08 >> 2017-03-12 20:33:08 [] ? native_safe_halt+0x6/0x10 >> [] default_idle+0x1e/0xe0 >> [] arch_cpu_idle+0xf/0x20 >> [] default_idle_call+0x3b/0x40 >> [] cpu_startup_entry+0x29a/0x370 >> [] rest_init+0x7c/0x80 >> [] start_kernel+0x490/0x49d >> [] ? early_idt_handler_array+0x120/0x120 >> [] x86_64_start_reservations+0x2a/0x2c >> [] x86_64_start_kernel+0x13b/0x14a >> Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 8b 87 70 fc >> ff ff 4c 8b 87 38 fe ff ff 48 c7 45 c8 00 00 00 00 48 89 75 d0 <48> 8b >> b8 f0 01 00 00 48 3b 47 28 49 8b 84 24 78 fc ff ff 0f 84 >> RIP [] __endio_write_update_ordered+0x33/0x140 [btrfs] >> RSP >> CR2: 01f0 >> ---[ end trace 7529a0652fd7873e ]--- >> Kernel panic - not syncing: Fatal exception in interrupt >> Kernel Offset: 0x3300 from 0x8100 (relocation range: >> 0x8000-0xbfff) >> >> Greets, >> Stefan >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > -- To unsubscribe from this list: send the
Re: [PATCH v3.1 1/7] btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
At 03/13/2017 03:29 PM, Anand Jain wrote: On 03/09/2017 09:34 AM, Qu Wenruo wrote: Introduce a new function, btrfs_check_rw_degradable(), to check if all chunks in btrfs is OK for degraded rw mount. It provides the new basis for accurate btrfs mount/remount and even runtime degraded mount check other than old one-size-fit-all method. Signed-off-by: Qu WenruoSigned-off-by: Anand Jain Tested-by: Austin S. Hemmelgarn Tested-by: Adam Borowski Tested-by: Dmitrii Tcvetkov --- fs/btrfs/volumes.c | 55 ++ fs/btrfs/volumes.h | 1 + 2 files changed, 56 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 73d56eef5e60..83613955e3c2 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6765,6 +6765,61 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info) return -EIO; } +/* + * Check if all chunks in the fs is OK for read-write degraded mount + * + * Return true if the fs is OK to be mounted degraded read-write + * Return false if the fs is not OK to be mounted degraded + */ +bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info) +{ +struct btrfs_mapping_tree *map_tree = _info->mapping_tree; +struct extent_map *em; +u64 next_start = 0; +bool ret = true; + +read_lock(_tree->map_tree.lock); +em = lookup_extent_mapping(_tree->map_tree, 0, (u64)-1); +read_unlock(_tree->map_tree.lock); +/* No chunk at all? Return false anyway */ +if (!em) { +ret = false; +goto out; +} +while (em) { +struct map_lookup *map; +int missing = 0; +int max_tolerated; +int i; + +map = (struct map_lookup *) em->bdev; any idea why not map = em->map_lookup; here? My fault, will update the patch. Thanks, Qu Thanks, Anand +max_tolerated = +btrfs_get_num_tolerated_disk_barrier_failures( +map->type); +for (i = 0; i < map->num_stripes; i++) { +if (map->stripes[i].dev->missing) +missing++; +} +if (missing > max_tolerated) { +ret = false; +btrfs_warn(fs_info, +"chunk %llu missing %d devices, max tolerance is %d for writeble mount", + em->start, missing, max_tolerated); +free_extent_map(em); +goto out; +} +next_start = extent_map_end(em); +free_extent_map(em); + +read_lock(_tree->map_tree.lock); +em = lookup_extent_mapping(_tree->map_tree, next_start, + (u64)(-1) - next_start); +read_unlock(_tree->map_tree.lock); +} +out: +return ret; +} + int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) { struct btrfs_root *root = fs_info->chunk_root; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 59be81206dd7..db1b5ef479cf 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -538,4 +538,5 @@ struct list_head *btrfs_get_fs_uuids(void); void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info); void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info); +bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info); #endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3.1 1/7] btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
On 03/09/2017 09:34 AM, Qu Wenruo wrote: Introduce a new function, btrfs_check_rw_degradable(), to check if all chunks in btrfs is OK for degraded rw mount. It provides the new basis for accurate btrfs mount/remount and even runtime degraded mount check other than old one-size-fit-all method. Signed-off-by: Qu WenruoSigned-off-by: Anand Jain Tested-by: Austin S. Hemmelgarn Tested-by: Adam Borowski Tested-by: Dmitrii Tcvetkov --- fs/btrfs/volumes.c | 55 ++ fs/btrfs/volumes.h | 1 + 2 files changed, 56 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 73d56eef5e60..83613955e3c2 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6765,6 +6765,61 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info) return -EIO; } +/* + * Check if all chunks in the fs is OK for read-write degraded mount + * + * Return true if the fs is OK to be mounted degraded read-write + * Return false if the fs is not OK to be mounted degraded + */ +bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info) +{ + struct btrfs_mapping_tree *map_tree = _info->mapping_tree; + struct extent_map *em; + u64 next_start = 0; + bool ret = true; + + read_lock(_tree->map_tree.lock); + em = lookup_extent_mapping(_tree->map_tree, 0, (u64)-1); + read_unlock(_tree->map_tree.lock); + /* No chunk at all? Return false anyway */ + if (!em) { + ret = false; + goto out; + } + while (em) { + struct map_lookup *map; + int missing = 0; + int max_tolerated; + int i; + + map = (struct map_lookup *) em->bdev; any idea why not map = em->map_lookup; here? Thanks, Anand + max_tolerated = + btrfs_get_num_tolerated_disk_barrier_failures( + map->type); + for (i = 0; i < map->num_stripes; i++) { + if (map->stripes[i].dev->missing) + missing++; + } + if (missing > max_tolerated) { + ret = false; + btrfs_warn(fs_info, + "chunk %llu missing %d devices, max tolerance is %d for writeble mount", + em->start, missing, max_tolerated); + free_extent_map(em); + goto out; + } + next_start = extent_map_end(em); + free_extent_map(em); + + read_lock(_tree->map_tree.lock); + em = lookup_extent_mapping(_tree->map_tree, next_start, + (u64)(-1) - next_start); + read_unlock(_tree->map_tree.lock); + } +out: + return ret; +} + int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) { struct btrfs_root *root = fs_info->chunk_root; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 59be81206dd7..db1b5ef479cf 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -538,4 +538,5 @@ struct list_head *btrfs_get_fs_uuids(void); void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info); void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info); +bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info); #endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html