Re: btrfs-convert missing in btrfs-tools v4.15.1
Nicholas D Steeves posted on Thu, 23 Aug 2018 14:15:18 -0400 as excerpted: >> It's in my interest to ship all tools in distros, but there's also only >> that much what the upstream community can do. If you're going to >> reconsider the status of btrfs-convert in Debian, please let me know. > > Yes, I'd be happy to advocate for its reinclusion if the answer to 4/5 > of the following questions is "yes". Does SUSE now recommend the use of > btrfs-convert to its enterprise customers? The following is a > frustrating criteria, but: Can a random desktop user run btrfs-convert > against their ext4 rootfs and expect the operation to succeed? Is > btrfs-convert now sufficiently trusted that it can be recommended with > the same degree of confidence as a backup, mkfs.btrfs, then restore to > new filesystem approach? Does the user of a btrfs volume created with > btrfs-convert have an equal or lesser probability of encountering bugs > compared to a one who used mkfs.btrfs? Just a user and list regular here, and gentoo not debian, but for what it counts... I'd personally never consider or recommend a filesystem converter over the backup, mkfs-to-new-fs, restore-to-new-fs, method, for three reasons. 1) Regardless of how stable a filesystem converter is and what two filesystems the conversion is between, "things" /do/ occasionally happen, thus making it irresponsible to use or recommend use of such a converter without a suitably current and tested backup, "just in case." (This is of course a special case of the sysadmin's first rule of backups, that the true value of data is defined not by any arbitrary claims, but by the number of backups of that data it's considered worth the time/trouble/resources to make/have. If the data value is trivial enough, sure, don't bother with the backup, but if it's of /that/ low a value, so low it's not worth a backup even when doing something as theoretically risky as a filesystem conversion, why is it worth the time and trouble to bother converting it in the first place, instead of just blowing it away and starting clean?) 2) Once a backup is considered "strongly recommended", as we've just established that it should be in 1 regardless of the stability of the converter, just using the existing filesystem as that backup and starting fresh with a mkfs for the new filesystem and copying things over is simply put the easiest, simplest and cleanest method to change filesystems. 3) (Pretty much)[1] Regardless of the filesystems in question, a fresh mkfs and clean sequential transfer of files from the old-fs/backup to the new one is pretty well guaranteed to be better optimized than conversion from an existing filesystem of a different type, particularly one that has been in normal operation for awhile and thus has operational fragmentation of both data and free-space. That's in addition to being less bug-prone, even for a "stable" converter. Restating: So(1) doing a conversion without a backup is irresponsible, (2) the easiest backup and conversion method is directly using the old fs as the backup, and copying over to the freshly mkfs-ed new filesystem, and (3) a freshly mkfs-ed filesystem and sequential copy of files to it from backup, whether that be the old filesystem or not, is going to be more efficient and less bug-prone than an in-place conversion. Given the above, why would /anyone/ /sane/ consider using a converter? It simply doesn't make sense, even if the converter were as stable as the most stable filesystems we have. So as a distro btrfs package maintainer, do what you wish in terms of the converter, but were it me, I might actually consider replacing it with an executable that simply printed out some form of the above argument, with a pointer to the sources should they still be interested after having read that argument.[2] Then, if people really are determined to unnecessarily waste their time to get a less efficient filesystem, possibly risking their data in the process of getting it, they can always build the converter from sources themselves. --- [1] I debated omitting the qualifier as I know of no exceptions, but I'm not a filesystem expert and while I'm a bit skeptical, I suppose it's possible that they might exist. [2] There's actually btrfs precedent for this in the form of the executable built as fsck.btrfs, which does nothing (successfully) but possibly print a message referring people to btrfs check, if run in interactive mode. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
dduper - Offline btrfs deduplication tool
Hi - dduper is an offline dedupe tool. Instead of reading whole file blocks and computing checksum, It works by fetching checksum from BTRFS csum tree. This hugely improves the performance. dduper works like: - Read csum for given two files. - Find matching location. - Pass the location to ioctl_ficlonerange directly instead of ioctl_fideduperange By default, dduper adds safty check to above steps by creating a backup reflink file and compares the md5sum after dedupe. If the backup file matches new deduped file, then backup file is removed. You can skip this check by passing --skip option. Here is sample cli usage [1] and quick demo [2] Some performance numbers: (with -skip option) Dedupe two 1GB files with same content - 1.2 seconds Dedupe two 5GB files with same content - 8.2 seconds Dedupe two 10GB files with same content - 13.8 seconds dduper requires `btrfs inspect-internal dump-csum` command, you can use this branch [3] or apply patch by yourself [4] [1] https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md [2] http://giis.co.in/btrfs_dedupe.gif [3] git clone https://gitlab.collabora.com/laks/btrfs-progs.git -b dump_csum [4] https://patchwork.kernel.org/patch/10540229/ Please remember its version-0.1, so test it out, if you plan to use dduper real data. Let me know, if you have suggestions or feedback or bugs :) Cheers. Lakshmipathi.G
[PATCH] btrfs-progs: dduper - BTRFS offline deduplication tool
dduper is an offline dedupe tool. It works by fetching checksum from BTRFS csum tree, instead of reading whole file blocks and computing checksum. This tool relies on output from 'btrfs inspect-internal dump-csum' command. Signed-off-by: Lakshmipathi.G --- dduper | 310 + 1 file changed, 310 insertions(+) create mode 100644 dduper diff --git a/dduper b/dduper new file mode 100644 index 000..2170b11 --- /dev/null +++ b/dduper @@ -0,0 +1,310 @@ +#!/usr/bin/env python + +""" dduper - BTRFS Dedupe tool. + +This is a offline dedupe tool. Instead of reading whole file blocks and +computing checksum, It works by fetching checksum from BTRFS csum tree. +This hugely improves the performance. + +Authors: Lakshmipathi.G +""" +import argparse +import errno +import hashlib +import numpy as np +import math +import os +import pdb +import struct +import subprocess +import sys + +from collections import OrderedDict +from fcntl import ioctl +from itertools import combinations +from itertools import izip_longest +from stat import * +# 4kb block size +blk_size = 4 +# no.of csum on single row - right now its 8 +no_of_chunks = 0 +FICLONERANGE = 0x4020940d + +device_name = None +skip = 0 +chunk_sz = 0 +run_len = 0 +ele_sz = 0 + +# Already deduped files +processed_files = [] + + +# From https://stackoverflow.com/questions/434287 +def grouper(iterable, n, fillvalue=None): +args = [iter(iterable)] * n +return izip_longest(*args, fillvalue=fillvalue) + + +def get_ele_size(): + +global no_of_chunks, run_len +if chunk_sz <= 0 or chunk_sz % 32 != 0: +print "Ensure chunk size is of multiple 32KB. (32,64,128 etc)" +sys.exit(-1) +no_of_chunks = chunk_sz / blk_size +ele_sz = no_of_chunks / 8 +run_len = no_of_chunks * blk_size * 1024 +return ele_sz + + +def get_hashes(out1): +""" + For each list item compute its hash and store it with offset as its key. +""" +global ele_sz + +if ele_sz == 1: +od = OrderedDict() +for idx, ele in enumerate(out1): +v = [] +k = hashlib.md5(str(ele)).hexdigest() +v.append(idx) +if k in od: +print "Collison with: "+str(k) + "at offset: "+str(v) +od[k] = v +else: +od = OrderedDict() +for idx, ele in enumerate(grouper(out1, ele_sz, 'x')): +v = [] +k = hashlib.md5(str(ele)).hexdigest() +v.append(idx) +if k in od: +print "Collison with: "+str(k) + "at offset: "+str(v) +od[k] = v + +return od + + +def ioctl_ficlonerange(dst_fd, s): + +try: +ioctl(dst_fd, FICLONERANGE, s) +except Exception as e: +print "error({0})".format(e.errno) + + +def cmp_files(file1, file2): + +md1 = subprocess.Popen(['md5sum', file1], stdout=subprocess.PIPE, +close_fds=True).stdout.read().split(" ")[0] +md2 = subprocess.Popen(['md5sum', file2], stdout=subprocess.PIPE, +close_fds=True).stdout.read().split(" ")[0] +if md1 == md2: +return 0 +else: +return 1 + + +def do_dedupe(src_file, dst_file, dry_run): + +bkup_file = dst_file + ".__superduper" +src_fd = os.open(src_file, os.O_RDONLY) +dst_fd = os.open(dst_file, os.O_WRONLY) +perfect_match = 0 + +out1 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', src_file, device_name], +stdout=subprocess.PIPE, close_fds=True).stdout.readlines() +out2 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', dst_file, device_name], +stdout=subprocess.PIPE, close_fds=True).stdout.readlines() +# todo : perfect match files. Remove dst_file from further operations +if out1 == out2: +print "Prefect match : ", src_file, dst_file +perfect_match = 1 + +src_dict = get_hashes(out1) +dst_dict = get_hashes(out2) +total_entry = len(src_dict) - 1 # Fix missing final ele +file_size = os.path.getsize(src_file) + +np1 = np.array([v for v in src_dict.keys()]) +np2 = np.array([v for v in dst_dict.keys()]) +matched_keys = np.intersect1d(np1, np2) +unmatched_keys = np.setdiff1d(np1, np2) + +if dry_run == 0: +# todo: Clear dict/np/list if there are not used further +# todo : handle same content within single file + +if matched_keys is not None: +if skip == 0: +bkup2 = subprocess.Popen(['cp', '--reflink=always', dst_file, bkup_file], stdout=subprocess.PIPE) +print "*" * 24 +# print "matched regions" +for location in matched_keys: +entry = src_dict[location][0] +src_len = no_of_chunks * blk_size * 1024 +src_offset = src_dict[location][0] * src_len + +multi_dst_offsets = dst_dict[location] # list +for offset in
[PATCH] btrfs: ctree.h: Fix suspicious rcu usage warning in btrfs_debug_in_rcu()
commit 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove duplicate code") replaces some open coded rcu string handling with macro. It turns out that btrfs_debug_in_rcu() is used for the first time and the macro lacks lock/unlock of rcu string for non debug case (i.e. when message is not printed), leading suspicious RCU usage warning when CONFIG_PROVE_RCU is on. Fix this by adding a wrapper to call lock/unlock for non debug case too. Fixes: 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove duplicate code") Reported-by: David Howells Signed-off-by: Misono Tomohiro --- fs/btrfs/ctree.h | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 53af9f5253f4..cc8b4ff8dcea 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3390,9 +3390,9 @@ do { \ #define btrfs_debug(fs_info, fmt, args...) \ btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args) #define btrfs_debug_in_rcu(fs_info, fmt, args...) \ - btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args) + btrfs_no_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args) #define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \ - btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args) + btrfs_no_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args) #define btrfs_debug_rl(fs_info, fmt, args...) \ btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args) #endif @@ -3404,6 +3404,13 @@ do { \ rcu_read_unlock(); \ } while (0) +#define btrfs_no_printk_in_rcu(fs_info, fmt, args...) \ +do { \ + rcu_read_lock();\ + btrfs_no_printk(fs_info, fmt, ##args); \ + rcu_read_unlock(); \ +} while (0) + #define btrfs_printk_ratelimited(fs_info, fmt, args...)\ do { \ static DEFINE_RATELIMIT_STATE(_rs, \ -- 2.14.4
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > - should we just round down the EOF dedupe request to the > > > block before EOF so dedupe still succeeds? > > > > I've often wondered if the interface should (have) be(en) that we start > > at src_off/dst_off and share as many common blocks as possible until we > > find a mismatch, then tell userspace where we stopped... instead of like > > now where we compare the entire extent and fail if any part of it > > doesn't match. > > The usefulness or harmfulness of that approach depends a lot on what > the application expects the filesystem to do. Here are some concrete examples. In the following, letters are 4K disk blocks and also inode offsets (i.e. "A" means a block containing 4096 x "A" located at inode offset 0, "B" contains "B" located at inode offset 1, etc). "|" indicates a physical discontinuity of the blocks on disk. Lowercase "a" has identical content to uppercase "A", but they are located in different physical blocks on disk. Suppose you have two identical files with different write histories, so they have different on-disk layouts: Inode 1: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ Inode 2: a|b|c|d|e|f|g|hijklmnopqrstuvwxyz A naive dedupe app might pick src and dst at random, and do this: // dedupe(length, src_ino, src_off, dst_ino, dst_off) dedupe(length 26, Inode 1, Offset 0, Inode 2, Offset 0) with the result having 11 fragments in each file, all from the original Inode 1: Inode 1: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ Inode 2: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ A smarter dedupe app might choose src and dst based on logical proximity and/or physical seek distance, or the app might choose dst with the smallest number of existing references in the filesystem, or the app might simply choose the longest available src extents to minimize fragmentation: dedupe(length 7, Inode 1, Offset 0, Inode 2, Offset 0) dedupe(length 19, Inode 2, Offset 7, Inode 1, Offset 7) with the result having 2 fragments in each file, each chosen from a different original inode: Inode 1: ABCDEFG|hijklmnopqrstuvwxyz Inode 2: ABCDEFG|hijklmnopqrstuvwxyz If the kernel continued past the 'length 7' size specified in the first dedupe, then the 'hijklmnopqrstuvwxyz' would be *lost*, and the second dedupe would be an expensive no-op because both Inode 1 and Inode 2 refer to the same physical blocks: Inode 1: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ [---] - app asked for this Inode 2: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ kernel does this too - [-] and "hijklmnopqrstuvwxyz" no longer exists for second dedupe A dedupe app willing to spend more on IO can create its own better src with only one fragment: open(with O_TMPFILE) -> Inode 3 copy(length 7, Inode 1, Offset 0, Inode 3, Offset 0) copy(length 19, Inode 2, Offset 7, Inode 3, Offset 7) dedupe(length 26, Inode 3, Offset 0, Inode 1, Offset 0) dedupe(length 26, Inode 3, Offset 0, Inode 2, Offset 0) close(Inode 3) Now there is just one fragment referenced from two places: Inode 1: αβξδεφγηιςκλμνοπθρστυвшχψζ Inode 2: αβξδεφγηιςκλμνοπθρστυвшχψζ [If encoding goes horribly wrong, the above are a-z transcoded as a mix of Greek and Cyrillic Unicode characters.] Real filesystems sometimes present thousands of possible dedupe src/dst permutations to choose from. The kernel shouldn't be trying to second-guess an application that may have access to external information to make better decisions (e.g. the full set of src extents available, or knowledge of other calls the app will issue in the future). > In btrfs, the dedupe operation acts on references to data, not the > underlying data blocks. If there are 1000 slightly overlapping references > to a single contiguous range of data blocks in dst on disk, each dedupe > operation acts on only one of those, leaving the other 999 untouched. > If the app then submits 999 other dedupe requests, no references to the > dst blocks remain and the underlying data blocks can be deleted. > > In a parallel universe (or a better filesystem, or a userspace emulation > built out of dedupe and other ioctls), dedupe could work at the extent > data (physical) level. The app points at src and dst extent references > (inode/offset/length tuples), and the filesystem figures out which > physical blocks these point to, then adjusts all the references to the > dst blocks at once, dealing with partial overlaps and snapshots and > nodatacow and whatever other exotic features might be lurking in the > filesystem, ending with every reference to every part of dst replaced > by the longest possible contiguous
Re: fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!
On 2018/08/24 0:49, David Howells wrote: > I'm seeing the attached message generated from this line: > > btrfs_debug_in_rcu(fs_info, > "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u", > bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector, > (u_long)dev->bdev->bd_dev, rcu_str_deref(dev->name), dev->devid, > bio->bi_iter.bi_size); > > in submit_stripe_bio(). I'm not sure exactly why, but I suspect > rcu_str_deref() is the point from where it is generated. > > Note that the mount succeeds. > > This code was introduced by: > > commit 672d599041c862dd61a1576c32e946ef0d77aa34 > Author: Misono Tomohiro > Date: Thu Aug 2 16:19:07 2018 +0900 > Thanks for the reporting. I didn't turn on CONFIG_PROVE_RCU and missed the warning. I will send a fix. Thanks, Misono > David > --- > = > WARNING: suspicious RCU usage > 4.18.0-fscache+ #540 Not tainted > - > fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage! > > other info that might help us debug this: > > > rcu_scheduler_active = 2, debug_locks = 1 > 1 lock held by mount/3194: > #0: 72604777 (>fs_type->s_umount_key#54/1){+.+.}, at: > alloc_super+0xa4/0x313 > > stack backtrace: > CPU: 2 PID: 3194 Comm: mount Not tainted 4.18.0-fscache+ #540 > Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 > Call Trace: > dump_stack+0x67/0x8e > btrfs_map_bio+0x282/0x321 > btree_submit_bio_hook+0x71/0xa6 > submit_one_bio+0x55/0x66 > read_extent_buffer_pages+0x1ec/0x2ab > btree_read_extent_buffer_pages+0x6e/0x237 > ? alloc_extent_buffer+0x28f/0x2f2 > read_tree_block+0x43/0x5e > open_ctree+0x139b/0x1ee4 > btrfs_get_tree+0x357/0xa33 > ? selinux_fs_context_dup+0x2d/0x104 > vfs_get_tree+0x7a/0x162 > btrfs_mount_root+0x52/0x8b > btrfs_get_tree+0x4ab/0xa33 > ? vfs_parse_fs_string+0x5b/0x9e > vfs_get_tree+0x7a/0x162 > do_mount+0x7f0/0x8b2 > ? memdup_user+0x3e/0x5a > ksys_mount+0x72/0x97 > __x64_sys_mount+0x21/0x24 > do_syscall_64+0x7d/0x1a0 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x7f11b8365ada > Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 > 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 > 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48 > RSP: 002b:7ffe0d9a5b98 EFLAGS: 0246 ORIG_RAX: 00a5 > RAX: ffda RBX: 5562bf7702a0 RCX: 7f11b8365ada > RDX: 5562bf77a620 RSI: 5562bf7704a0 RDI: 5562bf770480 > RBP: R08: R09: 7f11b8620c40 > R10: c0ed R11: 0246 R12: 5562bf770480 > R13: 5562bf77a620 R14: R15: 7f11b90f9184 > >
Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
On 2018/8/24 上午12:44, Zygo Blaxell wrote: > On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote: >> On 2018/8/23 上午11:11, Zygo Blaxell wrote: >>> This is a repro script for a btrfs bug that causes corrupted data reads >>> when reading a mix of compressed extents and holes. The bug is >>> reproducible on at least kernels v4.1..v4.18. >> >> This bug already sounds more serious than previous nodatasum + >> compression bug. > > Maybe. "compression + holes corruption bug 2017" could be avoided with > the max-inline=0 mount option without disabling compression. This time, > the workaround is more intrusive: avoid all applications that use dedup > or hole-punching. > >>> Some more observations and background follow, but first here is the >>> script and some sample output: >>> >>> root@rescue:/test# cat repro-hole-corruption-test >>> #!/bin/bash >>> >>> # Write a 4096 byte block of something >>> block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } >>> >>> # Here is some test data with holes in it: >>> for y in $(seq 0 100); do >>> for x in 0 1; do >>> block 0; >>> block 21; >>> block 0; >>> block 22; >>> block 0; >>> block 0; >>> block 43; >>> block 44; >>> block 0; >>> block 0; >>> block 61; >>> block 62; >>> block 63; >>> block 64; >>> block 65; >>> block 66;> done >> >> Does the content has any difference on this bug? >> It's just 16 * 4K * 2 * 101 data write *without* any hole so far. > > The content of the extents doesn't seem to matter, other than it needs to > be compressible so that the extents on disk are compressed. The bug is > also triggered by writing non-zero data to all blocks, and then punching > the holes later with "fallocate -p -l 4096 -o $(( insert math here ))". > > The layout of the extents matters a lot. I have to loop hundreds or > thousands of times to hit the bug if the first block in the pattern is > not a hole, or if the non-hole extents are different sizes or positions > than above. > > I tried random patterns of holes and extent refs, and most of them have > an order of magnitude lower hit rates than the above. This might be due > to some relationship between the alignment of read() request boundaries > with extent boundaries, but I haven't done any tests designed to detect > such a relationship. > > In the wild, corruption happens on some files much more often than others. > This seems to be correlated with the extent layout as well. > > I discovered the bug by examining files that were intermittently but > repeatedly failing routine data integrity checks, and found that in every > case they had similar hole + extent patterns near the point where data > was corrupted. > > I did a search on some big filesystems for the > hole-refExtentA-hole-refExtentA pattern and found several files with > this pattern that had passed previous data integrity checks, but would > fail randomly in the sha1sum/drop-caches loop. > >> This should indeed cause 101 128K compressed data extent. >> But I'm wondering the description about 'holes'. > > The holes are coming, wait for it... ;) > >>> done > am >>> sync >>> >>> # Now replace those 101 distinct extents with 101 references to the >>> first extent >>> btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * >>> 131072)); done) 2>&1 | tail >> >> Will this bug still happen by creating one extent and then reflink it >> 101 times? > > Yes. I used btrfs-extent-same because a binary is included in the > Debian duperemove package, but I use it only for convenience. > > It's not necessary to have hundreds of references to the same extent--even > two refs to a single extent plus a hole can trigger the bug sometimes. > 100 references in a single file will trigger the bug so often that it > can be detected within the first 20 sha1sum loops. > > When the corruption occurs, it affects around 90 of the original 101 > extents. The different sha1sum results are due to different extents > giving bad data on different runs. > >>> # Punch holes into the extent refs >>> fallocate -v -d am >> >> Hole-punch in fact happens here. >> >> BTW, will add a "sync" here change the result? > > No. You can reboot the machine here if you like, it does not change > anything that happens during reads later. So it looks like my assumption of bad file extent interpreter is getting more and more valid. It has nothing to do with the race against hole punching/write, but only the file layout and extent map cache. > > Looking at the extent tree in btrfs-debug-tree, the data on disk > looks correct, and btrfs does read it correctly most of the time (the >
Re: lazytime mount option—no support in Btrfs
On Tue, Aug 21, 2018 at 4:10 PM Austin S. Hemmelgarn wrote: > Also, once you've got the space cache set up by mounting once writable > with the appropriate flag and then waiting for it to initialize, you > should not ever need to specify the `space_cache` option again. True. I am not sure why I thought I have to keep cache-v2 in the rootflag list. I guess I forgot it was meant to be removed after a reboot. Or may be some old kernels misbehaved (always cleared the space-tree for some reason and re-initialized a space-cache instead unless the flag was there). But I removed it now and everything works as intended. Thanks.
Device Delete Stalls - Conclusion
Hello everybody, first, let me thank everybody for their advice. What I did was close the terminal with the device delete-process running in it and fired it up again. It took about 5 minutes of intensive IO-Usage and the data was redistributed and and the /dev/sda/ removed from the list of drives. I am currently running a scrub, which I hope will find and correct any errors caused by the forceful abort of the device delete process. Many people suggested it: Next time I am going to replace one of the older 4TB drives with a bigger one I will use "replace" directly instead of "add/delete". Yours sincerely Stefan
Re: Device Delete Stalls
And by 4.14 I actually mean 4.14.60 or 4.14.62 (based on the changelog). I don't think the single patch in 4.14.62 applies to your situation.
Re: Device Delete Stalls
On Thu, Aug 23, 2018 at 8:04 AM, Stefan Malte Schumacher wrote: > Hallo, > > I originally had RAID with six 4TB drives, which was more than 80 > percent full. So now I bought > a 10TB drive, added it to the Array and gave the command to remove the > oldest drive in the array. > > btrfs device delete /dev/sda /mnt/btrfs-raid > > I kept a terminal with "watch btrfs fi show" open and It showed that > the size of /dev/sda had been set to zero and that data was being > redistributed to the other drives. All seemed well, but now the > process stalls at 8GB being left on /dev/sda/. It also seems that the > size of the drive has been reset the original value of 3,64TiB. > > Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 > Total devices 7 FS bytes used 8.07TiB > devid1 size 3.64TiB used 8.00GiB path /dev/sda > devid2 size 3.64TiB used 2.73TiB path /dev/sdc > devid3 size 3.64TiB used 2.73TiB path /dev/sdd > devid4 size 3.64TiB used 2.73TiB path /dev/sde > devid5 size 3.64TiB used 2.73TiB path /dev/sdf > devid6 size 3.64TiB used 2.73TiB path /dev/sdg > devid7 size 9.10TiB used 2.50TiB path /dev/sdb > > I see no more btrfs worker processes and no more activity in iotop. > How do I proceed? I am using a current debian stretch which uses > Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. > > How should I proceed? I have a Backup but would prefer an easier and > less time-comsuming way out of this mess. I'd let it keep running as long as you can tolerate it. In the meantime, update your backups, and keep using the file system normally, it should be safe to use. The block group migration can sometimes be slow with "brfs dev del" compared to the replace operation, I can't explain why but it might be related to some combination of file and free space fragmentation as well as number of snapshots, and just general complexity of what is effectively a partial balance operation going on. Next, you could do a sysrq + t, which dumps process state into the kernel message buffer which might not be big enough to contain the output. If you're using systemd, the journal -k will have it, and presumably syslog's messages will have it. I can't parse this output but a developer might find it useful to see what's going on and if it's just plain wrong. Or if it's just slow. Next, once you get sick of waiting, well you can force a reboot with 'reboot -f' or 'sysrq + b' but then what's the plan? Sure you could just try again but I don't know that this should give different results. It's either just slow, or it's a bug. And if it's a bug, maybe it's fixed in something newer, in which case I'd try a much newer kernel 4.14 at the oldest, and ideally 4.18.4, at least to finish off this task. For what it's worth, the bulk of the delete operation is like a filtered balance, it's mainly relocating block groups, and that is supposed to be COW. So it should be safe to do an abrupt reboot. If you're not writing new information there's no information to lose; the worst case is Btrfs has a slightly older superblock than the latest generation for block group relocation and it starts from that point again. I've done quite a lot of jerkface reboot -f and sysrq + b with Btrfs and have never broken a file system so far (power failures, different story) but maybe I'm lucky and I have a bunch of well behaved devices. -- Chris Murphy
Re: btrfs-convert missing in btrfs-tools v4.15.1
Hi Qu, On Sun, Jul 29, 2018 at 07:44:05AM +0800, Qu Wenruo wrote: > > > On 2018年07月29日 05:34, Nicholas D Steeves wrote: > > Resending because I forgot to CC list. > > > > Hi jkexcel, > > > > On 28 July 2018 at 16:50, jkexcel wrote: > >> > >> I'm an end user trying to use btrfs-convert but when I installed > >> btrfs-tools and its dependency btrfs-progs on kubuntu 18.04, the > >> installation was successful, and it shows that v4.15.1-1build1 was > >> installed. > >> > >> However, when using the command # brtfs-convert /dev/sda4 (with the > >> drive unmounted) the resulting error appears "command not found" > >> I also tried command "btrfs convert" in case this was folded into the > >> main tool, but this also failed. > >> > >> 1. Is btrfs-convert still available? > >> > >> 2. Where can I find it? > >> > >> 3. Has btrfs-convert been replaced? what is it's new name? > >> > >> 4. Is it safe to use a downgraded version of btrfs-tools ie: 4.14 or > >> older? > > > > You can blame me for that. In Debian several users had reported > > release-critical issues in btrfs-convert, so I submitted a patch to > > disable it for the forseable future, eg: > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864798 > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=854489 > > Both report looks pretty old (4.7~4.9). > > In fact, just in v4.10 we had a big rework for convert. > It should work much better after that. > > Furthermore, we have newer (but smaller) fixes to remove a lot of > BUG_ON(), and do graceful exit for ENOSPC case since then. > > And the design of btrfs-convert (at least for the latest btrfs-convert) > is to ensure until everything goes well, we won't touch any bytes of the > ext* fs (in fact we open the ext* fs in RO mode). > So it at least won't corrupt the ext* fs. > > > > > Also, please consider the official status "As of 4.0 kernels this feature > > is not often used or well tested anymore, and there have been some reports > > that the conversion doesn't work reliably. Feel free to try it out, but > > make sure you have backups" ( > > https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 ). > > The wiki page looks needs to be updated. > > Both btrfs-convert and base btrfs-progs are improving, especially after > v4.10 btrfs-convert goes through a big rework and works well so far, and > even added support for reiserfs under the same framwork. > > So IMHO it's at least worth trying. > > Thanks, > Qu Thank you for sharing the cut-off where success became more likely :-) Debian 9 could have had 4.9.1 at the newest, so it wouldn't have had btrfs-convert. So it sounds like btrfs-convert could have been enabled for the experimental suite (which is almost only used by Debian developers and not users) for 4.10. Looking at the changelog it seems we might have had to disable it again before 4.14.1 or 4.16. I'm happy we're having this conversation now, because the time to give it another try is probably sometime in the next month :-) See my long email to David for the caveats. Cheers, Nicholas signature.asc Description: PGP signature
Re: btrfs-convert missing in btrfs-tools v4.15.1
Hi everyone, Sorry for the delay replying, I've been busy with other work. Reply follows inline. P.S. sorry about the table in this email that is 99 columns wide. On Thu, Aug 09, 2018 at 01:50:46PM +0200, David Sterba wrote: > On Sat, Jul 28, 2018 at 05:34:49PM -0400, Nicholas D Steeves wrote: > > On 28 July 2018 at 16:50, jkexcel wrote: > > > I'm an end user trying to use btrfs-convert but when I installed > > > btrfs-tools and its dependency btrfs-progs on kubuntu 18.04, the > > > installation was successful, and it shows that v4.15.1-1build1 was > > > installed. > > > > > > However, when using the command # brtfs-convert /dev/sda4 (with the > > > drive unmounted) the resulting error appears "command not found" > > > I also tried command "btrfs convert" in case this was folded into the > > > main tool, but this also failed. > > > > > > 1. Is btrfs-convert still available? > > > > > > 2. Where can I find it? > > > > > > 3. Has btrfs-convert been replaced? what is it's new name? > > > > > > 4. Is it safe to use a downgraded version of btrfs-tools ie: 4.14 or > > > older? > > > > You can blame me for that. In Debian several users had reported > > release-critical issues in btrfs-convert, so I submitted a patch to > > disable it for the forseable future, eg: > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864798 > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=854489 > > The reports are against version 4.7.3 released one year before the time > of the bug reports. The fix for the reported bug happened in 4.10, that > was half a year before the bug. Debian stable will always have an old version, which will be in use for two to four years. Btrfs-progs 4.7.3 will be in use in Debian 9 until at least 2021, possibly longer. Btw, I strongly believe Debian 9 should have shipped with btrfs-progs 4.9.1, but alas the primary maintainer didn't upload it on time. For "buster" (Debian 10), which will probably be released in mid 2019, the newest possible btrfs-progs version that could be included is whatever is current at the end of January 2019. Exceptions are sometimes granted for an unblock of a newer version. For example, if an LTS kernel won't be released until March, and the release, technical committee, and kernel team decide that's the version we want, then a preapproved exception will be granted. At that time an argument can be made for preapproval of a newer btrfs-progs as well. That said, I try to keep a backported newer version of btrfs-progs for the stable Debian release reasonably up-to-date (backports are available to users on a per-package opt-in basis). That's the channel for feature enablement. Also, my apologies, at the moment this backport is very much out of date--it stalled while investigating which packages would be broken by the library reorganisation; although, from what I can tell that would only be snapper. > > Also, please consider the official status "As of 4.0 kernels this feature > > is not often used or well tested anymore, and there have been some reports > > that the conversion doesn't work reliably. Feel free to try it out, but > > make sure you have backups" ( > > https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 ). > > Sorry that this you take it as official status. The wiki is open to > edits and such claims appear there from time to time. I've removed it, > it's been there since 2015 when it possibly was accurate but it's not > anymore. Is there a more authoritative and up-to-date location for various statuses? It would be nice to have something in btrfs-progs as a table like this, or exportable in some kind of human-friendly format: +---+-++--+---+ |Feature|1st mainline version where feature |LTS kernel 1|LTS kernel 2 |LTS kernel 3 | | |appeared |eg: 4.4 |eg: 4.9 |eg: 4.14 | +---+-++--+---+ |convert|Assume -progs and kernel |exp?|mostly? |stable?| |from |vessions are strongly|| amend | | |ext3/4 |associated, for simplicity. || status with: | | | | ||4.9.z:testing | | +---+-++--+---+ |foo|3.16:danger||exp||mostly||testing||stable|exp |mostly |testing| +---+-++--+---+ Ideally something that could be trivially exported to the format the btrfs.wiki.kernel.org wiki uses. The trick is to make it convenient to maintain... Other than CSV, I'm not sure what would work well for
Re: Device Delete Stalls
On 2018-08-23 10:04, Stefan Malte Schumacher wrote: Hallo, I originally had RAID with six 4TB drives, which was more than 80 percent full. So now I bought a 10TB drive, added it to the Array and gave the command to remove the oldest drive in the array. btrfs device delete /dev/sda /mnt/btrfs-raid I kept a terminal with "watch btrfs fi show" open and It showed that the size of /dev/sda had been set to zero and that data was being redistributed to the other drives. All seemed well, but now the process stalls at 8GB being left on /dev/sda/. It also seems that the size of the drive has been reset the original value of 3,64TiB. Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 Total devices 7 FS bytes used 8.07TiB devid1 size 3.64TiB used 8.00GiB path /dev/sda devid2 size 3.64TiB used 2.73TiB path /dev/sdc devid3 size 3.64TiB used 2.73TiB path /dev/sdd devid4 size 3.64TiB used 2.73TiB path /dev/sde devid5 size 3.64TiB used 2.73TiB path /dev/sdf devid6 size 3.64TiB used 2.73TiB path /dev/sdg devid7 size 9.10TiB used 2.50TiB path /dev/sdb I see no more btrfs worker processes and no more activity in iotop. How do I proceed? I am using a current debian stretch which uses Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. How should I proceed? I have a Backup but would prefer an easier and less time-comsuming way out of this mess. Not exactly what you asked for, but I do have some advice on how to avoid this situation in the future: If at all possible, use `btrfs device replace` instead of an add/delete cycle. The replace operation requires two things. First, you have to be able to connect the new device to the system while all the old ones except the device you are removing are present. Second, the new device has to be at least as big as the old one. Assuming both conditions are met and you can use replace, it's generally much faster and is a lot more reliable than an add/delete cycle (especially when the array is near full). This is because replace just copies the data that's on the old device directly (or rebuilds it directly if it's not present anymore or corrupted), whereas the add/delete method implicitly re-balances the entire array (which takes a long time and may fail if the array is mostly full). Now, as far as what's actually going on here, I'm unfortunately not quite sure, and therefore I'm really not the best person to be giving advice on how to fix it. I will comment that having info on the allocations for all the devices (not just /dev/sda) would be useful in debugging, but even with that I don't know that I personally can help.
Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote: > On 2018/8/23 上午11:11, Zygo Blaxell wrote: > > This is a repro script for a btrfs bug that causes corrupted data reads > > when reading a mix of compressed extents and holes. The bug is > > reproducible on at least kernels v4.1..v4.18. > > This bug already sounds more serious than previous nodatasum + > compression bug. Maybe. "compression + holes corruption bug 2017" could be avoided with the max-inline=0 mount option without disabling compression. This time, the workaround is more intrusive: avoid all applications that use dedup or hole-punching. > > Some more observations and background follow, but first here is the > > script and some sample output: > > > > root@rescue:/test# cat repro-hole-corruption-test > > #!/bin/bash > > > > # Write a 4096 byte block of something > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > # Here is some test data with holes in it: > > for y in $(seq 0 100); do > > for x in 0 1; do > > block 0; > > block 21; > > block 0; > > block 22; > > block 0; > > block 0; > > block 43; > > block 44; > > block 0; > > block 0; > > block 61; > > block 62; > > block 63; > > block 64; > > block 65; > > block 66;> done > > Does the content has any difference on this bug? > It's just 16 * 4K * 2 * 101 data write *without* any hole so far. The content of the extents doesn't seem to matter, other than it needs to be compressible so that the extents on disk are compressed. The bug is also triggered by writing non-zero data to all blocks, and then punching the holes later with "fallocate -p -l 4096 -o $(( insert math here ))". The layout of the extents matters a lot. I have to loop hundreds or thousands of times to hit the bug if the first block in the pattern is not a hole, or if the non-hole extents are different sizes or positions than above. I tried random patterns of holes and extent refs, and most of them have an order of magnitude lower hit rates than the above. This might be due to some relationship between the alignment of read() request boundaries with extent boundaries, but I haven't done any tests designed to detect such a relationship. In the wild, corruption happens on some files much more often than others. This seems to be correlated with the extent layout as well. I discovered the bug by examining files that were intermittently but repeatedly failing routine data integrity checks, and found that in every case they had similar hole + extent patterns near the point where data was corrupted. I did a search on some big filesystems for the hole-refExtentA-hole-refExtentA pattern and found several files with this pattern that had passed previous data integrity checks, but would fail randomly in the sha1sum/drop-caches loop. > This should indeed cause 101 128K compressed data extent. > But I'm wondering the description about 'holes'. The holes are coming, wait for it... ;) > > done > am > > sync > > > > # Now replace those 101 distinct extents with 101 references to the > > first extent > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * > > 131072)); done) 2>&1 | tail > > Will this bug still happen by creating one extent and then reflink it > 101 times? Yes. I used btrfs-extent-same because a binary is included in the Debian duperemove package, but I use it only for convenience. It's not necessary to have hundreds of references to the same extent--even two refs to a single extent plus a hole can trigger the bug sometimes. 100 references in a single file will trigger the bug so often that it can be detected within the first 20 sha1sum loops. When the corruption occurs, it affects around 90 of the original 101 extents. The different sha1sum results are due to different extents giving bad data on different runs. > > # Punch holes into the extent refs > > fallocate -v -d am > > Hole-punch in fact happens here. > > BTW, will add a "sync" here change the result? No. You can reboot the machine here if you like, it does not change anything that happens during reads later. Looking at the extent tree in btrfs-debug-tree, the data on disk looks correct, and btrfs does read it correctly most of the time (the correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4). The corruption therefore comes from btrfs read() producing incorrect data in some instances. > > # Do some other stuff on the machine while this runs, and watch the > > sha1sums change! > > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep > > 1; done > > > >
Re: fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!
David Sterba wrote: > The code previously had explicit rcu_lock/unlock, now it uses the > btrfs_debug_in_rcu helper which is supposed to provide that. It's > possible that the helper is missing it due to some #ifdef mess, but I > don't see it. I preprocessed the function and extracted submit_stripe_bio() (see attached). The problem is that btrfs_no_printk() is being called, but not within an RCU section. #define btrfs_debug_in_rcu(fs_info, fmt, args...) \ btrfs_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args) David --- static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio, u64 physical, int dev_nr, int async) { struct btrfs_device *dev = bbio->stripes[dev_nr].dev; struct btrfs_fs_info *fs_info = bbio->fs_info; bio->bi_private = bbio; btrfs_io_bio(bio)->stripe_index = dev_nr; bio->bi_end_io = btrfs_end_bio; bio->bi_iter.bi_sector = physical >> 9; btrfs_no_printk(fs_info, "\001" "7" "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u", ((bio)->bi_opf & ((1 << 8) - 1)), bio->bi_opf, (u64)bio->bi_iter.bi_sector, (u_long)dev->bdev->bd_dev, ({ struct rcu_string *__str = ({ typeof(*(dev->name)) *p1 = (typeof(*(dev->name)) *)({ union { typeof((dev->name)) __val; char __c[1]; } __u; if (1) __read_once_size(&((dev->name)), __u.__c, sizeof((dev->name))); else __read_once_size_nocheck(&((dev->name)), __u.__c, sizeof((dev->name))); do { } while (0); __u.__val; }); do { static bool __attribute__ ((__section__(".data.unlikely"))) __warned; if (debug_lockdep_rcu_enabled() && !__warned && (!((0) || rcu_read_lock_held( { __warned = true; lockdep_rcu_suspicious( "fs/btrfs/volumes.c" # 6110 "../fs/btrfs/volumes.c" , 6114 # 6110 "../fs/btrfs/volumes.c" , "suspicious rcu_dereference_check() usage"); } } while (0); ; ((typeof(*(dev->name)) *)(p1)); }); __str->str; }), dev->devid, bio->bi_iter.bi_size) ; do { if ((bio)->bi_disk != (dev->bdev)->bd_disk) bio_clear_flag(bio, 9); (bio)->bi_disk = (dev->bdev)->bd_disk; (bio)->bi_partno = (dev->bdev)->bd_partno; } while (0); btrfs_bio_counter_inc_noblocked(fs_info); if (async) btrfs_schedule_bio(dev, bio); else btrfsic_submit_bio(bio); }
Re: fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!
On Thu, Aug 23, 2018 at 04:49:11PM +0100, David Howells wrote: > I'm seeing the attached message generated from this line: > > btrfs_debug_in_rcu(fs_info, > "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u", > bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector, > (u_long)dev->bdev->bd_dev, rcu_str_deref(dev->name), dev->devid, > bio->bi_iter.bi_size); > > in submit_stripe_bio(). I'm not sure exactly why, but I suspect > rcu_str_deref() is the point from where it is generated. Yes, the macro is rcu_dereference in disguise. The code previously had explicit rcu_lock/unlock, now it uses the btrfs_debug_in_rcu helper which is supposed to provide that. It's possible that the helper is missing it due to some #ifdef mess, but I don't see it.
fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!
I'm seeing the attached message generated from this line: btrfs_debug_in_rcu(fs_info, "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u", bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector, (u_long)dev->bdev->bd_dev, rcu_str_deref(dev->name), dev->devid, bio->bi_iter.bi_size); in submit_stripe_bio(). I'm not sure exactly why, but I suspect rcu_str_deref() is the point from where it is generated. Note that the mount succeeds. This code was introduced by: commit 672d599041c862dd61a1576c32e946ef0d77aa34 Author: Misono Tomohiro Date: Thu Aug 2 16:19:07 2018 +0900 David --- = WARNING: suspicious RCU usage 4.18.0-fscache+ #540 Not tainted - fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by mount/3194: #0: 72604777 (>fs_type->s_umount_key#54/1){+.+.}, at: alloc_super+0xa4/0x313 stack backtrace: CPU: 2 PID: 3194 Comm: mount Not tainted 4.18.0-fscache+ #540 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: dump_stack+0x67/0x8e btrfs_map_bio+0x282/0x321 btree_submit_bio_hook+0x71/0xa6 submit_one_bio+0x55/0x66 read_extent_buffer_pages+0x1ec/0x2ab btree_read_extent_buffer_pages+0x6e/0x237 ? alloc_extent_buffer+0x28f/0x2f2 read_tree_block+0x43/0x5e open_ctree+0x139b/0x1ee4 btrfs_get_tree+0x357/0xa33 ? selinux_fs_context_dup+0x2d/0x104 vfs_get_tree+0x7a/0x162 btrfs_mount_root+0x52/0x8b btrfs_get_tree+0x4ab/0xa33 ? vfs_parse_fs_string+0x5b/0x9e vfs_get_tree+0x7a/0x162 do_mount+0x7f0/0x8b2 ? memdup_user+0x3e/0x5a ksys_mount+0x72/0x97 __x64_sys_mount+0x21/0x24 do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f11b8365ada Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48 RSP: 002b:7ffe0d9a5b98 EFLAGS: 0246 ORIG_RAX: 00a5 RAX: ffda RBX: 5562bf7702a0 RCX: 7f11b8365ada RDX: 5562bf77a620 RSI: 5562bf7704a0 RDI: 5562bf770480 RBP: R08: R09: 7f11b8620c40 R10: c0ed R11: 0246 R12: 5562bf770480 R13: 5562bf77a620 R14: R15: 7f11b90f9184
Device Delete Stalls - Addition
Hello everybody, I think this might be useful: root@mars:~# btrfs dev usage /mnt/btrfs-raid/ /dev/sda, ID: 1 Device size: 3.64TiB Device slack: 0.00B Data,RAID1: 7.00GiB Metadata,RAID1: 1.00GiB Unallocated: 3.63TiB Yours sincerely Stefan
Device Delete Stalls
Hallo, I originally had RAID with six 4TB drives, which was more than 80 percent full. So now I bought a 10TB drive, added it to the Array and gave the command to remove the oldest drive in the array. btrfs device delete /dev/sda /mnt/btrfs-raid I kept a terminal with "watch btrfs fi show" open and It showed that the size of /dev/sda had been set to zero and that data was being redistributed to the other drives. All seemed well, but now the process stalls at 8GB being left on /dev/sda/. It also seems that the size of the drive has been reset the original value of 3,64TiB. Label: none uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8 Total devices 7 FS bytes used 8.07TiB devid1 size 3.64TiB used 8.00GiB path /dev/sda devid2 size 3.64TiB used 2.73TiB path /dev/sdc devid3 size 3.64TiB used 2.73TiB path /dev/sdd devid4 size 3.64TiB used 2.73TiB path /dev/sde devid5 size 3.64TiB used 2.73TiB path /dev/sdf devid6 size 3.64TiB used 2.73TiB path /dev/sdg devid7 size 9.10TiB used 2.50TiB path /dev/sdb I see no more btrfs worker processes and no more activity in iotop. How do I proceed? I am using a current debian stretch which uses Kernel 4.9.0-8 and btrfs-progs 4.7.3-1. How should I proceed? I have a Backup but would prefer an easier and less time-comsuming way out of this mess. Yours Stefan
Re: [PATCH v2.1] btrfs: Handle owner mismatch gracefully when walking up tree
On 2018/8/23 下午9:24, David Sterba wrote: > On Tue, Aug 21, 2018 at 09:42:03AM +0800, Qu Wenruo wrote: >> [BUG] >> When mounting certain crafted image, btrfs will trigger kernel BUG_ON() >> when try to recover balance: >> -- >> [ cut here ] > > Please don't use lines starting with ---, I though "--" won't trigger the "---" of git thus OK to use. > an empty line is fine, or you > can also indent the trace dump by a few spaces for clarity. I'll update > the patches that have been merged so you don't need to resend. I'll go an empty line from now on. Thanks, Qu signature.asc Description: OpenPGP digital signature
Re: [PATCH v2.1] btrfs: Handle owner mismatch gracefully when walking up tree
On Tue, Aug 21, 2018 at 09:42:03AM +0800, Qu Wenruo wrote: > [BUG] > When mounting certain crafted image, btrfs will trigger kernel BUG_ON() > when try to recover balance: > -- > [ cut here ] Please don't use lines starting with ---, an empty line is fine, or you can also indent the trace dump by a few spaces for clarity. I'll update the patches that have been merged so you don't need to resend.
Re: [PATCH v2.1] btrfs: Handle owner mismatch gracefully when walking up tree
On Tue, Aug 21, 2018 at 09:42:03AM +0800, Qu Wenruo wrote: > [BUG] > When mounting certain crafted image, btrfs will trigger kernel BUG_ON() > when try to recover balance: > -- > [ cut here ] > kernel BUG at fs/btrfs/extent-tree.c:8956! > invalid opcode: [#1] PREEMPT SMP NOPTI > CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 > RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] > RSP: 0018:b53540c9b890 EFLAGS: 00010202 > Call Trace: > walk_up_tree+0x172/0x1f0 [btrfs] > btrfs_drop_snapshot+0x3a4/0x830 [btrfs] > merge_reloc_roots+0xe1/0x1d0 [btrfs] > btrfs_recover_relocation+0x3ea/0x420 [btrfs] > open_ctree+0x1af3/0x1dd0 [btrfs] > btrfs_mount_root+0x66b/0x740 [btrfs] > mount_fs+0x3b/0x16a > vfs_kern_mount.part.9+0x54/0x140 > btrfs_mount+0x16d/0x890 [btrfs] > mount_fs+0x3b/0x16a > vfs_kern_mount.part.9+0x54/0x140 > do_mount+0x1fd/0xda0 > ksys_mount+0xba/0xd0 > __x64_sys_mount+0x21/0x30 > do_syscall_64+0x60/0x210 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > ---[ end trace d4344e4deee03435 ]--- > -- > > [CAUSE] > Another extent tree corruption. > > In this particular case, tree reloc root's owner is > DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is > corrupted and we failed the owner check in walk_up_tree(). > > [FIX] > It's pretty hard to take care of every extent tree corruption, but at > least we can remove such BUG_ON() and exit more gracefully. > > And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE > shares the same root (which is obviously invalid), we needs to make > __del_reloc_root() more robust to detect such invalid share to avoid > possible NULL dereference as root->node can be NULL in this case. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411 > Reported-by: Xu Wen > Signed-off-by: Qu Wenruo Reviewed-by: David Sterba
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > - is documenting rejection on request alignment grounds > > (i.e. EINVAL) in the man page sufficient for app > > developers to understand what is going on here? > > I think so. The manpage says: "The filesystem does not support > reflinking the ranges of the given files", which (to my mind) covers > this case of not supporting dedupe of EOF blocks. Older versions of btrfs dedupe (before v4.2 or so) used to do exactly this; however, on btrfs, not supporting dedupe of EOF blocks means small files (one extent) cannot be deduped at all, because the EOF block holds a reference to the entire dst extent. If a dedupe app doesn't go all the way to EOF on btrfs, then it should not attempt to dedupe any part of the last extent of the file as the benefit would be zero or slightly negative. The app developer would need to be aware that such a restriction could exist on some filesystems, and be able to distinguish this from other cases that could lead to EINVAL. Portable code would have to try a dedupe up to EOF, then if that failed, round down and retry, and if that failed too, the app would have to figure out which filesystem it's running on to know what to do next. Performance demands the app know what the FS will do in advance, and avoid a whole class of behavior. btrfs dedupe reports success if the src extent is inline and the same size as the dst extent (i.e. file is smaller than one page). No dedupe can occur in such cases--a clone results in a simple copy, so the best a dedupe could do would be a no-op. Returning EINVAL there would break a few popular tools like "cp --reflink". Returning OK but doing nothing seems to be the best option in that case. > > - should we just round down the EOF dedupe request to the > > block before EOF so dedupe still succeeds? > > I've often wondered if the interface should (have) be(en) that we start > at src_off/dst_off and share as many common blocks as possible until we > find a mismatch, then tell userspace where we stopped... instead of like > now where we compare the entire extent and fail if any part of it > doesn't match. The usefulness or harmfulness of that approach depends a lot on what the application expects the filesystem to do. In btrfs, the dedupe operation acts on references to data, not the underlying data blocks. If there are 1000 slightly overlapping references to a single contiguous range of data blocks in dst on disk, each dedupe operation acts on only one of those, leaving the other 999 untouched. If the app then submits 999 other dedupe requests, no references to the dst blocks remain and the underlying data blocks can be deleted. In a parallel universe (or a better filesystem, or a userspace emulation built out of dedupe and other ioctls), dedupe could work at the extent data (physical) level. The app points at src and dst extent references (inode/offset/length tuples), and the filesystem figures out which physical blocks these point to, then adjusts all the references to the dst blocks at once, dealing with partial overlaps and snapshots and nodatacow and whatever other exotic features might be lurking in the filesystem, ending with every reference to every part of dst replaced by the longest possible contiguous reference(s) to src. Problems arise if the length deduped is not exactly the length requested. If the search continues until a mismatch is found, where does the search for a mismatch lead? Does the search follow physically contiguous blocks on disk, or would dedupe follow logically contiguous blocks in the src and dst files? Or the intersection of those, i.e. physically contiguous blocks that are logically contiguous in _any_ two files, not limited to src and dst. There is also the problem where the files could have been previously deduped and then partially overwritten with identical data. If the application cannot control where the dedupe search for identical data ends, it can end up accidentally creating new references to extents while it is trying to eliminate those extents. The kernel might do a lot of extra work from looking ahead that the application has to undo immediately (e.g. after the first few blocks of dst, the app wants to do another dedupe with a better src extent elsewhere on the filesystem, but the kernel goes ahead and dedupes with an inferior src beyond the end of what the app asked for). bees tries to determine exactly the set of dedupe requests required to remove all references to duplicate extents (and maybe someday do defrag as well). If the kernel deviates from the requested sizes (e.g. because the data changed on the filesystem between dedup requests), the final extent layout after the dedupe requests are finished won't match what bees expected it to be, so bees has to reexamine the filesystem and either retry with a fresh set of exact
Re: [PATCH] Btrfs: use next_state in find_first_extent_bit
On Thu, Aug 23, 2018 at 03:14:53AM +0800, Liu Bo wrote: > As next_state() is already defined to get the next state, update us to use > the helper. > > No functional changes. > > Signed-off-by: Liu Bo Added to misc-next, thanks.
Re: [PATCH 0/5] rb_first to rb_first_cached conversion
On 22.08.2018 22:51, Liu Bo wrote: > Several structs in btrfs are using rb_first() in a while loop, it'd be > more efficient to do this with rb_first_cached() which has the O(1) > complexity. > > This patch set updates five structs which may have a large rb tree in > practice > > Liu Bo (5): > Btrfs: href_root: use rb_first_cached > Btrfs: href->ref_tree: use rb_first_cached > Btrfs: delayed_items: use rb_first_cached > Btrfs: extent_map: use rb_first_cached > Btrfs: preftree: use rb_first_cached > > fs/btrfs/backref.c| 34 +- > fs/btrfs/delayed-inode.c | 29 +-- > fs/btrfs/delayed-inode.h | 4 ++-- > fs/btrfs/delayed-ref.c| 50 > +++ > fs/btrfs/delayed-ref.h| 4 ++-- > fs/btrfs/disk-io.c| 8 +++ > fs/btrfs/extent-tree.c| 19 --- > fs/btrfs/extent_map.c | 27 +++-- > fs/btrfs/extent_map.h | 2 +- > fs/btrfs/inode.c | 4 ++-- > fs/btrfs/tests/extent-map-tests.c | 4 ++-- > fs/btrfs/transaction.c| 5 ++-- > fs/btrfs/volumes.c| 5 ++-- > 13 files changed, 107 insertions(+), 88 deletions(-) > Do you have any performance data proving this is in fact helping?
Re: [PATCH 0/5] rb_first to rb_first_cached conversion
On 08/22/18 21:51, Liu Bo wrote: Several structs in btrfs are using rb_first() in a while loop, it'd be more efficient to do this with rb_first_cached() which has the O(1) complexity. This patch set updates five structs which may have a large rb tree in practice Great idea, works for me with no bad side effects so far. So: Tested-by: Holger Hoffstätte cheers Holger
Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option
On 2018/8/23 下午3:36, Nikolay Borisov wrote: > > > On 23.08.2018 10:31, Qu Wenruo wrote: >> Introduce --breadth-first option to do breadth-first tree dump. >> This is especially handy to inspect high level trees, e.g. comparing >> tree reloc tree with its source tree. > > Will it make sense instead of exposing another option to just have a > heuristics check that will switch to the BFS if the tree is higher than, > say, 2 levels? BFS has one obvious disadvantage here, so it may not be a good idea to use it for default. >> More memory usage << It needs to alloc heap memory, and this can be pretty large for leaves. At level 1, it will need to alloc nr_leaves * sizeof(bfs_entry) memory at least. Compared to DFS, it only needs to iterate at most 8 times, and all of its memory usage is function call stack memory. It only makes sense for my niche use case (compare tree reloc tree with its source). For real world use case the default DFS should works fine without all the memory allocation burden. So I still prefer to keep DFS as default and only provides BFS as a niche option for weird guys like me. Thanks, Qu
Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option
On 08/23/2018 03:31 PM, Qu Wenruo wrote: By default dump-tree does depth-first search. For 2 level trees it's completely OK, but for 3 level trees, it would be pretty hard to locate output of middle level tree nodes. Introduce --breadth-first option to do breadth-first tree dump. This is especially handy to inspect high level trees, e.g. comparing tree reloc tree with its source tree. Signed-off-by: Qu Wenruo LGTM. Reviewed-by: Su Yue --- Documentation/btrfs-inspect-internal.asciidoc | 5 ++ cmds-inspect-dump-tree.c | 26 -- print-tree.c | 89 +++ print-tree.h | 1 + 4 files changed, 112 insertions(+), 9 deletions(-) diff --git a/Documentation/btrfs-inspect-internal.asciidoc b/Documentation/btrfs-inspect-internal.asciidoc index e2db64660b9a..5435455fe099 100644 --- a/Documentation/btrfs-inspect-internal.asciidoc +++ b/Documentation/btrfs-inspect-internal.asciidoc @@ -69,6 +69,9 @@ readable equivalents where possible. This is useful for analyzing filesystem state or inconsistencies and has a positive educational effect on understanding the internal filesystem structure. + +By default *dump-tree* will dump the tree using depth-first search, this behavior +can be changed by '--breadth-first' option. ++ NOTE: contains file names, consider that if you're asked to send the dump for analysis. Does not contain file data. + @@ -89,6 +92,8 @@ print only the uuid tree information, empty output if the tree does not exist print info of the specified block only --follow use with '-b', print all children tree blocks of '' +--breadth-first +use breadth-first search to dump trees instead of default depth-first search -t print only the tree with the specified ID, where the ID can be numerical or common name in a flexible human readable form diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index c8acd55a0c3a..ecda07d65ab6 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -200,6 +200,7 @@ const char * const cmd_inspect_dump_tree_usage[] = { "-b|--block print info from the specified block only", "-t|--tree print only tree with the given id (string or number)", "--follow use with -b, to show all children tree blocks of ", + "--breadth-firstdo breadth-first search instead of default depth-first one", NULL }; @@ -213,6 +214,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) struct extent_buffer *leaf; struct btrfs_disk_key disk_key; struct btrfs_key found_key; + void (*print_tree_func) (struct extent_buffer *, int follow); char uuidbuf[BTRFS_UUID_UNPARSED_SIZE]; int ret; int slot; @@ -227,6 +229,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) u64 tree_id = 0; bool follow = false; + print_tree_func = btrfs_print_tree; /* * For debug-tree, we care nothing about extent tree (it's just backref * and usage accounting, only makes sense for RW operations). @@ -238,7 +241,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) optind = 0; while (1) { int c; - enum { GETOPT_VAL_FOLLOW = 256 }; + enum { GETOPT_VAL_FOLLOW = 256, GETOPT_VAL_BREADTH_FIRST }; static const struct option long_options[] = { { "extents", no_argument, NULL, 'e'}, { "device", no_argument, NULL, 'd'}, @@ -248,6 +251,8 @@ int cmd_inspect_dump_tree(int argc, char **argv) { "block", required_argument, NULL, 'b'}, { "tree", required_argument, NULL, 't'}, { "follow", no_argument, NULL, GETOPT_VAL_FOLLOW }, + { "breadth-first", no_argument, NULL, + GETOPT_VAL_BREADTH_FIRST }, { NULL, 0, NULL, 0 } }; @@ -303,6 +308,9 @@ int cmd_inspect_dump_tree(int argc, char **argv) case GETOPT_VAL_FOLLOW: follow = true; break; + case GETOPT_VAL_BREADTH_FIRST: + print_tree_func = btrfs_print_tree_breadth_first; + break; default: usage(cmd_inspect_dump_tree_usage); } @@ -341,7 +349,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) (unsigned long long)block_only); goto close_root; } - btrfs_print_tree(leaf, follow); + print_tree_func(leaf, follow); free_extent_buffer(leaf); goto close_root; } @@ -368,17 +376,17 @@ int cmd_inspect_dump_tree(int argc, char **argv) } else {
Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option
On 23.08.2018 10:31, Qu Wenruo wrote: > Introduce --breadth-first option to do breadth-first tree dump. > This is especially handy to inspect high level trees, e.g. comparing > tree reloc tree with its source tree. Will it make sense instead of exposing another option to just have a heuristics check that will switch to the BFS if the tree is higher than, say, 2 levels?
[PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option
By default dump-tree does depth-first search. For 2 level trees it's completely OK, but for 3 level trees, it would be pretty hard to locate output of middle level tree nodes. Introduce --breadth-first option to do breadth-first tree dump. This is especially handy to inspect high level trees, e.g. comparing tree reloc tree with its source tree. Signed-off-by: Qu Wenruo --- Documentation/btrfs-inspect-internal.asciidoc | 5 ++ cmds-inspect-dump-tree.c | 26 -- print-tree.c | 89 +++ print-tree.h | 1 + 4 files changed, 112 insertions(+), 9 deletions(-) diff --git a/Documentation/btrfs-inspect-internal.asciidoc b/Documentation/btrfs-inspect-internal.asciidoc index e2db64660b9a..5435455fe099 100644 --- a/Documentation/btrfs-inspect-internal.asciidoc +++ b/Documentation/btrfs-inspect-internal.asciidoc @@ -69,6 +69,9 @@ readable equivalents where possible. This is useful for analyzing filesystem state or inconsistencies and has a positive educational effect on understanding the internal filesystem structure. + +By default *dump-tree* will dump the tree using depth-first search, this behavior +can be changed by '--breadth-first' option. ++ NOTE: contains file names, consider that if you're asked to send the dump for analysis. Does not contain file data. + @@ -89,6 +92,8 @@ print only the uuid tree information, empty output if the tree does not exist print info of the specified block only --follow use with '-b', print all children tree blocks of '' +--breadth-first +use breadth-first search to dump trees instead of default depth-first search -t print only the tree with the specified ID, where the ID can be numerical or common name in a flexible human readable form diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index c8acd55a0c3a..ecda07d65ab6 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -200,6 +200,7 @@ const char * const cmd_inspect_dump_tree_usage[] = { "-b|--block print info from the specified block only", "-t|--tree print only tree with the given id (string or number)", "--follow use with -b, to show all children tree blocks of ", + "--breadth-firstdo breadth-first search instead of default depth-first one", NULL }; @@ -213,6 +214,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) struct extent_buffer *leaf; struct btrfs_disk_key disk_key; struct btrfs_key found_key; + void (*print_tree_func) (struct extent_buffer *, int follow); char uuidbuf[BTRFS_UUID_UNPARSED_SIZE]; int ret; int slot; @@ -227,6 +229,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) u64 tree_id = 0; bool follow = false; + print_tree_func = btrfs_print_tree; /* * For debug-tree, we care nothing about extent tree (it's just backref * and usage accounting, only makes sense for RW operations). @@ -238,7 +241,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) optind = 0; while (1) { int c; - enum { GETOPT_VAL_FOLLOW = 256 }; + enum { GETOPT_VAL_FOLLOW = 256, GETOPT_VAL_BREADTH_FIRST }; static const struct option long_options[] = { { "extents", no_argument, NULL, 'e'}, { "device", no_argument, NULL, 'd'}, @@ -248,6 +251,8 @@ int cmd_inspect_dump_tree(int argc, char **argv) { "block", required_argument, NULL, 'b'}, { "tree", required_argument, NULL, 't'}, { "follow", no_argument, NULL, GETOPT_VAL_FOLLOW }, + { "breadth-first", no_argument, NULL, + GETOPT_VAL_BREADTH_FIRST }, { NULL, 0, NULL, 0 } }; @@ -303,6 +308,9 @@ int cmd_inspect_dump_tree(int argc, char **argv) case GETOPT_VAL_FOLLOW: follow = true; break; + case GETOPT_VAL_BREADTH_FIRST: + print_tree_func = btrfs_print_tree_breadth_first; + break; default: usage(cmd_inspect_dump_tree_usage); } @@ -341,7 +349,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) (unsigned long long)block_only); goto close_root; } - btrfs_print_tree(leaf, follow); + print_tree_func(leaf, follow); free_extent_buffer(leaf); goto close_root; } @@ -368,17 +376,17 @@ int cmd_inspect_dump_tree(int argc, char **argv) } else { if (info->tree_root->node) { printf("root tree\n"); -
[PATCH] btrfs-progs: print-tree: Skip deprecated blockptr / nodesize output
When printing tree nodes, we output slots like: key (EXTENT_TREE ROOT_ITEM 0) block 73625600 (17975) gen 16 The number in the parentheses is blockptr / nodesize. However this number doesn't really do any thing useful. And in fact for unaligned metadata block group (block group start bytenr is not aligned to 16K), the number doesn't even make sense as it's rounded down. In factor kernel doesn't ever output such divided result in its print-tree.c Remove it so later reader won't wonder what the number means. Signed-off-by: Qu Wenruo --- print-tree.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/print-tree.c b/print-tree.c index a09ecfbb28f0..31f6fa12522f 100644 --- a/print-tree.c +++ b/print-tree.c @@ -1420,9 +1420,8 @@ void btrfs_print_tree(struct extent_buffer *eb, int follow) btrfs_disk_key_to_cpu(, _key); printf("\t"); btrfs_print_key(_key); - printf(" block %llu (%llu) gen %llu\n", + printf(" block %llu gen %llu\n", (unsigned long long)blocknr, - (unsigned long long)blocknr / eb->len, (unsigned long long)btrfs_node_ptr_generation(eb, i)); fflush(stdout); } -- 2.18.0