Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Fri, Sep 21, 2018 at 12:59:31PM +1000, Dave Chinner wrote: > On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote: [...] > With no DMAPI in the future, people with custom HSM-like interfaces > based on dmapi are starting to turn to fanotify and friends to > provide them with the change notifications they require I had a fanotify-based scanner once, before I noticed btrfs effectively had timestamps all over its metadata. fanotify won't tell me which parts of a file were modified (unless it got that feature in the last few years?). fanotify was pretty useless when the only file on the system that was being modified was a 13TB VM image. Or even a little 16GB one. Has to scan the whole file to find the one new byte. Even on desktops the poor thing spends most of its time looping over /var/log/messages. It was sad. If fanotify gave me (inode, offset, length) tuples of dirty pages in cache, I could look them up and use a dedupe_file_range call to replace the dirty pages with a reference to an existing disk block. If my listener can do that fast enough, it's in-band dedupe; if it doesn't, the data gets flushed to disk as normal, and I fall back to a scan of the filesystem to clean it up later. > > > e.g. a soft requirement is that we need to scan the entire fs at > > > least once a month. > > > > I have to scan and dedupe multiple times per hour. OK, the first-ever > > scan of a non-empty filesystem is allowed to take much longer, but after > > that, if you have enough spare iops for continuous autodefrag you should > > also have spare iops for continuous dedupe. > > Yup, but using notifications avoids the for even these scans - you'd > know exactly what data has changed, when it changed, and know > exactly that you needed to read to calculate the new hashes. ...if the scanner can keep up with the notifications; otherwise, the notification receiver has to log them somewhere for the scanner to catch up. If there are missed or dropped notifications--or 23 hours a day we're not listening for notifications because we only have an hour a day maintenance window--some kind of filesystem scan has to be done after the fact anyway. > > > A simple piece-wise per-AG scanning algorithm (like we use in > > > xfs_repair) could easily work within a 3GB RAM per AG constraint and > > > would scale very well. We'd only need to scan 30-40 AGs in the hour, > > > and a single AG at 1GB/s will only take 2 minutes to scan. We can > > > then do the processing while the next AG gets scanned. If we've got > > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?) > > > then we can scan 5-10AGs at once to keep the IO rate up, and process > > > them in bulk as we scan more. > > > > How do you match dupe blocks from different AGs if you only keep RAM for > > the duration of one AG scan? Do you not dedupe across AG boundaries? > > We could, but do we need too? There's a heap of runtime considerations > at the filesystem level we need to take into consideration here, and > there's every chance that too much consolidation creates > unpredictable bottlenecks in overwrite workloads that need to break > the sharing (i.e. COW operations). I'm well aware of that. I have a bunch of hacks in bees to not be too efficient lest it push the btrfs reflink bottlenecks too far. > e.g. An AG contains up to 1TB of data which is more than enough to > get decent AG-internal dedupe rates. If we've got 1PB of data spread > across 1000AGs, deduping a million copies of a common data pattern > spread across the entire filesystem down to one per AG (i.e. 10^6 > copies down to 10^3) still gives a massive space saving. That's true for 1000+ AG filesystems, but it's a bigger problem for filesystems of 2-5 AGs, where each AG holds one copy of 20-50% of the duplicates on the filesystem. OTOH, a filesystem that small could just be done in one pass with a larger but still reasonable amount of RAM. > > What you've described so far means the scope isn't limited anyway. If the > > call is used to dedupe two heavily-reflinked extents together (e.g. > > both duplicate copies are each shared by thousands of snapshots that > > have been created during the month-long period between dedupe runs), > > it could always be stuck doing a lot of work updating dst owners. > > Was there an omitted detail there? > > As I said early in the discussion - if both copies of identical data > are already shared hundreds or thousands of times each, then it > makes no sense to dedupe them again. All that does is create huge > amounts of work updating metadata for very little additional gain. I've had a user complain about the existing 2560-reflink limit in bees, because they were starting with 3000
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote: > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote: > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote: > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > > For future development I've abandoned the entire dedupe_file_range > > > > approach. I need to be able to read and dedupe the data blocks of > > > > the filesystem directly without having to deal with details like which > > > > files those blocks belong to, especially on filesystems with lots of > > > > existing deduped blocks and snapshots. > > > > > > IOWs, your desired OOB dedupe algorithm is: > > > > > > a) ask the filesystem where all it's file data is > > > > Actually, it's "ask the filesystem where all the *new* file data is" > > since we don't want to read any unique data twice on subsequent runs. > > Sorry, how do you read "unique data" twice? By definition, unique > data only occurs once ...but once it has been read, we don't want to read it again. Ever. Even better would be to read unique data less than 1.0 times on average. > Oh, and you still need to hash the old data so you can find > collisions with the new data that got written. Unless, of course, > you are keeping your hash tree in a persistent database I do that. > and can work out how to prune stale entries out of it efficiently I did that first. Well, more like I found that even a bad algorithm can still find most of the duplicate data in a typical filesystem, and there's a steep diminishing returns curve the closer you get to 100% efficiency. So I just used a bad algorithm (random drop with a bias toward keeping hashes that matched duplicate blocks). There's room to improve that, but the possible gains are small, so it's at least #5 on the performance whack-a-mole list and probably lower. The randomness means each full-filesystem sweep finds a different subset of duplicates, so I can arbitrarily cut hash table size in half and get almost all of the match rate back by doing two full scans. Or I cut the filesystem up into a few large pieces and feed the pieces through in different orders on different scan runs, so different subsets of data in the hash table meet different subsets of data on disk during each scan. An early prototype of bees worked that way, but single-digit efficiency gains were not worth doubling iops, so I stopped. > [...]I thought that "details omitted for > reasons of brevity" would be understood, not require omitted details > to be explained to me. Sorry. I don't know what you already know. > > Bees also operates under a constant-RAM constraint, so it doesn't operate > > in two distinct "collect data" and "act on data collected" passes, > > and cannot spend memory to store data about more than a few extents at > > any time. > > I suspect that I'm thinking at a completely different scale to you. > I don't really care for highly constrained or optimal dedupe > algorithms because those last few dedupe percentages really don't > matter that much to me. At large scales RAM is always constrained. It's the dedupe triangle of RAM, iops, and match hit rate--any improvement in one comes at the cost of the others. Any dedupe can go faster or use less RAM by raising the block size or partitioning the input data set to make it smaller. bees RAM usage is a bit more explicitly controlled--the admin tells bees how much RAM to use, and bees scales the other parameters to fit that. Other dedupe engines make the admin do math to set parameters to avoid overflowing RAM with dynamic memory allocations, or leave the admin to discover what their RAM constraint is the hard way. One big difference I am noticing in our approaches is latency. ZFS (and in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate data occupies disk space for zero time as it is never written to disk at all) but it requires more RAM for a given dedupe hit rate than any other dedupe implementation I've seen. What you've written tells me XFS saves RAM by partitioning the data and relying on an existing but very large source of iops (sharing scrub reads with dedupe), but then the dedupe latency is the same as the scrub interval (the worst so far). bees aims to have latency of a few minutes (ideally scanning data while it's still dirty in cache, but there's no good userspace API for that) though it's obviously not there yet. > I care much more about using all the > resources we can and running as fast as we possibly can, then > pr
Re: dduper - Offline btrfs deduplication tool
On Fri, Sep 07, 2018 at 09:27:28AM +0530, Lakshmipathi.G wrote: > > > > One question: > > Why not ioctl_fideduperange? > > i.e. you kill most of benefits from that ioctl - atomicity. > > > I plan to add fideduperange as an option too. User can > choose between fideduperange and ficlonerange call. > > If I'm not wrong, with fideduperange, kernel performs > comparsion check before dedupe. And it will increase > time to dedupe files. Creating the backup reflink file takes far more time than you will ever save from fideduperange. You don't need the md5sum either, unless you have a data set that is full of crc32 collisions (e.g. a file format that puts a CRC32 at the end of each 4K block). The few people who have such a data set can enable md5sums, everyone else can have md5sums disabled by default. > I believe the risk involved with ficlonerange is minimized > by having a backup file(reflinked). We can revert to older > original file, if we encounter some problems. With fideduperange the risk is more than minimized--it's completely eliminated. If you don't use fideduperange you can't use the tool on a live data set at all. > > > > -- > > Have a nice day, > > Timofey. > > Cheers. > Lakshmipathi.G signature.asc Description: PGP signature
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > > - is documenting rejection on request alignment grounds > > > > (i.e. EINVAL) in the man page sufficient for app > > > > developers to understand what is going on here? > > > > > > I think so. The manpage says: "The filesystem does not support > > > reflinking the ranges of the given files", which (to my mind) covers > > > this case of not supporting dedupe of EOF blocks. > > > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly > > this; however, on btrfs, not supporting dedupe of EOF blocks means small > > files (one extent) cannot be deduped at all, because the EOF block holds > > a reference to the entire dst extent. If a dedupe app doesn't go all the > > way to EOF on btrfs, then it should not attempt to dedupe any part of the > > last extent of the file as the benefit would be zero or slightly negative. > > That's a filesystem implementation issue, not an API or application > issue. The API and application issue remains even if btrfs is not considered. btrfs is just the worst case outcome. Other filesystems still have fragmentation issues, and applications have efficiency-vs-capability tradeoffs to make if they can't rely on dedupe-to-EOF being available. Tools like 'cp --reflink=auto' work by trying the best case, then falling back to a second choice if the first choice returns an error. If the second choice fails too, the surprising behavior can make inattentive users lose data. > > The app developer would need to be aware that such a restriction could > > exist on some filesystems, and be able to distinguish this from other > > cases that could lead to EINVAL. Portable code would have to try a dedupe > > up to EOF, then if that failed, round down and retry, and if that failed > > too, the app would have to figure out which filesystem it's running on > > to know what to do next. Performance demands the app know what the FS > > will do in advance, and avoid a whole class of behavior. > > Nobody writes "portable" applications like that. As an app developer, and having studied other applications' revision histories, and having followed IRC and mailing list conversations involving other developers writing these applications, I can assure you that is _exactly_ how portable applications get written around the dedupe function. Usually people start from experience with tools that use hardlinks to implement dedupe, so the developer's mental model starts with deduping entire files. Their first attempt does this: stat(fd, ); dedupe( ..., src_offset = 0, dst_offset = 0, length = st.st_size); then subsequent revisions of their code cope with limits on length, and then deal with EINVAL on odd lengths, because those are the problems that are encountered as the code runs for the first time on an expanding set of filesystems. After that, they deal with implementation-specific performance issues. Other app developers start by ignoring incomplete blocks, then compare their free-space-vs-time graphs with other dedupe apps on the same filesystem, then either adapt to handle EOF properly, or just accept being uncompetitive. > They read the man > page first, and work out what the common subset of functionality is > and then code from that. > Man page says: > > "Disk filesystems generally require the offset and length arguments > to be aligned to the fundamental block size." > IOWs, code compatible with starts with supporting the general case. > i.e. a range rounded to filesystem block boundaries (it's already > run fstat() on the files it wants to dedupe to find their size, > yes?), hence ignoring the partial EOF block. Will just work on > everything. Will cause a significant time/space performance hit too. EOFs are everywhere, and they have a higher-than-average duplication rate for their size. If an application assumes EOF can't be deduped on every filesystem, then it leaves a non-trivial amount of free space unrecovered on filesystems that can dedupe EOF. It also necessarily increases fragmentation unless the filesystem implements file tails (where it keeps fragmentation constant as the tail won't be stored contiguously in any case). > Code that then wants to optimise for btrfs/xfs/ocfs quirks runs > fstatvfs to determine what fs it's operating on and applies the > necessary quirks. For btrfs it can extend the range to include the > partial EOF block, and hence will handle the implem
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > - should we just round down the EOF dedupe request to the > > > block before EOF so dedupe still succeeds? > > > > I've often wondered if the interface should (have) be(en) that we start > > at src_off/dst_off and share as many common blocks as possible until we > > find a mismatch, then tell userspace where we stopped... instead of like > > now where we compare the entire extent and fail if any part of it > > doesn't match. > > The usefulness or harmfulness of that approach depends a lot on what > the application expects the filesystem to do. Here are some concrete examples. In the following, letters are 4K disk blocks and also inode offsets (i.e. "A" means a block containing 4096 x "A" located at inode offset 0, "B" contains "B" located at inode offset 1, etc). "|" indicates a physical discontinuity of the blocks on disk. Lowercase "a" has identical content to uppercase "A", but they are located in different physical blocks on disk. Suppose you have two identical files with different write histories, so they have different on-disk layouts: Inode 1: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ Inode 2: a|b|c|d|e|f|g|hijklmnopqrstuvwxyz A naive dedupe app might pick src and dst at random, and do this: // dedupe(length, src_ino, src_off, dst_ino, dst_off) dedupe(length 26, Inode 1, Offset 0, Inode 2, Offset 0) with the result having 11 fragments in each file, all from the original Inode 1: Inode 1: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ Inode 2: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ A smarter dedupe app might choose src and dst based on logical proximity and/or physical seek distance, or the app might choose dst with the smallest number of existing references in the filesystem, or the app might simply choose the longest available src extents to minimize fragmentation: dedupe(length 7, Inode 1, Offset 0, Inode 2, Offset 0) dedupe(length 19, Inode 2, Offset 7, Inode 1, Offset 7) with the result having 2 fragments in each file, each chosen from a different original inode: Inode 1: ABCDEFG|hijklmnopqrstuvwxyz Inode 2: ABCDEFG|hijklmnopqrstuvwxyz If the kernel continued past the 'length 7' size specified in the first dedupe, then the 'hijklmnopqrstuvwxyz' would be *lost*, and the second dedupe would be an expensive no-op because both Inode 1 and Inode 2 refer to the same physical blocks: Inode 1: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ [---] - app asked for this Inode 2: ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ kernel does this too - [-] and "hijklmnopqrstuvwxyz" no longer exists for second dedupe A dedupe app willing to spend more on IO can create its own better src with only one fragment: open(with O_TMPFILE) -> Inode 3 copy(length 7, Inode 1, Offset 0, Inode 3, Offset 0) copy(length 19, Inode 2, Offset 7, Inode 3, Offset 7) dedupe(length 26, Inode 3, Offset 0, Inode 1, Offset 0) dedupe(length 26, Inode 3, Offset 0, Inode 2, Offset 0) close(Inode 3) Now there is just one fragment referenced from two places: Inode 1: αβξδεφγηιςκλμνοπθρστυвшχψζ Inode 2: αβξδεφγηιςκλμνοπθρστυвшχψζ [If encoding goes horribly wrong, the above are a-z transcoded as a mix of Greek and Cyrillic Unicode characters.] Real filesystems sometimes present thousands of possible dedupe src/dst permutations to choose from. The kernel shouldn't be trying to second-guess an application that may have access to external information to make better decisions (e.g. the full set of src extents available, or knowledge of other calls the app will issue in the future). > In btrfs, the dedupe operation acts on references to data, not the > underlying data blocks. If there are 1000 slightly overlapping references > to a single contiguous range of data blocks in dst on disk, each dedupe > operation acts on only one of those, leaving the other 999 untouched. > If the app then submits 999 other dedupe requests, no references to the > dst blocks remain and the underlying data blocks can be deleted. > > In a parallel universe (or a better filesystem, or a userspace emulation > built out of dedupe and other ioctls), dedupe could work at the extent > data (physical) level. The app points at src and dst extent references > (inode/offset/length tuples), and the filesystem figures out which > physical blocks these point to, then adjusts all the references to the > dst blocks at once, dealing with partial overlaps and
Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote: > On 2018/8/23 上午11:11, Zygo Blaxell wrote: > > This is a repro script for a btrfs bug that causes corrupted data reads > > when reading a mix of compressed extents and holes. The bug is > > reproducible on at least kernels v4.1..v4.18. > > This bug already sounds more serious than previous nodatasum + > compression bug. Maybe. "compression + holes corruption bug 2017" could be avoided with the max-inline=0 mount option without disabling compression. This time, the workaround is more intrusive: avoid all applications that use dedup or hole-punching. > > Some more observations and background follow, but first here is the > > script and some sample output: > > > > root@rescue:/test# cat repro-hole-corruption-test > > #!/bin/bash > > > > # Write a 4096 byte block of something > > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } > > > > # Here is some test data with holes in it: > > for y in $(seq 0 100); do > > for x in 0 1; do > > block 0; > > block 21; > > block 0; > > block 22; > > block 0; > > block 0; > > block 43; > > block 44; > > block 0; > > block 0; > > block 61; > > block 62; > > block 63; > > block 64; > > block 65; > > block 66;> done > > Does the content has any difference on this bug? > It's just 16 * 4K * 2 * 101 data write *without* any hole so far. The content of the extents doesn't seem to matter, other than it needs to be compressible so that the extents on disk are compressed. The bug is also triggered by writing non-zero data to all blocks, and then punching the holes later with "fallocate -p -l 4096 -o $(( insert math here ))". The layout of the extents matters a lot. I have to loop hundreds or thousands of times to hit the bug if the first block in the pattern is not a hole, or if the non-hole extents are different sizes or positions than above. I tried random patterns of holes and extent refs, and most of them have an order of magnitude lower hit rates than the above. This might be due to some relationship between the alignment of read() request boundaries with extent boundaries, but I haven't done any tests designed to detect such a relationship. In the wild, corruption happens on some files much more often than others. This seems to be correlated with the extent layout as well. I discovered the bug by examining files that were intermittently but repeatedly failing routine data integrity checks, and found that in every case they had similar hole + extent patterns near the point where data was corrupted. I did a search on some big filesystems for the hole-refExtentA-hole-refExtentA pattern and found several files with this pattern that had passed previous data integrity checks, but would fail randomly in the sha1sum/drop-caches loop. > This should indeed cause 101 128K compressed data extent. > But I'm wondering the description about 'holes'. The holes are coming, wait for it... ;) > > done > am > > sync > > > > # Now replace those 101 distinct extents with 101 references to the > > first extent > > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * > > 131072)); done) 2>&1 | tail > > Will this bug still happen by creating one extent and then reflink it > 101 times? Yes. I used btrfs-extent-same because a binary is included in the Debian duperemove package, but I use it only for convenience. It's not necessary to have hundreds of references to the same extent--even two refs to a single extent plus a hole can trigger the bug sometimes. 100 references in a single file will trigger the bug so often that it can be detected within the first 20 sha1sum loops. When the corruption occurs, it affects around 90 of the original 101 extents. The different sha1sum results are due to different extents giving bad data on different runs. > > # Punch holes into the extent refs > > fallocate -v -d am > > Hole-punch in fact happens here. > > BTW, will add a "sync" here change the result? No. You can reboot the machine here if you like, it does not change anything that happens during reads later. Looking at the extent tree in btrfs-debug-tree, the data on disk looks correct, and btrfs does read it correctly most of the time (the correct sha1sum below is 6926a34e0ab
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > - is documenting rejection on request alignment grounds > > (i.e. EINVAL) in the man page sufficient for app > > developers to understand what is going on here? > > I think so. The manpage says: "The filesystem does not support > reflinking the ranges of the given files", which (to my mind) covers > this case of not supporting dedupe of EOF blocks. Older versions of btrfs dedupe (before v4.2 or so) used to do exactly this; however, on btrfs, not supporting dedupe of EOF blocks means small files (one extent) cannot be deduped at all, because the EOF block holds a reference to the entire dst extent. If a dedupe app doesn't go all the way to EOF on btrfs, then it should not attempt to dedupe any part of the last extent of the file as the benefit would be zero or slightly negative. The app developer would need to be aware that such a restriction could exist on some filesystems, and be able to distinguish this from other cases that could lead to EINVAL. Portable code would have to try a dedupe up to EOF, then if that failed, round down and retry, and if that failed too, the app would have to figure out which filesystem it's running on to know what to do next. Performance demands the app know what the FS will do in advance, and avoid a whole class of behavior. btrfs dedupe reports success if the src extent is inline and the same size as the dst extent (i.e. file is smaller than one page). No dedupe can occur in such cases--a clone results in a simple copy, so the best a dedupe could do would be a no-op. Returning EINVAL there would break a few popular tools like "cp --reflink". Returning OK but doing nothing seems to be the best option in that case. > > - should we just round down the EOF dedupe request to the > > block before EOF so dedupe still succeeds? > > I've often wondered if the interface should (have) be(en) that we start > at src_off/dst_off and share as many common blocks as possible until we > find a mismatch, then tell userspace where we stopped... instead of like > now where we compare the entire extent and fail if any part of it > doesn't match. The usefulness or harmfulness of that approach depends a lot on what the application expects the filesystem to do. In btrfs, the dedupe operation acts on references to data, not the underlying data blocks. If there are 1000 slightly overlapping references to a single contiguous range of data blocks in dst on disk, each dedupe operation acts on only one of those, leaving the other 999 untouched. If the app then submits 999 other dedupe requests, no references to the dst blocks remain and the underlying data blocks can be deleted. In a parallel universe (or a better filesystem, or a userspace emulation built out of dedupe and other ioctls), dedupe could work at the extent data (physical) level. The app points at src and dst extent references (inode/offset/length tuples), and the filesystem figures out which physical blocks these point to, then adjusts all the references to the dst blocks at once, dealing with partial overlaps and snapshots and nodatacow and whatever other exotic features might be lurking in the filesystem, ending with every reference to every part of dst replaced by the longest possible contiguous reference(s) to src. Problems arise if the length deduped is not exactly the length requested. If the search continues until a mismatch is found, where does the search for a mismatch lead? Does the search follow physically contiguous blocks on disk, or would dedupe follow logically contiguous blocks in the src and dst files? Or the intersection of those, i.e. physically contiguous blocks that are logically contiguous in _any_ two files, not limited to src and dst. There is also the problem where the files could have been previously deduped and then partially overwritten with identical data. If the application cannot control where the dedupe search for identical data ends, it can end up accidentally creating new references to extents while it is trying to eliminate those extents. The kernel might do a lot of extra work from looking ahead that the application has to undo immediately (e.g. after the first few blocks of dst, the app wants to do another dedupe with a better src extent elsewhere on the filesystem, but the kernel goes ahead and dedupes with an inferior src beyond the end of what the app asked for). bees tries to determine exactly the set of dedupe requests required to remove all references to duplicate extents (and maybe someday do defrag as well). If the kernel deviates from the requested sizes (e.g. because the data changed on the filesystem between dedup requests), the final extent layout after the dedupe requests are finished won't match what bees expected it to be, so bees has to reexamine the filesystem and either retry with a fresh set of exact
Reproducer for "compressed data + hole data corruption bug, 2018 editiion"
This is a repro script for a btrfs bug that causes corrupted data reads when reading a mix of compressed extents and holes. The bug is reproducible on at least kernels v4.1..v4.18. Some more observations and background follow, but first here is the script and some sample output: root@rescue:/test# cat repro-hole-corruption-test #!/bin/bash # Write a 4096 byte block of something block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; } # Here is some test data with holes in it: for y in $(seq 0 100); do for x in 0 1; do block 0; block 21; block 0; block 22; block 0; block 0; block 43; block 44; block 0; block 0; block 61; block 62; block 63; block 64; block 65; block 66; done done > am sync # Now replace those 101 distinct extents with 101 references to the first extent btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 131072)); done) 2>&1 | tail # Punch holes into the extent refs fallocate -v -d am # Do some other stuff on the machine while this runs, and watch the sha1sums change! while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 1; done root@rescue:/test# ./repro-hole-corruption-test i: 91, status: 0, bytes_deduped: 131072 i: 92, status: 0, bytes_deduped: 131072 i: 93, status: 0, bytes_deduped: 131072 i: 94, status: 0, bytes_deduped: 131072 i: 95, status: 0, bytes_deduped: 131072 i: 96, status: 0, bytes_deduped: 131072 i: 97, status: 0, bytes_deduped: 131072 i: 98, status: 0, bytes_deduped: 131072 i: 99, status: 0, bytes_deduped: 131072 13107200 total bytes deduped in this operation am: 4.8 MiB (4964352 bytes) converted to sparse holes. 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 072a152355788c767b97e4e4c0e4567720988b84 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am bf00d862c6ad436a1be2be606a8ab88d22166b89 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 0d44cdf030fb149e103cfdc164da3da2b7474c17 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 60831f0e7ffe4b49722612c18685c09f4583b1df am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am a19662b294a3ccdf35dbb18fdd72c62018526d7d am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am 6926a34e0ab3e0a023e8ea85a650f5b4217acab4 am ^C Corruption occurs most often when there is a sequence like this in a file: ref 1: hole ref 2: extent A, offset 0 ref 3: hole ref 4: extent A, offset 8192 This scenario typically arises due to hole-punching or deduplication. Hole-punching replaces one extent ref with two references to the same extent with a hole between them, so: ref 1: extent A, offset 0, length 16384 becomes: ref 1: extent A, offset 0, length 4096 ref 2: hole, length 8192 ref 3: extent A, offset 12288, length 4096 Deduplication replaces two distinct extent refs surrounding a hole with two references to one of the duplicate extents, turning this: ref 1: extent A, offset 0, length 4096 ref 2: hole, length 8192 ref 3: extent B, offset 0, length 4096 into this: ref 1: extent A, offset 0, length 4096 ref 2: hole, length 8192 ref 3: extent A, offset 0, length 4096 Compression is required (zlib, zstd, or lzo) for corruption to occur. I am not able to reproduce the issue with an uncompressed extent nor have I observed any such corruption in the wild. The presence or absence of the no-holes filesystem feature has no effect. Ordinary writes can lead to pairs of extent references to
Deadlock between dedup and rename
Every month or two I hit a btrfs deadlock like this: dedup and rsync are both operating on the same file when the filesystem locked up. The deadlock happens at the moment when rsync renames its temporary file (the dedup dst file) to replace the old version of the file (the dedup src file). Dedup ended up stuck with this stack trace: [] call_rwsem_down_write_failed+0x13/0x20 [] down_write_nested+0x87/0xb0 [] btrfs_dedupe_file_range+0xdc/0x5f0 [] vfs_dedupe_file_range+0x210/0x240 [] do_vfs_ioctl+0x236/0x6b0 [] SyS_ioctl+0x76/0x90 [] do_syscall_64+0x70/0x190 [] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [] 0x and rsync ended up stuck with this stack trace: [] call_rwsem_down_write_failed+0x13/0x20 [] down_write_nested+0x87/0xb0 [] vfs_rename+0x18e/0x8c0 [] SyS_renameat2+0x4ce/0x520 [] do_syscall_64+0x70/0x190 [] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [] 0x The file in question was somewhat large (>4GB) so there was probably some dirty page flushing going on in the background, which may or may not matter for reproducing the bug. This is a fairly common occurrence when rsyncing large files while bees is running, as the rsync temporary file is often a copy of its own previous version, and bees will start deduplication at the head of the temporary file before rsync finishes writing at the tail end. signature.asc Description: PGP signature
Re: List of known BTRFS Raid 5/6 Bugs?
t;/dev/sda3.19MiB >/dev/sdb3.19MiB >/dev/sdc3.19MiB >/dev/sdd3.19MiB >/dev/sde3.19MiB > > Unallocated: >/dev/sda5.63TiB >/dev/sdb5.63TiB >/dev/sdc5.63TiB >/dev/sdd5.63TiB >/dev/sde5.63TiB > menion@Menionubuntu:~$ > menion@Menionubuntu:~$ sf -h > The program 'sf' is currently not installed. You can install it by typing: > sudo apt install ruby-sprite-factory > menion@Menionubuntu:~$ df -h > Filesystem Size Used Avail Use% Mounted on > udev934M 0 934M 0% /dev > tmpfs 193M 22M 171M 12% /run > /dev/mmcblk0p3 28G 12G 15G 44% / > tmpfs 962M 0 962M 0% /dev/shm > tmpfs 5,0M 0 5,0M 0% /run/lock > tmpfs 962M 0 962M 0% /sys/fs/cgroup > /dev/mmcblk0p1 188M 3,4M 184M 2% /boot/efi > /dev/mmcblk0p3 28G 12G 15G 44% /home > /dev/sda 37T 6,6T 29T 19% /media/storage/das1 > tmpfs 193M 0 193M 0% /run/user/1000 > menion@Menionubuntu:~$ btrfs --version > btrfs-progs v4.17 > > So I don't fully understand where the scrub data size comes from > Il giorno lun 13 ago 2018 alle ore 23:56 ha scritto: > > > > Running time of 55:06:35 indicates that the counter is right, it is not > > enough time to scrub the entire array using hdd. > > > > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start > > /dev/sdx1" only scrubs the selected partition, > > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual > > array. > > > > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and > > post the output. > > For live statistics, use "sudo watch -n 1". > > > > By the way: > > 0 errors despite multiple unclean shutdowns? I assumed that the write hole > > would corrupt parity the first time around, was i wrong? > > > > Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: > > > Hi > > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), > > > there are contradicting opinions by the, well, "several" ways to check > > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of > > > data. > > > This array is running on kernel 4.17.3 and it definitely experienced > > > power loss while data was being written. > > > I can say that it wen through at least a dozen of unclear shutdown > > > So following this thread I started my first scrub on the array. and > > > this is the outcome (after having resumed it 4 times, two after a > > > power loss...): > > > > > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/ > > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35 > > > total bytes scrubbed: 2.59TiB with 0 errors > > > > > > So, there are 0 errors, but I don't understand why it says 2.59TiB of > > > scrubbed data. Is it possible that also this values is crap, as the > > > non zero counters for RAID5 array? > > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell > > > ha scritto: > > > > > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > > > > > I guess that covers most topics, two last questions: > > > > > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? > > > > > > > > Not really. It changes the probability distribution (you get an extra > > > > chance to recover using a parity block in some cases), but there are > > > > still cases where data gets lost that didn't need to be. > > > > > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? > > > > > > > > There may be benefits of raid5 metadata, but they are small compared to > > > > the risks. > > > > > > > > In some configurations it may not be possible to allocate the last > > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a > > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if > > > > N is an odd number there could be one chunk left over in the array that > > > > is unusable. Most users will find this irrelevant because a large disk > > > > array that is filled to the last GB will become quite slow due
Re: List of known BTRFS Raid 5/6 Bugs?
On Mon, Aug 13, 2018 at 11:56:05PM +0200, erentheti...@mail.de wrote: > Running time of 55:06:35 indicates that the counter is right, it is > not enough time to scrub the entire array using hdd. > > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub > start /dev/sdx1" only scrubs the selected partition, > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array. > > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics > and post the output. > For live statistics, use "sudo watch -n 1". > > By the way: > 0 errors despite multiple unclean shutdowns? I assumed that the write > hole would corrupt parity the first time around, was i wrong? You won't see the write hole from just a power failure. You need a power failure *and* a disk failure, and writes need to be happening at the moment power fails. Write hole breaks parity. Scrub silently(!) fixes parity. Scrub reads the parity block and compares it to the computed parity, and if it's wrong, scrub writes the computed parity back. Normal RAID5 reads with all disks online read only the data blocks, so they won't read the parity block and won't detect wrong parity. I did a couple of order-of-magnitude estimations of how likely a power failure is to trash a btrfs RAID system and got a probability between 3% and 30% per power failure if there were writes active at the time, and a disk failed to join the array after boot. That was based on 5 disks having 31 writes queued with one of the disks being significantly slower than the others (as failing disks often are) with continuous write load. If you have a power failure on an array that isn't writing anything at the time, nothing happens. > > Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: > > Hi > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), > > there are contradicting opinions by the, well, "several" ways to check > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of > > data. > > This array is running on kernel 4.17.3 and it definitely experienced > > power loss while data was being written. > > I can say that it wen through at least a dozen of unclear shutdown > > So following this thread I started my first scrub on the array. and > > this is the outcome (after having resumed it 4 times, two after a > > power loss...): > > > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/ > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35 > > total bytes scrubbed: 2.59TiB with 0 errors > > > > So, there are 0 errors, but I don't understand why it says 2.59TiB of > > scrubbed data. Is it possible that also this values is crap, as the > > non zero counters for RAID5 array? > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell > > ha scritto: > > > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > > > > I guess that covers most topics, two last questions: > > > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? > > > > > > Not really. It changes the probability distribution (you get an extra > > > chance to recover using a parity block in some cases), but there are > > > still cases where data gets lost that didn't need to be. > > > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? > > > > > > There may be benefits of raid5 metadata, but they are small compared to > > > the risks. > > > > > > In some configurations it may not be possible to allocate the last > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if > > > N is an odd number there could be one chunk left over in the array that > > > is unusable. Most users will find this irrelevant because a large disk > > > array that is filled to the last GB will become quite slow due to long > > > free space search and seek times--you really want to keep usage below 95%, > > > maybe 98% at most, and that means the last GB will never be needed. > > > > > > Reading raid5 metadata could theoretically be faster than raid1, but that > > > depends on a lot of variables, so you can't assume it as a rule of thumb. > > > > > > Raid6 metadata is more interesting because it's the only currently > > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately > > > that b
Re: List of known BTRFS Raid 5/6 Bugs?
On Mon, Aug 13, 2018 at 09:20:22AM +0200, Menion wrote: > Hi > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), > there are contradicting opinions by the, well, "several" ways to check > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of > data. > This array is running on kernel 4.17.3 and it definitely experienced > power loss while data was being written. > I can say that it wen through at least a dozen of unclear shutdown > So following this thread I started my first scrub on the array. and > this is the outcome (after having resumed it 4 times, two after a > power loss...): > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/ > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35 > total bytes scrubbed: 2.59TiB with 0 errors > > So, there are 0 errors, but I don't understand why it says 2.59TiB of > scrubbed data. Is it possible that also this values is crap, as the > non zero counters for RAID5 array? I just tested a quick scrub with injected errors on 4.18.0 and it looks like the garbage values are finally fixed (yay!). I never saw invalid values for 'total bytes' from raid5; however, scrub has (had?) trouble resuming, especially if the system was rebooted between cancel and resume, but sometimes just if the scrub had just been suspended too long (maybe if there are changes to the chunk tree...?). 55 hours for 2600 GB is just under 50GB per hour, which doesn't sound too unreasonable for btrfs, though it is known to be a bit slow compared to other raid5 implementations. > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell > ha scritto: > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > > > I guess that covers most topics, two last questions: > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? > > > > Not really. It changes the probability distribution (you get an extra > > chance to recover using a parity block in some cases), but there are > > still cases where data gets lost that didn't need to be. > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? > > > > There may be benefits of raid5 metadata, but they are small compared to > > the risks. > > > > In some configurations it may not be possible to allocate the last > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a > > time while raid5 will allocate 1GB chunks from N disks at a time, and if > > N is an odd number there could be one chunk left over in the array that > > is unusable. Most users will find this irrelevant because a large disk > > array that is filled to the last GB will become quite slow due to long > > free space search and seek times--you really want to keep usage below 95%, > > maybe 98% at most, and that means the last GB will never be needed. > > > > Reading raid5 metadata could theoretically be faster than raid1, but that > > depends on a lot of variables, so you can't assume it as a rule of thumb. > > > > Raid6 metadata is more interesting because it's the only currently > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately > > that benefit is rather limited due to the write hole bug. > > > > There are patches floating around that implement multi-disk raid1 (i.e. 3 > > or 4 mirror copies instead of just 2). This would be much better for > > metadata than raid6--more flexible, more robust, and my guess is that > > it will be faster as well (no need for RMW updates or journal seeks). > > > > > - > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT > > > > signature.asc Description: PGP signature
Re: List of known BTRFS Raid 5/6 Bugs?
On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > I guess that covers most topics, two last questions: > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? Not really. It changes the probability distribution (you get an extra chance to recover using a parity block in some cases), but there are still cases where data gets lost that didn't need to be. > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? There may be benefits of raid5 metadata, but they are small compared to the risks. In some configurations it may not be possible to allocate the last gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a time while raid5 will allocate 1GB chunks from N disks at a time, and if N is an odd number there could be one chunk left over in the array that is unusable. Most users will find this irrelevant because a large disk array that is filled to the last GB will become quite slow due to long free space search and seek times--you really want to keep usage below 95%, maybe 98% at most, and that means the last GB will never be needed. Reading raid5 metadata could theoretically be faster than raid1, but that depends on a lot of variables, so you can't assume it as a rule of thumb. Raid6 metadata is more interesting because it's the only currently supported way to get 2-disk failure tolerance in btrfs. Unfortunately that benefit is rather limited due to the write hole bug. There are patches floating around that implement multi-disk raid1 (i.e. 3 or 4 mirror copies instead of just 2). This would be much better for metadata than raid6--more flexible, more robust, and my guess is that it will be faster as well (no need for RMW updates or journal seeks). > - > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT > signature.asc Description: PGP signature
Re: List of known BTRFS Raid 5/6 Bugs?
On Sat, Aug 11, 2018 at 04:18:35AM +0200, erentheti...@mail.de wrote: > Write hole: > > > > The data will be readable until one of the data blocks becomes > > inaccessible (bad sector or failed disk). This is because it is only the > > parity block that is corrupted (old data blocks are still not modified > > due to btrfs CoW), and the parity block is only required when recovering > > from a disk failure. > > I am unsure about your meaning. > Assuming you perform an unclean shutdown (eg. crash), and after restart > perform a scrub, with no additional error (bad sector, bit-rot) before > or after the crash: > will you loose data? No, the parity blocks will be ignored and RAID5 will act like slow RAID0 if no other errors occur. > Will you be able to mount the filesystem like normal? Yes. > Additionaly, will the crash create additional errors like bad > sectors and or bit-rot aside from the parity-block corruption? No, only parity-block corruptions should occur. > Its actually part of my first mail, where the btrfs Raid5/6 page > assumes no data damage while the spinics comment implies the opposite. The above assumes no drive failures or data corruption; however, if this were the case, you could use RAID0 instead of RAID5. The only reason to use RAID5 is to handle cases where at least one block (or an entire disk) fails, so the behavior of RAID5 when all disks are working is almost irrelevant. A drive failure could occur at any time, so even if you mount successfully, if a disk fails immediately after, any stripes affected by write hole will be unrecoverably corrupted. > The write hole does not seem as dangerous if you could simply scrub > to repair damage (On smaller discs that is, where scrub doesnt take > enough time for additional errors to occur) Scrub can repair parity damage on normal data and metadata--it recomputes parity from data if the data passes a CRC check. No repair is possible for data in nodatasum files--the parity can be recomputed, but there is no way to determine if the result is correct. Metadata is always checksummed and transid verified; alas, there isn't an easy way to get btrfs to perform an urgent scrub on metadata only. > > Put another way: if all disks are online then RAID5/6 behaves like a slow > > RAID0, and RAID0 does not have the partial stripe update problem because > > all of the data blocks in RAID0 are independent. It is only when a disk > > fails in RAID5/6 that the parity block is combined with data blocks, so > > it is only in this case that the write hole bug can result in lost data. > > So data will not be lost if no drive has failed? Correct, but the array will have reduced failure tolerance, and RAID5 only matters when a drive has failed. It is effectively operating in degraded mode on parts of the array affected by write hole, and no single disk failure can be tolerated there. It is possible to recover the parity by performing an immediate scrub after reboot, but this cannot be as effective as a proper RAID5 update journal which avoids making the parity bad in the first place. > > > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable > > > > to the write hole, but data is. In this configuration you can determine > > > > with high confidence which files you need to restore from backup, and > > > > the filesystem will remain writable to replace the restored data, > > > > because > > > > raid1 does not have the write hole bug. > > In regards to my earlier questions, what would change if i do -draid5 -mraid1? Metadata would be using RAID1 which is not subject to the RAID5 write hole issue. It is much more tolerant of unclean shutdowns especially in degraded mode. Data in RAID5 may be damaged when the array is in degraded mode and a write hole occurs (in either order as long as both occur). Due to RAID1 metadata, the filesystem will continue to operate properly, allowing the damaged data to be overwritten or deleted. > Lost Writes: > > > Hotplugging causes an effect (lost writes) which can behave similarly > > to the write hole bug in some instances. The similarity ends there. > > Are we speaking about the same problem that is causing transid mismatch? Transid mismatch is usually caused by lost writes, by any mechanism that prevents a write from being completed after the disk reports that it was completed. Drives may report that data is "in stable storage", i.e. the drive believes it can complete the write in the future even if power is lost now because the drive or controller has capacitors or NVRAM or similar. If the drive is reset by the SATA host because of a cable disconnect event, the drive may forget that it has promised to do writes in the future. Drives may simply lie, and claim that data has been written to disk when the data is actually in volatile RAM and will disappear in a power failure. btrfs uses a transaction mechanism and CoW metadata to handle lost writes within an interrupted transaction.
Re: List of known BTRFS Raid 5/6 Bugs?
On Fri, Aug 10, 2018 at 06:55:58PM +0200, erentheti...@mail.de wrote: > Did i get you right? > Please correct me if i am wrong: > > Scrubbing seems to have been fixed, you only have to run it once. Yes. There is one minor bug remaining here: when scrub detects an error on any disk in a raid5/6 array, the error counts are garbage (random numbers on all the disks). You will need to inspect btrfs dev stats or the kernel log messages to learn which disks are injecting errors. This does not impair the scrubbing function, only the detailed statistics report (scrub status -d). If there are no errors, scrub correctly reports 0 for all error counts. Only raid5/6 is affected this way--other RAID profiles produce correct scrub statistics. > Hotplugging (temporary connection loss) is affected by the write hole > bug, and will create undetectable errors every 16 TB (crc32 limitation). Hotplugging causes an effect (lost writes) which can behave similarly to the write hole bug in some instances. The similarity ends there. They are really two distinct categories of problem. Temporary connection loss can do bad things to all RAID profiles on btrfs (not just RAID5/6) and the btrfs requirements for handling connection loss and write holes are very different. > The write Hole Bug can affect both old and new data. Normally, only old data can be affected by the write hole bug. The "new" data is not committed before the power failure (otherwise we would call it "old" data), so any corrupted new data will be inaccessible as a result of the power failure. The filesytem will roll back to the last complete committed data tree (discarding all new and modified data blocks), then replay the fsync log (which repeats and completes some writes that occurred since the last commit). This process eliminates new data from the filesystem whether the new data was corrupted by the write hole or not. Only corruptions that affect old data will remain, because old data is not overwritten by data saved in the fsync log, and old data is not part of the incomplete data tree that is rolled back after power failure. Exception: new data in nodatasum files can also be corrupted, but since nodatasum disables all data integrity or recovery features it's hard to define what "corrupted" means for a nodatasum file. > Reason: BTRFS saves data in fixed size stripes, if the write operation > fails midway, the stripe is lost. > This does not matter much for Raid 1/10, data always uses a full stripe, > and stripes are copied on write. Only new data could be lost. This is incorrect. Btrfs saves data in variable-sized extents (between 1 and 32768 4K data blocks) and btrfs has no concept of stripes outside of its raid layer. Stripes are never copied. In RAID 1/10/DUP all data blocks are fully independent of each other, i.e. writing to any block on these RAID profiles does not corrupt data in any other block. As a result these RAID profiles do not allow old data to be corrupted by partially completed writes of new data. There is striping in some profiles, but it is only used for performance in these cases, and has no effect on data recovery. > However, for some reason Raid 5/6 works with partial stripes, meaning > that data is stored in stripes not completley filled by prior data, In RAID 5/6 each data block is related to all other data blocks in the same stripe with the parity block(s). If any individual data block in the stripe is updated, the parity block(s) must also be updated atomically, or the wrong data will be reconstructed during RAID5/6 recovery. Because btrfs does nothing to prevent it, some writes will occur to RAID5/6 stripes that are already partially occupied by old data. btrfs also does nothing to ensure that parity block updates are atomic, so btrfs has the write hole bug as a result. > and stripes are removed on write. Stripes are never removed...? A stripe is just a group of disk blocks divided on 64K boundaries, same as mdadm and many hardware RAID5/6 implementations. > Result: If the operation fails midway, the stripe is lost as is all > data previously stored it. You can only lose as many data blocks in each stripe as there are parity disks (i.e. raid5 can lose 0 or 1 block, while raid6 can lose 0, 1, or 2 blocks); however, multiple writes can be lost affecting multiple stripes in a single power loss event. Losing even 1 block is often too much. ;) The data will be readable until one of the data blocks becomes inaccessible (bad sector or failed disk). This is because it is only the parity block that is corrupted (old data blocks are still not modified due to btrfs CoW), and the parity block is only required when recovering from a disk failure. Put another way: if all disks are online then RAID5/6 behaves like a slow RAID0, and RAID0 does not have the partial stripe update problem because all of the data blocks in RAID0 are independent. It is only when a disk fails in RAID5/6 that the parity block is
Re: List of known BTRFS Raid 5/6 Bugs?
On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote: > I am searching for more information regarding possible bugs related to > BTRFS Raid 5/6. All sites i could find are incomplete and information > contradicts itself: > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56) > warns of the write hole bug, stating that your data remains safe > (except data written during power loss, obviously) upon unclean shutdown > unless your data gets corrupted by further issues like bit-rot, drive > failure etc. The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are no mitigations to prevent or avoid it in mainline kernels. The write hole results from allowing a mixture of old (committed) and new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of blocks consisting of one related data or parity block from each disk in the array, such that writes to any of the data blocks affect the correctness of the parity block and vice versa). If the writes were not completed and one or more of the data blocks are not online, the data blocks reconstructed by the raid5/6 algorithm will be corrupt. If all disks are online, the write hole does not immediately damage user-visible data as the old data blocks can still be read directly; however, should a drive failure occur later, old data may not be recoverable because the parity block will not be correct for reconstructing the missing data block. A scrub can fix write hole errors if all disks are online, and a scrub should be performed after any unclean shutdown to recompute parity data. The write hole always puts both old and new data at risk of damage; however, due to btrfs's copy-on-write behavior, only the old damaged data can be observed after power loss. The damaged new data will have no references to it written to the disk due to the power failure, so there is no way to observe the new damaged data using the filesystem. Not every interrupted write causes damage to old data, but some will. Two possible mitigations for the write hole are: - modify the btrfs allocator to prevent writes to partially filled raid5/6 stripes (similar to what the ssd mount option does, except with the correct parameters to match RAID5/6 stripe boundaries), and advise users to run btrfs balance much more often to reclaim free space in partially occupied raid stripes - add a stripe write journal to the raid5/6 layer (either in btrfs itself, or in a lower RAID5 layer). There are assorted other ideas (e.g. copy the RAID-Z approach from zfs to btrfs or dramatically increase the btrfs block size) that also solve the write hole problem but are somewhat more invasive and less practical for btrfs. Note that the write hole also affects btrfs on top of other similar raid5/6 implementations (e.g. mdadm raid5 without stripe journal). The btrfs CoW layer does not understand how to allocate data to avoid RMW raid5 stripe updates without corrupting existing committed data, and this limitation applies to every combination of unjournalled raid5/6 and btrfs. > The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas) > warns of possible incorrigible "transid" mismatch, not stating which > versions are affected or what transid mismatch means for your data. It > does not mention the write hole at all. Neither raid5 nor write hole are required to produce a transid mismatch failure. transid mismatch usually occurs due to a lost write. Write hole is a specific case of lost write, but write hole does not usually produce transid failures (it produces header or csum failures instead). During real disk failure events, multiple distinct failure modes can occur concurrently. i.e. both transid failure and write hole can occur at different places in the same filesystem as a result of attempting to use a failing disk over a long period of time. A transid verify failure is metadata damage. It will make the filesystem readonly and make some data inaccessible as described below. > This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html" > target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html) > states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption, > but may corrupt your Metadata while trying to do so - meaning you have > to scrub twice in a row to ensure data integrity. Simple corruption (without write hole errors) is fixed by scrubbing as of the last...at least six months? Kernel v4.14.xx and later can definitely do it these days. Both data and metadata. If the metadata is damaged in any way (corruption, write hole, or transid verify failure) on btrfs and btrfs cannot use the raid profile for metadata to recover the damaged data, the filesystem is usually forever readonly, and anywhere from 0 to 100% of the filesystem may be readable depending on where in the metadata tree structure the error occurs (the closer to the
Re: RAID-1 refuses to balance large drive
On Sat, May 26, 2018 at 06:27:57PM -0700, Brad Templeton wrote: > A few years ago, I encountered an issue (halfway between a bug and a > problem) with attempting to grow a BTRFS 3 disk Raid 1 which was > fairly full. The problem was that after replacing (by add/delete) a > small drive with a larger one, there were now 2 full drives and one > new half-full one, and balance was not able to correct this situation > to produce the desired result, which is 3 drives, each with a roughly > even amount of free space. It can't do it because the 2 smaller > drives are full, and it doesn't realize it could just move one of the > copies of a block off the smaller drive onto the larger drive to free > space on the smaller drive, it wants to move them both, and there is > nowhere to put them both. > > I'm about to do it again, taking my nearly full array which is 4TB, > 4TB, 6TB and replacing one of the 4TB with an 8TB. I don't want to > repeat the very time consuming situation, so I wanted to find out if > things were fixed now. I am running Xenial (kernel 4.4.0) and could > consider the upgrade to bionic (4.15) though that adds a lot more to > my plate before a long trip and I would prefer to avoid if I can. > > So what is the best strategy: > > a) Replace 4TB with 8TB, resize up and balance? (This is the "basic" > strategy) > b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks > from 4TB but possibly not enough) > c) Replace 6TB with 8TB, resize/balance, then replace 4TB with > recently vacated 6TB -- much longer procedure but possibly better d) Run "btrfs balance start -dlimit=3 /fs" to make some unallocated space on all drives *before* adding disks. Then replace, resize up, and balance until unallocated space on all disks are equal. There is no need to continue balancing after that, so once that point is reached you can cancel the balance. A number of bad things can happen when unallocated space goes to zero, and being unable to expand a raid1 array is only one of them. Avoid that situation even when not resizing the array, because some cases can be very difficult to get out of. Assuming your disk is not filled to the last gigabyte, you'll be able to keep at least 1GB unallocated on every disk at all times. Monitor the amount of unallocated space and balance a few data block groups (e.g. -dlimit=3) whenever unallocated space gets low. A potential btrfs enhancement area: allow the 'devid' parameter of balance to specify two disks to balance block groups that contain chunks on both disks. We want to balance only those block groups that consist of one chunk on each smaller drive. This redistributes those block groups to have one chunk on the large disk and one chunk on one of the smaller disks, freeing space on the other small disk for the next block group. Block groups that consist of a chunk on the big disk and one of the small disks are already in the desired configuration, so rebalancing them is just a waste of time. Currently it's only possible to do this by writing a script to select individual block groups with python-btrfs or similar--much faster than plain btrfs balance for this case, but more involved to set up. > Or has this all been fixed and method A will work fine and get to the > ideal goal -- 3 drives, with available space suitably distributed to > allow full utilization over time? > > On Sat, May 26, 2018 at 6:24 PM, Brad Templeton wrote: > > A few years ago, I encountered an issue (halfway between a bug and a > > problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly > > full. The problem was that after replacing (by add/delete) a small drive > > with a larger one, there were now 2 full drives and one new half-full one, > > and balance was not able to correct this situation to produce the desired > > result, which is 3 drives, each with a roughly even amount of free space. > > It can't do it because the 2 smaller drives are full, and it doesn't realize > > it could just move one of the copies of a block off the smaller drive onto > > the larger drive to free space on the smaller drive, it wants to move them > > both, and there is nowhere to put them both. > > > > I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB > > and replacing one of the 4TB with an 8TB. I don't want to repeat the very > > time consuming situation, so I wanted to find out if things were fixed now. > > I am running Xenial (kernel 4.4.0) and could consider the upgrade to bionic > > (4.15) though that adds a lot more to my plate before a long trip and I > > would prefer to avoid if I can. > > > > So what is the best strategy: > > > > a) Replace 4TB with 8TB, resize up and balance? (This is the "basic" > > strategy) > > b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from > > 4TB but possibly not enough) > > c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently > > vacated 6TB -- much longer
Re: Any chance to get snapshot-aware defragmentation?
On Mon, May 21, 2018 at 11:38:28AM -0400, Austin S. Hemmelgarn wrote: > On 2018-05-21 09:42, Timofey Titovets wrote: > > пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn : > > > On 2018-05-19 04:54, Niccolò Belli wrote: > > > > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote: > > > > > With a bit of work, it's possible to handle things sanely. You can > > > > > deduplicate data from snapshots, even if they are read-only (you need > > > > > to pass the `-A` option to duperemove and run it as root), so it's > > > > > perfectly reasonable to only defrag the main subvolume, and then > > > > > deduplicate the snapshots against that (so that they end up all being > > > > > reflinks to the main subvolume). Of course, this won't work if you're > > > > > short on space, but if you're dealing with snapshots, you should have > > > > > enough space that this will work (because even without defrag, it's > > > > > fully possible for something to cause the snapshots to suddenly take > > > > > up a lot more space). > > > > > > > > Been there, tried that. Unfortunately even if I skip the defreg a simple > > > > > > > > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs > > > > > > > > is going to eat more space than it was previously available (probably > > > > due to autodefrag?). > > > It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME > > > ioctl). There's two things involved here: > > > > > * BTRFS has somewhat odd and inefficient handling of partial extents. > > > When part of an extent becomes unused (because of a CLONE ioctl, or an > > > EXTENT_SAME ioctl, or something similar), that part stays allocated > > > until the whole extent would be unused. > > > * You're using the default deduplication block size (128k), which is > > > larger than your filesystem block size (which is at most 64k, most > > > likely 16k, but might be 4k if it's an old filesystem), so deduplicating > > > can split extents. > > > > That's a metadata node leaf != fs block size. > > btrfs fs block size == machine page size currently. > You're right, I keep forgetting about that (probably because BTRFS is pretty > much the only modern filesystem that doesn't let you change the block size). > > > > > Because of this, if a duplicate region happens to overlap the front of > > > an already shared extent, and the end of said shared extent isn't > > > aligned with the deduplication block size, the EXTENT_SAME call will > > > deduplicate the first part, creating a new shared extent, but not the > > > tail end of the existing shared region, and all of that original shared > > > region will stick around, taking up extra space that it wasn't before. > > > > > Additionally, if only part of an extent is duplicated, then that area of > > > the extent will stay allocated, because the rest of the extent is still > > > referenced (so you won't necessarily see any actual space savings). > > > > > You can mitigate this by telling duperemove to use the same block size > > > as your filesystem using the `-b` option. Note that using a smaller > > > block size will also slow down the deduplication process and greatly > > > increase the size of the hash file. > > > > duperemove -b control "how hash data", not more or less and only support > > 4KiB..1MiB > And you can only deduplicate the data at the granularity you hashed it at. > In particular: > > * The total size of a region being deduplicated has to be an exact multiple > of the hash block size (what you pass to `-b`). So for the default 128k > size, you can only deduplicate regions that are multiples of 128k long > (128k, 256k, 384k, 512k, etc). This is a simple limit derived from how > blocks are matched for deduplication. > * Because duperemove uses fixed hash blocks (as opposed to using a rolling > hash window like many file synchronization tools do), the regions being > deduplicated also have to be exactly aligned to the hash block size. So, > with the default 128k size, you can only deduplicate regions starting at 0k, > 128k, 256k, 384k, 512k, etc, but not ones starting at, for example, 64k into > the file. > > > > And size of block for dedup will change efficiency of deduplication, > > when count of hash-block pairs, will change hash file size and time > > complexity. > > > > Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern. > > > > So, example, you have 2 of 2x4KiB blocks: > > 1: '' > > 2: '' > > > > With -b 8KiB hash of first block not same as second. > > But with -b 4KiB duperemove will see both '' and '' > > And then that blocks will be deduped. > This supports what I'm saying though. Your deduplication granularity is > bounded by your hash granularity. If in addition to the above you have a > file that looks like: > > AABBBAA > > It would not get deduplicated against the first two at either `-b 4k` or `-b > 8k` despite the middle 4k of the file being an exact duplicate
Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted
On Sun, May 13, 2018 at 11:26:39AM -0700, Darrick J. Wong wrote: > On Sun, May 13, 2018 at 06:21:52PM +, Mark Fasheh wrote: > > On Fri, May 11, 2018 at 05:06:34PM -0700, Darrick J. Wong wrote: > > > On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote: > > > > Right now we return EINVAL if a process does not have permission to > > > > dedupe a > > > > file. This was an oversight on my part. EPERM gives a true description > > > > of > > > > the nature of our error, and EINVAL is already used for the case that > > > > the > > > > filesystem does not support dedupe. > > > > > > > > Signed-off-by: Mark Fasheh> > > > --- > > > > fs/read_write.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > > > index 77986a2e2a3b..8edef43a182c 100644 > > > > --- a/fs/read_write.c > > > > +++ b/fs/read_write.c > > > > @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, > > > > struct file_dedupe_range *same) > > > > info->status = -EINVAL; > > > > } else if (!(is_admin || (dst_file->f_mode & > > > > FMODE_WRITE) || > > > > uid_eq(current_fsuid(), dst->i_uid))) { > > > > - info->status = -EINVAL; > > > > + info->status = -EPERM; > > > > > > Hmm, are we allowed to change this aspect of the kabi after the fact? > > > > > > Granted, we're only trading one error code for another, but will the > > > existing users of this care? xfs_io won't and I assume duperemove won't > > > either, but what about bees? :) > > > > Yeah if you see my initial e-mail I check bees and also rust-btrfs. I think > > this is fine as we're simply expanding on an error code return. There's no > > magic behavior expected with respect to these error codes either. > > Ok. No objections from me, then. > > Acked-by: Darrick J. Wong For what it's worth, no objection from me either. ;) bees runs only with admin privilege and will never hit the modified line. If bees is started without admin privilege, the TREE_SEARCH_V2 ioctl fails. bees uses this ioctl to walk over all the data in the filesystem, so without admin privilege, bees never opens, reads, or dedupes anything. bees relies on having an accurate internal model of btrfs structure and behavior to issue dedup commands that will work and do useful things; however, unexpected kernel behavior or concurrent user data changes will make some dedups fail. When that happens bees just abandons the extent immediately: a user data change will be handled in the next pass over the filesystem, but an unexpected kernel behavior needs bees code changes to correctly predict the new kernel behavior before the dedup can be reattempted. > --D > > > --Mark > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: Hard link not persisted on fsync
On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote: > Hi, > > The following seems to be a crash consistency bug on btrfs, where in > the link count is not persisted even after a fsync on the original > file. > > Consider the following workload : > creat foo > link (foo, A/bar) > fsync(foo) > ---Crash--- > > Now, on recovery we expect the metadata of foo to be persisted i.e > have a link count of 2. However in btrfs, the link count is 1 and file > A/bar is not persisted. The expected behaviour would be to persist the > dependencies of inode foo. That is to say, shouldn't fsync of foo > persist A/bar and correctly update the link count? Those dependencies are backward. foo's inode doesn't depend on anything but the data in the file foo, and foo's inode itself. "foo" and "A/bar" are dirents that both depend on the inode of foo, which implies that "A" and "." must be updated atomically with foo's inode. If you had called fsync(A) then we'd expect A/bar to exist and the inode to have a link count of 2. If you'd called fsync(.) then...well, you didn't modify "." at all, so I guess either outcome is valid as long as the inode link count matches the number of dirents referencing the inode. But then...why does foo exist at all? I'd expect at least some tests would end without foo on disk either, since all that was fsync()ed was the foo inode, not the foo dirent in the directory '.'. Does btrfs combine creating foo and updating foo's inode into a single atomic operation? I vaguely recall that it does exactly that, in order to solve a bug some years ago. What happens if you add a rename, e.g. unlink foo2 # make sure foo2 doesn't exist creat foo rename(foo, foo2) link(foo2, A/bar) fsync(foo2) Do you get foo or foo2? I'd expect foo since you didn't fsync '.', but maybe rename implies flush and you get foo2. That's not to say that fsync is not a rich source of filesystem bugs. btrfs did once have (and maybe still has?) a bug where renames and fsync can create a dirent with no inode, e.g. loop continuously: creat foo write(foo, data) fsync(foo) rename(foo, bar) and crash somewhere in the middle of the loop, which will create a dirent "foo" that points to a non-existent inode. Removing the "fsync" works around the bug. rename() does a flush anyway, so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem inconsistency, especially when Googling recommends app developers to sprinkle fsync()s indiscriminately in their code to prevent their data from being mangled. I haven't been tracking to see if that's fixed yet. I last saw it on 4.11, but I have been aggressively avoiding fsync with eatmydata for some years now. > Note that ext4, xfs and f2fs recover to the correct link count of 2 > for the above workload. Do those filesystems also work if you remove the fsync? That may be your answer: they could be flushing the other metadata earlier, before you call fsync(). > Let us know what you think about this behavior. > > Thanks, > Jayashree Mohan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: Status of RAID5/6
On Wed, Apr 04, 2018 at 11:31:33PM +0200, Goffredo Baroncelli wrote: > On 04/04/2018 08:01 AM, Zygo Blaxell wrote: > > On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote: > >> On 04/04/2018 12:57 AM, Zygo Blaxell wrote: > [...] > >> Before you pointed out that the non-contiguous block written has > >> an impact on performance. I am replaying that the switching from a > >> different BG happens at the stripe-disk boundary, so in any case the > >> block is physically interrupted and switched to another disk > > > > The difference is that the write is switched to a different local address > > on the disk. > > > > It's not "another" disk if it's a different BG. Recall in this plan > > there is a full-width BG that is on _every_ disk, which means every > > small-width BG shares a disk with the full-width BG. Every extent tail > > write requires a seek on a minimum of two disks in the array for raid5, > > three disks for raid6. A tail that is strip-width minus one will hit > > N - 1 disks twice in an N-disk array. > > Below I made a little simulation; my results telling me another thing: > > Current BTRFS (w/write hole) > > Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb) > > Case A.1): extent size = 192kb: > 5 writes of 64kb spread on 5 disks (3data + 2 parity) > > Case A.2.2): extent size = 256kb: (optimistic case: contiguous space > available) > 5 writes of 64kb spread on 5 disks (3 data + 2 parity) > 2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**] > 3 writes of 64 kb spread on 3 disks (data + 2 parity) > > Note that the two reads are contiguous to the 5 writes both in term of > space and time. The three writes are contiguous only in terms of space, > but not in terms of time, because these could happen only after the 2 > reads and the consequent parities computations. So we should consider > that between these two events, some disk activities happen; this means > seeks between the 2 reads and the 3 writes > > > BTRFS with multiple BG (wo/write hole) > > Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb) > > Case B.1): extent size = 192kb: > 5 writes of 64kb spread on 5 disks > > Case B.2): extent size = 256kb: > 5 writes of 64kb spread on 5 disks in BG#1 > 3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks) > > So if I count correctly: > - case B1 vs A1: these are equivalent > - case B2 vs A2.1/A2.2: > 8 writes vs 8 writes > 3 seeks vs 3 seeks > 0 reads vs 2 reads > > So to me it seems that the cost of doing a RMW cycle is worse than > seeking to another BG. Well, RMW cycles are dangerous, so being slow as well is just a second reason never to do them. > Anyway I am reaching the conclusion, also thanks of this discussion, > that this is not enough. Even if we had solve the problem of the > "extent smaller than stripe" write, we still face gain this issue when > part of the file is changed. > In this case the file update breaks the old extent and will create a > three extents: the first part, the new part, the last part. Until that > everything is OK. However the "old" part of the file would be marked > as free space. But using this part could require a RMW cycle You cannot use that free space within RAID stripes because it would require RMW, and RMW causes write hole. The space would have to be kept unavailable until the rest of the RAID stripe was deleted. OTOH, if you can solve that free space management problem, you don't have to do anything else to solve write hole. If you never RMW then you never have the write hole in the first place. > I am concluding that the only two reliable solution are > a) variable stripe size (like ZFS does) > or b) logging the RMW cycle of a stripe Those are the only solutions that don't require a special process for reclaiming unused space in RAID stripes. If you have that, you have a few more options; however, they all involve making a second copy of the data at a later time (as opposed to option b, which makes a second copy of the data during the original write). a) also doesn't support nodatacow files (AFAIK ZFS doesn't have those) and it would require defrag to get the inefficiently used space back. b) is the best of the terrible options. It minimizes the impact on the rest of the filesystem since it can fix RMW inconsistency without having to eliminate the RMW cases. It doesn't require rewriting the allocator nor does it require users to run defrag or balance periodically. > [**] Does someone know if the checksum are checked during this read ? > [...] > > BR > G.Baroncelli > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: PGP signature
Re: Status of RAID5/6
On Tue, Apr 03, 2018 at 09:08:01PM -0600, Chris Murphy wrote: > On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli <kreij...@inwind.it> > wrote: > > On 04/03/2018 02:31 AM, Zygo Blaxell wrote: > >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: > >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > >>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote: > >>>>> I thought that a possible solution is to create BG with different > >>>> number of data disks. E.g. supposing to have a raid 6 system with 6 > >>>> disks, where 2 are parity disk; we should allocate 3 BG > >>>>> BG #1: 1 data disk, 2 parity disks > >>>>> BG #2: 2 data disks, 2 parity disks, > >>>>> BG #3: 4 data disks, 2 parity disks > >>>>> > >>>>> For simplicity, the disk-stripe length is assumed = 4K. > >>>>> > >>>>> So If you have a write with a length of 4 KB, this should be placed > >>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > >>>> should be placed in in BG#2, then in BG#1. > >>>>> This would avoid space wasting, even if the fragmentation will > >>>> increase (but shall the fragmentation matters with the modern solid > >>>> state disks ?). > >>> I don't really see why this would increase fragmentation or waste space. > > > >> Oh, wait, yes I do. If there's a write of 6 blocks, we would have > >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the > >> remaining 2 blocks). It also flips the usual order of "determine size > >> of extent, then allocate space for it" which might require major surgery > >> on the btrfs allocator to implement. > > > > I have to point out that in any case the extent is physically interrupted > > at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write > > 128KB, the first half is written in the first disk, the other in the 2nd > > disk. If you want to write 96kb, the first 64 are written in the first > > disk, the last part in the 2nd, only on a different BG. > > So yes there is a fragmentation from a logical point of view; from a > > physical point of view the data is spread on the disks in any case. > > > > In any case, you are right, we should gather some data, because the > > performance impact are no so clear. > > They're pretty clear, and there's a lot written about small file size > and parity raid performance being shit, no matter the implementation > (md, ZFS, Btrfs, hardware maybe less so just because of all the > caching and extra processing hardware that's dedicated to the task). Pretty much everything goes fast if you put a faster non-volatile cache in front of it. > The linux-raid@ list is full of optimizations for this that are use > case specific. One of those that often comes up is how badly suited > raid56 are for e.g. mail servers, tons of small file reads and writes, > and all the disk contention that comes up, and it's even worse when > you lose a disk, and even if you're running raid 6 and lose two disk > it's really god awful. It can be unexpectedly a disqualifying setup > without prior testing in that condition: can your workload really be > usable for two or three days in a double degraded state on that raid6? > *shrug* > > Parity raid is well suited for full stripe reads and writes, lots of > sequential writes. Ergo a small file is anything less than a full > stripe write. Of course, delayed allocation can end up making for more > full stripe writes. But now you have more RMW which is the real > performance killer, again no matter the raid. RMW isn't necessary if you have properly configured COW on top. ZFS doesn't do RMW at all. OTOH for some workloads COW is a step in a different wrong direction--the btrfs raid5 problems with nodatacow files can be solved by stripe logging and nothing else. Some equivalent of autodefrag that repacks your small RAID stripes into bigger ones will burn 3x your write IOPS eventually--it just lets you defer the inevitable until a hopefully more convenient time. A continuously loaded server never has a more convenient time, so it needs a different solution. > > I am not worried abut having different BG; we have problem with these > > because we never developed tool to handle this issue properly (i.e. a > > daemon which starts a balance when needed). But I hope that this will be > > solved in future. > > > > In any case, the all solutions proposed have their trade off: > > > > - a) as is: write hole
Re: Status of RAID5/6
On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote: > On 04/04/2018 12:57 AM, Zygo Blaxell wrote: > >> I have to point out that in any case the extent is physically > >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if > >> you want to write 128KB, the first half is written in the first disk, > >> the other in the 2nd disk. If you want to write 96kb, the first 64 > >> are written in the first disk, the last part in the 2nd, only on a > >> different BG. > > The "only on a different BG" part implies something expensive, either > > a seek or a new erase page depending on the hardware. Without that, > > nearby logical blocks are nearby physical blocks as well. > > In any case it happens on a different disk No it doesn't. The small-BG could be on the same disk(s) as the big-BG. > >> So yes there is a fragmentation from a logical point of view; from a > >> physical point of view the data is spread on the disks in any case. > > > What matters is the extent-tree point of view. There is (currently) > > no fragmentation there, even for RAID5/6. The extent tree is unaware > > of RAID5/6 (to its peril). > > Before you pointed out that the non-contiguous block written has > an impact on performance. I am replaying that the switching from a > different BG happens at the stripe-disk boundary, so in any case the > block is physically interrupted and switched to another disk The difference is that the write is switched to a different local address on the disk. It's not "another" disk if it's a different BG. Recall in this plan there is a full-width BG that is on _every_ disk, which means every small-width BG shares a disk with the full-width BG. Every extent tail write requires a seek on a minimum of two disks in the array for raid5, three disks for raid6. A tail that is strip-width minus one will hit N - 1 disks twice in an N-disk array. > However yes: from an extent-tree point of view there will be an increase > of number extents, because the end of the writing is allocated to > another BG (if the size is not stripe-boundary) > > > If an application does a loop writing 68K then fsync(), the multiple-BG > > solution adds two seeks to read every 68K. That's expensive if sequential > > read bandwidth is more scarce than free space. > > Why you talk about an additional seeks? In any case (even without the > additional BG) the read happens from another disks See above: not another disk, usually a different location on two or more of the same disks. > >> * c),d),e) are applied only for the tail of the extent, in case the > > size is less than the stripe size. > > > > It's only necessary to split an extent if there are no other writes > > in the same transaction that could be combined with the extent tail > > into a single RAID stripe. As long as everything in the RAID stripe > > belongs to a single transaction, there is no write hole > > May be that a more "simpler" optimization would be close the transaction > when the data reach the stripe boundary... But I suspect that it is > not so simple to implement. Transactions exist in btrfs to batch up writes into big contiguous extents already. The trick is to _not_ do that when one transaction ends and the next begins, i.e. leave a space at the end of the partially-filled stripe so that the next transaction begins in an empty stripe. This does mean that there will only be extra seeks during transaction commit and fsync()--which were already very seeky to begin with. It's not necessary to write a partial stripe when there are other extents to combine. So there will be double the amount of seeking, but depending on the workload, it could double a very small percentage of writes. > > Not for d. Balance doesn't know how to get rid of unreachable blocks > > in extents (it just moves the entire extent around) so after a balance > > the writes would still be rounded up to the stripe size. Balance would > > never be able to free the rounded-up space. That space would just be > > gone until the file was overwritten, deleted, or defragged. > > If balance is capable to move the extent, why not place one near the > other during a balance ? The goal is not to limit the the writing of > the end of a extent, but avoid writing the end of an extent without > further data (e.g. the gap to the stripe has to be filled in the > same transaction) That's plan f (leave gaps in RAID stripes empty). Balance will repack short extents into RAID stripes nicely. Plan d can't do that because plan d overallocates the extent so that the extent fills the stripe (only some of the extent is used for data). Small but important difference. > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: PGP signature
Re: Status of RAID5/6
On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote: > On 04/03/2018 02:31 AM, Zygo Blaxell wrote: > > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: > >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote: > >>>> I thought that a possible solution is to create BG with different > >>> number of data disks. E.g. supposing to have a raid 6 system with 6 > >>> disks, where 2 are parity disk; we should allocate 3 BG > >>>> BG #1: 1 data disk, 2 parity disks > >>>> BG #2: 2 data disks, 2 parity disks, > >>>> BG #3: 4 data disks, 2 parity disks > >>>> > >>>> For simplicity, the disk-stripe length is assumed = 4K. > >>>> > >>>> So If you have a write with a length of 4 KB, this should be placed > >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > >>> should be placed in in BG#2, then in BG#1. > >>>> This would avoid space wasting, even if the fragmentation will > >>> increase (but shall the fragmentation matters with the modern solid > >>> state disks ?). > >> I don't really see why this would increase fragmentation or waste space. > > > Oh, wait, yes I do. If there's a write of 6 blocks, we would have > > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the > > remaining 2 blocks). It also flips the usual order of "determine size > > of extent, then allocate space for it" which might require major surgery > > on the btrfs allocator to implement. > > I have to point out that in any case the extent is physically > interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if > you want to write 128KB, the first half is written in the first disk, > the other in the 2nd disk. If you want to write 96kb, the first 64 > are written in the first disk, the last part in the 2nd, only on a > different BG. The "only on a different BG" part implies something expensive, either a seek or a new erase page depending on the hardware. Without that, nearby logical blocks are nearby physical blocks as well. > So yes there is a fragmentation from a logical point of view; from a > physical point of view the data is spread on the disks in any case. What matters is the extent-tree point of view. There is (currently) no fragmentation there, even for RAID5/6. The extent tree is unaware of RAID5/6 (to its peril). ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can put a stripe of any size anywhere. If we're going to do that in btrfs, you might as well just do what ZFS does. OTOH, variable-size block groups give us read-compatibility with old kernel versions (and write-compatibility for that matter--a kernel that didn't know about the BG separation would just work but have write hole). If an application does a loop writing 68K then fsync(), the multiple-BG solution adds two seeks to read every 68K. That's expensive if sequential read bandwidth is more scarce than free space. > In any case, you are right, we should gather some data, because the > performance impact are no so clear. > > I am not worried abut having different BG; we have problem with these > because we never developed tool to handle this issue properly (i.e. a > daemon which starts a balance when needed). But I hope that this will > be solved in future. Balance daemons are easy to the point of being trivial to write in Python. The balancing itself is quite expensive and invasive: can't usefully ionice it, can only abort it on block group boundaries, can't delete snapshots while it's running. If balance could be given a vrange that was the size of one extent...then we could talk about daemons. > In any case, the all solutions proposed have their trade off: > > - a) as is: write hole bug > - b) variable stripe size (like ZFS): big impact on how btrfs handle > the extent. limited waste of space > - c) logging data before writing: we wrote the data two times in a > short time window. Moreover the log area is written several order of > magnitude more than the other area; there was some patch around > - d) rounding the writing to the stripe size: waste of space; simple > to implement; > - e) different BG with different stripe size: limited waste of space; > logical fragmentation. Also: - f) avoiding writes to partially filled stripes: free space fragmentation; simple to implement (ssd_spread does it accidentally) The difference between d) and f) is that d) allocates the space to the extent while f) leaves the space unallocated, but skips any free space fragments smaller than the stripe size when allocati
Re: Status of RAID5/6
On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote: > On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > > On 2018-04-02 11:18, Goffredo Baroncelli wrote: > > > I thought that a possible solution is to create BG with different > > number of data disks. E.g. supposing to have a raid 6 system with 6 > > disks, where 2 are parity disk; we should allocate 3 BG > > > > > > BG #1: 1 data disk, 2 parity disks > > > BG #2: 2 data disks, 2 parity disks, > > > BG #3: 4 data disks, 2 parity disks > > > > > > For simplicity, the disk-stripe length is assumed = 4K. > > > > > > So If you have a write with a length of 4 KB, this should be placed > > in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > > should be placed in in BG#2, then in BG#1. > > > > > > This would avoid space wasting, even if the fragmentation will > > increase (but shall the fragmentation matters with the modern solid > > state disks ?). > > I don't really see why this would increase fragmentation or waste space. Oh, wait, yes I do. If there's a write of 6 blocks, we would have to split an extent between BG #3 (the first 4 blocks) and BG #2 (the remaining 2 blocks). It also flips the usual order of "determine size of extent, then allocate space for it" which might require major surgery on the btrfs allocator to implement. If we round that write up to 8 blocks (so we can put both pieces in BG #3), it degenerates into the "pretend partially filled RAID stripes are completely full" case, something like what ssd_spread already does. That trades less file fragmentation for more free space fragmentation. > The extent size is determined before allocation anyway, all that changes > in this proposal is where those small extents ultimately land on the disk. > > If anything, it might _reduce_ fragmentation since everything in BG #1 > and BG #2 will be of uniform size. > > It does solve write hole (one transaction per RAID stripe). > > > Also, you're still going to be wasting space, it's just that less space will > > be wasted, and it will be wasted at the chunk level instead of the block > > level, which opens up a whole new set of issues to deal with, most > > significantly that it becomes functionally impossible without brute-force > > search techniques to determine when you will hit the common-case of -ENOSPC > > due to being unable to allocate a new chunk. > > Hopefully the allocator only keeps one of each size of small block groups > around at a time. The allocator can take significant short cuts because > the size of every extent in the small block groups is known (they are > all the same size by definition). > > When a small block group fills up, the next one should occupy the > most-empty subset of disks--which is the opposite of the usual RAID5/6 > allocation policy. This will probably lead to "interesting" imbalances > since there are now two allocators on the filesystem with different goals > (though it is no worse than -draid5 -mraid1, and I had no problems with > free space when I was running that). > > There will be an increase in the amount of allocated but not usable space, > though, because now the amount of free space depends on how much data > is batched up before fsync() or sync(). Probably best to just not count > any space in the small block groups as 'free' in statvfs terms at all. > > There are a lot of variables implied there. Without running some > simulations I have no idea if this is a good idea or not. > > > > Time to time, a re-balance should be performed to empty the BG #1, > > and #2. Otherwise a new BG should be allocated. > > That shouldn't be _necessary_ (the filesystem should just allocate > whatever BGs it needs), though it will improve storage efficiency if it > is done. > > > > The cost should be comparable to the logging/journaling (each > > data shorter than a full-stripe, has to be written two times); the > > implementation should be quite easy, because already NOW btrfs support > > BG with different set of disks. > signature.asc Description: PGP signature
Re: Status of RAID5/6
On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote: > On 2018-04-02 11:18, Goffredo Baroncelli wrote: > > On 04/02/2018 07:45 AM, Zygo Blaxell wrote: > > [...] > > > It is possible to combine writes from a single transaction into full > > > RMW stripes, but this *does* have an impact on fragmentation in btrfs. > > > Any partially-filled stripe is effectively read-only and the space within > > > it is inaccessible until all data within the stripe is overwritten, > > > deleted, or relocated by balance. > > > > > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe > > > update, but that has a significant write magnification effect (and before > > > kernel 4.14, non-trivial CPU load as well). > > > > > > btrfs could also just allocate the full stripe to an extent, but emit > > > only extent ref items for the blocks that are in use. No fragmentation > > > but lots of extra disk space used. Also doesn't quite work the same > > > way for metadata pages. > > > > > > If btrfs adopted the ZFS approach, the extent allocator and all higher > > > layers of the filesystem would have to know about--and skip over--the > > > parity blocks embedded inside extents. Making this change would mean > > > that some btrfs RAID profiles start interacting with stuff like balance > > > and compression which they currently do not. It would create a new > > > block group type and require an incompatible on-disk format change for > > > both reads and writes. > > > > I thought that a possible solution is to create BG with different > number of data disks. E.g. supposing to have a raid 6 system with 6 > disks, where 2 are parity disk; we should allocate 3 BG > > > > BG #1: 1 data disk, 2 parity disks > > BG #2: 2 data disks, 2 parity disks, > > BG #3: 4 data disks, 2 parity disks > > > > For simplicity, the disk-stripe length is assumed = 4K. > > > > So If you have a write with a length of 4 KB, this should be placed > in BG#1; if you have a write with a length of 4*3KB, the first 8KB, > should be placed in in BG#2, then in BG#1. > > > > This would avoid space wasting, even if the fragmentation will > increase (but shall the fragmentation matters with the modern solid > state disks ?). I don't really see why this would increase fragmentation or waste space. The extent size is determined before allocation anyway, all that changes in this proposal is where those small extents ultimately land on the disk. If anything, it might _reduce_ fragmentation since everything in BG #1 and BG #2 will be of uniform size. It does solve write hole (one transaction per RAID stripe). > Also, you're still going to be wasting space, it's just that less space will > be wasted, and it will be wasted at the chunk level instead of the block > level, which opens up a whole new set of issues to deal with, most > significantly that it becomes functionally impossible without brute-force > search techniques to determine when you will hit the common-case of -ENOSPC > due to being unable to allocate a new chunk. Hopefully the allocator only keeps one of each size of small block groups around at a time. The allocator can take significant short cuts because the size of every extent in the small block groups is known (they are all the same size by definition). When a small block group fills up, the next one should occupy the most-empty subset of disks--which is the opposite of the usual RAID5/6 allocation policy. This will probably lead to "interesting" imbalances since there are now two allocators on the filesystem with different goals (though it is no worse than -draid5 -mraid1, and I had no problems with free space when I was running that). There will be an increase in the amount of allocated but not usable space, though, because now the amount of free space depends on how much data is batched up before fsync() or sync(). Probably best to just not count any space in the small block groups as 'free' in statvfs terms at all. There are a lot of variables implied there. Without running some simulations I have no idea if this is a good idea or not. > > Time to time, a re-balance should be performed to empty the BG #1, > and #2. Otherwise a new BG should be allocated. That shouldn't be _necessary_ (the filesystem should just allocate whatever BGs it needs), though it will improve storage efficiency if it is done. > > The cost should be comparable to the logging/journaling (each > data shorter than a full-stripe, has to be written two times); the > implementation should be quite easy, because already NOW btrfs support > BG with different set of disks. signature.asc Description: PGP signature
Re: Status of RAID5/6
On Sun, Apr 01, 2018 at 03:11:04PM -0600, Chris Murphy wrote: > (I hate it when my palm rubs the trackpad and hits send prematurely...) > > > On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphywrote: > > >> Users can run scrub immediately after _every_ unclean shutdown to > >> reduce the risk of inconsistent parity and unrecoverable data should > >> a disk fail later, but this can only prevent future write hole events, > >> not recover data lost during past events. > > > > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And > > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM > > means that EXTENT_CSUM is assumed to be correct. But in fact it could > be stale. It's just as possible the metadata and superblock update is > what's missing due to the interruption, while both data and parity > strip writes succeeded. The window for either the data or parity write > to fail is way shorter of a time interval, than that of the numerous > metadata writes, followed by superblock update. csums cannot be wrong due to write interruption. The data and metadata blocks are written first, then barrier, then superblock updates pointing to the data and csums previously written in the same transaction. Unflushed data is not included in the metadata. If there is a write interruption then the superblock update doesn't occur and btrfs reverts to the previous unmodified data+csum trees. This works on non-raid5/6 because all the writes that make up a single transaction are ordered and independent, and no data from older transactions is modified during any tree update. On raid5/6 every RMW operation modifies data from old transactions by creating data/parity inconsistency. If there was no data in the stripe from an old transaction, the operation would be just a write, no read and modify. In the write hole case, the csum *is* correct, it is the data that is wrong. > In such a case, the > old metadata is what's pointed to, including EXTENT_CSUM. Therefore > your scrub would always show csum error, even if both data and parity > are correct. You'd have to init-csum in this case, I suppose. No, the csums are correct. The data does not match the csum because the data is corrupted. Assuming barriers work on your disk, and you're not having some kind of direct IO data consistency bug, and you can read the csum tree at all, then the csums are correct, even with write hole. When write holes and other write interruption patterns affect the csum tree itself, this results in parent transid verify failures, csum tree page csum failures, or both. This forces the filesystem read-only so it's easy to spot when it happens. Note that the data blocks with wrong csum from raid5/6 reconstruction after a write hole event always belong to _old_ transactions damaged by the write hole. If the writes are interrupted, the new data blocks in a RMW stripe will not be committed and will have no csums to verify, so they can't have _wrong_ csums. The old data blocks do not have their csum changed by the write hole (the csum is stored on a separate tree in a different block group) so the csums are intact. When a write hole event corrupts the data reconstruction on a degraded array, the csum doesn't match because the csum is correct and the data is not. > Pretty much it's RMW with a (partial) stripe overwrite upending COW, > and therefore upending the atomicity, and thus consistency of Btrfs in > the raid56 case where any portion of the transaction is interrupted. Not any portion, only the RMW stripe update can produce data loss due to write interruption (well, that, and fsync() log-tree replay bugs). If any other part of the transaction is interrupted then btrfs recovers just fine with its COW tree update algorithm and write barriers. > And this is amplified if metadata is also raid56. Data and metadata are mangled the same way. The difference is the impact. btrfs tolerates exactly 0 bits of damaged metadata after RAID recovery, and enforces this intolerance with metadata transids and csums, so write hole on metadata _always_ breaks the filesystem. > ZFS avoids the problem at the expense of probably a ton of > fragmentation, by taking e.g. 4KiB RMW and writing a full length > stripe of 8KiB fully COW, rather than doing stripe modification with > an overwrite. And that's because it has dynamic stripe lengths. I think that's technically correct but could be clearer. ZFS never does RMW. It doesn't need to. Parity blocks are allocated at the extent level and RAID stripes are built *inside* the extents (or "groups of contiguous blocks written in a single transaction" which seems to be the closest ZFS equivalent of the btrfs extent concept). Since every ZFS RAID stripe is bespoke sized to exactly fit a single write operation, no two ZFS transactions can ever share a RAID stripe. No transactions sharing a stripe means no write hole. There is no impact on fragmentation on ZFS--space is
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote: > On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli > <kreij...@inwind.it> wrote: > > On 03/31/2018 07:03 AM, Zygo Blaxell wrote: > >>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery > >>>> is always a full-device operation. In theory btrfs could track > >>>> modifications at the chunk level but this isn't even specified in the > >>>> on-disk format, much less implemented. > >>> It could go even further; it would be sufficient to track which > >>> *partial* stripes update will be performed before a commit, in one > >>> of the btrfs logs. Then in case of a mount of an unclean filesystem, > >>> a scrub on these stripes would be sufficient. > > > >> A scrub cannot fix a raid56 write hole--the data is already lost. > >> The damaged stripe updates must be replayed from the log. > > > > Your statement is correct, but you doesn't consider the COW nature of btrfs. > > > > The key is that if a data write is interrupted, all the transaction is > > interrupted and aborted. And due to the COW nature of btrfs, the "old > > state" is restored at the next reboot. > > > > What is needed in any case is rebuild of parity to avoid the "write-hole" > > bug. > > Write hole happens on disk in Btrfs, but the ensuing corruption on > rebuild is detected. Corrupt data never propagates. Data written with nodatasum or nodatacow is corrupted without detection (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity journal device). Metadata always has csums, and files have checksums if they are created with default attributes and mount options. Those cases are covered, any corrupted data will give EIO on reads (except once per 4 billion blocks, where the corrupted CRC matches at random). > The problem is that Btrfs gives up when it's detected. Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible combinations of recovery blocks for raid6, and earlier kernels than those would not recover correctly for raid5 either. I think this has all been fixed in recent kernels but I haven't tested these myself so don't quote me on that. Other than that, btrfs doesn't give up in the write hole case. It rebuilds the data according to the raid5/6 parity algorithm, but the algorithm doesn't produce correct data for interrupted RMW writes when there is no stripe update journal. There is nothing else to try at that point. By the time the error is detected the opportunity to recover the data has long passed. The data that comes out of the recovery algorithm is a mixture of old and new data from the filesystem. The "new" data is something that was written just before a failure, but the "old" data could be data of any age, even a block of free space, that previously existed on the filesystem. If you bypass the EIO from the failing csums (e.g. by using btrfs rescue) it will appear as though someone took the XOR of pairs of random blocks from the disk and wrote it over one of the data blocks at random. When this happens to btrfs metadata, it is effectively a fuzz tester for tools like 'btrfs check' which will often splat after a write hole failure happens. > If it assumes just a bit flip - not always a correct assumption but > might be reasonable most of the time, it could iterate very quickly. That is not how write hole works (or csum recovery for that matter). Write hole producing a single bit flip would occur extremely rarely outside of contrived test cases. Recall that in a write hole, one or more 4K blocks are updated on some of the disks in a stripe, but other blocks retain their original values from prior to the update. This is OK as long as all disks are online, since the parity can be ignored or recomputed from the data blocks. It is also OK if the writes on all disks are completed without interruption, since the data and parity eventually become consistent when all writes complete as intended. It is also OK if the entire stripe is written at once, since then there is only one transaction referring to the stripe, and if that transaction is not committed then the content of the stripe is irrelevant. The write hole error event is when all of the following occur: - a stripe containing committed data from one or more btrfs transactions is modified by raid5/6 RMW update in a new transaction. This is the usual case on a btrfs filesystem with the default, 'nossd' or 'ssd' mount options. - the write is not completed (due to crash, power failure, disk failure, bad sector, SCSI timeout, bad cable, firmware bug, etc), so the parity block is out of sync with modified data blocks (before or af
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote: > 31.03.2018 11:16, Goffredo Baroncelli пишет: > > On 03/31/2018 09:43 AM, Zygo Blaxell wrote: > >>> The key is that if a data write is interrupted, all the transaction > >>> is interrupted and aborted. And due to the COW nature of btrfs, the > >>> "old state" is restored at the next reboot. > > > >> This is not presently true with raid56 and btrfs. RAID56 on btrfs uses > >> RMW operations which are not COW and don't provide any data integrity > >> guarantee. Old data (i.e. data from very old transactions that are not > >> part of the currently written transaction) can be destroyed by this. > > > > Could you elaborate a bit ? > > > > Generally speaking, updating a part of a stripe require a RMW cycle, because > > - you need to read all data stripe (with parity in case of a problem) > > - then you should write > > - the new data > > - the new parity (calculated on the basis of the first read, and the > > new data) > > > > However the "old" data should be untouched; or you are saying that the > > "old" data is rewritten with the same data ? > > > > If old data block becomes unavailable, it can no more be reconstructed > because old content of "new data" and "new priority" blocks are lost. > Fortunately if checksum is in use it does not cause silent data > corruption but it effectively means data loss. > > Writing of data belonging to unrelated transaction affects previous > transactions precisely due to RMW cycle. This fundamentally violates > btrfs claim of always having either old or new consistent state. Correct. To fix this, any RMW stripe update on raid56 has to be written to a log first. All RMW updates must be logged because a disk failure could happen at any time. Full stripe writes don't need to be logged because all the data in the stripe belongs to the same transaction, so if a disk fails the entire stripe is either committed or it is not. One way to avoid the logging is to change the btrfs allocation parameters so that the filesystem doesn't allocate data in RAID stripes that are already occupied by data from older transactions. This is similar to what 'ssd_spread' does, although the ssd_spread option wasn't designed for this and won't be effective on large arrays. This avoids modifying stripes that contain old committed data, but it also means the free space on the filesystem will become heavily fragmented over time. Users will have to run balance *much* more often to defragment the free space. signature.asc Description: PGP signature
Re: Status of RAID5/6
On Sat, Mar 31, 2018 at 08:57:18AM +0200, Goffredo Baroncelli wrote: > On 03/31/2018 07:03 AM, Zygo Blaxell wrote: > >>> btrfs has no optimization like mdadm write-intent bitmaps; recovery > >>> is always a full-device operation. In theory btrfs could track > >>> modifications at the chunk level but this isn't even specified in the > >>> on-disk format, much less implemented. > >> It could go even further; it would be sufficient to track which > >> *partial* stripes update will be performed before a commit, in one > >> of the btrfs logs. Then in case of a mount of an unclean filesystem, > >> a scrub on these stripes would be sufficient. > > > A scrub cannot fix a raid56 write hole--the data is already lost. > > The damaged stripe updates must be replayed from the log. > > Your statement is correct, but you doesn't consider the COW nature of btrfs. > > The key is that if a data write is interrupted, all the transaction > is interrupted and aborted. And due to the COW nature of btrfs, the > "old state" is restored at the next reboot. This is not presently true with raid56 and btrfs. RAID56 on btrfs uses RMW operations which are not COW and don't provide any data integrity guarantee. Old data (i.e. data from very old transactions that are not part of the currently written transaction) can be destroyed by this. > What is needed in any case is rebuild of parity to avoid the > "write-hole" bug. And this is needed only for a partial stripe > write. For a full stripe write, due to the fact that the commit is > not flushed, it is not needed the scrub at all. > > Of course for the NODATACOW file this is not entirely true; but I > don't see the gain to switch from the cost of COW to the cost of a log. > > The above sentences are correct (IMHO) if we don't consider a power > failure+device missing case. However in this case even logging the > "new data" would be not sufficient. > > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: PGP signature
Re: Status of RAID5/6
On Fri, Mar 30, 2018 at 06:14:52PM +0200, Goffredo Baroncelli wrote: > On 03/29/2018 11:50 PM, Zygo Blaxell wrote: > > On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote: > >> Hey. > >> > >> Some things would IMO be nice to get done/clarified (i.e. documented in > >> the Wiki and manpages) from users'/admin's POV: > [...] > > > > btrfs has no optimization like mdadm write-intent bitmaps; recovery > > is always a full-device operation. In theory btrfs could track > > modifications at the chunk level but this isn't even specified in the > > on-disk format, much less implemented. > > It could go even further; it would be sufficient to track which > *partial* stripes update will be performed before a commit, in one > of the btrfs logs. Then in case of a mount of an unclean filesystem, > a scrub on these stripes would be sufficient. A scrub cannot fix a raid56 write hole--the data is already lost. The damaged stripe updates must be replayed from the log. A scrub could fix raid1/raid10 partial updates but only if the filesystem can reliably track which blocks failed to be updated by the disconnected disks. It would be nice if scrub could be filtered the same way balance is, e.g. only certain block ranges, or only metadata blocks; however, this is not presently implemented. > BR > G.Baroncelli > > [...] > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: Status of RAID5/6
On Fri, Mar 30, 2018 at 09:21:00AM +0200, Menion wrote: > Thanks for the detailed explanation. I think that a summary of this > should go in the btrfs raid56 wiki status page, because now it is > completely inconsistent and if a user comes there, ihe may get the > impression that the raid56 is just broken > Still I have the 1 bilion dollar question: from your word I understand > that even in RAID56 the metadata are spread on the devices in a coplex > way, but shall I assume that the array can survice to the sudden death > of one (two for raid6) HDD in the array? I wouldn't assume that. There is still the write hole, and while there is a small probability of having a write hole failure, it's a probability that applies on *every* write in degraded mode, and since disks can fail at any time, the array can enter degraded mode at any time. It's similar to lottery tickets--buy one ticket, you probably won't win, but if you buy millions of tickets, you'll claim the prize eventually. The "prize" in this case is a severely damaged, possibly unrecoverable filesystem. If the data is raid5 and the metadata is raid1, the filesystem can survive a single disk failure easily; however, some of the data may be lost if writes to the remaining disks are interrupted by a system crash or power failure and the write hole issue occurs. Note that the damage is not necessarily limited to recently written data--it's any random data that is merely located adjacent to written data on the filesystem. I wouldn't use raid6 until the write hole issue is resolved. There is no configuration where two disks can fail and metadata can still be updated reliably. Some users use the 'ssd_spread' mount option to reduce the probability of write hole failure, which happens to be helpful by accident on some array configurations, but it has a fairly high cost when the array is not degraded due to all the extra balancing required. > Bye signature.asc Description: PGP signature
Re: Status of RAID5/6
On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote: > Hey. > > Some things would IMO be nice to get done/clarified (i.e. documented in > the Wiki and manpages) from users'/admin's POV: > > Some basic questions: I can answer some easy ones: > - compression+raid? There is no interaction between compression and raid. They happen on different data trees at different levels of the stack. So if the raid works, compression does too. > - rebuild / replace of devices? "replace" needs raid-level-specific support. If the raid level doesn't support replace, then users have to do device add followed by device delete, which is considerably (orders of magnitude) slower. > - changing raid lvls? btrfs uses a brute-force RAID conversion algorithm which always works, but takes zero short cuts. e.g. there is no speed optimization implemented for cases like "convert 2-disk raid1 to 1-disk single" which can be very fast in theory. The worst-case running time is the only running time available in btrfs. Also, users have to understand how the different raid allocators work to understand their behavior in specific situations. Without this understanding, the set of restrictions that pop up in practice can seem capricious and arbitrary. e.g. after adding 1 disk to a nearly-full raid1, full balance is required to make the new space available, but adding 2 disks makes all the free space available immediately. Generally it always works if you repeatedly run full-balances in a loop until you stop running out of space, but again, this is the worst case. > - anything to consider with raid when doing snapshots, send/receive > or defrag? Snapshot deletes cannot run at the same time as RAID convert/device delete/device shrink/resize. If one is started while the other is running, it will be blocked until the other finishes. Internally these operations block each other on a mutex. I don't know if snapshot deletes interact with device replace (the case has never come up for me). I wouldn't expect it to as device replace is more similar to scrub than balance, and scrub has no such interaction. Also note you can only run one balance, device shrink, or device delete at a time. If you start one of these three operations while another is already running, the new request is rejected immediately. As far as I know there are no other restrictions. > => and for each of these: for which raid levels? Most of those features don't interact with anything specific to a raid layer, so they work on all raid levels. Device replace is the exception: all RAID levels in use on the filesystem must support it, or the user must use device add and device delete instead. [Aside: I don't know if any RAID levels that do not support device replace still exist, which makes my answer longer than it otherwise would be] > Perhaps also confirmation for previous issues: > - I vaguely remember there were issues with either device delete or > replace and that one of them was possibly super-slow? Device replace is faster than device delete. Replace does not modify any metadata, while delete rewrites all the metadata referring to the removed device. Delete can be orders of magnitude slower than expected because of the metadata modifications required. > - I also remember there were cases in which a fs could end up in > permanent read-only state? Any unrecovered metadata error 1 bit or larger will do that. RAID level is relevant only in terms of how well it can recover corrupted or unreadable metadata blocks. > - Clarifying questions on what is expected to work and how things are > expected to behave, e.g.: > - Can one plug a device (without deleting/removing it first) just > under operation and will btrfs survive it? On raid1 and raid10, yes. On raid5/6 you will be at risk of write hole problems if the filesystem is modified while the device is unplugged. If the device is later reconnected, you should immediately scrub to bring the metadata on the devices back in sync. Data written to the filesystem while the device was offline will be corrected if the csum is different on the removed device. If there is no csum data will be silently corrupted. If the csum is correct, but the data is not (this occurs with 2^-32 probability on random data where the CRC happens to be identical) then the data will be silently corrupted. A full replace of the removed device would be better than a scrub, as that will get a known good copy of the data. If the device is offline for a long time, it should be wiped before being reintroduced to the rest of the array to avoid data integrity issues. It may be necessary to specify a different device name when mounting a filesystem that has had a disk removed and later reinserted until the scrub or replace action above is completed. btrfs has no optimization like mdadm write-intent bitmaps; recovery is always a full-device operation. In theory btrfs
Re: [RFC PATCH v3 0/7] btrfs-progs: Allow normal user to call "subvolume list/show"
On Mon, Mar 19, 2018 at 04:30:17PM +0900, Misono, Tomohiro wrote: > This is a part of RFC I sent last December[1] whose aim is to improve normal > users' usability. > The remaining works of RFC are: > - Allow "sub delete" for empty subvolume I don't mean to scope creep on you, but I have a couple of wishes related to this topic: - allow "rmdir" to remove an empty subvolume, i.e. when a subvolume is detected in rmdir, try switching to subvol delete before returning an error. This lets admin tools that are not btrfs-aware do 'rm -fr' on a user directory when it contains a subvolume. Legacy admin tools (or legacy tools in general) can't remove a subvol, and there is no solution for environments where we can't just fire users who create them. - mount option to restrict "sub create" and "sub snapshot" to root only. If we get "rmdir" working then this is significantly less important. > - Allow "qgroup show" to check quota limit > > [1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg70991.html signature.asc Description: PGP signature
[PATCH v2] btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
Until v4.14, this warning was very infrequent: WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0 Modules linked in: [...] CPU: 3 PID: 18172 Comm: bees Tainted: G D WL 4.11.9-zb64+ #1 Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 210112/02/2014 Call Trace: dump_stack+0x85/0xc2 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 find_parent_nodes+0xc41/0x14e0 __btrfs_find_all_roots+0xad/0x120 ? extent_same_check_offsets+0x70/0x70 iterate_extent_inodes+0x168/0x300 iterate_inodes_from_logical+0x87/0xb0 ? iterate_inodes_from_logical+0x87/0xb0 ? extent_same_check_offsets+0x70/0x70 btrfs_ioctl+0x8ac/0x2820 ? lock_acquire+0xc2/0x200 do_vfs_ioctl+0x91/0x700 ? __fget+0x112/0x200 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc6 ? trace_hardirqs_off_caller+0x1f/0x140 Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary reference tracking to use rbtrees")) the WARN_ON occurs three orders of magnitude more frequently--almost once per second while running workloads like bees. Replace the WARN_ON() with a comment rationale for its removal. The rationale is paraphrased from an explanation by Edmund Nadolski <enadol...@suse.de> on the linux-btrfs mailing list. Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- v2: Replace WARN_ON with rationale instead of merely deleting it. Trim irrelevant detail from the backtrace. Add Fixes reference. Fix subject line (missing "< 0"). fs/btrfs/backref.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 7d0dc100a09a..06597c5f9f4b 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -1263,7 +1263,16 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, while (node) { ref = rb_entry(node, struct prelim_ref, rbnode); node = rb_next(>rbnode); - WARN_ON(ref->count < 0); + /* +* ref->count < 0 can happen here if there are delayed +* refs with a node->action of BTRFS_DROP_DELAYED_REF. +* prelim_ref_insert() relies on this when merging +* identical refs to keep the overall count correct. +* prelim_ref_insert() will merge only those refs +* which compare identically. Any refs having +* e.g. different offsets would not be merged, +* and would retain their original ref->count < 0. +*/ if (roots && ref->count && ref->root_id && ref->parent == 0) { if (sc && sc->root_objectid && ref->root_id != sc->root_objectid) { -- 2.11.0 signature.asc Description: PGP signature
Re: [PATCH] btrfs: remove spurious WARN_ON(ref->count) in find_parent_nodes
On Mon, Jan 22, 2018 at 11:34:52AM +0800, Lu Fengqi wrote: > On Sun, Jan 21, 2018 at 02:08:58PM -0500, Zygo Blaxell wrote: > >This warning appears during execution of the LOGICAL_INO ioctl and > >appears to be spurious: > > > > [ cut here ] > > WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 > > find_parent_nodes+0xc41/0x14e0 > > Modules linked in: ib_iser rdma_cm iw_cm ib_cm ib_core configfs > > iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi overlay r8169 ufs qnx4 > > hfsplus hfs minix ntfs vfat msdos fat jfs xfs cpuid rpcsec_gss_krb5 nfsv4 > > nfsv3 nfs fscache algif_skcipher af_alg softdog nfsd auth_rpcgss nfs_acl > > lockd grace sunrpc bnep cpufreq_userspace cpufreq_powersave > > cpufreq_conservative nfnetlink_queue nfnetlink_log nfnetlink bluetooth > > rfkill snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_oss > > snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device binfmt_misc fuse nbd > > xt_REDIRECT nf_nat_redirect ipt_REJECT nf_reject_ipv4 xt_nat xt_conntrack > > xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG ip6table_nat nf_conntrack_ipv6 > > nf_defrag_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 > > nf_nat_ipv4 nf_nat nf_conntrack > > ip6table_mangle iptable_mangle ip6table_filter ip6_tables > > iptable_filter ip_tables x_tables tcp_cubic dummy lp dm_crypt edac_mce_amd > > edac_core snd_hda_codec_hdmi ppdev kvm_amd kvm irqbypass crct10dif_pclmul > > crc32_pclmul ghash_clmulni_intel snd_hda_codec_via pcbc amdkfd > > snd_hda_codec_generic amd_iommu_v2 aesni_intel snd_hda_intel radeon > > snd_hda_codec aes_x86_64 snd_hda_core snd_hwdep crypto_simd glue_helper sg > > snd_pcm_oss cryptd input_leds joydev pcspkr serio_raw snd_mixer_oss > > rtc_cmos snd_pcm parport_pc parport shpchp wmi acpi_cpufreq evdev snd_timer > > asus_atk0110 k10temp fam15h_power snd soundcore sp5100_tco hid_generic ipv6 > > af_packet crc_ccitt raid10 raid456 async_raid6_recov async_memcpy async_pq > > async_xor async_tx libcrc32c raid0 multipath linear dm_mod raid1 md_mod > > ohci_pci ide_pci_generic > > sr_mod cdrom pdc202xx_new ohci_hcd crc32c_intel atiixp ehci_pci > > psmouse ide_core i2c_piix4 ehci_hcd xhci_pci mii xhci_hcd [last unloaded: > > r8169] > > CPU: 3 PID: 18172 Comm: bees Tainted: G D WL 4.11.9-zb64+ #1 > > Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, > > BIOS 210112/02/2014 > > Call Trace: > > dump_stack+0x85/0xc2 > > __warn+0xd1/0xf0 > > warn_slowpath_null+0x1d/0x20 > > find_parent_nodes+0xc41/0x14e0 > > __btrfs_find_all_roots+0xad/0x120 > > ? extent_same_check_offsets+0x70/0x70 > > iterate_extent_inodes+0x168/0x300 > > iterate_inodes_from_logical+0x87/0xb0 > > ? iterate_inodes_from_logical+0x87/0xb0 > > ? extent_same_check_offsets+0x70/0x70 > > btrfs_ioctl+0x8ac/0x2820 > > ? lock_acquire+0xc2/0x200 > > do_vfs_ioctl+0x91/0x700 > > ? __fget+0x112/0x200 > > SyS_ioctl+0x79/0x90 > > entry_SYSCALL_64_fastpath+0x23/0xc6 > > RIP: 0033:0x7f727b20be07 > > RSP: 002b:7f7279f1e018 EFLAGS: 0246 ORIG_RAX: 0010 > > RAX: ffda RBX: 9c0f4d7f RCX: 7f727b20be07 > > RDX: 7f7279f1e118 RSI: c0389424 RDI: 0003 > > RBP: 0035 R08: 7f72581bf340 R09: > > R10: 0020 R11: 0246 R12: 0040 > > R13: 7f725818d230 R14: 7f7279f1b640 R15: 7f725820 > > ? trace_hardirqs_off_caller+0x1f/0x140 > > ---[ end trace 5de243350f6762c6 ]--- > > [ cut here ] > > > >ref->count can be below zero under normal conditions (for delayed refs), > >so there is no need to spam dmesg when it happens. > > > > Added Edmund. > > Hi, > > I've also encountered the same problem when running the test case > xfstests/btrfs/004. However, I'm not sure whether the negative ref->count > is reasonable. > > IMO, these functions (such as add_delayed_refs, add_delayed_refs, > add_delayed_refs, add_missing_keys and resolve_indirect_refs) have been > executed at this point in time. Hence, these references not only include > these refs in the memory (delayed) but also include those refs in the disk > (inline/keyed). I don't have the complete picture, but while looking at other code, comments, and git log messages surrounding ref->count in btrfs, I found: * ref->count starts off at -1 (for a
Re: [PATCH] btrfs: remove spurious WARN_ON(ref->count) in find_parent_nodes
On Mon, Jan 22, 2018 at 09:06:23PM +0800, Lu Fengqi wrote: > On Mon, Jan 22, 2018 at 02:38:42PM +0200, Nikolay Borisov wrote: > > > > > >On 22.01.2018 14:19, Lu Fengqi wrote: > >> On 01/22/2018 04:46 PM, Nikolay Borisov wrote: > >>> > >>> > >>> On 22.01.2018 05:34, Lu Fengqi wrote: > According to my bisect result, The frequency of the warning occurrence > increased to the detectable degree after this patch > >>> > >>> That sentence implies that even before Ed's patch it was possible to > >>> trigger those warnings, is that true? Personally I've never seen such > >>> warnings while executing btrfs/004. How do you configure the filesystem > >>> for the test runs? > >>> > >> > >> Just only default mount option. > >> > >> ➜ xfstests-dev git:(master) for i in $(seq 1 100); do echo $i; if ! > >> sudo ./check btrfs/004; then break; fi; done > >> 1 > >> > >> FSTYP -- btrfs > >> > >> PLATFORM -- Linux/x86_64 sarch 4.15.0-rc9 > >> > >> MKFS_OPTIONS -- /dev/vdd1 > >> > >> MOUNT_OPTIONS -- /dev/vdd1 /mnt/scratch > >> > >> > >> > >> > >> btrfs/004 47s ... 49s > >> > >> Ran: btrfs/004 > >> > >> Passed all 1 tests > >> > >> > >> > >> > >> 2 > >> > >> FSTYP -- btrfs > >> > >> PLATFORM -- Linux/x86_64 sarch 4.15.0-rc9 > >> > >> MKFS_OPTIONS -- /dev/vdd1 > >> > >> MOUNT_OPTIONS -- /dev/vdd1 /mnt/scratch > >> > >> > >> > >> > >> btrfs/004 49s ... 52s > >> > >> _check_dmesg: something found in dmesg (see > >> /home/luke/workspace/xfstests-dev/results//btrfs/004.dmesg) > >> > >> Ran: btrfs/004 > >> > >> Failures: btrfs/004 > >> > >> Failed 1 of 1 tests > >> > >> The probability of this warning appearing is rather low, and I only > >> encountered 52 warnings when I looped 1008 times btrfs/004 for 20 hours > >> in 4.15-rc6 (IOW, the probability is nearly 5%). So you want to trigger > >> warning also need more luck or patience. > > > >Thanks but is this before or after the mentioned commit below? > > > > After this commit. The bisect condition I use to locate this commit is > to repeat btrfs/004 20 times without warning (This may not be accurate enough, > can only be used as a reference). I have been seeing this warning since at least 2015 (v3.18?), possibly earlier. In the past it has never been correlated to any event I've need to take action to correct (i.e. no data corruption, no crashes, no hangs, no filesystem damage, and no obvious functional failures in userspace). In v4.14 nothing seems to have changed, except the warning now appears three orders of magnitude more often. This spams console terminals and kernel logs with gigabytes of stacktrace and bumps this phenomenon up to the top of my priority list. It looks like the warning has been there with only minor editorial changes since Jan Schmidt's 2011 commit "Btrfs: added btrfs_find_all_roots()" in v3.3-rc1. > Maybe Zygo has found a finer way to reproduce > it, so he reproduce this warning more frequently than me. It's not really a finer way, but bees hits this warning most often, sometimes many times per second in bursts lasting minutes at a time. btrfs balance also hits the warning occasionally (it was the most common trigger of that warning in 2015 before I was running bees everywhere). The net effect of the bees worker loop looks fairly similar to btrfs/004, basically calling LOGICAL_INO many times per second on a busy filesystem. bees focuses its activity on active parts of the filesystem, which means it's more likely to do backref walks against extents that are also being affected by user activity and therefore more likely to encounter delayed refs. Contrast with 'btrfs balance' which spreads its effect across the entire filesystem and is much less likely to collide with user activity. Every duplicate extent hit in bees uses LOGICAL_INO at least once to map a stored duplicate block bytenr back to something that can be passed to open() and FILE_EXTENT_SAME. The warnings do arrive in bursts at the same time as bees hitting clusters of duplicate extents. > > > >> > 86d5f9944252 ("btrfs: convert prelimary reference tracking to use > rbtrees") > is committed. I understand that this does not mean that this patch > caused > the problem, but maybe Edmund can give us some help, so I added him > to the > recipient. > >>> > >>> > >> > >> > > > > > > -- > Thanks, > Lu > > signature.asc Description: PGP signature
[PATCH] btrfs: remove spurious WARN_ON(ref->count) in find_parent_nodes
This warning appears during execution of the LOGICAL_INO ioctl and appears to be spurious: [ cut here ] WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0 Modules linked in: ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi overlay r8169 ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs cpuid rpcsec_gss_krb5 nfsv4 nfsv3 nfs fscache algif_skcipher af_alg softdog nfsd auth_rpcgss nfs_acl lockd grace sunrpc bnep cpufreq_userspace cpufreq_powersave cpufreq_conservative nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device binfmt_misc fuse nbd xt_REDIRECT nf_nat_redirect ipt_REJECT nf_reject_ipv4 xt_nat xt_conntrack xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_mangle iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables x_tables tcp_cubic dummy lp dm_crypt edac_mce_amd edac_core snd_hda_codec_hdmi ppdev kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_via pcbc amdkfd snd_hda_codec_generic amd_iommu_v2 aesni_intel snd_hda_intel radeon snd_hda_codec aes_x86_64 snd_hda_core snd_hwdep crypto_simd glue_helper sg snd_pcm_oss cryptd input_leds joydev pcspkr serio_raw snd_mixer_oss rtc_cmos snd_pcm parport_pc parport shpchp wmi acpi_cpufreq evdev snd_timer asus_atk0110 k10temp fam15h_power snd soundcore sp5100_tco hid_generic ipv6 af_packet crc_ccitt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c raid0 multipath linear dm_mod raid1 md_mod ohci_pci ide_pci_generic sr_mod cdrom pdc202xx_new ohci_hcd crc32c_intel atiixp ehci_pci psmouse ide_core i2c_piix4 ehci_hcd xhci_pci mii xhci_hcd [last unloaded: r8169] CPU: 3 PID: 18172 Comm: bees Tainted: G D WL 4.11.9-zb64+ #1 Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 210112/02/2014 Call Trace: dump_stack+0x85/0xc2 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 find_parent_nodes+0xc41/0x14e0 __btrfs_find_all_roots+0xad/0x120 ? extent_same_check_offsets+0x70/0x70 iterate_extent_inodes+0x168/0x300 iterate_inodes_from_logical+0x87/0xb0 ? iterate_inodes_from_logical+0x87/0xb0 ? extent_same_check_offsets+0x70/0x70 btrfs_ioctl+0x8ac/0x2820 ? lock_acquire+0xc2/0x200 do_vfs_ioctl+0x91/0x700 ? __fget+0x112/0x200 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc6 RIP: 0033:0x7f727b20be07 RSP: 002b:7f7279f1e018 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 9c0f4d7f RCX: 7f727b20be07 RDX: 7f7279f1e118 RSI: c0389424 RDI: 0003 RBP: 0035 R08: 7f72581bf340 R09: R10: 0020 R11: 0246 R12: 0040 R13: 7f725818d230 R14: 7f7279f1b640 R15: 7f725820 ? trace_hardirqs_off_caller+0x1f/0x140 ---[ end trace 5de243350f6762c6 ]--- [ cut here ] ref->count can be below zero under normal conditions (for delayed refs), so there is no need to spam dmesg when it happens. On kernel v4.14 this warning occurs 100-1000 times more frequently than on kernels v4.2..v4.12. In the worst case, one test machine had 59020 warnings in 24 hours on v4.14.14 compared to 55 on v4.12.14. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/backref.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 7d0dc100a09a..57e8d2562ed5 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -1263,7 +1263,6 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, while (node) { ref = rb_entry(node, struct prelim_ref, rbnode); node = rb_next(>rbnode); - WARN_ON(ref->count < 0); if (roots && ref->count && ref->root_id && ref->parent == 0) { if (sc && sc->root_objectid && ref->root_id != sc->root_objectid) { -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/backref.c| 63 ++- fs/btrfs/backref.h| 8 +++--- fs/btrfs/inode.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/btrfs/qgroup.c | 8 +++--- fs/btrfs/scrub.c | 6 ++--- fs/btrfs/send.c | 2 +- fs/btrfs/tests/qgroup-tests.c | 20 +++--- 8 files changed, 63 insertions(+), 48 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index b517ef1477ea..a2609786cd86 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -40,12 +40,14 @@ static int check_extent_in_eb(const struct btrfs_key *key, const struct extent_buffer *eb, const struct btrfs_file_extent_item *fi, u64 extent_item_pos, - struct extent_inode_elem **eie) + struct extent_inode_elem **eie, + bool ignore_offset) { u64 offset = 0; struct extent_inode_elem *e; - if (!btrfs_file_extent_compression(eb, fi) && + if (!ignore_offset && + !btrfs_file_extent_compression(eb, fi) && !btrfs_file_extent_encryption(eb, fi) && !btrfs_file_extent_other_encoding(eb, fi)) { u64 data_offset; @@ -84,7 +86,8 @@ static void free_inode_elem_list(struct extent_inode_elem *eie) static int find_extent_in_eb(const struct extent_buffer *eb, u64 wanted_disk_byte, u64 extent_item_pos, -struct extent_inode_elem **eie) +struct extent_inode_elem **eie, +bool ignore_offset) { u64 disk_byte; struct btrfs_key key; @@ -113,7 +116,7 @@ static int find_extent_in_eb(const struct extent_buffer *eb, if (disk_byte != wanted_disk_byte) continue; - ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie); + ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, ignore_offset); if (ret < 0) return ret; } @@ -419,7 +422,7 @@ static int add_indirect_ref(const struct btrfs_fs_info *fs_info, static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, struct ulist *parents, struct prelim_ref *ref, int level, u64 time_seq, const u64 *extent_item_pos, - u64 total_refs) + u64 total_refs, bool ignore_offset) { int ret = 0; int slot; @@ -472,7 +475,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, if (extent_item_pos) { ret = check_extent_in_eb(, eb, fi,
[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl
Build-server workloads have hundreds of references per file after dedup. Multiply by a few snapshots and we quickly exhaust the limit of 2730 references per extent that can fit into a 64K buffer. Raise the limit to 16M to be consistent with other btrfs ioctls (e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME). To minimize surprising userspace behavior, apply this change only to the LOGICAL_INO_V2 ioctl. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index f4281ffd1833..1940678fc440 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4554,6 +4554,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (version == 1) { ignore_offset = false; + size = min_t(u32, loi->size, SZ_64K); } else { /* All reserved bits must be 0 for now */ if (memchr_inv(loi->reserved, 0, sizeof(loi->reserved))) { @@ -4566,6 +4567,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out_loi; } ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + size = min_t(u32, loi->size, SZ_16M); } path = btrfs_alloc_path(); @@ -4574,7 +4576,6 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out; } - size = min_t(u32, loi->size, SZ_64K); inodes = init_data_container(size); if (IS_ERR(inodes)) { ret = PTR_ERR(inodes); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off, we need a way to do it from userspace. Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent offset filtering, taking the place of one of the existing reserved[] fields. Previous versions of LOGICAL_INO neglected to check whether any of the reserved fields have non-zero values. Assigning meaning to those fields now may change the behavior of existing programs that left these fields uninitialized. The lack of a zero check also means that new programs have no way to know whether the kernel is honoring the flags field. To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can use the same argument layout as LOGICAL_INO, but shorten the reserved[] array by one element and turn it into the 'flags' field. The V2 ioctl explicitly checks that reserved fields and unsupported flag bits are zero so that userspace can negotiate future feature bits as they are defined. Since the memory layouts of the two ioctls' arguments are compatible, there is no need for a separate function for logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search where the layout and code are quite different). A version parameter and an 'if' statement will suffice. Now that we have a flags field in logical_ino_args, add a flag BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, and pass it down the stack to iterate_inodes_from_logical. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/ioctl.c | 26 +++--- include/uapi/linux/btrfs.h | 8 +++- 2 files changed, 30 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b7de32568082..f4281ffd1833 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 root, void *ctx) } static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, - void __user *arg) + void __user *arg, int version) { int ret = 0; int size; struct btrfs_ioctl_logical_ino_args *loi; struct btrfs_data_container *inodes = NULL; struct btrfs_path *path = NULL; + bool ignore_offset; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -4551,6 +4552,22 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (IS_ERR(loi)) return PTR_ERR(loi); + if (version == 1) { + ignore_offset = false; + } else { + /* All reserved bits must be 0 for now */ + if (memchr_inv(loi->reserved, 0, sizeof(loi->reserved))) { + ret = -EINVAL; + goto out_loi; + } + /* Only accept flags we have defined so far */ + if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { + ret = -EINVAL; + goto out_loi; + } + ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; @@ -4566,7 +4583,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, } ret = iterate_inodes_from_logical(loi->logical, fs_info, path, - build_ino_list, inodes, false); + build_ino_list, inodes, ignore_offset); if (ret == -EINVAL) ret = -ENOENT; if (ret < 0) @@ -4580,6 +4597,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, out: btrfs_free_path(path); kvfree(inodes); +out_loi: kfree(loi); return ret; @@ -5550,7 +5568,9 @@ long btrfs_ioctl(struct file *file, unsigned int case BTRFS_IOC_INO_PATHS: return btrfs_ioctl_ino_to_path(root, argp); case BTRFS_IOC_LOGICAL_INO: - return btrfs_ioctl_logical_to_ino(fs_info, argp); + return btrfs_ioctl_logical_to_ino(fs_info, argp, 1); + case BTRFS_IOC_LOGICAL_INO_V2: + return btrfs_ioctl_logical_to_ino(fs_info, argp, 2); case BTRFS_IOC_SPACE_INFO: return btrfs_ioctl_space_info(fs_info, argp); case BTRFS_IOC_SYNC: { diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 378230c163d5..99bb7988e6fe 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -608,10 +608,14 @@ struct btrfs_ioctl_ino_path_args { struct btrfs_ioctl_logical_ino_args { __u64 logical;/* in */ __u64 size; /* in */ - __u64 reserved[4]; + __u64 rese
[PATCH v3] btrfs: LOGICAL_INO enhancements
Changelog: v3-v2: - Stricter check on reserved[] field - now must be all zero, or userspace gets EINVAL. This prevents userspace from setting any of the reserved bits without the kernel providing an unambiguous interpretation of them, and doesn't require us to burn a flag bit for each one. - Moved 'flags' to the end of the reserved[] array. This allows existing source code using version 1 of the ioctl to behave the same way when using version 2 of the btrfs_ioctl_logical_ino_args struct definition (i.e. reserved[3] becomes an alias for 'flags', and the addresses of reserved[0-2] don't change). - Clarified the reasoning in the commit message for patch 2, "btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2". v2: - added patch series intro text - rebased on 4.14-rc1. v1: This patch series fixes some weaknesses in the btrfs LOGICAL_INO ioctl. Background: Suppose we have a file with one extent: root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a root@tester:~# sync Split the extent by overwriting it in the middle: root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a We should now have 3 extent refs to 2 extents, with one block unreachable. The extent tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 2 [...] item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53 extent refs 2 gen 29 flags DATA extent data backref root 5 objectid 261 offset 0 count 2 [...] item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53 extent refs 1 gen 30 flags DATA extent data backref root 5 objectid 261 offset 8192 count 1 [...] and the ref tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 5 [...] item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 0 nr 8192 ram 73728 extent compression(none) item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53 extent data disk byte 1103175680 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression(none) item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 12288 nr 61440 ram 73728 extent compression(none) [...] There are two references to the same extent with different, non-overlapping byte offsets: [--72K extent at 1103101952--] [--8K|--4K unreachable|--60K-] ^ ^ | | [--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--] | v [-4K extent-] at 1103175680 We want to find all of the references to extent bytenr 1103101952. Without the patch (and without running btrfs-debug-tree), we have to do it with 18 LOGICAL_INO calls: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO inode 261 offset 0 root 5 root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode inode 261 offset 0 root 5 inode 261 offset 4096 root 5 <- same extent ref as offset 0 (offset 8192 returns empty set, not reachable) inode 261 offset 12288 root 5 inode 261 offset 16384 root 5 \ inode 261 offset 20480 root 5 | inode 261 offset 24576 root 5 | inode 261 offset 28672 root 5 | inode 261 offset 32768 root 5 | inode 261 offset 36864 root 5 \ inode 261 offset 40960 root 5 > all the same extent ref as offset 12288. inode 261 offset 45056 root 5 / More processing required in userspace inode 261 offset 49152 root 5 | to figure out these are all duplicates. inode 261 offset 53248 root 5 | inode 261 offset 57344 root 5 | inode 261 offset 61440 root 5 | inode 261 offset 65536 root 5 | inode 261 offset 69632 root 5 / In the worst case the extents are 128MB long, and we have to do 32768 iterations of the loop to find one 4K extent ref. With the patch, we just use one call to map all refs to the extent at once: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO_V2 inode 261 offset 0 root 5 inode 261 offset 12288 root 5 The TREE_SEARCH ioctl allows userspace to retrieve the offset and extent bytenr fields easily once the root, inode and offset are known. This is sufficient information to
Re: [PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
On Thu, Sep 21, 2017 at 12:59:42PM -0700, Darrick J. Wong wrote: > On Thu, Sep 21, 2017 at 12:10:15AM -0400, Zygo Blaxell wrote: > > Now that check_extent_in_eb()'s extent offset filter can be turned off, > > we need a way to do it from userspace. > > > > Add a 'flags' field to the btrfs_logical_ino_args structure to disable > > extent > > offset filtering, taking the place of one of the reserved[] fields. > > > > Previous versions of LOGICAL_INO neglected to check whether any of the > > reserved fields have non-zero values. Assigning meaning to those fields > > now may change the behavior of existing programs that left these fields > > uninitialized. > > > > To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses > > the same argument layout as LOGICAL_INO, but uses one of the reserved > > fields for flags. The V2 ioctl explicitly checks that unsupported flag > > bits are zero so that userspace can probe for future feature bits as > > they are defined. If the other reserved fields are used in the future, > > one of the remaining flag bits could specify that the other reserved > > fields are valid, so we don't need to check those for now. > > > > Since the memory layouts and behavior of the two ioctls' arguments > > are almost identical, there is no need for a separate function for > > logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search). > > A version parameter and an 'if' statement will suffice. > > > > Now that we have a flags field in logical_ino_args, add a flag > > BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, > > and pass it down the stack to iterate_inodes_from_logical. > > > > Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> > > --- > > fs/btrfs/ioctl.c | 21 ++--- > > include/uapi/linux/btrfs.h | 8 +++- > > 2 files changed, 25 insertions(+), 4 deletions(-) > > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > > index b7de32568082..2bc3a9588d1d 100644 > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 > > root, void *ctx) > > } > > > > static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, > > - void __user *arg) > > + void __user *arg, int version) > > { > > int ret = 0; > > int size; > > struct btrfs_ioctl_logical_ino_args *loi; > > struct btrfs_data_container *inodes = NULL; > > struct btrfs_path *path = NULL; > > + bool ignore_offset; > > > > if (!capable(CAP_SYS_ADMIN)) > > return -EPERM; > > @@ -4551,6 +4552,17 @@ static long btrfs_ioctl_logical_to_ino(struct > > btrfs_fs_info *fs_info, > > if (IS_ERR(loi)) > > return PTR_ERR(loi); > > > > + if (version == 1) { > > + ignore_offset = false; > > + } else { > > + /* Only accept flags we have defined so far */ > > + if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { > > + ret = -EINVAL; > > + goto out_loi; > > + } > > + ignore_offset = loi->flags & > > BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; > > Please check loi->reserved[3] for zeroness so that the next person who > wants to add a field to btrfs_ioctl_logical_ino_args doesn't have to > create LOGICAL_INO_V3 for the same reason you're creating V2. OK now I'm confused, in several distinct ways. I wonder if you meant reserved[1] and reserved[2] there, since I'm not checking them (for reasons stated in the commit log--we can use flags to indicate whether and what values are present there). But that's not the bigger problem. Maybe you did mean reserved[3], but there's no "reserved[3]" any more. I shortened the reserved array from 4 elements to 3, so "reserved[3]" is no longer a valid memory reference. Also "reserved[0]" no longer refers to the same thing it once did. > --D > > > + } > > + > > path = btrfs_alloc_path(); > > if (!path) { > > ret = -ENOMEM; > > @@ -4566,7 +4578,7 @@ static long btrfs_ioctl_logical_to_ino(struct > > btrfs_fs_info *fs_info, > > } > > > > ret = iterate_inodes_from_logical(loi->logical, fs_info, path, > > - build_ino_list, inodes, false); > > + bui
[PATCH v2] btrfs: LOGICAL_INO enhancements (this time based on 4.14-rc1)
The previous patch series was based on v4.12.14, and this introductory text was missing. This patch series fixes some weaknesses in the btrfs LOGICAL_INO ioctl. Background: Suppose we have a file with one extent: root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a root@tester:~# sync Split the extent by overwriting it in the middle: root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a We should now have 3 extent refs to 2 extents, with one block unreachable. The extent tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 2 [...] item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53 extent refs 2 gen 29 flags DATA extent data backref root 5 objectid 261 offset 0 count 2 [...] item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53 extent refs 1 gen 30 flags DATA extent data backref root 5 objectid 261 offset 8192 count 1 [...] and the ref tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 5 [...] item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 0 nr 8192 ram 73728 extent compression(none) item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53 extent data disk byte 1103175680 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression(none) item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 12288 nr 61440 ram 73728 extent compression(none) [...] There are two references to the same extent with different, non-overlapping byte offsets: [--72K extent at 1103101952--] [--8K|--4K unreachable|--60K-] ^ ^ | | [--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--] | v [-4K extent-] at 1103175680 We now want to find all of the references to extent bytenr 1103101952. Without the patch (and without running btrfs-debug-tree), we have to do it with 18 LOGICAL_INO calls: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO inode 261 offset 0 root 5 root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode inode 261 offset 0 root 5 inode 261 offset 4096 root 5 <- same extent ref as offset 0 (offset 8192 returns empty set, not reachable) inode 261 offset 12288 root 5 inode 261 offset 16384 root 5 \ inode 261 offset 20480 root 5 | inode 261 offset 24576 root 5 | inode 261 offset 28672 root 5 | inode 261 offset 32768 root 5 | inode 261 offset 36864 root 5 \ inode 261 offset 40960 root 5 > all the same extent ref as offset 12288. inode 261 offset 45056 root 5 / More processing required in userspace inode 261 offset 49152 root 5 | to figure out these are all duplicates. inode 261 offset 53248 root 5 | inode 261 offset 57344 root 5 | inode 261 offset 61440 root 5 | inode 261 offset 65536 root 5 | inode 261 offset 69632 root 5 / In the worst case the extents are 128MB long, and we have to do 32768 iterations of the loop to find one 4K extent ref. With the patch, we just use one call to map all refs to the extent at once: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO_V2 inode 261 offset 0 root 5 inode 261 offset 12288 root 5 The TREE_SEARCH ioctl allows userspace to retrieve the offset and extent bytenr fields easily once the root, inode and offset are known. This is sufficient information to build a complete map of the extent and all of its references. Userspace can use this information to make better choices to dedup or defrag. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl
Build-server workloads have hundreds of references per file after dedup. Multiply by a few snapshots and we quickly exhaust the limit of 2730 references per extent that can fit into a 64K buffer. Raise the limit to 16M to be consistent with other btrfs ioctls (e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME). To minimize surprising userspace behavior, apply this change only to the LOGICAL_INO_V2 ioctl. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2bc3a9588d1d..4be9b1791f58 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4554,6 +4554,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (version == 1) { ignore_offset = false; + size = min_t(u32, loi->size, SZ_64K); } else { /* Only accept flags we have defined so far */ if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { @@ -4561,6 +4562,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out_loi; } ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + size = min_t(u32, loi->size, SZ_16M); } path = btrfs_alloc_path(); @@ -4569,7 +4571,6 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out; } - size = min_t(u32, loi->size, SZ_64K); inodes = init_data_container(size); if (IS_ERR(inodes)) { ret = PTR_ERR(inodes); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off, we need a way to do it from userspace. Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent offset filtering, taking the place of one of the reserved[] fields. Previous versions of LOGICAL_INO neglected to check whether any of the reserved fields have non-zero values. Assigning meaning to those fields now may change the behavior of existing programs that left these fields uninitialized. To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses the same argument layout as LOGICAL_INO, but uses one of the reserved fields for flags. The V2 ioctl explicitly checks that unsupported flag bits are zero so that userspace can probe for future feature bits as they are defined. If the other reserved fields are used in the future, one of the remaining flag bits could specify that the other reserved fields are valid, so we don't need to check those for now. Since the memory layouts and behavior of the two ioctls' arguments are almost identical, there is no need for a separate function for logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search). A version parameter and an 'if' statement will suffice. Now that we have a flags field in logical_ino_args, add a flag BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, and pass it down the stack to iterate_inodes_from_logical. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/ioctl.c | 21 ++--- include/uapi/linux/btrfs.h | 8 +++- 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b7de32568082..2bc3a9588d1d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 root, void *ctx) } static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, - void __user *arg) + void __user *arg, int version) { int ret = 0; int size; struct btrfs_ioctl_logical_ino_args *loi; struct btrfs_data_container *inodes = NULL; struct btrfs_path *path = NULL; + bool ignore_offset; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -4551,6 +4552,17 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (IS_ERR(loi)) return PTR_ERR(loi); + if (version == 1) { + ignore_offset = false; + } else { + /* Only accept flags we have defined so far */ + if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { + ret = -EINVAL; + goto out_loi; + } + ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; @@ -4566,7 +4578,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, } ret = iterate_inodes_from_logical(loi->logical, fs_info, path, - build_ino_list, inodes, false); + build_ino_list, inodes, ignore_offset); if (ret == -EINVAL) ret = -ENOENT; if (ret < 0) @@ -4580,6 +4592,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, out: btrfs_free_path(path); kvfree(inodes); +out_loi: kfree(loi); return ret; @@ -5550,7 +5563,9 @@ long btrfs_ioctl(struct file *file, unsigned int case BTRFS_IOC_INO_PATHS: return btrfs_ioctl_ino_to_path(root, argp); case BTRFS_IOC_LOGICAL_INO: - return btrfs_ioctl_logical_to_ino(fs_info, argp); + return btrfs_ioctl_logical_to_ino(fs_info, argp, 1); + case BTRFS_IOC_LOGICAL_INO_V2: + return btrfs_ioctl_logical_to_ino(fs_info, argp, 2); case BTRFS_IOC_SPACE_INFO: return btrfs_ioctl_space_info(fs_info, argp); case BTRFS_IOC_SYNC: { diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 378230c163d5..0b3de597e04f 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -608,10 +608,14 @@ struct btrfs_ioctl_ino_path_args { struct btrfs_ioctl_logical_ino_args { __u64 logical;/* in */ __u64 size; /* in */ - __u64 reserved[4]; + __u64 flags; /* in, v2 only */ + __u64 reserved[3]; /* struct btrfs_data_container *inodes;out */ __u64 inodes; }; +/* Return every ref to the extent, not just those containing logical b
[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/backref.c| 63 ++- fs/btrfs/backref.h| 8 +++--- fs/btrfs/inode.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/btrfs/qgroup.c | 8 +++--- fs/btrfs/scrub.c | 6 ++--- fs/btrfs/send.c | 2 +- fs/btrfs/tests/qgroup-tests.c | 20 +++--- 8 files changed, 63 insertions(+), 48 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index b517ef1477ea..a2609786cd86 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -40,12 +40,14 @@ static int check_extent_in_eb(const struct btrfs_key *key, const struct extent_buffer *eb, const struct btrfs_file_extent_item *fi, u64 extent_item_pos, - struct extent_inode_elem **eie) + struct extent_inode_elem **eie, + bool ignore_offset) { u64 offset = 0; struct extent_inode_elem *e; - if (!btrfs_file_extent_compression(eb, fi) && + if (!ignore_offset && + !btrfs_file_extent_compression(eb, fi) && !btrfs_file_extent_encryption(eb, fi) && !btrfs_file_extent_other_encoding(eb, fi)) { u64 data_offset; @@ -84,7 +86,8 @@ static void free_inode_elem_list(struct extent_inode_elem *eie) static int find_extent_in_eb(const struct extent_buffer *eb, u64 wanted_disk_byte, u64 extent_item_pos, -struct extent_inode_elem **eie) +struct extent_inode_elem **eie, +bool ignore_offset) { u64 disk_byte; struct btrfs_key key; @@ -113,7 +116,7 @@ static int find_extent_in_eb(const struct extent_buffer *eb, if (disk_byte != wanted_disk_byte) continue; - ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie); + ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, ignore_offset); if (ret < 0) return ret; } @@ -419,7 +422,7 @@ static int add_indirect_ref(const struct btrfs_fs_info *fs_info, static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, struct ulist *parents, struct prelim_ref *ref, int level, u64 time_seq, const u64 *extent_item_pos, - u64 total_refs) + u64 total_refs, bool ignore_offset) { int ret = 0; int slot; @@ -472,7 +475,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, if (extent_item_pos) { ret = check_extent_in_eb(, eb, fi,
[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/backref.c | 62 -- fs/btrfs/backref.h | 8 --- fs/btrfs/inode.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/btrfs/qgroup.c | 8 +++ fs/btrfs/scrub.c | 6 +++--- fs/btrfs/send.c| 2 +- 7 files changed, 52 insertions(+), 38 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 1d71a5a4b1b9..3bffd36c6897 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -302,12 +302,14 @@ static int ref_tree_add(struct ref_root *ref_tree, u64 root_id, u64 object_id, static int check_extent_in_eb(struct btrfs_key *key, struct extent_buffer *eb, struct btrfs_file_extent_item *fi, u64 extent_item_pos, - struct extent_inode_elem **eie) + struct extent_inode_elem **eie, + bool ignore_offset) { u64 offset = 0; struct extent_inode_elem *e; - if (!btrfs_file_extent_compression(eb, fi) && + if (!ignore_offset && + !btrfs_file_extent_compression(eb, fi) && !btrfs_file_extent_encryption(eb, fi) && !btrfs_file_extent_other_encoding(eb, fi)) { u64 data_offset; @@ -346,7 +348,8 @@ static void free_inode_elem_list(struct extent_inode_elem *eie) static int find_extent_in_eb(struct extent_buffer *eb, u64 wanted_disk_byte, u64 extent_item_pos, - struct extent_inode_elem **eie) + struct extent_inode_elem **eie, + bool ignore_offset) { u64 disk_byte; struct btrfs_key key; @@ -375,7 +378,7 @@ static int find_extent_in_eb(struct extent_buffer *eb, u64 wanted_disk_byte, if (disk_byte != wanted_disk_byte) continue; - ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie); + ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, ignore_offset); if (ret < 0) return ret; } @@ -511,7 +514,7 @@ static int __add_prelim_ref(struct list_head *head, u64 root_id, static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, struct ulist *parents, struct __prelim_ref *ref, int level, u64 time_seq, const u64 *extent_item_pos, - u64 total_refs) + u64 total_refs, bool ignore_offset) { int ret = 0; int slot; @@ -564,7 +567,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, if (extent_item_pos) { ret = check_extent_in_eb(, eb, fi, *extent_item_pos, - ); +
[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off, we need a way to do it from userspace. Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent offset filtering, taking the place of one of the reserved[] fields. Previous versions of LOGICAL_INO neglected to check whether any of the reserved fields have non-zero values. Assigning meaning to those fields now may change the behavior of existing programs that left these fields uninitialized. To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses the same argument layout as LOGICAL_INO, but uses one of the reserved fields for flags. The V2 ioctl explicitly checks that unsupported flag bits are zero so that userspace can probe for future feature bits as they are defined. If the other reserved fields are used in the future, one of the remaining flag bits could specify that the other reserved fields are valid, so we don't need to check those for now. Since the memory layouts and behavior of the two ioctls' arguments are almost identical, there is no need for a separate function for logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search). A version parameter and an 'if' statement will suffice. Now that we have a flags field in logical_ino_args, add a flag BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, and pass it down the stack to iterate_inodes_from_logical. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/ioctl.c | 21 ++--- include/uapi/linux/btrfs.h | 8 +++- 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index c6787660d91f..def0ab85134a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4542,13 +4542,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 root, void *ctx) } static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, - void __user *arg) + void __user *arg, int version) { int ret = 0; int size; struct btrfs_ioctl_logical_ino_args *loi; struct btrfs_data_container *inodes = NULL; struct btrfs_path *path = NULL; + bool ignore_offset; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -4557,6 +4558,17 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (IS_ERR(loi)) return PTR_ERR(loi); + if (version == 1) { + ignore_offset = false; + } else { + /* Only accept flags we have defined so far */ + if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { + ret = -EINVAL; + goto out_loi; + } + ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; @@ -4572,7 +4584,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, } ret = iterate_inodes_from_logical(loi->logical, fs_info, path, - build_ino_list, inodes, false); + build_ino_list, inodes, ignore_offset); if (ret == -EINVAL) ret = -ENOENT; if (ret < 0) @@ -4586,6 +4598,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, out: btrfs_free_path(path); vfree(inodes); +out_loi: kfree(loi); return ret; @@ -5559,7 +5572,9 @@ long btrfs_ioctl(struct file *file, unsigned int case BTRFS_IOC_INO_PATHS: return btrfs_ioctl_ino_to_path(root, argp); case BTRFS_IOC_LOGICAL_INO: - return btrfs_ioctl_logical_to_ino(fs_info, argp); + return btrfs_ioctl_logical_to_ino(fs_info, argp, 1); + case BTRFS_IOC_LOGICAL_INO_V2: + return btrfs_ioctl_logical_to_ino(fs_info, argp, 2); case BTRFS_IOC_SPACE_INFO: return btrfs_ioctl_space_info(fs_info, argp); case BTRFS_IOC_SYNC: { diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index a456e5309238..a23555026994 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -591,10 +591,14 @@ struct btrfs_ioctl_ino_path_args { struct btrfs_ioctl_logical_ino_args { __u64 logical;/* in */ __u64 size; /* in */ - __u64 reserved[4]; + __u64 flags; /* in, v2 only */ + __u64 reserved[3]; /* struct btrfs_data_container *inodes;out */ __u64 inodes; }; +/* Return every ref to the extent, not just those containing logical b
[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl
Build-server workloads have hundreds of references per file after dedup. Multiply by a few snapshots and we quickly exhaust the limit of 2730 references per extent that can fit into a 64K buffer. Raise the limit to 16M to be consistent with other btrfs ioctls (e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME). To minimize surprising userspace behavior, apply this change only to the LOGICAL_INO_V2 ioctl. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index def0ab85134a..e13fea25ecb8 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4560,6 +4560,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (version == 1) { ignore_offset = false; + size = min_t(u32, loi->size, SZ_64K); } else { /* Only accept flags we have defined so far */ if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { @@ -4567,6 +4568,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out_loi; } ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + size = min_t(u32, loi->size, SZ_16M); } path = btrfs_alloc_path(); @@ -4575,7 +4577,6 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out; } - size = min_t(u32, loi->size, SZ_64K); inodes = init_data_container(size); if (IS_ERR(inodes)) { ret = PTR_ERR(inodes); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] btrfs: add missing memset while reading compressed inline extents
t each run: 000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> Reviewed-by: Liu Bo <bo.li@oracle.com> --- v4: remove WARN_ON. Put in the comment about decompression code filling in zeros up to the end of max_size, and why we need a memset here. fs/btrfs/inode.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 25ac2cf..f41ef5d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6805,6 +6805,20 @@ static noinline int uncompress_inline(struct btrfs_path *path, max_size = min_t(unsigned long, PAGE_SIZE, max_size); ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size); + + /* +* decompression code contains a memset to fill in any space between the end +* of the uncompressed data and the end of max_size in case the decompressed +* data ends up shorter than ram_bytes. That doesn't cover the hole between +* the end of an inline extent and the beginning of the next block, so we +* cover that region here. +*/ + + if (max_size + pg_offset < PAGE_SIZE) { + char *map = kmap(page); + memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - pg_offset); + kunmap(page); + } kfree(tmp); return ret; } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents
On Fri, Mar 10, 2017 at 02:12:54PM -0500, Chris Mason wrote: > > > On 03/10/2017 01:56 PM, Zygo Blaxell wrote: > >On Fri, Mar 10, 2017 at 11:19:24AM -0500, Chris Mason wrote: > >>On 03/09/2017 11:41 PM, Zygo Blaxell wrote: > >>>On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote: > >>>> > >>>> > >>>>On 03/08/2017 09:12 PM, Zygo Blaxell wrote: > >>>>>This is a story about 4 distinct (and very old) btrfs bugs. > >>>>> > >>>> > >>>>Really great write up. > >>>> > >>>>[ ... ] > >>>> > >>>>> > >>>>>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >>>>>index 25ac2cf..4d41a31 100644 > >>>>>--- a/fs/btrfs/inode.c > >>>>>+++ b/fs/btrfs/inode.c > >>>>>@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct > >>>>>btrfs_path *path, > >>>>> max_size = min_t(unsigned long, PAGE_SIZE, max_size); > >>>>> ret = btrfs_decompress(compress_type, tmp, page, > >>>>>extent_offset, inline_size, max_size); > >>>>>+WARN_ON(max_size + pg_offset > PAGE_SIZE); > >>>> > >>>>Can you please drop this WARN_ON and make the math reflect any possible > >>>>pg_offset? I do agree it shouldn't be happening, but its easy to correct > >>>>for and the WARN is likely to get lost. > >>> > >>>I'm not sure how to do that. It looks like I'd have to pass pg_offset > >>>through btrfs_decompress to the decompress functions? > >>> > >>> ret = btrfs_decompress(compress_type, tmp, page, > >>> extent_offset, inline_size, max_size, pg_offset); > >>> > >>>and in the compression functions get pg_offset from the argument list > >>>instead of hardcoding zero. > >> > >>Yeah, it's a good point. Both zlib and lzo are assuming a zero pg_offset > >>right now, but just like there are wacky corners allowing inline extents > >>followed by more data, there are a few wacky corners allowing inline extents > >>at the end of the file. > >> > >>Lets not mix that change in with this one though. For now, just get the > >>memset right and we can pass pg_offset down in a later patch. > > > >Are you saying "fix the memset in the patch" (and if so, what's wrong > >with it?), or are you saying "let's take the patch with its memset as is, > >and fix the pg_offset > 0 issues later"? > > Your WARN_ON() would fire when this math is bad: > > memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - pg_offset); > > Instead or warning, just don't memset if pg_offset + max_size >= PAGE_SIZE OK. While I was looking at this function I noticed that there doesn't seem to be a sanity check on the data in the extent ref. e.g. ram_bytes could be 2GB and nothing would notice. I'm pretty sure that's only possible by fuzzing, but it seemed worthwhile to log it if it ever happened. I'll take the WARN_ON out, and also put in the comment you asked for in the other branch of this thread. > -chris > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents
On Fri, Mar 10, 2017 at 11:19:24AM -0500, Chris Mason wrote: > On 03/09/2017 11:41 PM, Zygo Blaxell wrote: > >On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote: > >> > >> > >>On 03/08/2017 09:12 PM, Zygo Blaxell wrote: > >>>This is a story about 4 distinct (and very old) btrfs bugs. > >>> > >> > >>Really great write up. > >> > >>[ ... ] > >> > >>> > >>>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >>>index 25ac2cf..4d41a31 100644 > >>>--- a/fs/btrfs/inode.c > >>>+++ b/fs/btrfs/inode.c > >>>@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct > >>>btrfs_path *path, > >>> max_size = min_t(unsigned long, PAGE_SIZE, max_size); > >>> ret = btrfs_decompress(compress_type, tmp, page, > >>> extent_offset, inline_size, max_size); > >>>+ WARN_ON(max_size + pg_offset > PAGE_SIZE); > >> > >>Can you please drop this WARN_ON and make the math reflect any possible > >>pg_offset? I do agree it shouldn't be happening, but its easy to correct > >>for and the WARN is likely to get lost. > > > >I'm not sure how to do that. It looks like I'd have to pass pg_offset > >through btrfs_decompress to the decompress functions? > > > > ret = btrfs_decompress(compress_type, tmp, page, > > extent_offset, inline_size, max_size, pg_offset); > > > >and in the compression functions get pg_offset from the argument list > >instead of hardcoding zero. > > Yeah, it's a good point. Both zlib and lzo are assuming a zero pg_offset > right now, but just like there are wacky corners allowing inline extents > followed by more data, there are a few wacky corners allowing inline extents > at the end of the file. > > Lets not mix that change in with this one though. For now, just get the > memset right and we can pass pg_offset down in a later patch. Are you saying "fix the memset in the patch" (and if so, what's wrong with it?), or are you saying "let's take the patch with its memset as is, and fix the pg_offset > 0 issues later"? > -chris > signature.asc Description: Digital signature
Re: [PATCH v3] btrfs: add missing memset while reading compressed inline extents
On Thu, Mar 09, 2017 at 10:39:49AM -0500, Chris Mason wrote: > > > On 03/08/2017 09:12 PM, Zygo Blaxell wrote: > >This is a story about 4 distinct (and very old) btrfs bugs. > > > > Really great write up. > > [ ... ] > > > > >diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >index 25ac2cf..4d41a31 100644 > >--- a/fs/btrfs/inode.c > >+++ b/fs/btrfs/inode.c > >@@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct > >btrfs_path *path, > > max_size = min_t(unsigned long, PAGE_SIZE, max_size); > > ret = btrfs_decompress(compress_type, tmp, page, > >extent_offset, inline_size, max_size); > >+WARN_ON(max_size + pg_offset > PAGE_SIZE); > > Can you please drop this WARN_ON and make the math reflect any possible > pg_offset? I do agree it shouldn't be happening, but its easy to correct > for and the WARN is likely to get lost. I'm not sure how to do that. It looks like I'd have to pass pg_offset through btrfs_decompress to the decompress functions? ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size, pg_offset); and in the compression functions get pg_offset from the argument list instead of hardcoding zero. But how does pg_offset become non-zero for an inline extent? A micro-hole before the first byte? If the offset was >= 4096, the data wouldn't be in the first block so there would never be an inline extent in the first place. > >+if (max_size + pg_offset < PAGE_SIZE) { > >+char *map = kmap(page); > >+memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - > >pg_offset); > >+kunmap(page); > >+} > > Both lzo and zlib have a memset to cover the gap between what they actually > decompress and the max_size that we pass here. That's important because > ram_bytes may not be 100% accurate. > > Can you also please toss in a comment about how the decompression code is > responsible for the memset up to max_bytes? > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: [PATCH] btrfs: add missing memset while reading compressed inline extents
On Wed, Mar 08, 2017 at 10:27:33AM +, Filipe Manana wrote: > On Wed, Mar 8, 2017 at 3:18 AM, Zygo Blaxell > <zblax...@waya.furryterror.org> wrote: > > From: Zygo Blaxell <ce3g8...@umail.furryterror.org> > > > > This is a story about 4 distinct (and very old) btrfs bugs. > > > > Commit c8b978188c ("Btrfs: Add zlib compression support") added > > three data corruption bugs for inline extents (bugs #1-3). > > > > Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") > > fixed bug #1: uncompressed inline extents followed by a hole and more > > extents could get non-zero data in the hole as they were read. The fix > > was to add a memset in btrfs_get_extent to zero out the hole. > > > > Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") > > fixed bug #2: compressed inline extents which contained non-zero bytes > > might be replaced with zero bytes in some cases. This patch removed an > > unhelpful memset from uncompress_inline, but the case where memset is > > required was missed. > > > > There is also a memset in the decompression code, but this only covers > > decompressed data that is shorter than the ram_bytes from the extent > > ref record. This memset doesn't cover the region between the end of the > > decompressed data and the end of the page. It has also moved around a > > few times over the years, so there's no single patch to refer to. > > > > This patch fixes bug #3: compressed inline extents followed by a hole > > and more extents could get non-zero data in the hole as they were read > > (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). > > The fix is the same: zero out the hole in the compressed case too, > > by putting a memset back in uncompress_inline, but this time with > > correct parameters. > > > > The last and oldest bug, bug #0, is the cause of the offending inline > > extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk > > of behavior somewhere in the btrfs write code. In a few special cases, > > an inline extent and hole are allowed to persist where they normally > > would be combined with later extents in the file. > > > > A fast reproducer for bug #0 is presented below. A few offending extents > > are also created in the wild during large rsync transfers with the -S > > flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) > > will produce a handful of offending files as well. Once an offending > > file is created, it can present different content to userspace each > > time it is read. > > > > Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y > > kernel back to v3.5 has this behavior. There are fossil records of this > > bug's effects in commits all the way back to v2.6.32. I have no reason > > to believe bug #0 wasn't present at the beginning of btrfs compression > > support in v2.6.29, but I can't easily test kernels that old to be sure. > > > > It is not clear whether bug #0 is worth fixing. A fix would likely > > require injecting extra reads into currently write-only paths, and most > > of the exceptional cases caused by bug #0 are already handled now. > > > > Whether we like them or not, bug #0's inline extents followed by holes > > are part of the btrfs de-facto disk format now, and we need to be able > > to read them without data corruption or an infoleak. So enough about > > bug #0, let's get back to bug #3 (this patch). > > > > An example of on-disk structure leading to data corruption: > > > > item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 > > inode generation 50 transid 50 size 47424 nbytes 49141 > > block group 0 mode 100644 links 1 uid 0 gid 0 > > rdev 0 flags 0x0(none) > > item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 > > inode ref index 3 namelen 10 name: DB_File.so > > item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 > > inline extent data size 1341 ram 4085 compress(zlib) > > item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 > > extent data disk byte 5367308288 nr 20480 > > extent data offset 0 nr 45056 ram 45056 > > extent compression(zlib) > > So this case is actually different from the reproducer below, because > once a file has prealloc extents, future writes will never be > compressed. That is, the extent at offset 4096 can not ha
[PATCH v3] btrfs: add missing memset while reading compressed inline extents
0 Actual output: the data from byte 1000 to the end of the first 4096 byte page will be corrupt/infoleak: 000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> Reviewed-by: Liu Bo <bo.li@oracle.com> --- v3: Clarify that there are two distinct methods to create the hole, but both lead to the same corruption/infoleak when the hole is read. No code change. v2: I'm not able to contrive a test case where pg_offset != 0, but we might as well handle it anyway. fs/btrfs/inode.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 25ac2cf..4d41a31 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6805,6 +6805,12 @@ static noinline int uncompress_inline(struct btrfs_path *path, max_size = min_t(unsigned long, PAGE_SIZE, max_size); ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size); + WARN_ON(max_size + pg_offset > PAGE_SIZE); + if (max_size + pg_offset < PAGE_SIZE) { + char *map = kmap(page); + memset(map + pg_offset + max_size, 0, PAGE_SIZE - max_size - pg_offset); + kunmap(page); + } kfree(tmp); return ret; } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: add missing memset while reading compressed inline extents
From: Zygo Blaxell <ce3g8...@umail.furryterror.org> This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) Different data appears in userspace during each read of the 11 bytes between 4085 and 4096. The extent in item 63 is not long enough to fill the first page of the file, so a memset is required to fill the space between item 63 (ending at 4085) and item 64 (beginning at 4096) with zero. Here is a reproducer from Liu Bo: Using 'page_poison=on' kernel command line (or enable CONFIG_PAGE_POISONING) run the following: # touch foo # chattr +c foo # xfs_io -f -c "pwrite -W 0 1000" foo # xfs_io -f -c "falloc 4 8188" foo # od -x foo # echo 3 >/proc/sys/vm/drop_caches # od -x foo This produce the following on my box: 000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 0001760 * 002 000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) v2: I'm not able to contrive a
Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents
Ping? This is still reproducible on 4.9.8. On Mon, Nov 28, 2016 at 12:03:12AM -0500, Zygo Blaxell wrote: > Commit c8b978188c ("Btrfs: Add zlib compression support") produces > data corruption when reading a file with a hole positioned after an > inline extent. btrfs_get_extent will return uninitialized kernel memory > instead of zero bytes in the hole. > > Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") > fills the hole by memset to zero after *uncompressed* inline extents. > > This patch provides the missing memset for holes after *compressed* > inline extents. > > The offending holes appear in the wild and will appear during routine > data integrity audits (e.g. comparing backups against their originals). > They can also be created intentionally by fuzzing or crafting a filesystem > image. > > Holes like these are not intended to occur in btrfs; however, I tested > tagged kernels between v3.5 and the present, and found that all of them > can create such holes. Whether we like them or not, this kind of hole > is now part of the btrfs de-facto on-disk format, and we need to be able > to read such holes without an infoleak or wrong data. > > An example of a hole leading to data corruption: > > item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 > inode generation 50 transid 50 size 47424 nbytes 49141 > block group 0 mode 100644 links 1 uid 0 gid 0 > rdev 0 flags 0x0(none) > item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 > inode ref index 3 namelen 10 name: DB_File.so > item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 > inline extent data size 1341 ram 4085 compress(zlib) > item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 > extent data disk byte 5367308288 nr 20480 > extent data offset 0 nr 45056 ram 45056 > extent compression(zlib) > > Different data appears in userspace during each uncached read of the 10 > bytes between offset 4085 and 4095. The extent in item 63 is not long > enough to fill the first page of the file, so a memset is required to > fill the space between item 63 (ending at 4085) and item 64 (beginning > at 4096) with zero. > > Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> > > --- > fs/btrfs/inode.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 8e3a5a2..b1314d6 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct > btrfs_path *path, > max_size = min_t(unsigned long, PAGE_SIZE, max_size); > ret = btrfs_decompress(compress_type, tmp, page, > extent_offset, inline_size, max_size); > + WARN_ON(max_size > PAGE_SIZE); > + if (max_size < PAGE_SIZE) { > + char *map = kmap(page); > + memset(map + max_size, 0, PAGE_SIZE - max_size); > + kunmap(page); > + } > kfree(tmp); > return ret; > } > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
On Wed, Jan 04, 2017 at 07:58:55AM -0500, Austin S. Hemmelgarn wrote: > On 2017-01-03 16:35, Peter Becker wrote: > >As i understand the duperemove source-code right (i work on/ try to > >improve this code since 5 or 6 weeks on multiple parts), duperemove > >does hashing and calculation before they call extend_same. > >Duperemove stores all in a hashfile and read this. after all files > >hashed, and duplicates detected, the progress all in order without > >reading new data form disk / hashfile. so the byte-by-byte comparison > >of extend_same ioctl should consume the full possible bandwidth of the > >disks. > Not necessarily. You've actually got a significant amount of processing > between each disk operation. General ordering inside the ioctl is: > 1. Do generic ioctl setup. > 2. Lock the extents. > 3. Read the ranges into memory. > 4. Compare the ranges. > 5. If the ranges are identical, write out the changes needed to reflink > them. > 6. Unlock all the extents. > 7. Do generic ioctl cleanup. > 1 and 7 in particular are pretty heavy. Ioctls were not intended to be > called with this kind of frequency, and that fact really shows in the setup > and teardown (overhead is way higher than a syscall). Steps 1 and 7 are not heavy at all. ioctl setup is an order of magnitude higher than other system calls, but still up to 11 orders of magnitude faster than the other steps. The other steps are *slow*, and step 5 is orders of magnitude slower than all the others combined. Most of the time in step 5 is spent deleting the dst extent refs (or waiting for transaction commits, but everything waits for those). It gets worse when you have big files (1G and larger), more extents, and more extent references in the same inode. On a 100G file the overhead of manipulating shared extent refs is so large that the rest of the extent-same ioctl is just noise by comparison (microseconds vs minutes). The commit 1d57ee9 "btrfs: improve delayed refs iterations" (merged in v4.10-rc1) helps a bit with this, but deleting shared refs is still one of the most expensive things you can do in btrfs. > The operation ended > up being an ioctl instead of a syscall (or extension to another syscall) > because: > 1. Manipulating low-level filesystem state is part of what they're intended > to be used for. > 2. Introducing a new FS specific ioctl is a whole lot less controversial > than introducing a new FS specific syscall. > > > >1. dbfile_load_hashes > >2. find_all_dupes > >3. dedupe_results > >-> call the following in N threads: > >>dedupe_extent_list > >>>list_for_each_entry > add_extent_to_dedupe #produce a simple list/queue > dedupe_extents > >btrfs_extent_same > >>BTRFS_IOC_FILE_EXTENT_SAME > > > >So if this right, one of this thinks is realy slow: > > > >1. byte-per-byte comparison > There's no way that this part can't be slow. You need to load the data into > the registers to do the comparison, you can't just point something at RAM > and get an answer. On x86, this in turn means that the comparison amounts > to a loop of 2 loads followed by a compare and a branch for , repeated once > for each range beyond the first, and that's assuming that the compiler > optimizes it to the greatest degree possible. On some other systems the > compare and branch are one instruction, on others the second load might be > eliminated, but overall it's not something that can be sped up all that > much. On cheap amd64 machines this can be done at gigabytes per second. Not much gain from optimizing this. > >2. sets up the reflinks > This actually is not as efficient as it sounds like it should be, adding > reflinks means updating metadata, which means that there is some unavoidable > overhead here. I doubt that it's where the issue is, but I may be wrong. Most of the time spent here is spent waiting for IO. extent-same seems to imply fsync() with all the performance cost thereof. > >3. unlocks the new extent > There's one other aspect not listed here, locking the original extents, > which can actually add quite a lot of overhead if the files are actually > being used. > > > >If i'm not wrong with my understanding of the duperemove source code, > >this behaivor should also affected the online dedupe feature on with > >Qu Wenruo works. > AFAIK, that uses a different code path from the batch deduplication ioctl. > It also doesn't have the context switches and other overhead from an ioctl > involved, because it's done in kernel code. No difference there--the extent-same ioctl is all kernel code too. > >2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn: > >>On 2017-01-03 15:20, Peter Becker wrote: > >>> > >>>I think i understand. The resulting keyquestion is, how i can improve > >>>the performance of extend_same ioctl. > >>>I tested it with following results: > >>> > >>>enviorment: > >>>2 files, called "file", size each 100GB, duperemove nofiemap-options > >>>set, 1MB extend size. > >>> > >>>duperemove
Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents
05969: 1: 35913575: last,eof ./drivers/ata/.pata_sis.o.cmd: 9 extents found Note that corruption can only occur if the first extent (the inline extent at offset 0) is compressed (encoded). Uncompressed inline extents (like the one above) will not be corrupted due to the fix in commit 93c82d5750. If commit 93c82d5750 is reverted, you can get corruption on uncompressed files too. >Thanks, >Xin > > Sent: Saturday, December 10, 2016 at 9:16 PM >From: "Zygo Blaxell" <ce3g8...@umail.furryterror.org> >To: "Roman Mamedov" <r...@romanrm.net>, "Filipe Manana" > <fdman...@gmail.com> >Cc: linux-btrfs@vger.kernel.org >Subject: Re: [PATCH] btrfs: fix hole read corruption for compressed inline >extents >Ping? > >I know at least two people have read this patch, but it hasn't appeared in >the usual integration branches yet, and I've seen no actionable suggestion >to improve it. I've provided two non-overlapping rationales for it. >Is there something else you are looking for? > >This patch is a fix for a simple data corruption bug. It (or some >equivalent fix for the same bug) should be on its way to all stable > kernels starting from 2.6.32. > >Thanks > >On Mon, Nov 28, 2016 at 05:27:10PM +0500, Roman Mamedov wrote: >> On Mon, 28 Nov 2016 00:03:12 -0500 >> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: >> >> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> > index 8e3a5a2..b1314d6 100644 >> > --- a/fs/btrfs/inode.c >> > +++ b/fs/btrfs/inode.c >> > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct >btrfs_path *path, >> > max_size = min_t(unsigned long, PAGE_SIZE, max_size); >> > ret = btrfs_decompress(compress_type, tmp, page, >> > extent_offset, inline_size, max_size); >> > + WARN_ON(max_size > PAGE_SIZE); >> > + if (max_size < PAGE_SIZE) { >> > + char *map = kmap(page); >> > + memset(map + max_size, 0, PAGE_SIZE - max_size); >> > + kunmap(page); >> > + } >> > kfree(tmp); >> > return ret; >> > } >> >> Wasn't this already posted as: >> >> btrfs: fix silent data corruption while reading compressed inline >extents >> [1]https://patchwork.kernel.org/patch/9371971/ >> >> but you don't indicate that's a V2 or something, and in fact the patch >seems >> exactly the same, just the subject and commit message are entirely >different. >> Quite confusing. >> >> -- >> With respect, >> Roman >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at [2]http://vger.kernel.org/majordomo-info.html > > References > >Visible links >1. https://patchwork.kernel.org/patch/9371971/ >2. http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents
Ping? I know at least two people have read this patch, but it hasn't appeared in the usual integration branches yet, and I've seen no actionable suggestion to improve it. I've provided two non-overlapping rationales for it. Is there something else you are looking for? This patch is a fix for a simple data corruption bug. It (or some equivalent fix for the same bug) should be on its way to all stable kernels starting from 2.6.32. Thanks On Mon, Nov 28, 2016 at 05:27:10PM +0500, Roman Mamedov wrote: > On Mon, 28 Nov 2016 00:03:12 -0500 > Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 8e3a5a2..b1314d6 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct > > btrfs_path *path, > > max_size = min_t(unsigned long, PAGE_SIZE, max_size); > > ret = btrfs_decompress(compress_type, tmp, page, > >extent_offset, inline_size, max_size); > > + WARN_ON(max_size > PAGE_SIZE); > > + if (max_size < PAGE_SIZE) { > > + char *map = kmap(page); > > + memset(map + max_size, 0, PAGE_SIZE - max_size); > > + kunmap(page); > > + } > > kfree(tmp); > > return ret; > > } > > Wasn't this already posted as: > > btrfs: fix silent data corruption while reading compressed inline extents > https://patchwork.kernel.org/patch/9371971/ > > but you don't indicate that's a V2 or something, and in fact the patch seems > exactly the same, just the subject and commit message are entirely different. > Quite confusing. > > -- > With respect, > Roman > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
[PATCH v2] btrfs-progs: utils: negative numbers are more plausible than sizes over 8 EiB
I got tired of seeing "16.00EiB" whenever btrfs-progs encounters a negative size value, e.g. during resize: Unallocated: /dev/mapper/datamd18 16.00EiB This version is much more useful: Unallocated: /dev/mapper/datamd18 -26.29GiB Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- v2: change the function prototype so it's easier to see that the mangling implied by the name "pretty" includes "reinterpretation of the u64 value as a signed quantity." --- utils.c | 12 ++-- utils.h | 4 ++-- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/utils.c b/utils.c index 69b580a..07e8443 100644 --- a/utils.c +++ b/utils.c @@ -2575,7 +2575,7 @@ out: * Note: this function uses a static per-thread buffer. Do not call this * function more than 10 times within one argument list! */ -const char *pretty_size_mode(u64 size, unsigned mode) +const char *pretty_size_mode(s64 size, unsigned mode) { static __thread int ps_index = 0; static __thread char ps_array[10][32]; @@ -2594,20 +2594,20 @@ static const char* unit_suffix_binary[] = static const char* unit_suffix_decimal[] = { "B", "kB", "MB", "GB", "TB", "PB", "EB"}; -int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned unit_mode) +int pretty_size_snprintf(s64 size, char *str, size_t str_size, unsigned unit_mode) { int num_divs; float fraction; - u64 base = 0; + s64 base = 0; int mult = 0; const char** suffix = NULL; - u64 last_size; + s64 last_size; if (str_size == 0) return 0; if ((unit_mode & ~UNITS_MODE_MASK) == UNITS_RAW) { - snprintf(str, str_size, "%llu", size); + snprintf(str, str_size, "%lld", size); return 0; } @@ -2642,7 +2642,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned unit_mod num_divs = 0; break; default: - while (size >= mult) { + while ((size < 0 ? -size : size) >= mult) { last_size = size; size /= mult; num_divs++; diff --git a/utils.h b/utils.h index 366ca29..525bde9 100644 --- a/utils.h +++ b/utils.h @@ -174,9 +174,9 @@ int check_mounted_where(int fd, const char *file, char *where, int size, int btrfs_device_already_in_root(struct btrfs_root *root, int fd, int super_offset); -int pretty_size_snprintf(u64 size, char *str, size_t str_bytes, unsigned unit_mode); +int pretty_size_snprintf(s64 size, char *str, size_t str_bytes, unsigned unit_mode); #define pretty_size(size) pretty_size_mode(size, UNITS_DEFAULT) -const char *pretty_size_mode(u64 size, unsigned mode); +const char *pretty_size_mode(s64 size, unsigned mode); u64 parse_size(char *s); u64 parse_qgroupid(const char *p); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: utils: negative numbers are more plausible than sizes over 8 EiB
On Sat, Dec 03, 2016 at 10:25:17AM -0800, Omar Sandoval wrote: > On Sat, Dec 03, 2016 at 01:19:38AM -0500, Zygo Blaxell wrote: > > I got tired of seeing "16.00EiB" whenever btrfs-progs encounters a > > negative size value. > > > > e.g. during filesystem shrink we see: > > > > Unallocated: > >/dev/mapper/testvol0 16.00EiB > > > > Interpreting this as a signed quantity is much more useful: > > > > Unallocated: > >/dev/mapper/testvol0 -26.29GiB > > > > Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> > > --- > > utils.c | 13 - > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > diff --git a/utils.c b/utils.c > > index 69b580a..bd2b66e 100644 > > --- a/utils.c > > +++ b/utils.c > > @@ -2594,20 +2594,23 @@ static const char* unit_suffix_binary[] = > > static const char* unit_suffix_decimal[] = > > { "B", "kB", "MB", "GB", "TB", "PB", "EB"}; > > > > -int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned > > unit_mode) > > +int pretty_size_snprintf(u64 usize, char *str, size_t str_size, unsigned > > unit_mode) > > { > > int num_divs; > > float fraction; > > - u64 base = 0; > > + s64 base = 0; > > int mult = 0; > > const char** suffix = NULL; > > - u64 last_size; > > + s64 last_size; > > > > if (str_size == 0) > > return 0; > > > > + /* Negative numbers are more plausible than sizes over 8 EiB. */ > > + s64 size = (s64)usize; > > Just make pretty_size_snprintf() take an s64 size so it's clear from the > function signature that it's signed instead of hidden in the definition. I intentionally buried the unsigned -> signed conversion in the lowest level function so I wouldn't trigger signed/unsigned conversion warnings at all 46 call sites for pretty_size_mode. The btrfs code uses u64 endemically for all size data, and I wasn't about to try to change that. The word "pretty" in the function name should imply that what comes out is a possibly lossy transformation of what goes in. Since "16.00EiB" is much more lossy than "-29.96GiB", I believe I am merely reducing the lossiness quantitatively rather than qualitatively. On the other hand, the signed/unsigned warning isn't enabled by default in this project. I can certainly do it that way if you prefer. > > + > > if ((unit_mode & ~UNITS_MODE_MASK) == UNITS_RAW) { > > - snprintf(str, str_size, "%llu", size); > > + snprintf(str, str_size, "%lld", size); > > return 0; > > } > > > > @@ -2642,7 +2645,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t > > str_size, unsigned unit_mod > >num_divs = 0; > >break; > > default: > > - while (size >= mult) { > > + while ((size < 0 ? -size : size) >= mult) { > > last_size = size; > > size /= mult; > > num_divs++; > > -- > > 2.1.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: Digital signature
[PATCH] btrfs-progs: utils: negative numbers are more plausible than sizes over 8 EiB
I got tired of seeing "16.00EiB" whenever btrfs-progs encounters a negative size value. e.g. during filesystem shrink we see: Unallocated: /dev/mapper/testvol0 16.00EiB Interpreting this as a signed quantity is much more useful: Unallocated: /dev/mapper/testvol0 -26.29GiB Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- utils.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/utils.c b/utils.c index 69b580a..bd2b66e 100644 --- a/utils.c +++ b/utils.c @@ -2594,20 +2594,23 @@ static const char* unit_suffix_binary[] = static const char* unit_suffix_decimal[] = { "B", "kB", "MB", "GB", "TB", "PB", "EB"}; -int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned unit_mode) +int pretty_size_snprintf(u64 usize, char *str, size_t str_size, unsigned unit_mode) { int num_divs; float fraction; - u64 base = 0; + s64 base = 0; int mult = 0; const char** suffix = NULL; - u64 last_size; + s64 last_size; if (str_size == 0) return 0; + /* Negative numbers are more plausible than sizes over 8 EiB. */ + s64 size = (s64)usize; + if ((unit_mode & ~UNITS_MODE_MASK) == UNITS_RAW) { - snprintf(str, str_size, "%llu", size); + snprintf(str, str_size, "%lld", size); return 0; } @@ -2642,7 +2645,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned unit_mod num_divs = 0; break; default: - while (size >= mult) { + while ((size < 0 ? -size : size) >= mult) { last_size = size; size /= mult; num_divs++; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: raid with a variable stripe size
On Tue, Nov 29, 2016 at 02:03:58PM +0800, Qu Wenruo wrote: > At 11/29/2016 01:51 PM, Chris Murphy wrote: > >On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruowrote: > >> > >> > >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote: > >>> > >>>Hello, > >>> > >>>these are only my thoughts; no code here, but I would like to share it > >>>hoping that it could be useful. > >>> > >>>As reported several times by Zygo (and others), one of the problem of > >>>raid5/6 is the write hole. Today BTRFS is not capable to address it. > >> > >> > >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it > >>yet. > >> > >>Personally speaking, Btrfs should implementing RAID56 support just like > >>Btrfs on mdadm. > >>See how badly the current RAID56 works? > >> > >>The marginally benefit of btrfs RAID56 to scrub data better than tradition > >>RAID56 is just a joke in current code base. > > > >Btrfs is subject to the write hole problem on disk, but any read or > >scrub that needs to reconstruct from parity that is corrupt results in > >a checksum error and EIO. So corruption is not passed up to user > >space. Recent versions of md/mdadm support a write journal to avoid > >the write hole problem on disk in case of a crash. > > That's interesting. > > So I think it's less worthy to support RAID56 in btrfs, especially > considering the stability. > > My widest dream is, btrfs calls device mapper to build a micro RAID1/5/6/10 > device for each chunk. > Which should save us tons of codes and bugs. > > And for better recovery, enhance device mapper to provide interface to judge > which block is correct. > > Although that's just dream anyway. It would be nice to do that for balancing. In many balance cases (especially device delete and full balance after device add) it's not necessary to rewrite the data in a block group, only copy it verbatim to a different physical location (like pvmove does) and update the chunk tree with the new address when it's done. No need to rewrite the whole extent tree. > Thanks, > Qu > > > >>>The problem is that the stripe size is bigger than the "sector size" (ok > >>>sector is not the correct word, but I am referring to the basic unit of > >>>writing on the disk, which is 4k or 16K in btrfs). > >>>So when btrfs writes less data than the stripe, the stripe is not filled; > >>>when it is filled by a subsequent write, a RMW of the parity is required. > >>> > >>>On the best of my understanding (which could be very wrong) ZFS try to > >>>solve this issue using a variable length stripe. > >> > >> > >>Did you mean ZFS record size? > >>IIRC that's file extent minimum size, and I didn't see how that can handle > >>the write hole problem. > >> > >>Or did ZFS handle the problem? > > > >ZFS isn't subject to the write hole. My understanding is they get > >around this because all writes are COW, there is no RMW. > >But the > >variable stripe size means they don't have to do the usual (fixed) > >full stripe write for just, for example a 4KiB change in data for a > >single file. Conversely Btrfs does do RMW in such a case. > > > > > >>Anyway, it should be a low priority thing, and personally speaking, > >>any large behavior modification involving both extent allocator and bg > >>allocator will be bug prone. > > > >I tend to agree. I think the non-scalability of Btrfs raid10, which > >makes it behave more like raid 0+1, is a higher priority because right > >now it's misleading to say the least; and then the longer term goal > >for scaleable huge file systems is how Btrfs can shed irreparably > >damaged parts of the file system (tree pruning) rather than > >reconstruction. > > > > > > > > signature.asc Description: Digital signature
Re: RFC: raid with a variable stripe size
On Tue, Nov 29, 2016 at 01:49:09PM +0800, Qu Wenruo wrote: > >>>My proposal requires only a modification to the extent allocator. > >>>The behavior at the block group layer and scrub remains exactly the same. > >>>We just need to adjust the allocator slightly to take the RAID5 CoW > >>>constraints into account. > >> > >>Then, you'd need to allow btrfs to split large buffered/direct write into > >>small extents(not 128M anymore). > >>Not sure if we need to do extra work for DirectIO. > > > >Nope, that's not my proposal. My proposal is to simply ignore free > >space whenever it's inside a partially filled raid stripe (optimization: > >...which was empty at the start of the current transaction). > > Still have problems. > > Allocator must handle fs under device remove or profile converting (from 4 > disks raid5 to 5 disk raid5/6) correctly. > Which already seems complex for me. Those would be allocations in separate block groups with different stripe widths. Already handled in btrfs. > And further more, for fs with more devices, for example, 9 devices RAID5. > It will be a disaster to just write a 4K data and take up the whole 8 * 64K > space. > It will definitely cause huge ENOSPC problem. If you called fsync() after every 4K, yes; otherwise you can just batch up small writes into full-size stripes. The worst case isn't common enough to be a serious problem for a lot of the common RAID5 use cases (i.e. non-database workloads). I wouldn't try running a database on it--I'd use a RAID1 or RAID10 array for that instead, because the other RAID5 performance issues would be deal-breakers. On ZFS the same case degenerates into something like btrfs RAID1 over the 9 disks, which burns over 50% of the space. More efficient than wasting 99% of the space, but still wasteful. > If you really think it's easy, make a RFC patch, which should be easy if it > is, then run fstest auto group on it. I plan to when I get time; however, that could be some months in the future and I don't want to "claim" the task and stop anyone else from taking a crack at it in the meantime. > Easy words won't turn emails into real patch. > > >That avoids modifying a stripe with committed data and therefore plugs the > >write hole. > > > >For nodatacow, prealloc (and maybe directio?) extents the behavior > >wouldn't change (you'd have write hole, but only on data blocks not > >metadata, and only on files that were already marked as explicitly not > >requiring data integrity). > > > >>And in fact, you're going to support variant max file extent size. > > > >The existing extent sizing behavior is not changed *at all* in my proposal, > >only the allocator's notion of what space is 'free'. > > > >We can write an extent across multiple RAID5 stripes so long as we > >finish writing the entire extent before pointing committed metadata to > >it. btrfs does that already otherwise checksums wouldn't work. > > > >>This makes delalloc more complex (Wang enhanced dealloc support for variant > >>file extent size, to fix ENOSPC problem for dedupe and compression). > >> > >>This is already much more complex than you expected. > > > >The complexity I anticipate is having to deal with two implementations > >of the free space search, one for free space cache and one for free > >space tree. > > > >It could be as simple as calling the existing allocation functions and > >just filtering out anything that isn't suitably aligned inside a raid56 > >block group (at least for a proof of concept). > > > >>And this is the *BIGGEST* problem of current btrfs: > >>No good enough(if there is any) *ISOLATION* for such a complex fs. > >> > >>So even "small" modification can lead to unexpected bugs. > >> > >>That's why I want to isolate the fix in RAID56 layer, not any layer upwards. > > > >I don't think the write hole is fixable in the current raid56 layer, at > >least not without a nasty brute force solution like stripe update journal. > > > >Any of the fixes I'd want to use fix the problem from outside. > > > >>If not possible, I prefer not to do anything yet, until we are sure the very > >>basic part of RAID56 is stable. > >> > >>Thanks, > >>Qu > >> > >>> > >>>It's not as efficient as the ZFS approach, but it doesn't require an > >>>incompatible disk format change either. > >>> > >On BTRFS this could be achieved using several BGs (== block group or > >chunk), one for each stripe size. > > > >For example, if a filesystem - RAID5 is composed by 4 DISK, the > >filesystem should have three BGs: > >BG #1,composed by two disks (1 data+ 1 parity) > >BG #2 composed by three disks (2 data + 1 parity) > >BG #3 composed by four disks (3 data + 1 parity). > > Too complicated bg layout and further extent allocator modification. > > More code means more bugs, and I'm pretty sure it will be bug prone. > > > Although the idea of variable stripe size can somewhat reduce the problem > under certain situation.
Re: RFC: raid with a variable stripe size
On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote: > > > At 11/29/2016 11:53 AM, Zygo Blaxell wrote: > >On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote: > >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote: > >>>Hello, > >>> > >>>these are only my thoughts; no code here, but I would like to share it > >>>hoping that it could be useful. > >>> > >>>As reported several times by Zygo (and others), one of the problem > >>of raid5/6 is the write hole. Today BTRFS is not capable to address it. > >> > >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it > >>yet. > >> > >>Personally speaking, Btrfs should implementing RAID56 support just like > >>Btrfs on mdadm. > > > >Even mdadm doesn't implement it the way btrfs does (assuming all bugs > >are fixed) any more. > > > >>See how badly the current RAID56 works? > > > >>The marginally benefit of btrfs RAID56 to scrub data better than tradition > >>RAID56 is just a joke in current code base. > > > >>>The problem is that the stripe size is bigger than the "sector size" > >>(ok sector is not the correct word, but I am referring to the basic > >>unit of writing on the disk, which is 4k or 16K in btrfs). >So when > >>btrfs writes less data than the stripe, the stripe is not filled; when > >>it is filled by a subsequent write, a RMW of the parity is required. > >>> > >>>On the best of my understanding (which could be very wrong) ZFS try > >>to solve this issue using a variable length stripe. > >> > >>Did you mean ZFS record size? > >>IIRC that's file extent minimum size, and I didn't see how that can handle > >>the write hole problem. > >> > >>Or did ZFS handle the problem? > > > >ZFS's strategy does solve the write hole. In btrfs terms, ZFS embeds the > >parity blocks within extents, so it behaves more like btrfs compression > >in the sense that the data in a RAID-Z extent is encoded differently > >from the data in the file, and the kernel has to transform it on reads > >and writes. > > > >No ZFS stripe can contain blocks from multiple different > >transactions because the RAID-Z stripes begin and end on extent > >(single-transaction-write) boundaries, so there is no write hole on ZFS. > > > >There is some space waste in ZFS because the minimum allocation unit > >is two blocks (one data one parity) so any free space that is less > >than two blocks long is unusable. Also the maximum usable stripe width > >(number of disks) is the size of the data in the extent plus one parity > >block. It means if you write a lot of discontiguous 4K blocks, you > >effectively get 2-disk RAID1 and that may result in disappointing > >storage efficiency. > > > >(the above is for RAID-Z1. For Z2 and Z3 add an extra block or two > >for additional parity blocks). > > > >One could implement RAID-Z on btrfs, but it's by far the most invasive > >proposal for fixing btrfs's write hole so far (and doesn't actually fix > >anything, since the existing raid56 format would still be required to > >read old data, and it would still be broken). > > > >>Anyway, it should be a low priority thing, and personally speaking, > >>any large behavior modification involving both extent allocator and bg > >>allocator will be bug prone. > > > >My proposal requires only a modification to the extent allocator. > >The behavior at the block group layer and scrub remains exactly the same. > >We just need to adjust the allocator slightly to take the RAID5 CoW > >constraints into account. > > Then, you'd need to allow btrfs to split large buffered/direct write into > small extents(not 128M anymore). > Not sure if we need to do extra work for DirectIO. Nope, that's not my proposal. My proposal is to simply ignore free space whenever it's inside a partially filled raid stripe (optimization: ...which was empty at the start of the current transaction). That avoids modifying a stripe with committed data and therefore plugs the write hole. For nodatacow, prealloc (and maybe directio?) extents the behavior wouldn't change (you'd have write hole, but only on data blocks not metadata, and only on files that were already marked as explicitly not requiring data integrity). > And in fact, you're going to support variant max file extent size. The existing extent sizing behavior is not changed *at all* in my proposal, only the allocator's notion of what space is 'free'. We can write an
Re: RFC: raid with a variable stripe size
On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote: > At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote: > >Hello, > > > >these are only my thoughts; no code here, but I would like to share it > >hoping that it could be useful. > > > >As reported several times by Zygo (and others), one of the problem > of raid5/6 is the write hole. Today BTRFS is not capable to address it. > > I'd say, no need to address yet, since current soft RAID5/6 can't handle it > yet. > > Personally speaking, Btrfs should implementing RAID56 support just like > Btrfs on mdadm. Even mdadm doesn't implement it the way btrfs does (assuming all bugs are fixed) any more. > See how badly the current RAID56 works? > The marginally benefit of btrfs RAID56 to scrub data better than tradition > RAID56 is just a joke in current code base. > >The problem is that the stripe size is bigger than the "sector size" > (ok sector is not the correct word, but I am referring to the basic > unit of writing on the disk, which is 4k or 16K in btrfs). >So when > btrfs writes less data than the stripe, the stripe is not filled; when > it is filled by a subsequent write, a RMW of the parity is required. > > > >On the best of my understanding (which could be very wrong) ZFS try > to solve this issue using a variable length stripe. > > Did you mean ZFS record size? > IIRC that's file extent minimum size, and I didn't see how that can handle > the write hole problem. > > Or did ZFS handle the problem? ZFS's strategy does solve the write hole. In btrfs terms, ZFS embeds the parity blocks within extents, so it behaves more like btrfs compression in the sense that the data in a RAID-Z extent is encoded differently from the data in the file, and the kernel has to transform it on reads and writes. No ZFS stripe can contain blocks from multiple different transactions because the RAID-Z stripes begin and end on extent (single-transaction-write) boundaries, so there is no write hole on ZFS. There is some space waste in ZFS because the minimum allocation unit is two blocks (one data one parity) so any free space that is less than two blocks long is unusable. Also the maximum usable stripe width (number of disks) is the size of the data in the extent plus one parity block. It means if you write a lot of discontiguous 4K blocks, you effectively get 2-disk RAID1 and that may result in disappointing storage efficiency. (the above is for RAID-Z1. For Z2 and Z3 add an extra block or two for additional parity blocks). One could implement RAID-Z on btrfs, but it's by far the most invasive proposal for fixing btrfs's write hole so far (and doesn't actually fix anything, since the existing raid56 format would still be required to read old data, and it would still be broken). > Anyway, it should be a low priority thing, and personally speaking, > any large behavior modification involving both extent allocator and bg > allocator will be bug prone. My proposal requires only a modification to the extent allocator. The behavior at the block group layer and scrub remains exactly the same. We just need to adjust the allocator slightly to take the RAID5 CoW constraints into account. It's not as efficient as the ZFS approach, but it doesn't require an incompatible disk format change either. > >On BTRFS this could be achieved using several BGs (== block group or chunk), > >one for each stripe size. > > > >For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem > >should have three BGs: > >BG #1,composed by two disks (1 data+ 1 parity) > >BG #2 composed by three disks (2 data + 1 parity) > >BG #3 composed by four disks (3 data + 1 parity). > > Too complicated bg layout and further extent allocator modification. > > More code means more bugs, and I'm pretty sure it will be bug prone. > > > Although the idea of variable stripe size can somewhat reduce the problem > under certain situation. > > For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3 > disc RAID5, we can avoid such write hole problem. > Withouth modification to extent/chunk allocator. > > And I'd prefer to make stripe len mkfs time parameter, not possible to > modify after mkfs. To make things easy. > > Thanks, > Qu > > > > >If the data to be written has a size of 4k, it will be allocated to the BG > >#1. > >If the data to be written has a size of 8k, it will be allocated to the BG #2 > >If the data to be written has a size of 12k, it will be allocated to the BG > >#3 > >If the data to be written has a size greater than 12k, it will be allocated > >to the BG3, until the data fills a full stripes; then the remainder will be > >stored in BG #1 or BG #2. > > > > > >To avoid unbalancing of the disk usage, each BG could use all the disks, > >even if a stripe uses less disks: i.e > > > >DISK1 DISK2 DISK3 DISK4 > >S1S1S1S2 > >S2S2S3S3 > >S3S4S4S4 > >[] > > > >Above is show a BG which uses all the four disks, but
Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote: > On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote: > > If a drive's > > embedded controller RAM fails, you get corruption on the majority of > > reads from a single disk, and most writes will be corrupted (even if > > they > > were not before). > > Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite > a number of disks for nearly 10 years now, I'd have never stumbled on > such a case of breakage so far... In data centers you won't see breakages that are common on desktop and laptop drives. Laptops in particular sometimes (often?) go to places that are much less friendly to hardware. All my NAS and enterprise drives in server racks and data centers just wake up one morning stone dead or with a few well-behaved bad sectors, with none of this drama. Boring! > Actually most cases are as simple as HDD fails to work and this is > properly signalled to the controller. > > > > > If there's a transient failure due to environmental > > issues (e.g. short-term high-amplitude vibration or overheating) then > > writes may pause for mechanical retry loops. If there is bitrot in > > SSDs > > (particularly in the address translation tables) it looks like a wall > > of random noise that only ends when the disk goes offline. You can > > get > > combinations of these (e.g. RAM failures caused by transient > > overheating) > > where the drive's behavior changes over time. > > > > When in doubt, don't write. > > Sorry, but these cases as any cases of memory issues (be it main memory > or HDD controller) would also kick in at any normal writes. Yes, but in a RAID1 context there will be another disk with a good copy (or if main RAM is failing, the entire filesystem will be toast no matter what you do). > So there's no point in protecting against this on the storage side... > > Either never write at all... or have good backups for these rare cases. > > > > Cheers, > Chris. signature.asc Description: Digital signature
Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
On Mon, Nov 28, 2016 at 07:32:38PM +0100, Goffredo Baroncelli wrote: > On 2016-11-28 04:37, Christoph Anton Mitterer wrote: > > I think for safety it's best to repair as early as possible (and thus > > on read when a damage is detected), as further blocks/devices may fail > > till eventually a scrub(with repair) would be run manually. > > > > However, there may some workloads under which such auto-repair is > > undesirable as it may cost performance and safety may be less important > > than that. > > I am assuming that a corruption is a quite rare event. So occasionally > it could happens that a page is corrupted and the system corrects > it. This shouldn't have an impact on the workloads. Depends heavily on the specifics of the failure case. If a drive's embedded controller RAM fails, you get corruption on the majority of reads from a single disk, and most writes will be corrupted (even if they were not before). If there's a transient failure due to environmental issues (e.g. short-term high-amplitude vibration or overheating) then writes may pause for mechanical retry loops. If there is bitrot in SSDs (particularly in the address translation tables) it looks like a wall of random noise that only ends when the disk goes offline. You can get combinations of these (e.g. RAM failures caused by transient overheating) where the drive's behavior changes over time. When in doubt, don't write. > BR > G.Baroncelli > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > signature.asc Description: Digital signature
Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents
On Mon, Nov 28, 2016 at 05:27:10PM +0500, Roman Mamedov wrote: > On Mon, 28 Nov 2016 00:03:12 -0500 > Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 8e3a5a2..b1314d6 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct > > btrfs_path *path, > > max_size = min_t(unsigned long, PAGE_SIZE, max_size); > > ret = btrfs_decompress(compress_type, tmp, page, > >extent_offset, inline_size, max_size); > > + WARN_ON(max_size > PAGE_SIZE); > > + if (max_size < PAGE_SIZE) { > > + char *map = kmap(page); > > + memset(map + max_size, 0, PAGE_SIZE - max_size); > > + kunmap(page); > > + } > > kfree(tmp); > > return ret; > > } > > Wasn't this already posted as: > > btrfs: fix silent data corruption while reading compressed inline extents > https://patchwork.kernel.org/patch/9371971/ > > but you don't indicate that's a V2 or something, and in fact the patch seems > exactly the same, just the subject and commit message are entirely different. > Quite confusing. The previous commit message discussed the related hole-creation bug, including a reproducer; however, this patch does not fix the hole-creation bug and was never intended to. Despite my follow-up clarification, reviewers got distracted by the hole-creation bug discussion and didn't recover, so the patch didn't go anywhere. This patch only fixes _reading_ the holes after they are created, and the new commit message and subject line state that much more clearly. The patch didn't change, so I didn't add 'v2'. There's no 'v1' with the same title, so I thought a 'v2' tag would be more confusing than just starting over. The hole-creation bug is a very old, low-urgency issue. btrfs filesystems in the field have the buggy holes already, and have been creating new ones from 2009(*) to the present. I had to ask a few people before I found one who know whether it was even a bug, or intentional behavior from the beginning. (*) 2009 is the oldest commit date I can find that introduces a change which would only be necessary in the presence of the hole-creation bug. I have not been able to test kernels before 2012 because they crash while running my reproducer. > -- > With respect, > Roman > signature.asc Description: Digital signature
[PATCH] btrfs: fix hole read corruption for compressed inline extents
Commit c8b978188c ("Btrfs: Add zlib compression support") produces data corruption when reading a file with a hole positioned after an inline extent. btrfs_get_extent will return uninitialized kernel memory instead of zero bytes in the hole. Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fills the hole by memset to zero after *uncompressed* inline extents. This patch provides the missing memset for holes after *compressed* inline extents. The offending holes appear in the wild and will appear during routine data integrity audits (e.g. comparing backups against their originals). They can also be created intentionally by fuzzing or crafting a filesystem image. Holes like these are not intended to occur in btrfs; however, I tested tagged kernels between v3.5 and the present, and found that all of them can create such holes. Whether we like them or not, this kind of hole is now part of the btrfs de-facto on-disk format, and we need to be able to read such holes without an infoleak or wrong data. An example of a hole leading to data corruption: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) Different data appears in userspace during each uncached read of the 10 bytes between offset 4085 and 4095. The extent in item 63 is not long enough to fill the first page of the file, so a memset is required to fill the space between item 63 (ending at 4085) and item 64 (beginning at 4096) with zero. Signed-off-by: Zygo Blaxell <ce3g8...@umail.furryterror.org> --- fs/btrfs/inode.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8e3a5a2..b1314d6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct btrfs_path *path, max_size = min_t(unsigned long, PAGE_SIZE, max_size); ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size); + WARN_ON(max_size > PAGE_SIZE); + if (max_size < PAGE_SIZE) { + char *map = kmap(page); + memset(map + max_size, 0, PAGE_SIZE - max_size); + kunmap(page); + } kfree(tmp); return ret; } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
On Sun, Nov 27, 2016 at 12:16:34AM +0100, Goffredo Baroncelli wrote: > On 2016-11-26 19:54, Zygo Blaxell wrote: > > On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote: > >> On 2016-11-25 05:31, Zygo Blaxell wrote: > [...] > >> > >> BTW Btrfs in RAID1 mode corrects the data even in the read case. So > > > > Have you tested this? I think you'll find that it doesn't. > > Yes I tested it; and it does the rebuild automatically. > I corrupted a disk of mirror, then I read the related file. The log says: > > [ 59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum > 12813760 expected csum 3114703128 > [ 59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum > 12813760 expected csum 3114703128 > [ 59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 > (dev /dev/vdb sector 2154496) > ^ > IIRC In case of RAID5/6 the last line is missing. However in both the > case the data returned is good; but in RAID1 the data is corrected > also on the disk. > > Where you read that the data is not rebuild automatically ? Experience? I have real disk failures all the time. Errors on RAID1 arrays persist until scrubbed. No, wait... _transid_ errors always persist until scrubbed. csum failures are rewritten in repair_io_failure. There is a comment earlier in repair_io_failure that rewrite in RAID56 is not supported yet. > In fact I was surprised that RAID5/6 behaves differently The difference is surprising, no matter which strategy you believe is correct. ;) > >> I am still convinced that is the RAID5/6 behavior "strange". > >> > >> BR > >> G.Baroncelli > >> -- > >> gpg @keyserver.linux.it: Goffredo Baroncelli > >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > >> > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > signature.asc Description: Digital signature
Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote: > On 2016-11-25 05:31, Zygo Blaxell wrote: > >>> Do you mean, read the corrupted data won't repair it? > >>> > >>> IIRC that's the designed behavior. > >> :O > >> > >> You are right... I was unaware of that > > This is correct. > > > > Ordinary reads shouldn't touch corrupt data, they should only read > > around it. Scrubs in read-write mode should write corrected data over > > the corrupt data. Read-only scrubs can only report errors without > > correcting them. > > > > Rewriting corrupt data outside of scrub (i.e. on every read) is a > > bad idea. Consider what happens if a RAM controller gets too hot: > > checksums start failing randomly, but the data on disk is still OK. > > If we tried to fix the bad data on every read, we'd probably just trash > > the filesystem in some cases. > > > > I cant agree. If the filesystem is mounted read-only this behavior may > be correct; bur in others cases I don't see any reason to not correct > wrong data even in the read case. If your ram is unreliable you have > big problem anyway. If you don't like RAM corruption, pick any other failure mode. Laptops have to deal with things like vibration and temperature extremes which produce the same results (spurious csum failures and IO errors under conditions where writing will only destroy data that would otherwise be recoverable). > The likelihood that the data contained in a disk is "corrupted" is > higher than the likelihood that the RAM is bad. > > BTW Btrfs in RAID1 mode corrects the data even in the read case. So Have you tested this? I think you'll find that it doesn't. > I am still convinced that is the RAID5/6 behavior "strange". > > BR > G.Baroncelli > -- > gpg @keyserver.linux.it: Goffredo Baroncelli > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > signature.asc Description: Digital signature
Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
On Fri, Nov 25, 2016 at 03:40:36PM +1100, Gareth Pye wrote: > On Fri, Nov 25, 2016 at 3:31 PM, Zygo Blaxell > <ce3g8...@umail.furryterror.org> wrote: > > > > This risk mitigation measure does rely on admins taking a machine in this > > state down immediately, and also somehow knowing not to start a scrub > > while their RAM is failing...which is kind of an annoying requirement > > for the admin. > > Attempting to detect if RAM is bad when scrub starts is both time > consuming and not very reliable right. RAM, like all hardware, could fail at any time, and a scrub could already be running when it happens. This is annoying but also a fact of life that admins have to deal with. Testing RAM before scrub starts is not more beneficial than testing RAM at random intervals--but if you are testing RAM at random intervals, why not do it at the same intervals as scrub? If I see corruption errors showing up in stats, I will do a basic sanity test to make sure they're coming from the storage layer and not somewhere closer to the CPU. If all errors come from one device and there are clear log messages showing SCSI device errors and the SMART log matches the other data, RAM is probably not the root case of failures, so scrub away. If normally reliable programs like /bin/sh start randomly segfaulting, there's smoke pouring out of the back of the machine, all the disks are full of csum failures, and the BIOS welcome message has spelling errors that weren't there before, I would *not* start a scrub. More like turn the machine off, take it apart, test all the pieces separately, and only do a scrub after everything above the storage layer had been replaced or recertified. I certainly wouldn't want the filesystem to try to fix the csum failures it finds in such situations. signature.asc Description: Digital signature
Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
On Tue, Nov 22, 2016 at 07:02:13PM +0100, Goffredo Baroncelli wrote: > On 2016-11-22 01:28, Qu Wenruo wrote: > > > > > > At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote: > >> Hi Qu, > >> > >> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a > >> corrupted disks, then I ran "btrfs scrub start ...", then I check the > >> disks). > >> > >> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file): > >> 1) the system detect that the file is corrupted (good :) ) > >> 2) the system return the correct file content (good :) ) > >> 3) the data on the platter are still wrong(no good :( ) > > > > Do you mean, read the corrupted data won't repair it? > > > > IIRC that's the designed behavior. > > :O > > You are right... I was unaware of that This is correct. Ordinary reads shouldn't touch corrupt data, they should only read around it. Scrubs in read-write mode should write corrected data over the corrupt data. Read-only scrubs can only report errors without correcting them. Rewriting corrupt data outside of scrub (i.e. on every read) is a bad idea. Consider what happens if a RAM controller gets too hot: checksums start failing randomly, but the data on disk is still OK. If we tried to fix the bad data on every read, we'd probably just trash the filesystem in some cases. This risk mitigation measure does rely on admins taking a machine in this state down immediately, and also somehow knowing not to start a scrub while their RAM is failing...which is kind of an annoying requirement for the admin. > So you can add a "tested-by: Goffredo Baroncelli" > > BR > G.Baroncelli > > > > > For RAID5/6 read, there are several different mode, like READ_REBUILD or > > SCRUB_PARITY. > > > > I'm not sure for write, but for read it won't write correct data. > > > > So it's a designed behavior if I don't miss something. > > > > Thanks, > > Qu > > > >> > >> > >> Enclosed the script which reproduces the problem. Note that: > >> If I corrupt the data, in the dmesg two time appears a line which says: > >> > >> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 > >> csum 2280586218 expected csum 3192393815 > >> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 > >> csum 2280586218 expected csum 3192393815 > >> > >> If I corrupt the parity, of course the system doesn't detect the > >> corruption nor try to correct it. But this is the expected behavior. > >> > >> BR > >> G.Baroncelli > >> > >> > >> > >> On 2016-11-21 09:50, Qu Wenruo wrote: > >>> In the following situation, scrub will calculate wrong parity to > >>> overwrite correct one: > >>> > >>> RAID5 full stripe: > >>> > >>> Before > >>> | Dev 1 | Dev 2 | Dev 3 | > >>> | Data stripe 1 | Data stripe 2 | Parity Stripe | > >>> --- 0 > >>> | 0x (Bad) | 0xcdcd | 0x| > >>> --- 4K > >>> | 0xcdcd | 0xcdcd | 0x| > >>> ... > >>> | 0xcdcd | 0xcdcd | 0x| > >>> --- 64K > >>> > >>> After scrubbing dev3 only: > >>> > >>> | Dev 1 | Dev 2 | Dev 3 | > >>> | Data stripe 1 | Data stripe 2 | Parity Stripe | > >>> --- 0 > >>> | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | > >>> --- 4K > >>> | 0xcdcd | 0xcdcd | 0x| > >>> ... > >>> | 0xcdcd | 0xcdcd | 0x| > >>> --- 64K > >>> > >>> The calltrace of such corruption is as following: > >>> > >>> scrub_bio_end_io_worker() get called for each extent read out > >>> |- scriub_block_complete() > >>>|- Data extent csum mismatch > >>>|- scrub_handle_errored_block > >>> |- scrub_recheck_block() > >>> |- scrub_submit_raid56_bio_wait() > >>> |- raid56_parity_recover() > >>> > >>> Now we have a rbio with correct data stripe 1 recovered. > >>> Let's call it "good_rbio". > >>> > >>> scrub_parity_check_and_repair() > >>> |- raid56_parity_submit_scrub_rbio() > >>>|- lock_stripe_add() > >>>| |- steal_rbio() > >>>| |- Recovered data are steal from "good_rbio", stored into > >>>|rbio->stripe_pages[] > >>>|Now rbio->bio_pages[] are bad data read from disk. > >>>|- async_scrub_parity() > >>> |- scrub_parity_work() (delayed_call to scrub_parity_work) > >>> > >>> scrub_parity_work() > >>> |- raid56_parity_scrub_stripe() > >>>|- validate_rbio_for_parity_scrub() > >>> |- finish_parity_scrub() > >>> |- Recalculate parity using *BAD* pages in rbio->bio_pages[] > >>> So good parity is overwritten with *BAD* one > >>> > >>> The fix is to introduce 2 new members,
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Wed, Nov 23, 2016 at 05:26:18PM -0800, Darrick J. Wong wrote: [...] > Keep in mind that the number of bytes deduped is returned to userspace > via file_dedupe_range.info[x].bytes_deduped, so a properly functioning > userspace program actually /can/ detect that its 128MB request got cut > down to only 16MB and re-issue the request with the offsets moved up by > 16MB. The dedupe client in xfs_io (see dedupe_ioctl() in io/reflink.c) > implements this strategy. duperemove (the only other user I know of) > also does this. > > So it's really no big deal to increase the limit beyond 16MB, eliminate > it entirely, or even change it to cap the total request size while > dropping the per-item IO limit. > > As I mentioned in my other reply, the only hesitation I have for not > killing XFS_MAX_DEDUPE_LEN is that I feel that 2GB is enough IO for a > single ioctl call. Everything's relative. btrfs has ioctls that will do hundreds of terabytes of IO and take months to run. 2GB of data is nothing. Deduping entire 100TB files with a single ioctl call makes as much sense to me as reflink copying them with a single ioctl call. The only reason I see to keep the limit is to work around something wrong with the implementation. signature.asc Description: Digital signature
Re: Identifying reflink / CoW files
On Fri, Nov 04, 2016 at 03:41:49PM +0100, Saint Germain wrote: > On Thu, 3 Nov 2016 01:17:07 -0400, Zygo Blaxell > <ce3g8...@umail.furryterror.org> wrote : > > [...] > > The quality of the result therefore depends on the amount of effort > > put into measuring it. If you look for the first non-hole extent in > > each file and use its physical address as a physical file identifier, > > then you get a fast reflink detector function that has a high risk of > > false positives. If you map out two files and compare physical > > addresses block by block, you get a slow function with a low risk of > > false positives (but maybe a small risk of false negatives too). > > > > If your dedup program only does full-file reflink copies then the > > first extent physical address method is sufficient. If your program > > does block- or extent-level dedup then it shouldn't be using files in > > its data model at all, except where necessary to provide a mechanism > > to access the physical blocks through the POSIX filesystem API. > > > > FIEMAP will tell you about all the extents (physical address for > > extents that have them, zero for other extent types). It's also slow > > and has assorted accuracy problems especially with compressed files. > > Any user can run FIEMAP, and it uses only standard structure arrays. > > > > SEARCH_V2 is root-only and requires parsing variable-length binary > > btrfs data encoding, but it's faster than FIEMAP and gives more > > accurate results on compressed files. > > As the dedup program only does full-file reflink, the first extent > physical address method can be used as a fast first check to identify > potential files. > > But how to implement the second check in order to have 0% risk of false > positive ? > Because you said that mapping out two files and comparing the physical > addresses block by block also has a low risk of false positives. In theory, what you do is call FIEMAP on each file and compare the physical blocks that come back. If they are large files you will have to call FIEMAP multiple times on both files, each time setting the start position to the end position of the previous run. Translate each result record into a range of physical addresses, then compare them. If there were no differences, the files are already deduped. In practice, FIEMAP doesn't provide full accuracy for compressed extents, and in some cases the physical address data will compare equal when the files are in fact different. This is the small risk of false positives, and the only way to get 100% accuracy is to not use FIEMAP. Instead you can use the SEARCH ioctl, which dumps out the binary extent items from btrfs. If you look up the items corresponding to one inode, you can get the real physical block address plus the offset from the beginning of the extent for compressed extents. In Bees I encode the compressed extent start offset into the same uint64_t as the physical extent start address using the bottom 6 bits of the physical (bytenr) address: https://github.com/Zygo/bees/blob/master/src/bees-types.cc#L744 This fills in an object which uniquely (and reversibly) identifies the block on the filesystem. The raw btrfs extent data is extracted here: https://github.com/Zygo/bees/blob/master/lib/extentwalker.cc#L533 BeesAddress gives no false positives, but it's built on top of hundreds of lines of userspace support code. :-/ > Thank you very much for the detailed explanation ! > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: Increased disk usage after deduplication and system running out of memory
On Thu, Nov 24, 2016 at 03:00:26PM +0100, Niccolò Belli wrote: > Hi, > I use snapper, so I have plenty of snapshots in my btrfs partition and most > of my data is already deduplicated because of that. > Since long time ago I run offline defragmentation once (because I didn't > know extents get unshared) I wanted to run offline deduplication to free a > couple of GBs. > > This is the script I use to stop snapper, set snapshots to rw, balance, > deduplicate, etc: https://paste.pound-python.org/show/vPUGVNjPQbDvr4HbtMgs/ > > $ cat after_balance Overall: >Device size: 152.36GiB > Device allocated:136.00GiB > Device unallocated: 16.35GiB > Device missing: 0.00B > Used:133.97GiB > Free (estimated): 17.17GiB (min: 17.17GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 239.94MiB (used: 0.00B) > Data,single: Size:133.00GiB, Used:132.18GiB > /dev/mapper/cryptroot 133.00GiB > Metadata,single: Size:3.00GiB, Used:1.79GiB > /dev/mapper/cryptroot 3.00GiB > System,single: Size:3.00MiB, Used:16.00KiB > /dev/mapper/cryptroot 3.00MiB > Unallocated: > /dev/mapper/cryptroot 16.35GiB > > > $ cat after_duperemove_and_balance > Overall: > Device size: 152.36GiB > Device allocated:136.03GiB > Device unallocated: 16.33GiB > Device missing: 0.00B > Used:133.81GiB > Free (estimated): 16.55GiB (min: 16.55GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:127.00GiB, Used:126.77GiB > /dev/mapper/cryptroot 127.00GiB > > Metadata,single: Size:9.00GiB, Used:7.03GiB > /dev/mapper/cryptroot 9.00GiB > > System,single: Size:32.00MiB, Used:16.00KiB > /dev/mapper/cryptroot 32.00MiB > > Unallocated: > /dev/mapper/cryptroot 16.33GiB > > > As you can see it freed 5.41 GB of data, but it also added 5.24 GB of > metadata. The estimated free space is now 16.55 GB, while before the > deduplication it was higher: 17.17 GB. > > This is when running duperemove git with noblock, but almost nothing changes > if I omitt it (it defaults to block). > Why did my metadata increase by a 4x factor? 99% of my data already had > shared extents because of snapshots, so why such a huge increase? Sharing by snapshot is different from sharing by dedup. For snapshots, a new tree node is introduced which shares the entire rest of the tree. So you get: Root 123 -\ /--- Node 85 --- data 84 >- Node 87 ---< Root 124 -/ \--- Node 43 --- data 42 This means there's 16K of metadata (actually probably more, but small nonetheless) that is sharing the entire subvol. For dedup, each shared data extent is shared individually, and metadata is not shared at all: Root 123 -\ /--- Node 85 --- data 84 (shared) \- Node 87 ---< \--- Node 43 --- data 42 (shared) /--- Node 129 --- data 84 (shared) Root 124 --- Node 131 ---< \--- Node 126 --- data 42 (shared) If you dedup over a set of snapshots, it eventually unshares the metadata. The data is still shared, but _only_ the data, so it multiplies the metadata size by the number of snapshots. It's even worse if you have dup metadata since the cost of each new metadata page is doubled. > Deduplication didn't finish up to 100%, because duperemove got killed by OOM > killer at 99%: https://paste.pound-python.org/show/yUcIOSzXcrfNPkF9rV2L/ > > As you can see from dmesg > (https://paste.pound-python.org/show/eZIkpxUU6QR9ij6Rn1Oq/) there is no > process stealing so much memory (my system has 8GB): the biggest one takes > as much as 700MB of vm. > > Another strange thing that you can see from the previous log is that it > tries to deduplicate /home/niko/nosnap/rootfs/@images/fedora25.qcow2 which > is a UNIQUE file. Such image is stored in a separate subvolume because I > don't want it to be snapshotted, so I'm pretty sure there are no other > copies of this image, but still it tries to deduplicate it. > > Niccolò Belli > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Tue, Nov 22, 2016 at 06:44:19PM -0800, Darrick J. Wong wrote: > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote: > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote: > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the > > >implicit EOF gets around this in the existing XFS implementation. I > > >copied this for the Btrfs implementation. > > > > Somewhat tangential to this patch, but on the dedup topic: Can we raise > > or drop that 16MB limit? > > > > The maximum btrfs extent length is 128MB. Currently the btrfs dedup > > behavior for a 128MB extent is to generate 8x16MB shared extent references > > with different extent offsets to a single 128MB physical extent. > > These references no longer look like the original 128MB extent to a > > userspace dedup tool. That raises the difficulty level substantially > > for a userspace dedup tool when it tries to figure out which extents to > > keep and which to discard or rewrite. > > > > XFS may not have this problem--I haven't checked. On btrfs it's > > definitely not as simple as "bang two inode/offset/length pairs together > > with dedup and disk space will be freed automagically." If dedup is > > done incorrectly on btrfs, it can end up just making the filesystem slow > > without freeing any space. > > I copied the 16M limit into xfs/ocfs2 because btrfs had it. :) Finally, a clearly stated rationale. ;) > The VFS now limits the size of the incoming struct file_dedupe_range to > whatever a page size is. On x86 that only allows us 126 dedupe > candidates, which means that a single call can perform up to ~2GB of IO. > Storage is getting faster, but 2GB is still a fair amount for a single > call. Of course in XFS we do the dedupe one file and one page at a time > to keep the memory footprint sane. > > On ppc64 with its huge 64k pages that comes out to 32GB of IO. > > One thing we (speaking for XFS, anyway) /could/ do is limit based on the > theoretical IO count instead of clamping the length, e.g. > > if ((u64)dest_count * len >= (1ULL << 31)) > return -E2BIG; > > That way you /could/ specify a larger extent size if you pass in fewer > file descriptors. OTOH XFS will merge all the records together, so even > if you deduped the whole 128M in 4k chunks you'll still end up with a > single block mapping record and a single backref. This is why I'm mystified that XFS has this limitation. On btrfs there were at least _reasons_ for it, even if they were just "we have a v0.3 implementation and nobody's even started optimizing it yet." The btrfs code calls kzalloc (with size limited by MAX_DEDUPE_LEN) in the context of the thread executing the ioctl. It then loads up all the pages, compares them, then decides whether to continue with clone_range for the whole extent, or not. btrfs doesn't seem to ever merge these. > Were I starting from scratch I'd probably just dump the existing dedupe > interface in favor of a non-vectorized dedupe_range call taking the same > parameters as clone_range: > > int dedupe_range(src_fd, src_off, dest_fd, dest_off); > > I'd also change the semantics to "Find and share all identical blocks in > this subrange. Differing blocks are left alone." because it seems silly > that duperemove can issue large requests but a single byte difference in > the middle causes info->status to be set to FILE_DEDUPE_RANGE_DIFFERS > and info->bytes_deduped only changes if the entire range was deduped. It'd also be nice if it replaced all existing shared refs to the dst blocks at the same time. On btrfs, dedup agents have to find all the shared refs (either through brute force or by using LOGICAL_INO to look them all up through backrefs) and feed each one into extent_same until the last reference to dst is removed. But maybe this is only needed to work around a btrfs thing that never happens on XFS... :-P > > The 16MB limit doesn't seem to be useful in practice. The two useful > > effects of the limit seem to be DoS mitigation. There is no checking of > > the RAM usage that I can find (i.e. if you fire off 16 dedup threads, > > they want 256MB of RAM; put another way, if you want to tie up 16GB of > > kernel RAM, all you have to do is create 1024 dedup threads), so it's > > not an effective DoS mitigation feature. Internally dedup could verify > > blocks in batches of 16MB and check for signals/release and reacquire > > locks in between, so it wouldn't tie up the kernel or the two inodes > > for excessively long periods. > > (Does btrfs actually do the extent_same stuff in parallel??) A btrfs dedup agent can invoke multiple extent_sames
bees v0.1 - Best-Effort Extent-Same, a btrfs deduplication daemon
I made a thing! Bees ("Best-Effort Extent-Same") is a dedup daemon for btrfs. Bees is a block-oriented userspace dedup designed to avoid scalability problems on large filesystems. Bees is designed to degrade gracefully when underprovisioned with RAM. Bees does not use more RAM or storage as filesystem data size increases. The dedup hash table size is fixed at creation time and does not change. The effective dedup block size is dynamic and adjusts automatically to fit the hash table into the configured RAM limit. Hash table overflow is not implemented to eliminate the IO overhead of hash table overflow. Hash table entries are only 16 bytes per dedup block to keep the average dedup block size small. Bees does not require alignment between dedup blocks or extent boundaries (i.e. it can handle any multiple-of-4K offset between dup block pairs). Bees rearranges blocks into shared and unique extents if required to work within current btrfs kernel dedup limitations. Bees can dedup any combination of compressed and uncompressed extents. Bees operates in a single pass which removes duplicate extents immediately during scan. There are no separate scanning and dedup phases. Bees uses only data-safe btrfs kernel operations, so it can dedup live data (e.g. build servers, sqlite databases, VM disk images). It does not modify file attributes or timestamps. Bees does not store any information about filesystem structure, so it is not affected by the number or size of files (except to the extent that these cause performance problems for btrfs in general). It retrieves such information on demand through btrfs SEARCH_V2 and LOGICAL_INO ioctls. This eliminates the storage required to maintain the equivalents of these functions in userspace. It's also why bees has no XFS support. Bees is a daemon designed to run continuously and maintain its state across crahes and reboots. Bees uses checkpoints for persistence to eliminate the IO overhead of a transactional data store. On restart, bees will dedup any data that was added to the filesystem since the last checkpoint. I use bees to dedup filesystems ranging in size from 16GB to 35TB, with hash tables ranging in size from 128MB to 11GB. It's well past time for a v0.1 release, so here it is! Bees is available on Github: https://github.com/Zygo/bees Please enjoy this code. signature.asc Description: Digital signature
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Thu, Nov 24, 2016 at 09:13:28AM +1100, Dave Chinner wrote: > On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote: > > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote: > > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote: > > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote: > > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the > > > > >implicit EOF gets around this in the existing XFS implementation. I > > > > >copied this for the Btrfs implementation. > > > > > > > > Somewhat tangential to this patch, but on the dedup topic: Can we raise > > > > or drop that 16MB limit? > > > > > > > > The maximum btrfs extent length is 128MB. Currently the btrfs dedup > > > > behavior for a 128MB extent is to generate 8x16MB shared extent > > > > references > > > > with different extent offsets to a single 128MB physical extent. > > > > These references no longer look like the original 128MB extent to a > > > > userspace dedup tool. That raises the difficulty level substantially > > > > for a userspace dedup tool when it tries to figure out which extents to > > > > keep and which to discard or rewrite. > > > > > > That, IMO, is a btrfs design/implementation problem, not a problem > > > with the API. Applications are always going to end up doing things > > > that aren't perfectly aligned to extent boundaries or sizes > > > regardless of the size limit that is placed on the dedupe ranges. > > > > Given that XFS doesn't have all the problems btrfs does, why does XFS > > have the same aribitrary size limit? Especially since XFS demonstrably > > doesn't need it? > > Creating a new-but-slightly-incompatible jsut for XFS makes no > sense - we have multiple filesystems that support this functionality > and so they all should use the same APIs and present (as far as is > possible) the same behaviour to userspace. OK. Let's just remove the limit on all the filesystems then. XFS doesn't need it, and btrfs can be fixed. > IOWs it's more important to use existing APIs than to invent a new > one that does almost the same thing. This way userspace applications > don't need to be changed to support new XFS functionality and we > make life easier for everyone. Except removing the limit doesn't work that way. An application that didn't impose an undocumented limit on itself wouldn't break when moved to a filesystem that imposed no such limit, i.e. if XFS had no limit, an application that moved from btrfs to XFS would just work. > A shiny new API without warts would > be nice, but we've already got to support the existing one forever, > it does the job we need and so it's less burden on everyone if we > just use it as is. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > signature.asc Description: Digital signature
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote: > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote: > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote: > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the > > >implicit EOF gets around this in the existing XFS implementation. I > > >copied this for the Btrfs implementation. > > > > Somewhat tangential to this patch, but on the dedup topic: Can we raise > > or drop that 16MB limit? > > > > The maximum btrfs extent length is 128MB. Currently the btrfs dedup > > behavior for a 128MB extent is to generate 8x16MB shared extent references > > with different extent offsets to a single 128MB physical extent. > > These references no longer look like the original 128MB extent to a > > userspace dedup tool. That raises the difficulty level substantially > > for a userspace dedup tool when it tries to figure out which extents to > > keep and which to discard or rewrite. > > That, IMO, is a btrfs design/implementation problem, not a problem > with the API. Applications are always going to end up doing things > that aren't perfectly aligned to extent boundaries or sizes > regardless of the size limit that is placed on the dedupe ranges. Given that XFS doesn't have all the problems btrfs does, why does XFS have the same aribitrary size limit? Especially since XFS demonstrably doesn't need it? > > XFS may not have this problem--I haven't checked. > > It doesn't - it tracks shared blocks exactly and merges adjacent > extent records whenever possible. > > > Even if we want to keep the 16MB limit, there's also no way to query the > > kernel from userspace to find out what the limit is, other than by trial > > and error. It's not even in a header file, userspace just has to *know*. > > So add a define to the API to make it visible to applications and > document it in the man page. To answer some of my own questions on the btrfs side: It looks like the btrfs implementation does have a reason for it (fixed-size arrays). > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > signature.asc Description: Digital signature
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote: > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the >implicit EOF gets around this in the existing XFS implementation. I >copied this for the Btrfs implementation. Somewhat tangential to this patch, but on the dedup topic: Can we raise or drop that 16MB limit? The maximum btrfs extent length is 128MB. Currently the btrfs dedup behavior for a 128MB extent is to generate 8x16MB shared extent references with different extent offsets to a single 128MB physical extent. These references no longer look like the original 128MB extent to a userspace dedup tool. That raises the difficulty level substantially for a userspace dedup tool when it tries to figure out which extents to keep and which to discard or rewrite. XFS may not have this problem--I haven't checked. On btrfs it's definitely not as simple as "bang two inode/offset/length pairs together with dedup and disk space will be freed automagically." If dedup is done incorrectly on btrfs, it can end up just making the filesystem slow without freeing any space. The 16MB limit doesn't seem to be useful in practice. The two useful effects of the limit seem to be DoS mitigation. There is no checking of the RAM usage that I can find (i.e. if you fire off 16 dedup threads, they want 256MB of RAM; put another way, if you want to tie up 16GB of kernel RAM, all you have to do is create 1024 dedup threads), so it's not an effective DoS mitigation feature. Internally dedup could verify blocks in batches of 16MB and check for signals/release and reacquire locks in between, so it wouldn't tie up the kernel or the two inodes for excessively long periods. Even if we want to keep the 16MB limit, there's also no way to query the kernel from userspace to find out what the limit is, other than by trial and error. It's not even in a header file, userspace just has to *know*. signature.asc Description: Digital signature
Re: [RFC] btrfs: make max inline data can be equal to sectorsize
On Fri, Nov 18, 2016 at 03:58:06PM -0500, Chris Mason wrote: > > > On 11/16/2016 11:10 AM, David Sterba wrote: > >On Mon, Nov 14, 2016 at 09:55:34AM +0800, Qu Wenruo wrote: > >>At 11/12/2016 04:22 AM, Liu Bo wrote: > >>>On Tue, Oct 11, 2016 at 02:47:42PM +0800, Wang Xiaoguang wrote: > If we use mount option "-o max_inline=sectorsize", say 4096, indeed > even for a fresh fs, say nodesize is 16k, we can not make the first > 4k data completely inline, I found this conditon causing this issue: > !compressed_size && (actual_end & (root->sectorsize - 1)) == 0 > > If it retuns true, we'll not make data inline. For 4k sectorsize, > 0~4094 dara range, we can make it inline, but 0~4095, it can not. > I don't think this limition is useful, so here remove it which will > make max inline data can be equal to sectorsize. > >>> > >>>It's difficult to tell whether we need this, I'm not a big fan of using > >>>max_inline size more than the default size 2048, given that most reports > >>>about ENOSPC is due to metadata and inline may make it worse. > >> > >>IMHO if we can use inline data extents to trigger ENOSPC more easily, > >>then we should allow it to dig the problem further. > >> > >>Just ignoring it because it may cause more bug will not solve the real > >>problem anyway. > > > >Not allowing the full 4k value as max_inline looks artificial to me. > >We've removed other similar limitation in the past so I'd tend to agree > >to do the same here. There's no significant use for it as far as I can > >tell, if you want to exhaust metadata, the difference to max_inline=4095 > >would be really tiny in the end. So, I'm okay with merging it. If > >anybody feels like adding his reviewed-by, please do so. > > The check is there because in practice it doesn't make sense to inline an > extent if it fits perfectly in a data block. You could argue its saving > seeks, but we're also adding seeks by spreading out the metadata in general. > So, I'd want to see benchmarks before deciding. Does that limit kick in before or after compression? A compressed extent could easily have 4096 bytes of data in 200 bytes. If a filesystem contained a whole lot of exactly-4096-byte compressible files that extra byte might be worth something. > If we're using it for debugging, I'd rather stick with max_inline=4095. > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: RFC: raid with a variable stripe size
On Fri, Nov 18, 2016 at 07:15:12PM +0100, Goffredo Baroncelli wrote: > Hello, > > these are only my thoughts; no code here, but I would like to share > it hoping that it could be useful. > > As reported several times by Zygo (and others), one of the problem of > raid5/6 is the write hole. Today BTRFS is not capable to address it. > > The problem is that the stripe size is bigger than the "sector size" > (ok sector is not the correct word, but I am referring to the basic > unit of writing on the disk, which is 4k or 16K in btrfs). So when > btrfs writes less data than the stripe, the stripe is not filled; when > it is filled by a subsequent write, a RMW of the parity is required. The key point in the problem statement is that subsequent writes are allowed to modify stripes while they contain data. Proper CoW would never do that. Stripes should never contain data from two separate transactions--that would imply that CoW rules have been violated. Currently there is no problem for big writes on empty disks because the data block allocator happens to do the right thing accidentally in such cases. It's only when the allocator allocates new data to partially filled stripes that the problems occur. For metadata the allocator currently stumbles into RMW writes so badly that the difference between the current allocator and the worst possible allocator is only a few percent. > On the best of my understanding (which could be very wrong) ZFS try > to solve this issue using a variable length stripe. ZFS ties the parity blocks to what btrfs would call extents. It prevents multiple writes to the same RAID stripe in different transactions by dynamically defining the RAID stripe boundaries *around* the write boundaries. This is very different from btrfs's current on-disk structure. e.g. if we were to write: extent D, 7 blocks extent E, 3 blocks extent F, 9 blocks the disk in btrfs looks something like: D1 D2 D3 D4 P1 D5 D6 D7 P2 E1 E2 E3 P3 F1 F2 F3 P4 F4 F5 F6 P5 F7 F8 F9 xx P1 = parity(D1..D4) P2 = parity(D5..D7, E1) P3 = parity(E2, E3, F1, F2) P4 = parity(F3..F6) P5 = parity(F7..F9) If D, E, and F were written in different transactions, it could make P2 and P3 invalid. The disk in ZFS looks something like: D1 D2 D3 D4 P1 D5 D6 D7 P2 E1 E2 E3 P3 F1 F2 F3 F4 P4 F5 F6 F7 F8 P5 F9 P6 where: P1 is parity(D1..D4) P2 is parity(D5..D7) P3 is parity(E1..E3) P4 is parity(F1..F4) P5 is parity(F5..F8) P6 is parity(F9) Each parity value contains only data from one extent, which makes it impossible for any P block to contain data from different transactions. Every extent is striped across a potentially different number of disks, so it's less efficient than "pure" raid5 would be with the same quantity of data. This would require pushing the parity allocation all the way up into the extent layer in btrfs, which would be a massive change that could introduce regressions into all the other RAID levels; on the other hand, if it was pushed up to that level, it would be possible to checksum the parity blocks... > On BTRFS this could be achieved using several BGs (== block group or > chunk), one for each stripe size. Actually it's one per *possibly* failed disk (N^2 - N disks for RAID6). Block groups are composed of *specific* disks... > For example, if a filesystem - RAID5 is composed by 4 DISK, the > filesystem should have three BGs: BG #1,composed by two disks (1 > data+ 1 parity) BG #2 composed by three disks (2 data + 1 parity) > BG #3 composed by four disks (3 data + 1 parity). ...i.e. you'd need block groups for disks ABCD, ABC, ABD, ACD, and BCD. Btrfs doesn't allocate block groups that way anyway. A much simpler version of this is to make two changes: 1. Identify when disks go offline and mark block groups touching these disks as 'degraded'. Currently this only happens at mount time, so the btrfs change would be to add the detection of state transition at the instant when a disk fails. 2. When a block group is degraded (i.e. some of its disks are missing), mark it strictly read-only and disable nodatacow. Btrfs can already do #2 when balancing. I've used this capability to repair broken raid5 arrays. Currently btrfs does *not* do this for ordinary data writes, and that's the required change. The trade-off for this approach is that if you didn't have any unallocated space when a disk failed, you'll get ENOSPC for everything, because there's no disk you could be allocating new metadata pages on. That makes it hard to add or replace disks. > If the data to be written has a size of 4k, it will be allocated to > the BG #1. If the data to be written has a size of 8k, it will be > allocated to the BG #2 If the data to be written has a size of 12k, > it will be
Re: [PATCH 0/2] RAID5/6 scrub race fix
On Fri, Nov 18, 2016 at 07:09:34PM +0100, Goffredo Baroncelli wrote: > Hi Zygo > On 2016-11-18 00:13, Zygo Blaxell wrote: > > On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > >> Fix the so-called famous RAID5/6 scrub error. > >> > >> Thanks Goffredo Baroncelli for reporting the bug, and make it into our > >> sight. > >> (Yes, without the Phoronix report on this, > >> https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > >> I won't ever be aware of it) > > > > If you're hearing about btrfs RAID5 bugs for the first time through > > Phoronix, then your testing coverage is *clearly* inadequate. > > > > Fill up a RAID5 array, start a FS stress test, pull a drive out while > > that's running, let the FS stress test run for another hour, then try > > to replace or delete the missing device. If there are any crashes, > > corruptions, or EIO during any part of this process (assuming all the > > remaining disks are healthy), then btrfs RAID5 is still broken, and > > you've found another bug to fix. > > > > The fact that so many problems in btrfs can still be found this way > > indicates to me that nobody is doing this basic level of testing > > (or if they are, they're not doing anything about the results). > > [...] > > Sorry but I don't find useful this kind of discussion. Yes BTRFS > RAID5/6 needs a lot of care. Yes, *our* test coverage is far to be > complete; but this is not a fault of a single person; and Qu tried to > solve one issue and for this we should say only tanks.. > > Even if you don't find valuable the work of Qu (and my little one :-) > ), this required some time and need to be respected. I do find this work valuable, and I do thank you and Qu for it. I've been following it with great interest because I haven't had time to dive into it myself. It's a use case I used before and would like to use again. Most of my recent frustration, if directed at anyone, is really directed at Phoronix for conflating "one bug was fixed" with "ready for production use today," and I wanted to ensure that the latter rumor was promptly quashed. This is why I'm excited about Qu's work: on my list of 7 btrfs-raid5 recovery bugs (6 I found plus yours), Qu has fixed at least 2 of them, maybe as many as 4, with the patches so far. I can fix 2 of the others, for a total of 6 fixed out of 7. Specifically, the 7 bugs I know of are: 1-2. BUG_ONs in functions that should return errors (I had fixed both already when trying to recover my broken arrays) 3. scrub can't identify which drives or files are corrupted (Qu might have fixed this--I won't know until I do testing) 4-6. symptom groups related to wrong data or EIO in scrub recovery, including Goffredo's (Qu might have fixed all of these, but from a quick read of the patch I think at least two are done). 7. the write hole. I'll know more after I've had a chance to run Qu's patches through testing, which I intend to do at some point. Optimistically, this means there could be only *one* bug remaining in the critical path for btrfs RAID56 single disk failure recovery. That last bug is the write hole, which is why I keep going on about it. It's the only bug I know exists in btrfs RAID56 that has neither an existing fix nor any evidence of someone actively working on it, even at the design proposal stage. Please, I'd love to be wrong about this. When I described the situation recently as "a thin layer of bugs on top of a design defect", I was not trying to be mean. I was trying to describe the situation *precisely*. The thin layer of bugs is much thinner thanks to Qu's work, and thanks in part to his work, I now have confidence that further investment in this area won't be wasted. > Finally, I don't think that we should compare the RAID-hole with this > kind of bug(fix). The former is a design issue, the latter is a bug > related of one of the basic feature of the raid system (recover from > the lost of a disk/corruption). > > Even the MD subsystem (which is far behind btrfs) had tolerated > the raid-hole until last year. My frustration against this point is the attitude that mdadm was ever good enough, much less a model to emulate in the future. It's 2016--there have been some advancements in the state of the art since the IBM patent describing RAID5 30 years ago, yet in the btrfs world, we seem to insist on repeating all the same mistakes in the same order. "We're as good as some existing broken-by-design thing" isn't a really useful attitude. We should aspire to do *better* than the existing broken-by-design things. If we didn't, we wouldn't be here, we'd all be lurking on some other list, running ext4
Re: [PATCH 0/2] RAID5/6 scrub race fix
On Fri, Nov 18, 2016 at 10:42:23AM +0800, Qu Wenruo wrote: > > > At 11/18/2016 09:56 AM, Hugo Mills wrote: > >On Fri, Nov 18, 2016 at 09:19:11AM +0800, Qu Wenruo wrote: > >> > >> > >>At 11/18/2016 07:13 AM, Zygo Blaxell wrote: > >>>On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > >>>>Fix the so-called famous RAID5/6 scrub error. > >>>> > >>>>Thanks Goffredo Baroncelli for reporting the bug, and make it into our > >>>>sight. > >>>>(Yes, without the Phoronix report on this, > >>>>https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > >>>>I won't ever be aware of it) > >>> > >>>If you're hearing about btrfs RAID5 bugs for the first time through > >>>Phoronix, then your testing coverage is *clearly* inadequate. > >> > >>I'm not fixing everything, I'm just focusing on the exact one bug > >>reported by Goffredo Baroncelli. > >> > >>Although it seems that, the bug reported by him is in fact two bugs. > >>One is race condition I'm fixing, another one is that recovery is > >>recovering data correctly, but screwing up parity. > >> > >>I just don't understand why you always want to fix everything in one step. > > > > Fix the important, fundamental things first, and the others > >later. This, from my understanding of Zygo's comments, appears to be > >one of the others. > > > > It's papering over the missing bricks in the wall instead of > >chipping out the mortar and putting new bricks in. It may need to be > >fixed, but it's not the fundamental "OMG, everything's totally broken" > >problem. If anything, it's only a serious problem *because* the other > >thing (write hole) is still there. > > > > It just seems like a piece of mis-prioritised effort. > > It seems that, we have different standards on the priority. My concern isn't priority. Easier bugs often get fixed first. That's just the way Linux development works. I am very concerned by articles like this: http://phoronix.com/scan.php?page=news_item=Btrfs-RAID5-RAID6-Fixed with headlines like "btrfs RAID5/RAID6 support is finally fixed" when that's very much not the case. Only one bug has been removed for the key use case that makes RAID5 interesting, and it's just the first of many that still remain in the path of a user trying to recover from a normal disk failure. Admittedly this is Michael's (Phoronix's) problem more than Qu's, but it's important to always be clear and _complete_ when stating bug status because people quote statements out of context. When the article quoted the text "it's not a timed bomb buried deeply into the RAID5/6 code, but a race condition in scrub recovery code" the commenters on Phoronix are clearly interpreting this to mean "famous RAID5/6 scrub error" had been fixed *and* the issue reported by Goffredo was the time bomb issue. It's more accurate to say something like "Goffredo's issue is not the time bomb buried deeply in the RAID5/6 code, but a separate issue caused by a race condition in scrub recovery code" Reading the Phoronix article, one might imagine RAID5 is now working as well as RAID1 on btrfs. To be clear, it's not--although the gap is now significantly narrower. > For me, if some function on the very basic/minimal environment can't work > reliably, then it's a high priority bug. > > In this case, in a very minimal setup, with only 128K data spreading on 3 > devices RAID5. With a data stripe fully corrupted, without any other thing > interfering. > Scrub can't return correct csum error number and even cause false > unrecoverable error, then it's a high priority thing. > If the problem involves too many steps like removing devices, degraded mode, > fsstress and some time. Then it's not that priority unless one pin-downs the > root case to, for example, degraded mode itself with special sequenced > operations. There are multiple bugs in the stress + remove device case. Some are quite easy to isolate. They range in difficulty from simple BUG_ON instead of error returns to finally solving the RMW update problem. Run the test, choose any of the bugs that occur to work on, repeat until the test stops finding new bugs for a while. There are currently several bugs to choose from with various levels of difficulty to fix them, and you should hit the first level of bugs in a matter of hours if not minutes. Using this method, you would have discovered Goffredo's bug years ago. Instead, you only discovered it after Phoronix quoted the conclusion of an investigation that started because of pro
Re: [PATCH 0/2] RAID5/6 scrub race fix
On Tue, Nov 15, 2016 at 10:50:22AM +0800, Qu Wenruo wrote: > Fix the so-called famous RAID5/6 scrub error. > > Thanks Goffredo Baroncelli for reporting the bug, and make it into our > sight. > (Yes, without the Phoronix report on this, > https://www.phoronix.com/scan.php?page=news_item=Btrfs-RAID-56-Is-Bad, > I won't ever be aware of it) If you're hearing about btrfs RAID5 bugs for the first time through Phoronix, then your testing coverage is *clearly* inadequate. Fill up a RAID5 array, start a FS stress test, pull a drive out while that's running, let the FS stress test run for another hour, then try to replace or delete the missing device. If there are any crashes, corruptions, or EIO during any part of this process (assuming all the remaining disks are healthy), then btrfs RAID5 is still broken, and you've found another bug to fix. The fact that so many problems in btrfs can still be found this way indicates to me that nobody is doing this basic level of testing (or if they are, they're not doing anything about the results). > Unlike many of us(including myself) assumed, it's not a timed bomb buried > deeply into the RAID5/6 code, but a race condition in scrub recovery > code. I don't see how this patch fixes the write hole issue at the core of btrfs RAID56. It just makes the thin layer of bugs over that issue a little thinner. There's still the metadata RMW update timebomb at the bottom of the bug pile that can't be fixed by scrub (the filesystem is unrecoverably damaged when the bomb goes off, so scrub isn't possible). > The problem is not found because normal mirror based profiles aren't > affected by the race, since they are independent with each other. True. > Although this time the fix doesn't affect the scrub code much, it should > warn us that current scrub code is really hard to maintain. This last sentence is true. I found and fixed three BUG_ONs in RAID5 code on the first day I started testing in degraded mode, then hit the scrub code and had to give up. It was like a brick wall made out of mismatched assumptions and layering inversions, using uninitialized kernel data as mortar (though I suppose the "uninitialized" data symptom might just have been an unprotected memory access). > Abuse of workquque to delay works and the full fs scrub is race prone. > > Xfstest will follow a little later, as we don't have good enough tools > to corrupt data stripes pinpointly. > > Qu Wenruo (2): > btrfs: scrub: Introduce full stripe lock for RAID56 > btrfs: scrub: Fix RAID56 recovery race condition > > fs/btrfs/ctree.h | 4 ++ > fs/btrfs/extent-tree.c | 3 + > fs/btrfs/scrub.c | 192 > + > 3 files changed, 199 insertions(+) > > -- > 2.10.2 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Wed, Nov 16, 2016 at 11:24:33PM +0100, Niccolò Belli wrote: > On martedì 15 novembre 2016 18:52:01 CET, Zygo Blaxell wrote: > >Like I said, millions of extents per week... > > > >64K is an enormous dedup block size, especially if it comes with a 64K > >alignment constraint as well. > > > >These are the top ten duplicate block sizes from a sample of 95251 > >dedup ops on a medium-sized production server with 4TB of filesystem > >(about one machine-day of data): > > Which software do you use to dedupe your data? I tried duperemove but it > gets killed by the OOM killer because it triggers some kind of memory leak: > https://github.com/markfasheh/duperemove/issues/163 Duperemove does use a lot of memory, but the logs at that URL only show 2G of RAM in duperemove--not nearly enough to trigger OOM under normal conditions on an 8G machine. There's another process with 6G of virtual address space (although much less than that resident) that looks more interesting (i.e. duperemove might just be the victim of some interaction between baloo_file and the OOM killer). On the other hand, the logs also show kernel 4.8. 100% of my test machines failed to finish booting before they were cut down by OOM on 4.7.x kernels. The same problem occurs on early kernels in the 4.8.x series. I am having good results with 4.8.6 and later, but you should be aware that significant changes have been made to the way OOM works in these kernel versions, and maybe you're hitting a regression for your use case. > Niccolò Belli > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Tue, Nov 15, 2016 at 07:26:53AM -0500, Austin S. Hemmelgarn wrote: > On 2016-11-14 16:10, Zygo Blaxell wrote: > >Why is deduplicating thousands of blocks of data crazy? I already > >deduplicate four orders of magnitude more than that per week. > You missed the 'tiny' quantifier. I'm talking really small blocks, on the > order of less than 64k (so, IOW, stuff that's not much bigger than a few > filesystem blocks), and that is somewhat crazy because it ends up not only > taking _really_ long to do compared to larger chunks (because you're running > more independent hashes than with bigger blocks), but also because it will > often split extents unnecessarily and contribute to fragmentation, which > will lead to all kinds of other performance problems on the FS. Like I said, millions of extents per week... 64K is an enormous dedup block size, especially if it comes with a 64K alignment constraint as well. These are the top ten duplicate block sizes from a sample of 95251 dedup ops on a medium-sized production server with 4TB of filesystem (about one machine-day of data): total bytes extent countdup size 2750808064 20987 131072 803733504 1533524288 123801600 975 126976 103575552 842912288 97443840793 122880 8205107210016 8192 7749222418919 4096 71331840645 110592 64143360540 118784 63897600650 98304 all bytes all extents average dup size 6129995776 95251 64356 128K and 512K are the most common sizes due to btrfs compression (it limits the block size to 128K for compressed extents and seems to limit uncompressed extents to 512K for some reason). 12K is #4, and 3 of the top ten sizes are below 16K. The average size is just a little below 64K. These are the duplicates with block sizes smaller than 64K: total bytes extent countextent size 41615360635 65536 46264320753 61440 45817856799 57344 41267200775 53248 45760512931 49152 46948352104245056 43417600106040960 47296512128336864 59277312180932768 49029120171028672 43745280178024576 53616640261820480 43466752265316384 103575552 842912288 8205107210016 8192 7749222418919 4096 all bytes <=64K extents <=64K average dup size <=64K 870641664 55212 15769 14% of my duplicate bytes are in blocks smaller than 64K or blocks not aligned to a 64K boundary within a file. It's too large a space saving to ignore on machines that have constrained storage. It may be worthwhile skipping 4K and 8K dedups--at 250 ms per dedup, they're 30% of the total run time and only 2.6% of the total dedup bytes. On the other hand, this machine is already deduping everything fast enough to keep up with new data, so there's no performance problem to solve here. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Mon, Nov 14, 2016 at 09:07:51PM +0100, James Pharaoh wrote: > On 14/11/16 20:51, Zygo Blaxell wrote: > >On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote: > >>On 2016-11-14 13:22, James Pharaoh wrote: > >>>One thing I am keen to understand is if BTRFS will automatically ignore > >>>a request to deduplicate a file if it is already deduplicated? Given the > >>>performance I see when doing a repeat deduplication, it seems to me that > >>>it can't be doing so, although this could be caused by the CPU usage you > >>>mention above. > >> > >>What's happening is that the dedupe ioctl does a byte-wise comparison of the > >>ranges to make sure they're the same before linking them. This is actually > >>what takes most of the time when calling the ioctl, and is part of why it > >>takes longer the larger the range to deduplicate is. In essence, it's > >>behaving like an OS should and not trusting userspace to make reasonable > >>requests (which is also why there's a separate ioctl to clone a range from > >>another file instead of deduplicating existing data). > > > > - the extent-same ioctl could check to see which extents > > are referenced by the src and dst ranges, and return success > > immediately without reading data if they are the same (but > > userspace should already know this, or it's wasting a huge amount > > of time before it even calls the kernel). > > Yes, this is what I am talking about. I believe I should be able to read > data about the BTRFS data structures and determine if this is the case. I > don't care if there are false matches, due to concurrent updates, but > there'll be a /lot/ of repeat deduplications unless I do this, because even > if the file is identical, the mtime etc hasn't changed, and I have a record > of previously doing a dedupe, there's no guarantee that the file hasn't been > rewritten in place (eg by rsync), and no way that I know of to reliably > detect if a file has been changed. > > I am sure there are libraries out there which can look into the data > structures of a BTRFS file system, I haven't researched this in detail > though. I imagine that with some kind of lock on a BTRFS root, this could be > achieved by simply reading the data from the disk, since I believe that > everything is copy-on-write, so no existing data should be overwritten until > all roots referring to it are updated. Perhaps I'm missing something > though... FIEMAP (VFS) and SEARCH_V2 (btrfs-specific) will both give you access to the underlying physical block numbers. SEARCH_V2 is non-trivial to use without reverse-engineering significant parts of btrfs-progs. SEARCH_V2 is a generic tree-searching tool which will give you all kinds of information about btrfs structures...it's essential for a sophisticated deduplicator and overkill for a simple one. For full-file dedup using FIEMAP you only need to look at the "physical" field of the first extent (if it's zero or the same as the other file, the files cannot be deduplicated or are already deduplicated, respectively). The source for 'filefrag' (from e2fsprogs) is good for learning how FIEMAP works. For block-level dedup you need to look at each extent individually. That's much slower and full of additional caveats. If you're going down that road it's probably better to just improve duperemove instead. > James signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Mon, Nov 14, 2016 at 02:56:51PM -0500, Austin S. Hemmelgarn wrote: > On 2016-11-14 14:51, Zygo Blaxell wrote: > >Deduplicating an extent that may might be concurrently modified during the > >dedup is a reasonable userspace request. In the general case there's > >no way for userspace to ensure that it's not happening. > I'm not even talking about the locking, I'm talking about the data > comparison that the ioctl does to ensure they are the same before > deduplicating them, and specifically that protecting against userspace just > passing in two random extents that happen to be the same size but not > contain the same data (because deduplication _should_ reject such a > situation, that's what the clone ioctl is for). If I'm deduping a VM image, and the virtual host is writing to said image (which is likely since an incremental dedup will be intentionally doing dedup over recently active data sets), the extent I just compared in userspace might be different by the time the kernel sees it. This is an important reason why the whole lock/read/compare/replace step is an atomic operation from userspace's PoV. The read also saves having to confirm a short/weak hash isn't a collision. The RAM savings from using weak hashes (~48 bits) are a huge performance win. The locking overhead is very small compared to the reading overhead, and (in the absence of bugs) it will only block concurrent writes to the same offset range in the src/dst inodes (based on a read of the code...I don't know if there's also an inode-level or backref-level barrier that expands the locking scope). I'm not sure the ioctl is well designed for simply throwing random data at it, especially not entire files (it can't handle files over 16MB anyway). It will read more data than it has to compared to a block-by-block comparison from userspace with prefetches or a pair of IO threads. If userspace reads both copies of the data just before issuing the extent-same call, the kernel will read the data from cache reasonably quickly. > The locking is perfectly reasonable and shouldn't contribute that much to > the overhead (unless you're being crazy and deduplicating thousands of tiny > blocks of data). Why is deduplicating thousands of blocks of data crazy? I already deduplicate four orders of magnitude more than that per week. > >That said, some optimization is possible (although there are good reasons > >not to bother with optimization in the kernel): > > > > - VFS could recognize when it has two separate references to > > the same physical extent and not re-read the same data twice > > (but that requires teaching VFS how to do CoW in general, and is > > hard for political reasons on top of the obvious technical ones). > > > > - the extent-same ioctl could check to see which extents > > are referenced by the src and dst ranges, and return success > > immediately without reading data if they are the same (but > > userspace should already know this, or it's wasting a huge amount > > of time before it even calls the kernel). > > > >>TBH, even though it's kind of annoying from a performance perspective, it's > >>a rather nice safety net to have. For example, one of the cases where I do > >>deduplication is a couple of directories where each directory is an > >>overlapping partial subset of one large tree which I keep elsewhere. In > >>this case, I can tell just by filename exactly what files might be > >>duplicates, so the ioctl's check lets me just call the ioctl on all > >>potential duplicates (after checking size, no point in wasting time if the > >>files obviously aren't duplicates), and have it figure out whether or not > >>they can be deduplicated. > >>> > >>>In any case, I'm considering some digging into the filesystem structures > >>>to see if I can work this out myself before i do any deduplication. I'm > >>>fairly sure this should be relatively simple to work out, at least well > >>>enough for my purposes. > >>Sadly, there's no way to avoid doing so right now. > >> > >>-- > >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>the body of a message to majord...@vger.kernel.org > >>More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Mon, Nov 14, 2016 at 01:39:02PM -0500, Austin S. Hemmelgarn wrote: > On 2016-11-14 13:22, James Pharaoh wrote: > >One thing I am keen to understand is if BTRFS will automatically ignore > >a request to deduplicate a file if it is already deduplicated? Given the > >performance I see when doing a repeat deduplication, it seems to me that > >it can't be doing so, although this could be caused by the CPU usage you > >mention above. > What's happening is that the dedupe ioctl does a byte-wise comparison of the > ranges to make sure they're the same before linking them. This is actually > what takes most of the time when calling the ioctl, and is part of why it > takes longer the larger the range to deduplicate is. In essence, it's > behaving like an OS should and not trusting userspace to make reasonable > requests (which is also why there's a separate ioctl to clone a range from > another file instead of deduplicating existing data). Deduplicating an extent that may might be concurrently modified during the dedup is a reasonable userspace request. In the general case there's no way for userspace to ensure that it's not happening. That said, some optimization is possible (although there are good reasons not to bother with optimization in the kernel): - VFS could recognize when it has two separate references to the same physical extent and not re-read the same data twice (but that requires teaching VFS how to do CoW in general, and is hard for political reasons on top of the obvious technical ones). - the extent-same ioctl could check to see which extents are referenced by the src and dst ranges, and return success immediately without reading data if they are the same (but userspace should already know this, or it's wasting a huge amount of time before it even calls the kernel). > TBH, even though it's kind of annoying from a performance perspective, it's > a rather nice safety net to have. For example, one of the cases where I do > deduplication is a couple of directories where each directory is an > overlapping partial subset of one large tree which I keep elsewhere. In > this case, I can tell just by filename exactly what files might be > duplicates, so the ioctl's check lets me just call the ioctl on all > potential duplicates (after checking size, no point in wasting time if the > files obviously aren't duplicates), and have it figure out whether or not > they can be deduplicated. > > > >In any case, I'm considering some digging into the filesystem structures > >to see if I can work this out myself before i do any deduplication. I'm > >fairly sure this should be relatively simple to work out, at least well > >enough for my purposes. > Sadly, there's no way to avoid doing so right now. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Mon, Nov 14, 2016 at 07:22:59PM +0100, James Pharaoh wrote: > On 14/11/16 19:07, Zygo Blaxell wrote: > >There is also a still-unresolved problem where the filesystem CPU usage > >rises exponentially for some operations depending on the number of shared > >references to an extent. Files which contain blocks with more than a few > >thousand shared references can trigger this problem. A file over 1TB can > >keep the kernel busy at 100% CPU for over 40 minutes at a time. > > Yes, I see this all the time. For my use cases, I don't really care about > "shared references" as blocks of files, but am happy to simply deduplicate > at the whole-file level. I wonder if this still will have the same effect, > however. I guess that this could be mitigated in a tool, but this is going > to be both annoying and not the most elegant solution. If you have huge files (1TB+) this can be a problem even with whole-file deduplications (which are really just extent-level deduplications applied to the entire file). The CPU time is a product of file size and extent reference count with some other multipliers on top. I've hacked around it by timing how long it takes to manipulate the data, and blacklisting any hash value or block address that takes more than 10 seconds to process (if such a block is found after blacklisting, just skip processing the block/extent/file entirely). It turns out there are very few of these in practice (only a few hundred per TB) but these few hundred block hash values occur millions of times in a large data corpus. > One thing I am keen to understand is if BTRFS will automatically ignore a > request to deduplicate a file if it is already deduplicated? Given the > performance I see when doing a repeat deduplication, it seems to me that it > can't be doing so, although this could be caused by the CPU usage you > mention above. As far as I can tell btrfs doesn't do anything different in this case--it'll happily repeat the entire lock/read/compare/delete/insert sequence even if the outcome cannot be different from the initial conditions. Due to limitations of VFS caching it'll read the same blocks from storage hardware twice, too. > In any case, I'm considering some digging into the filesystem structures to > see if I can work this out myself before i do any deduplication. I'm fairly > sure this should be relatively simple to work out, at least well enough for > my purposes. I used FIEMAP (then later replaced it with SEARCH_V2 for speed) to map the extents to physical addresses before deduping them. If you're only going to do whole-file dedup then you only need to care about the physical address of the first non-hole extent. signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Tue, Nov 08, 2016 at 12:06:01PM +0100, Niccolò Belli wrote: > Nice, you should probably update the btrfs wiki as well, because there is no > mention of btrfs-dedupe. > > First question, why this name? Don't you plan to support xfs as well? Does XFS plan to support LOGICAL_INO, INO_PATHS, and something analogous to SEARCH_V2? POSIX API + FILE_EXTENT_SAME is OK for the lowest common denominator across arbitrary filesystems, but a btrfs-specific tool can do a lot better. Especially for incremental dedup and low-RAM algorithms. signature.asc Description: Digital signature
Re: Announcing btrfs-dedupe
On Mon, Nov 07, 2016 at 07:49:51PM +0100, James Pharaoh wrote: > Annoyingly I can't find this now, but I definitely remember reading someone, > apparently someone knowledgable, claim that the latest version of the kernel > which I was using at the time, still suffered from issues regarding the > dedupe code. > This was a while ago, and I would be very pleased to hear that there is high > confidence in the current implementation! I'll post a link if I manage to > find the comments. I've been running the btrfs dedup ioctl 7 times per second on average over 42TB of test data for most of a year (and at a lower rate for two years). I have not found any data corruptions due to _dedup_. I did find three distinct data corruption kernel bugs unrelated to dedup, and two test machines with bad RAM, so I'm pretty sure my corruption detection is working. That said, I wouldn't run dedup on a kernel older than 4.4. LTS kernels might be OK too, but only if they're up to date with backported btrfs fixes. Kernels older than 3.13 lack the FILE_EXTENT_SAME ioctl and can only deduplicate static data (i.e. data you are certain is not being concurrently modified). Before 3.12 there are so many bugs you might as well not bother. Older kernels are bad for dedup because of non-corruption reasons. Between 3.13 and 4.4, the following bugs were fixed: - false-negative capability checks (e.g. same-inode, EOF extent) reduce dedup efficiency - ctime updates (older versions would update ctime when a file was deduped) mess with incremental backup tools, build systems, etc. - kernel memory leaks (self-explanatory) - multiple kernel hang/panic bugs (e.g. a deadlock if two threads try to read the same extent at the same time, and at least one of those threads is dedup; and there was some race condition leading to invalid memory access on dedup's comparison reads) which won't eat your data, but they might ruin your day anyway. There is also a still-unresolved problem where the filesystem CPU usage rises exponentially for some operations depending on the number of shared references to an extent. Files which contain blocks with more than a few thousand shared references can trigger this problem. A file over 1TB can keep the kernel busy at 100% CPU for over 40 minutes at a time. There might also be a correlation between delalloc data and hangs in extent-same, but I have NOT been able to confirm this. All I know at this point is that doing a fsync() on the source FD just before doing the extent-same ioctl dramatically reduces filesystem hang rates: several weeks between hangs (or no hangs at all) with fsync, vs. 18 hours or less without. > James > > On 07/11/16 18:59, Mark Fasheh wrote: > >Hi James, > > > >Re the following text on your project page: > > > >"IMPORTANT CAVEAT — I have read that there are race and/or error > >conditions which can cause filesystem corruption in the kernel > >implementation of the deduplication ioctl." > > > >Can you expound on that? I'm not aware of any bugs right now but if > >there is any it'd absolutely be worth having that info on the btrfs > >list. > > > >Thanks, > >--Mark > > > > > >On Sun, Nov 6, 2016 at 7:30 AM, James Pharaoh > >wrote: > >>Hi all, > >> > >>I'm pleased to announce my btrfs deduplication utility, written in Rust. > >>This operates on whole files, is fast, and I believe complements the > >>existing utilities (duperemove, bedup), which exist currently. > >> > >>Please visit the homepage for more information: > >> > >>http://btrfs-dedupe.com > >> > >>James Pharaoh > >>-- > >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>the body of a message to majord...@vger.kernel.org > >>More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: Identifying reflink / CoW files
On Thu, Oct 27, 2016 at 01:30:11PM +0200, Saint Germain wrote: > Hello, > > Following the previous discussion: > https://www.spinics.net/lists/linux-btrfs/msg19075.html > > I would be interested in finding a way to reliably identify reflink / > CoW files in order to use deduplication programs (like fdupes, jdupes, > rmlint) efficiently. > > Using FIEMAP doesn't seem to be reliable according to this discussion > on rmlint: > https://github.com/sahib/rmlint/issues/132#issuecomment-157665154 Inline extents have no physical address (FIEMAP returns 0 in that field). You can't dedup them and each file can have only one, so if you see the FIEMAP_EXTENT_INLINE bit set, you can just skip processing the entire file immediately. You can create a separate non-inline extent in a temporary file then use dedup to replace _both_ copies of the original inline extent. Or don't bother, as the savings are negligible. > Is there another way that deduplication programs can easily use ? The problem is that it's not files that are reflinked--individual extents are. "reflink file copy" really just means "a file whose extents are 100% shared with another file." It's possible for files on btrfs to have any percentage of shared extents from 0 to 100% in increments of the host page size. It's also possible for the blocks to be shared with different extent boundaries. The quality of the result therefore depends on the amount of effort put into measuring it. If you look for the first non-hole extent in each file and use its physical address as a physical file identifier, then you get a fast reflink detector function that has a high risk of false positives. If you map out two files and compare physical addresses block by block, you get a slow function with a low risk of false positives (but maybe a small risk of false negatives too). If your dedup program only does full-file reflink copies then the first extent physical address method is sufficient. If your program does block- or extent-level dedup then it shouldn't be using files in its data model at all, except where necessary to provide a mechanism to access the physical blocks through the POSIX filesystem API. FIEMAP will tell you about all the extents (physical address for extents that have them, zero for other extent types). It's also slow and has assorted accuracy problems especially with compressed files. Any user can run FIEMAP, and it uses only standard structure arrays. SEARCH_V2 is root-only and requires parsing variable-length binary btrfs data encoding, but it's faster than FIEMAP and gives more accurate results on compressed files. > Thanks > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: Monitoring Btrfs
On Mon, Oct 17, 2016 at 06:44:14PM +0200, Stefan Malte Schumacher wrote: > Hello > > I would like to monitor my btrfs-filesystem for missing drives. On > Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and > sends an email if anything is wrong with the array. I would like to do > the same with btrfs. In my first attempt I grepped and cut the > information from "btrfs fi show" and let the script send an email if > the number of devices was not equal to the preselected number. > > Then I saw this: > > ubuntu@ubuntu:~$ sudo btrfs filesystem show > Label: none uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7 > Total devices 6 FS bytes used 5.47TiB > devid1 size 1.81TiB used 1.71TiB path /dev/sda3 > devid2 size 1.81TiB used 1.71TiB path /dev/sdb3 > devid3 size 1.82TiB used 1.72TiB path /dev/sdc1 > devid4 size 1.82TiB used 1.72TiB path /dev/sdd1 > devid5 size 2.73TiB used 2.62TiB path /dev/sde1 > *** Some devices missing > > on this page: > https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices > The number of devices is still at 6, despite the fact that one of the > drives is missing, which means that my first idea doesnt work. Using fi show for this isn't a good idea. By the time btrfs fi show tells you something is different from the norm, you've probably already crashed at least once and are now mounting with the 'degraded' option. > I have > two questions: > 1) Has anybody already written a script like this? After all, there is > no need to reinvent the wheel a second time. > 2) What should I best grep for? In this case I would just go for the > "missing". Does this cover all possible outputs of btrfs fi show in > case of a damaged array? What other outputs do I need to consider for > my script. I monitor the device error counters, i.e. the output of for fs in /fs1 /fs2 /fs3... ; do btrfs dev stat "$fs" | grep -v " 0$" done and send an email when it isn't empty. When there are errors I investigate in more detail (is it a failing disk? failed disk? bad cables? bad RAM? One-off UNC sector that can be ignored?), fix any problems (i.e. replace hardware, run scrub), and reset the counters to zero with 'btrfs dev stat -z'. > Yours sincerely > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: [RFC] btrfs: make max inline data can be equal to sectorsize
On Wed, Oct 12, 2016 at 11:35:46AM +0800, Wang Xiaoguang wrote: > hi, > > On 10/11/2016 11:49 PM, Chris Murphy wrote: > >On Tue, Oct 11, 2016 at 12:47 AM, Wang Xiaoguang > >wrote: > >>If we use mount option "-o max_inline=sectorsize", say 4096, indeed > >>even for a fresh fs, say nodesize is 16k, we can not make the first > >>4k data completely inline, I found this conditon causing this issue: > >> !compressed_size && (actual_end & (root->sectorsize - 1)) == 0 > >> > >>If it retuns true, we'll not make data inline. For 4k sectorsize, > >>0~4094 dara range, we can make it inline, but 0~4095, it can not. > >>I don't think this limition is useful, so here remove it which will > >>make max inline data can be equal to sectorsize. > >> > >>Signed-off-by: Wang Xiaoguang > >>--- > >> fs/btrfs/inode.c | 2 -- > >> 1 file changed, 2 deletions(-) > >> > >>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >>index ea15520..c0db393 100644 > >>--- a/fs/btrfs/inode.c > >>+++ b/fs/btrfs/inode.c > >>@@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct > >>btrfs_root *root, > >> if (start > 0 || > >> actual_end > root->sectorsize || > >> data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) || > >>- (!compressed_size && > >>- (actual_end & (root->sectorsize - 1)) == 0) || > >> end + 1 < isize || > >> data_len > root->fs_info->max_inline) { > >> return 1; > >>-- > >>2.9.0 > > > >Before making any further changes to inline data, does it make sense > >to find the source of corruption Zygo has been experiencing? That's in > >the "btrfs rare silent data corruption with kernel data leak" thread. > Yes, agree. > Also Zygo has sent a patch to fix that bug this morning :) FWIW I don't see any connection between this and the problem I found. A page-sized inline extent wouldn't have any room for uninitialized bytes. If anthing, it's the one rare case that already worked. ;) > Regards, > XIaoguang Wang > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature