a new kind of "No space left on device" error
This is one I have not seen before. When running a simple, well-tested and well-used script that makes backups using btrfs send | receive, I got these two errors: At subvol snapshot ERROR: rename o131621-1091-0 -> usr/lib/node_modules/node-gyp/gyp/pylib/gyp/MSVSVersion.py failed: No space left on device At subvol snapshot ERROR: rename o259-1095-0 -> myser/.bash_profile failed: No space left on device I have run this script many, many times and never seen errors like this. There is plenty of room on the device: # btrfs fi df /mnt/ Data, single: total=18.01GiB, used=16.53GiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=1.00GiB, used=145.12MiB GlobalReserve, single: total=24.53MiB, used=0.00B # df -h /mnt/ Filesystem Size Used Avail Use% Mounted on /dev/sdc254G 17G 36G 33% /mnt The send | receive appears to have mostly succeeded because the final expected size is about 17G, as shown above. That will use only about 1/3 of the available disk space, when completed. I don't see any reason for "No space left on device" errors, but maybe somebody here can spot a problem I am missing.
btrfs-qgroup-rescan using 100% CPU
I'm using btrfs and snapper on a system with an SSD. On this system when I run `snapper -c root ls` (where `root` is the snapper config for /), the process takes a very long time and top shows the following process using 100% of the CPU: kworker/u8:6+btrfs-qgroup-rescan I have multiple computers (also with SSD's) set up the same way with snapper and btrfs. On the other computers, `snapper -c root ls` completes almost instantly, even on systems with many more snapshots. This system has 20 total snapshots on `/`. System info: 4.18.16-arch1-1-ARCH (Arch Linux) btrfs-progs v4.17.1 scrub started at Sat Oct 27 18:37:21 2018 and finished after 00:04:02 total bytes scrubbed: 75.97GiB with 0 errors Filesystem Size Used Avail Use% Mounted on /dev/mapper/cryptdv 116G 77G 38G 67% / Data, single: total=72.01GiB, used=71.38GiB System, DUP: total=32.00MiB, used=16.00KiB Metadata, DUP: total=3.50GiB, used=2.22GiB What other info would be helpful? What troubleshooting steps should I try?
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote: > On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote: > > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote: > > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote: > > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > > > For future development I've abandoned the entire dedupe_file_range > > > > > approach. I need to be able to read and dedupe the data blocks of > > > > > the filesystem directly without having to deal with details like which > > > > > files those blocks belong to, especially on filesystems with lots of > > > > > existing deduped blocks and snapshots. > > > > > > > > IOWs, your desired OOB dedupe algorithm is: > > > > > > > > a) ask the filesystem where all it's file data is > > > > > > Actually, it's "ask the filesystem where all the *new* file data is" > > > since we don't want to read any unique data twice on subsequent runs. > > > > Sorry, how do you read "unique data" twice? By definition, unique > > data only occurs once > > ...but once it has been read, we don't want to read it again. Ever. > Even better would be to read unique data less than 1.0 times on average. > > > Oh, and you still need to hash the old data so you can find > > collisions with the new data that got written. Unless, of course, > > you are keeping your hash tree in a persistent database > > I do that. OK, so you're slowly re-inventing the typical HSM implementation that keep a userspace database to keep track of what is happening in the filesystem. They do this to make better decisions when moving less frequently accessed data out to a slower teirs of storage to keep space in the primary storage available as it fills up. You're approaching dedupe in essentially the same way, for very similar reasons. > One big difference I am noticing in our approaches is latency. ZFS (and > in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate > data occupies disk space for zero time as it is never written to disk > at all) but it requires more RAM for a given dedupe hit rate than any > other dedupe implementation I've seen. What you've written tells me > XFS saves RAM by partitioning the data and relying on an existing but > very large source of iops (sharing scrub reads with dedupe), but then > the dedupe latency is the same as the scrub interval (the worst so far). That's just the background maintenance implementation. xfs_fsr can run this way, but it can also be run as a complete immediate scan, or it can be pointed at a defined set of files to run on. We can (and probably will) do exactly the same thing with dedupe. > bees aims to have latency of a few minutes (ideally scanning data while > it's still dirty in cache, but there's no good userspace API for that) > though it's obviously not there yet. Yup, this is pretty much the "I need immediate notification that a file changed" problem that HSMs face. That's one of the things DMAPI was used for - XFS has (now unused) dmapi event notification fields in it's inode structure that were used by DMAPI to configure the notifications the filesystem was supposed to send userspace With no DMAPI in the future, people with custom HSM-like interfaces based on dmapi are starting to turn to fanotify and friends to provide them with the change notifications they require > > e.g. a soft requirement is that we need to scan the entire fs at > > least once a month. > > I have to scan and dedupe multiple times per hour. OK, the first-ever > scan of a non-empty filesystem is allowed to take much longer, but after > that, if you have enough spare iops for continuous autodefrag you should > also have spare iops for continuous dedupe. Yup, but using notifications avoids the for even these scans - you'd know exactly what data has changed, when it changed, and know exactly that you needed to read to calculate the new hashes. > > A simple piece-wise per-AG scanning algorithm (like we use in > > xfs_repair) could easily work within a 3GB RAM per AG constraint and > > would scale very well. We'd only need to scan 30-40 AGs in the hour, > > and a single AG at 1GB/s will only take 2 minutes to scan. We can > > then do the processing while the next AG gets scanned. If we've got > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?) > > then we can scan 5-10AGs at once to keep th
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote: > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote: > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > For future development I've abandoned the entire dedupe_file_range > > > approach. I need to be able to read and dedupe the data blocks of > > > the filesystem directly without having to deal with details like which > > > files those blocks belong to, especially on filesystems with lots of > > > existing deduped blocks and snapshots. > > > > IOWs, your desired OOB dedupe algorithm is: > > > > a) ask the filesystem where all it's file data is > > Actually, it's "ask the filesystem where all the *new* file data is" > since we don't want to read any unique data twice on subsequent runs. Sorry, how do you read "unique data" twice? By definition, unique data only occurs once Oh, and you still need to hash the old data so you can find collisions with the new data that got written. Unless, of course, you are keeping your hash tree in a persistent database and can work out how to prune stale entries out of it efficiently And, well, see my comments about media scrubbing below. > > b) read that used space to build a data hash index > > c) on all the data hash collisions find the owners of the > >colliding blocks > > d) if the block data is the same dedupe it > > > > I agree - that's a simple and effective algorithm. It's also the > > obvious solution to an expert in the field. > > It's obvious, but if you literally implement that, you end up with a > filesystem shattered into billions of fragments. Well, yes, that's obvious. I thought that "details omitted for reasons of brevity" would be understood, not require omitted details to be explained to me. > At step c) I need to figure out _which_ colliding block I want to keep. > That's where the algorithm needs to inspect the higher-level file > structure to determine the physical and logical situation for each block. > That leads to one of: > > a) store the hashes and block locations in the hash table > b) relocate the blocks (defrag) > c) replace some blocks with a reference to other blocks (dedupe) > > This sounds a bit like what xfs_fsr does (or will do?). xfs_fsr currently does the defrag bit w/ the swapext ioctl(*). It'll get extended to also do dedupe scans eventually, as defrag and dedupe obviously need to be integrated to prevent them from duelling. (*) swapext - in case you might not know about it - atomically swaps a range of extents between two files without copying data. So you remember those dedupe selection problems you were talking about (details omitted for brevity): inode 1:ABCDEFG|H|I|JK|L|M|N|O|P inode 2:a|b|c|d|e|f|g|hijklmop The XFS solution will be to first defrag the master file (inode 1) with the larger extents from inode 2: swapext(H-P, h-p) inode 1:ABCDEFG|hijklmop inode 2:a|b|c|d|e|f|g|H|I|JK|L|M|N|O|P And the we dedupe to the defragged file: dedupe(inode 1, inode 2) inode 1:ABCDEFG|hijklmop inode 2:ABCDEFG|hijklmop > Bees also operates under a constant-RAM constraint, so it doesn't operate > in two distinct "collect data" and "act on data collected" passes, > and cannot spend memory to store data about more than a few extents at > any time. I suspect that I'm thinking at a completely different scale to you. I don't really care for highly constrained or optimal dedupe algorithms because those last few dedupe percentages really don't matter that much to me. I care much more about using all the resources we can and running as fast as we possibly can, then providing the admin with means to throttle performance back to what they need. i.e. I'm concerned about how to effectively scan and dedupe PBs of data, where scan rates may need to be measured in tens of GB/s. In these cases the limiting factor is either suboptimal IO patterns (low IO throughput), or not having enough CPU and memory bandwidth for hashing the data the IO engine is streaming through memory. Once we get past a certain point, we'll gain far more by parallelising the dedupe scan over lots of CPU than we ever will by optimising an algorithm that cannot be parallelised. Hence we need to start with a parallelisable algorithm, not an artificially constrained one. e.g. a soft requirement is that we need to scan the entire fs at least once a month. That will need to be broken up into nightly work, with a full 1PB filesystem needing 33+TB a night to be scanned. We typically get 1
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > > > - is documenting rejection on request alignment grounds > > > > > (i.e. EINVAL) in the man page sufficient for app > > > > > developers to understand what is going on here? > > > > > > > > I think so. The manpage says: "The filesystem does not support > > > > reflinking the ranges of the given files", which (to my mind) covers > > > > this case of not supporting dedupe of EOF blocks. > > > > > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly > > > this; however, on btrfs, not supporting dedupe of EOF blocks means small > > > files (one extent) cannot be deduped at all, because the EOF block holds > > > a reference to the entire dst extent. If a dedupe app doesn't go all the > > > way to EOF on btrfs, then it should not attempt to dedupe any part of the > > > last extent of the file as the benefit would be zero or slightly negative. > > > > That's a filesystem implementation issue, not an API or application > > issue. > > The API and application issue remains even if btrfs is not considered. > btrfs is just the worst case outcome. Other filesystems still have > fragmentation issues, and applications have efficiency-vs-capability > tradeoffs to make if they can't rely on dedupe-to-EOF being available. > > Tools like 'cp --reflink=auto' work by trying the best case, then falling > back to a second choice if the first choice returns an error. Well, yes. That's necessary for the "cp" tool to behave according to user expectations. That's not a kernel API issue - that's just an implementation of an *application requirement*. Indeed, this is identical to the behaviour of rename() in "mv" - if rename fails with -EXDEV, mv needs to fall back to a manual copy because the user expects the file to be moved. IOWS, these application level requirements you talk about are just not relevant to the kernel API for dedupe/clone operations. [snip] > It is something that naive dedupe apps will do. "naive" here meaning > "does not dive deeply into the filesystem's physical structure (or survey > the entire filesystem with FIEMAP) to determine that the thousand-refs > situation does not exist at dst prior to invoking the dedupe() call." /me sighs and points at FS_IOC_GETFSMAP $ man ioctl_getfsmap DESCRIPTION This ioctl(2) operation retrieves physical extent mappings for a filesystem. This information can be used to discover which files are mapped to a physical block, examine free space, or find known bad blocks, among other things. . I don't really care about "enabling" naive, inefficient applications. I care about applications that scale to huge filesystems and can get the job done quickly and efficiently. > > XFS doesn't have partial overlaps, we don't have nodatacow hacks, > > and the subvol snapshot stuff I'm working on just uses shared data > > extents so it's 100% compatible with dedupe. > > If you allow this sequence of operations, you get partial overlaps: > > dedupe(fd1, 0, fd2, 0, 1048576); > > dedupe(fd2, 16384, fd3, 0, 65536); Oh, I misunderstood - I thought you were refering to sharing partial filesystem blocks (like at EOF) because that's what this discussion was originally about. XFS supports the above just fine. [snip] tl;dr we don't need a new clone or dedupe API > For future development I've abandoned the entire dedupe_file_range > approach. I need to be able to read and dedupe the data blocks of > the filesystem directly without having to deal with details like which > files those blocks belong to, especially on filesystems with lots of > existing deduped blocks and snapshots. IOWs, your desired OOB dedupe algorithm is: a) ask the filesystem where all it's file data is b) read that used space to build a data hash index c) on all the data hash collisions find the owners of the colliding blocks d) if the block data is the same dedupe it I agree - that's a simple and effective algorithm. It's also the obvious solution to an expert in the field. > The file structure is frankly > just noise for dedupe on large filesystems. We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all the SGI^WHPE HSM scanning tools, etc all avoid the directory structure
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > - is documenting rejection on request alignment grounds > > > (i.e. EINVAL) in the man page sufficient for app > > > developers to understand what is going on here? > > > > I think so. The manpage says: "The filesystem does not support > > reflinking the ranges of the given files", which (to my mind) covers > > this case of not supporting dedupe of EOF blocks. > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly > this; however, on btrfs, not supporting dedupe of EOF blocks means small > files (one extent) cannot be deduped at all, because the EOF block holds > a reference to the entire dst extent. If a dedupe app doesn't go all the > way to EOF on btrfs, then it should not attempt to dedupe any part of the > last extent of the file as the benefit would be zero or slightly negative. That's a filesystem implementation issue, not an API or application issue. > The app developer would need to be aware that such a restriction could > exist on some filesystems, and be able to distinguish this from other > cases that could lead to EINVAL. Portable code would have to try a dedupe > up to EOF, then if that failed, round down and retry, and if that failed > too, the app would have to figure out which filesystem it's running on > to know what to do next. Performance demands the app know what the FS > will do in advance, and avoid a whole class of behavior. Nobody writes "portable" applications like that. They read the man page first, and work out what the common subset of functionality is and then code from that. Man page says: "Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size." IOWs, code compatible with starts with supporting the general case. i.e. a range rounded to filesystem block boundaries (it's already run fstat() on the files it wants to dedupe to find their size, yes?), hence ignoring the partial EOF block. Will just work on everything. Code that then wants to optimise for btrfs/xfs/ocfs quirks runs fstatvfs to determine what fs it's operating on and applies the necessary quirks. For btrfs it can extend the range to include the partial EOF block, and hence will handle the implementation quirks btrfs has with single extent dedupe. Simple, reliable, and doesn't require any sort of flailing about with offsets and lengths to avoid unexpected EINVAL errors. > btrfs dedupe reports success if the src extent is inline and the same > size as the dst extent (i.e. file is smaller than one page). No dedupe > can occur in such cases--a clone results in a simple copy, so the best > a dedupe could do would be a no-op. Returning EINVAL there would break > a few popular tools like "cp --reflink". Returning OK but doing nothing > seems to be the best option in that case. Again, those are a filesystem implementation issues, not problems with the API itself. > > > - should we just round down the EOF dedupe request to the > > > block before EOF so dedupe still succeeds? > > > > I've often wondered if the interface should (have) be(en) that we start > > at src_off/dst_off and share as many common blocks as possible until we > > find a mismatch, then tell userspace where we stopped... instead of like > > now where we compare the entire extent and fail if any part of it > > doesn't match. > > The usefulness or harmfulness of that approach depends a lot on what > the application expects the filesystem to do. > > In btrfs, the dedupe operation acts on references to data, not the > underlying data blocks. If there are 1000 slightly overlapping references > to a single contiguous range of data blocks in dst on disk, each dedupe > operation acts on only one of those, leaving the other 999 untouched. > If the app then submits 999 other dedupe requests, no references to the > dst blocks remain and the underlying data blocks can be deleted. Assuming your strawman is valid, if you have a thousand separate references across the same set of data blocks on disk, then that data is already heavily deduplicated. Trying to optimise that further seems misguided, way down the curve of diminishing returns. > In a parallel universe (or a better filesystem, or a userspace emulation ^^^ > built out of dedupe and other ioctls), dedupe could work at the extent > data (physical) level. The app points at src and dst extent references > (inode/offset/length tuples), and the filesystem figures out which > physical blocks t
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Mon, Aug 20, 2018 at 08:17:18PM -0500, Eric Sandeen wrote: > > > On 8/20/18 7:49 PM, Dave Chinner wrote: > > Upon successful completion of this ioctl, the number of > > bytes successfully deduplicated is returned in bytes_deduped > > and a status code for the deduplication operation is > > returned in status. If even a single byte in the range does > > not match, the deduplication request will be ignored and > > status set to FILE_DEDUPE_RANGE_DIFFERS. > > > > This implies we can dedupe less than the entire range as long as the > > entire range matches. If the entire range does not match, we have > > to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match > > so we can pick and choose how much we deduplicate. How much we > > dedupe is then returned as a byte count. In this case, it will be a > > few bytes short of the entire length requested because we aligned > > the dedupe inwards > > > > Does that sound reasonable? > > I had hoped that dedupe was advisory as Darrick wished for, but TBH my > reading of that is no, if you ask for a range to be deduped and any of > it differs, "even a single byte," you fail it all. Yes, if the data differs, then it fails. But that's not what I'm questioning nor what I care about in this case. I'm asking whether the filesystem gets to choose how much of the range of same data is deduped when the file data is apparently the same. > Why else would that last part be present, if the interface is free to > ignore later parts that don't match and truncate the range to the > matching portion? I think there is a clear differentiation in the man page text between "all the bytes are the same" and "how much of that range the filesystem deduped". i.e. the man page does not say that the filesystem *must* dedupe the entire range if all the data is the same - it says the filesystem will return /how many bytes it successfully deduplicated/. IOWs, "do nothing and return zero" is a valid ->dedupe_file_range implementation. AFAIC, it's the fact that we do data comparison from the page cache before calling into the filesystem to check on-disk formats that is the problem here. Having identical data in the page cache is not the same thing as "the blocks on disk containing the user data are identical". i.e. Filesystems deduplicate disk blocks, but they often perform transformations on the data as it passes between the page cache and disk blocks or have constraints on the format of the data on disk. For example: - encrypted files. unencrypted data in the page cache is the same, therefore we can't return FILE_DEDUPE_RANGE_DIFFERS because the user will be able to see that they are the same. But they will different on disk after encryption (e.g. because different nonce/seed or completely different keys) and so the filesystem should not dedupe them. - the two files differ in on-disk format e.g. compressed vs encrypted. - data might be inline with metadata - tail packing - dedupe request might be block aligned at file offsets, but file offsets are not aligned to disk blocks due to, say, multi-block data compression. - on disk extent tracking granularity might be larger than filesystem block size (e.g. ext4's bigalloc mode) so can't be deduped on filesystem block size alignment. - the source or destination inode is marked "NODATACOW" so can't contain shared extents I'm sure there's others, but I think this is enough to demonstrate that "user visible file data is the same" does not mean the filesystem can dedupe it. The wording in the man page appears to understand these limitations and hence it allows us to silently ignore the partial EOF block when it comes to dedupe Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > So why was this dedupe request even accepted by the kernel? Well, > > I think there's a bug in the check just above this: > > > > /* If we're linking to EOF, continue to the block boundary. */ > > if (pos_in + *len == isize) > > blen = ALIGN(isize, bs) - pos_in; > > else > > blen = *len; > > > > basically, when the "in" file dedupe/reflink range is to EOF, it > > expands the range to include /all/ of the block that contains the > > EOF byte. IOWs, it now requests us to dedupe /undefined data beyond > > EOF/. But when we go to compare the data in these ranges, it > > truncates the comparison to the length that the user asked for > > (i.e. *len) and not the extended block length. > > > > IOWs, it doesn't compare the bytes beyond EOF in the source block to > > the data in the destination block it would replace, and so doesn't > > fail the compare like it should. .. > > i.e. the dedupe request should fail as we cannot dedupe the EOF > > block. > > > > The patch below does this for the VFS dedupe code. it's not a final > > patch, it's just a demonstration of how this should never have got > > past the range checks. Open questions: > > (Here's my five minutes of XFS that I'm allowed per day... :/) > > > - is documenting rejection on request alignment grounds > > (i.e. EINVAL) in the man page sufficient for app > > developers to understand what is going on here? > > I think so. The manpage says: "The filesystem does not support > reflinking the ranges of the given files", which (to my mind) covers > this case of not supporting dedupe of EOF blocks. Ok. > > - should we just round down the EOF dedupe request to the > > block before EOF so dedupe still succeeds? > > I've often wondered if the interface should (have) be(en) that we start > at src_off/dst_off and share as many common blocks as possible until we > find a mismatch, then tell userspace where we stopped... instead of like > now where we compare the entire extent and fail if any part of it > doesn't match. I think that the ioctl_fideduperange() man page description gives us the flexibility to do this. That is, this text: Upon successful completion of this ioctl, the number of bytes successfully deduplicated is returned in bytes_deduped and a status code for the deduplication operation is returned in status. If even a single byte in the range does not match, the deduplication request will be ignored and status set to FILE_DEDUPE_RANGE_DIFFERS. This implies we can dedupe less than the entire range as long as the entire range matches. If the entire range does not match, we have to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match so we can pick and choose how much we deduplicate. How much we dedupe is then returned as a byte count. In this case, it will be a few bytes short of the entire length requested because we aligned the dedupe inwards Does that sound reasonable? > > - should file cloning (i.e. reflink) be allow allowing the > > EOF block to be shared somewhere inside EOF in the > > destination file? > > I don't think we should. > > > That's stale data exposure, yes? > > Haven't tested that, but seems likely. Yeah, I haven't tested it either, but it I can't see how the current behaviour ends well. > > - should clone only allow sharing of the EOF block if it's a > > either a full file clone or a matching range clone and > > isize is the same for both src and dest files? > > The above sound reasonable for clone, though I would also allow clone to > share the EOF block if we extend isize of the dest file. Hmmm. That covers the only valid rule for sharing the EOF block, right? i.e. That the source EOF lands exactly at or beyond the destination EOF, so it remains the EOF block in both files? FWIW, I just noticed that the ioctl_ficlonerange man page doesn't document the return value on success of a clone. > The above also nearly sound reasonable for dedupe too. If we encounter > EOF at the same places in the src range and the dest range, should we > allow sharing there? The post-eof bytes are undefined by definition; > does undefined == undefined? It's undefined, espescially as different fs implementations may be doing different things with partial blocks (e.g. tail packing). >From an XFS perspective, I don't think we should physically share partial EOF blocks on dedupe because it means ex
[patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
[cc linux-fsdevel now, too] On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote: > [cc linux-...@vger.kernel.org] > > On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote: > > From: Filipe Manana > > > > Test that deduplication of an entire file that has a size that is not > > aligned to the filesystem's block size into a different file does not > > corrupt the destination's file data. Ok, I've looked at this now. My first question is where did all the magic offsets in this test come from? i.e. how was this bug found and who is it affecting? > > This test is motivated by a bug found in Btrfs which is fixed by the > > following patch for the linux kernel: > > > > "Btrfs: fix data corruption when deduplicating between different files" > > > > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly > > with the same corruption as in Btrfs - some bytes of a block get replaced > > with zeroes after the deduplication. > > Filipe, in future can please report XFS bugs you find to the XFS > list the moment you find them. We shouldn't ever find out about a > data corruption bug we need to fix via a "oh, by the way" comment in > a commit message for a regression test This becomes much more relevant because of what I've just found . > > +# The first byte with a value of 0xae starts at an offset (2518890) which > > is not > > +# a multiple of the block size. > > +$XFS_IO_PROG -f \ > > + -c "pwrite -S 0x6b 0 2518890" \ > > + -c "pwrite -S 0xae 2518890 102398" \ > > + $SCRATCH_MNT/foo | _filter_xfs_io > > + > > +# Create a second file with a length not aligned to the block size, whose > > bytes > > +# all have the value 0x6b, so that its extent(s) can be deduplicated with > > the > > +# first file. > > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | > > _filter_xfs_io > > + > > +# The file is filled with bytes having the value 0x6b from offset 0 to > > offset > > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287. > > +echo "File content before deduplication:" > > +od -t x1 $SCRATCH_MNT/foo Please use "od -Ad -t x1 " so the file offsets reported by od match the offsets used in the test (i.e. in decimal, not octal). > > + > > +# Now deduplicate the entire second file into a range of the first file > > that > > +# also has all bytes with the value 0x6b. The destination range's end > > offset > > +# must not be aligned to the block size and must be less then the offset of > > +# the first byte with the value 0xae (byte at offset 2518890). > > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" > > $SCRATCH_MNT/foo \ > > + | _filter_xfs_io Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e this check: /* Only reflink if we're aligned to block boundaries */ if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) || !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs)) return -EINVAL; And it's pretty clear that a length of 557771 is not block aligned (being an odd number). So why was this dedupe request even accepted by the kernel? Well, I think there's a bug in the check just above this: /* If we're linking to EOF, continue to the block boundary. */ if (pos_in + *len == isize) blen = ALIGN(isize, bs) - pos_in; else blen = *len; basically, when the "in" file dedupe/reflink range is to EOF, it expands the range to include /all/ of the block that contains the EOF byte. IOWs, it now requests us to dedupe /undefined data beyond EOF/. But when we go to compare the data in these ranges, it truncates the comparison to the length that the user asked for (i.e. *len) and not the extended block length. IOWs, it doesn't compare the bytes beyond EOF in the source block to the data in the destination block it would replace, and so doesn't fail the compare like it should. And, well, btrfs has the same bug. extent_same_check_offsets() extends the range for alignment so it passes alignment checks, but then /explicitly/ uses the original length for the data compare and dedupe. i.e: /* pass original length for comparison so we stay within i_size */ ret = btrfs_cmp_data(olen, cmp); if (ret == 0) ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1); This is what we should see if someone tried to dedupe the EOF block of a file: generic/505 - output mismatch (see /home/dave/src/xfstests-dev/resul
Re: [PATCH] generic: test for deduplication between different files
[cc linux-...@vger.kernel.org] On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > Test that deduplication of an entire file that has a size that is not > aligned to the filesystem's block size into a different file does not > corrupt the destination's file data. > > This test is motivated by a bug found in Btrfs which is fixed by the > following patch for the linux kernel: > > "Btrfs: fix data corruption when deduplicating between different files" > > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly > with the same corruption as in Btrfs - some bytes of a block get replaced > with zeroes after the deduplication. Filipe, in future can please report XFS bugs you find to the XFS list the moment you find them. We shouldn't ever find out about a data corruption bug we need to fix via a "oh, by the way" comment in a commit message for a regression test Cheers, Dave. > Signed-off-by: Filipe Manana > --- > tests/generic/505 | 84 > +++ > tests/generic/505.out | 33 > tests/generic/group | 1 + > 3 files changed, 118 insertions(+) > create mode 100755 tests/generic/505 > create mode 100644 tests/generic/505.out > > diff --git a/tests/generic/505 b/tests/generic/505 > new file mode 100755 > index ..5ee232a2 > --- /dev/null > +++ b/tests/generic/505 > @@ -0,0 +1,84 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved. > +# > +# FS QA Test No. 505 > +# > +# Test that deduplication of an entire file that has a size that is not > aligned > +# to the filesystem's block size into a different file does not corrupt the > +# destination's file data. > +# > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/reflink > + > +# real QA test starts here > +_supported_fs generic > +_supported_os Linux > +_require_scratch_dedupe > + > +rm -f $seqres.full > + > +_scratch_mkfs >>$seqres.full 2>&1 > +_scratch_mount > + > +# The first byte with a value of 0xae starts at an offset (2518890) which is > not > +# a multiple of the block size. > +$XFS_IO_PROG -f \ > + -c "pwrite -S 0x6b 0 2518890" \ > + -c "pwrite -S 0xae 2518890 102398" \ > + $SCRATCH_MNT/foo | _filter_xfs_io > + > +# Create a second file with a length not aligned to the block size, whose > bytes > +# all have the value 0x6b, so that its extent(s) can be deduplicated with the > +# first file. > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | > _filter_xfs_io > + > +# The file is filled with bytes having the value 0x6b from offset 0 to offset > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287. > +echo "File content before deduplication:" > +od -t x1 $SCRATCH_MNT/foo > + > +# Now deduplicate the entire second file into a range of the first file that > +# also has all bytes with the value 0x6b. The destination range's end offset > +# must not be aligned to the block size and must be less then the offset of > +# the first byte with the value 0xae (byte at offset 2518890). > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" $SCRATCH_MNT/foo \ > + | _filter_xfs_io > + > +# The bytes in the range starting at offset 2515659 (end of the deduplication > +# range) and ending at offset 2519040 (start offset rounded up to the block > +# size) must all have the value 0xae (and not replaced with 0x00 values). > +# In other words, we should have exactly the same data we had before we asked > +# for deduplication. > +echo "File content after deduplication and before unmounting:" > +od -t x1 $SCRATCH_MNT/foo > + > +# Unmount the filesystem and mount it again. This guarantees any file data in > +# the page cache is dropped. > +_scratch_cycle_mount > + > +# The bytes in the range starting at offset 2515659 (end of the deduplication > +# range) and ending at offset 2519040 (start offset rounded up to the block > +# size) must all have the value 0xae (and not replaced with 0x00 values). > +# In other words, we should have exactly the same data we had before we asked > +# for deduplication. > +echo "File content after unmounting:"
Re: [PATCH V4] test online label ioctl
On Thu, May 17, 2018 at 10:28:26AM -0500, Eric Sandeen wrote: > This tests the online label ioctl that btrfs has, which has been > recently proposed for XFS. > > To run, it requires an updated xfs_io with the label command and a > filesystem that supports it > > A slight change here to _require_xfs_io_command as well, so that tests > which simply fail with "Inappropriate ioctl" can be caught in the > common case. > > Signed-off-by: Eric Sandeen <sand...@redhat.com> > --- > > Now with new and improved sequential V4 versioning! Looks good now. Reviewed-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] test online label ioctl
On Tue, May 15, 2018 at 10:22:37AM -0500, Eric Sandeen wrote: > This tests the online label ioctl that btrfs has, which has been > recently proposed for XFS. > > To run, it requires an updated xfs_io with the label command and a > filesystem that supports it > > A slight change here to _require_xfs_io_command as well, so that tests > which simply fail with "Inappropriate ioctl" can be caught in the > common case. > > Signed-off-by: Eric Sandeen <sand...@redhat.com> > --- > > (urgh send as proper new thread, sorry) > > This passes on btrfs, _notruns on xfs/ext4 of yore, and passes > on xfs w/ my online label patchset (as long as xfs_io has the new > capability) > > V2: Add a max label length helper > Set the proper btrfs max label length o_O oops > Filter trailing whitespace from blkid output > > V3: lowercase local vars, simplify max label len function Looks good now, but I wondered about one thing the test doesn't cover: can you clear the label by setting it to a null string? i.e you check max length bounds, but don't check empty string behaviour... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] test online label ioctl
On Mon, May 14, 2018 at 06:26:07PM -0500, Eric Sandeen wrote: > On 5/14/18 6:11 PM, Dave Chinner wrote: > > On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote: > >> This tests the online label ioctl that btrfs has, which has been > >> recently proposed for XFS. > >> > >> To run, it requires an updated xfs_io with the label command and a > >> filesystem that supports it > >> > >> A slight change here to _require_xfs_io_command as well, so that tests > >> which simply fail with "Inappropriate ioctl" can be caught in the > >> common case. > >> > >> Signed-off-by: Eric Sandeen <sand...@redhat.com> > >> --- > > > > >> +# The maximum filesystem label length, not including terminating NULL > >> +_label_get_max() > >> +{ > >> + case $FSTYP in > >> + xfs) > >> + MAXLEN=12 > >> + ;; > >> + btrfs) > >> + MAXLEN=255 > >> + ;; > >> + *) > >> + MAXLEN=0 > > > > Why not just _notrun here? > > do we want to go through the other steps only to get here and notrun > halfway through? > > >> + ;; > >> + esac > >> + > >> + echo $MAXLEN > >> +} > >> + > >> +_require_label_get_max() > >> +{ > >> + if [ $(_label_get_max) -eq 0 ]; then > >> + _notrun "$FSTYP does not define maximum label length" > >> + fi > > > > And this check can go away? > > We'd like to know if all the validations in this can complete before we > get started, right? That's why I did it this way. Sure, just trying to be a bit defensive as people often forget _requires directives when writing new tests > > Also, shouldn't it be _require_online_label_change() ? And then > > maybe you can move the xfs_io label command check inside it? > > Well, we want to know a lot of things: > > 1) can the kernel code for this filesystem support online label > 2) does xfs_io support the label command > 3) does this test know the maximum label length to test for this fs > > the above stuff is for 3) What I was suggesting was doing all of these in one function similar to _require_xfs_sparse_inodes, _require_meta_uuid, etc: _require_online_label_change() { # need xfs_io support _require_xfs_io_command "label" # need fstests knowledge of label size [ $(_label_get_max) -eq 0 ] && _notrun "$FSTYP does not define maximum label length" # need kernel support $XFS_IO_PROG -c label $TEST_DIR > /dev/null 2>&1 [ $? -ne 0 ] && _notrun "Kernel does not support FS_IOC_GETLABEL" } Which also means that the label_f command in xfs_io needs to set the exitcode to non-zero when the ioctl fails so that xfs_io returns non-zero on failure... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] test online label ioctl
On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote: > This tests the online label ioctl that btrfs has, which has been > recently proposed for XFS. > > To run, it requires an updated xfs_io with the label command and a > filesystem that supports it > > A slight change here to _require_xfs_io_command as well, so that tests > which simply fail with "Inappropriate ioctl" can be caught in the > common case. > > Signed-off-by: Eric Sandeen <sand...@redhat.com> > --- > > (urgh send as proper new thread, sorry) > > This passes on btrfs, _notruns on xfs/ext4 of yore, and passes > on xfs w/ my online label patchset (as long as xfs_io has the new > capability) > > V2: Add a max label length helper > Set the proper btrfs max label length o_O oops > Filter trailing whitespace from blkid output > > > diff --git a/common/rc b/common/rc > index 9ffab7fd..88a99cff 100644 > --- a/common/rc > +++ b/common/rc > @@ -2158,6 +2158,9 @@ _require_xfs_io_command() > echo $testio | grep -q "Inappropriate ioctl" && \ > _notrun "xfs_io $command support is missing" > ;; > + "label") > + testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1` > + ;; > "open") > # -c "open $f" is broken in xfs_io <= 4.8. Along with the fix, > # a new -C flag was introduced to execute one shot commands. > @@ -2196,7 +2199,7 @@ _require_xfs_io_command() > rm -f $testfile 2>&1 > /dev/null > echo $testio | grep -q "not found" && \ > _notrun "xfs_io $command support is missing" > - echo $testio | grep -q "Operation not supported" && \ > + echo $testio | grep -q "Operation not supported\|Inappropriate ioctl" > && \ > _notrun "xfs_io $command failed (old kernel/wrong fs?)" > echo $testio | grep -q "Invalid" && \ > _notrun "xfs_io $command failed (old kernel/wrong fs/bad args?)" > @@ -3802,6 +3805,31 @@ _require_scratch_feature() > esac > } > > +# The maximum filesystem label length, not including terminating NULL > +_label_get_max() > +{ > + case $FSTYP in > + xfs) > + MAXLEN=12 > + ;; > + btrfs) > + MAXLEN=255 > + ;; > + *) > + MAXLEN=0 Why not just _notrun here? > + ;; > + esac > + > + echo $MAXLEN > +} > + > +_require_label_get_max() > +{ > + if [ $(_label_get_max) -eq 0 ]; then > + _notrun "$FSTYP does not define maximum label length" > + fi And this check can go away? Also, shouldn't it be _require_online_label_change() ? And then maybe you can move the xfs_io label command check inside it? > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +_supported_fs generic > +_supported_os Linux > +_require_scratch > +_require_xfs_io_command "label" > +_require_label_get_max > + > +_scratch_mkfs > $seqres.full 2>&1 > +_scratch_mount > + > +# Make sure we can set & clear the label > +$XFS_IO_PROG -c "label label.$seq" $SCRATCH_MNT > +$XFS_IO_PROG -c "label" $SCRATCH_MNT > + > +# And that userspace can see it now, while mounted > +# NB: some blkid has trailing whitespace, filter it out here > +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g" > + > +# And that the it is still there when it's unmounted > +_scratch_unmount > +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g" Ok, so "LABEL" here is a special blkid match token.... > +# And that it persists after a remount > +_scratch_mount > +$XFS_IO_PROG -c "label" $SCRATCH_MNT > + > +# And that a too-long label is rejected, beyond the interface max: > +LABEL=$(perl -e "print 'l' x 257;") And now you use it as a variable. Nasty and confusing. Using lower case for local variables is the norm, right? I thought we were only supposed to use upper case for global test harness variables... But even making it "label" is problematic: > +$XFS_IO_PROG -c "label $LABEL" $SCRATCH_MNT because "label" is an xfs_io command. Perhaps call it "fs_label"? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote: > On 5/8/18 7:38 PM, Dave Chinner wrote: > > On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > >> Hi, > >> > >> The VFS's super_block covers a variety of filesystem functionality. In > >> particular we have a single structure representing both I/O and > >> namespace domains. > >> > >> There are requirements to de-couple this functionality. For example, > >> filesystems with more than one root (such as btrfs subvolumes) can > >> have multiple inode namespaces. This starts to confuse userspace when > >> it notices multiple inodes with the same inode/device tuple on a > >> filesystem. > > > > Devil's Advocate - I'm not looking at the code, I'm commenting on > > architectural issues I see here. > > > > The XFS subvolume work I've been doing explicitly uses a superblock > > per subvolume. That's because subvolumes are designed to be > > completely independent of the backing storage - they know nothing > > about the underlying storage except to share a BDI for writeback > > purposes and write to whatever block device the remapping layer > > gives them at IO time. Hence XFS subvolumes have (at this point) > > their own unique s_dev, on-disk format configuration, journal, space > > accounting, etc. i.e. They are fully independent filesystems in > > their own right, and as such we do not have multiple inode > > namespaces per superblock. > > That's a fundamental difference between how your XFS subvolumes work and > how btrfs subvolumes do. Yup, you've just proved my point: this is not a "subvolume problem"; but rather a "multiple namespace per root" problem. > There is no independence among btrfs > subvolumes. When a snapshot is created, it has a few new blocks but > otherwise shares the metadata of the source subvolume. The metadata > trees are shared across all of the subvolumes and there are several > internal trees used to manage all of it. I don't need btrfs 101 stuff explained to me. :/ > a single transaction engine. There are housekeeping and maintenance > tasks that operate across the entire file system internally. I > understand that there are several problems you need to solve at the VFS > layer to get your version of subvolumes up and running, but trying to > shoehorn one into the other is bound to fail. Actually, the VFS has provided everything I need for XFS subvolumes so far without requiring any sort of modifications. That's the perspective I'm approaching this from - if the VFS can do what we need for XFS subvolumes, as well as overlay (which are effectively VFS-based COW subvolumes), then lets see if we can make that work for btrfs too. > > So this doesn't sound like a "subvolume problem" - it's a "how do we > > sanely support multiple independent namespaces per superblock" > > problem. AFAICT, this same problem exists with bind mounts and mount > > namespaces - they are effectively multiple roots on a single > > superblock, but it's done at the vfsmount level and so the > > superblock knows nothing about them. > > In this case, you're talking about the user-visible file system > hierarchy namespace that has no bearing on the underlying file system > outside of per-mount flags. Except that it tracks and provides infrastructure that allows user visible "multiple namespace per root" constructs. Subvolumes - as a user visible namespace construct - are little different to bind mounts in behaviour and functionality. How the underlying filesystem implements subvolumes is really up to the filesystem, but we should be trying to implement a clean model for "multiple namespaces on a single root" at the VFS so we have consistent behaviour across all filesystems that implement similar functionality. FWIW, bind mounts and overlay also have similar inode number namespace problems to what Mark describes for btrfs subvolumes. e.g. overlay recently introduce the "xino" mount option to separate the user presented inode number namespace for overlay inode from the underlying parent filesystem inodes. How is that different to btrfs subvolumes needing to present different inode number namespaces from the underlying parent? This sort of "namespace shifting" is needed for several different pieces of information the kernel reports to userspace. The VFS replacement for shiftfs is an example of this. So is inode number remapping. I'm sure there's more. My point is that if we are talking about infrastructure to remap what userspace sees from different mountpoint views into a filesystem, then it should be done above the filesystem layers in the VFS so all filesystems behave th
Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > Hi, > > The VFS's super_block covers a variety of filesystem functionality. In > particular we have a single structure representing both I/O and > namespace domains. > > There are requirements to de-couple this functionality. For example, > filesystems with more than one root (such as btrfs subvolumes) can > have multiple inode namespaces. This starts to confuse userspace when > it notices multiple inodes with the same inode/device tuple on a > filesystem. Devil's Advocate - I'm not looking at the code, I'm commenting on architectural issues I see here. The XFS subvolume work I've been doing explicitly uses a superblock per subvolume. That's because subvolumes are designed to be completely independent of the backing storage - they know nothing about the underlying storage except to share a BDI for writeback purposes and write to whatever block device the remapping layer gives them at IO time. Hence XFS subvolumes have (at this point) their own unique s_dev, on-disk format configuration, journal, space accounting, etc. i.e. They are fully independent filesystems in their own right, and as such we do not have multiple inode namespaces per superblock. So this doesn't sound like a "subvolume problem" - it's a "how do we sanely support multiple independent namespaces per superblock" problem. AFAICT, this same problem exists with bind mounts and mount namespaces - they are effectively multiple roots on a single superblock, but it's done at the vfsmount level and so the superblock knows nothing about them. So this kinda feel like there's still a impedence mismatch between btrfs subvolumes being mounted as subtrees on the underlying root vfsmount rather than being created as truly independent vfs namespaces that share a superblock. To put that as a question: why aren't btrfs subvolumes vfsmounts in their own right, and the unique information subvolume information get stored in (or obtained from) the vfsmount? > In addition, it's currently impossible for a filesystem subvolume to > have a different security context from it's parent. If we could allow > for subvolumes to optionally specify their own security context, we > could use them as containers directly instead of having to go through > an overlay. Again, XFS subvolumes don't have this problem. So really we need to frame this discussion in terms of supporting multiple namespaces within a superblock sanely, not subvolumes. > I ran into this particular problem with respect to Btrfs some years > ago and sent out a very naive set of patches which were (rightfully) > not incorporated: > > https://marc.info/?l=linux-btrfs=130074451403261=2 > https://marc.info/?l=linux-btrfs=130532890824992=2 > > During the discussion, one question did come up - why can't > filesystems like Btrfs use a superblock per subvolume? There's a > couple of problems with that: > > - It's common for a single Btrfs filesystem to have thousands of > subvolumes. So keeping a superblock for each subvol in memory would > get prohibively expensive - imagine having 8000 copies of struct > super_block for a file system just because we wanted some separation > of say, s_dev. That's no different to using individual overlay mounts for the thousands of containers that are on the system. This doesn't seem to be a major problem... > - Writeback would also have to walk all of these superblocks - > again not very good for system performance. Background writeback is backing device focussed, not superblock focussed. It will only iterate the superblocks that have dirty inodes on the bdi writeback lists, not all the superblocks on the bdi. IOWs, this isn't a major problem except for sync() operations that iterate superblocks. > - Anyone wanting to lock down I/O on a filesystem would have to > freeze all the superblocks. This goes for most things related to > I/O really - we simply can't afford to have the kernel walking > thousands of superblocks to sync a single fs. Not with XFS subvolumes. Freezing the underlying parent filesystem will effectively stop all IO from the mounted subvolumes by freezing remapping calls before IO. Sure, those subvolumes aren't in a consistent state, but we don't freeze userspace so none of the application data is ever in a consistent state when filesystems are frozen. So, again, I'm not sure there's /subvolume/ problem here. There's definitely a "freeze heirarchy" problem, but that already exists and it's something we talked about at LSFMM because we need to solve it for reliable hibernation. > It's far more efficient then to pull those fields we need for a > subvolume namespace into their own structure. I'm not convinced yet - it still feels like it's the wrong layer to be solving the multiple namespace per superblock problem
Re: Symlink not persisted even after fsync
On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote: > Thanks! As I mentioned before, this is useful. I have a follow-up > question. Consider the following workload: > > creat foo > link (foo, A/bar) > fsync(foo) > crash > > In this case, after the file system recovers, do we expect foo's link > count to be 2 or 1? So, strictly ordered behaviour: create foo: - creates dirent in inode B and new inode A in an atomic transaction sequence #1 link foo -> A/bar - creates dirent in inode C and bumps inode A link count in an atomic transaction seqeunce #2. fsync foo - looks at inode A, sees it's "last modification" sequence counter as #2 - flushes all transactions up to and including #2 to the journal. See the dependency chain? Both the inodes and dirents in the create operation and the link operation are chained to the inode foo via the atomic transactions. Hence when we flush foo, we also flush the dependent changes because of the change atomicity requirements > I would say 2, Correct, for strict ordering. But > but POSIX is silent on this, Well, it's not silent, POSIX explicitly allows for fsync() to do nothing and report success. Hence we can't really look to POSIX to define how fsync() should behave. > so > thought I would confirm. The tricky part here is we are not calling > fsync() on directory A. Right. But directory A has a dependent change linked to foo. If we fsync() foo, we are persisting the link count change in that file, and hence all the other changes related to that link count change must also be flushed. Similarly, all the cahnges related to the creation on foo must be flushed, too. > In this case, its not a symlink; its a hard link, so I would say the > link count for foo should be 2. Right - that's the "reference counted object dependency" I refered to. i.e. it's a bi-direction atomic dependency - either we show both the new dirent and the link count change, or we show neither of them. Hence fsync on one object implies that we are also persisting the related changes in the other object, too. > But btrfs and F2FS show link count of > 1 after a crash. That may be valid if the dirent A/bar does not exist after recovery, but it also means fsync() hasn't actually guaranteed inode changes made prior to the fsync to be persistent on disk. i.e. that's a violation of ordered metadata semantics and probably a bug. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symlink not persisted even after fsync
On Fri, Apr 13, 2018 at 10:27:56PM -0500, Vijay Chidambaram wrote: > Hi Dave, > > Thanks for the reply. > > I feel like we are not talking about the same thing here. > > What we are asking is: if you perform > > fsync(symlink) > crash > > can we expect it to see the symlink file in the parent directory after > a crash given we didn't fsync the parent directory? Amir argues we > can't expect it. Your first email seemed to argue we should expect it. My first email comments on Amir's quoting of behaviours for files vs directories on fsync, and then applying those caveats to symlinks. It probably wasn't that clear I was mainly trying to point out that symlinks are not files, so they have different ordering requirements. i.e. that you have to look at ordering requirements of the filesystems, not the fsync() specification to determine what the fsync behviour is supposed to be. My second email clarifies the ordering behaviour that is expected with symlinks and the reason why you'll see different behaviour to files w.r.t. fsync and parent directories. > ext4 and xfs have this behavior, which Amir argues is an > implementation side-effect, and not intended. > > >> >>> 1. symlink (foo, bar.tmp) > >> >>> 2. open bar.tmp > >> >>> 3. fsync bar.tmp > >> >>> 4. rename(bar.tmp, bar) > >> >>> 5. fsync bar > >> >>> crash here > > The second workload that Amir constructed just moves the symlink > creation into a different transaction. In both workloads, we are > creating or renaming new symlinks and calling fsync on them. In both > cases we are not explicitly calling fsync on the parent directory. Yes, I decided not to write all this "symlink behaviour is dependent on initial conditions" stuff because, AFAIC, it is a pretty obvious conclusion to draw from the ordering dependencies I described between the symlink and the object it points at. Script that demonstrates this is simple: $ cat t.sh #!/bin/bash dev=/dev/vdb mnt=/mnt/scratch test_file=$mnt/foo # 1. symlink (foo, bar.tmp) # 2. open bar.tmp # 3. fsync bar.tmp # 4. rename(bar.tmp, bar) # 5. fsync bar umount $mnt mount $dev $mnt cd $mnt rm -f foo bar.tmp bar sync # Don't fsync creation of foo, will see foo and bar.tmp after shutdown touch foo ln -s foo bar.tmp xfs_io -c fsync bar.tmp mv bar.tmp bar xfs_io -c fsync bar xfs_io -xc "shutdown" $mnt cd ~ umount $mnt mount $dev $mnt cd $mnt ls -l $mnt rm -f foo bar.tmp bar sync # don't fsync foo or bar.tmp, will see foo and bar after shutdown touch foo xfs_io -c fsync foo touch foo ln -s foo bar.tmp mv bar.tmp bar xfs_io -c fsync bar xfs_io -xc "shutdown" $mnt cd ~ umount $mnt mount $dev $mnt cd $mnt ls -l $mnt rm -f foo bar.tmp bar sync # fsync creation of foo, will see only foo after shutdown touch foo xfs_io -c fsync foo ln -s foo bar.tmp xfs_io -c fsync bar.tmp mv bar.tmp bar xfs_io -c fsync bar xfs_io -xc "shutdown" $mnt cd ~ umount $mnt mount $dev $mnt cd $mnt ls -l $mnt $ And the output is: $ sudo umount /mnt/scratch ; sudo mount /dev/vdb /mnt/scratch ; sudo ./t.sh ; total 0 lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar.tmp -> foo -rw-r--r--. 1 root root 0 Apr 14 09:52 foo total 0 lrwxrwxrwx. 1 root root 3 Apr 14 09:52 bar -> foo -rw-r--r--. 1 root root 0 Apr 14 09:52 foo total 0 -rw-r--r--. 1 root root 0 Apr 14 09:52 foo $ i.e. it depends on the state of the original file as to what is captured by the fsync of that file through the symlink. i.e. symlinks has no ordering dependency with the object resolved from the path in the symlink. > Note that we are not saying if we call fsync on symlink file, it > should call fsync on the original file. We agree that should not be > done as the symlink file and the original link are two distinct > entities. "symlink file" - there's no such thing. It's either a symlink or a regular file and it cant be both. And, well, you can't fsync a symlink *inode*, anyway, because you can't open it directly for IO operations. > I believe in most journaling/copy-on-write file systems today, if you > call fsync on a new file, the fsync will persist the directory entry > of the new file in the parent directory (even though POSIX doesn't > really require this). Yes, that's the strict ordering dependency thing I talked about, and it was something that btrfs got wrong for an awful long time. > It seems reasonable to extend this persistence > courtesy to symlinks (considering them just as normal files). And no, that's not reasonable, because symlinks only contain a path instead of a direct reference to any filesysetm object. i.e. it's an indirect reference, and that can be clearly seen by the fact that Symlinks are created and removed without referencing the object they point to or caring wheth
Re: Symlink not persisted even after fsync
On Fri, Apr 13, 2018 at 09:39:27AM -0500, Jayashree Mohan wrote: > Hey Dave, > > Thanks for clarifying the crash recovery semantics of strictly > metadata ordered filesystems. We had a follow-up question in this > case. > > On Fri, Apr 13, 2018 at 8:16 AM, Amir Goldstein <amir7...@gmail.com> wrote: > > On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vi...@cs.utexas.edu> > > wrote: > >> Hi Amir, > >> > >> Thanks for the reply! > >> > >> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir7...@gmail.com> > >> wrote: > >>> > >>> Not a bug. > >>> > >>> From man 2 fsync: > >>> > >>> "Calling fsync() does not necessarily ensure that the entry in the > >>> directory containing the file has also reached disk. For that an > >>> explicit fsync() on a file descriptor for the directory is also needed." > >> > >> > >> Are we understanding this right: > >> > >> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But > >> btrfs does not. Is this what we are seeing? > > > > Nope. > > > > You are seeing an unintentional fsync of parent, because both > > parent update and symlink update are metadata updates that are > > tracked by the same transaction. > > > > fsync of symlink forces the current transaction to the journal, > > pulling in the parent update with it. > > > > > >> > >> I agree that fsync of a file does not mean fsync of its directory entry, > >> but > >> it seems odd to do it for regular files and not for sym links. We do not > >> see > >> this behavior if we use a regular file instead of a sym link file. > >> > > > > fsync of regular file behaves differently than fsync of non regular file. > > I suggest this read: > > https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ > > > >>> > >>> There is a reason why this behavior is not being reproduces in > >>> ext4/xfs, but you should be able to reproduce a similar issue > >>> like this: > >>> > >>> > >>> 1. symlink (foo, bar.tmp) > >>> 2. open bar.tmp > >>> 3. fsync bar.tmp > >>> 4. rename(bar.tmp, bar) > >>> 5. fsync bar > >>> crash here > >> > > Going by your argument that all previous transactions that referenced > the file being fsync-ed needs to be committed, should we expect xfs > (and ext4) to persist file bar in this case? No, that's not what I'm implying. I'm implying that there is specific ordering dependencies that govern this behaviour, and assuming that what the fsync man page says about files applies to symlinks is not a valid assumption because files and symlinks are not equivalent objects. In these cases, you first have to ask "what are we actually running fsync on?" The fsync is being performed on the inode the symlink points to, not the symlink. You can't directly open a symlink to fsync the symlink. Then you have to ask "what is the dependency chain between the parent directory, the symlink and the file it points to?" the short answer is that symlinks have no direct relationship to the object they point to. i.e. symlinks contain a path, not a reference to a specific filesystem object. IOWs, symlinks are really a directory construct, not a file. However, there is no ordering dependency between a symlink and what it points to. symlinks contain a path which needs to be resolved to find out what it points to, and that may not even exist. Files have no reference to symlinks that point at them, so there's no way we can create an ordering dependency between file updates and any symlink that points to them. Directories, OTOH, contain a pointer to a reference counted object (an inode) in their dirents. hence if you add/remove directory dirents that point to an inode, you also have to modify the inode link counts as it records how many directory entries point at it. That's a bi-directional atomic modification ordering dependency between directories and inodes they point at. So when we look at symlinks, the parent directory has a ordering dependency with the symlink inode, not whatever is found by resolving the path in the symlink data. IOWs, there is no ordering relationship between the symlink's parent directory and whatever the symlink points at. i.e. it's a one-way relationship, and so there is no reverse ordering dependency that requires fsync() on the file to force synchronisation of a symlink it knows nothing about. i.e. the ordering dependency that exists with symlinks is between the symlink and it's parent directory, not whatever the symlink points to. Hence fsyncing whatever the symlink points to does not guarantee that the symlink is made stable because the symlink is not part of the dependency chain of the object being fsync()d Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symlink not persisted even after fsync
On Fri, Apr 13, 2018 at 08:52:19AM +0300, Amir Goldstein wrote: > On Thu, Apr 12, 2018 at 8:51 PM, Jayashree Mohan > <jayashree2...@gmail.com> wrote: > > Hi, > > > > We came across what seems to be a crash consistency bug on btrfs and > > f2fs. When we create a symlink for a file and fsync the symlink, on > > recovery from crash, the fsync-ed file is missing. > > > > You can reproduce this behaviour using this minimal workload : > > > > 1. symlink (foo, bar) > > 2. open bar > > 3. fsync bar > > crash here > > > > When we recover, we find that file bar is missing. This behaviour > > seems unexpected as the fsynced file is lost on a crash. ext4 and xfs > > correctly recovers file bar. This seems like a bug. If not, could you > > explain why? > > > > Not a bug. Actually, for a filesystem with strictly ordered metadata recovery semantics, it is a bug. > From man 2 fsync: > > "Calling fsync() does not necessarily ensure that the entry in the > directory containing the file has also reached disk. For that an > explicit fsync() on a file descriptor for the directory is also needed." We've been through this before, many times. This caveat does not apply to strictly ordered metadata filesystems. If you fsync a file on an ordered metadata filesystem, then all previous transactions that are needed to reference the file are also committed. The behaviour from ext4 and XFS is correct for strictly ordered filesystems. This is not a "fsync requirement", nor is it a general linux filesystem requirement. It is a requirement of the desired filesystem crash recovery mechanisms BTRFS is advertised as having strictly ordered metadata recovery semantics, so it should behave the same way as ext4 and XFS in tests like these. If it doesn't, then there's filesystem bugs that need fixing... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] common/rc: raise mixed mode threshold to 1GB
On Wed, Apr 11, 2018 at 10:07:29PM -0700, Omar Sandoval wrote: > From: Omar Sandoval <osan...@fb.com> > > generic/427 creates a 256 MB filesystem and then writes a 200 MB file, > which fails on Btrfs if mixed mode is not enabled. Raise the threshold > to 1GB, which is where we typically recommend mixed mode. > > Signed-off-by: Omar Sandoval <osan...@fb.com> > --- > common/rc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/common/rc b/common/rc > index 5dbb8fe5..dc0062de 100644 > --- a/common/rc > +++ b/common/rc > @@ -999,7 +999,7 @@ _scratch_mkfs_sized() > ;; > btrfs) > local mixed_opt= > - (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed' > + (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed' > $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV > ;; > jfs) Makes sense. Reviewed-by: Dave Chinner <dchin...@redhat.com> Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] generic/427: used mixed mode for Btrfs
On Wed, Apr 11, 2018 at 04:47:30PM -0700, Omar Sandoval wrote: > From: Omar Sandoval <osan...@fb.com> > > This test creates a 256 MB filesystem and then writes a 200 MB file. > With separate data and metadata, Btrfs will run out of data space since > it needs to allocate some metadata space. Use mixed mode, which is the > recommendation for smaller filesystems. > > Signed-off-by: Omar Sandoval <osan...@fb.com> > --- > tests/generic/427 | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tests/generic/427 b/tests/generic/427 > index 9cde5f50..b2cb4526 100755 > --- a/tests/generic/427 > +++ b/tests/generic/427 > @@ -55,6 +55,8 @@ _require_test_program "feature" > _require_aiodio aio-dio-eof-race > > # limit the filesystem size, to save the time of filling filesystem > +# Btrfs needs to use mixed mode for such a small filesystem > +[ "$FSTYP" = "btrfs" ] && MKFS_OPTIONS="$MKFS_OPTIONS -M" _scratch_mkfs_sized() should already be adding this for small btrfs filesystems. Yup, it does: btrfs) local mixed_opt= (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed' $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV ;; > _scratch_mkfs_sized $((256 * 1024 * 1024)) >>$seqres.full 2>&1 But this uses a filesystem larger than the mixed mode threshold in _scratch_mkfs_sized(). Please update the generic threshold rather than special case this test. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic test for fsync after fallocate
On Mon, Apr 09, 2018 at 11:00:52AM +0100, Filipe Manana wrote: > On Mon, Apr 9, 2018 at 10:51 AM, Dave Chinner <da...@fromorbit.com> wrote: > > On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote: > >> On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote: > > You also cannot assume that two separate preallocations beyond EOF > > are going to be contiguous (i.e. it could be two separate extents. > > I really don't care about that. > Just want to check that allocated space is not lost, don't care if > that space is covered by 1, 2 or more extents, or whether they are > contiguous or not. Sure, but that's not what the test does :/ > > What you should just be checking is that there are extents allocated > > covering EOF to 3MB, not the exactly size, shape and type of extents > > are allocated. > > How to do such check, for a generic test, without using fiemap? We already do such checks using fiemap, yes? i.e. there's a bunch of _filter*fiemap() functions in common/punch which should show you exactly how to do this i.e. filter all the unwritten/data extents to be the same name, then they coalesce into single ranges Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic test for fsync after fallocate
On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote: > On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote: > > From: Filipe Manana <fdman...@suse.com> > > > > Test that fsync operations preserve extents allocated with fallocate(2) > > that are placed beyond a file's size. > > > > This test is motivated by a bug found in btrfs where unwritten extents > > beyond the inode's i_size were not preserved after a fsync and power > > failure. The btrfs bug is fixed by the following patch for the linux > > kernel: > > > > "Btrfs: fix loss of prealloc extents past i_size after fsync log replay" > > > > Signed-off-by: Filipe Manana <fdman...@suse.com> > > Hmm, xfs fails this test, while ext4 passes. > > # diff -u tests/generic/483.out > /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad > --- tests/generic/483.out 2018-04-07 23:35:00.55511 +0800 > +++ /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad > 2018-04-07 23:39:48.780659707 +0800 > @@ -6,5 +6,5 @@ > 0: [0..511]: data > 1: [512..2559]: unwritten > File baz fiemap: > -0: [0..511]: data > -1: [512..6143]: unwritten > +0: [0..895]: data > +1: [896..6143]: unwritten Perfectly valid result from the test. > > +# A file which already has a prealloc extent beyond its size. > > +# The fsync done on it is motivated by differences in the btrfs > > implementation > > +# of fsync (first fsync has different logic from subsequent fsyncs). > > +$XFS_IO_PROG -f -c "pwrite -S 0xf1 0 256K" \ > > +-c "falloc -k 256K 768K" \ > > +-c "fsync" \ > > +$SCRATCH_MNT/baz >/dev/null Delayed allocation extending the file to 256k means we'll have speculative prealloc of data beyond 256k - it should be somewhere around 448k IIRC. Now, falloc doesn't guarantee unwritten extents are allocated - it just guarantees space is allocated. unwritten extents are an optimisation to avoid needing to zero the extents pre-allocated within EOF. i.e. we can quite safely allocate data extents beyond EOF rather than unwritten extents, and this is not a bug. In this case, the delalloc extent is completely allocated as written data, but nothing is written beyond EOF @ 256k. Hence it reports as a data extent, not an unwritten extent, and this is a perfectly valid thing to do. > > +# Allocate another extent beyond the size of file baz. > > +$XFS_IO_PROG -c "falloc -k 1M 2M"\ > > +-c "fsync" \ > > +$SCRATCH_MNT/baz You also cannot assume that two separate preallocations beyond EOF are going to be contiguous (i.e. it could be two separate extents. What you should just be checking is that there are extents allocated covering EOF to 3MB, not the exactly size, shape and type of extents are allocated. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] fstests: generic: Check the fs after each FUA writes
On Wed, Mar 28, 2018 at 12:40:23PM +0800, Qu Wenruo wrote: > Basic test case which triggers fsstress with dm-log-writes, and then > check the fs after each FUA writes. > With needed infrastructure and special handlers for journal based fs. > > Signed-off-by: Qu Wenruo <w...@suse.com> . > For xfs: > _check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is > inconsistent (r) > *** xfs_repair -n output *** > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan (but don't clear) agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > - agno = 1 > - agno = 2 > bad symlink header ino 8409190, file block 0, disk block 1051147 > problem with symbolic link in inode 8409190 > would have cleared inode 8409190 > - agno = 3 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > - check for inodes claiming duplicate blocks... > - agno = 0 > - agno = 1 > - agno = 3 > - agno = 2 > entry "lb" in shortform directory 8409188 references free inode 8409190 > would have junked entry "lb" in directory inode 8409188 > bad symlink header ino 8409190, file block 0, disk block 1051147 > problem with symbolic link in inode 8409190 > would have cleared inode 8409190 > No modify flag set, skipping phase 5 > Phase 6 - check inode connectivity... > - traversing filesystem ... > entry "lb" in shortform directory inode 8409188 points to free inode 8409190 > would junk entry > - traversal finished ... > - moving disconnected inodes to lost+found ... > Phase 7 - verify link counts... > Maximum metadata LSN (1:396) is ahead of log (1:63). That warning implies a write ordering problem - there's a write with an LSN on disk that does not yet exist in the log. i.e. the last FUA write to the log had 1:63 in it, yet there's metadata on disk that could only be *issued* after a REQ_FLUSH|REQ_FUA log write with 1:396 in it was *completed*. If we've only replayed up to the FUA write with 1:63 in it, then no metadata writes should have been *issued* with 1:396 in it as the LSN that is stamped into metadata is only updated on log IO completion On first glance, this implies a bug in the underlying device write replay code Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs send | receive ERROR rename file exists
Can anyone give me any ideas why this error would happen? The receive directory started empty. Snapshot 3 exists at both source and target. # mkdir /.snapshots/bw538/ # btrfs send -p /mnt/backup/root/laptop/3/snapshot/ /mnt/backup/root/laptop/4/snapshot/ | btrfs receive /.snapshots/bw538/ At subvol /mnt/backup/root/laptop/4/snapshot/ At snapshot snapshot ERROR: rename o439638-244-0 -> var/cache/man/pl/index.db failed: File exists -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to make a cache directory nodatacow while also excluded from snapshots?
I want to exclude my ~/.cache directory from snapshots. The obvious way to do this is to mount a btrfs subvolume at that location. However, I also want the ~/.cache directory to be nodatacow. Since the parent volume is COW, I believe it isn't possible to mount the subvolume with different mount options. What's the solution for achieving both of these goals? I tried this without success: chattr +C ~/.cache Since ~/.cache is a btrfs subvolume, apparently that doesn't work. lsattr ~/.cache returns nothing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 19/19] fs: handle inode->i_version more efficiently
On Tue, Jan 09, 2018 at 09:10:59AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > Since i_version is mostly treated as an opaque value, we can exploit that > fact to avoid incrementing it when no one is watching. With that change, > we can avoid incrementing the counter on writes, unless someone has > queried for it since it was last incremented. If the a/c/mtime don't > change, and the i_version hasn't changed, then there's no need to dirty > the inode metadata on a write. > > Convert the i_version counter to an atomic64_t, and use the lowest order > bit to hold a flag that will tell whether anyone has queried the value > since it was last incremented. > > When we go to maybe increment it, we fetch the value and check the flag > bit. If it's clear then we don't need to do anything if the update > isn't being forced. > > If we do need to update, then we increment the counter by 2, and clear > the flag bit, and then use a CAS op to swap it into place. If that > works, we return true. If it doesn't then do it again with the value > that we fetch from the CAS operation. > > On the query side, if the flag is already set, then we just shift the > value down by 1 bit and return it. Otherwise, we set the flag in our > on-stack value and again use cmpxchg to swap it into place if it hasn't > changed. If it has, then we use the value from the cmpxchg as the new > "old" value and try again. > > This method allows us to avoid incrementing the counter on writes (and > dirtying the metadata) under typical workloads. We only need to increment > if it has been queried since it was last changed. > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > Reviewed-by: Jan Kara <j...@suse.cz> Documentation helps a lot in understanding all this. Thanks for adding it into the patch! Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
On Tue, Jan 09, 2018 at 09:10:57AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > If XFS_ILOG_CORE is already set then go ahead and increment it. > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > Acked-by: Darrick J. Wong <darrick.w...@oracle.com> Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 14/19] xfs: convert to new i_version API
On Tue, Jan 09, 2018 at 09:10:54AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > Acked-by: Darrick J. Wong <darrick.w...@oracle.com> Looks ok, but I haven't tested it at all. Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote: > On Wed 03-01-18 13:32:19, Dave Chinner wrote: > > I think we could probably block ->write_metadata if necessary via a > > completion/wakeup style notification when a specific LSN is reached > > by the log tail, but realistically if there's any amount of data > > needing to be written it'll throttle data writes because the IO > > pipeline is being kept full by background metadata writes > > So the problem I'm concerned about is a corner case. Consider a situation > when you have no dirty data, only dirty metadata but enough of them to > trigger background writeback. How should metadata writeback behave for XFS > in this case? Who should be responsible that wb_writeback() just does not > loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes > enough progress? > > Thinking about this today, I think this looping prevention belongs to > wb_writeback(). Well, backgroudn data writeback can block in two ways. One is during IO submission when the request queue is full, the other is when all dirty inodes have had some work done on them and have all been moved to b_more_io - wb_writeback waits for the __I_SYNC bit to be cleared on the last(?) inode on that list, hence backing off before submitting more IO. IOws, there's a "during writeback" blocking mechanism as well as a "between cycles" block mechanism. > Sadly we don't have much info to decide how long to sleep > before trying more writeback so we'd have to just sleep for > if we found no writeback happened in the last writeback > round before going through the whole writeback loop again. Right - I don't think we can provide a generic "between cycles" blocking mechanism for XFS, but I'm pretty sure we can emulate a "during writeback" blocking mechanism to avoid busy looping inside the XFS code. e.g. if we get a writeback call that asks for 5% to be written, and we already have a metadata writeback target of 5% in place, that means we should block for a while. That would emulate request queue blocking and prevent busy looping in this case > And > ->write_metadata() for XFS would need to always return 0 (as in "no progress > made") to make sure this busyloop avoidance logic in wb_writeback() > triggers. ext4 and btrfs would return number of bytes written from > ->write_metadata (or just 1 would be enough to indicate some progress in > metadata writeback was made and busyloop avoidance is not needed). Well, if we block for a little while, we can indicate that progress has been made and this whole mess would go away, right? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote: > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote: > > On Wed 20-12-17 08:35:05, Dave Chinner wrote: > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote: > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote: > > > > > IOWs, treating metadata like it's one great big data inode doesn't > > > > > seem to me to be the right abstraction to use for this - in most > > > > > fileystems it's a bunch of objects with a complex dependency tree > > > > > and unknown write ordering, not an inode full of data that can be > > > > > sequentially written. > > > > > > > > > > Maybe we need multiple ops with well defined behaviours. e.g. > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for > > > > > sync based operations. That way different filesystems can ignore the > > > > > parts they don't need simply by not implementing those operations, > > > > > and the writeback code doesn't need to try to cater for all > > > > > operations through the one op. The writeback code should be cleaner, > > > > > the filesystem code should be cleaner, and we can tailor the work > > > > > guidelines for each operation separately so there's less mismatch > > > > > between what writeback is asking and how filesystems track dirty > > > > > metadata... > > > > > > > > I agree that writeback for memory cleaning and writeback for data > > > > integrity > > > > are two very different things especially for metadata. In fact for data > > > > integrity writeback we already have ->sync_fs operation so there the > > > > functionality gets duplicated. What we could do is that in > > > > writeback_sb_inodes() we'd call ->write_metadata only when > > > > work->for_kupdate or work->for_background is set. That way > > > > ->write_metadata > > > > would be called only for memory cleaning purposes. > > > > > > That makes sense, but I still think we need a better indication of > > > how much writeback we need to do than just "writeback this chunk of > > > pages". That "writeback a chunk" interface is necessary to share > > > writeback bandwidth across numerous data inodes so that we don't > > > starve any one inode of writeback bandwidth. That's unnecessary for > > > metadata writeback on a superblock - we don't need to share that > > > bandwidth around hundreds or thousands of inodes. What we actually > > > need to know is how much writeback we need to do as a total of all > > > the dirty metadata on the superblock. > > > > > > Sure, that's not ideal for btrfs and mayext4, but we can write a > > > simple generic helper that converts "flush X percent of dirty > > > metadata" to a page/byte chunk as the current code does. DOing it > > > this way allows filesystems to completely internalise the accounting > > > that needs to be done, rather than trying to hack around a > > > writeback accounting interface with large impedance mismatches to > > > how the filesystem accounts for dirty metadata and/or tracks > > > writeback progress. > > > > Let me think loud on how we could tie this into how memory cleaning > > writeback currently works - the one with for_background == 1 which is > > generally used to get amount of dirty pages in the system under control. > > We have a queue of inodes to write, we iterate over this queue and ask each > > inode to write some amount (e.g. 64 M - exact amount depends on measured It's a maximum of 1024 pages per inode. > > writeback bandwidth etc.). Some amount from that inode gets written and we > > continue with the next inode in the queue (put this one at the end of the > > queue if it still has dirty pages). We do this until: > > > > a) the number of dirty pages in the system is below background dirty limit > >and the number dirty pages for this device is below background dirty > >limit for this device. > > b) run out of dirty inodes on this device > > c) someone queues different type of writeback > > > > And we need to somehow incorporate metadata writeback into this loop. I see > > two questions here: > > > > 1) When / how often should we ask for metadata writeback? > > 2) How much to ask to write in one go? > > > > The second question is especially tricky in t
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote: > On Wed 13-12-17 09:20:04, Dave Chinner wrote: > > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote: > > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote: > > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote: > > > This is just one of those things that's going to be slightly shitty. > > > It's the > > > same for memory reclaim, all of those places use pages so we just take > > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough. > > > > Ok, so that isn't exactly easy to deal with, because all our > > metadata writeback is based on log sequence number targets (i.e. how > > far to push the tail of the log towards the current head). We've > > actually got no idea how pages/bytes actually map to a LSN target > > because while we might account a full buffer as dirty for memory > > reclaim purposes (up to 64k in size), we might have only logged 128 > > bytes of it. > > > > i.e. if we are asked to push 2MB of metadata and we treat that as > > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have > > logged several tens of megabytes of dirty metadata in that LSN > > range and have to flush it all. OTOH, if the buffers are fully > > logged, then that same target might only flush 1.5MB of metadata > > once all the log overhead is taken into account. > > > > So there's a fairly large disconnect between the "flush N bytes of > > metadata" API and the "push to a target LSN" that XFS uses for > > flushing metadata in aged order. I'm betting that extN and otehr > > filesystems might have similar mismatches with their journal > > flushing... > > Well, for ext4 it isn't as bad since we do full block logging only. So if > we are asked to flush N pages, we can easily translate that to number of fs > blocks and flush that many from the oldest transaction. > > Couldn't XFS just track how much it has cleaned (from reclaim perspective) > when pushing items from AIL (which is what I suppose XFS would do in > response to metadata writeback request) and just stop pushing when it has > cleaned as much as it was asked to? If only it were that simple :/ To start with, flushing the dirty objects (such as inodes) to their backing buffers do not mean the the object is clean once the writeback completes. XFS has decoupled in-memory objects with logical object logging rather than logging physical buffers, and so can be modified and dirtied while the inode buffer is being written back. Hence if we just count things like "buffer size written" it's not actually a correct account of the amount of dirty metadata we've cleaned. If we don't get that right, it'll result in accounting errors and incorrect behaviour. The bigger problem, however, is that we have no channel to return flush information from the AIL pushing to whatever caller asked for the push. Pushing metadata is completely decoupled from every other subsystem. i.e. the caller asked the xfsaild to push to a specific LSN (e.g. to free up a certain amount of log space for new transactions), and *nothing* has any idea of how much metadata we'll need to write to push the tail of the log to that LSN. It's also completely asynchronous - there's no mechanism for waiting on a push to a specific LSN. Anything that needs a specific amount of log space to be available waits in ordered ticket queues on the log tail moving forwards. The only interfaces that have access to the log tail ticket waiting is the transaction reservation subsystem, which cannot be used during metadata writeback because that's a guaranteed deadlock vector Saying "just account for bytes written" assumes directly connected, synchronous dispatch metadata writeback infrastructure which we simply don't have in XFS. "just clean this many bytes" doesn't really fit at all because we have no way of referencing that to the distance we need to push the tail of the log. An interface that tells us "clean this percentage of dirty metadata" is much more useful because we can map that easily to a log sequence number based push target > > IOWs, treating metadata like it's one great big data inode doesn't > > seem to me to be the right abstraction to use for this - in most > > fileystems it's a bunch of objects with a complex dependency tree > > and unknown write ordering, not an inode full of data that can be > > sequentially written. > > > > Maybe we need multiple ops with well defined behaviours. e.g. > > ->writeback_metadata() for background writeback, ->sync_metadata() for > > sync based operations. That way different filesystems can igno
Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently
On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote: > [PATCH] SQUASH: add memory barriers around i_version accesses Why explicit memory barriers rather than annotating the operations with the required semantics and getting the barriers the arch requires automatically? I suspect this should be using atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT the atomic_cmpxchg needs to have release semantics to match the acquire semantics needed for the load of the current value. >From include/linux/atomics.h: * For compound atomics performing both a load and a store, ACQUIRE * semantics apply only to the load and RELEASE semantics only to the * store portion of the operation. Note that a failed cmpxchg_acquire * does -not- imply any memory ordering constraints. Memory barriers hurt my brain. :/ At minimum, shouldn't the atomic op specific barriers be used rather than full memory barriers? i.e: > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > index a9fbf99709df..39ec9aa9e08e 100644 > --- a/include/linux/iversion.h > +++ b/include/linux/iversion.h > @@ -87,6 +87,25 @@ static inline void > inode_set_iversion_raw(struct inode *inode, const u64 val) > { > atomic64_set(>i_version, val); > + smp_wmb(); smp_mb__after_atomic(); . > +static inline u64 > +inode_peek_iversion_raw(const struct inode *inode) > +{ > + smp_rmb(); smp_mb__before_atomic(); > + return atomic64_read(>i_version); > } And, of course, these will require comments explaining what they match with and what they are ordering. > @@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force) > { > u64 cur, old, new; > > - cur = (u64)atomic64_read(>i_version); > + cur = inode_peek_iversion_raw(inode); > for (;;) { > /* If flag is clear then we needn't do anything */ > if (!force && !(cur & I_VERSION_QUERIED)) cmpxchg in this loop needs a release barrier so everyone will see the change? > @@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode) > inode_maybe_inc_iversion(inode, true); > } > > -/** > - * inode_peek_iversion_raw - grab a "raw" iversion value > - * @inode: inode from which i_version should be read > - * > - * Grab a "raw" inode->i_version value and return it. The i_version is not > - * flagged or converted in any way. This is mostly used to access a > self-managed > - * i_version. > - * > - * With those filesystems, we want to treat the i_version as an entirely > - * opaque value. > - */ > -static inline u64 > -inode_peek_iversion_raw(const struct inode *inode) > -{ > - return atomic64_read(>i_version); > -} > - > /** > * inode_iversion_need_inc - is the i_version in need of being incremented? > * @inode: inode to check > @@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode) > { > u64 cur, old, new; > > - cur = atomic64_read(>i_version); > + cur = inode_peek_iversion_raw(inode); > for (;;) { > /* If flag is already set, then no need to swap */ > if (cur & I_VERSION_QUERIED) cmpxchg in this loop needs a release barrier so everyone will see the change on load? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/19] fs: new API for handling inode->i_version
On Sat, Dec 16, 2017 at 08:46:38AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > Add a documentation blob that explains what the i_version field is, how > it is expected to work, and how it is currently implemented by various > filesystems. > > We already have inode_inc_iversion. Add several other functions for > manipulating and accessing the i_version counter. For now, the > implementation is trivial and basically works the way that all of the > open-coded i_version accesses work today. > > Future patches will convert existing users of i_version to use the new > API, and then convert the backend implementation to do things more > efficiently. > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > --- > include/linux/fs.h | 200 > ++--- > 1 file changed, 192 insertions(+), 8 deletions(-) Just a random sunday morning coffee musing I was just wondering if it would be better to split this stuff out into it's own header file now? include/linux/fs.h is aleady a massive header file (~3500 lines) and changes cause tree-wide rebuilds, so maybe it would be better to split relatively isolated functionality like this out while it's being reworked and you're already touching every file that uses it? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote: > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote: > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote: > > > From: Josef Bacik <jba...@fb.com> > > > > > > Now that we have metadata counters in the VM, we need to provide a way to > > > kick > > > writeback on dirty metadata. Introduce super_operations->write_metadata. > > > This > > > allows file systems to deal with writing back any dirty metadata we need > > > based > > > on the writeback needs of the system. Since there is no inode to key off > > > of we > > > need a list in the bdi for dirty super blocks to be added. From there we > > > can > > > find any dirty sb's on the bdi we are currently doing writeback on and > > > call into > > > their ->write_metadata callback. > > > > > > Signed-off-by: Josef Bacik <jba...@fb.com> > > > Reviewed-by: Jan Kara <j...@suse.cz> > > > Reviewed-by: Tejun Heo <t...@kernel.org> > > > --- > > > fs/fs-writeback.c| 72 > > > > > > fs/super.c | 6 > > > include/linux/backing-dev-defs.h | 2 ++ > > > include/linux/fs.h | 4 +++ > > > mm/backing-dev.c | 2 ++ > > > 5 files changed, 80 insertions(+), 6 deletions(-) > > > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > index 987448ed7698..fba703dff678 100644 > > > --- a/fs/fs-writeback.c > > > +++ b/fs/fs-writeback.c > > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct > > > bdi_writeback *wb, > > > return pages; > > > } > > > > > > +static long writeback_sb_metadata(struct super_block *sb, > > > + struct bdi_writeback *wb, > > > + struct wb_writeback_work *work) > > > +{ > > > + struct writeback_control wbc = { > > > + .sync_mode = work->sync_mode, > > > + .tagged_writepages = work->tagged_writepages, > > > + .for_kupdate= work->for_kupdate, > > > + .for_background = work->for_background, > > > + .for_sync = work->for_sync, > > > + .range_cyclic = work->range_cyclic, > > > + .range_start= 0, > > > + .range_end = LLONG_MAX, > > > + }; > > > + long write_chunk; > > > + > > > + write_chunk = writeback_chunk_size(wb, work); > > > + wbc.nr_to_write = write_chunk; > > > + sb->s_op->write_metadata(sb, ); > > > + work->nr_pages -= write_chunk - wbc.nr_to_write; > > > + > > > + return write_chunk - wbc.nr_to_write; > > > > Ok, writeback_chunk_size() returns a page count. We've already gone > > through the "metadata is not page sized" dance on the dirty > > accounting side, so how are we supposed to use pages to account for > > metadata writeback? > > > > This is just one of those things that's going to be slightly shitty. It's the > same for memory reclaim, all of those places use pages so we just take > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough. Ok, so that isn't exactly easy to deal with, because all our metadata writeback is based on log sequence number targets (i.e. how far to push the tail of the log towards the current head). We've actually got no idea how pages/bytes actually map to a LSN target because while we might account a full buffer as dirty for memory reclaim purposes (up to 64k in size), we might have only logged 128 bytes of it. i.e. if we are asked to push 2MB of metadata and we treat that as 2MB of log space (i.e. push target of tail LSN + 2MB) we could have logged several tens of megabytes of dirty metadata in that LSN range and have to flush it all. OTOH, if the buffers are fully logged, then that same target might only flush 1.5MB of metadata once all the log overhead is taken into account. So there's a fairly large disconnect between the "flush N bytes of metadata" API and the "push to a target LSN" that XFS uses for flushing metadata in aged order. I'm betting that extN and otehr filesystems might have similar mismatches with their journal flushing... > > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or > > work->tagged_writepages is set, this will basically tell us to f
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Mon, Dec 11, 2017 at 02:12:28PM -0800, Joe Perches wrote: > On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote: > > On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote: > > > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote: > > > > 1. Using lockdep_set_novalidate_class() for anything other > > > > than device->mutex will throw checkpatch warnings. Nice. (*) > > > [] > > > > (*) checkpatch.pl is considered mostly harmful round here, too, > > > > but that's another rant > > > > > > How so? > > > > Short story is that it barfs all over the slightly non-standard > > coding style used in XFS. > [] > > This sort of stuff is just lowest-common-denominator noise - great > > for new code and/or inexperienced developers, but not for working > > with large bodies of existing code with slightly non-standard > > conventions. > > Completely reasonable. Thanks. > > Do you get many checkpatch submitters for fs/xfs? We used to. Not recently, though. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote: > From: Josef Bacik <jba...@fb.com> > > Now that we have metadata counters in the VM, we need to provide a way to kick > writeback on dirty metadata. Introduce super_operations->write_metadata. > This > allows file systems to deal with writing back any dirty metadata we need based > on the writeback needs of the system. Since there is no inode to key off of > we > need a list in the bdi for dirty super blocks to be added. From there we can > find any dirty sb's on the bdi we are currently doing writeback on and call > into > their ->write_metadata callback. > > Signed-off-by: Josef Bacik <jba...@fb.com> > Reviewed-by: Jan Kara <j...@suse.cz> > Reviewed-by: Tejun Heo <t...@kernel.org> > --- > fs/fs-writeback.c| 72 > > fs/super.c | 6 > include/linux/backing-dev-defs.h | 2 ++ > include/linux/fs.h | 4 +++ > mm/backing-dev.c | 2 ++ > 5 files changed, 80 insertions(+), 6 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 987448ed7698..fba703dff678 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback > *wb, > return pages; > } > > +static long writeback_sb_metadata(struct super_block *sb, > + struct bdi_writeback *wb, > + struct wb_writeback_work *work) > +{ > + struct writeback_control wbc = { > + .sync_mode = work->sync_mode, > + .tagged_writepages = work->tagged_writepages, > + .for_kupdate= work->for_kupdate, > + .for_background = work->for_background, > + .for_sync = work->for_sync, > + .range_cyclic = work->range_cyclic, > + .range_start= 0, > + .range_end = LLONG_MAX, > + }; > + long write_chunk; > + > + write_chunk = writeback_chunk_size(wb, work); > + wbc.nr_to_write = write_chunk; > + sb->s_op->write_metadata(sb, ); > + work->nr_pages -= write_chunk - wbc.nr_to_write; > + > + return write_chunk - wbc.nr_to_write; Ok, writeback_chunk_size() returns a page count. We've already gone through the "metadata is not page sized" dance on the dirty accounting side, so how are we supposed to use pages to account for metadata writeback? And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or work->tagged_writepages is set, this will basically tell us to flush the entire dirty metadata cache because write_chunk will get set to LONG_MAX. IOWs, this would appear to me to change sync() behaviour quite dramatically on filesystems where ->write_metadata is implemented. That is, instead of leaving all the metadata dirty in memory and just forcing the journal to stable storage, filesystems will be told to also write back all their dirty metadata before sync() returns, even though it is not necessary to provide correct sync() semantics Mind you, writeback invocation is so convoluted now I could easily be mis-interpretting this code, but it does seem to me like this code is going to have some unintended behaviours Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Sun, Dec 10, 2017 at 08:23:15PM -0800, Matthew Wilcox wrote: > On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote: > > i.e. the fact the cmpxchg failed may not have anything to do with a > > race condtion - it failed because the slot wasn't empty like we > > expected it to be. There can be any number of reasons the slot isn't > > empty - the API should not "document" that the reason the insert > > failed was a race condition. It should document the case that we > > "couldn't insert because there was an existing entry in the slot". > > Let the surrounding code document the reason why that might have > > happened - it's not for the API to assume reasons for failure. > > > > i.e. this API and potential internal implementation makes much > > more sense: > > > > int > > xa_store_iff_empty(...) > > { > > curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > > if (!curr) > > return 0; /* success! */ > > if (!IS_ERR(curr)) > > return -EEXIST; /* failed - slot not empty */ > > return PTR_ERR(curr); /* failed - XA internal issue */ > > } > > > > as it replaces the existing preload and insert code in the XFS code > > paths whilst letting us handle and document the "insert failed > > because slot not empty" case however we want. It implies nothing > > about *why* the slot wasn't empty, just that we couldn't do the > > insert because it wasn't empty. > > Yeah, OK. So, over the top of the recent changes I'm looking at this: > > I'm not in love with xa_store_empty() as a name. I almost want > xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot > down, I'm leery of it. "empty" is at least a concept we already have > in the API with the comment for xa_init() talking about an empty array > and xa_empty(). I also considered xa_store_null and xa_overwrite_null > and xa_replace_null(). Naming remains hard. > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 941f38bb94a4..586b43836905 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -451,7 +451,7 @@ xfs_iget_cache_miss( > int flags, > int lock_flags) > { > - struct xfs_inode*ip, *curr; > + struct xfs_inode*ip; > int error; > xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino); > int iflags; > @@ -498,8 +498,7 @@ xfs_iget_cache_miss( > xfs_iflags_set(ip, iflags); > > /* insert the new inode */ > - curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > - error = __xa_race(curr, -EAGAIN); > + error = xa_store_empty(>pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN); > if (error) > goto out_unlock; Can we avoid encoding the error to be returned into the function call? No other generic/library API does this, so this seems like a highly unusual special snowflake approach. I just don't see how creating a whole new error specification convention is justified for the case where it *might* save a line or two of error checking code in a caller? I'd much prefer that the API defines the error on failure, and let the caller handle that specific error however they need to like every other library interface we have... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote: > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote: > > 1. Using lockdep_set_novalidate_class() for anything other > > than device->mutex will throw checkpatch warnings. Nice. (*) > [] > > (*) checkpatch.pl is considered mostly harmful round here, too, > > but that's another rant > > How so? Short story is that it barfs all over the slightly non-standard coding style used in XFS. It generates enough noise on incidental things we aren't important that it complicates simple things. e.g. I just moved a block of defines from one header to another, and checkpatch throws about 10 warnings on that because of whitespace. I'm just moving code - I don't want to change it and it doesn't need to be modified because it is neat and easy to read and is obviously correct. A bunch of prototypes I added another parameter to gets warnings because it uses "unsigned" for an existing parameter that I'm not changing. And so on. This sort of stuff is just lowest-common-denominator noise - great for new code and/or inexperienced developers, but not for working with large bodies of existing code with slightly non-standard conventions. Churning *lots* of code we otherwise wouldn't touch just to shut up checkpatch warnings takes time, risks regressions and makes it harder to trace the history of the code when we are trying to track down bugs. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Tue, Oct 31, 2017 someone wrote: > > > > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted > > nocow -- it will NOT be snapshotted I did exactly this. It servers the purpose of avoiding snapshots. However, today I saw the following at https://wiki.archlinux.org/index.php/Btrfs Note: From Btrfs Wiki Mount options: within a single file system, it is not possible to mount some subvolumes with nodatacow and others with datacow. The mount option of the first mounted subvolume applies to any other subvolumes. That makes me think my nodatacow mount option on $HOME/.cache is not effective. True? (My subjective performance results have not been as good as hoped for with the tweaks I have tried so far.) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Fri, Dec 08, 2017 at 03:01:31PM -0800, Matthew Wilcox wrote: > On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote: > > > > cmpxchg is for replacing a known object in a store - it's not really > > > > intended for doing initial inserts after a lookup tells us there is > > > > nothing in the store. The radix tree "insert only if empty" makes > > > > sense here, because it naturally takes care of lookup/insert races > > > > via the -EEXIST mechanism. > > > > > > > > I think that providing xa_store_excl() (which would return -EEXIST > > > > if the entry is not empty) would be a better interface here, because > > > > it matches the semantics of lookup cache population used all over > > > > the kernel > > > > > > I'm not thrilled with xa_store_excl(), but I need to think about that > > > a bit more. > > > > Not fussed about the name - I just think we need a function that > > matches the insert semantics of the code > > I think I have something that works better for you than returning -EEXIST > (because you don't actually want -EEXIST, you want -EAGAIN): > > /* insert the new inode */ > - spin_lock(>pag_ici_lock); > - error = radix_tree_insert(>pag_ici_root, agino, ip); > - if (unlikely(error)) { > - WARN_ON(error != -EEXIST); > - XFS_STATS_INC(mp, xs_ig_dup); > - error = -EAGAIN; > - goto out_preload_end; > - } > - spin_unlock(>pag_ici_lock); > - radix_tree_preload_end(); > + curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > + error = __xa_race(curr, -EAGAIN); > + if (error) > + goto out_unlock; > > ... > > -out_preload_end: > - spin_unlock(>pag_ici_lock); > - radix_tree_preload_end(); > +out_unlock: > + if (error == -EAGAIN) > + XFS_STATS_INC(mp, xs_ig_dup); > > I've changed the behaviour slightly in that returning an -ENOMEM used to > hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS > returning -ENOMEM probably gets you a nice warning already from the > mm code. It's been a couple of days since I first looked at this, and my initial reaction of "that's horrible!" hasn't changed. What you are proposing here might be a perfectly reasonable *internal implemention* of xa_store_excl(), but it makes for a terrible external API because the sematics and behaviour are so vague. e.g. what does "race" mean here with respect to an insert failure? i.e. the fact the cmpxchg failed may not have anything to do with a race condtion - it failed because the slot wasn't empty like we expected it to be. There can be any number of reasons the slot isn't empty - the API should not "document" that the reason the insert failed was a race condition. It should document the case that we "couldn't insert because there was an existing entry in the slot". Let the surrounding code document the reason why that might have happened - it's not for the API to assume reasons for failure. i.e. this API and potential internal implementation makes much more sense: int xa_store_iff_empty(...) { curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); if (!curr) return 0; /* success! */ if (!IS_ERR(curr)) return -EEXIST; /* failed - slot not empty */ return PTR_ERR(curr); /* failed - XA internal issue */ } as it replaces the existing preload and insert code in the XFS code paths whilst letting us handle and document the "insert failed because slot not empty" case however we want. It implies nothing about *why* the slot wasn't empty, just that we couldn't do the insert because it wasn't empty. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lockdep is less useful than it was
On Fri, Dec 08, 2017 at 10:14:38AM -0800, Matthew Wilcox wrote: > At the moment, the radix tree actively disables the RCU checking that > enabling lockdep would give us. It has to, because it has no idea what > lock protects any individual access to the radix tree. The XArray can > use the RCU checking because it knows that every reference is protected > by either the spinlock or the RCU lock. > > Dave was saying that he has a tree which has to be protected by a mutex > because of where it is in the locking hierarchy, and I was vigorously > declining his proposal of allowing him to skip taking the spinlock. Oh, I wasn't suggesting that you remove the internal tree locking because we need external locking. I was trying to point out that the internal locking doesn't remove the need for external locking, and that there are cases where smearing the internal lock outside the XA tree doesn't work, either. i.e. internal locking doesn't replace all the cases where external locking is required, and so it's less efficient than the existing radix tree code. What I was questioning was the value of replacing the radix tree code with a less efficient structure just to add lockdep validation to a tree that doesn't actually need any extra locking validation... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Fri, Dec 08, 2017 at 12:35:07PM -0500, Alan Stern wrote: > On Fri, 8 Dec 2017, Byungchul Park wrote: > > > I'm sorry to hear that.. If I were you, I would also get > > annoyed. And.. thanks for explanation. > > > > But, I think assigning lock classes properly and checking > > relationship of the classes to detect deadlocks is reasonable. > > > > In my opinion about the common lockdep stuff, there are 2 > > problems on it. > > > > 1) Firstly, it's hard to assign lock classes *properly*. By > > default, it relies on the caller site of lockdep_init_map(), > > but we need to assign another class manually, where ordering > > rules are complicated so cannot rely on the caller site. That > > *only* can be done by experts of the subsystem. Sure, but that's not the issue here. The issue here is the lack of communication with subsystem experts and that the annotation complexity warnings given immediately by the subsystem experts were completely ignored... > > I think if they want to get benifit from lockdep, they have no > > choice but to assign classes manually with the domain knowledge, > > or use *lockdep_set_novalidate_class()* to invalidate locks > > making the developers annoyed and not want to use the checking > > for them. > > Lockdep's no_validate class is used when the locking patterns are too > complicated for lockdep to understand. Basically, it tells lockdep to > ignore those locks. Let me just point out two things here: 1. Using lockdep_set_novalidate_class() for anything other than device->mutex will throw checkpatch warnings. Nice. (*) 2. lockdep_set_novalidate_class() is completely undocumented - it's the first I've ever heard of this functionality. i.e. nobody has ever told us there is a mechanism to turn off validation of an object; we've *always* been told to "change your code and/or fix your annotations" when discussing lockdep deficiencies. (**) > The device core uses that class. The tree of struct devices, each with > its own lock, gets used in many different and complicated ways. > Lockdep can't understand this -- it doesn't have the ability to > represent an arbitrarily deep hierarchical tree of locks -- so we tell ^^ That largely describes the in-memory structure of XFS, except we have a forest of lock trees, not just one > it to ignore the device locks. > > It sounds like XFS may need to do the same thing with its semaphores. Who-ever adds semaphore checking to lockdep can add those annotations. The externalisation of the development cost of new lockdep functionality is one of the problems here. -Dave. (*) checkpatch.pl is considered mostly harmful round here, too, but that's another rant (**) the frequent occurrence of "core code/devs aren't held to the same rules/standard as everyone else" is another rant I have stored up for a rainy day. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Fri, Dec 08, 2017 at 01:45:52PM +0900, Byungchul Park wrote: > On Fri, Dec 08, 2017 at 09:22:16AM +1100, Dave Chinner wrote: > > On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote: > > > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote: > > > > > Unfortunately for you, I don't find arguments along the lines of > > > > > "lockdep will save us" at all convincing. lockdep already throws > > > > > too many false positives to be useful as a tool that reliably and > > > > > accurately points out rare, exciting, complex, intricate locking > > > > > problems. > > > > > > > > But it does reliably and accurately point out "dude, you forgot to take > > > > the lock". It's caught a number of real problems in my own testing that > > > > you never got to see. > > > > > > The problem is that if it has too many false positives --- and it's > > > gotten *way* worse with the completion callback "feature", people will > > > just stop using Lockdep as being too annyoing and a waste of developer > > > time when trying to figure what is a legitimate locking bug versus > > > lockdep getting confused. > > > > > > I can't even disable the new Lockdep feature which is throwing > > > lots of new false positives --- it's just all or nothing. > > > > > > Dave has just said he's already stopped using Lockdep, as a result. > > > > This is compeltely OT, but FYI I stopped using lockdep a long time > > ago. We've spend orders of magnitude more time and effort to shut > > up lockdep false positives in the XFS code than we ever have on > > locking problems that lockdep has uncovered. And still lockdep > > throws too many false positives on XFS workloads to be useful to me. > > > > But it's more than that: I understand just how much lockdep *doesn't > > check* and that means *I know I can't rely on lockdep* for potential > > deadlock detection. e.g. it doesn't cover semaphores, which means > > Hello, > > I'm careful in saying the following since you seem to feel not good at > crossrelease and even lockdep. Now that cross-release has been > introduced, semaphores can be covered as you might know. Actually, all > general waiters can. And all it will do is create a whole bunch more work for us XFS guys to shut up all the the false positive crap that falls out from it because the locking model we have is far more complex than any of the lockdep developers thought was necessary to support, just like happened with the XFS inode annotations all those years ago. e.g. nobody has ever bothered to ask us what is needed to describe XFS's semaphore locking model. If you did that, you'd know that we nest *thousands* of locked semaphores in compeltely random lock order during metadata buffer writeback. And that this lock order does not reflect the actual locking order rules we have for locking buffers during transactions. Oh, and you'd also know that a semaphore's lock order and context can change multiple times during the life time of the buffer. Say we free a block and the reallocate it as something else before it is reclaimed - that buffer now might have a different lock order. Or maybe we promote a buffer to be a root btree block as a result of a join - it's now the first buffer in a lock run, rather than a child. Or we split a tree, and the root is now a node and so no longer is the first buffer in a lock run. Or that we walk sideways along the leaf nodes siblings during searches. IOWs, there is no well defined static lock ordering at all for buffers - and therefore semaphores - in XFS at all. And knowing that, you wouldn't simply mention that lockdep can support semaphores now as though that is necessary to "make it work" for XFS. It's going to be much simpler for us to just turn off lockdep and ignore whatever crap it sends our way than it is to spend unplanned weeks of our time to try to make lockdep sorta work again. Sure, we might get there in the end, but it's likely to take months, if not years like it did with the XFS inode annotations. > > it has zero coverage of the entire XFS metadata buffer subsystem and > > the complex locking orders we have for metadata updates. > > > > Put simply: lockdep doesn't provide me with any benefit, so I don't > > use it... > > Sad.. I don't think you understand. I'll try to explain. The lockdep infrastructure by itself doesn't make lockdep a useful tool - it mostly generates false positives because it has no concept of locking models that don't match it's internal tracking assumptions and/or limitations. That means if we can't suppress the false positives, then l
Re: Lockdep is less useful than it was
On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote: > On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote: > > The problem is that if it has too many false positives --- and it's > > gotten *way* worse with the completion callback "feature", people will > > just stop using Lockdep as being too annyoing and a waste of developer > > time when trying to figure what is a legitimate locking bug versus > > lockdep getting confused. > > > > I can't even disable the new Lockdep feature which is throwing > > lots of new false positives --- it's just all or nothing. > > You *can* ... but it's way more hacking Kconfig than you ought to have > to do (which is a separate rant ...) > > You need to get LOCKDEP_CROSSRELEASE off. I'd revert patches > e26f34a407aec9c65bce2bc0c838fabe4f051fc6 and > b483cf3bc249d7af706390efa63d6671e80d1c09 > > I think it was a mistake to force these on for everybody; they have a > much higher false-positive rate than the rest of lockdep, so as you say > forcing them on leads to fewer people using *any* of lockdep. > > The bug you're hitting isn't Byungchul's fault; it's an annotation > problem. The same kind of annotation problem that we used to have with > dozens of other places in the kernel which are now fixed. That's one of the fundamental problem with lockdep - it throws the difficulty of solving all these new false positives onto the developers who know nothing about lockdep and don't follow it's development. And until they do solve them - especially in critical subsystems that everyone uses like the storage stack - lockdep is essentially worthless. > If you didn't > have to hack Kconfig to get rid of this problem, you'd be happier, right? I'd be much happier if it wasn't turned on by default in the first place. We gave plenty of warnings that there were still unsolved false positive problems with the new checks in the storage stack. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote: > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote: > > > Unfortunately for you, I don't find arguments along the lines of > > > "lockdep will save us" at all convincing. lockdep already throws > > > too many false positives to be useful as a tool that reliably and > > > accurately points out rare, exciting, complex, intricate locking > > > problems. > > > > But it does reliably and accurately point out "dude, you forgot to take > > the lock". It's caught a number of real problems in my own testing that > > you never got to see. > > The problem is that if it has too many false positives --- and it's > gotten *way* worse with the completion callback "feature", people will > just stop using Lockdep as being too annyoing and a waste of developer > time when trying to figure what is a legitimate locking bug versus > lockdep getting confused. > > I can't even disable the new Lockdep feature which is throwing > lots of new false positives --- it's just all or nothing. > > Dave has just said he's already stopped using Lockdep, as a result. This is compeltely OT, but FYI I stopped using lockdep a long time ago. We've spend orders of magnitude more time and effort to shut up lockdep false positives in the XFS code than we ever have on locking problems that lockdep has uncovered. And still lockdep throws too many false positives on XFS workloads to be useful to me. But it's more than that: I understand just how much lockdep *doesn't check* and that means *I know I can't rely on lockdep* for potential deadlock detection. e.g. it doesn't cover semaphores, which means it has zero coverage of the entire XFS metadata buffer subsystem and the complex locking orders we have for metadata updates. Put simply: lockdep doesn't provide me with any benefit, so I don't use it... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote: > > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote: > > > That said, using xa_cmpxchg() in the dquot code looked like the right > > > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks > > > entirely reasonable for another thread to come in and set up the dquot. > > > But I'm obviously quite ignorant of the XFS internals, so maybe there's > > > something else going on that makes this essentially a "can't happen". > > > > It's no different to the inode cache code, which drops the RCU > > lock on lookup miss, instantiates the new inode (maybe reading it > > off disk), then locks the tree and attempts to insert it. Both cases > > use "insert if empty, otherwise retry lookup from start" semantics. > > Ah. I had my focus set a little narrow on the inode cache code and didn't > recognise the pattern. > > Why do you sleep for one jiffy after encountering a miss, then seeing > someone else insert the inode for you? The sleep is a backoff that allows whatever we raced with to complete, be it a hit that raced with an inode being reclaimed and removed, or a miss that raced with another insert. Ideally we'd sleep on the XFS_INEW bit, similar to the vfs I_NEW flag, but it's not quite that simple with the reclaim side of things... > > cmpxchg is for replacing a known object in a store - it's not really > > intended for doing initial inserts after a lookup tells us there is > > nothing in the store. The radix tree "insert only if empty" makes > > sense here, because it naturally takes care of lookup/insert races > > via the -EEXIST mechanism. > > > > I think that providing xa_store_excl() (which would return -EEXIST > > if the entry is not empty) would be a better interface here, because > > it matches the semantics of lookup cache population used all over > > the kernel > > I'm not thrilled with xa_store_excl(), but I need to think about that > a bit more. Not fussed about the name - I just think we need a function that matches the insert semantics of the code > > > I'm quite happy to have normal API variants that don't save/restore > > > interrupts. Just need to come up with good names ... I don't think > > > xa_store_noirq() is a good name, but maybe you do? > > > > I'd prefer not to have to deal with such things at all. :P > > > > How many subsystems actually require irq safety in the XA locking > > code? Make them use irqsafe versions, not make everyone else use > > "noirq" versions, as is the convention for the rest of the kernel > > code > > Hard to say how many existing radix tree users require the irq safety. The mapping tree requires it because it gets called from IO completion contexts to clear page writeback state, but I don't know about any of the others. > Also hard to say how many potential users (people currently using > linked lists, people using resizable arrays, etc) need irq safety. > My thinking was "make it safe by default and let people who know better > have a way to opt out", but there's definitely something to be said for > "make it fast by default and let people who need the unusual behaviour > type those extra few letters". > > So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh() > and xa_store_irqsafe()? (at least on demand, as users come to light?) > At least the read side doesn't require any variants; everybody can use > RCU for read side protection. That would follow the pattern of the rest of the kernel APIs, though I think it might be cleaner to simply state the locking requirement to xa_init() and keep all those details completely internal rather than encoding them into API calls. After all, the "irqsafe-ness" of the locking needs to be consistent across the entire XA instance > ("safe", not "save" because I wouldn't make the caller provide the > "flags" argument). > > > > At least, not today. One of the future plans is to allow xa_nodes to > > > be allocated from ZONE_MOVABLE. In order to do that, we have to be > > > able to tell which lock protects any given node. With the XArray, > > > we can find that out (xa_node->root->xa_lock); with the radix tree, > > > we don't even know what kind of lock protects the tree. > > > > Yup, this is a prime example of why we shouldn't be creating > > external dependencies by smearing the locking context outside the XA > > structure itself. It's not a stretch to se
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote: > > > The other conversions use the normal API instead of the advanced API, so > > > all of this gets hidden away. For example, the inode cache does this: > > > > Ah, OK, that's not obvious from the code changes. :/ > > Yeah, it's a lot easier to understand (I think!) if you build the > docs in that tree and look at > file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html > (mutatis mutandi). I've tried to tell a nice story about how to put > all the pieces together from the normal to the advanced API. > > > However, it's probably overkill for XFS. In all the cases, when we > > insert there should be no entry in the tree - the > > radix tree insert error handling code there was simply catching > > "should never happen" cases and handling it without crashing. > > I thought it was probably overkill to be using xa_cmpxchg() in the > pag_ici patch. I didn't want to take away your error handling as part > of the conversion, but I think a rational person implementing it today > would just call xa_store() and not even worry about the return value > except to check it for IS_ERR(). *nod* > That said, using xa_cmpxchg() in the dquot code looked like the right > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks > entirely reasonable for another thread to come in and set up the dquot. > But I'm obviously quite ignorant of the XFS internals, so maybe there's > something else going on that makes this essentially a "can't happen". It's no different to the inode cache code, which drops the RCU lock on lookup miss, instantiates the new inode (maybe reading it off disk), then locks the tree and attempts to insert it. Both cases use "insert if empty, otherwise retry lookup from start" semantics. cmpxchg is for replacing a known object in a store - it's not really intended for doing initial inserts after a lookup tells us there is nothing in the store. The radix tree "insert only if empty" makes sense here, because it naturally takes care of lookup/insert races via the -EEXIST mechanism. I think that providing xa_store_excl() (which would return -EEXIST if the entry is not empty) would be a better interface here, because it matches the semantics of lookup cache population used all over the kernel > > Now that I've looked at this, I have to say that having a return > > value of NULL meaning "success" is quite counter-intuitive. That's > > going to fire my "that looks so wrong" detector every time I look at > > the code and notice it's erroring out on a non-null return value > > that isn't a PTR_ERR case > > It's the same convention as cmpxchg(). I think it's triggering your > "looks so wrong" detector because it's fundamentally not the natural > thing to write. Most definitely the case, and this is why it's a really bad interface for the semantics we have. This how we end up with code that makes it easy for programmers to screw up pointer checks in error handling... :/ > I'm quite happy to have normal API variants that don't save/restore > interrupts. Just need to come up with good names ... I don't think > xa_store_noirq() is a good name, but maybe you do? I'd prefer not to have to deal with such things at all. :P How many subsystems actually require irq safety in the XA locking code? Make them use irqsafe versions, not make everyone else use "noirq" versions, as is the convention for the rest of the kernel code > > > It's the design pattern I've always intended to use. Naturally, the > > > xfs radix trees weren't my initial target; it was the page cache, and > > > the page cache does the same thing; uses the tree_lock to protect both > > > the radix tree and several other fields in that same data structure. > > > > > > I'm open to argument on this though ... particularly if you have a better > > > design pattern in mind! > > > > I don't mind structures having internal locking - I have a problem > > with leaking them into contexts outside the structure they protect. > > That way lies madness - you can't change the internal locking in > > future because of external dependencies, and the moment you need > > something different externally we've got to go back to an external > > lock anyway. > > > > This is demonstrated by the way you converted the XFS dquot tree - > > you didn't replace the dquot tree lock with the internal xa_lock > > because it's a mutex and we have to sleep holding it. IOWs, we've > > added another layer of locking here, not simplif
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Tue, Dec 05, 2017 at 06:02:08PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 12:36:48PM +1100, Dave Chinner wrote: > > > - if (radix_tree_preload(GFP_NOFS)) > > > - return -ENOMEM; > > > - > > > INIT_LIST_HEAD(>list_node); > > > elem->key = key; > > > > > > - spin_lock(>lock); > > > - error = radix_tree_insert(>store, key, elem); > > > - radix_tree_preload_end(); > > > - if (!error) > > > - _xfs_mru_cache_list_insert(mru, elem); > > > - spin_unlock(>lock); > > > + do { > > > + xas_lock(); > > > + xas_store(, elem); > > > + error = xas_error(); > > > + if (!error) > > > + _xfs_mru_cache_list_insert(mru, elem); > > > + xas_unlock(); > > > + } while (xas_nomem(, GFP_NOFS)); > > > > Ok, so why does this have a retry loop on ENOMEM despite the > > existing code handling that error? And why put such a loop in this > > code and not any of the other XFS code that used > > radix_tree_preload() and is arguably much more important to avoid > > ENOMEM on insert (e.g. the inode cache)? > > If we need more nodes in the tree, xas_store() will try to allocate them > with GFP_NOWAIT | __GFP_NOWARN. If that fails, it signals it in xas_error(). > xas_nomem() will notice that we're in an ENOMEM situation, and allocate > a node using your preferred GFP flags (NOIO in your case). Then we retry, > guaranteeing forward progress. [1] > > The other conversions use the normal API instead of the advanced API, so > all of this gets hidden away. For example, the inode cache does this: > > + curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > > and xa_cmpxchg internally does: > > do { > xa_lock_irqsave(xa, flags); > curr = xas_create(); > if (curr == old) > xas_store(, entry); > xa_unlock_irqrestore(xa, flags); > } while (xas_nomem(, gfp)); Ah, OK, that's not obvious from the code changes. :/ However, it's probably overkill for XFS. In all the cases, when we insert there should be no entry in the tree - the radix tree insert error handling code there was simply catching "should never happen" cases and handling it without crashing. Now that I've looked at this, I have to say that having a return value of NULL meaning "success" is quite counter-intuitive. That's going to fire my "that looks so wrong" detector every time I look at the code and notice it's erroring out on a non-null return value that isn't a PTR_ERR case Also, there's no need for irqsave/restore() locking contexts here as we never access these caches from interrupt contexts. That's just going to be extra overhead, especially on workloads that run 10^6 inodes inodes through the cache every second. That's a problem caused by driving the locks into the XA structure and then needing to support callers that require irq safety > > Also, I really don't like the pattern of using xa_lock()/xa_unlock() > > to protect access to an external structure. i.e. the mru->lock > > context is protecting multiple fields and operations in the MRU > > structure, not just the radix tree operations. Turning that around > > so that a larger XFS structure and algorithm is now protected by an > > opaque internal lock from generic storage structure the forms part > > of the larger structure seems like a bad design pattern to me... > > It's the design pattern I've always intended to use. Naturally, the > xfs radix trees weren't my initial target; it was the page cache, and > the page cache does the same thing; uses the tree_lock to protect both > the radix tree and several other fields in that same data structure. > > I'm open to argument on this though ... particularly if you have a better > design pattern in mind! I don't mind structures having internal locking - I have a problem with leaking them into contexts outside the structure they protect. That way lies madness - you can't change the internal locking in future because of external dependencies, and the moment you need something different externally we've got to go back to an external lock anyway. This is demonstrated by the way you converted the XFS dquot tree - you didn't replace the dquot tree lock with the internal xa_lock because it's a mutex and we have to sleep holding it. IOWs, we've added another layer of locking here, not simplified the code. What I really see here is that we have inconsistent locking patterns w.r.t. XA stores inside XFS - some have an external mutex to cover a wider scope, some use xa_
Re: [PATCH v4 00/73] XArray version 4
On Tue, Dec 05, 2017 at 06:05:15PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote: > > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > > > From: Matthew Wilcox <mawil...@microsoft.com> > > > > > > I looked through some notes and decided this was version 4 of the XArray. > > > Last posted two weeks ago, this version includes a *lot* of changes. > > > I'd like to thank Dave Chinner for his feedback, encouragement and > > > distracting ideas for improvement, which I'll get to once this is merged. > > > > BTW, you need to fix the "To:" line on your patchbombs: > > > > > To: unlisted-recipients: ;, no To-header on input > > > <@gmail-pop.l.google.com> > > > > This bad email address getting quoted to the cc line makes some MTAs > > very unhappy. > > I know :-( I was unhappy when I realised what I'd done. > > https://marc.info/?l=git=151252237912266=2 > > > I'll give this a quick burn this afternoon and see what catches fire... > > All of the things ... 0day gave me a 90% chance of hanging in one > configuration. Need to drill down on it more and find out what stupid > thing I've done wrong this time. Yup, Bad Stuff happened on boot: [ 24.548039] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 24.548978] 1-...!: (0 ticks this GP) idle=688/0/0 softirq=143/143 fqs=0 [ 24.549926] 5-...!: (0 ticks this GP) idle=db8/0/0 softirq=120/120 fqs=0 [ 24.550864] 6-...!: (0 ticks this GP) idle=d58/0/0 softirq=111/111 fqs=0 [ 24.551802] 8-...!: (5 GPs behind) idle=514/0/0 softirq=189/189 fqs=0 [ 24.552722] 10-...!: (84 GPs behind) idle=ac0/0/0 softirq=80/80 fqs=0 [ 24.553617] 11-...!: (8 GPs behind) idle=cfc/0/0 softirq=95/95 fqs=0 [ 24.554496] 13-...!: (8 GPs behind) idle=b0c/0/0 softirq=82/82 fqs=0 [ 24.555382] 14-...!: (38 GPs behind) idle=a7c/0/0 softirq=93/93 fqs=0 [ 24.556305] 15-...!: (4 GPs behind) idle=b18/0/0 softirq=88/88 fqs=0 [ 24.557190] (detected by 9, t=5252 jiffies, g=-178, c=-179, q=994) [ 24.558051] Sending NMI from CPU 9 to CPUs 1: [ 24.558703] NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x2/0x10 [ 24.559654] Sending NMI from CPU 9 to CPUs 5: [ 24.559675] NMI backtrace for cpu 5 skipped: idling at native_safe_halt+0x2/0x10 [ 24.560654] Sending NMI from CPU 9 to CPUs 6: [ 24.560689] NMI backtrace for cpu 6 skipped: idling at native_safe_halt+0x2/0x10 [ 24.561655] Sending NMI from CPU 9 to CPUs 8: [ 24.561701] NMI backtrace for cpu 8 skipped: idling at native_safe_halt+0x2/0x10 [ 24.562654] Sending NMI from CPU 9 to CPUs 10: [ 24.562675] NMI backtrace for cpu 10 skipped: idling at native_safe_halt+0x2/0x10 [ 24.563653] Sending NMI from CPU 9 to CPUs 11: [ 24.563669] NMI backtrace for cpu 11 skipped: idling at native_safe_halt+0x2/0x10 [ 24.564653] Sending NMI from CPU 9 to CPUs 13: [ 24.564670] NMI backtrace for cpu 13 skipped: idling at native_safe_halt+0x2/0x10 [ 24.565652] Sending NMI from CPU 9 to CPUs 14: [ 24.565674] NMI backtrace for cpu 14 skipped: idling at native_safe_halt+0x2/0x10 [ 24.566652] Sending NMI from CPU 9 to CPUs 15: [ 24.59] NMI backtrace for cpu 15 skipped: idling at native_safe_halt+0x2/0x10 [ 24.567653] rcu_preempt kthread starved for 5256 jiffies! g18446744073709551438 c18446744073709551437 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->7 [ 24.567654] rcu_preempt I15128 9 2 0x8000 [ 24.567660] Call Trace: [ 24.567679] ? __schedule+0x289/0x880 [ 24.567681] schedule+0x2f/0x90 [ 24.567682] schedule_timeout+0x152/0x370 [ 24.567686] ? __next_timer_interrupt+0xc0/0xc0 [ 24.567689] rcu_gp_kthread+0x561/0x880 [ 24.567691] ? force_qs_rnp+0x1a0/0x1a0 [ 24.567693] kthread+0x111/0x130 [ 24.567695] ? __kthread_create_worker+0x120/0x120 [ 24.567697] ret_from_fork+0x24/0x30 [ 44.064092] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:854] [ 44.065920] CPU: 0 PID: 854 Comm: kswapd0 Not tainted 4.15.0-rc2-dgc #228 [ 44.067769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 44.070030] RIP: 0010:smp_call_function_single+0xce/0x100 [ 44.071521] RSP: :c90001d2fb20 EFLAGS: 0202 ORIG_RAX: ff11 [ 44.073592] RAX: RBX: 88013ab515c8 RCX: c9000350bb20 [ 44.075560] RDX: 0001 RSI: c90001d2fb20 RDI: c90001d2fb20 [ 44.077531] RBP: c90001d2fb50 R08: 0007 R09: 0080 [ 44.079483] R10: c90001d2fb78 R11: c90001d2fb30 R12: c90001d2fc10 [ 44.081465] R13: ea000449fc78 R14: ea000449fc58 R15: 88013ba36c40 [ 44.083434] FS: () GS:88013fc0() knlGS: [ 44.085683] CS: 0010 DS: ES: CR0
Re: [PATCH v4 00/73] XArray version 4
On Wed, Dec 06, 2017 at 01:53:41AM +, Matthew Wilcox wrote: > Huh, you've caught a couple of problems that 0day hasn't sent me yet. Try > turning on DAX or TRANSPARENT_HUGEPAGE. Thanks! Dax is turned on, CONFIG_TRANSPARENT_HUGEPAGE is not. Looks like nothing is setting CONFIG_RADIX_TREE_MULTIORDER, which is what xas_set_order() is hidden under. Ah, CONFIG_ZONE_DEVICE turns it on, not CONFIG_DAX/CONFIG_FS_DAX. H. That seems wrong if it's used in fs/dax.c... $ grep DAX .config CONFIG_DAX=y CONFIG_FS_DAX=y $ grep ZONE_DEVICE .config CONFIG_ARCH_HAS_ZONE_DEVICE=y $ So I have DAX enabled, but not ZONE_DEVICE? Shouldn't DAX be selecting ZONE_DEVICE, not relying on a user to select both of them so that stuff works properly? Hmmm - there's no menu option to turn on zone device, so it's selected by something else? Oh, HMM turns on ZONE device. But that is "default y", so should be turned on. But it's not? And there's no obvious HMM menu config option, either What a godawful mess Kconfig has turned into. I'm just going to enable TRANSPARENT_HUGEPAGE - madness awaits me if I follow the other path down the rat hole Ok, it build this time. -Dave. > > > -Original Message- > > From: Dave Chinner [mailto:da...@fromorbit.com] > > Sent: Tuesday, December 5, 2017 8:51 PM > > To: Matthew Wilcox <wi...@infradead.org> > > Cc: Matthew Wilcox <mawil...@microsoft.com>; Ross Zwisler > > <ross.zwis...@linux.intel.com>; Jens Axboe <ax...@kernel.dk>; Rehas > > Sachdeva <aquan...@gmail.com>; linux...@kvack.org; linux- > > fsde...@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; linux- > > ni...@vger.kernel.org; linux-btrfs@vger.kernel.org; > > linux-...@vger.kernel.org; > > linux-...@vger.kernel.org; linux-ker...@vger.kernel.org > > Subject: Re: [PATCH v4 00/73] XArray version 4 > > > > On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote: > > > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > > > > From: Matthew Wilcox <mawil...@microsoft.com> > > > > > > > > I looked through some notes and decided this was version 4 of the > > > > XArray. > > > > Last posted two weeks ago, this version includes a *lot* of changes. > > > > I'd like to thank Dave Chinner for his feedback, encouragement and > > > > distracting ideas for improvement, which I'll get to once this is > > > > merged. > > > > > > BTW, you need to fix the "To:" line on your patchbombs: > > > > > > > To: unlisted-recipients: ;, no To-header on input <@gmail- > > pop.l.google.com> > > > > > > This bad email address getting quoted to the cc line makes some MTAs > > > very unhappy. > > > > > > > > > > > Highlights: > > > > - Over 2000 words of documentation in patch 8! And lots more > > > > kernel-doc. > > > > - The page cache is now fully converted to the XArray. > > > > - Many more tests in the test-suite. > > > > > > > > This patch set is not for applying. 0day is still reporting problems, > > > > and I'd feel bad for eating someone's data. These patches apply on top > > > > of a set of prepatory patches which just aren't interesting. If you > > > > want to see the patches applied to a tree, I suggest pulling my git > > > > tree: > > > > > > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.infrade > > ad.org%2Fusers%2Fwilly%2Flinux- > > dax.git%2Fshortlog%2Frefs%2Fheads%2Fxarray-2017-12- > > 04=02%7C01%7Cmawilcox%40microsoft.com%7Ca3e721545f8b4b9dff1 > > 608d53c4bd42f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364 > > 81218740341312=IXNZXXLTf964OQ0eLDpJt2LCv%2BGGWFW%2FQd4Kc > > KYu6zo%3D=0 > > > > I also left out the idr_preload removals. They're still in the git > > > > tree, > > > > but I'm not looking for feedback on them. > > > > > > I'll give this a quick burn this afternoon and see what catches fire... > > > > Build warnings/errors: > > > > . > > lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not > > used > > [-Wunused-function] > > static void radix_tree_free_nodes(struct radix_tree_node *node) > > . > > lib/xarray.c: In function ¿xas_max¿: > > lib/xarray.c:291:16: warning: unused variable ¿mask¿ > > [-Wunused-variable] > > unsigned long mask, max = xas->xa_index; > > ^~~~ > > .. > > fs/dax.c: In function ¿grab_mapping_entry¿: > > fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; > > did you > > mean ¿xas_set_err¿? [-Werror=implicit-function-declaration] > > xas_set_order(, index, size_flag ? PMD_ORDER : 0); > > ^ > > scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed > > make[1]: *** [fs/dax.o] Error 1 > > > > -Dave. > > -- > > Dave Chinner > > da...@fromorbit.com -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/73] XArray version 4
On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote: > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > > From: Matthew Wilcox <mawil...@microsoft.com> > > > > I looked through some notes and decided this was version 4 of the XArray. > > Last posted two weeks ago, this version includes a *lot* of changes. > > I'd like to thank Dave Chinner for his feedback, encouragement and > > distracting ideas for improvement, which I'll get to once this is merged. > > BTW, you need to fix the "To:" line on your patchbombs: > > > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> > > This bad email address getting quoted to the cc line makes some MTAs > very unhappy. > > > > > Highlights: > > - Over 2000 words of documentation in patch 8! And lots more kernel-doc. > > - The page cache is now fully converted to the XArray. > > - Many more tests in the test-suite. > > > > This patch set is not for applying. 0day is still reporting problems, > > and I'd feel bad for eating someone's data. These patches apply on top > > of a set of prepatory patches which just aren't interesting. If you > > want to see the patches applied to a tree, I suggest pulling my git tree: > > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04 > > I also left out the idr_preload removals. They're still in the git tree, > > but I'm not looking for feedback on them. > > I'll give this a quick burn this afternoon and see what catches fire... Build warnings/errors: . lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not used [-Wunused-function] static void radix_tree_free_nodes(struct radix_tree_node *node) . lib/xarray.c: In function ¿xas_max¿: lib/xarray.c:291:16: warning: unused variable ¿mask¿ [-Wunused-variable] unsigned long mask, max = xas->xa_index; ^~~~ .. fs/dax.c: In function ¿grab_mapping_entry¿: fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; did you mean ¿xas_set_err¿? [-Werror=implicit-function-declaration] xas_set_order(, index, size_flag ? PMD_ORDER : 0); ^ scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed make[1]: *** [fs/dax.o] Error 1 -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/73] XArray version 4
On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox <mawil...@microsoft.com> > > I looked through some notes and decided this was version 4 of the XArray. > Last posted two weeks ago, this version includes a *lot* of changes. > I'd like to thank Dave Chinner for his feedback, encouragement and > distracting ideas for improvement, which I'll get to once this is merged. BTW, you need to fix the "To:" line on your patchbombs: > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> This bad email address getting quoted to the cc line makes some MTAs very unhappy. > > Highlights: > - Over 2000 words of documentation in patch 8! And lots more kernel-doc. > - The page cache is now fully converted to the XArray. > - Many more tests in the test-suite. > > This patch set is not for applying. 0day is still reporting problems, > and I'd feel bad for eating someone's data. These patches apply on top > of a set of prepatory patches which just aren't interesting. If you > want to see the patches applied to a tree, I suggest pulling my git tree: > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04 > I also left out the idr_preload removals. They're still in the git tree, > but I'm not looking for feedback on them. I'll give this a quick burn this afternoon and see what catches fire... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Tue, Dec 05, 2017 at 04:41:58PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox <mawil...@microsoft.com> > > This eliminates a call to radix_tree_preload(). . > void > @@ -431,24 +424,24 @@ xfs_mru_cache_insert( > unsigned long key, > struct xfs_mru_cache_elem *elem) > { > + XA_STATE(xas, >store, key); > int error; > > ASSERT(mru && mru->lists); > if (!mru || !mru->lists) > return -EINVAL; > > - if (radix_tree_preload(GFP_NOFS)) > - return -ENOMEM; > - > INIT_LIST_HEAD(>list_node); > elem->key = key; > > - spin_lock(>lock); > - error = radix_tree_insert(>store, key, elem); > - radix_tree_preload_end(); > - if (!error) > - _xfs_mru_cache_list_insert(mru, elem); > - spin_unlock(>lock); > + do { > + xas_lock(); > + xas_store(, elem); > + error = xas_error(); > + if (!error) > + _xfs_mru_cache_list_insert(mru, elem); > + xas_unlock(); > + } while (xas_nomem(, GFP_NOFS)); Ok, so why does this have a retry loop on ENOMEM despite the existing code handling that error? And why put such a loop in this code and not any of the other XFS code that used radix_tree_preload() and is arguably much more important to avoid ENOMEM on insert (e.g. the inode cache)? Also, I really don't like the pattern of using xa_lock()/xa_unlock() to protect access to an external structure. i.e. the mru->lock context is protecting multiple fields and operations in the MRU structure, not just the radix tree operations. Turning that around so that a larger XFS structure and algorithm is now protected by an opaque internal lock from generic storage structure the forms part of the larger structure seems like a bad design pattern to me... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Tue, Nov 21, 2017 at 04:52:53AM -0800, Matthew Wilcox wrote: > On Tue, Nov 21, 2017 at 05:48:15PM +1100, Dave Chinner wrote: > > On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote: > > > On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote: > > > > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote: > > > > > First thing I noticed was that "xa" as a prefix is already quite > > > > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock > > > > > > The X stands for 'eXpandable' or 'eXtending'. I really don't want to > > > use more than a two-letter acronym for whatever the XArray ends up being > > > called. One of the problems with the radix tree is that everything has > > > that 11-character 'radix_tree_' prefix; just replacing that with 'xa_' > > > makes a huge improvement to readability. > > > > Yeah, understood. That's why > > we have very little clear > > prefix namespace left :/ > > > > [ somedays I write something that looks sorta like a haiku, and from > > that point on everything just starts falling out of my brain that > > way. I blame Eric for this today! :P ] > > When the namespace is > tight we must consider the > competing users. > > The earliest us'r > has a claim to a prefix > we are used to it. > > Also a more wide- > spread user has a claim to > a shorter prefix. > > Would you mind changing > your prefix to one only > one letter longer? We can do something like that, though Darrick has the final say in it. > ... ok, I give up ;-) Top marks for effort :) > All your current usage of the xa_ prefix looks somewhat like this: > > fs/xfs/xfs_trans_ail.c: spin_lock(>xa_lock); > > with honourable mentions to: > fs/xfs/xfs_log.c: spin_lock(>m_ail->xa_lock); > > Would you mind if I bolt a patch on to the front of the series called > something like "free up xa_ namespace" that renamed your xa_* to ail_*? > There are no uses of the 'ail_' prefix in the kernel today. > > I don't think that > spin_lock(>ail_lock); > loses any readability. Not sure that's going to work - there's an "xa_ail" member for the AIL list head. That would now read in places: if (list_empty(>ail_ail)) I'd be inclined to just drop the "xa_" prefix from the XFS structure. There is no information loss by removing the prefix in the XFS code because the pointer name tells us what structure it is pointing at. > > By the way, what does AIL stand for? It'd be nice if it were spelled out > in at least one of the header files, maybe fs/xfs/xfs_trans_priv.h? Active Item List. See the section "Tracking Changes" in Documentation/filesystems/xfs-delayed-logging-design.txt for the full rundown, but in short: "The log item is already used to track the log items that have been written to the log but not yet written to disk. Such log items are considered "active" and as such are stored in the Active Item List (AIL) which is a LSN-ordered double linked list. Items are inserted into this list during log buffer IO completion, after which they are unpinned and can be written to disk." The lack of comments describing the AIL is historic - it's never been documented in the code, nor has the whole relogging concept it implements. I wrote the document above when introducing the CIL (Commited Item List) because almost no-one actively working on XFS understood how the whole journalling subsystem worked in any detail. > > Zoetrope Array. > > Labyrinth of illusion. > > Structure never ends. > > Thank you for making me look up zoetrope ;-) My pleasure :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote: > On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote: > > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote: > > > First thing I noticed was that "xa" as a prefix is already quite > > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock > > > already exists and is quite widely used, so having a generic > > > interface using the same prefixes and lock names is going to be > > > quite confusing in the XFS code. Especially considering there's > > > fair bit of radix tree use in XFS (e.g. the internal inode and > > > dquot caches). > > > > > > FYI, from fs/xfs/xfs_trans_priv.h: > > > > > > /* > > > * Private AIL structures. > > > * > > > * Eventually we need to drive the locking in here as well. > > > */ > > > struct xfs_ail { > > > struct xfs_mount*xa_mount; > > > struct task_struct *xa_task; > > > struct list_headxa_ail; > > > xfs_lsn_t xa_target; > > > xfs_lsn_t xa_target_prev; > > > struct list_headxa_cursors; > > > spinlock_t xa_lock; > > > xfs_lsn_t xa_last_pushed_lsn; > > > int xa_log_flush; > > > struct list_headxa_buf_list; > > > wait_queue_head_t xa_empty; > > > }; > > > > > > FWIW, why is it named "XArray"? "X" stands for what? It still > > > looks like a tree structure to me, but without a design doc I'm a > > > bit lost to how it differs to the radix tree (apart from the API) > > > and why it's considered an "array". > > > > /me nominates 'xarr' for the prefix because pirates. :P > > The X stands for 'eXpandable' or 'eXtending'. I really don't want to > use more than a two-letter acronym for whatever the XArray ends up being > called. One of the problems with the radix tree is that everything has > that 11-character 'radix_tree_' prefix; just replacing that with 'xa_' > makes a huge improvement to readability. Yeah, understood. That's why we have very little clear prefix namespace left :/ [ somedays I write something that looks sorta like a haiku, and from that point on everything just starts falling out of my brain that way. I blame Eric for this today! :P ] > I'm open to renaming it, but I want a good name. I think it needs to > be *something* array, so we're looking for a prefix used nowhere in the > kernel. Running 'git grep' in turn for '\<[a-z]a_' really only allows > for ja, ya and za. 'ja_' and 'ya_' only have one user, while 'za_' > has none. So ... anyone got a good retcon for JArray, YArray or ZArray? A Yellow Array. Colour is irrelevant. The bikeshed looks nice. > It's considered an array because it behaves "as if" it's an infinite > array. Infinite Array. Namespace prefix collision this haiku can't solve. > The fact that we chop it up into chunks of 64 entries, and then > arrange those chunks into a tree is not something the average user needs > to concern themselves with. Jumbo Array. Many pieces hidden behind walls. Will anyone notice? > We have a lot of places in the kernel that > fuss around with "allocate N entries, whoops, we exceeded N, kmalloc some > more, oh no kmalloc failed, vmalloc it instead". So the idea is to give > these places an interface that acts like an automatically-resizing array. Zoetrope Array. Labyrinth of illusion. Structure never ends. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Mon, Nov 20, 2017 at 01:51:00PM -0800, Matthew Wilcox wrote: > On Tue, Nov 21, 2017 at 07:26:06AM +1100, Dave Chinner wrote: > > On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote: > > > On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote: > > > > If two programs simultaneously try to write to the same part of a file > > > > via direct IO and buffered IO, there's a chance that the post-diowrite > > > > pagecache invalidation will fail on the dirty page. When this happens, > > > > the dio write succeeded, which means that the page cache is no longer > > > > coherent with the disk! > > > > > > This seems like a good opportunity to talk about what I've been working > > > on for solving this problem. The XArray is going to introduce a set > > > of entries which can be stored to locations in the page cache that I'm > > > calling 'wait entries'. > > > > What's this XArray thing you speak of? > > Ah, right, you were on sabbatical at LSFMM this year where I talked > about it. Briefly, it's a new API for the radix tree. The data structure > is essentially unchanged (minor enhancements), but I'm rationalising > existing functionality and adding new abilities. And getting rid of > misfeatures like the preload API and implicit GFP flags. > > My current working tree is here: > > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-11-20 First thing I noticed was that "xa" as a prefix is already quite widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock already exists and is quite widely used, so having a generic interface using the same prefixes and lock names is going to be quite confusing in the XFS code. Especially considering there's fair bit of radix tree use in XFS (e.g. the internal inode and dquot caches). FYI, from fs/xfs/xfs_trans_priv.h: /* * Private AIL structures. * * Eventually we need to drive the locking in here as well. */ struct xfs_ail { struct xfs_mount*xa_mount; struct task_struct *xa_task; struct list_headxa_ail; xfs_lsn_t xa_target; xfs_lsn_t xa_target_prev; struct list_headxa_cursors; spinlock_t xa_lock; xfs_lsn_t xa_last_pushed_lsn; int xa_log_flush; struct list_headxa_buf_list; wait_queue_head_t xa_empty; }; > Ignoring the prep patches, the excitement is all to be found with the > commits which start 'xarray:' FWIW, why is it named "XArray"? "X" stands for what? It still looks like a tree structure to me, but without a design doc I'm a bit lost to how it differs to the radix tree (apart from the API) and why it's considered an "array". > If you want an example of it in use, I'm pretty happy with this patch > that switches the brd driver entirely from the radix tree API to the > xarray API: > > http://git.infradead.org/users/willy/linux-dax.git/commitdiff/dbf96ae943e43563cbbaa26e21b656b6fe8f4b0f Looks pretty neat, but I'll reserve judgement for when I see the conversion of the XFS radix tree code > I've been pretty liberal with the kernel-doc, but I haven't written out > a good .rst file to give an overview of how to use it. Let me know when you've written it :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote: > On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote: > > If two programs simultaneously try to write to the same part of a file > > via direct IO and buffered IO, there's a chance that the post-diowrite > > pagecache invalidation will fail on the dirty page. When this happens, > > the dio write succeeded, which means that the page cache is no longer > > coherent with the disk! > > This seems like a good opportunity to talk about what I've been working > on for solving this problem. The XArray is going to introduce a set > of entries which can be stored to locations in the page cache that I'm > calling 'wait entries'. What's this XArray thing you speak of? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedov <r...@romanrm.net> wrote: > > On Mon, 13 Nov 2017 22:39:44 -0500 > Dave <davestechs...@gmail.com> wrote: > > > I have my live system on one block device and a backup snapshot of it > > on another block device. I am keeping them in sync with hourly rsync > > transfers. > > > > Here's how this system works in a little more detail: > > > > 1. I establish the baseline by sending a full snapshot to the backup > > block device using btrfs send-receive. > > 2. Next, on the backup device I immediately create a rw copy of that > > baseline snapshot. > > 3. I delete the source snapshot to keep the live filesystem free of > > all snapshots (so it can be optimally defragmented, etc.) > > 4. hourly, I take a snapshot of the live system, rsync all changes to > > the backup block device, and then delete the source snapshot. This > > hourly process takes less than a minute currently. (My test system has > > only moderate usage.) > > 5. hourly, following the above step, I use snapper to take a snapshot > > of the backup subvolume to create/preserve a history of changes. For > > example, I can find the version of a file 30 hours prior. > > Sounds a bit complex, I still don't get why you need all these snapshot > creations and deletions, and even still using btrfs send-receive. Hopefully, my comments below will explain my reasons. > > Here is my scheme: > > /mnt/dst <- mounted backup storage volume > /mnt/dst/backup <- a subvolume > /mnt/dst/backup/host1/ <- rsync destination for host1, regular directory > /mnt/dst/backup/host2/ <- rsync destination for host2, regular directory > /mnt/dst/backup/host3/ <- rsync destination for host3, regular directory > etc. > > /mnt/dst/backup/host1/bin/ > /mnt/dst/backup/host1/etc/ > /mnt/dst/backup/host1/home/ > ... > Self explanatory. All regular directories, not subvolumes. > > Snapshots: > /mnt/dst/snaps/backup <- a regular directory > /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1 of /mnt/dst/backup > /mnt/dst/snaps/backup/2017-11-14T13:00/ <- snapshot 2 of /mnt/dst/backup > /mnt/dst/snaps/backup/2017-11-14T14:00/ <- snapshot 3 of /mnt/dst/backup > > Accessing historic data: > /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash > ... > /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup system). > > > No need for btrfs send-receive, only plain rsync is used, directly from > hostX:/ to /mnt/dst/backup/host1/; I prefer to start with a BTRFS snapshot at the backup destination. I think that's the most "accurate" starting point. > > No need to create or delete snapshots during the actual backup process; Then you can't guarantee consistency of the backed up information. > > A single common timeline is kept for all hosts to be backed up, snapshot count > not multiplied by the number of hosts (in my case the backup location is > multi-purpose, so I somewhat care about total number of snapshots there as > well); > > Also, all of this works even with source hosts which do not use Btrfs. That's not a concern for me because I prefer to use BTRFS everywhere. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Wed, Nov 1, 2017 at 1:15 AM, Roman Mamedov <r...@romanrm.net> wrote: > On Wed, 1 Nov 2017 01:00:08 -0400 > Dave <davestechs...@gmail.com> wrote: > >> To reconcile those conflicting goals, the only idea I have come up >> with so far is to use btrfs send-receive to perform incremental >> backups as described here: >> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup . > > Another option is to just use the regular rsync to a designated destination > subvolume on the backup host, AND snapshot that subvolume on that host from > time to time (or on backup completions, if you can synchronize that). > > rsync --inplace will keep space usage low as it will not reupload entire files > in case of changes/additions to them. > > Yes rsync has to traverse both directory trees to find changes, but that's > pretty fast (couple of minutes at most, for a typical root filesystem), > especially if you use SSD or SSD caching. Hello. I am implementing this suggestion. So far, so good. However, I need some further recommendations on rsync options to use for this purpose. My rsync command currently looks like this: rsync -axAHv --inplace --delete-delay --exclude-from="/some/file" "$source_snapshop/" "$backup_location" In particular, I want to know if I should or should not be using these options: -H, --hard-linkspreserve hard links -A, --acls preserve ACLs (implies -p) -X, --xattrspreserve extended attributes -x, --one-file-system don't cross filesystem boundaries I had to use the "x" option to prevent rsync from deleting files in snapshots in the backup location (as the source location does not retain any snapshots). Is there a better way? I have my live system on one block device and a backup snapshot of it on another block device. I am keeping them in sync with hourly rsync transfers. Here's how this system works in a little more detail: 1. I establish the baseline by sending a full snapshot to the backup block device using btrfs send-receive. 2. Next, on the backup device I immediately create a rw copy of that baseline snapshot. 3. I delete the source snapshot to keep the live filesystem free of all snapshots (so it can be optimally defragmented, etc.) 4. hourly, I take a snapshot of the live system, rsync all changes to the backup block device, and then delete the source snapshot. This hourly process takes less than a minute currently. (My test system has only moderate usage.) 5. hourly, following the above step, I use snapper to take a snapshot of the backup subvolume to create/preserve a history of changes. For example, I can find the version of a file 30 hours prior. The backup volume contains up to 100 snapshots while the live volume has no snapshots. Best of both worlds? I guess I'll find out over time. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] writeback: allow for dirty metadata accounting
On Thu, Nov 09, 2017 at 02:30:57PM -0500, Josef Bacik wrote: > From: Josef Bacik <jba...@fb.com> > > Provide a mechanism for file systems to indicate how much dirty metadata they > are holding. This introduces a few things > > 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. > 2) WB stat for dirty metadata. This way we know if we need to try and call > into > the file system to write out metadata. This could potentially be used in the > future to make balancing of dirty pages smarter. Ok, so when you have 64k page size and 4k metadata block size and you're using kmalloc() to allocate the storage for the metadata, how do we make use of all this page-based metadata accounting stuff? We could use dirty metadata accounting infrastructure in XFS so the mm/ subsystem can push on dirty metadata before we get into reclaim situations, but I just can't see how this code works when raw pages are not used to back the metadata cache. That is, XFS can use various different sizes of metadata buffers in the same filesystem. For example, we use sector sized buffers for static AG metadata, filesystem blocks for per-AG metadata, some multiple (1-16) of filesystem blocks for inode buffers, and some multiple (1-128) of filesytem blocks for directory blocks. This means we have a combination of buffers we need to account for that are made up of: heap memory when buffer size < page size, single pages when buffer size == page size, and multiple pages when buffer size > page size. The default filesystem config on a 4k page machine with 512 byte sectors will create buffers in all three categories above which tends to indicate we can't use this new generic infrastructure as proposeda. Any thoughts about how we could efficiently support accounting for variable sized, non-page based metadata with this generic infrastructure? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
On Sat, Nov 4, 2017 at 1:25 PM, Chris Murphy <li...@colorremedies.com> wrote: > > On Sat, Nov 4, 2017 at 1:26 AM, Dave <davestechs...@gmail.com> wrote: > > On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphy <li...@colorremedies.com> > > wrote: > >> > >> That is not a general purpose file system. It's a file system for admins > >> who understand where the bodies are buried. > > > > I'm not sure I understand your comment... > > > > Are you saying BTRFS is not a general purpose file system? > > I'm suggesting that any file system that burdens the user with more > knowledge to stay out of trouble than the widely considered general > purpose file systems of the day, is not a general purpose file system. > > And yes, I'm suggesting that Btrfs is at risk of being neither general > purpose, and not meeting its design goals as stated in Btrfs > documentation. It is not easy to admin *when things go wrong*. It's > great before then. It's a butt ton easier to resize, replace devices, > take snapshots, and so on. But when it comes to fixing it when it goes > wrong? It is a goddamn Choose Your Own Adventure book. It's way, way > more complicated than any other file system I'm aware of. It sounds like a large part of that could be addressed with better documentation. I know that documentation such as what you are suggesting would be really valuable to me! > > If btrfs isn't able to serve as a general purpose file system for > > Linux going forward, which file system(s) would you suggest can fill > > that role? (I can't think of any that are clearly all-around better > > than btrfs now, or that will be in the next few years.) > > ext4 and XFS are clearly the file systems to beat. They almost always > recover from crashes with just a normal journal replay at mount time, > file system repair is not often needed. When it is needed, it usually > works, and there is just the one option to repair and go with it. > Btrfs has piles of repair options, mount time options, btrfs check has > options, btrfs rescue has options, it's a bit nutty honestly. And > there's zero guidance in the available docs what order to try things > in, not least of which some of these repair tools are still considered > dangerous at least in the man page text, and the order depends on the > failure. The user is burdened with way too much. Neither one of those file systems offers snapshots. (And when I compared LVM snapshots vs BTRFS snapshots, I got the impression BTRFS is the clear winner.) Snapshots and volumes have a lot of value to me and I would not enjoy going back to a file system without those features. > Even as much as I know about Btrfs having used it since 2008 and my > list activity, I routinely have WTF moments when people post problems, > what order to try to get things going again. Easy to admin? Yeah for > the most part. But stability is still a problem, and it's coming up on > a 10 year anniversary soon. > > If I were equally familiar with ZFS on Linux as I am with Btrfs, I'd > use ZoL hands down. Might it be the case that if you were equally familiar with ZFS, you would become aware of more of its pitfalls? And that greater knowledge could always lead to a different decision (such as favoring BTRFS).. In my experience the grass is always greener when I am less familiar with the field. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
On Mon, Oct 30, 2017 at 5:37 PM, Chris Murphywrote: > > That is not a general purpose file system. It's a file system for admins who > understand where the bodies are buried. I'm not sure I understand your comment... Are you saying BTRFS is not a general purpose file system? If btrfs isn't able to serve as a general purpose file system for Linux going forward, which file system(s) would you suggest can fill that role? (I can't think of any that are clearly all-around better than btrfs now, or that will be in the next few years.) Or maybe you meant something else? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Thu, Nov 2, 2017 at 4:46 PM, Kai Krakow <hurikha...@gmail.com> wrote: > Am Wed, 1 Nov 2017 02:51:58 -0400 > schrieb Dave <davestechs...@gmail.com>: > >> > >> >> To reconcile those conflicting goals, the only idea I have come up >> >> with so far is to use btrfs send-receive to perform incremental >> >> backups >> > >> > As already said by Romain Mamedov, rsync is viable alternative to >> > send-receive with much less hassle. According to some reports it >> > can even be faster. >> >> Thanks for confirming. I must have missed those reports. I had never >> considered this idea until now -- but I like it. >> >> Are there any blogs or wikis where people have done something similar >> to what we are discussing here? > > I used rsync before, backup source and destination both were btrfs. I > was experiencing the same btrfs bug from time to time on both devices, > luckily not at the same time. > > I instead switched to using borgbackup, and xfs as the destination (to > not fall the same-bug-in-two-devices pitfall). I'm going to stick with btrfs everywhere. My reasoning is that my biggest pitfalls will be related to lack of knowledge. So focusing on learning one filesystem better (vs poorly learning two) is the better strategy for me, given my limited time. (I'm not an IT professional of any sort.) Is there any problem with the Borgbackup repository being on btrfs? > Borgbackup achieves a > much higher deduplication density and compression, and as such also is > able to store much more backup history in the same storage space. The > first run is much slower than rsync (due to enabled compression) but > successive runs are much faster (like 20 minutes per backup run instead > of 4-5 hours). > > I'm currently storing 107 TB of backup history in just 2.2 TB backup > space, which counts a little more than one year of history now, > containing 56 snapshots. This is my retention policy: > > * 5 yearly snapshots > * 12 monthly snapshots > * 14 weekly snapshots (worth around 3 months) > * 30 daily snapshots > > Restore is fast enough, and a snapshot can even be fuse-mounted (tho, > in that case mounted access can be very slow navigating directories). > > With latest borgbackup version, the backup time increased to around 1 > hour from 15-20 minutes in the previous version. That is due to > switching the file cache strategy from mtime to ctime. This can be > tuned to get back to old performance, but it may miss some files during > backup if you're doing awkward things to file timestamps. > > I'm also backing up some servers with it now, then use rsync to sync > the borg repository to an offsite location. > > Combined with same-fs local btrfs snapshots with short retention times, > this could be a viable solution for you. Yes, I appreciate the idea. I'm going to evaluate both rsync and Borgbackup. The advantage of rsync, I think, is that it will likely run in just a couple minutes. That will allow me to run it hourly and to keep my live volume almost entire free of snapshots and fully defragmented. It's also very simple as I already have rsync. And since I'm going to run btrfs on the backup volume, I can perform hourly snapshots there and use Snapper to manage retention. It's all very simple and relies on tools I already have and know. However, the advantages of Borgbackup you mentioned (much higher deduplication density and compression) make it worth considering. Maybe Borgbackup won't take long to complete successive (incremental) backups on my system. I'll have to try it to see. It's a very nice looking project. I'm surprised I never heard of it before. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Thu, Nov 2, 2017 at 7:07 AM, Austin S. Hemmelgarn <ahferro...@gmail.com> wrote: > On 2017-11-01 21:39, Dave wrote: >> I'm going to make this change now. What would be a good way to >> implement this so that the change applies to the $HOME/.cache of each >> user? >> >> The simple way would be to create a new subvolume for each existing >> user and mount it at $HOME/.cache in /etc/fstab, hard coding that >> mount location for each user. I don't mind doing that as there are >> only 4 users to consider. One minor concern is that it adds an >> unexpected step to the process of creating a new user. Is there a >> better way? >> > The easiest option is to just make sure nobody is logged in and run the > following shell script fragment: > > for dir in /home/* ; do > rm -rf $dir/.cache > btrfs subvolume create $dir/.cache > done > > And then add something to the user creation scripts to create that > subvolume. This approach won't pollute /etc/fstab, will still exclude the > directory from snapshots, and doesn't require any hugely creative work to > integrate with user creation and deletion. > > In general, the contents of the .cache directory are just that, cached data. > Provided nobody is actively accessing it, it's perfectly safe to just nuke > the entire directory... I like this suggestion. Thank you. I had intended to mount the .cache subvolumes with the NODATACOW option. However, with this approach, I won't be explicitly mounting the .cache subvolumes. Is it possible to use "chattr +C $dir/.cache" in that loop even though it is a subvolume? And, is setting the .cache directory to NODATACOW the right choice given this scenario? From earlier comments, I believe it is, but I want to be sure I understood correctly. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakowwrote: > > You may want to try btrfs autodefrag mount option and see if it > improves things (tho, the effect may take days or weeks to apply if you > didn't enable it right from the creation of the filesystem). > > Also, autodefrag will probably unshare reflinks on your snapshots. You > may be able to use bees[1] to work against this effect. Its interaction > with autodefrag is not well tested but it works fine for me. Also, bees > is able to reduce some of the fragmentation during deduplication > because it will rewrite extents back into bigger chunks (but only for > duplicated data). > > [1]: https://github.com/Zygo/bees I will look into bees. And yes, I plan to try autodefrag. (I already have it enabled now.) However, I need to understand something about how btrfs send-receive works in regard to reflinks and fragmentation. Say I have 2 snapshots on my live volume. The earlier one of them has already been sent to another block device by btrfs send-receive (full backup). Now defrag runs on the live volume and breaks some percentage of the reflinks. At this point I do an incremental btrfs send-receive using "-p" (or "-c") with the diff going to the same other block device where the prior snapshot was already sent. Will reflinks be "made whole" (restored) on the receiving block device? Or is the state of the source volume replicated so closely that reflink status is the same on the target? Also, is fragmentation reduced on the receiving block device? My expectation is that fragmentation would be reduced and duplication would be reduced too. In other words, does send-receive result in defragmentation and deduplication too? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Thu, Nov 2, 2017 at 7:17 AM, Austin S. Hemmelgarnwrote: >> And the worst performing machine was the one with the most RAM and a >> fast NVMe drive and top of the line hardware. > > Somewhat nonsensically, I'll bet that NVMe is a contributing factor in this > particular case. NVMe has particularly bad performance with the old block > IO schedulers (though it is NVMe, so it should still be better than a SATA > or SAS SSD), and the new blk-mq framework just got scheduling support in > 4.12, and only got reasonably good scheduling options in 4.13. I doubt it's > the entirety of the issue, but it's probably part of it. Thanks for that news. Based on that, I assume the advice here (to use noop for NVMe) is now outdated? https://stackoverflow.com/a/27664577/463994 Is the solution as simple as running a kernel >= 4.13? Or do I need to specify which scheduler to use? I just checked one computer: uname -a Linux morpheus 4.13.5-1-ARCH #1 SMP PREEMPT Fri Oct 6 09:58:47 CEST 2017 x86_64 GNU/Linux $ sudo find /sys -name scheduler -exec grep . {} + /sys/devices/pci:00/:00:1d.0/:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler:[none] mq-deadline kyber bfq >From this article, it sounds like (maybe) I should use kyber. I see kyber listed in the output above, so I assume that means it is available. I also think [none] is the current scheduler being used, as it is in brackets. I checked this: https://www.kernel.org/doc/Documentation/block/switching-sched.txt Based on that, I assume I would do this at runtime: echo kyber > /sys/devices/pci:00/:00:1d.0/:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler I assume this is equivalent: echo kyber > /sys/block/nvme0n1/queue/scheduler How would I set it permanently at boot time? >> While Firefox and Linux in general have their performance "issues", >> that's not relevant here. I'm comparing the same distros, same Firefox >> versions, same Firefox add-ons, etc. I eventually tested many hardware >> configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc. >> The only remaining difference I can find is that the computer with >> acceptable performance uses LVM + EXT4 while all the others use BTRFS. >> >> With all the great feedback I have gotten here, I'm now ready to >> retest this after implementing all the BTRFS-related suggestions I >> have received. Maybe that will solve the problem or maybe this mystery >> will continue... > > Hmm, if you're only using SSD's, that may partially explain things. I don't > remember if it was mentioned earlier in this thread, but you might try > adding 'nossd' to the mount options. The 'ssd' mount option (which gets set > automatically if the device reports as non-rotational) impacts how the block > allocator works, and that can have a pretty insane impact on performance. I will test the "nossd" mount option. > Additionally, independently from that, try toggling the 'discard' mount > option. If you have it enabled, disable it, if you have it disabled, enable > it. Inline discards can be very expensive on some hardware, especially > older SSD's, and discards happen pretty frequently in a COW filesystem. I have been following this advice, so I have never enabled discard for an NVMe drive. Do you think it is worth testing? Solid State Drives/NVMe - ArchWiki https://wiki.archlinux.org/index.php/Solid_State_Drives/NVMe Discards: Note: Although continuous TRIM is an option (albeit not recommended) for SSDs, NVMe devices should not be issued discards. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014) – Ronny Egners Blog
Has this been discussed here? Has anything changed since it was written? Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014) – Ronny Egners Blog http://blog.ronnyegner-consulting.de/2014/12/10/parity-based-redundancy-raid56triple-parity-and-beyond-on-btrfs-and-mdadm-dec-2014/comment-page-1/ TL;DR: There are patches to extend the linux kernel to support up to 6 parity disks but BTRFS does not want them because it does not fit their “business case” and MDADM would want them but somebody needs to develop patches for the MDADM component. The kernel raid implementation is ready and usable. If someone volunteers to do this kind of work I would support with equipment and myself as a test resource. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Wed, Nov 1, 2017 at 8:21 AM, Austin S. Hemmelgarnwrote: >> The cache is in a separate location from the profiles, as I'm sure you >> know. The reason I suggested a separate BTRFS subvolume for >> $HOME/.cache is that this will prevent the cache files for all >> applications (for that user) from being included in the snapshots. We >> take frequent snapshots and (afaik) it makes no sense to include cache >> in backups or snapshots. The easiest way I know to exclude cache from >> BTRFS snapshots is to put it on a separate subvolume. I assumed this >> would make several things related to snapshots more efficient too. > > Yes, it will, and it will save space long-term as well since $HOME/.cache is > usually the most frequently modified location in $HOME. In addition to not > including this in the snapshots, it may also improve performance. Each > subvolume is it's own tree, with it's own locking, which means that you can > generally improve parallel access performance by splitting the workload > across multiple subvolumes. Whether it will actually provide any real > benefit in that respect is heavily dependent on the exact workload however, > but it won't hurt performance. I'm going to make this change now. What would be a good way to implement this so that the change applies to the $HOME/.cache of each user? The simple way would be to create a new subvolume for each existing user and mount it at $HOME/.cache in /etc/fstab, hard coding that mount location for each user. I don't mind doing that as there are only 4 users to consider. One minor concern is that it adds an unexpected step to the process of creating a new user. Is there a better way? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Wed, Nov 1, 2017 at 1:48 PM, Peter Grandiwrote: >> When defragmenting individual files on a BTRFS filesystem with >> COW, I assume reflinks between that file and all snapshots are >> broken. So if there are 30 snapshots on that volume, that one >> file will suddenly take up 30 times more space... [ ... ] > > Defragmentation works by effectively making a copy of the file > contents (simplistic view), so the end result is one copy with > 29 reflinked contents, and one copy with defragmented contents. The clarification is much appreciated. >> Can you also give an example of using find, as you suggested >> above? [ ... ] > > Well, one way is to use 'find' as a filtering replacement for > 'defrag' option '-r', as in for example: > > find "$HOME" -xdev '(' -name '*.sqlite' -o -name '*.mk4' ')' \ > -type f -print0 | xargs -0 btrfs fi defrag > > Another one is to find the most fragmented files first or all > files of at least 1M with with at least say 100 fragments as in: > > find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \ > | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \ > | xargs -0 btrfs fi defrag > > But there are many 'find' web pages and that is not quite a > Btrfs related topic. Your examples were perfect. I have experience using find in similar ways. I can take it from there. :-) >> Background: I'm not sure why our Firefox performance is so terrible > > As I always say, "performance" is not the same as "speed", and > probably your Firefox "performance" is sort of OKish even if the > "speed" is terrile, and neither is likely related to the profile > or the cache being on Btrfs. Here's what happened. Two years ago I installed Kubuntu (with Firefox) on two desktop computers. One machine performed fine. Like you said, "sort of OKish" and that's what we expect with the current state of Linux. The other machine was substantially worse. We ran side-by-side real-world tests on these two machines for months. Initially I did a lot of testing, troubleshooting and reconfiguration trying to get the second machine to perform as well as the first. I never had success. At first I thought it was related to the GPU (or driver). Then I thought it was because the first machine used the z170 chipset and the second was X99 based. But that wasn't it. I have never solved the problem and I have been coming back to it periodically these last two years. In that time I have tried different distros from opensuse to Arch, and a lot of different hardware. Furthermore, my new machines have the same performance problem. The most interesting example is a high end machine with 256 GB of RAM. It showed substantially worse desktop application performance than any other computer here. All are running the exact same version of Firefox with the exact same add-ons. (The installations are carbon copies of each other.) What originally caught my attention was earlier information in this thread: Am Wed, 20 Sep 2017 07:46:52 -0400 schrieb "Austin S. Hemmelgarn" : > > Fragmentation: Files with a lot of random writes can become > > heavily fragmented (1+ extents) causing excessive multi-second > > spikes of CPU load on systems with an SSD or large amount a RAM. On > > desktops this primarily affects application databases (including > > Firefox). Workarounds include manually defragmenting your home > > directory using btrfs fi defragment. Auto-defragment (mount option > > autodefrag) should solve this problem. > > > > Upon reading that I am wondering if fragmentation in the Firefox > > profile is part of my issue. That's one thing I never tested > > previously. (BTW, this system has 256 GB of RAM and 20 cores.) > Almost certainly. Most modern web browsers are brain-dead and insist > on using SQLite databases (or traditional DB files) for everything, > including the cache, and the usage for the cache in particular kills > performance when fragmentation is an issue. It turns out the the first machine (which performed well enough) was the last one which was installed using LVM + EXT4. The second machine (the one with the original performance problem) and all subsequent machines have used BTRFS. And the worst performing machine was the one with the most RAM and a fast NVMe drive and top of the line hardware. While Firefox and Linux in general have their performance "issues", that's not relevant here. I'm comparing the same distros, same Firefox versions, same Firefox add-ons, etc. I eventually tested many hardware configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc. The only remaining difference I can find is that the computer with acceptable performance uses LVM + EXT4 while all the others use BTRFS. With all the great feedback I have gotten here, I'm now ready to retest this after implementing all the BTRFS-related suggestions I have received. Maybe that will solve the problem or maybe this mystery
Re: defragmenting best practice?
On Wed, Nov 1, 2017 at 9:31 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Dave posted on Tue, 31 Oct 2017 17:47:54 -0400 as excerpted: > >> 6. Make sure Firefox is running in multi-process mode. (Duncan's >> instructions, while greatly appreciated and very useful, left me >> slightly confused about pulseaudio's compatibility with multi-process >> mode.) > > Just to clarify: > > There's no problem with native pulseaudio and firefox multi-process > mode. Thank you for clarifying. And I appreciate your detailed explanation. > Back when I posted that, a not e10s-enabled extension was actually quite > likely, as e10s was still rather new. It's probably somewhat less so > now, and firefox is of course on to the next big change, dropping the old > "legacy chrome" extension support, in favor of the newer and generally > Chromium-compatible WebExtensions/WE API, with firefox 57, to be released > mid-month (Nov 14). I am now running Firefox 57 beta and I'll be doing my testing with that using only WebExtensions. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Wed, Nov 1, 2017 at 4:34 AM, Marat Khaliliwrote: >> We do experience severe performance problems now, especially with >> Firefox. Part of my experiment is to reduce the number of snapshots on >> the live volumes, hence this question. > > Just for statistics, how many snapshots do you have and how often do you > take them? It's on SSD, right? I don't think the severe performance problems stem solely from the number of snapshots. I think it is also related to Firefox stuff (cache fragmentation, lack of multi-processor mode maybe, etc.) I still have to investigate the Firefox issues, but I'm starting at the foundation by trying to get a basic BTRFS setup that will support better desktop application performance first. The poor performance has existed from the beginning of using BTRFS + KDE + Firefox (almost 2 years ago), at a point when very few snapshots had yet been created. A comparison system running similar hardware as well as KDE + Firefox (and LVM + EXT4) did not have the performance problems. The difference has been consistent and significant. For a while I thought the difference was due to the hardware, as one system used the z170 chipset and the other used the X99 chipset (but were otherwise equivalent). So I repeated the testing on identical hardware and the stark performance difference remained. When I realized that, I began focusing on BTRFS, as it is the only consistent difference I can recognize. Sometimes I have used Snapper settings like this: TIMELINE_MIN_AGE="1800" TIMELINE_LIMIT_HOURLY="36" TIMELINE_LIMIT_DAILY="30" TIMELINE_LIMIT_MONTHLY="12" TIMELINE_LIMIT_YEARLY="10" However, I also have some computers set like this: TIMELINE_MIN_AGE="1800" TIMELINE_LIMIT_HOURLY="10" TIMELINE_LIMIT_DAILY="10" TIMELINE_LIMIT_WEEKLY="0" TIMELINE_LIMIT_MONTHLY="0" TIMELINE_LIMIT_YEARLY="0" > BTW beware of deleting too many snapshots at once with any tool. Delete few > and let filesystem stabilize before proceeding. OK, thanks for the tip. > For deduplication tool to be useful you ought to have some duplicate data on > your live volume. Do you have any (e.g. many LXC containers with the same > distribution)? No, no containers and no duplication to that large extent. > P.S. I still think you need some off-system backup solution too, either > rsync+snapshot-based over ssh or e.g. Burp (shameless advertising: > http://burp.grke.org/ ). I agree, but that's beyond the scope of the current problem I'm trying to solve. However, I'll check out Burp once I have a base configuration that is working satisfactorily. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Wed, Nov 1, 2017 at 2:19 AM, Marat Khaliliwrote: > You seem to have two tasks: (1) same-volume snapshots (I would not call them > backups) and (2) updating some backup volume (preferably on a different > box). By solving them separately you can avoid some complexity... Yes, it appears that is a very good strategy -- solve the concerns separately. Make the live volume performant and the backup volume historical. > >> To reconcile those conflicting goals, the only idea I have come up >> with so far is to use btrfs send-receive to perform incremental >> backups > > As already said by Romain Mamedov, rsync is viable alternative to > send-receive with much less hassle. According to some reports it can even be > faster. Thanks for confirming. I must have missed those reports. I had never considered this idea until now -- but I like it. Are there any blogs or wikis where people have done something similar to what we are discussing here? > >> Given the hourly snapshots, incremental backups are the only practical >> option. They take mere moments. Full backups could take an hour or >> more, which won't work with hourly backups. > > I don't see much sense in re-doing full backups to the same physical device. > If you care about backup integrity, it is probably more important to invest > in backups verification. (OTOH, while you didn't reveal data size, if full > backup takes just an hour on your system then why not?) I was saying that a full backup could take an hour or more. That means full backups are not compatible with an hourly backup schedule. And it is certainly not a potential solution to making the system perform better because the system will be spending all its time running backups -- it would be never ending. With hourly backups, they should complete in just a few moments, which is the case with incremental backups. (It sounds like this will be the case with rsync as well.) > >> We will delete most snapshots on the live volume, but retain many (or >> all) snapshots on the backup block device. Is that a good strategy, >> given my goals? > > Depending on the way you use it, retaining even a dozen snapshots on a live > volume might hurt performance (for high-performance databases) or be > completely transparent (for user folders). You may want to experiment with > this number. We do experience severe performance problems now, especially with Firefox. Part of my experiment is to reduce the number of snapshots on the live volumes, hence this question. > > In any case I'd not recommend retaining ALL snapshots on backup device, even > if you have infinite space. Such filesystem would be as dangerous as the > demon core, only good for adding more snapshots (not even deleting them), > and any little mistake will blow everything up. Keep a few dozen, hundred at > most. The intention -- if we were to keep all snapshots on a backup device -- would be to never ever try to delete them. However, with the suggestion to separate the concerns and use rsync, we could also easily run the Snapper timeline cleanup on the backup volume, thereby limiting the retained snapshots to some reasonable number. > Unlike other backup systems, you can fairly easily remove snapshots in the > middle of sequence, use this opportunity. My thinout rule is: remove > snapshot if resulting gap will be less than some fraction (e.g. 1/4) of its > age. One day I'll publish portable solution on github. Thanks. I hope you do find time to publish it. (And what do you mean by portable?) For now, Snapper has a cleanup algorithm that we can use. At least one of the tools listed here has a thinout algorithm too: https://btrfs.wiki.kernel.org/index.php/Incremental_Backup >> Given this minimal retention of snapshots on the live volume, should I >> defrag it (assuming there is at least 50% free space available on the >> device)? (BTW, is defrag OK on an NVMe drive? or an SSD?) >> >> In the above procedure, would I perform that defrag before or after >> taking the snapshot? Or should I use autodefrag? > > I ended up using autodefrag, didn't try manual defragmentation. I don't use > SSDs as backup volumes. I don't use SSD's as backup volumes either. I was asking about the live volume. > >> Should I consider a dedup tool like one of these? > > Certainly NOT for snapshot-based backups: it is already deduplicated almost > as much as possible, dedup tools can only make it *less* deduplicated. The question is whether to use a dedup tool on the live volume which has a few snapshots. Even with the new strategy (based on rsync), the live volume may sometimes have two snapshots (pre- and post- pacman upgrades). I still wish to know, in that case, about using both a dedup tool and defragmenting the btrfs filesystem. Also still wondering about these options: no-holes, skinny metadata, or extended inode refs? This is a very helpful discussion. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On Wed, Nov 1, 2017 at 1:15 AM, Roman Mamedov <r...@romanrm.net> wrote: > On Wed, 1 Nov 2017 01:00:08 -0400 > Dave <davestechs...@gmail.com> wrote: > >> To reconcile those conflicting goals, the only idea I have come up >> with so far is to use btrfs send-receive to perform incremental >> backups as described here: >> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup . > > Another option is to just use the regular rsync to a designated destination > subvolume on the backup host, AND snapshot that subvolume on that host from > time to time (or on backup completions, if you can synchronize that). > > rsync --inplace will keep space usage low as it will not reupload entire files > in case of changes/additions to them. > This seems like a brilliant idea, something that has a lot of potential... On a system where the root filesystem is on an SSD and the backup volume on an HDD, I could rsync hourly, and then run Snapper on the backup volume hourly, as well as using Snapper's timeline cleanup on the backup volume. The live filesystem would have zero snapshots and could be optimized for performance. The backup volume could retain a large number of snapshots (even more than several hundred) because performance would not be very important (as far as I can guess). This seems to resolve our conflict. How about on a system (such as a laptop) with only a single SSD? Would this same idea work where the backup volume is on the same block device? I know that is not technically a backup, but what it does accomplish is separation of the live filesystem from the snapshotted backup volume for performance reasons -- yet the hourly snapshot history is still available. That would seem to meet our use case too. (An external backup disk would be connected to the laptop periodically, of course, too.) Currently, for most btrfs volumes, I have three volumes: the main volume, a snapshot subvolume which contains all the individual snapshots, and a backup volume* (on a different block device but on the same machine). With this new idea, I would have a main volume without any snapshots and a backup volume which contains all the snapshots. It simplifies things on that level and it also simplifies performance tuning on the main volume. In fact it simplifies backup snapshot management too. My initial impression is that this simplifies everything as well as optimizing everything. So surely it must have some disadvantages compared to btrfs send-receive incremental backups (https://btrfs.wiki.kernel.org/index.php/Incremental_Backup). What would those disadvantages be? The first one that comes to mind is that I would lose the functionality of pre- and post- upgrade snapshots on the root filesystem. But I think that's minor. I could either keep those two snapshots for a few hours or days after major upgrades or maybe I could find a pacman hook that uses rsync to make pre- and post- upgrade copies... * Footnote: on some workstation computers, we have 2 or 3 separate backup block devices (e..g, external USB hard drives, etc.). Laptops, however, generally only have a single block device and are not always connected to an external USB hard drive for backup as often as would be ideal. But we also don't keep any critical data on laptops. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Need help with incremental backup strategy (snapshots, defragmentingt & performance)
Our use case requires snapshots. btrfs snapshots are best solution we have found for our requirements, and over the last year snapshots have proven their value to us. (For this discussion I am considering both the "root" volume and the "home" volume on a typical desktop workstation. Also, all btfs volumes are mounted with noatime and nodiratime flags.) For performance reasons, I now wish to minimize the number of snapshots retained on the live btrfs volume. However, for backup purposes, I wish to maximize the number of snapshots retained over time. We'll keep yearly, monthly, weekly, daily and hourly snapshots for as long as possible. To reconcile those conflicting goals, the only idea I have come up with so far is to use btrfs send-receive to perform incremental backups as described here: https://btrfs.wiki.kernel.org/index.php/Incremental_Backup . Given the hourly snapshots, incremental backups are the only practical option. They take mere moments. Full backups could take an hour or more, which won't work with hourly backups. We will delete most snapshots on the live volume, but retain many (or all) snapshots on the backup block device. Is that a good strategy, given my goals? The steps: I know step one is to do the "bootstrapping" where a full initial copy of the live volume is sent to the backup volume. I also know the steps for doing incremental backups. However, the first problem I see is that performing incremental backups requires both the live volume and the backup volume to have an identical "parent" snapshot before each new incremental can be sent. I have found it easy to accidentally delete that specific required parent snapshot when hourly snapshots are being taken and many snaphots exist. Given that I want to retain the minimum number of snapshots on the live volume, how do I ensure that a valid "parent" subvolume exists there in order to perform the incremental backup? (Again, I have often run into the error "no valid parent exists" when doing incremental backups.) I think the rule is like this: Do not delete a snapshot from the live volume until the next snapshot based on it has been sent to the backup volume. In other words, always retain the *exact* snapshot that was the last one sent to the backup volume. Deleting that one then taking another one does not seem sufficient. BTRFS does not seem to recognize parent-child-grandchild relationships of snapshots when doing send-receive incremental backups. However, maybe I'm wrong. Would it be sufficient to first take another snapshot, then delete the prior snapshot? Will the send-receive algorithm be able to infer a parent exists on the backup volume when it receives an incremental based on a child snapshot? (My experience says "no", but I'd like a more authoritative answer.) The next step in my proposed procedure is to take a new snapshot, send it to the backup volume, and only then delete the prior snapshot ( and only from the live volume* ). Using this strategy, the live volume will always have the current snapshot (which I guess should not be called a snapshot -- it's the live volume) plus at least one more snapshot. Briefly, during the incremental backup, it will have an additional snapshot until the older one gets deleted. Given this minimal retention of snapshots on the live volume, should I defrag it (assuming there is at least 50% free space available on the device)? (BTW, is defrag OK on an NVMe drive? or an SSD?) In the above procedure, would I perform that defrag before or after taking the snapshot? Or should I use autodefrag? Should I consider a dedup tool like one of these? g2p/bedup: Btrfs deduplication https://github.com/g2p/bedup markfasheh/duperemove: Tools for deduping file systems https://github.com/markfasheh/duperemove Zygo/bees: Best-Effort Extent-Same, a btrfs dedup agent https://github.com/Zygo/bees Does anyone care to elaborate on the relationship between a dedup tool like Bees and defragmenting a btrfs filesystem with snapshots? I understand they do opposing things, but I think it was suggested in another thread on defragmenting that they can be combined to good effect. Should I consider this as a possible solution for my situation? Should I consider any of these options: no-holes, skinny metadata, or extended inode refs? Finally, are there any good BTRFS performance wiki articles or blogs I should refer to for my situation? * Footnote: On the backup device, maybe we will never delete snapshots. In any event, that's not a concern now. We'll retain many, many snapshots on the backup device. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Tue, Oct 31, 2017 at 7:06 PM, Peter Grandiwrote: > > Also nothing forces you to defragment a whole filesystem, you > can just defragment individual files or directories by using > 'find' with it. Thanks for that info. When defragmenting individual files on a BTRFS filesystem with COW, I assume reflinks between that file and all snapshots are broken. So if there are 30 snapshots on that volume, that one file will suddenly take up 30 times more space... Is that correct? Or are the reflinks only broken between the live file and the latest snapshot? Or is it something between, based on how many times the file has changed? > > My top "$HOME" fragmented files are the aKregator RSS feed > databases, usually a few hundred fragments each, and the > '.sqlite' files for Firefox. Occasionally like just now I do > this: > > tree$ sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4 > .firefox/default/cleanup.sqlite: 43 extents found > .firefox/default/content-prefs.sqlite: 67 extents found > .firefox/default/formhistory.sqlite: 87 extents found > .firefox/default/places.sqlite: 3879 extents found > > tree$ sudo btrfs fi defrag .firefox/default/*.sqlite > > tree$ sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4 > .firefox/default/webappsstore.sqlite: 1 extent found > .firefox/default/favicons.sqlite: 2 extents found > .firefox/default/kinto.sqlite: 2 extents found > .firefox/default/places.sqlite: 44 extents found That's a very helpful example. Can you also give an example of using find, as you suggested above? I'm generally familiar with using find to execute specific commands, but an example is appreciated in this case. > > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted nocow -- > > it will NOT be snapshotted > Also, you can declare the '.firefox/default/' directory to be NOCOW, and that > "just works". The cache is in a separate location from the profiles, as I'm sure you know. The reason I suggested a separate BTRFS subvolume for $HOME/.cache is that this will prevent the cache files for all applications (for that user) from being included in the snapshots. We take frequent snapshots and (afaik) it makes no sense to include cache in backups or snapshots. The easiest way I know to exclude cache from BTRFS snapshots is to put it on a separate subvolume. I assumed this would make several things related to snapshots more efficient too. As far as the Firefox profile being declared NOCOW, as soon as we take the first snapshot, I understand that it will become COW again. So I don't see any point in making it NOCOW. Thanks for your reply. I appreciate any other feedback or suggestions. Background: I'm not sure why our Firefox performance is so terrible but here's my original post from Sept 20. (I could repost the earlier replies too if needed.) I've been waiting to have a window of opportunity to try to fix our Firefox performance again, and now I have that chance. >On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: >> When I do a >> btrfs filesystem defragment -r /directory >> does it defragment really all files in this directory tree, even if it >> contains subvolumes? >> The man page does not mention subvolumes on this topic. > >No answer so far :-( > >But I found another problem in the man-page: > > Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as > with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4 > will break up the ref-links of COW data (for example files copied with > cp --reflink, snapshots or de-duplicated data). This may cause > considerable increase of space usage depending on the broken up > ref-links. > >I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several >snapshots. >Therefore, I better should avoid calling "btrfs filesystem defragment -r"? > >What is the defragmenting best practice? >Avoid it completly? My question is the same as the OP in this thread, so I came here to read the answers before asking. Based on the answers here, it sounds like I should not run defrag at all. However, I have a performance problem I need to solve, so if I don't defrag, I need to do something else. Here's my scenario. Some months ago I built an over-the-top powerful desktop computer / workstation and I was looking forward to really fantastic performance improvements over my 6 year old Ubuntu machine. I installed Arch Linux on BTRFS on the new computer (on an SSD). To my shock, it was no faster than my old machine. I focused a lot on Firefox performance because I use Firefox a lot and that was one of the applications in which I was most looking forward to better performance. I tried everything I could think of and everything recommended to me in various forums (except switching to Windows) and the performance remained very disappointing. Then today I read the following: Gotchas - btrfs Wiki
Re: defragmenting best practice?
I'm following up on all the suggestions regarding Firefox performance on BTRFS. I have time to make these changes now, but I am having trouble figuring out what to do. The constraints are: 1. BTRFS snapshots have proven to be too useful (and too important to our overall IT approach) to forego. 2. We do not see any practical alternative (for us) to the incremental backup strategy (https://btrfs.wiki.kernel.org/index.php/Incremental_Backup) 3. We have large amounts of storage space (and can add more), but not enough to break all reflinks on all snapshots. 4. We can transfer snapshots to backup storage (and thereby retain minimal snapshots on the live volume) 3. Our team is standardized on Firefox. (Switching to Chromium is not an option for us.) 5. Firefox profile sync has not worked well for us in the past, so we don't use it. 6. Our machines generally have plenty of RAM so we could put the Firefox cache (and maybe profile) into RAM using a technique such as https://wiki.archlinux.org/index.php/Firefox/Profile_on_RAM. However, profile persistence is important. The most common recommendations were to switch to Chromium, defragment and don't use snapshots. As the constraints above illustrate, we cannot do those things. The tentative solution I have come up with is: 1. Continue using snapshots, but retain the minimal number possible on the live volume. Move historical snapshots to a backup device using btrfs send-receive. (https://btrfs.wiki.kernel.org/index.php/Incremental_Backup) 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted nocow -- it will NOT be snapshotted 3. Put most of $HOME on a "home" volume but separate all user documents to another volume (i.e., "documents"). 3.a. The "home" volume will retain only the one most recent snapshot on that live volume. (More backup history will be retained on a backup volume. ) This home volume can be defragmented. With one snapshot, that will double our space usage, which is acceptable. 3.b. The documents volume will be snapshotted hourly and 36 hourly snapshots plus daily, weekly and monthly snapshots retained. Therefore it will NOT be defragmented, as that would not be practical or space-wise possible. 3.c. The root volume (operating system, etc.) will follow a strategy similar to home, but will also retain pre- and post- update snapshots. 4. Put the Firefox cache in RAM 5. If needed, consider putting the Firefox profile in RAM 6. Make sure Firefox is running in multi-process mode. (Duncan's instructions, while greatly appreciated and very useful, left me slightly confused about pulseaudio's compatibility with multi-process mode.) 7. Check various Firefox performance tweaks such as these: https://wiki.archlinux.org/index.php/Firefox/Tweaks Can anyone guess whether this will be sufficient to solve our severe performance problems? Do these steps make sense? Will any of these steps lead to new problems? Should I proceed to give them a try? Or can anyone suggest a better set of steps to test? Notes: In regard to snapshots, we must retain about 36 hourly snapshots of user documents, for example. We have to have pre- and post- package upgrade snapshots from at least the most recent operating system & application package update. And we have to retain several daily, weekly and monthly snapshots of system directories and some other locations.) Most of these snapshots can be retained on backup storage devices. Regarding Firefox profile sync, it does not have an intelligent method for resolving conflicts, for example. We found too many unexpected changes when using sync, so we do not use it. On Thu, Sep 21, 2017 at 7:09 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Dave posted on Wed, 20 Sep 2017 02:38:13 -0400 as excerpted: > >> Here's my scenario. Some months ago I built an over-the-top powerful >> desktop computer / workstation and I was looking forward to really >> fantastic performance improvements over my 6 year old Ubuntu machine. I >> installed Arch Linux on BTRFS on the new computer (on an SSD). To my >> shock, it was no faster than my old machine. I focused a lot on Firefox >> performance because I use Firefox a lot and that was one of the >> applications in which I was most looking forward to better performance. >> >> I tried everything I could think of and everything recommended to me in >> various forums (except switching to Windows) and the performance >> remained very disappointing. >> >> Then today I read the following: >> >> Gotchas - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/Gotchas >> >> Fragmentation: Files with a lot of random writes can become >> heavily fragmented (1+ extents) causing excessive multi-second >> spikes of CPU load on systems with an SSD or large amount a RAM. On >> desktops this primarily affects application databases (i
Re: Problem with file system
This is a very helpful thread. I want to share an interesting related story. We have a machine with 4 btrfs volumes and 4 Snapper configs. I recently discovered that Snapper timeline cleanup been turned off for 3 of those volumes. In the Snapper configs I found this setting: TIMELINE_CLEANUP="no" Normally that would be set to "yes". So I corrected the issue and set it to "yes" for the 3 volumes where it had not been set correctly. I suppose it was turned off temporarily and then somebody forgot to turn it back on. What I did not know, and what I did not realize was a critical piece of information, was how long timeline cleanup had been turned off and how many snapshots had accumulated on each volume in that time. I naively re-enabled Snapper timeline cleanup. The instant I started the snapper-cleanup.service the system was hosed. The ssh session became unresponsive, no other ssh sessions could be established and it was impossible to log into the system at the console. My subsequent investigation showed that the root filesystem volume accumulated more than 3000 btrfs snapshots. The two other affected volumes also had very large numbers of snapshots. Deleting a single snapshot in that situation would likely require hours. (I set up a test, but I ran out of patience before I was able to delete even a single snapshot.) My guess is that if we had been patient enough to wait for all the snapshots to be deleted, the process would have finished in some number of months (or maybe a year). We did not know most of this at the time, so we did what we usually do when a system becomes totally unresponsive -- we did a hard reset. Of course, we could never get the system to boot up again. Since we had backups, the easiest option became to replace that system -- not unlike what the OP decided to do. In our case, the hardware was not old, so we simply reformatted the drives and reinstalled Linux. That's a drastic consequence of changing TIMELINE_CLEANUP="no" to TIMELINE_CLEANUP="yes" in the snapper config. It's all part of the process of gaining critical experience with BTRFS. Whether or not BTRFS is ready for production use is (it seems to me) mostly a question of how knowledgeable and experienced are the people administering it. In the various online discussions on this topic, all the focus is on whether or not BTRFS itself is production-ready. At the current maturity level of BTRFS, I think that's the wrong focus. The right focus is on how production-ready is the admin person or team (with respect to their BTRFS knowledge and experience). When a filesystem has been around for decades, most of the critical admin issues become fairly common knowledge, fairly widely known and easy to find. When a filesystem is newer, far fewer people understand the gotchas. Also, in older or widely used filesystems, when someone hits a gotcha, the response isn't "that filesystem is not ready for production". Instead the response is, "you should have known not to do that." On Wed, Apr 26, 2017 at 12:43 PM, Fred Van Andelwrote: > Yes I was running qgroups. > Yes the filesystem is highly fragmented. > Yes I have way too many snapshots. > > I think it's clear that the problem is on my end. I simply placed too > many demands on the filesystem without fully understanding the > implications. Now I have to deal with the consequences. > > It was decided today to replace this computer due to its age. I will > use the recover command to pull the needed data off this system and > onto the new one. > > > Thank you everyone for your assistance and the education. > > Fred > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Mon, Oct 09, 2017 at 09:00:51AM -0400, Josef Bacik wrote: > On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote: > > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > > > > Integrating into fstests means it will be immediately available to > > > > all fs developers, it'll run on everything that everyone already has > > > > setup for filesystem testing, and it will have familiar mkfs/mount > > > > option setup behaviour so there's no new hoops for everyone to jump > > > > through to run it... > > > > > > > > > > TBF I specifically made it as easy as possible because I know we all hate > > > trying > > > to learn new shit. > > > > Yeah, it's also hard to get people to change their workflows to add > > a whole new test harness into them. It's easy if it's just a new > > command to an existing workflow :P > > > > Agreed, so if you probably won't run this outside of fstests then I'll add it > to > xfstests. I envision this tool as being run by maintainers to verify their > pull > requests haven't regressed since the last set of patches, as well as by > anybody > trying to fix performance problems. So it's way more important to me that > you, > Ted, and all the various btrfs maintainers will run it than anybody else. > > > > I figured this was different enough to warrant a separate > > > project, especially since I'm going to add block device jobs so Jens can > > > test > > > block layer things. If we all agree we'd rather see this in fstests then > > > I'm > > > happy to do that too. Thanks, > > > > I'm not fussed either way - it's a good discussion to have, though. > > > > If I want to add tests (e.g. my time-honoured fsmark tests), where > > should I send patches? > > > > I beat you to that! I wanted to avoid adding fs_mark to the suite because it > means parsing another different set of outputs, so I added a new ioengine to > fio > for this > > http://www.spinics.net/lists/fio/msg06367.html > > and added a fio job to do 500k files > > https://github.com/josefbacik/fsperf/blob/master/tests/500kemptyfiles.fio > > The test is disabled by default for now because obviously the fio support > hasn't > landed yet. That seems misguided. fio is good, but it's not a universal solution. > I'd _like_ to expand fio for cases we come up with that aren't possible, as > there's already a ton of measurements that are taken, especially around > latencies. To be properly useful it needs to support more than just fio to run tests. Indeed, it's largely useless to me if that's all it can do or it's a major pain to add support for different tools like fsmark. e.g. my typical perf regression test that you see the concurrnet fsmark create workload is actually a lot more. It does: fsmark to create 50m zero length files umount, run parallel xfs_repair (excellent mmap_sem/page fault punisher) mount run parallel find -ctime (readdir + lookup traversal) unmount, mount run parallel ls -R (readdir + dtype traversal) unmount, mount parallel rm -rf of 50m files I have variants that use small 4k files or large files rather than empty files, taht use different fsync patterns to stress the log, use grep -R to traverse the data as well as the directory/inode structure instead of find, etc. > That said I'm not opposed to throwing new stuff in there, it just > means we have to add stuff to parse the output and store it in the database > in a > consistent way, which seems like more of a pain than just making fio do what > we > need it to. Thanks, fio is not going to be able to replace the sort of perf tests I run from week to week. If that's all it's going to do then it's not directly useful to me... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > On Mon, Oct 09, 2017 at 11:51:37AM +1100, Dave Chinner wrote: > > On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote: > > > Hello, > > > > > > One thing that comes up a lot every LSF is the fact that we have no > > > general way > > > that we do performance testing. Every fs developer has a set of scripts > > > or > > > things that they run with varying degrees of consistency, but nothing > > > central > > > that we all use. I for one am getting tired of finding regressions when > > > we are > > > deploying new kernels internally, so I wired this thing up to try and > > > address > > > this need. > > > > > > We all hate convoluted setups, the more brain power we have to put in to > > > setting > > > something up the less likely we are to use it, so I took the xfstests > > > approach > > > of making it relatively simple to get running and relatively easy to add > > > new > > > tests. For right now the only thing this framework does is run fio > > > scripts. I > > > chose fio because it already gathers loads of performance data about it's > > > runs. > > > We have everything we need there, latency, bandwidth, cpu time, and all > > > broken > > > down by reads, writes, and trims. I figure most of us are familiar > > > enough with > > > fio and how it works to make it relatively easy to add new tests to the > > > framework. > > > > > > I've posted my code up on github, you can get it here > > > > > > https://github.com/josefbacik/fsperf > > > > > > All (well most) of the results from fio are stored in a local sqlite > > > database. > > > Right now the comparison stuff is very crude, it simply checks against the > > > previous run and it only checks a few of the keys by default. You can > > > check > > > latency if you want, but while writing this stuff up it seemed that > > > latency was > > > too variable from run to run to be useful in a "did my thing regress or > > > improve" > > > sort of way. > > > > > > The configuration is brain dead simple, the README has examples. All you > > > need > > > to do is make your local.cfg, run ./setup and then run ./fsperf and you > > > are good > > > to go. > > > > Why re-invent the test infrastructure? Why not just make it a > > tests/perf subdir in fstests? > > > > Probably should have led with that shouldn't I have? There's nothing keeping > me > from doing it, but I didn't want to try and shoehorn in a python thing into > fstests. I need python to do the sqlite and the json parsing to dump into the > sqlite database. > > Now if you (and others) are not opposed to this being dropped into tests/perf > then I'll work that up. But it's definitely going to need to be done in > python. > I know you yourself have said you aren't opposed to using python in the past, > so > if that's still the case then I can definitely wire it all up. I have no problems with people using python for stuff like this but, OTOH, I'm not the fstests maintainer anymore :P > > > The plan is to add lots of workloads as we discover regressions and such. > > > We > > > don't want anything that takes too long to run otherwise people won't run > > > this, > > > so the existing tests don't take much longer than a few minutes each. I > > > will be > > > adding some more comparison options so you can compare against averages > > > of all > > > previous runs and such. > > > > Yup, that fits exactly into what fstests is for... :P > > > > Integrating into fstests means it will be immediately available to > > all fs developers, it'll run on everything that everyone already has > > setup for filesystem testing, and it will have familiar mkfs/mount > > option setup behaviour so there's no new hoops for everyone to jump > > through to run it... > > > > TBF I specifically made it as easy as possible because I know we all hate > trying > to learn new shit. Yeah, it's also hard to get people to change their workflows to add a whole new test harness into them. It's easy if it's just a new command to an existing workflow :P > I figured this was different enough to warrant a separate > project, especially since I'm going to add block device jobs so Jens can test > block layer things. If we all agree we'd rather see this in fstests then I'm > happy to do that too. Thanks, I'm not fussed either way - it's a good discussion to have, though. If I want to add tests (e.g. my time-honoured fsmark tests), where should I send patches? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote: > Hello, > > One thing that comes up a lot every LSF is the fact that we have no general > way > that we do performance testing. Every fs developer has a set of scripts or > things that they run with varying degrees of consistency, but nothing central > that we all use. I for one am getting tired of finding regressions when we > are > deploying new kernels internally, so I wired this thing up to try and address > this need. > > We all hate convoluted setups, the more brain power we have to put in to > setting > something up the less likely we are to use it, so I took the xfstests approach > of making it relatively simple to get running and relatively easy to add new > tests. For right now the only thing this framework does is run fio scripts. > I > chose fio because it already gathers loads of performance data about it's > runs. > We have everything we need there, latency, bandwidth, cpu time, and all broken > down by reads, writes, and trims. I figure most of us are familiar enough > with > fio and how it works to make it relatively easy to add new tests to the > framework. > > I've posted my code up on github, you can get it here > > https://github.com/josefbacik/fsperf > > All (well most) of the results from fio are stored in a local sqlite database. > Right now the comparison stuff is very crude, it simply checks against the > previous run and it only checks a few of the keys by default. You can check > latency if you want, but while writing this stuff up it seemed that latency > was > too variable from run to run to be useful in a "did my thing regress or > improve" > sort of way. > > The configuration is brain dead simple, the README has examples. All you need > to do is make your local.cfg, run ./setup and then run ./fsperf and you are > good > to go. Why re-invent the test infrastructure? Why not just make it a tests/perf subdir in fstests? > The plan is to add lots of workloads as we discover regressions and such. We > don't want anything that takes too long to run otherwise people won't run > this, > so the existing tests don't take much longer than a few minutes each. I will > be > adding some more comparison options so you can compare against averages of all > previous runs and such. Yup, that fits exactly into what fstests is for... :P Integrating into fstests means it will be immediately available to all fs developers, it'll run on everything that everyone already has setup for filesystem testing, and it will have familiar mkfs/mount option setup behaviour so there's no new hoops for everyone to jump through to run it... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?
On Tue, Oct 03, 2017 at 01:40:51PM -0700, Matthew Wilcox wrote: > On Wed, Oct 04, 2017 at 07:10:35AM +1100, Dave Chinner wrote: > > On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote: > > > [repost. I didn´t notice autocompletion gave me wrong address for > > > fsdevel, > > > blacklisted now] > > > > > > Hello. > > > > > > What do you think of > > > > > > http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs > > > > Domain not found. > > Must be an Australian problem ... Probably, I forgot to stand on my head so everything must have been sent to the server upside down Though it is a curious failure - it failed until I went to "openzfs.org" and that redirected to "open-zfs.org" and now it all works. Somewhat bizarre. > A ZFS channel program (ZCP) is a small script written in a domain specific > language that manipulate ZFS internals in a single, atomically-visible > operation. For instance, to delete all snapshots of a filesystem a ZCP > could be written which 1) generates the list of snapshots, 2) traverses > that list, and 3) destroys each snapshot unconditionally. Because > each of these statements would be evaluated from within the kernel, > ZCPs can guarantee safety from interference with other concurrent ZFS > modifications. Executing from inside the kernel allows us to guarantee > atomic visibility of these operations (correctness) and allows them to > be performed in a single transaction group (performance). > > A successful implementation of ZCP will: > > 1. Support equivalent functionality for all of the current ZFS commands > with improved performance and correctness from the point of view of the > user of ZFS. > > 2. Facilitate the quick addition of new and useful commands as > ZCP enables the implementation of more powerful operations which > previously would have been unsafe to implement in user programs, or > would require modifications to the kernel for correctness. Since the > ZCP layer guarantees the atomicity of each ZCP, we only need to write > new sync_tasks for individual simple operations, then can use ZCPs to > chain those simple operations together into more complicated operations. > > 3. Allow ZFS users to safely implement their own ZFS operations without > performing operations they don’t have the privileges for. > > 4. Improve the performance and correctness of existing applications > built on ZFS operations. /me goes and looks at the slides Seems like they are trying to solve a problem of their own making, in that admin operations are run by the kernel from a separate task that is really, really slow. So this scripting is a method of aggregating multiple "sync tasks" into a single operation so there isn't delays between tasks. /me chokes on slide 8/8 "Add a Lua interpreter to the kernel, implement ZFS intrinsics (...) as extensions to the Lua language" Somehow, I don't see that happening in Linux. Yes, I can see us potentially adding some custom functionality in filesystems with eBPF (e.g. custom allocation policies), but I think admin operations need to be done from userspace through a clear, stable interface that supports all the necessary primitives to customise admin operations for different needs. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?
On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote: > [repost. I didn´t notice autocompletion gave me wrong address for fsdevel, > blacklisted now] > > Hello. > > What do you think of > > http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs Domain not found. -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
These are great suggestions. I will test several of them (or all of them) and report back with my results once I have done the testing. Thank you! This is a fantastic mailing list. P.S. I'm inclined to stay with Firefox, but I will definitely test Chromium vs Firefox after making a series of changes based on the suggestions here. I would hate to see the market lose the option of Firefox because everyone goes to Chrome/Chromium. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs/150 regression test for reading compressed data
On Wed, Sep 20, 2017 at 05:52:43PM -0600, Liu Bo wrote: > We had a bug in btrfs compression code which could end up with a > kernel panic. > > This is adding a regression test for the bug and I've also sent a > kernel patch to fix the bug. > > The patch is "Btrfs: fix kernel oops while reading compressed data". > > Signed-off-by: Liu Bo <bo.li@oracle.com> > --- > tests/btrfs/150 | 102 > > tests/btrfs/150.out | 3 ++ > tests/btrfs/group | 1 + > 3 files changed, 106 insertions(+) > create mode 100755 tests/btrfs/150 > create mode 100644 tests/btrfs/150.out > > diff --git a/tests/btrfs/150 b/tests/btrfs/150 > new file mode 100755 > index 000..834be51 > --- /dev/null > +++ b/tests/btrfs/150 > @@ -0,0 +1,102 @@ > +#! /bin/bash > +# FS QA Test btrfs/150 > +# > +# This is a regression test which ends up with a kernel oops in btrfs. group += dangerous > +# It occurs when btrfs's read repair happens while reading a compressed > +# extent. > +# The patch for this is > +# x Incomplete? > +# > +#--- > +# Copyright (c) 2017 Liu Bo. All Rights Reserved. You're signing off this patch an Oracle employee, but claiming personal copyright. Please clarify who owns the copyright - if it's your personal copyright then please sign off with a personal email address, not your employer's... Also, I note that these recently added tests from you: tests/btrfs/140:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/141:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/142:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/143:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/generic/406:# Copyright (c) 2017 Liu Bo. All Rights Reserved. all have this same ambiguity - personal copyright with employer signoff in the commit. This definitely needs clarification and fixing if it is wrong > +disable_io_failure() > +{ > +echo 0 > $SYSFS_BDEV/make-it-fail > +echo 0 > $DEBUGFS_MNT/fail_make_request/probability > +echo 0 > $DEBUGFS_MNT/fail_make_request/times > +} > + > +_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1 > + > +# It doesn't matter which compression algorithm we use. > +_scratch_mount -ocompress > + > +# Create a file with all data being compressed > +$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io needs an fsync to reach disk. > +# Raid1 consists of two copies and btrfs decides which copy to read by > reader's > +# %pid. Now we inject errors to copy #1 and copy #0 is good. We want to > read > +# the bad copy to trigger read-repair. > +while true; do > + disable_io_failure > + # invalidate the page cache > + $XFS_IO_PROG -f -c "fadvise -d 0 128K" $SCRATCH_MNT/foobar | > _filter_xfs_io > + > + enable_io_failure > + od -x $SCRATCH_MNT/foobar > /dev/null & why are you using od to read the data when the output is piped to dev/null? why not just xfs_io -c "pread 0 8k" ? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: parent determination failed (btrfs send-receive)
On Tue, Sep 19, 2017 at 3:37 PM, Andrei Borzenkov <arvidj...@gmail.com> wrote: > 18.09.2017 09:10, Dave пишет: >> I use snap-sync to create and send snapshots. >> >> GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external >> drive >> https://github.com/wesbarnett/snap-sync >> > > Are you trying to backup top-level subvolume? No. I never mount the top-level subvolume. > I just reproduced this > behavior with this tool. The problem is, snapshots > of top-level > subvolume do not have parent UUID (I am not even sure if UUID exists at > all TBH). If you mount any other subvolume, it will work. Well, that's not exactly my issue but it still helps me gain additional understanding. It may help me reproduce my original problem too. Thank you. > > > As I told you in the first reply, showing output of "btrfs su li -qu > /path/to/src" would explain your problem much earlier. > > Actually if snap-sync used "btrfs send -p" instead of "btrfs send -c" it > would work as well, as then no parent search would be needed (and as I > mentioned in another mail both commands are functionally equivalent). > But this becomes really off-topic on this list. As already suggested, > open issue for snap-sync. Thank you for this also. I will simply edit my current copy of the snap-sync script to use "-p" now. That's a simple change I am implement immediately. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: >> When I do a >> btrfs filesystem defragment -r /directory >> does it defragment really all files in this directory tree, even if it >> contains subvolumes? >> The man page does not mention subvolumes on this topic. > >No answer so far :-( > >But I found another problem in the man-page: > > Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as > with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4 > will break up the ref-links of COW data (for example files copied with > cp --reflink, snapshots or de-duplicated data). This may cause > considerable increase of space usage depending on the broken up > ref-links. > >I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several >snapshots. >Therefore, I better should avoid calling "btrfs filesystem defragment -r"? > >What is the defragmenting best practice? >Avoid it completly? My question is the same as the OP in this thread, so I came here to read the answers before asking. However, it turns out that I still need to ask something. Should I ask here or start a new thread? (I'll assume here, since the topic is the same.) Based on the answers here, it sounds like I should not run defrag at all. However, I have a performance problem I need to solve, so if I don't defrag, I need to do something else. Here's my scenario. Some months ago I built an over-the-top powerful desktop computer / workstation and I was looking forward to really fantastic performance improvements over my 6 year old Ubuntu machine. I installed Arch Linux on BTRFS on the new computer (on an SSD). To my shock, it was no faster than my old machine. I focused a lot on Firefox performance because I use Firefox a lot and that was one of the applications in which I was most looking forward to better performance. I tried everything I could think of and everything recommended to me in various forums (except switching to Windows) and the performance remained very disappointing. Then today I read the following: Gotchas - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/Gotchas Fragmentation: Files with a lot of random writes can become heavily fragmented (1+ extents) causing excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM. On desktops this primarily affects application databases (including Firefox). Workarounds include manually defragmenting your home directory using btrfs fi defragment. Auto-defragment (mount option autodefrag) should solve this problem. Upon reading that I am wondering if fragmentation in the Firefox profile is part of my issue. That's one thing I never tested previously. (BTW, this system has 256 GB of RAM and 20 cores.) Furthermore, on the same BTRFS Wiki page, it mentions the performance penalties of many snapshots. I am keeping 30 to 50 snapshots of the volume that contains the Firefox profile. Would these two things be enough to turn top-of-the-line hardware into a mediocre-preforming desktop system? (The system performs fine on benchmarks -- it's real life usage, particularly with Firefox where it is disappointing.) After reading the info here, I am wondering if I should make a new subvolume just for my Firefox profile(s) and not use COW and/or not keep snapshots on it and mount it with the autodefrag option. As part of this strategy, I could send snapshots to another disk using btrfs send-receive. That way I would have the benefits of snapshots (which are important to me), but by not keeping any snapshots on the live subvolume I could avoid the performance problems. What would you guys do in this situation? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snapshots of encrypted directories?
On Fri, Sep 15, 2017 at 6:01 AM, Ulli Horlacherwrote: > > On Fri 2017-09-15 (06:45), Andrei Borzenkov wrote: > > > The actual question is - do you need to mount each individual btrfs > > subvolume when using encfs? > > And even worse it goes with ecryptfs: I do not know at all how to mount a > snapshot, so that the user has access to it. > > It seems snapshots are incompatible with encrypted filesystems :-( My experience is the opposite. I use dm-crypt as well as encfs with BTRFS and everything, including snapshots, works as I would expect it to work. I have been able to successfully restore snapshots that contained encrypted data. I think the other answers have already provided more details than I could provide, so I just wanted to add the fact that my experience has been positive with BTRFS snapshots and encryption. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: parent determination failed (btrfs send-receive)
On Mon, Sep 18, 2017 at 12:23 AM, Andrei Borzenkov <arvidj...@gmail.com> wrote: > > 18.09.2017 05:31, Dave пишет: > > Sometimes when using btrfs send-receive, I get errors like this: > > > > ERROR: parent determination failed for > > > > When this happens, btrfs send-receive backups fail. And all subsequent > > backups fail too. > > > > The issue seems to stem from the fact that an automated cleanup > > process removes certain earlier subvolumes. (I'm using Snapper.) > > > > I'd like to understand exactly what is happening so that my backups do > > not unexpectedly fail. > > > > You try to send incremental changes but you deleted subvolume to compute > changes against. It is hard to tell more without seeing subvolume list > with uuid/parent uuid. > Thanks to Duncan <1i5t5.dun...@cox.net> I have a bit better understanding of -c and -p now. My backup tool is using only the -c option. (The tool is GitHub - wesbarnett/snap-sync https://github.com/wesbarnett/snap-sync .) No subvolume at the destination had ever been deleted. At the source, a number (about 30 in this case) preceding snapshots (subvolumes) exist, but some others have been cleaned up with Snapper's timeline algorithm. I think I understand that with the -c option, it is **not** strictly necessary that the specified snapshots exist on both the source and destination. It seems I had sufficient subvolumes available for the incremental send to work with the -c option. Therefore, it still isn't clear why I got the error: ERROR: parent determination failed. Further general comments will be helpful. I understand the limits in getting too specific in replies because I cannot give subvolume lists with uuid's. As mentioned, I cannot give that info because I tried to fix the above error by sending a subvolume from the destination back to the target. This resulted in my source volume having a "Received UUID" which wrecked all subsequent backups. I now understand that (for my use cases) a source subvolume for a send-receive should **never** have a Received UUID. (If that is indeed a general rule, it seems btrfs tools should check it. Or possibly something about this the pitfalls of a "Received UUID" in a source could be listed on the BTRFS incremental backup wiki page?) I was previously referred to the FAQ here: https://github.com/digint/btrbk/blob/master/doc/FAQ.md#im-getting-an-error-aborted-received-uuid-is-set But in the end, I decided the safest strategy was to delete all prior backup subvolumes so I could be sure my backups were valid going forward. I am asking these questions now to avoid getting into a situation like this again (hopefully). Any general comments are appreciated. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
difference between -c and -p for send-receive?
new subject for new question On Mon, Sep 18, 2017 at 1:37 PM, Andrei Borzenkovwrote: > >> What scenarios can lead to "ERROR: parent determination failed"? > > > > The man page for btrfs-send is reasonably clear on the requirements > > btrfs imposes. If you want to use incremental sends (i.e. the -c or -p > > options) then the specified snapshots must exist on both the source and > > destination. If you don't have a suitable existing snapshot then don't > > use -c or -p and just do a full send. > > > > Well, I do not immediately see why -c must imply incremental send. We > want to reduce amount of data that is transferred, so reuse data from > existing snapshots, but it is really orthogonal to whether we send full > subvolume or just changes since another snapshot. > Starting months ago when I began using btrfs serious, I have been reading, rereading and trying to understand this: FAQ - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/FAQ#What_is_the_difference_between_-c_and_-p_in_send.3F The comment above suddenly gives me another clue... However, I still don't understand terms like "clone range ioctl", although I can guess it is something like a hard link. Would it be correct to say the following? 1. "-c" causes (appropriate) files in the newly transferred snapshot to be "hard linked" to existing files in another snapshot on the destination. Doesn't "-p" do something equivalent though? 2. The -c and -p options can be used together or individually. Questions: If "-c" "will send all of the metadata of @B.1, but will leave out the data for @B.1/bigfile, because it's already in the backups filesystem, and can be reflinked from there" what will -p do in contrast? Will "-p" not send all the metadata? Will "-p" also leave out the data for @B.1/bigfile, when it's also already in the backups? What would make me choose one of these options over the other? I still struggle to see the difference. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: parent determination failed (btrfs send-receive)
On Mon, Sep 18, 2017 at 12:23 AM, Andrei Borzenkov <arvidj...@gmail.com> wrote: > 18.09.2017 05:31, Dave пишет: >> Sometimes when using btrfs send-receive, I get errors like this: >> >> ERROR: parent determination failed for >> >> When this happens, btrfs send-receive backups fail. And all subsequent >> backups fail too. >> >> The issue seems to stem from the fact that an automated cleanup >> process removes certain earlier subvolumes. (I'm using Snapper.) >> >> I'd like to understand exactly what is happening so that my backups do >> not unexpectedly fail. >> > > You try to send incremental changes but you deleted subvolume to compute > changes against. It is hard to tell more without seeing subvolume list > with uuid/parent uuid. I do not have a current subvolume list to provide UUID's. To ensure integrity of my backups, I was forced to delete all backup snapshots and start over. (After this initial parent determination error, my attempted solution created a new problem related to Received UUID's, so removing all backups was the best solution in the end.) For my understanding, what are the restrictions on deleting snapshots? What scenarios can lead to "ERROR: parent determination failed"? After all, it should not be necessary to retain every snapshot ever made. We have to delete snapshots periodically. In my case, I still retained every snapshot on the target volume. None had ever been deleted (yet). And I also retained around 30 recent snapshots on the source, according to the snapper timeline cleanup config. Yet I still ran into "ERROR: parent determination failed". > >> In my scenario, no parent subvolumes have been deleted from the >> target. Some subvolumes have been deleted from the source, but why >> does that matter? I am able to take a valid snapshot at this time and >> every snapshot ever taken continues to reside at the target backup >> destination (seemingly meaning that a parent subvolume can be found at >> the target). >> >> This issue seems to make btrfs send-receive a very fragile backup >> solution. > > btrfs send/receive is not backup solution - it is low level tool that > does exactly what it is told to do. You may create backup solution that > is using btrfs send/receive to transfer data stream, but then do not > blame tool for incorrect usage. > > To give better advice how to fix your situation you need to describe > your backup solution - how exactly you select/create snapshots. I use snap-sync to create and send snapshots. GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external drive https://github.com/wesbarnett/snap-sync > > I hope, instead, there is some knowledge I'm missing, that >> when learned, will make this a robust backup solution. >> >> Thanks >> -- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ERROR: parent determination failed (btrfs send-receive)
Sometimes when using btrfs send-receive, I get errors like this: ERROR: parent determination failed for When this happens, btrfs send-receive backups fail. And all subsequent backups fail too. The issue seems to stem from the fact that an automated cleanup process removes certain earlier subvolumes. (I'm using Snapper.) I'd like to understand exactly what is happening so that my backups do not unexpectedly fail. In my scenario, no parent subvolumes have been deleted from the target. Some subvolumes have been deleted from the source, but why does that matter? I am able to take a valid snapshot at this time and every snapshot ever taken continues to reside at the target backup destination (seemingly meaning that a parent subvolume can be found at the target). This issue seems to make btrfs send-receive a very fragile backup solution. I hope, instead, there is some knowledge I'm missing, that when learned, will make this a robust backup solution. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: send | receive: received snapshot is missing recent files
On Mon, Sep 11, 2017 at 11:19 PM, Andrei Borzenkov <arvidj...@gmail.com> wrote: > 11.09.2017 20:53, Axel Burri пишет: >> On 2017-09-08 06:44, Dave wrote: >>> I'm referring to the link below. Using "btrfs subvolume snapshot -r" >>> copies the Received UUID from the source into the new snapshot. The >>> btrbk FAQ entry suggests otherwise. Has something changed? >> >> I don't think something has changed, the description for the read-only >> subvolumes on the btrbk FAQ was just wrong (fixed now). >> >>> The only way I see to remove a Received UUID is to create a rw >>> snapshot (above command without the "-r"), which is not ideal in this >>> situation when cleaning up readonly source snapshots. >>> >>> Any suggestions? Thanks >> >> No suggestions from my part, as far as I know there is no way to easily >> remove/change a received_uuid from a subvolume. >> > > There is BTRFS_IOC_SET_RECEIVED_SUBVOL IOCTL which is used by "btrfs > received". My understanding is that it can also be set to empty (this > clearing it). You could write small program to do it. > > In general it sounds like a bug - removing read-only flag from subvolume > by any means should also clear Received UUID as we cannot anymore > guarantee that subvolume content is the same. Yes! That makes a great deal of sense. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: send | receive: received snapshot is missing recent files
I'm referring to the link below. Using "btrfs subvolume snapshot -r" copies the Received UUID from the source into the new snapshot. The btrbk FAQ entry suggests otherwise. Has something changed? The only way I see to remove a Received UUID is to create a rw snapshot (above command without the "-r"), which is not ideal in this situation when cleaning up readonly source snapshots. Any suggestions? Thanks On Thu, Sep 7, 2017 at 10:33 AM, Axel Burri <a...@tty0.ch> wrote: > > Having a received_uuid set on the source volume ("/home" in your case) > is indeed a bad thing when it comes to send/receive. You probably > restored a backup with send/receive, and made it read/write using "btrfs > property set -ts /home ro false". This is a an evil thing, as it leaves > received_uuid intact. In order to make a subvolume read-write, I > recommend to use "btrfs subvolume snapshot ". > > There is a FAQ entry on btrbk on how to fix this: > > https://github.com/digint/btrbk/blob/master/doc/FAQ.md#im-getting-an-error-aborted-received-uuid-is-set > > > On 2017-09-07 15:34, Dave wrote: > > I just ran a test. The btrfs send - receive problem I described is > > indeed fully resolved by removing the "problematic" snapshot on the > > target device. I did not make any changes to the source volume. I did > > not make any other changes in my steps (see earlier message for my > > exact steps). > > > > Therefore, the problem I described in my earlier message is not due > > exclusively to having a Received UUID on the source volume (or to any > > other feature of the source volume). It is not related to any feature > > of the directly specified parent volume either. More details are > > included in my earlier email. > > > > Thanks for any further feedback, including answers to my questions and > > comments about whether this is a known issue. > > > > > > On Thu, Sep 7, 2017 at 8:39 AM, Dave <davestechs...@gmail.com> wrote: > >> > >> Hello. Can anyone further explain this issue ("you have a Received UUID on > >> the source volume")? > >> > >> How does it happen? > >> How does one remove a Received UUID from the source volume? > >> > >> And how does that explain my results where I showed that the problem > >> is not dependent upon the source volume but is instead dependent upon > >> some existing snapshot on the target volume? > >> > >> My results do not appear to be fully explained by a Received UUID on the > >> source volume, as my prior message hopefully shows clearly. > >> > >> Thank you. > >> > >> On Thu, Sep 7, 2017 at 2:24 AM, A L <crimsoncott...@gmail.com> wrote: > >>> The problem can be that you have a Received UUID on the source volume. > >>> This breaks send-receive. > >>> > >>> From: Dave <davestechs...@gmail.com> -- Sent: 2017-09-07 - 06:43 > >>> > >>>> Here is more info and a possible (shocking) explanation. This > >>>> aggregates my prior messages and it provides an almost complete set of > >>>> steps to reproduce this problem. > >>>> > >>>> Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 > >>>> GNU/Linux > >>>> btrfs-progs v4.12 > >>>> > >>>> My steps: > >>>> > >>>> [root@srv]# sync > >>>> [root@srv]# mkdir /home/.snapshots/test1 > >>>> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/ > >>>> Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home' > >>>> [root@srv]# sync > >>>> [root@srv]# mkdir /mnt/x5a/home/test1 > >>>> [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive > >>>> /mnt/x5a/home/test1/ > >>>> At subvol /home/.snapshots/test1/home/ > >>>> At subvol home > >>>> [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/ > >>>> NOTE: all recent files are present > >>>> [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/ > >>>> NOTE: all recent files are present > >>>> [root@srv]# mkdir /home/.snapshots/test2 > >>>> [root@srv]# mkdir /mnt/x5a/home/test2 > >>>> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/ > >>>> Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home' > >>>> [root@srv]# sync > >>>> [root@srv]# btrfs send -p
Re: send | receive: received snapshot is missing recent files
I just ran a test. The btrfs send - receive problem I described is indeed fully resolved by removing the "problematic" snapshot on the target device. I did not make any changes to the source volume. I did not make any other changes in my steps (see earlier message for my exact steps). Therefore, the problem I described in my earlier message is not due exclusively to having a Received UUID on the source volume (or to any other feature of the source volume). It is not related to any feature of the directly specified parent volume either. More details are included in my earlier email. Thanks for any further feedback, including answers to my questions and comments about whether this is a known issue. On Thu, Sep 7, 2017 at 8:39 AM, Dave <davestechs...@gmail.com> wrote: > > Hello. Can anyone further explain this issue ("you have a Received UUID on > the source volume")? > > How does it happen? > How does one remove a Received UUID from the source volume? > > And how does that explain my results where I showed that the problem > is not dependent upon the source volume but is instead dependent upon > some existing snapshot on the target volume? > > My results do not appear to be fully explained by a Received UUID on the > source volume, as my prior message hopefully shows clearly. > > Thank you. > > On Thu, Sep 7, 2017 at 2:24 AM, A L <crimsoncott...@gmail.com> wrote: > > The problem can be that you have a Received UUID on the source volume. This > > breaks send-receive. > > > > From: Dave <davestechs...@gmail.com> -- Sent: 2017-09-07 - 06:43 > > > >> Here is more info and a possible (shocking) explanation. This > >> aggregates my prior messages and it provides an almost complete set of > >> steps to reproduce this problem. > >> > >> Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 GNU/Linux > >> btrfs-progs v4.12 > >> > >> My steps: > >> > >> [root@srv]# sync > >> [root@srv]# mkdir /home/.snapshots/test1 > >> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/ > >> Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home' > >> [root@srv]# sync > >> [root@srv]# mkdir /mnt/x5a/home/test1 > >> [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive > >> /mnt/x5a/home/test1/ > >> At subvol /home/.snapshots/test1/home/ > >> At subvol home > >> [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/ > >> NOTE: all recent files are present > >> [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/ > >> NOTE: all recent files are present > >> [root@srv]# mkdir /home/.snapshots/test2 > >> [root@srv]# mkdir /mnt/x5a/home/test2 > >> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/ > >> Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home' > >> [root@srv]# sync > >> [root@srv]# btrfs send -p /home/.snapshots/test1/home/ > >> /home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/ > >> At subvol /home/.snapshots/test2/home/ > >> At snapshot home > >> [root@srv]# ls -la /mnt/x5a/home/test2/home/user1/ > >> NOTE: all recent files are MISSING > >> [root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/ > >> NOTE: all recent files are MISSING > >> > >> Below I am including some rsync output to illustrate when a snapshot > >> is missing files (or not): > >> > >> [root@srv]# rsync -aniv /home/.snapshots/test1/home/ > >> /home/.snapshots/test2/home/ > >> sending incremental file list > >> > >> sent 1,143,286 bytes received 1,123 bytes 762,939.33 bytes/sec > >> total size is 3,642,972,271 speedup is 3,183.28 (DRY RUN) > >> > >> This indicates that these two subvolumes contain the same files, which > >> they should because test2 is a snapshot of test1 without any changes > >> to files, and it was not sent to another physical device. > >> > >> The problem is when test2 is sent to another device as shown by the > >> rsync results below. > >> > >> [root@srv]# rsync -aniv /home/.snapshots/test2/home/ > >> /mnt/x5a/home/test2/home/ > >> sending incremental file list > >> .d..t.. ./ > >> .d..t.. user1/ > >>>f.st.. user1/.bash_history > >>>f.st.. user1/.bashrc > >>>f+ user1/test2017-09-06.txt > >> ... > >> and a long list of other missing files > >> > >> The incrementally sent
Re: send | receive: received snapshot is missing recent files
Hello. Can anyone further explain this issue ("you have a Received UUID on the source volume")? How does it happen? How does one remove a Received UUID from the source volume? And how does that explain my results where I showed that the problem is not dependent upon the source volume but is instead dependent upon some existing snapshot on the target volume? My results do not appear to be fully explained by a Received UUID on the source volume, as my prior message hopefully shows clearly. Thank you. On Thu, Sep 7, 2017 at 2:24 AM, A L <crimsoncott...@gmail.com> wrote: > The problem can be that you have a Received UUID on the source volume. This > breaks send-receive. > > From: Dave <davestechs...@gmail.com> -- Sent: 2017-09-07 - 06:43 > >> Here is more info and a possible (shocking) explanation. This >> aggregates my prior messages and it provides an almost complete set of >> steps to reproduce this problem. >> >> Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 GNU/Linux >> btrfs-progs v4.12 >> >> My steps: >> >> [root@srv]# sync >> [root@srv]# mkdir /home/.snapshots/test1 >> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/ >> Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home' >> [root@srv]# sync >> [root@srv]# mkdir /mnt/x5a/home/test1 >> [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive >> /mnt/x5a/home/test1/ >> At subvol /home/.snapshots/test1/home/ >> At subvol home >> [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/ >> NOTE: all recent files are present >> [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/ >> NOTE: all recent files are present >> [root@srv]# mkdir /home/.snapshots/test2 >> [root@srv]# mkdir /mnt/x5a/home/test2 >> [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/ >> Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home' >> [root@srv]# sync >> [root@srv]# btrfs send -p /home/.snapshots/test1/home/ >> /home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/ >> At subvol /home/.snapshots/test2/home/ >> At snapshot home >> [root@srv]# ls -la /mnt/x5a/home/test2/home/user1/ >> NOTE: all recent files are MISSING >> [root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/ >> NOTE: all recent files are MISSING >> >> Below I am including some rsync output to illustrate when a snapshot >> is missing files (or not): >> >> [root@srv]# rsync -aniv /home/.snapshots/test1/home/ >> /home/.snapshots/test2/home/ >> sending incremental file list >> >> sent 1,143,286 bytes received 1,123 bytes 762,939.33 bytes/sec >> total size is 3,642,972,271 speedup is 3,183.28 (DRY RUN) >> >> This indicates that these two subvolumes contain the same files, which >> they should because test2 is a snapshot of test1 without any changes >> to files, and it was not sent to another physical device. >> >> The problem is when test2 is sent to another device as shown by the >> rsync results below. >> >> [root@srv]# rsync -aniv /home/.snapshots/test2/home/ >> /mnt/x5a/home/test2/home/ >> sending incremental file list >> .d..t.. ./ >> .d..t.. user1/ >>>f.st.. user1/.bash_history >>>f.st.. user1/.bashrc >>>f+ user1/test2017-09-06.txt >> ... >> and a long list of other missing files >> >> The incrementally sent snapshot at /mnt/x5a/home/test2/home/ is >> missing all recent files (any files from the month of August or >> September), as my prior visual inspections had indicated. The same >> files are missing every time. There is no randomness to the missing >> data. >> >> The problem does not happen for me if the receive command target is >> located on the same physical device as shown next. (However, I suspect >> there's more to it than that, as explained further below.) >> >> [root@srv]# mkdir /home/.snapshots/test2rec >> [root@srv]# btrfs send -p /home/.snapshots/test1/home/ >> /home/.snapshots/test2/home/ | btrfs receive >> /home/.snapshots/test2rec/ >> At subvol /home/.snapshots/test2/home/ >> >> # rsync -aniv /home/.snapshots/test2/home/ /home/.snapshots/test2rec/home/ >> sending incremental file list >> >> sent 1,143,286 bytes received 1,123 bytes 2,288,818.00 bytes/sec >> total size is 3,642,972,271 speedup is 3,183.28 (DRY RUN) >> >> The above (as well as visual inspection of files) indicates that these >> two subvolumes contain the same files, which was not the case when the >>
Re: send | receive: received snapshot is missing recent files
Here is more info and a possible (shocking) explanation. This aggregates my prior messages and it provides an almost complete set of steps to reproduce this problem. Linux srv 4.9.41-1-lts #1 SMP Mon Aug 7 17:32:35 CEST 2017 x86_64 GNU/Linux btrfs-progs v4.12 My steps: [root@srv]# sync [root@srv]# mkdir /home/.snapshots/test1 [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/ Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home' [root@srv]# sync [root@srv]# mkdir /mnt/x5a/home/test1 [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive /mnt/x5a/home/test1/ At subvol /home/.snapshots/test1/home/ At subvol home [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/ NOTE: all recent files are present [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/ NOTE: all recent files are present [root@srv]# mkdir /home/.snapshots/test2 [root@srv]# mkdir /mnt/x5a/home/test2 [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/ Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home' [root@srv]# sync [root@srv]# btrfs send -p /home/.snapshots/test1/home/ /home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/ At subvol /home/.snapshots/test2/home/ At snapshot home [root@srv]# ls -la /mnt/x5a/home/test2/home/user1/ NOTE: all recent files are MISSING [root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/ NOTE: all recent files are MISSING Below I am including some rsync output to illustrate when a snapshot is missing files (or not): [root@srv]# rsync -aniv /home/.snapshots/test1/home/ /home/.snapshots/test2/home/ sending incremental file list sent 1,143,286 bytes received 1,123 bytes 762,939.33 bytes/sec total size is 3,642,972,271 speedup is 3,183.28 (DRY RUN) This indicates that these two subvolumes contain the same files, which they should because test2 is a snapshot of test1 without any changes to files, and it was not sent to another physical device. The problem is when test2 is sent to another device as shown by the rsync results below. [root@srv]# rsync -aniv /home/.snapshots/test2/home/ /mnt/x5a/home/test2/home/ sending incremental file list .d..t.. ./ .d..t.. user1/ >f.st.. user1/.bash_history >f.st.. user1/.bashrc >f+ user1/test2017-09-06.txt ... and a long list of other missing files The incrementally sent snapshot at /mnt/x5a/home/test2/home/ is missing all recent files (any files from the month of August or September), as my prior visual inspections had indicated. The same files are missing every time. There is no randomness to the missing data. The problem does not happen for me if the receive command target is located on the same physical device as shown next. (However, I suspect there's more to it than that, as explained further below.) [root@srv]# mkdir /home/.snapshots/test2rec [root@srv]# btrfs send -p /home/.snapshots/test1/home/ /home/.snapshots/test2/home/ | btrfs receive /home/.snapshots/test2rec/ At subvol /home/.snapshots/test2/home/ # rsync -aniv /home/.snapshots/test2/home/ /home/.snapshots/test2rec/home/ sending incremental file list sent 1,143,286 bytes received 1,123 bytes 2,288,818.00 bytes/sec total size is 3,642,972,271 speedup is 3,183.28 (DRY RUN) The above (as well as visual inspection of files) indicates that these two subvolumes contain the same files, which was not the case when the same command had a target located on another physical device. Of course, a snapshot which resides on the same physical device is not a very good backup. So I do need to send it to another device, but that results in missing files when the -p or -c options are used with btrfs send. (Non-incremental sending to another physical device does work.) I can think of a couple possible explanations. One is that there is a problem when using the -p or -c options with btrfs send when the target is another physical device. I suspect this is the actual explanation, however. A second possibility is that the presence of prior existing snapshots at the target location (even if old and not referenced in any current btrfs command), can determine the outcome and final contents of an incremental send operation. I believe the info below suggests this to be the case. [root@srv]# btrfs su show /home/.snapshots/test2/home/ test2/home Name: home UUID: 292e8bbf-a95f-2a4e-8280-129202d389dc Parent UUID:62418df6-a1f8-d74a-a152-11f519593053 Received UUID: e00d5318-6efd-824e-ac91-f25efa5c2a74 Creation time: 2017-09-06 15:38:16 -0400 Subvolume ID: 2000 Generation: 5020 Gen at creation:5020 Parent ID: 257 Top level ID: 257 Flags: readonly Snapshot(s): [root@srv]# btrfs su show /mnt/x5a/home/test1/home home/test1/home Name: home UUID:
Re: send | receive: received snapshot is missing recent files
This is an even better set of steps for reproducing the problem. [root@srv]# sync [root@srv]# mkdir /home/.snapshots/test1 [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test1/ Create a readonly snapshot of '/home/' in '/home/.snapshots/test1//home' [root@srv]# sync [root@srv]# mkdir /mnt/x5a/home/test1 [root@srv]# btrfs send /home/.snapshots/test1/home/ | btrfs receive /mnt/x5a/home/test1/ At subvol /home/.snapshots/test1/home/ At subvol home [root@srv]# ls -la /mnt/x5a/home/test1/home/user1/ NOTE: all recent files are present [root@srv]# ls -la /mnt/x5a/home/test1/home/user2/Documents/ NOTE: all recent files are present [root@srv]# mkdir /home/.snapshots/test2 [root@srv]# mkdir /mnt/x5a/home/test2 [root@srv]# btrfs su sn -r /home/ /home/.snapshots/test2/ Create a readonly snapshot of '/home/' in '/home/.snapshots/test2//home' [root@srv]# sync [root@srv]# btrfs send -p /home/.snapshots/test1/home/ /home/.snapshots/test2/home/ | btrfs receive /mnt/x5a/home/test2/ At subvol /home/.snapshots/test2/home/ At snapshot home [root@srv]# ls -la /mnt/x5a/home/test2/home/user1/ NOTE: all recent files are MISSING [root@srv]# ls -la /mnt/x5a/home/test2/home/user2/Documents/ NOTE: all recent files are MISSING Any ideas what could be causing this problem with incremental backups? On Wed, Sep 6, 2017 at 3:23 PM, Dave <davestechs...@gmail.com> wrote: > > Here is more info on this problem. I can reproduce this without using my > script. Simple btrfs commands will reproduce the problem every time. The same > files are missing every time. There is no randomness to the missing data. > > Here are my steps: > > 1. snapper -c home create > result is a valid snapshot at /home/.snapshots/1704/snapshot > 2. btrfs send /home/.snapshots/1704/snapshot | btrfs receive > /mnt/x5a/home/1704 > 3. snapper -c home create > result is a valid snapshot at /home/.snapshots/1716/snapshot > 4. btrfs send -c /home/.snapshots/1704/snapshot/ > /home/.snapshots/1716/snapshot/ | btrfs receive /mnt/x5a/home/1716/ > > I expect /mnt/x5a/home/1716/snapshot to be identical to > /home/.snapshots/1716/snapshot. However, it is not. > The result is that /mnt/x5a/home/1716/snapshot is missing all recent files. > > Next step was to delete snapshot 1716 (in both locations) and repeat the send > | receive using -p > > btrfs su del /mnt/x5a/home/1716/snapshot > snapper -c home delete 1716 > snapper -c home create > btrfs send -p /home/.snapshots/1704/snapshot/ /home/.snapshots/1716/snapshot/ > | btrfs receive /mnt/x5a/home/1716/ > > The result is once again that /mnt/x5a/home/1716/snapshot is missing all > recent files. However, the other snapshots are all valid: > /home/.snapshots/1704/snapshot is valid & complete > /mnt/x5a/home/1704/snapshot -- non-incremental send: snapshot is valid & > complete > /home/.snapshots/1716/snapshot is valid & complete > /mnt/x5a/home/1716/snapshot -- incrementally sent snapshot is missing all > recent files whether sent with -c or -p > > The incrementally sent snapshot is even missing files that are present in the > reference snapshot /mnt/x5a/home/1704/snapshot. > > > > On Wed, Sep 6, 2017 at 1:37 AM, Dave <davestechs...@gmail.com> wrote: >> >> I'm running Arch Linux on BTRFS. I use Snapper to take hourly >> snapshots and it works without any issues. >> >> I have a bash script that uses send | receive to transfer snapshots to >> a couple external HDD's. The script runs daily on a systemd timer. I >> set all this up recently and I first noticed that it runs every day >> and that the expected snapshots are received. >> >> At a glance, everything looked correct. However, today was my day to >> drill down and really make sure everything was working. >> >> To my surprise, the newest received incremental snapshots are missing >> all recent files. These new snapshots reflect the system state from >> weeks ago and no files more recent than a certain date are in the >> snapshots. >> >> However, the snapshots are newly created and newly received. The work >> is being done fresh each day when my script runs, but the results are >> anchored back in time at this earlier date. Weird. >> >> I'm not really sure where to start troubleshooting, so I'll start by >> sharing part of my script. I'm sure the problem is in my script, and >> is not related to BTRFS or snapper functionality. (As I said, the >> Snapper snapshots are totally OK before being sent | received. >> >> These are the key lines of the script I'm using to send | receive a snapshot: >> >> old_num=$(snapper -c "$config" list -t single | awk >> '/'"$selected_uuid"'/ {print $1}') >> old_snap=$SUBVOLUME/.s
send | receive: received snapshot is missing recent files
I'm running Arch Linux on BTRFS. I use Snapper to take hourly snapshots and it works without any issues. I have a bash script that uses send | receive to transfer snapshots to a couple external HDD's. The script runs daily on a systemd timer. I set all this up recently and I first noticed that it runs every day and that the expected snapshots are received. At a glance, everything looked correct. However, today was my day to drill down and really make sure everything was working. To my surprise, the newest received incremental snapshots are missing all recent files. These new snapshots reflect the system state from weeks ago and no files more recent than a certain date are in the snapshots. However, the snapshots are newly created and newly received. The work is being done fresh each day when my script runs, but the results are anchored back in time at this earlier date. Weird. I'm not really sure where to start troubleshooting, so I'll start by sharing part of my script. I'm sure the problem is in my script, and is not related to BTRFS or snapper functionality. (As I said, the Snapper snapshots are totally OK before being sent | received. These are the key lines of the script I'm using to send | receive a snapshot: old_num=$(snapper -c "$config" list -t single | awk '/'"$selected_uuid"'/ {print $1}') old_snap=$SUBVOLUME/.snapshots/$old_num/snapshot new_num=$(snapper -c "$config" create --print-number) new_snap=$SUBVOLUME/.snapshots/$new_num/snapshot btrfs send -c "$old_snap" "$new_snap" | $ssh btrfs receive "$backup_location" I have to admit that even after reading the following page half a dozen times, I barely understand the difference between -c and -p. https://btrfs.wiki.kernel.org/index.php/FAQ#What_is_the_difference_between_-c_and_-p_in_send.3F After reading that page again today, I feel like I should switch to -p (maybe). However, the -c vs -p choice probably isn't my problem. Any ideas what my problem could be? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html