Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote: > On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote: > > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote: > > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote: > > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > > > For future development I've abandoned the entire dedupe_file_range > > > > > approach. I need to be able to read and dedupe the data blocks of > > > > > the filesystem directly without having to deal with details like which > > > > > files those blocks belong to, especially on filesystems with lots of > > > > > existing deduped blocks and snapshots. > > > > > > > > IOWs, your desired OOB dedupe algorithm is: > > > > > > > > a) ask the filesystem where all it's file data is > > > > > > Actually, it's "ask the filesystem where all the *new* file data is" > > > since we don't want to read any unique data twice on subsequent runs. > > > > Sorry, how do you read "unique data" twice? By definition, unique > > data only occurs once > > ...but once it has been read, we don't want to read it again. Ever. > Even better would be to read unique data less than 1.0 times on average. > > > Oh, and you still need to hash the old data so you can find > > collisions with the new data that got written. Unless, of course, > > you are keeping your hash tree in a persistent database > > I do that. OK, so you're slowly re-inventing the typical HSM implementation that keep a userspace database to keep track of what is happening in the filesystem. They do this to make better decisions when moving less frequently accessed data out to a slower teirs of storage to keep space in the primary storage available as it fills up. You're approaching dedupe in essentially the same way, for very similar reasons. > One big difference I am noticing in our approaches is latency. ZFS (and > in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate > data occupies disk space for zero time as it is never written to disk > at all) but it requires more RAM for a given dedupe hit rate than any > other dedupe implementation I've seen. What you've written tells me > XFS saves RAM by partitioning the data and relying on an existing but > very large source of iops (sharing scrub reads with dedupe), but then > the dedupe latency is the same as the scrub interval (the worst so far). That's just the background maintenance implementation. xfs_fsr can run this way, but it can also be run as a complete immediate scan, or it can be pointed at a defined set of files to run on. We can (and probably will) do exactly the same thing with dedupe. > bees aims to have latency of a few minutes (ideally scanning data while > it's still dirty in cache, but there's no good userspace API for that) > though it's obviously not there yet. Yup, this is pretty much the "I need immediate notification that a file changed" problem that HSMs face. That's one of the things DMAPI was used for - XFS has (now unused) dmapi event notification fields in it's inode structure that were used by DMAPI to configure the notifications the filesystem was supposed to send userspace With no DMAPI in the future, people with custom HSM-like interfaces based on dmapi are starting to turn to fanotify and friends to provide them with the change notifications they require > > e.g. a soft requirement is that we need to scan the entire fs at > > least once a month. > > I have to scan and dedupe multiple times per hour. OK, the first-ever > scan of a non-empty filesystem is allowed to take much longer, but after > that, if you have enough spare iops for continuous autodefrag you should > also have spare iops for continuous dedupe. Yup, but using notifications avoids the for even these scans - you'd know exactly what data has changed, when it changed, and know exactly that you needed to read to calculate the new hashes. > > A simple piece-wise per-AG scanning algorithm (like we use in > > xfs_repair) could easily work within a 3GB RAM per AG constraint and > > would scale very well. We'd only need to scan 30-40 AGs in the hour, > > and a single AG at 1GB/s will only take 2 minutes to scan. We can > > then do the processing while the next AG gets scanned. If we've got > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?) > > then we can scan 5-10AGs at once to keep th
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote: > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote: > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > For future development I've abandoned the entire dedupe_file_range > > > approach. I need to be able to read and dedupe the data blocks of > > > the filesystem directly without having to deal with details like which > > > files those blocks belong to, especially on filesystems with lots of > > > existing deduped blocks and snapshots. > > > > IOWs, your desired OOB dedupe algorithm is: > > > > a) ask the filesystem where all it's file data is > > Actually, it's "ask the filesystem where all the *new* file data is" > since we don't want to read any unique data twice on subsequent runs. Sorry, how do you read "unique data" twice? By definition, unique data only occurs once Oh, and you still need to hash the old data so you can find collisions with the new data that got written. Unless, of course, you are keeping your hash tree in a persistent database and can work out how to prune stale entries out of it efficiently And, well, see my comments about media scrubbing below. > > b) read that used space to build a data hash index > > c) on all the data hash collisions find the owners of the > >colliding blocks > > d) if the block data is the same dedupe it > > > > I agree - that's a simple and effective algorithm. It's also the > > obvious solution to an expert in the field. > > It's obvious, but if you literally implement that, you end up with a > filesystem shattered into billions of fragments. Well, yes, that's obvious. I thought that "details omitted for reasons of brevity" would be understood, not require omitted details to be explained to me. > At step c) I need to figure out _which_ colliding block I want to keep. > That's where the algorithm needs to inspect the higher-level file > structure to determine the physical and logical situation for each block. > That leads to one of: > > a) store the hashes and block locations in the hash table > b) relocate the blocks (defrag) > c) replace some blocks with a reference to other blocks (dedupe) > > This sounds a bit like what xfs_fsr does (or will do?). xfs_fsr currently does the defrag bit w/ the swapext ioctl(*). It'll get extended to also do dedupe scans eventually, as defrag and dedupe obviously need to be integrated to prevent them from duelling. (*) swapext - in case you might not know about it - atomically swaps a range of extents between two files without copying data. So you remember those dedupe selection problems you were talking about (details omitted for brevity): inode 1:ABCDEFG|H|I|JK|L|M|N|O|P inode 2:a|b|c|d|e|f|g|hijklmop The XFS solution will be to first defrag the master file (inode 1) with the larger extents from inode 2: swapext(H-P, h-p) inode 1:ABCDEFG|hijklmop inode 2:a|b|c|d|e|f|g|H|I|JK|L|M|N|O|P And the we dedupe to the defragged file: dedupe(inode 1, inode 2) inode 1:ABCDEFG|hijklmop inode 2:ABCDEFG|hijklmop > Bees also operates under a constant-RAM constraint, so it doesn't operate > in two distinct "collect data" and "act on data collected" passes, > and cannot spend memory to store data about more than a few extents at > any time. I suspect that I'm thinking at a completely different scale to you. I don't really care for highly constrained or optimal dedupe algorithms because those last few dedupe percentages really don't matter that much to me. I care much more about using all the resources we can and running as fast as we possibly can, then providing the admin with means to throttle performance back to what they need. i.e. I'm concerned about how to effectively scan and dedupe PBs of data, where scan rates may need to be measured in tens of GB/s. In these cases the limiting factor is either suboptimal IO patterns (low IO throughput), or not having enough CPU and memory bandwidth for hashing the data the IO engine is streaming through memory. Once we get past a certain point, we'll gain far more by parallelising the dedupe scan over lots of CPU than we ever will by optimising an algorithm that cannot be parallelised. Hence we need to start with a parallelisable algorithm, not an artificially constrained one. e.g. a soft requirement is that we need to scan the entire fs at least once a month. That will need to be broken up into nightly work, with a full 1PB filesystem needing 33+TB a night to be scanned. We typically get 1
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote: > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote: > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > > > - is documenting rejection on request alignment grounds > > > > > (i.e. EINVAL) in the man page sufficient for app > > > > > developers to understand what is going on here? > > > > > > > > I think so. The manpage says: "The filesystem does not support > > > > reflinking the ranges of the given files", which (to my mind) covers > > > > this case of not supporting dedupe of EOF blocks. > > > > > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly > > > this; however, on btrfs, not supporting dedupe of EOF blocks means small > > > files (one extent) cannot be deduped at all, because the EOF block holds > > > a reference to the entire dst extent. If a dedupe app doesn't go all the > > > way to EOF on btrfs, then it should not attempt to dedupe any part of the > > > last extent of the file as the benefit would be zero or slightly negative. > > > > That's a filesystem implementation issue, not an API or application > > issue. > > The API and application issue remains even if btrfs is not considered. > btrfs is just the worst case outcome. Other filesystems still have > fragmentation issues, and applications have efficiency-vs-capability > tradeoffs to make if they can't rely on dedupe-to-EOF being available. > > Tools like 'cp --reflink=auto' work by trying the best case, then falling > back to a second choice if the first choice returns an error. Well, yes. That's necessary for the "cp" tool to behave according to user expectations. That's not a kernel API issue - that's just an implementation of an *application requirement*. Indeed, this is identical to the behaviour of rename() in "mv" - if rename fails with -EXDEV, mv needs to fall back to a manual copy because the user expects the file to be moved. IOWS, these application level requirements you talk about are just not relevant to the kernel API for dedupe/clone operations. [snip] > It is something that naive dedupe apps will do. "naive" here meaning > "does not dive deeply into the filesystem's physical structure (or survey > the entire filesystem with FIEMAP) to determine that the thousand-refs > situation does not exist at dst prior to invoking the dedupe() call." /me sighs and points at FS_IOC_GETFSMAP $ man ioctl_getfsmap DESCRIPTION This ioctl(2) operation retrieves physical extent mappings for a filesystem. This information can be used to discover which files are mapped to a physical block, examine free space, or find known bad blocks, among other things. . I don't really care about "enabling" naive, inefficient applications. I care about applications that scale to huge filesystems and can get the job done quickly and efficiently. > > XFS doesn't have partial overlaps, we don't have nodatacow hacks, > > and the subvol snapshot stuff I'm working on just uses shared data > > extents so it's 100% compatible with dedupe. > > If you allow this sequence of operations, you get partial overlaps: > > dedupe(fd1, 0, fd2, 0, 1048576); > > dedupe(fd2, 16384, fd3, 0, 65536); Oh, I misunderstood - I thought you were refering to sharing partial filesystem blocks (like at EOF) because that's what this discussion was originally about. XFS supports the above just fine. [snip] tl;dr we don't need a new clone or dedupe API > For future development I've abandoned the entire dedupe_file_range > approach. I need to be able to read and dedupe the data blocks of > the filesystem directly without having to deal with details like which > files those blocks belong to, especially on filesystems with lots of > existing deduped blocks and snapshots. IOWs, your desired OOB dedupe algorithm is: a) ask the filesystem where all it's file data is b) read that used space to build a data hash index c) on all the data hash collisions find the owners of the colliding blocks d) if the block data is the same dedupe it I agree - that's a simple and effective algorithm. It's also the obvious solution to an expert in the field. > The file structure is frankly > just noise for dedupe on large filesystems. We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all the SGI^WHPE HSM scanning tools, etc all avoid the directory structure
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote: > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > > - is documenting rejection on request alignment grounds > > > (i.e. EINVAL) in the man page sufficient for app > > > developers to understand what is going on here? > > > > I think so. The manpage says: "The filesystem does not support > > reflinking the ranges of the given files", which (to my mind) covers > > this case of not supporting dedupe of EOF blocks. > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly > this; however, on btrfs, not supporting dedupe of EOF blocks means small > files (one extent) cannot be deduped at all, because the EOF block holds > a reference to the entire dst extent. If a dedupe app doesn't go all the > way to EOF on btrfs, then it should not attempt to dedupe any part of the > last extent of the file as the benefit would be zero or slightly negative. That's a filesystem implementation issue, not an API or application issue. > The app developer would need to be aware that such a restriction could > exist on some filesystems, and be able to distinguish this from other > cases that could lead to EINVAL. Portable code would have to try a dedupe > up to EOF, then if that failed, round down and retry, and if that failed > too, the app would have to figure out which filesystem it's running on > to know what to do next. Performance demands the app know what the FS > will do in advance, and avoid a whole class of behavior. Nobody writes "portable" applications like that. They read the man page first, and work out what the common subset of functionality is and then code from that. Man page says: "Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size." IOWs, code compatible with starts with supporting the general case. i.e. a range rounded to filesystem block boundaries (it's already run fstat() on the files it wants to dedupe to find their size, yes?), hence ignoring the partial EOF block. Will just work on everything. Code that then wants to optimise for btrfs/xfs/ocfs quirks runs fstatvfs to determine what fs it's operating on and applies the necessary quirks. For btrfs it can extend the range to include the partial EOF block, and hence will handle the implementation quirks btrfs has with single extent dedupe. Simple, reliable, and doesn't require any sort of flailing about with offsets and lengths to avoid unexpected EINVAL errors. > btrfs dedupe reports success if the src extent is inline and the same > size as the dst extent (i.e. file is smaller than one page). No dedupe > can occur in such cases--a clone results in a simple copy, so the best > a dedupe could do would be a no-op. Returning EINVAL there would break > a few popular tools like "cp --reflink". Returning OK but doing nothing > seems to be the best option in that case. Again, those are a filesystem implementation issues, not problems with the API itself. > > > - should we just round down the EOF dedupe request to the > > > block before EOF so dedupe still succeeds? > > > > I've often wondered if the interface should (have) be(en) that we start > > at src_off/dst_off and share as many common blocks as possible until we > > find a mismatch, then tell userspace where we stopped... instead of like > > now where we compare the entire extent and fail if any part of it > > doesn't match. > > The usefulness or harmfulness of that approach depends a lot on what > the application expects the filesystem to do. > > In btrfs, the dedupe operation acts on references to data, not the > underlying data blocks. If there are 1000 slightly overlapping references > to a single contiguous range of data blocks in dst on disk, each dedupe > operation acts on only one of those, leaving the other 999 untouched. > If the app then submits 999 other dedupe requests, no references to the > dst blocks remain and the underlying data blocks can be deleted. Assuming your strawman is valid, if you have a thousand separate references across the same set of data blocks on disk, then that data is already heavily deduplicated. Trying to optimise that further seems misguided, way down the curve of diminishing returns. > In a parallel universe (or a better filesystem, or a userspace emulation ^^^ > built out of dedupe and other ioctls), dedupe could work at the extent > data (physical) level. The app points at src and dst extent references > (inode/offset/length tuples), and the filesystem figures out which > physical blocks t
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Mon, Aug 20, 2018 at 08:17:18PM -0500, Eric Sandeen wrote: > > > On 8/20/18 7:49 PM, Dave Chinner wrote: > > Upon successful completion of this ioctl, the number of > > bytes successfully deduplicated is returned in bytes_deduped > > and a status code for the deduplication operation is > > returned in status. If even a single byte in the range does > > not match, the deduplication request will be ignored and > > status set to FILE_DEDUPE_RANGE_DIFFERS. > > > > This implies we can dedupe less than the entire range as long as the > > entire range matches. If the entire range does not match, we have > > to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match > > so we can pick and choose how much we deduplicate. How much we > > dedupe is then returned as a byte count. In this case, it will be a > > few bytes short of the entire length requested because we aligned > > the dedupe inwards > > > > Does that sound reasonable? > > I had hoped that dedupe was advisory as Darrick wished for, but TBH my > reading of that is no, if you ask for a range to be deduped and any of > it differs, "even a single byte," you fail it all. Yes, if the data differs, then it fails. But that's not what I'm questioning nor what I care about in this case. I'm asking whether the filesystem gets to choose how much of the range of same data is deduped when the file data is apparently the same. > Why else would that last part be present, if the interface is free to > ignore later parts that don't match and truncate the range to the > matching portion? I think there is a clear differentiation in the man page text between "all the bytes are the same" and "how much of that range the filesystem deduped". i.e. the man page does not say that the filesystem *must* dedupe the entire range if all the data is the same - it says the filesystem will return /how many bytes it successfully deduplicated/. IOWs, "do nothing and return zero" is a valid ->dedupe_file_range implementation. AFAIC, it's the fact that we do data comparison from the page cache before calling into the filesystem to check on-disk formats that is the problem here. Having identical data in the page cache is not the same thing as "the blocks on disk containing the user data are identical". i.e. Filesystems deduplicate disk blocks, but they often perform transformations on the data as it passes between the page cache and disk blocks or have constraints on the format of the data on disk. For example: - encrypted files. unencrypted data in the page cache is the same, therefore we can't return FILE_DEDUPE_RANGE_DIFFERS because the user will be able to see that they are the same. But they will different on disk after encryption (e.g. because different nonce/seed or completely different keys) and so the filesystem should not dedupe them. - the two files differ in on-disk format e.g. compressed vs encrypted. - data might be inline with metadata - tail packing - dedupe request might be block aligned at file offsets, but file offsets are not aligned to disk blocks due to, say, multi-block data compression. - on disk extent tracking granularity might be larger than filesystem block size (e.g. ext4's bigalloc mode) so can't be deduped on filesystem block size alignment. - the source or destination inode is marked "NODATACOW" so can't contain shared extents I'm sure there's others, but I think this is enough to demonstrate that "user visible file data is the same" does not mean the filesystem can dedupe it. The wording in the man page appears to understand these limitations and hence it allows us to silently ignore the partial EOF block when it comes to dedupe Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote: > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote: > > So why was this dedupe request even accepted by the kernel? Well, > > I think there's a bug in the check just above this: > > > > /* If we're linking to EOF, continue to the block boundary. */ > > if (pos_in + *len == isize) > > blen = ALIGN(isize, bs) - pos_in; > > else > > blen = *len; > > > > basically, when the "in" file dedupe/reflink range is to EOF, it > > expands the range to include /all/ of the block that contains the > > EOF byte. IOWs, it now requests us to dedupe /undefined data beyond > > EOF/. But when we go to compare the data in these ranges, it > > truncates the comparison to the length that the user asked for > > (i.e. *len) and not the extended block length. > > > > IOWs, it doesn't compare the bytes beyond EOF in the source block to > > the data in the destination block it would replace, and so doesn't > > fail the compare like it should. .. > > i.e. the dedupe request should fail as we cannot dedupe the EOF > > block. > > > > The patch below does this for the VFS dedupe code. it's not a final > > patch, it's just a demonstration of how this should never have got > > past the range checks. Open questions: > > (Here's my five minutes of XFS that I'm allowed per day... :/) > > > - is documenting rejection on request alignment grounds > > (i.e. EINVAL) in the man page sufficient for app > > developers to understand what is going on here? > > I think so. The manpage says: "The filesystem does not support > reflinking the ranges of the given files", which (to my mind) covers > this case of not supporting dedupe of EOF blocks. Ok. > > - should we just round down the EOF dedupe request to the > > block before EOF so dedupe still succeeds? > > I've often wondered if the interface should (have) be(en) that we start > at src_off/dst_off and share as many common blocks as possible until we > find a mismatch, then tell userspace where we stopped... instead of like > now where we compare the entire extent and fail if any part of it > doesn't match. I think that the ioctl_fideduperange() man page description gives us the flexibility to do this. That is, this text: Upon successful completion of this ioctl, the number of bytes successfully deduplicated is returned in bytes_deduped and a status code for the deduplication operation is returned in status. If even a single byte in the range does not match, the deduplication request will be ignored and status set to FILE_DEDUPE_RANGE_DIFFERS. This implies we can dedupe less than the entire range as long as the entire range matches. If the entire range does not match, we have to return FILE_DEDUPE_RANGE_DIFFERS, but in this case it does match so we can pick and choose how much we deduplicate. How much we dedupe is then returned as a byte count. In this case, it will be a few bytes short of the entire length requested because we aligned the dedupe inwards Does that sound reasonable? > > - should file cloning (i.e. reflink) be allow allowing the > > EOF block to be shared somewhere inside EOF in the > > destination file? > > I don't think we should. > > > That's stale data exposure, yes? > > Haven't tested that, but seems likely. Yeah, I haven't tested it either, but it I can't see how the current behaviour ends well. > > - should clone only allow sharing of the EOF block if it's a > > either a full file clone or a matching range clone and > > isize is the same for both src and dest files? > > The above sound reasonable for clone, though I would also allow clone to > share the EOF block if we extend isize of the dest file. Hmmm. That covers the only valid rule for sharing the EOF block, right? i.e. That the source EOF lands exactly at or beyond the destination EOF, so it remains the EOF block in both files? FWIW, I just noticed that the ioctl_ficlonerange man page doesn't document the return value on success of a clone. > The above also nearly sound reasonable for dedupe too. If we encounter > EOF at the same places in the src range and the dest range, should we > allow sharing there? The post-eof bytes are undefined by definition; > does undefined == undefined? It's undefined, espescially as different fs implementations may be doing different things with partial blocks (e.g. tail packing). >From an XFS perspective, I don't think we should physically share partial EOF blocks on dedupe because it means ex
[patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)
[cc linux-fsdevel now, too] On Mon, Aug 20, 2018 at 09:11:26AM +1000, Dave Chinner wrote: > [cc linux-...@vger.kernel.org] > > On Fri, Aug 17, 2018 at 09:39:24AM +0100, fdman...@kernel.org wrote: > > From: Filipe Manana > > > > Test that deduplication of an entire file that has a size that is not > > aligned to the filesystem's block size into a different file does not > > corrupt the destination's file data. Ok, I've looked at this now. My first question is where did all the magic offsets in this test come from? i.e. how was this bug found and who is it affecting? > > This test is motivated by a bug found in Btrfs which is fixed by the > > following patch for the linux kernel: > > > > "Btrfs: fix data corruption when deduplicating between different files" > > > > XFS also fails this test, at least as of linux kernel 4.18-rc7, exactly > > with the same corruption as in Btrfs - some bytes of a block get replaced > > with zeroes after the deduplication. > > Filipe, in future can please report XFS bugs you find to the XFS > list the moment you find them. We shouldn't ever find out about a > data corruption bug we need to fix via a "oh, by the way" comment in > a commit message for a regression test This becomes much more relevant because of what I've just found . > > +# The first byte with a value of 0xae starts at an offset (2518890) which > > is not > > +# a multiple of the block size. > > +$XFS_IO_PROG -f \ > > + -c "pwrite -S 0x6b 0 2518890" \ > > + -c "pwrite -S 0xae 2518890 102398" \ > > + $SCRATCH_MNT/foo | _filter_xfs_io > > + > > +# Create a second file with a length not aligned to the block size, whose > > bytes > > +# all have the value 0x6b, so that its extent(s) can be deduplicated with > > the > > +# first file. > > +$XFS_IO_PROG -f -c "pwrite -S 0x6b 0 557771" $SCRATCH_MNT/bar | > > _filter_xfs_io > > + > > +# The file is filled with bytes having the value 0x6b from offset 0 to > > offset > > +# 2518889 and with the value 0xae from offset 2518890 to offset 2621287. > > +echo "File content before deduplication:" > > +od -t x1 $SCRATCH_MNT/foo Please use "od -Ad -t x1 " so the file offsets reported by od match the offsets used in the test (i.e. in decimal, not octal). > > + > > +# Now deduplicate the entire second file into a range of the first file > > that > > +# also has all bytes with the value 0x6b. The destination range's end > > offset > > +# must not be aligned to the block size and must be less then the offset of > > +# the first byte with the value 0xae (byte at offset 2518890). > > +$XFS_IO_PROG -c "dedupe $SCRATCH_MNT/bar 0 1957888 557771" > > $SCRATCH_MNT/foo \ > > + | _filter_xfs_io Ok, now it gets fun. dedupe to non-block aligned rtanges is supposed to be rejected by the kernel in vfs_clone_file_prep_inodes(). i.e this check: /* Only reflink if we're aligned to block boundaries */ if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) || !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs)) return -EINVAL; And it's pretty clear that a length of 557771 is not block aligned (being an odd number). So why was this dedupe request even accepted by the kernel? Well, I think there's a bug in the check just above this: /* If we're linking to EOF, continue to the block boundary. */ if (pos_in + *len == isize) blen = ALIGN(isize, bs) - pos_in; else blen = *len; basically, when the "in" file dedupe/reflink range is to EOF, it expands the range to include /all/ of the block that contains the EOF byte. IOWs, it now requests us to dedupe /undefined data beyond EOF/. But when we go to compare the data in these ranges, it truncates the comparison to the length that the user asked for (i.e. *len) and not the extended block length. IOWs, it doesn't compare the bytes beyond EOF in the source block to the data in the destination block it would replace, and so doesn't fail the compare like it should. And, well, btrfs has the same bug. extent_same_check_offsets() extends the range for alignment so it passes alignment checks, but then /explicitly/ uses the original length for the data compare and dedupe. i.e: /* pass original length for comparison so we stay within i_size */ ret = btrfs_cmp_data(olen, cmp); if (ret == 0) ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1); This is what we should see if someone tried to dedupe the EOF block of a file: generic/505 - output mismatch (see /home/dave/src/xfstests-dev/resul
Re: [PATCH] generic: test for deduplication between different files
> +od -t x1 $SCRATCH_MNT/foo > + > +status=0 > +exit > diff --git a/tests/generic/505.out b/tests/generic/505.out > new file mode 100644 > index ..7556b9fb > --- /dev/null > +++ b/tests/generic/505.out > @@ -0,0 +1,33 @@ > +QA output created by 505 > +wrote 2518890/2518890 bytes at offset 0 > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > +wrote 102398/102398 bytes at offset 2518890 > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > +wrote 557771/557771 bytes at offset 0 > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > +File content before deduplication: > +000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > +* > +11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae > +11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae > +* > +11777540 ae ae ae ae ae ae ae ae > +11777550 > +deduped 557771/557771 bytes at offset 1957888 > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > +File content after deduplication and before unmounting: > +000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > +* > +11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae > +11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae > +* > +11777540 ae ae ae ae ae ae ae ae > +11777550 > +File content after unmounting: > +000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > +* > +11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae > +11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae > +* > +11777540 ae ae ae ae ae ae ae ae > +11777550 > diff --git a/tests/generic/group b/tests/generic/group > index 55155de8..2ff1bd7e 100644 > --- a/tests/generic/group > +++ b/tests/generic/group > @@ -507,3 +507,4 @@ > 502 auto quick log > 503 auto quick dax punch collapse zero > 504 auto quick locks > +505 auto quick clone dedupe > -- > 2.11.0 > > -- Dave Chinner da...@fromorbit.com
Re: [PATCH V4] test online label ioctl
On Thu, May 17, 2018 at 10:28:26AM -0500, Eric Sandeen wrote: > This tests the online label ioctl that btrfs has, which has been > recently proposed for XFS. > > To run, it requires an updated xfs_io with the label command and a > filesystem that supports it > > A slight change here to _require_xfs_io_command as well, so that tests > which simply fail with "Inappropriate ioctl" can be caught in the > common case. > > Signed-off-by: Eric Sandeen <sand...@redhat.com> > --- > > Now with new and improved sequential V4 versioning! Looks good now. Reviewed-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] test online label ioctl
On Tue, May 15, 2018 at 10:22:37AM -0500, Eric Sandeen wrote: > This tests the online label ioctl that btrfs has, which has been > recently proposed for XFS. > > To run, it requires an updated xfs_io with the label command and a > filesystem that supports it > > A slight change here to _require_xfs_io_command as well, so that tests > which simply fail with "Inappropriate ioctl" can be caught in the > common case. > > Signed-off-by: Eric Sandeen <sand...@redhat.com> > --- > > (urgh send as proper new thread, sorry) > > This passes on btrfs, _notruns on xfs/ext4 of yore, and passes > on xfs w/ my online label patchset (as long as xfs_io has the new > capability) > > V2: Add a max label length helper > Set the proper btrfs max label length o_O oops > Filter trailing whitespace from blkid output > > V3: lowercase local vars, simplify max label len function Looks good now, but I wondered about one thing the test doesn't cover: can you clear the label by setting it to a null string? i.e you check max length bounds, but don't check empty string behaviour... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] test online label ioctl
On Mon, May 14, 2018 at 06:26:07PM -0500, Eric Sandeen wrote: > On 5/14/18 6:11 PM, Dave Chinner wrote: > > On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote: > >> This tests the online label ioctl that btrfs has, which has been > >> recently proposed for XFS. > >> > >> To run, it requires an updated xfs_io with the label command and a > >> filesystem that supports it > >> > >> A slight change here to _require_xfs_io_command as well, so that tests > >> which simply fail with "Inappropriate ioctl" can be caught in the > >> common case. > >> > >> Signed-off-by: Eric Sandeen <sand...@redhat.com> > >> --- > > > > >> +# The maximum filesystem label length, not including terminating NULL > >> +_label_get_max() > >> +{ > >> + case $FSTYP in > >> + xfs) > >> + MAXLEN=12 > >> + ;; > >> + btrfs) > >> + MAXLEN=255 > >> + ;; > >> + *) > >> + MAXLEN=0 > > > > Why not just _notrun here? > > do we want to go through the other steps only to get here and notrun > halfway through? > > >> + ;; > >> + esac > >> + > >> + echo $MAXLEN > >> +} > >> + > >> +_require_label_get_max() > >> +{ > >> + if [ $(_label_get_max) -eq 0 ]; then > >> + _notrun "$FSTYP does not define maximum label length" > >> + fi > > > > And this check can go away? > > We'd like to know if all the validations in this can complete before we > get started, right? That's why I did it this way. Sure, just trying to be a bit defensive as people often forget _requires directives when writing new tests > > Also, shouldn't it be _require_online_label_change() ? And then > > maybe you can move the xfs_io label command check inside it? > > Well, we want to know a lot of things: > > 1) can the kernel code for this filesystem support online label > 2) does xfs_io support the label command > 3) does this test know the maximum label length to test for this fs > > the above stuff is for 3) What I was suggesting was doing all of these in one function similar to _require_xfs_sparse_inodes, _require_meta_uuid, etc: _require_online_label_change() { # need xfs_io support _require_xfs_io_command "label" # need fstests knowledge of label size [ $(_label_get_max) -eq 0 ] && _notrun "$FSTYP does not define maximum label length" # need kernel support $XFS_IO_PROG -c label $TEST_DIR > /dev/null 2>&1 [ $? -ne 0 ] && _notrun "Kernel does not support FS_IOC_GETLABEL" } Which also means that the label_f command in xfs_io needs to set the exitcode to non-zero when the ioctl fails so that xfs_io returns non-zero on failure... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] test online label ioctl
On Mon, May 14, 2018 at 12:09:16PM -0500, Eric Sandeen wrote: > This tests the online label ioctl that btrfs has, which has been > recently proposed for XFS. > > To run, it requires an updated xfs_io with the label command and a > filesystem that supports it > > A slight change here to _require_xfs_io_command as well, so that tests > which simply fail with "Inappropriate ioctl" can be caught in the > common case. > > Signed-off-by: Eric Sandeen <sand...@redhat.com> > --- > > (urgh send as proper new thread, sorry) > > This passes on btrfs, _notruns on xfs/ext4 of yore, and passes > on xfs w/ my online label patchset (as long as xfs_io has the new > capability) > > V2: Add a max label length helper > Set the proper btrfs max label length o_O oops > Filter trailing whitespace from blkid output > > > diff --git a/common/rc b/common/rc > index 9ffab7fd..88a99cff 100644 > --- a/common/rc > +++ b/common/rc > @@ -2158,6 +2158,9 @@ _require_xfs_io_command() > echo $testio | grep -q "Inappropriate ioctl" && \ > _notrun "xfs_io $command support is missing" > ;; > + "label") > + testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1` > + ;; > "open") > # -c "open $f" is broken in xfs_io <= 4.8. Along with the fix, > # a new -C flag was introduced to execute one shot commands. > @@ -2196,7 +2199,7 @@ _require_xfs_io_command() > rm -f $testfile 2>&1 > /dev/null > echo $testio | grep -q "not found" && \ > _notrun "xfs_io $command support is missing" > - echo $testio | grep -q "Operation not supported" && \ > + echo $testio | grep -q "Operation not supported\|Inappropriate ioctl" > && \ > _notrun "xfs_io $command failed (old kernel/wrong fs?)" > echo $testio | grep -q "Invalid" && \ > _notrun "xfs_io $command failed (old kernel/wrong fs/bad args?)" > @@ -3802,6 +3805,31 @@ _require_scratch_feature() > esac > } > > +# The maximum filesystem label length, not including terminating NULL > +_label_get_max() > +{ > + case $FSTYP in > + xfs) > + MAXLEN=12 > + ;; > + btrfs) > + MAXLEN=255 > + ;; > + *) > + MAXLEN=0 Why not just _notrun here? > + ;; > + esac > + > + echo $MAXLEN > +} > + > +_require_label_get_max() > +{ > + if [ $(_label_get_max) -eq 0 ]; then > + _notrun "$FSTYP does not define maximum label length" > + fi And this check can go away? Also, shouldn't it be _require_online_label_change() ? And then maybe you can move the xfs_io label command check inside it? > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +_supported_fs generic > +_supported_os Linux > +_require_scratch > +_require_xfs_io_command "label" > +_require_label_get_max > + > +_scratch_mkfs > $seqres.full 2>&1 > +_scratch_mount > + > +# Make sure we can set & clear the label > +$XFS_IO_PROG -c "label label.$seq" $SCRATCH_MNT > +$XFS_IO_PROG -c "label" $SCRATCH_MNT > + > +# And that userspace can see it now, while mounted > +# NB: some blkid has trailing whitespace, filter it out here > +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g" > + > +# And that the it is still there when it's unmounted > +_scratch_unmount > +blkid -s LABEL $SCRATCH_DEV | _filter_scratch | sed -e "s/ $//g" Ok, so "LABEL" here is a special blkid match token > +# And that it persists after a remount > +_scratch_mount > +$XFS_IO_PROG -c "label" $SCRATCH_MNT > + > +# And that a too-long label is rejected, beyond the interface max: > +LABEL=$(perl -e "print 'l' x 257;") And now you use it as a variable. Nasty and confusing. Using lower case for local variables is the norm, right? I thought we were only supposed to use upper case for global test harness variables... But even making it "label" is problematic: > +$XFS_IO_PROG -c "label $LABEL" $SCRATCH_MNT because "label" is an xfs_io command. Perhaps call it "fs_label"? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
On Tue, May 08, 2018 at 10:06:44PM -0400, Jeff Mahoney wrote: > On 5/8/18 7:38 PM, Dave Chinner wrote: > > On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > >> Hi, > >> > >> The VFS's super_block covers a variety of filesystem functionality. In > >> particular we have a single structure representing both I/O and > >> namespace domains. > >> > >> There are requirements to de-couple this functionality. For example, > >> filesystems with more than one root (such as btrfs subvolumes) can > >> have multiple inode namespaces. This starts to confuse userspace when > >> it notices multiple inodes with the same inode/device tuple on a > >> filesystem. > > > > Devil's Advocate - I'm not looking at the code, I'm commenting on > > architectural issues I see here. > > > > The XFS subvolume work I've been doing explicitly uses a superblock > > per subvolume. That's because subvolumes are designed to be > > completely independent of the backing storage - they know nothing > > about the underlying storage except to share a BDI for writeback > > purposes and write to whatever block device the remapping layer > > gives them at IO time. Hence XFS subvolumes have (at this point) > > their own unique s_dev, on-disk format configuration, journal, space > > accounting, etc. i.e. They are fully independent filesystems in > > their own right, and as such we do not have multiple inode > > namespaces per superblock. > > That's a fundamental difference between how your XFS subvolumes work and > how btrfs subvolumes do. Yup, you've just proved my point: this is not a "subvolume problem"; but rather a "multiple namespace per root" problem. > There is no independence among btrfs > subvolumes. When a snapshot is created, it has a few new blocks but > otherwise shares the metadata of the source subvolume. The metadata > trees are shared across all of the subvolumes and there are several > internal trees used to manage all of it. I don't need btrfs 101 stuff explained to me. :/ > a single transaction engine. There are housekeeping and maintenance > tasks that operate across the entire file system internally. I > understand that there are several problems you need to solve at the VFS > layer to get your version of subvolumes up and running, but trying to > shoehorn one into the other is bound to fail. Actually, the VFS has provided everything I need for XFS subvolumes so far without requiring any sort of modifications. That's the perspective I'm approaching this from - if the VFS can do what we need for XFS subvolumes, as well as overlay (which are effectively VFS-based COW subvolumes), then lets see if we can make that work for btrfs too. > > So this doesn't sound like a "subvolume problem" - it's a "how do we > > sanely support multiple independent namespaces per superblock" > > problem. AFAICT, this same problem exists with bind mounts and mount > > namespaces - they are effectively multiple roots on a single > > superblock, but it's done at the vfsmount level and so the > > superblock knows nothing about them. > > In this case, you're talking about the user-visible file system > hierarchy namespace that has no bearing on the underlying file system > outside of per-mount flags. Except that it tracks and provides infrastructure that allows user visible "multiple namespace per root" constructs. Subvolumes - as a user visible namespace construct - are little different to bind mounts in behaviour and functionality. How the underlying filesystem implements subvolumes is really up to the filesystem, but we should be trying to implement a clean model for "multiple namespaces on a single root" at the VFS so we have consistent behaviour across all filesystems that implement similar functionality. FWIW, bind mounts and overlay also have similar inode number namespace problems to what Mark describes for btrfs subvolumes. e.g. overlay recently introduce the "xino" mount option to separate the user presented inode number namespace for overlay inode from the underlying parent filesystem inodes. How is that different to btrfs subvolumes needing to present different inode number namespaces from the underlying parent? This sort of "namespace shifting" is needed for several different pieces of information the kernel reports to userspace. The VFS replacement for shiftfs is an example of this. So is inode number remapping. I'm sure there's more. My point is that if we are talking about infrastructure to remap what userspace sees from different mountpoint views into a filesystem, then it should be done above the filesystem layers in the VFS so all filesystems behave th
Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root
On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > Hi, > > The VFS's super_block covers a variety of filesystem functionality. In > particular we have a single structure representing both I/O and > namespace domains. > > There are requirements to de-couple this functionality. For example, > filesystems with more than one root (such as btrfs subvolumes) can > have multiple inode namespaces. This starts to confuse userspace when > it notices multiple inodes with the same inode/device tuple on a > filesystem. Devil's Advocate - I'm not looking at the code, I'm commenting on architectural issues I see here. The XFS subvolume work I've been doing explicitly uses a superblock per subvolume. That's because subvolumes are designed to be completely independent of the backing storage - they know nothing about the underlying storage except to share a BDI for writeback purposes and write to whatever block device the remapping layer gives them at IO time. Hence XFS subvolumes have (at this point) their own unique s_dev, on-disk format configuration, journal, space accounting, etc. i.e. They are fully independent filesystems in their own right, and as such we do not have multiple inode namespaces per superblock. So this doesn't sound like a "subvolume problem" - it's a "how do we sanely support multiple independent namespaces per superblock" problem. AFAICT, this same problem exists with bind mounts and mount namespaces - they are effectively multiple roots on a single superblock, but it's done at the vfsmount level and so the superblock knows nothing about them. So this kinda feel like there's still a impedence mismatch between btrfs subvolumes being mounted as subtrees on the underlying root vfsmount rather than being created as truly independent vfs namespaces that share a superblock. To put that as a question: why aren't btrfs subvolumes vfsmounts in their own right, and the unique information subvolume information get stored in (or obtained from) the vfsmount? > In addition, it's currently impossible for a filesystem subvolume to > have a different security context from it's parent. If we could allow > for subvolumes to optionally specify their own security context, we > could use them as containers directly instead of having to go through > an overlay. Again, XFS subvolumes don't have this problem. So really we need to frame this discussion in terms of supporting multiple namespaces within a superblock sanely, not subvolumes. > I ran into this particular problem with respect to Btrfs some years > ago and sent out a very naive set of patches which were (rightfully) > not incorporated: > > https://marc.info/?l=linux-btrfs=130074451403261=2 > https://marc.info/?l=linux-btrfs=130532890824992=2 > > During the discussion, one question did come up - why can't > filesystems like Btrfs use a superblock per subvolume? There's a > couple of problems with that: > > - It's common for a single Btrfs filesystem to have thousands of > subvolumes. So keeping a superblock for each subvol in memory would > get prohibively expensive - imagine having 8000 copies of struct > super_block for a file system just because we wanted some separation > of say, s_dev. That's no different to using individual overlay mounts for the thousands of containers that are on the system. This doesn't seem to be a major problem... > - Writeback would also have to walk all of these superblocks - > again not very good for system performance. Background writeback is backing device focussed, not superblock focussed. It will only iterate the superblocks that have dirty inodes on the bdi writeback lists, not all the superblocks on the bdi. IOWs, this isn't a major problem except for sync() operations that iterate superblocks. > - Anyone wanting to lock down I/O on a filesystem would have to > freeze all the superblocks. This goes for most things related to > I/O really - we simply can't afford to have the kernel walking > thousands of superblocks to sync a single fs. Not with XFS subvolumes. Freezing the underlying parent filesystem will effectively stop all IO from the mounted subvolumes by freezing remapping calls before IO. Sure, those subvolumes aren't in a consistent state, but we don't freeze userspace so none of the application data is ever in a consistent state when filesystems are frozen. So, again, I'm not sure there's /subvolume/ problem here. There's definitely a "freeze heirarchy" problem, but that already exists and it's something we talked about at LSFMM because we need to solve it for reliable hibernation. > It's far more efficient then to pull those fields we need for a > subvolume namespace into their own structure. I'm not convinced yet - it still feels like it's the wrong layer to be solving the multiple namespace per superblock problem
Re: Symlink not persisted even after fsync
On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote: > Thanks! As I mentioned before, this is useful. I have a follow-up > question. Consider the following workload: > > creat foo > link (foo, A/bar) > fsync(foo) > crash > > In this case, after the file system recovers, do we expect foo's link > count to be 2 or 1? So, strictly ordered behaviour: create foo: - creates dirent in inode B and new inode A in an atomic transaction sequence #1 link foo -> A/bar - creates dirent in inode C and bumps inode A link count in an atomic transaction seqeunce #2. fsync foo - looks at inode A, sees it's "last modification" sequence counter as #2 - flushes all transactions up to and including #2 to the journal. See the dependency chain? Both the inodes and dirents in the create operation and the link operation are chained to the inode foo via the atomic transactions. Hence when we flush foo, we also flush the dependent changes because of the change atomicity requirements > I would say 2, Correct, for strict ordering. But > but POSIX is silent on this, Well, it's not silent, POSIX explicitly allows for fsync() to do nothing and report success. Hence we can't really look to POSIX to define how fsync() should behave. > so > thought I would confirm. The tricky part here is we are not calling > fsync() on directory A. Right. But directory A has a dependent change linked to foo. If we fsync() foo, we are persisting the link count change in that file, and hence all the other changes related to that link count change must also be flushed. Similarly, all the cahnges related to the creation on foo must be flushed, too. > In this case, its not a symlink; its a hard link, so I would say the > link count for foo should be 2. Right - that's the "reference counted object dependency" I refered to. i.e. it's a bi-direction atomic dependency - either we show both the new dirent and the link count change, or we show neither of them. Hence fsync on one object implies that we are also persisting the related changes in the other object, too. > But btrfs and F2FS show link count of > 1 after a crash. That may be valid if the dirent A/bar does not exist after recovery, but it also means fsync() hasn't actually guaranteed inode changes made prior to the fsync to be persistent on disk. i.e. that's a violation of ordered metadata semantics and probably a bug. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symlink not persisted even after fsync
er it is even valid. There is no way reliable ordering dependencies can be created for indirect references, especially as symlinks can point to any type of object (e.g. dir, blkdev, etc), it can point to something outside the filesystem, and it can even point to something that doesn't exist. This also means that "fsync on a symlink" may, in fact, run a fsync method of a completely different filesystem or subsystem. There is no way this could possible trigger a directory fsync of the symlink parent, because the object being fsync()d may not even know what a filesystem is... If you want a symlink to have ordering behaviour like a dirent pointing to a regular file, then use hard links Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symlink not persisted even after fsync
On Fri, Apr 13, 2018 at 09:39:27AM -0500, Jayashree Mohan wrote: > Hey Dave, > > Thanks for clarifying the crash recovery semantics of strictly > metadata ordered filesystems. We had a follow-up question in this > case. > > On Fri, Apr 13, 2018 at 8:16 AM, Amir Goldstein <amir7...@gmail.com> wrote: > > On Fri, Apr 13, 2018 at 3:54 PM, Vijay Chidambaram <vi...@cs.utexas.edu> > > wrote: > >> Hi Amir, > >> > >> Thanks for the reply! > >> > >> On Fri, Apr 13, 2018 at 12:52 AM, Amir Goldstein <amir7...@gmail.com> > >> wrote: > >>> > >>> Not a bug. > >>> > >>> From man 2 fsync: > >>> > >>> "Calling fsync() does not necessarily ensure that the entry in the > >>> directory containing the file has also reached disk. For that an > >>> explicit fsync() on a file descriptor for the directory is also needed." > >> > >> > >> Are we understanding this right: > >> > >> ext4 and xfs fsync the parent directory if a sym link file is fsync-ed. But > >> btrfs does not. Is this what we are seeing? > > > > Nope. > > > > You are seeing an unintentional fsync of parent, because both > > parent update and symlink update are metadata updates that are > > tracked by the same transaction. > > > > fsync of symlink forces the current transaction to the journal, > > pulling in the parent update with it. > > > > > >> > >> I agree that fsync of a file does not mean fsync of its directory entry, > >> but > >> it seems odd to do it for regular files and not for sym links. We do not > >> see > >> this behavior if we use a regular file instead of a sym link file. > >> > > > > fsync of regular file behaves differently than fsync of non regular file. > > I suggest this read: > > https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ > > > >>> > >>> There is a reason why this behavior is not being reproduces in > >>> ext4/xfs, but you should be able to reproduce a similar issue > >>> like this: > >>> > >>> > >>> 1. symlink (foo, bar.tmp) > >>> 2. open bar.tmp > >>> 3. fsync bar.tmp > >>> 4. rename(bar.tmp, bar) > >>> 5. fsync bar > >>> crash here > >> > > Going by your argument that all previous transactions that referenced > the file being fsync-ed needs to be committed, should we expect xfs > (and ext4) to persist file bar in this case? No, that's not what I'm implying. I'm implying that there is specific ordering dependencies that govern this behaviour, and assuming that what the fsync man page says about files applies to symlinks is not a valid assumption because files and symlinks are not equivalent objects. In these cases, you first have to ask "what are we actually running fsync on?" The fsync is being performed on the inode the symlink points to, not the symlink. You can't directly open a symlink to fsync the symlink. Then you have to ask "what is the dependency chain between the parent directory, the symlink and the file it points to?" the short answer is that symlinks have no direct relationship to the object they point to. i.e. symlinks contain a path, not a reference to a specific filesystem object. IOWs, symlinks are really a directory construct, not a file. However, there is no ordering dependency between a symlink and what it points to. symlinks contain a path which needs to be resolved to find out what it points to, and that may not even exist. Files have no reference to symlinks that point at them, so there's no way we can create an ordering dependency between file updates and any symlink that points to them. Directories, OTOH, contain a pointer to a reference counted object (an inode) in their dirents. hence if you add/remove directory dirents that point to an inode, you also have to modify the inode link counts as it records how many directory entries point at it. That's a bi-directional atomic modification ordering dependency between directories and inodes they point at. So when we look at symlinks, the parent directory has a ordering dependency with the symlink inode, not whatever is found by resolving the path in the symlink data. IOWs, there is no ordering relationship between the symlink's parent directory and whatever the symlink points at. i.e. it's a one-way relationship, and so there is no reverse ordering dependency that requires fsync() on the file to force synchronisation of a symlink it knows nothing about. i.e. the ordering dependency that exists with symlinks is between the symlink and it's parent directory, not whatever the symlink points to. Hence fsyncing whatever the symlink points to does not guarantee that the symlink is made stable because the symlink is not part of the dependency chain of the object being fsync()d Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symlink not persisted even after fsync
On Fri, Apr 13, 2018 at 08:52:19AM +0300, Amir Goldstein wrote: > On Thu, Apr 12, 2018 at 8:51 PM, Jayashree Mohan > <jayashree2...@gmail.com> wrote: > > Hi, > > > > We came across what seems to be a crash consistency bug on btrfs and > > f2fs. When we create a symlink for a file and fsync the symlink, on > > recovery from crash, the fsync-ed file is missing. > > > > You can reproduce this behaviour using this minimal workload : > > > > 1. symlink (foo, bar) > > 2. open bar > > 3. fsync bar > > crash here > > > > When we recover, we find that file bar is missing. This behaviour > > seems unexpected as the fsynced file is lost on a crash. ext4 and xfs > > correctly recovers file bar. This seems like a bug. If not, could you > > explain why? > > > > Not a bug. Actually, for a filesystem with strictly ordered metadata recovery semantics, it is a bug. > From man 2 fsync: > > "Calling fsync() does not necessarily ensure that the entry in the > directory containing the file has also reached disk. For that an > explicit fsync() on a file descriptor for the directory is also needed." We've been through this before, many times. This caveat does not apply to strictly ordered metadata filesystems. If you fsync a file on an ordered metadata filesystem, then all previous transactions that are needed to reference the file are also committed. The behaviour from ext4 and XFS is correct for strictly ordered filesystems. This is not a "fsync requirement", nor is it a general linux filesystem requirement. It is a requirement of the desired filesystem crash recovery mechanisms BTRFS is advertised as having strictly ordered metadata recovery semantics, so it should behave the same way as ext4 and XFS in tests like these. If it doesn't, then there's filesystem bugs that need fixing... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] common/rc: raise mixed mode threshold to 1GB
On Wed, Apr 11, 2018 at 10:07:29PM -0700, Omar Sandoval wrote: > From: Omar Sandoval <osan...@fb.com> > > generic/427 creates a 256 MB filesystem and then writes a 200 MB file, > which fails on Btrfs if mixed mode is not enabled. Raise the threshold > to 1GB, which is where we typically recommend mixed mode. > > Signed-off-by: Omar Sandoval <osan...@fb.com> > --- > common/rc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/common/rc b/common/rc > index 5dbb8fe5..dc0062de 100644 > --- a/common/rc > +++ b/common/rc > @@ -999,7 +999,7 @@ _scratch_mkfs_sized() > ;; > btrfs) > local mixed_opt= > - (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed' > + (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed' > $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV > ;; > jfs) Makes sense. Reviewed-by: Dave Chinner <dchin...@redhat.com> Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] generic/427: used mixed mode for Btrfs
On Wed, Apr 11, 2018 at 04:47:30PM -0700, Omar Sandoval wrote: > From: Omar Sandoval <osan...@fb.com> > > This test creates a 256 MB filesystem and then writes a 200 MB file. > With separate data and metadata, Btrfs will run out of data space since > it needs to allocate some metadata space. Use mixed mode, which is the > recommendation for smaller filesystems. > > Signed-off-by: Omar Sandoval <osan...@fb.com> > --- > tests/generic/427 | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tests/generic/427 b/tests/generic/427 > index 9cde5f50..b2cb4526 100755 > --- a/tests/generic/427 > +++ b/tests/generic/427 > @@ -55,6 +55,8 @@ _require_test_program "feature" > _require_aiodio aio-dio-eof-race > > # limit the filesystem size, to save the time of filling filesystem > +# Btrfs needs to use mixed mode for such a small filesystem > +[ "$FSTYP" = "btrfs" ] && MKFS_OPTIONS="$MKFS_OPTIONS -M" _scratch_mkfs_sized() should already be adding this for small btrfs filesystems. Yup, it does: btrfs) local mixed_opt= (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed' $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV ;; > _scratch_mkfs_sized $((256 * 1024 * 1024)) >>$seqres.full 2>&1 But this uses a filesystem larger than the mixed mode threshold in _scratch_mkfs_sized(). Please update the generic threshold rather than special case this test. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic test for fsync after fallocate
On Mon, Apr 09, 2018 at 11:00:52AM +0100, Filipe Manana wrote: > On Mon, Apr 9, 2018 at 10:51 AM, Dave Chinner <da...@fromorbit.com> wrote: > > On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote: > >> On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote: > > You also cannot assume that two separate preallocations beyond EOF > > are going to be contiguous (i.e. it could be two separate extents. > > I really don't care about that. > Just want to check that allocated space is not lost, don't care if > that space is covered by 1, 2 or more extents, or whether they are > contiguous or not. Sure, but that's not what the test does :/ > > What you should just be checking is that there are extents allocated > > covering EOF to 3MB, not the exactly size, shape and type of extents > > are allocated. > > How to do such check, for a generic test, without using fiemap? We already do such checks using fiemap, yes? i.e. there's a bunch of _filter*fiemap() functions in common/punch which should show you exactly how to do this i.e. filter all the unwritten/data extents to be the same name, then they coalesce into single ranges Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic test for fsync after fallocate
On Sun, Apr 08, 2018 at 10:07:54AM +0800, Eryu Guan wrote: > On Thu, Apr 05, 2018 at 10:56:14PM +0100, fdman...@kernel.org wrote: > > From: Filipe Manana <fdman...@suse.com> > > > > Test that fsync operations preserve extents allocated with fallocate(2) > > that are placed beyond a file's size. > > > > This test is motivated by a bug found in btrfs where unwritten extents > > beyond the inode's i_size were not preserved after a fsync and power > > failure. The btrfs bug is fixed by the following patch for the linux > > kernel: > > > > "Btrfs: fix loss of prealloc extents past i_size after fsync log replay" > > > > Signed-off-by: Filipe Manana <fdman...@suse.com> > > Hmm, xfs fails this test, while ext4 passes. > > # diff -u tests/generic/483.out > /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad > --- tests/generic/483.out 2018-04-07 23:35:00.55511 +0800 > +++ /root/workspace/xfstests/results//xfs_4k_crc/generic/483.out.bad > 2018-04-07 23:39:48.780659707 +0800 > @@ -6,5 +6,5 @@ > 0: [0..511]: data > 1: [512..2559]: unwritten > File baz fiemap: > -0: [0..511]: data > -1: [512..6143]: unwritten > +0: [0..895]: data > +1: [896..6143]: unwritten Perfectly valid result from the test. > > +# A file which already has a prealloc extent beyond its size. > > +# The fsync done on it is motivated by differences in the btrfs > > implementation > > +# of fsync (first fsync has different logic from subsequent fsyncs). > > +$XFS_IO_PROG -f -c "pwrite -S 0xf1 0 256K" \ > > +-c "falloc -k 256K 768K" \ > > +-c "fsync" \ > > +$SCRATCH_MNT/baz >/dev/null Delayed allocation extending the file to 256k means we'll have speculative prealloc of data beyond 256k - it should be somewhere around 448k IIRC. Now, falloc doesn't guarantee unwritten extents are allocated - it just guarantees space is allocated. unwritten extents are an optimisation to avoid needing to zero the extents pre-allocated within EOF. i.e. we can quite safely allocate data extents beyond EOF rather than unwritten extents, and this is not a bug. In this case, the delalloc extent is completely allocated as written data, but nothing is written beyond EOF @ 256k. Hence it reports as a data extent, not an unwritten extent, and this is a perfectly valid thing to do. > > +# Allocate another extent beyond the size of file baz. > > +$XFS_IO_PROG -c "falloc -k 1M 2M"\ > > +-c "fsync" \ > > +$SCRATCH_MNT/baz You also cannot assume that two separate preallocations beyond EOF are going to be contiguous (i.e. it could be two separate extents. What you should just be checking is that there are extents allocated covering EOF to 3MB, not the exactly size, shape and type of extents are allocated. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] fstests: generic: Check the fs after each FUA writes
On Wed, Mar 28, 2018 at 12:40:23PM +0800, Qu Wenruo wrote: > Basic test case which triggers fsstress with dm-log-writes, and then > check the fs after each FUA writes. > With needed infrastructure and special handlers for journal based fs. > > Signed-off-by: Qu Wenruo <w...@suse.com> . > For xfs: > _check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is > inconsistent (r) > *** xfs_repair -n output *** > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan (but don't clear) agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > - agno = 1 > - agno = 2 > bad symlink header ino 8409190, file block 0, disk block 1051147 > problem with symbolic link in inode 8409190 > would have cleared inode 8409190 > - agno = 3 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > - check for inodes claiming duplicate blocks... > - agno = 0 > - agno = 1 > - agno = 3 > - agno = 2 > entry "lb" in shortform directory 8409188 references free inode 8409190 > would have junked entry "lb" in directory inode 8409188 > bad symlink header ino 8409190, file block 0, disk block 1051147 > problem with symbolic link in inode 8409190 > would have cleared inode 8409190 > No modify flag set, skipping phase 5 > Phase 6 - check inode connectivity... > - traversing filesystem ... > entry "lb" in shortform directory inode 8409188 points to free inode 8409190 > would junk entry > - traversal finished ... > - moving disconnected inodes to lost+found ... > Phase 7 - verify link counts... > Maximum metadata LSN (1:396) is ahead of log (1:63). That warning implies a write ordering problem - there's a write with an LSN on disk that does not yet exist in the log. i.e. the last FUA write to the log had 1:63 in it, yet there's metadata on disk that could only be *issued* after a REQ_FLUSH|REQ_FUA log write with 1:396 in it was *completed*. If we've only replayed up to the FUA write with 1:63 in it, then no metadata writes should have been *issued* with 1:396 in it as the LSN that is stamped into metadata is only updated on log IO completion On first glance, this implies a bug in the underlying device write replay code Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 19/19] fs: handle inode->i_version more efficiently
On Tue, Jan 09, 2018 at 09:10:59AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > Since i_version is mostly treated as an opaque value, we can exploit that > fact to avoid incrementing it when no one is watching. With that change, > we can avoid incrementing the counter on writes, unless someone has > queried for it since it was last incremented. If the a/c/mtime don't > change, and the i_version hasn't changed, then there's no need to dirty > the inode metadata on a write. > > Convert the i_version counter to an atomic64_t, and use the lowest order > bit to hold a flag that will tell whether anyone has queried the value > since it was last incremented. > > When we go to maybe increment it, we fetch the value and check the flag > bit. If it's clear then we don't need to do anything if the update > isn't being forced. > > If we do need to update, then we increment the counter by 2, and clear > the flag bit, and then use a CAS op to swap it into place. If that > works, we return true. If it doesn't then do it again with the value > that we fetch from the CAS operation. > > On the query side, if the flag is already set, then we just shift the > value down by 1 bit and return it. Otherwise, we set the flag in our > on-stack value and again use cmpxchg to swap it into place if it hasn't > changed. If it has, then we use the value from the cmpxchg as the new > "old" value and try again. > > This method allows us to avoid incrementing the counter on writes (and > dirtying the metadata) under typical workloads. We only need to increment > if it has been queried since it was last changed. > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > Reviewed-by: Jan Kara <j...@suse.cz> Documentation helps a lot in understanding all this. Thanks for adding it into the patch! Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
On Tue, Jan 09, 2018 at 09:10:57AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > If XFS_ILOG_CORE is already set then go ahead and increment it. > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > Acked-by: Darrick J. Wong <darrick.w...@oracle.com> Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 14/19] xfs: convert to new i_version API
On Tue, Jan 09, 2018 at 09:10:54AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > Acked-by: Darrick J. Wong <darrick.w...@oracle.com> Looks ok, but I haven't tested it at all. Acked-by: Dave Chinner <dchin...@redhat.com> -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Wed, Jan 03, 2018 at 02:59:21PM +0100, Jan Kara wrote: > On Wed 03-01-18 13:32:19, Dave Chinner wrote: > > I think we could probably block ->write_metadata if necessary via a > > completion/wakeup style notification when a specific LSN is reached > > by the log tail, but realistically if there's any amount of data > > needing to be written it'll throttle data writes because the IO > > pipeline is being kept full by background metadata writes > > So the problem I'm concerned about is a corner case. Consider a situation > when you have no dirty data, only dirty metadata but enough of them to > trigger background writeback. How should metadata writeback behave for XFS > in this case? Who should be responsible that wb_writeback() just does not > loop invoking ->write_metadata() as fast as CPU allows until xfsaild makes > enough progress? > > Thinking about this today, I think this looping prevention belongs to > wb_writeback(). Well, backgroudn data writeback can block in two ways. One is during IO submission when the request queue is full, the other is when all dirty inodes have had some work done on them and have all been moved to b_more_io - wb_writeback waits for the __I_SYNC bit to be cleared on the last(?) inode on that list, hence backing off before submitting more IO. IOws, there's a "during writeback" blocking mechanism as well as a "between cycles" block mechanism. > Sadly we don't have much info to decide how long to sleep > before trying more writeback so we'd have to just sleep for > if we found no writeback happened in the last writeback > round before going through the whole writeback loop again. Right - I don't think we can provide a generic "between cycles" blocking mechanism for XFS, but I'm pretty sure we can emulate a "during writeback" blocking mechanism to avoid busy looping inside the XFS code. e.g. if we get a writeback call that asks for 5% to be written, and we already have a metadata writeback target of 5% in place, that means we should block for a while. That would emulate request queue blocking and prevent busy looping in this case > And > ->write_metadata() for XFS would need to always return 0 (as in "no progress > made") to make sure this busyloop avoidance logic in wb_writeback() > triggers. ext4 and btrfs would return number of bytes written from > ->write_metadata (or just 1 would be enough to indicate some progress in > metadata writeback was made and busyloop avoidance is not needed). Well, if we block for a little while, we can indicate that progress has been made and this whole mess would go away, right? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Tue, Jan 02, 2018 at 11:13:06AM -0500, Josef Bacik wrote: > On Wed, Dec 20, 2017 at 03:30:55PM +0100, Jan Kara wrote: > > On Wed 20-12-17 08:35:05, Dave Chinner wrote: > > > On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote: > > > > On Wed 13-12-17 09:20:04, Dave Chinner wrote: > > > > > IOWs, treating metadata like it's one great big data inode doesn't > > > > > seem to me to be the right abstraction to use for this - in most > > > > > fileystems it's a bunch of objects with a complex dependency tree > > > > > and unknown write ordering, not an inode full of data that can be > > > > > sequentially written. > > > > > > > > > > Maybe we need multiple ops with well defined behaviours. e.g. > > > > > ->writeback_metadata() for background writeback, ->sync_metadata() for > > > > > sync based operations. That way different filesystems can ignore the > > > > > parts they don't need simply by not implementing those operations, > > > > > and the writeback code doesn't need to try to cater for all > > > > > operations through the one op. The writeback code should be cleaner, > > > > > the filesystem code should be cleaner, and we can tailor the work > > > > > guidelines for each operation separately so there's less mismatch > > > > > between what writeback is asking and how filesystems track dirty > > > > > metadata... > > > > > > > > I agree that writeback for memory cleaning and writeback for data > > > > integrity > > > > are two very different things especially for metadata. In fact for data > > > > integrity writeback we already have ->sync_fs operation so there the > > > > functionality gets duplicated. What we could do is that in > > > > writeback_sb_inodes() we'd call ->write_metadata only when > > > > work->for_kupdate or work->for_background is set. That way > > > > ->write_metadata > > > > would be called only for memory cleaning purposes. > > > > > > That makes sense, but I still think we need a better indication of > > > how much writeback we need to do than just "writeback this chunk of > > > pages". That "writeback a chunk" interface is necessary to share > > > writeback bandwidth across numerous data inodes so that we don't > > > starve any one inode of writeback bandwidth. That's unnecessary for > > > metadata writeback on a superblock - we don't need to share that > > > bandwidth around hundreds or thousands of inodes. What we actually > > > need to know is how much writeback we need to do as a total of all > > > the dirty metadata on the superblock. > > > > > > Sure, that's not ideal for btrfs and mayext4, but we can write a > > > simple generic helper that converts "flush X percent of dirty > > > metadata" to a page/byte chunk as the current code does. DOing it > > > this way allows filesystems to completely internalise the accounting > > > that needs to be done, rather than trying to hack around a > > > writeback accounting interface with large impedance mismatches to > > > how the filesystem accounts for dirty metadata and/or tracks > > > writeback progress. > > > > Let me think loud on how we could tie this into how memory cleaning > > writeback currently works - the one with for_background == 1 which is > > generally used to get amount of dirty pages in the system under control. > > We have a queue of inodes to write, we iterate over this queue and ask each > > inode to write some amount (e.g. 64 M - exact amount depends on measured It's a maximum of 1024 pages per inode. > > writeback bandwidth etc.). Some amount from that inode gets written and we > > continue with the next inode in the queue (put this one at the end of the > > queue if it still has dirty pages). We do this until: > > > > a) the number of dirty pages in the system is below background dirty limit > >and the number dirty pages for this device is below background dirty > >limit for this device. > > b) run out of dirty inodes on this device > > c) someone queues different type of writeback > > > > And we need to somehow incorporate metadata writeback into this loop. I see > > two questions here: > > > > 1) When / how often should we ask for metadata writeback? > > 2) How much to ask to write in one go? > > > > The second question is especially tricky in t
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Tue, Dec 19, 2017 at 01:07:09PM +0100, Jan Kara wrote: > On Wed 13-12-17 09:20:04, Dave Chinner wrote: > > On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote: > > > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote: > > > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote: > > > This is just one of those things that's going to be slightly shitty. > > > It's the > > > same for memory reclaim, all of those places use pages so we just take > > > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough. > > > > Ok, so that isn't exactly easy to deal with, because all our > > metadata writeback is based on log sequence number targets (i.e. how > > far to push the tail of the log towards the current head). We've > > actually got no idea how pages/bytes actually map to a LSN target > > because while we might account a full buffer as dirty for memory > > reclaim purposes (up to 64k in size), we might have only logged 128 > > bytes of it. > > > > i.e. if we are asked to push 2MB of metadata and we treat that as > > 2MB of log space (i.e. push target of tail LSN + 2MB) we could have > > logged several tens of megabytes of dirty metadata in that LSN > > range and have to flush it all. OTOH, if the buffers are fully > > logged, then that same target might only flush 1.5MB of metadata > > once all the log overhead is taken into account. > > > > So there's a fairly large disconnect between the "flush N bytes of > > metadata" API and the "push to a target LSN" that XFS uses for > > flushing metadata in aged order. I'm betting that extN and otehr > > filesystems might have similar mismatches with their journal > > flushing... > > Well, for ext4 it isn't as bad since we do full block logging only. So if > we are asked to flush N pages, we can easily translate that to number of fs > blocks and flush that many from the oldest transaction. > > Couldn't XFS just track how much it has cleaned (from reclaim perspective) > when pushing items from AIL (which is what I suppose XFS would do in > response to metadata writeback request) and just stop pushing when it has > cleaned as much as it was asked to? If only it were that simple :/ To start with, flushing the dirty objects (such as inodes) to their backing buffers do not mean the the object is clean once the writeback completes. XFS has decoupled in-memory objects with logical object logging rather than logging physical buffers, and so can be modified and dirtied while the inode buffer is being written back. Hence if we just count things like "buffer size written" it's not actually a correct account of the amount of dirty metadata we've cleaned. If we don't get that right, it'll result in accounting errors and incorrect behaviour. The bigger problem, however, is that we have no channel to return flush information from the AIL pushing to whatever caller asked for the push. Pushing metadata is completely decoupled from every other subsystem. i.e. the caller asked the xfsaild to push to a specific LSN (e.g. to free up a certain amount of log space for new transactions), and *nothing* has any idea of how much metadata we'll need to write to push the tail of the log to that LSN. It's also completely asynchronous - there's no mechanism for waiting on a push to a specific LSN. Anything that needs a specific amount of log space to be available waits in ordered ticket queues on the log tail moving forwards. The only interfaces that have access to the log tail ticket waiting is the transaction reservation subsystem, which cannot be used during metadata writeback because that's a guaranteed deadlock vector Saying "just account for bytes written" assumes directly connected, synchronous dispatch metadata writeback infrastructure which we simply don't have in XFS. "just clean this many bytes" doesn't really fit at all because we have no way of referencing that to the distance we need to push the tail of the log. An interface that tells us "clean this percentage of dirty metadata" is much more useful because we can map that easily to a log sequence number based push target > > IOWs, treating metadata like it's one great big data inode doesn't > > seem to me to be the right abstraction to use for this - in most > > fileystems it's a bunch of objects with a complex dependency tree > > and unknown write ordering, not an inode full of data that can be > > sequentially written. > > > > Maybe we need multiple ops with well defined behaviours. e.g. > > ->writeback_metadata() for background writeback, ->sync_metadata() for > > sync based operations. That way different filesystems can igno
Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently
On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote: > [PATCH] SQUASH: add memory barriers around i_version accesses Why explicit memory barriers rather than annotating the operations with the required semantics and getting the barriers the arch requires automatically? I suspect this should be using atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT the atomic_cmpxchg needs to have release semantics to match the acquire semantics needed for the load of the current value. >From include/linux/atomics.h: * For compound atomics performing both a load and a store, ACQUIRE * semantics apply only to the load and RELEASE semantics only to the * store portion of the operation. Note that a failed cmpxchg_acquire * does -not- imply any memory ordering constraints. Memory barriers hurt my brain. :/ At minimum, shouldn't the atomic op specific barriers be used rather than full memory barriers? i.e: > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > index a9fbf99709df..39ec9aa9e08e 100644 > --- a/include/linux/iversion.h > +++ b/include/linux/iversion.h > @@ -87,6 +87,25 @@ static inline void > inode_set_iversion_raw(struct inode *inode, const u64 val) > { > atomic64_set(>i_version, val); > + smp_wmb(); smp_mb__after_atomic(); . > +static inline u64 > +inode_peek_iversion_raw(const struct inode *inode) > +{ > + smp_rmb(); smp_mb__before_atomic(); > + return atomic64_read(>i_version); > } And, of course, these will require comments explaining what they match with and what they are ordering. > @@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force) > { > u64 cur, old, new; > > - cur = (u64)atomic64_read(>i_version); > + cur = inode_peek_iversion_raw(inode); > for (;;) { > /* If flag is clear then we needn't do anything */ > if (!force && !(cur & I_VERSION_QUERIED)) cmpxchg in this loop needs a release barrier so everyone will see the change? > @@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode) > inode_maybe_inc_iversion(inode, true); > } > > -/** > - * inode_peek_iversion_raw - grab a "raw" iversion value > - * @inode: inode from which i_version should be read > - * > - * Grab a "raw" inode->i_version value and return it. The i_version is not > - * flagged or converted in any way. This is mostly used to access a > self-managed > - * i_version. > - * > - * With those filesystems, we want to treat the i_version as an entirely > - * opaque value. > - */ > -static inline u64 > -inode_peek_iversion_raw(const struct inode *inode) > -{ > - return atomic64_read(>i_version); > -} > - > /** > * inode_iversion_need_inc - is the i_version in need of being incremented? > * @inode: inode to check > @@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode) > { > u64 cur, old, new; > > - cur = atomic64_read(>i_version); > + cur = inode_peek_iversion_raw(inode); > for (;;) { > /* If flag is already set, then no need to swap */ > if (cur & I_VERSION_QUERIED) cmpxchg in this loop needs a release barrier so everyone will see the change on load? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/19] fs: new API for handling inode->i_version
On Sat, Dec 16, 2017 at 08:46:38AM -0500, Jeff Layton wrote: > From: Jeff Layton <jlay...@redhat.com> > > Add a documentation blob that explains what the i_version field is, how > it is expected to work, and how it is currently implemented by various > filesystems. > > We already have inode_inc_iversion. Add several other functions for > manipulating and accessing the i_version counter. For now, the > implementation is trivial and basically works the way that all of the > open-coded i_version accesses work today. > > Future patches will convert existing users of i_version to use the new > API, and then convert the backend implementation to do things more > efficiently. > > Signed-off-by: Jeff Layton <jlay...@redhat.com> > --- > include/linux/fs.h | 200 > ++--- > 1 file changed, 192 insertions(+), 8 deletions(-) Just a random sunday morning coffee musing I was just wondering if it would be better to split this stuff out into it's own header file now? include/linux/fs.h is aleady a massive header file (~3500 lines) and changes cause tree-wide rebuilds, so maybe it would be better to split relatively isolated functionality like this out while it's being reworked and you're already touching every file that uses it? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote: > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote: > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote: > > > From: Josef Bacik <jba...@fb.com> > > > > > > Now that we have metadata counters in the VM, we need to provide a way to > > > kick > > > writeback on dirty metadata. Introduce super_operations->write_metadata. > > > This > > > allows file systems to deal with writing back any dirty metadata we need > > > based > > > on the writeback needs of the system. Since there is no inode to key off > > > of we > > > need a list in the bdi for dirty super blocks to be added. From there we > > > can > > > find any dirty sb's on the bdi we are currently doing writeback on and > > > call into > > > their ->write_metadata callback. > > > > > > Signed-off-by: Josef Bacik <jba...@fb.com> > > > Reviewed-by: Jan Kara <j...@suse.cz> > > > Reviewed-by: Tejun Heo <t...@kernel.org> > > > --- > > > fs/fs-writeback.c| 72 > > > > > > fs/super.c | 6 > > > include/linux/backing-dev-defs.h | 2 ++ > > > include/linux/fs.h | 4 +++ > > > mm/backing-dev.c | 2 ++ > > > 5 files changed, 80 insertions(+), 6 deletions(-) > > > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > > > index 987448ed7698..fba703dff678 100644 > > > --- a/fs/fs-writeback.c > > > +++ b/fs/fs-writeback.c > > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct > > > bdi_writeback *wb, > > > return pages; > > > } > > > > > > +static long writeback_sb_metadata(struct super_block *sb, > > > + struct bdi_writeback *wb, > > > + struct wb_writeback_work *work) > > > +{ > > > + struct writeback_control wbc = { > > > + .sync_mode = work->sync_mode, > > > + .tagged_writepages = work->tagged_writepages, > > > + .for_kupdate= work->for_kupdate, > > > + .for_background = work->for_background, > > > + .for_sync = work->for_sync, > > > + .range_cyclic = work->range_cyclic, > > > + .range_start= 0, > > > + .range_end = LLONG_MAX, > > > + }; > > > + long write_chunk; > > > + > > > + write_chunk = writeback_chunk_size(wb, work); > > > + wbc.nr_to_write = write_chunk; > > > + sb->s_op->write_metadata(sb, ); > > > + work->nr_pages -= write_chunk - wbc.nr_to_write; > > > + > > > + return write_chunk - wbc.nr_to_write; > > > > Ok, writeback_chunk_size() returns a page count. We've already gone > > through the "metadata is not page sized" dance on the dirty > > accounting side, so how are we supposed to use pages to account for > > metadata writeback? > > > > This is just one of those things that's going to be slightly shitty. It's the > same for memory reclaim, all of those places use pages so we just take > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough. Ok, so that isn't exactly easy to deal with, because all our metadata writeback is based on log sequence number targets (i.e. how far to push the tail of the log towards the current head). We've actually got no idea how pages/bytes actually map to a LSN target because while we might account a full buffer as dirty for memory reclaim purposes (up to 64k in size), we might have only logged 128 bytes of it. i.e. if we are asked to push 2MB of metadata and we treat that as 2MB of log space (i.e. push target of tail LSN + 2MB) we could have logged several tens of megabytes of dirty metadata in that LSN range and have to flush it all. OTOH, if the buffers are fully logged, then that same target might only flush 1.5MB of metadata once all the log overhead is taken into account. So there's a fairly large disconnect between the "flush N bytes of metadata" API and the "push to a target LSN" that XFS uses for flushing metadata in aged order. I'm betting that extN and otehr filesystems might have similar mismatches with their journal flushing... > > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or > > work->tagged_writepages is set, this will basically tell us to f
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Mon, Dec 11, 2017 at 02:12:28PM -0800, Joe Perches wrote: > On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote: > > On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote: > > > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote: > > > > 1. Using lockdep_set_novalidate_class() for anything other > > > > than device->mutex will throw checkpatch warnings. Nice. (*) > > > [] > > > > (*) checkpatch.pl is considered mostly harmful round here, too, > > > > but that's another rant > > > > > > How so? > > > > Short story is that it barfs all over the slightly non-standard > > coding style used in XFS. > [] > > This sort of stuff is just lowest-common-denominator noise - great > > for new code and/or inexperienced developers, but not for working > > with large bodies of existing code with slightly non-standard > > conventions. > > Completely reasonable. Thanks. > > Do you get many checkpatch submitters for fs/xfs? We used to. Not recently, though. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote: > From: Josef Bacik <jba...@fb.com> > > Now that we have metadata counters in the VM, we need to provide a way to kick > writeback on dirty metadata. Introduce super_operations->write_metadata. > This > allows file systems to deal with writing back any dirty metadata we need based > on the writeback needs of the system. Since there is no inode to key off of > we > need a list in the bdi for dirty super blocks to be added. From there we can > find any dirty sb's on the bdi we are currently doing writeback on and call > into > their ->write_metadata callback. > > Signed-off-by: Josef Bacik <jba...@fb.com> > Reviewed-by: Jan Kara <j...@suse.cz> > Reviewed-by: Tejun Heo <t...@kernel.org> > --- > fs/fs-writeback.c| 72 > > fs/super.c | 6 > include/linux/backing-dev-defs.h | 2 ++ > include/linux/fs.h | 4 +++ > mm/backing-dev.c | 2 ++ > 5 files changed, 80 insertions(+), 6 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 987448ed7698..fba703dff678 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct bdi_writeback > *wb, > return pages; > } > > +static long writeback_sb_metadata(struct super_block *sb, > + struct bdi_writeback *wb, > + struct wb_writeback_work *work) > +{ > + struct writeback_control wbc = { > + .sync_mode = work->sync_mode, > + .tagged_writepages = work->tagged_writepages, > + .for_kupdate= work->for_kupdate, > + .for_background = work->for_background, > + .for_sync = work->for_sync, > + .range_cyclic = work->range_cyclic, > + .range_start= 0, > + .range_end = LLONG_MAX, > + }; > + long write_chunk; > + > + write_chunk = writeback_chunk_size(wb, work); > + wbc.nr_to_write = write_chunk; > + sb->s_op->write_metadata(sb, ); > + work->nr_pages -= write_chunk - wbc.nr_to_write; > + > + return write_chunk - wbc.nr_to_write; Ok, writeback_chunk_size() returns a page count. We've already gone through the "metadata is not page sized" dance on the dirty accounting side, so how are we supposed to use pages to account for metadata writeback? And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or work->tagged_writepages is set, this will basically tell us to flush the entire dirty metadata cache because write_chunk will get set to LONG_MAX. IOWs, this would appear to me to change sync() behaviour quite dramatically on filesystems where ->write_metadata is implemented. That is, instead of leaving all the metadata dirty in memory and just forcing the journal to stable storage, filesystems will be told to also write back all their dirty metadata before sync() returns, even though it is not necessary to provide correct sync() semantics Mind you, writeback invocation is so convoluted now I could easily be mis-interpretting this code, but it does seem to me like this code is going to have some unintended behaviours Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Sun, Dec 10, 2017 at 08:23:15PM -0800, Matthew Wilcox wrote: > On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote: > > i.e. the fact the cmpxchg failed may not have anything to do with a > > race condtion - it failed because the slot wasn't empty like we > > expected it to be. There can be any number of reasons the slot isn't > > empty - the API should not "document" that the reason the insert > > failed was a race condition. It should document the case that we > > "couldn't insert because there was an existing entry in the slot". > > Let the surrounding code document the reason why that might have > > happened - it's not for the API to assume reasons for failure. > > > > i.e. this API and potential internal implementation makes much > > more sense: > > > > int > > xa_store_iff_empty(...) > > { > > curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > > if (!curr) > > return 0; /* success! */ > > if (!IS_ERR(curr)) > > return -EEXIST; /* failed - slot not empty */ > > return PTR_ERR(curr); /* failed - XA internal issue */ > > } > > > > as it replaces the existing preload and insert code in the XFS code > > paths whilst letting us handle and document the "insert failed > > because slot not empty" case however we want. It implies nothing > > about *why* the slot wasn't empty, just that we couldn't do the > > insert because it wasn't empty. > > Yeah, OK. So, over the top of the recent changes I'm looking at this: > > I'm not in love with xa_store_empty() as a name. I almost want > xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot > down, I'm leery of it. "empty" is at least a concept we already have > in the API with the comment for xa_init() talking about an empty array > and xa_empty(). I also considered xa_store_null and xa_overwrite_null > and xa_replace_null(). Naming remains hard. > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 941f38bb94a4..586b43836905 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -451,7 +451,7 @@ xfs_iget_cache_miss( > int flags, > int lock_flags) > { > - struct xfs_inode*ip, *curr; > + struct xfs_inode*ip; > int error; > xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino); > int iflags; > @@ -498,8 +498,7 @@ xfs_iget_cache_miss( > xfs_iflags_set(ip, iflags); > > /* insert the new inode */ > - curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > - error = __xa_race(curr, -EAGAIN); > + error = xa_store_empty(>pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN); > if (error) > goto out_unlock; Can we avoid encoding the error to be returned into the function call? No other generic/library API does this, so this seems like a highly unusual special snowflake approach. I just don't see how creating a whole new error specification convention is justified for the case where it *might* save a line or two of error checking code in a caller? I'd much prefer that the API defines the error on failure, and let the caller handle that specific error however they need to like every other library interface we have... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote: > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote: > > 1. Using lockdep_set_novalidate_class() for anything other > > than device->mutex will throw checkpatch warnings. Nice. (*) > [] > > (*) checkpatch.pl is considered mostly harmful round here, too, > > but that's another rant > > How so? Short story is that it barfs all over the slightly non-standard coding style used in XFS. It generates enough noise on incidental things we aren't important that it complicates simple things. e.g. I just moved a block of defines from one header to another, and checkpatch throws about 10 warnings on that because of whitespace. I'm just moving code - I don't want to change it and it doesn't need to be modified because it is neat and easy to read and is obviously correct. A bunch of prototypes I added another parameter to gets warnings because it uses "unsigned" for an existing parameter that I'm not changing. And so on. This sort of stuff is just lowest-common-denominator noise - great for new code and/or inexperienced developers, but not for working with large bodies of existing code with slightly non-standard conventions. Churning *lots* of code we otherwise wouldn't touch just to shut up checkpatch warnings takes time, risks regressions and makes it harder to trace the history of the code when we are trying to track down bugs. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Fri, Dec 08, 2017 at 03:01:31PM -0800, Matthew Wilcox wrote: > On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote: > > > > cmpxchg is for replacing a known object in a store - it's not really > > > > intended for doing initial inserts after a lookup tells us there is > > > > nothing in the store. The radix tree "insert only if empty" makes > > > > sense here, because it naturally takes care of lookup/insert races > > > > via the -EEXIST mechanism. > > > > > > > > I think that providing xa_store_excl() (which would return -EEXIST > > > > if the entry is not empty) would be a better interface here, because > > > > it matches the semantics of lookup cache population used all over > > > > the kernel > > > > > > I'm not thrilled with xa_store_excl(), but I need to think about that > > > a bit more. > > > > Not fussed about the name - I just think we need a function that > > matches the insert semantics of the code > > I think I have something that works better for you than returning -EEXIST > (because you don't actually want -EEXIST, you want -EAGAIN): > > /* insert the new inode */ > - spin_lock(>pag_ici_lock); > - error = radix_tree_insert(>pag_ici_root, agino, ip); > - if (unlikely(error)) { > - WARN_ON(error != -EEXIST); > - XFS_STATS_INC(mp, xs_ig_dup); > - error = -EAGAIN; > - goto out_preload_end; > - } > - spin_unlock(>pag_ici_lock); > - radix_tree_preload_end(); > + curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > + error = __xa_race(curr, -EAGAIN); > + if (error) > + goto out_unlock; > > ... > > -out_preload_end: > - spin_unlock(>pag_ici_lock); > - radix_tree_preload_end(); > +out_unlock: > + if (error == -EAGAIN) > + XFS_STATS_INC(mp, xs_ig_dup); > > I've changed the behaviour slightly in that returning an -ENOMEM used to > hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS > returning -ENOMEM probably gets you a nice warning already from the > mm code. It's been a couple of days since I first looked at this, and my initial reaction of "that's horrible!" hasn't changed. What you are proposing here might be a perfectly reasonable *internal implemention* of xa_store_excl(), but it makes for a terrible external API because the sematics and behaviour are so vague. e.g. what does "race" mean here with respect to an insert failure? i.e. the fact the cmpxchg failed may not have anything to do with a race condtion - it failed because the slot wasn't empty like we expected it to be. There can be any number of reasons the slot isn't empty - the API should not "document" that the reason the insert failed was a race condition. It should document the case that we "couldn't insert because there was an existing entry in the slot". Let the surrounding code document the reason why that might have happened - it's not for the API to assume reasons for failure. i.e. this API and potential internal implementation makes much more sense: int xa_store_iff_empty(...) { curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); if (!curr) return 0; /* success! */ if (!IS_ERR(curr)) return -EEXIST; /* failed - slot not empty */ return PTR_ERR(curr); /* failed - XA internal issue */ } as it replaces the existing preload and insert code in the XFS code paths whilst letting us handle and document the "insert failed because slot not empty" case however we want. It implies nothing about *why* the slot wasn't empty, just that we couldn't do the insert because it wasn't empty. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lockdep is less useful than it was
On Fri, Dec 08, 2017 at 10:14:38AM -0800, Matthew Wilcox wrote: > At the moment, the radix tree actively disables the RCU checking that > enabling lockdep would give us. It has to, because it has no idea what > lock protects any individual access to the radix tree. The XArray can > use the RCU checking because it knows that every reference is protected > by either the spinlock or the RCU lock. > > Dave was saying that he has a tree which has to be protected by a mutex > because of where it is in the locking hierarchy, and I was vigorously > declining his proposal of allowing him to skip taking the spinlock. Oh, I wasn't suggesting that you remove the internal tree locking because we need external locking. I was trying to point out that the internal locking doesn't remove the need for external locking, and that there are cases where smearing the internal lock outside the XA tree doesn't work, either. i.e. internal locking doesn't replace all the cases where external locking is required, and so it's less efficient than the existing radix tree code. What I was questioning was the value of replacing the radix tree code with a less efficient structure just to add lockdep validation to a tree that doesn't actually need any extra locking validation... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Fri, Dec 08, 2017 at 12:35:07PM -0500, Alan Stern wrote: > On Fri, 8 Dec 2017, Byungchul Park wrote: > > > I'm sorry to hear that.. If I were you, I would also get > > annoyed. And.. thanks for explanation. > > > > But, I think assigning lock classes properly and checking > > relationship of the classes to detect deadlocks is reasonable. > > > > In my opinion about the common lockdep stuff, there are 2 > > problems on it. > > > > 1) Firstly, it's hard to assign lock classes *properly*. By > > default, it relies on the caller site of lockdep_init_map(), > > but we need to assign another class manually, where ordering > > rules are complicated so cannot rely on the caller site. That > > *only* can be done by experts of the subsystem. Sure, but that's not the issue here. The issue here is the lack of communication with subsystem experts and that the annotation complexity warnings given immediately by the subsystem experts were completely ignored... > > I think if they want to get benifit from lockdep, they have no > > choice but to assign classes manually with the domain knowledge, > > or use *lockdep_set_novalidate_class()* to invalidate locks > > making the developers annoyed and not want to use the checking > > for them. > > Lockdep's no_validate class is used when the locking patterns are too > complicated for lockdep to understand. Basically, it tells lockdep to > ignore those locks. Let me just point out two things here: 1. Using lockdep_set_novalidate_class() for anything other than device->mutex will throw checkpatch warnings. Nice. (*) 2. lockdep_set_novalidate_class() is completely undocumented - it's the first I've ever heard of this functionality. i.e. nobody has ever told us there is a mechanism to turn off validation of an object; we've *always* been told to "change your code and/or fix your annotations" when discussing lockdep deficiencies. (**) > The device core uses that class. The tree of struct devices, each with > its own lock, gets used in many different and complicated ways. > Lockdep can't understand this -- it doesn't have the ability to > represent an arbitrarily deep hierarchical tree of locks -- so we tell ^^ That largely describes the in-memory structure of XFS, except we have a forest of lock trees, not just one > it to ignore the device locks. > > It sounds like XFS may need to do the same thing with its semaphores. Who-ever adds semaphore checking to lockdep can add those annotations. The externalisation of the development cost of new lockdep functionality is one of the problems here. -Dave. (*) checkpatch.pl is considered mostly harmful round here, too, but that's another rant (**) the frequent occurrence of "core code/devs aren't held to the same rules/standard as everyone else" is another rant I have stored up for a rainy day. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Fri, Dec 08, 2017 at 01:45:52PM +0900, Byungchul Park wrote: > On Fri, Dec 08, 2017 at 09:22:16AM +1100, Dave Chinner wrote: > > On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote: > > > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote: > > > > > Unfortunately for you, I don't find arguments along the lines of > > > > > "lockdep will save us" at all convincing. lockdep already throws > > > > > too many false positives to be useful as a tool that reliably and > > > > > accurately points out rare, exciting, complex, intricate locking > > > > > problems. > > > > > > > > But it does reliably and accurately point out "dude, you forgot to take > > > > the lock". It's caught a number of real problems in my own testing that > > > > you never got to see. > > > > > > The problem is that if it has too many false positives --- and it's > > > gotten *way* worse with the completion callback "feature", people will > > > just stop using Lockdep as being too annyoing and a waste of developer > > > time when trying to figure what is a legitimate locking bug versus > > > lockdep getting confused. > > > > > > I can't even disable the new Lockdep feature which is throwing > > > lots of new false positives --- it's just all or nothing. > > > > > > Dave has just said he's already stopped using Lockdep, as a result. > > > > This is compeltely OT, but FYI I stopped using lockdep a long time > > ago. We've spend orders of magnitude more time and effort to shut > > up lockdep false positives in the XFS code than we ever have on > > locking problems that lockdep has uncovered. And still lockdep > > throws too many false positives on XFS workloads to be useful to me. > > > > But it's more than that: I understand just how much lockdep *doesn't > > check* and that means *I know I can't rely on lockdep* for potential > > deadlock detection. e.g. it doesn't cover semaphores, which means > > Hello, > > I'm careful in saying the following since you seem to feel not good at > crossrelease and even lockdep. Now that cross-release has been > introduced, semaphores can be covered as you might know. Actually, all > general waiters can. And all it will do is create a whole bunch more work for us XFS guys to shut up all the the false positive crap that falls out from it because the locking model we have is far more complex than any of the lockdep developers thought was necessary to support, just like happened with the XFS inode annotations all those years ago. e.g. nobody has ever bothered to ask us what is needed to describe XFS's semaphore locking model. If you did that, you'd know that we nest *thousands* of locked semaphores in compeltely random lock order during metadata buffer writeback. And that this lock order does not reflect the actual locking order rules we have for locking buffers during transactions. Oh, and you'd also know that a semaphore's lock order and context can change multiple times during the life time of the buffer. Say we free a block and the reallocate it as something else before it is reclaimed - that buffer now might have a different lock order. Or maybe we promote a buffer to be a root btree block as a result of a join - it's now the first buffer in a lock run, rather than a child. Or we split a tree, and the root is now a node and so no longer is the first buffer in a lock run. Or that we walk sideways along the leaf nodes siblings during searches. IOWs, there is no well defined static lock ordering at all for buffers - and therefore semaphores - in XFS at all. And knowing that, you wouldn't simply mention that lockdep can support semaphores now as though that is necessary to "make it work" for XFS. It's going to be much simpler for us to just turn off lockdep and ignore whatever crap it sends our way than it is to spend unplanned weeks of our time to try to make lockdep sorta work again. Sure, we might get there in the end, but it's likely to take months, if not years like it did with the XFS inode annotations. > > it has zero coverage of the entire XFS metadata buffer subsystem and > > the complex locking orders we have for metadata updates. > > > > Put simply: lockdep doesn't provide me with any benefit, so I don't > > use it... > > Sad.. I don't think you understand. I'll try to explain. The lockdep infrastructure by itself doesn't make lockdep a useful tool - it mostly generates false positives because it has no concept of locking models that don't match it's internal tracking assumptions and/or limitations. That means if we can't suppress the false positives, then l
Re: Lockdep is less useful than it was
On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote: > On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote: > > The problem is that if it has too many false positives --- and it's > > gotten *way* worse with the completion callback "feature", people will > > just stop using Lockdep as being too annyoing and a waste of developer > > time when trying to figure what is a legitimate locking bug versus > > lockdep getting confused. > > > > I can't even disable the new Lockdep feature which is throwing > > lots of new false positives --- it's just all or nothing. > > You *can* ... but it's way more hacking Kconfig than you ought to have > to do (which is a separate rant ...) > > You need to get LOCKDEP_CROSSRELEASE off. I'd revert patches > e26f34a407aec9c65bce2bc0c838fabe4f051fc6 and > b483cf3bc249d7af706390efa63d6671e80d1c09 > > I think it was a mistake to force these on for everybody; they have a > much higher false-positive rate than the rest of lockdep, so as you say > forcing them on leads to fewer people using *any* of lockdep. > > The bug you're hitting isn't Byungchul's fault; it's an annotation > problem. The same kind of annotation problem that we used to have with > dozens of other places in the kernel which are now fixed. That's one of the fundamental problem with lockdep - it throws the difficulty of solving all these new false positives onto the developers who know nothing about lockdep and don't follow it's development. And until they do solve them - especially in critical subsystems that everyone uses like the storage stack - lockdep is essentially worthless. > If you didn't > have to hack Kconfig to get rid of this problem, you'd be happier, right? I'd be much happier if it wasn't turned on by default in the first place. We gave plenty of warnings that there were still unsolved false positive problems with the new checks in the storage stack. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote: > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote: > > > Unfortunately for you, I don't find arguments along the lines of > > > "lockdep will save us" at all convincing. lockdep already throws > > > too many false positives to be useful as a tool that reliably and > > > accurately points out rare, exciting, complex, intricate locking > > > problems. > > > > But it does reliably and accurately point out "dude, you forgot to take > > the lock". It's caught a number of real problems in my own testing that > > you never got to see. > > The problem is that if it has too many false positives --- and it's > gotten *way* worse with the completion callback "feature", people will > just stop using Lockdep as being too annyoing and a waste of developer > time when trying to figure what is a legitimate locking bug versus > lockdep getting confused. > > I can't even disable the new Lockdep feature which is throwing > lots of new false positives --- it's just all or nothing. > > Dave has just said he's already stopped using Lockdep, as a result. This is compeltely OT, but FYI I stopped using lockdep a long time ago. We've spend orders of magnitude more time and effort to shut up lockdep false positives in the XFS code than we ever have on locking problems that lockdep has uncovered. And still lockdep throws too many false positives on XFS workloads to be useful to me. But it's more than that: I understand just how much lockdep *doesn't check* and that means *I know I can't rely on lockdep* for potential deadlock detection. e.g. it doesn't cover semaphores, which means it has zero coverage of the entire XFS metadata buffer subsystem and the complex locking orders we have for metadata updates. Put simply: lockdep doesn't provide me with any benefit, so I don't use it... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote: > > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote: > > > That said, using xa_cmpxchg() in the dquot code looked like the right > > > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks > > > entirely reasonable for another thread to come in and set up the dquot. > > > But I'm obviously quite ignorant of the XFS internals, so maybe there's > > > something else going on that makes this essentially a "can't happen". > > > > It's no different to the inode cache code, which drops the RCU > > lock on lookup miss, instantiates the new inode (maybe reading it > > off disk), then locks the tree and attempts to insert it. Both cases > > use "insert if empty, otherwise retry lookup from start" semantics. > > Ah. I had my focus set a little narrow on the inode cache code and didn't > recognise the pattern. > > Why do you sleep for one jiffy after encountering a miss, then seeing > someone else insert the inode for you? The sleep is a backoff that allows whatever we raced with to complete, be it a hit that raced with an inode being reclaimed and removed, or a miss that raced with another insert. Ideally we'd sleep on the XFS_INEW bit, similar to the vfs I_NEW flag, but it's not quite that simple with the reclaim side of things... > > cmpxchg is for replacing a known object in a store - it's not really > > intended for doing initial inserts after a lookup tells us there is > > nothing in the store. The radix tree "insert only if empty" makes > > sense here, because it naturally takes care of lookup/insert races > > via the -EEXIST mechanism. > > > > I think that providing xa_store_excl() (which would return -EEXIST > > if the entry is not empty) would be a better interface here, because > > it matches the semantics of lookup cache population used all over > > the kernel > > I'm not thrilled with xa_store_excl(), but I need to think about that > a bit more. Not fussed about the name - I just think we need a function that matches the insert semantics of the code > > > I'm quite happy to have normal API variants that don't save/restore > > > interrupts. Just need to come up with good names ... I don't think > > > xa_store_noirq() is a good name, but maybe you do? > > > > I'd prefer not to have to deal with such things at all. :P > > > > How many subsystems actually require irq safety in the XA locking > > code? Make them use irqsafe versions, not make everyone else use > > "noirq" versions, as is the convention for the rest of the kernel > > code > > Hard to say how many existing radix tree users require the irq safety. The mapping tree requires it because it gets called from IO completion contexts to clear page writeback state, but I don't know about any of the others. > Also hard to say how many potential users (people currently using > linked lists, people using resizable arrays, etc) need irq safety. > My thinking was "make it safe by default and let people who know better > have a way to opt out", but there's definitely something to be said for > "make it fast by default and let people who need the unusual behaviour > type those extra few letters". > > So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh() > and xa_store_irqsafe()? (at least on demand, as users come to light?) > At least the read side doesn't require any variants; everybody can use > RCU for read side protection. That would follow the pattern of the rest of the kernel APIs, though I think it might be cleaner to simply state the locking requirement to xa_init() and keep all those details completely internal rather than encoding them into API calls. After all, the "irqsafe-ness" of the locking needs to be consistent across the entire XA instance > ("safe", not "save" because I wouldn't make the caller provide the > "flags" argument). > > > > At least, not today. One of the future plans is to allow xa_nodes to > > > be allocated from ZONE_MOVABLE. In order to do that, we have to be > > > able to tell which lock protects any given node. With the XArray, > > > we can find that out (xa_node->root->xa_lock); with the radix tree, > > > we don't even know what kind of lock protects the tree. > > > > Yup, this is a prime example of why we shouldn't be creating > > external dependencies by smearing the locking context outside the XA > > structure itself. It's not a stretch to se
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote: > > > The other conversions use the normal API instead of the advanced API, so > > > all of this gets hidden away. For example, the inode cache does this: > > > > Ah, OK, that's not obvious from the code changes. :/ > > Yeah, it's a lot easier to understand (I think!) if you build the > docs in that tree and look at > file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html > (mutatis mutandi). I've tried to tell a nice story about how to put > all the pieces together from the normal to the advanced API. > > > However, it's probably overkill for XFS. In all the cases, when we > > insert there should be no entry in the tree - the > > radix tree insert error handling code there was simply catching > > "should never happen" cases and handling it without crashing. > > I thought it was probably overkill to be using xa_cmpxchg() in the > pag_ici patch. I didn't want to take away your error handling as part > of the conversion, but I think a rational person implementing it today > would just call xa_store() and not even worry about the return value > except to check it for IS_ERR(). *nod* > That said, using xa_cmpxchg() in the dquot code looked like the right > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks > entirely reasonable for another thread to come in and set up the dquot. > But I'm obviously quite ignorant of the XFS internals, so maybe there's > something else going on that makes this essentially a "can't happen". It's no different to the inode cache code, which drops the RCU lock on lookup miss, instantiates the new inode (maybe reading it off disk), then locks the tree and attempts to insert it. Both cases use "insert if empty, otherwise retry lookup from start" semantics. cmpxchg is for replacing a known object in a store - it's not really intended for doing initial inserts after a lookup tells us there is nothing in the store. The radix tree "insert only if empty" makes sense here, because it naturally takes care of lookup/insert races via the -EEXIST mechanism. I think that providing xa_store_excl() (which would return -EEXIST if the entry is not empty) would be a better interface here, because it matches the semantics of lookup cache population used all over the kernel > > Now that I've looked at this, I have to say that having a return > > value of NULL meaning "success" is quite counter-intuitive. That's > > going to fire my "that looks so wrong" detector every time I look at > > the code and notice it's erroring out on a non-null return value > > that isn't a PTR_ERR case > > It's the same convention as cmpxchg(). I think it's triggering your > "looks so wrong" detector because it's fundamentally not the natural > thing to write. Most definitely the case, and this is why it's a really bad interface for the semantics we have. This how we end up with code that makes it easy for programmers to screw up pointer checks in error handling... :/ > I'm quite happy to have normal API variants that don't save/restore > interrupts. Just need to come up with good names ... I don't think > xa_store_noirq() is a good name, but maybe you do? I'd prefer not to have to deal with such things at all. :P How many subsystems actually require irq safety in the XA locking code? Make them use irqsafe versions, not make everyone else use "noirq" versions, as is the convention for the rest of the kernel code > > > It's the design pattern I've always intended to use. Naturally, the > > > xfs radix trees weren't my initial target; it was the page cache, and > > > the page cache does the same thing; uses the tree_lock to protect both > > > the radix tree and several other fields in that same data structure. > > > > > > I'm open to argument on this though ... particularly if you have a better > > > design pattern in mind! > > > > I don't mind structures having internal locking - I have a problem > > with leaking them into contexts outside the structure they protect. > > That way lies madness - you can't change the internal locking in > > future because of external dependencies, and the moment you need > > something different externally we've got to go back to an external > > lock anyway. > > > > This is demonstrated by the way you converted the XFS dquot tree - > > you didn't replace the dquot tree lock with the internal xa_lock > > because it's a mutex and we have to sleep holding it. IOWs, we've > > added another layer of locking here, not simplif
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Tue, Dec 05, 2017 at 06:02:08PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 12:36:48PM +1100, Dave Chinner wrote: > > > - if (radix_tree_preload(GFP_NOFS)) > > > - return -ENOMEM; > > > - > > > INIT_LIST_HEAD(>list_node); > > > elem->key = key; > > > > > > - spin_lock(>lock); > > > - error = radix_tree_insert(>store, key, elem); > > > - radix_tree_preload_end(); > > > - if (!error) > > > - _xfs_mru_cache_list_insert(mru, elem); > > > - spin_unlock(>lock); > > > + do { > > > + xas_lock(); > > > + xas_store(, elem); > > > + error = xas_error(); > > > + if (!error) > > > + _xfs_mru_cache_list_insert(mru, elem); > > > + xas_unlock(); > > > + } while (xas_nomem(, GFP_NOFS)); > > > > Ok, so why does this have a retry loop on ENOMEM despite the > > existing code handling that error? And why put such a loop in this > > code and not any of the other XFS code that used > > radix_tree_preload() and is arguably much more important to avoid > > ENOMEM on insert (e.g. the inode cache)? > > If we need more nodes in the tree, xas_store() will try to allocate them > with GFP_NOWAIT | __GFP_NOWARN. If that fails, it signals it in xas_error(). > xas_nomem() will notice that we're in an ENOMEM situation, and allocate > a node using your preferred GFP flags (NOIO in your case). Then we retry, > guaranteeing forward progress. [1] > > The other conversions use the normal API instead of the advanced API, so > all of this gets hidden away. For example, the inode cache does this: > > + curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > > and xa_cmpxchg internally does: > > do { > xa_lock_irqsave(xa, flags); > curr = xas_create(); > if (curr == old) > xas_store(, entry); > xa_unlock_irqrestore(xa, flags); > } while (xas_nomem(, gfp)); Ah, OK, that's not obvious from the code changes. :/ However, it's probably overkill for XFS. In all the cases, when we insert there should be no entry in the tree - the radix tree insert error handling code there was simply catching "should never happen" cases and handling it without crashing. Now that I've looked at this, I have to say that having a return value of NULL meaning "success" is quite counter-intuitive. That's going to fire my "that looks so wrong" detector every time I look at the code and notice it's erroring out on a non-null return value that isn't a PTR_ERR case Also, there's no need for irqsave/restore() locking contexts here as we never access these caches from interrupt contexts. That's just going to be extra overhead, especially on workloads that run 10^6 inodes inodes through the cache every second. That's a problem caused by driving the locks into the XA structure and then needing to support callers that require irq safety > > Also, I really don't like the pattern of using xa_lock()/xa_unlock() > > to protect access to an external structure. i.e. the mru->lock > > context is protecting multiple fields and operations in the MRU > > structure, not just the radix tree operations. Turning that around > > so that a larger XFS structure and algorithm is now protected by an > > opaque internal lock from generic storage structure the forms part > > of the larger structure seems like a bad design pattern to me... > > It's the design pattern I've always intended to use. Naturally, the > xfs radix trees weren't my initial target; it was the page cache, and > the page cache does the same thing; uses the tree_lock to protect both > the radix tree and several other fields in that same data structure. > > I'm open to argument on this though ... particularly if you have a better > design pattern in mind! I don't mind structures having internal locking - I have a problem with leaking them into contexts outside the structure they protect. That way lies madness - you can't change the internal locking in future because of external dependencies, and the moment you need something different externally we've got to go back to an external lock anyway. This is demonstrated by the way you converted the XFS dquot tree - you didn't replace the dquot tree lock with the internal xa_lock because it's a mutex and we have to sleep holding it. IOWs, we've added another layer of locking here, not simplified the code. What I really see here is that we have inconsistent locking patterns w.r.t. XA stores inside XFS - some have an external mutex to cover a wider scope, some use xa_
Re: [PATCH v4 00/73] XArray version 4
On Tue, Dec 05, 2017 at 06:05:15PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote: > > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > > > From: Matthew Wilcox <mawil...@microsoft.com> > > > > > > I looked through some notes and decided this was version 4 of the XArray. > > > Last posted two weeks ago, this version includes a *lot* of changes. > > > I'd like to thank Dave Chinner for his feedback, encouragement and > > > distracting ideas for improvement, which I'll get to once this is merged. > > > > BTW, you need to fix the "To:" line on your patchbombs: > > > > > To: unlisted-recipients: ;, no To-header on input > > > <@gmail-pop.l.google.com> > > > > This bad email address getting quoted to the cc line makes some MTAs > > very unhappy. > > I know :-( I was unhappy when I realised what I'd done. > > https://marc.info/?l=git=151252237912266=2 > > > I'll give this a quick burn this afternoon and see what catches fire... > > All of the things ... 0day gave me a 90% chance of hanging in one > configuration. Need to drill down on it more and find out what stupid > thing I've done wrong this time. Yup, Bad Stuff happened on boot: [ 24.548039] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 24.548978] 1-...!: (0 ticks this GP) idle=688/0/0 softirq=143/143 fqs=0 [ 24.549926] 5-...!: (0 ticks this GP) idle=db8/0/0 softirq=120/120 fqs=0 [ 24.550864] 6-...!: (0 ticks this GP) idle=d58/0/0 softirq=111/111 fqs=0 [ 24.551802] 8-...!: (5 GPs behind) idle=514/0/0 softirq=189/189 fqs=0 [ 24.552722] 10-...!: (84 GPs behind) idle=ac0/0/0 softirq=80/80 fqs=0 [ 24.553617] 11-...!: (8 GPs behind) idle=cfc/0/0 softirq=95/95 fqs=0 [ 24.554496] 13-...!: (8 GPs behind) idle=b0c/0/0 softirq=82/82 fqs=0 [ 24.555382] 14-...!: (38 GPs behind) idle=a7c/0/0 softirq=93/93 fqs=0 [ 24.556305] 15-...!: (4 GPs behind) idle=b18/0/0 softirq=88/88 fqs=0 [ 24.557190] (detected by 9, t=5252 jiffies, g=-178, c=-179, q=994) [ 24.558051] Sending NMI from CPU 9 to CPUs 1: [ 24.558703] NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x2/0x10 [ 24.559654] Sending NMI from CPU 9 to CPUs 5: [ 24.559675] NMI backtrace for cpu 5 skipped: idling at native_safe_halt+0x2/0x10 [ 24.560654] Sending NMI from CPU 9 to CPUs 6: [ 24.560689] NMI backtrace for cpu 6 skipped: idling at native_safe_halt+0x2/0x10 [ 24.561655] Sending NMI from CPU 9 to CPUs 8: [ 24.561701] NMI backtrace for cpu 8 skipped: idling at native_safe_halt+0x2/0x10 [ 24.562654] Sending NMI from CPU 9 to CPUs 10: [ 24.562675] NMI backtrace for cpu 10 skipped: idling at native_safe_halt+0x2/0x10 [ 24.563653] Sending NMI from CPU 9 to CPUs 11: [ 24.563669] NMI backtrace for cpu 11 skipped: idling at native_safe_halt+0x2/0x10 [ 24.564653] Sending NMI from CPU 9 to CPUs 13: [ 24.564670] NMI backtrace for cpu 13 skipped: idling at native_safe_halt+0x2/0x10 [ 24.565652] Sending NMI from CPU 9 to CPUs 14: [ 24.565674] NMI backtrace for cpu 14 skipped: idling at native_safe_halt+0x2/0x10 [ 24.566652] Sending NMI from CPU 9 to CPUs 15: [ 24.59] NMI backtrace for cpu 15 skipped: idling at native_safe_halt+0x2/0x10 [ 24.567653] rcu_preempt kthread starved for 5256 jiffies! g18446744073709551438 c18446744073709551437 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->7 [ 24.567654] rcu_preempt I15128 9 2 0x8000 [ 24.567660] Call Trace: [ 24.567679] ? __schedule+0x289/0x880 [ 24.567681] schedule+0x2f/0x90 [ 24.567682] schedule_timeout+0x152/0x370 [ 24.567686] ? __next_timer_interrupt+0xc0/0xc0 [ 24.567689] rcu_gp_kthread+0x561/0x880 [ 24.567691] ? force_qs_rnp+0x1a0/0x1a0 [ 24.567693] kthread+0x111/0x130 [ 24.567695] ? __kthread_create_worker+0x120/0x120 [ 24.567697] ret_from_fork+0x24/0x30 [ 44.064092] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:854] [ 44.065920] CPU: 0 PID: 854 Comm: kswapd0 Not tainted 4.15.0-rc2-dgc #228 [ 44.067769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 44.070030] RIP: 0010:smp_call_function_single+0xce/0x100 [ 44.071521] RSP: :c90001d2fb20 EFLAGS: 0202 ORIG_RAX: ff11 [ 44.073592] RAX: RBX: 88013ab515c8 RCX: c9000350bb20 [ 44.075560] RDX: 0001 RSI: c90001d2fb20 RDI: c90001d2fb20 [ 44.077531] RBP: c90001d2fb50 R08: 0007 R09: 0080 [ 44.079483] R10: c90001d2fb78 R11: c90001d2fb30 R12: c90001d2fc10 [ 44.081465] R13: ea000449fc78 R14: ea000449fc58 R15: 88013ba36c40 [ 44.083434] FS: () GS:88013fc0() knlGS: [ 44.085683] CS: 0010 DS: ES: CR0
Re: [PATCH v4 00/73] XArray version 4
On Wed, Dec 06, 2017 at 01:53:41AM +, Matthew Wilcox wrote: > Huh, you've caught a couple of problems that 0day hasn't sent me yet. Try > turning on DAX or TRANSPARENT_HUGEPAGE. Thanks! Dax is turned on, CONFIG_TRANSPARENT_HUGEPAGE is not. Looks like nothing is setting CONFIG_RADIX_TREE_MULTIORDER, which is what xas_set_order() is hidden under. Ah, CONFIG_ZONE_DEVICE turns it on, not CONFIG_DAX/CONFIG_FS_DAX. H. That seems wrong if it's used in fs/dax.c... $ grep DAX .config CONFIG_DAX=y CONFIG_FS_DAX=y $ grep ZONE_DEVICE .config CONFIG_ARCH_HAS_ZONE_DEVICE=y $ So I have DAX enabled, but not ZONE_DEVICE? Shouldn't DAX be selecting ZONE_DEVICE, not relying on a user to select both of them so that stuff works properly? Hmmm - there's no menu option to turn on zone device, so it's selected by something else? Oh, HMM turns on ZONE device. But that is "default y", so should be turned on. But it's not? And there's no obvious HMM menu config option, either What a godawful mess Kconfig has turned into. I'm just going to enable TRANSPARENT_HUGEPAGE - madness awaits me if I follow the other path down the rat hole Ok, it build this time. -Dave. > > > -Original Message----- > > From: Dave Chinner [mailto:da...@fromorbit.com] > > Sent: Tuesday, December 5, 2017 8:51 PM > > To: Matthew Wilcox <wi...@infradead.org> > > Cc: Matthew Wilcox <mawil...@microsoft.com>; Ross Zwisler > > <ross.zwis...@linux.intel.com>; Jens Axboe <ax...@kernel.dk>; Rehas > > Sachdeva <aquan...@gmail.com>; linux...@kvack.org; linux- > > fsde...@vger.kernel.org; linux-f2fs-de...@lists.sourceforge.net; linux- > > ni...@vger.kernel.org; linux-btrfs@vger.kernel.org; > > linux-...@vger.kernel.org; > > linux-...@vger.kernel.org; linux-ker...@vger.kernel.org > > Subject: Re: [PATCH v4 00/73] XArray version 4 > > > > On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote: > > > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > > > > From: Matthew Wilcox <mawil...@microsoft.com> > > > > > > > > I looked through some notes and decided this was version 4 of the > > > > XArray. > > > > Last posted two weeks ago, this version includes a *lot* of changes. > > > > I'd like to thank Dave Chinner for his feedback, encouragement and > > > > distracting ideas for improvement, which I'll get to once this is > > > > merged. > > > > > > BTW, you need to fix the "To:" line on your patchbombs: > > > > > > > To: unlisted-recipients: ;, no To-header on input <@gmail- > > pop.l.google.com> > > > > > > This bad email address getting quoted to the cc line makes some MTAs > > > very unhappy. > > > > > > > > > > > Highlights: > > > > - Over 2000 words of documentation in patch 8! And lots more > > > > kernel-doc. > > > > - The page cache is now fully converted to the XArray. > > > > - Many more tests in the test-suite. > > > > > > > > This patch set is not for applying. 0day is still reporting problems, > > > > and I'd feel bad for eating someone's data. These patches apply on top > > > > of a set of prepatory patches which just aren't interesting. If you > > > > want to see the patches applied to a tree, I suggest pulling my git > > > > tree: > > > > > > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.infrade > > ad.org%2Fusers%2Fwilly%2Flinux- > > dax.git%2Fshortlog%2Frefs%2Fheads%2Fxarray-2017-12- > > 04=02%7C01%7Cmawilcox%40microsoft.com%7Ca3e721545f8b4b9dff1 > > 608d53c4bd42f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364 > > 81218740341312=IXNZXXLTf964OQ0eLDpJt2LCv%2BGGWFW%2FQd4Kc > > KYu6zo%3D=0 > > > > I also left out the idr_preload removals. They're still in the git > > > > tree, > > > > but I'm not looking for feedback on them. > > > > > > I'll give this a quick burn this afternoon and see what catches fire... > > > > Build warnings/errors: > > > > . > > lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not > > used > > [-Wunused-function] > > static void radix_tree_free_nodes(struct radix_tree_node *node) > > . > > lib/xarray.c: In function ¿xas_max¿: > > lib/xarray.c:291:16: warning: unused variable ¿mask¿ > > [-Wunused-variable] > > unsigned long mask, max = xas->xa_index; > > ^~~~ > > .. > > fs/dax.c: In function ¿grab_mapping_entry¿: > > fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; > > did you > > mean ¿xas_set_err¿? [-Werror=implicit-function-declaration] > > xas_set_order(, index, size_flag ? PMD_ORDER : 0); > > ^ > > scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed > > make[1]: *** [fs/dax.o] Error 1 > > > > -Dave. > > -- > > Dave Chinner > > da...@fromorbit.com -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/73] XArray version 4
On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote: > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > > From: Matthew Wilcox <mawil...@microsoft.com> > > > > I looked through some notes and decided this was version 4 of the XArray. > > Last posted two weeks ago, this version includes a *lot* of changes. > > I'd like to thank Dave Chinner for his feedback, encouragement and > > distracting ideas for improvement, which I'll get to once this is merged. > > BTW, you need to fix the "To:" line on your patchbombs: > > > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> > > This bad email address getting quoted to the cc line makes some MTAs > very unhappy. > > > > > Highlights: > > - Over 2000 words of documentation in patch 8! And lots more kernel-doc. > > - The page cache is now fully converted to the XArray. > > - Many more tests in the test-suite. > > > > This patch set is not for applying. 0day is still reporting problems, > > and I'd feel bad for eating someone's data. These patches apply on top > > of a set of prepatory patches which just aren't interesting. If you > > want to see the patches applied to a tree, I suggest pulling my git tree: > > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04 > > I also left out the idr_preload removals. They're still in the git tree, > > but I'm not looking for feedback on them. > > I'll give this a quick burn this afternoon and see what catches fire... Build warnings/errors: . lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not used [-Wunused-function] static void radix_tree_free_nodes(struct radix_tree_node *node) . lib/xarray.c: In function ¿xas_max¿: lib/xarray.c:291:16: warning: unused variable ¿mask¿ [-Wunused-variable] unsigned long mask, max = xas->xa_index; ^~~~ .. fs/dax.c: In function ¿grab_mapping_entry¿: fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; did you mean ¿xas_set_err¿? [-Werror=implicit-function-declaration] xas_set_order(, index, size_flag ? PMD_ORDER : 0); ^ scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed make[1]: *** [fs/dax.o] Error 1 -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/73] XArray version 4
On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox <mawil...@microsoft.com> > > I looked through some notes and decided this was version 4 of the XArray. > Last posted two weeks ago, this version includes a *lot* of changes. > I'd like to thank Dave Chinner for his feedback, encouragement and > distracting ideas for improvement, which I'll get to once this is merged. BTW, you need to fix the "To:" line on your patchbombs: > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com> This bad email address getting quoted to the cc line makes some MTAs very unhappy. > > Highlights: > - Over 2000 words of documentation in patch 8! And lots more kernel-doc. > - The page cache is now fully converted to the XArray. > - Many more tests in the test-suite. > > This patch set is not for applying. 0day is still reporting problems, > and I'd feel bad for eating someone's data. These patches apply on top > of a set of prepatory patches which just aren't interesting. If you > want to see the patches applied to a tree, I suggest pulling my git tree: > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04 > I also left out the idr_preload removals. They're still in the git tree, > but I'm not looking for feedback on them. I'll give this a quick burn this afternoon and see what catches fire... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Tue, Dec 05, 2017 at 04:41:58PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox <mawil...@microsoft.com> > > This eliminates a call to radix_tree_preload(). . > void > @@ -431,24 +424,24 @@ xfs_mru_cache_insert( > unsigned long key, > struct xfs_mru_cache_elem *elem) > { > + XA_STATE(xas, >store, key); > int error; > > ASSERT(mru && mru->lists); > if (!mru || !mru->lists) > return -EINVAL; > > - if (radix_tree_preload(GFP_NOFS)) > - return -ENOMEM; > - > INIT_LIST_HEAD(>list_node); > elem->key = key; > > - spin_lock(>lock); > - error = radix_tree_insert(>store, key, elem); > - radix_tree_preload_end(); > - if (!error) > - _xfs_mru_cache_list_insert(mru, elem); > - spin_unlock(>lock); > + do { > + xas_lock(); > + xas_store(, elem); > + error = xas_error(); > + if (!error) > + _xfs_mru_cache_list_insert(mru, elem); > + xas_unlock(); > + } while (xas_nomem(, GFP_NOFS)); Ok, so why does this have a retry loop on ENOMEM despite the existing code handling that error? And why put such a loop in this code and not any of the other XFS code that used radix_tree_preload() and is arguably much more important to avoid ENOMEM on insert (e.g. the inode cache)? Also, I really don't like the pattern of using xa_lock()/xa_unlock() to protect access to an external structure. i.e. the mru->lock context is protecting multiple fields and operations in the MRU structure, not just the radix tree operations. Turning that around so that a larger XFS structure and algorithm is now protected by an opaque internal lock from generic storage structure the forms part of the larger structure seems like a bad design pattern to me... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Tue, Nov 21, 2017 at 04:52:53AM -0800, Matthew Wilcox wrote: > On Tue, Nov 21, 2017 at 05:48:15PM +1100, Dave Chinner wrote: > > On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote: > > > On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote: > > > > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote: > > > > > First thing I noticed was that "xa" as a prefix is already quite > > > > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock > > > > > > The X stands for 'eXpandable' or 'eXtending'. I really don't want to > > > use more than a two-letter acronym for whatever the XArray ends up being > > > called. One of the problems with the radix tree is that everything has > > > that 11-character 'radix_tree_' prefix; just replacing that with 'xa_' > > > makes a huge improvement to readability. > > > > Yeah, understood. That's why > > we have very little clear > > prefix namespace left :/ > > > > [ somedays I write something that looks sorta like a haiku, and from > > that point on everything just starts falling out of my brain that > > way. I blame Eric for this today! :P ] > > When the namespace is > tight we must consider the > competing users. > > The earliest us'r > has a claim to a prefix > we are used to it. > > Also a more wide- > spread user has a claim to > a shorter prefix. > > Would you mind changing > your prefix to one only > one letter longer? We can do something like that, though Darrick has the final say in it. > ... ok, I give up ;-) Top marks for effort :) > All your current usage of the xa_ prefix looks somewhat like this: > > fs/xfs/xfs_trans_ail.c: spin_lock(>xa_lock); > > with honourable mentions to: > fs/xfs/xfs_log.c: spin_lock(>m_ail->xa_lock); > > Would you mind if I bolt a patch on to the front of the series called > something like "free up xa_ namespace" that renamed your xa_* to ail_*? > There are no uses of the 'ail_' prefix in the kernel today. > > I don't think that > spin_lock(>ail_lock); > loses any readability. Not sure that's going to work - there's an "xa_ail" member for the AIL list head. That would now read in places: if (list_empty(>ail_ail)) I'd be inclined to just drop the "xa_" prefix from the XFS structure. There is no information loss by removing the prefix in the XFS code because the pointer name tells us what structure it is pointing at. > > By the way, what does AIL stand for? It'd be nice if it were spelled out > in at least one of the header files, maybe fs/xfs/xfs_trans_priv.h? Active Item List. See the section "Tracking Changes" in Documentation/filesystems/xfs-delayed-logging-design.txt for the full rundown, but in short: "The log item is already used to track the log items that have been written to the log but not yet written to disk. Such log items are considered "active" and as such are stored in the Active Item List (AIL) which is a LSN-ordered double linked list. Items are inserted into this list during log buffer IO completion, after which they are unpinned and can be written to disk." The lack of comments describing the AIL is historic - it's never been documented in the code, nor has the whole relogging concept it implements. I wrote the document above when introducing the CIL (Commited Item List) because almost no-one actively working on XFS understood how the whole journalling subsystem worked in any detail. > > Zoetrope Array. > > Labyrinth of illusion. > > Structure never ends. > > Thank you for making me look up zoetrope ;-) My pleasure :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Mon, Nov 20, 2017 at 08:32:40PM -0800, Matthew Wilcox wrote: > On Mon, Nov 20, 2017 at 05:37:53PM -0800, Darrick J. Wong wrote: > > On Tue, Nov 21, 2017 at 09:27:49AM +1100, Dave Chinner wrote: > > > First thing I noticed was that "xa" as a prefix is already quite > > > widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock > > > already exists and is quite widely used, so having a generic > > > interface using the same prefixes and lock names is going to be > > > quite confusing in the XFS code. Especially considering there's > > > fair bit of radix tree use in XFS (e.g. the internal inode and > > > dquot caches). > > > > > > FYI, from fs/xfs/xfs_trans_priv.h: > > > > > > /* > > > * Private AIL structures. > > > * > > > * Eventually we need to drive the locking in here as well. > > > */ > > > struct xfs_ail { > > > struct xfs_mount*xa_mount; > > > struct task_struct *xa_task; > > > struct list_headxa_ail; > > > xfs_lsn_t xa_target; > > > xfs_lsn_t xa_target_prev; > > > struct list_headxa_cursors; > > > spinlock_t xa_lock; > > > xfs_lsn_t xa_last_pushed_lsn; > > > int xa_log_flush; > > > struct list_headxa_buf_list; > > > wait_queue_head_t xa_empty; > > > }; > > > > > > FWIW, why is it named "XArray"? "X" stands for what? It still > > > looks like a tree structure to me, but without a design doc I'm a > > > bit lost to how it differs to the radix tree (apart from the API) > > > and why it's considered an "array". > > > > /me nominates 'xarr' for the prefix because pirates. :P > > The X stands for 'eXpandable' or 'eXtending'. I really don't want to > use more than a two-letter acronym for whatever the XArray ends up being > called. One of the problems with the radix tree is that everything has > that 11-character 'radix_tree_' prefix; just replacing that with 'xa_' > makes a huge improvement to readability. Yeah, understood. That's why we have very little clear prefix namespace left :/ [ somedays I write something that looks sorta like a haiku, and from that point on everything just starts falling out of my brain that way. I blame Eric for this today! :P ] > I'm open to renaming it, but I want a good name. I think it needs to > be *something* array, so we're looking for a prefix used nowhere in the > kernel. Running 'git grep' in turn for '\<[a-z]a_' really only allows > for ja, ya and za. 'ja_' and 'ya_' only have one user, while 'za_' > has none. So ... anyone got a good retcon for JArray, YArray or ZArray? A Yellow Array. Colour is irrelevant. The bikeshed looks nice. > It's considered an array because it behaves "as if" it's an infinite > array. Infinite Array. Namespace prefix collision this haiku can't solve. > The fact that we chop it up into chunks of 64 entries, and then > arrange those chunks into a tree is not something the average user needs > to concern themselves with. Jumbo Array. Many pieces hidden behind walls. Will anyone notice? > We have a lot of places in the kernel that > fuss around with "allocate N entries, whoops, we exceeded N, kmalloc some > more, oh no kmalloc failed, vmalloc it instead". So the idea is to give > these places an interface that acts like an automatically-resizing array. Zoetrope Array. Labyrinth of illusion. Structure never ends. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Mon, Nov 20, 2017 at 01:51:00PM -0800, Matthew Wilcox wrote: > On Tue, Nov 21, 2017 at 07:26:06AM +1100, Dave Chinner wrote: > > On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote: > > > On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote: > > > > If two programs simultaneously try to write to the same part of a file > > > > via direct IO and buffered IO, there's a chance that the post-diowrite > > > > pagecache invalidation will fail on the dirty page. When this happens, > > > > the dio write succeeded, which means that the page cache is no longer > > > > coherent with the disk! > > > > > > This seems like a good opportunity to talk about what I've been working > > > on for solving this problem. The XArray is going to introduce a set > > > of entries which can be stored to locations in the page cache that I'm > > > calling 'wait entries'. > > > > What's this XArray thing you speak of? > > Ah, right, you were on sabbatical at LSFMM this year where I talked > about it. Briefly, it's a new API for the radix tree. The data structure > is essentially unchanged (minor enhancements), but I'm rationalising > existing functionality and adding new abilities. And getting rid of > misfeatures like the preload API and implicit GFP flags. > > My current working tree is here: > > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-11-20 First thing I noticed was that "xa" as a prefix is already quite widely used in XFS - it's shorthand for "XFS AIL". Indeed, xa_lock already exists and is quite widely used, so having a generic interface using the same prefixes and lock names is going to be quite confusing in the XFS code. Especially considering there's fair bit of radix tree use in XFS (e.g. the internal inode and dquot caches). FYI, from fs/xfs/xfs_trans_priv.h: /* * Private AIL structures. * * Eventually we need to drive the locking in here as well. */ struct xfs_ail { struct xfs_mount*xa_mount; struct task_struct *xa_task; struct list_headxa_ail; xfs_lsn_t xa_target; xfs_lsn_t xa_target_prev; struct list_headxa_cursors; spinlock_t xa_lock; xfs_lsn_t xa_last_pushed_lsn; int xa_log_flush; struct list_headxa_buf_list; wait_queue_head_t xa_empty; }; > Ignoring the prep patches, the excitement is all to be found with the > commits which start 'xarray:' FWIW, why is it named "XArray"? "X" stands for what? It still looks like a tree structure to me, but without a design doc I'm a bit lost to how it differs to the radix tree (apart from the API) and why it's considered an "array". > If you want an example of it in use, I'm pretty happy with this patch > that switches the brd driver entirely from the radix tree API to the > xarray API: > > http://git.infradead.org/users/willy/linux-dax.git/commitdiff/dbf96ae943e43563cbbaa26e21b656b6fe8f4b0f Looks pretty neat, but I'll reserve judgement for when I see the conversion of the XFS radix tree code > I've been pretty liberal with the kernel-doc, but I haven't written out > a good .rst file to give an overview of how to use it. Let me know when you've written it :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iomap: report collisions between directio and buffered writes to userspace
On Mon, Nov 20, 2017 at 08:18:29AM -0800, Matthew Wilcox wrote: > On Fri, Nov 17, 2017 at 11:39:25AM -0800, Darrick J. Wong wrote: > > If two programs simultaneously try to write to the same part of a file > > via direct IO and buffered IO, there's a chance that the post-diowrite > > pagecache invalidation will fail on the dirty page. When this happens, > > the dio write succeeded, which means that the page cache is no longer > > coherent with the disk! > > This seems like a good opportunity to talk about what I've been working > on for solving this problem. The XArray is going to introduce a set > of entries which can be stored to locations in the page cache that I'm > calling 'wait entries'. What's this XArray thing you speak of? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] writeback: allow for dirty metadata accounting
On Thu, Nov 09, 2017 at 02:30:57PM -0500, Josef Bacik wrote: > From: Josef Bacik <jba...@fb.com> > > Provide a mechanism for file systems to indicate how much dirty metadata they > are holding. This introduces a few things > > 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. > 2) WB stat for dirty metadata. This way we know if we need to try and call > into > the file system to write out metadata. This could potentially be used in the > future to make balancing of dirty pages smarter. Ok, so when you have 64k page size and 4k metadata block size and you're using kmalloc() to allocate the storage for the metadata, how do we make use of all this page-based metadata accounting stuff? We could use dirty metadata accounting infrastructure in XFS so the mm/ subsystem can push on dirty metadata before we get into reclaim situations, but I just can't see how this code works when raw pages are not used to back the metadata cache. That is, XFS can use various different sizes of metadata buffers in the same filesystem. For example, we use sector sized buffers for static AG metadata, filesystem blocks for per-AG metadata, some multiple (1-16) of filesystem blocks for inode buffers, and some multiple (1-128) of filesytem blocks for directory blocks. This means we have a combination of buffers we need to account for that are made up of: heap memory when buffer size < page size, single pages when buffer size == page size, and multiple pages when buffer size > page size. The default filesystem config on a 4k page machine with 512 byte sectors will create buffers in all three categories above which tends to indicate we can't use this new generic infrastructure as proposeda. Any thoughts about how we could efficiently support accounting for variable sized, non-page based metadata with this generic infrastructure? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Mon, Oct 09, 2017 at 09:00:51AM -0400, Josef Bacik wrote: > On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote: > > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > > > > Integrating into fstests means it will be immediately available to > > > > all fs developers, it'll run on everything that everyone already has > > > > setup for filesystem testing, and it will have familiar mkfs/mount > > > > option setup behaviour so there's no new hoops for everyone to jump > > > > through to run it... > > > > > > > > > > TBF I specifically made it as easy as possible because I know we all hate > > > trying > > > to learn new shit. > > > > Yeah, it's also hard to get people to change their workflows to add > > a whole new test harness into them. It's easy if it's just a new > > command to an existing workflow :P > > > > Agreed, so if you probably won't run this outside of fstests then I'll add it > to > xfstests. I envision this tool as being run by maintainers to verify their > pull > requests haven't regressed since the last set of patches, as well as by > anybody > trying to fix performance problems. So it's way more important to me that > you, > Ted, and all the various btrfs maintainers will run it than anybody else. > > > > I figured this was different enough to warrant a separate > > > project, especially since I'm going to add block device jobs so Jens can > > > test > > > block layer things. If we all agree we'd rather see this in fstests then > > > I'm > > > happy to do that too. Thanks, > > > > I'm not fussed either way - it's a good discussion to have, though. > > > > If I want to add tests (e.g. my time-honoured fsmark tests), where > > should I send patches? > > > > I beat you to that! I wanted to avoid adding fs_mark to the suite because it > means parsing another different set of outputs, so I added a new ioengine to > fio > for this > > http://www.spinics.net/lists/fio/msg06367.html > > and added a fio job to do 500k files > > https://github.com/josefbacik/fsperf/blob/master/tests/500kemptyfiles.fio > > The test is disabled by default for now because obviously the fio support > hasn't > landed yet. That seems misguided. fio is good, but it's not a universal solution. > I'd _like_ to expand fio for cases we come up with that aren't possible, as > there's already a ton of measurements that are taken, especially around > latencies. To be properly useful it needs to support more than just fio to run tests. Indeed, it's largely useless to me if that's all it can do or it's a major pain to add support for different tools like fsmark. e.g. my typical perf regression test that you see the concurrnet fsmark create workload is actually a lot more. It does: fsmark to create 50m zero length files umount, run parallel xfs_repair (excellent mmap_sem/page fault punisher) mount run parallel find -ctime (readdir + lookup traversal) unmount, mount run parallel ls -R (readdir + dtype traversal) unmount, mount parallel rm -rf of 50m files I have variants that use small 4k files or large files rather than empty files, taht use different fsync patterns to stress the log, use grep -R to traverse the data as well as the directory/inode structure instead of find, etc. > That said I'm not opposed to throwing new stuff in there, it just > means we have to add stuff to parse the output and store it in the database > in a > consistent way, which seems like more of a pain than just making fio do what > we > need it to. Thanks, fio is not going to be able to replace the sort of perf tests I run from week to week. If that's all it's going to do then it's not directly useful to me... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > On Mon, Oct 09, 2017 at 11:51:37AM +1100, Dave Chinner wrote: > > On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote: > > > Hello, > > > > > > One thing that comes up a lot every LSF is the fact that we have no > > > general way > > > that we do performance testing. Every fs developer has a set of scripts > > > or > > > things that they run with varying degrees of consistency, but nothing > > > central > > > that we all use. I for one am getting tired of finding regressions when > > > we are > > > deploying new kernels internally, so I wired this thing up to try and > > > address > > > this need. > > > > > > We all hate convoluted setups, the more brain power we have to put in to > > > setting > > > something up the less likely we are to use it, so I took the xfstests > > > approach > > > of making it relatively simple to get running and relatively easy to add > > > new > > > tests. For right now the only thing this framework does is run fio > > > scripts. I > > > chose fio because it already gathers loads of performance data about it's > > > runs. > > > We have everything we need there, latency, bandwidth, cpu time, and all > > > broken > > > down by reads, writes, and trims. I figure most of us are familiar > > > enough with > > > fio and how it works to make it relatively easy to add new tests to the > > > framework. > > > > > > I've posted my code up on github, you can get it here > > > > > > https://github.com/josefbacik/fsperf > > > > > > All (well most) of the results from fio are stored in a local sqlite > > > database. > > > Right now the comparison stuff is very crude, it simply checks against the > > > previous run and it only checks a few of the keys by default. You can > > > check > > > latency if you want, but while writing this stuff up it seemed that > > > latency was > > > too variable from run to run to be useful in a "did my thing regress or > > > improve" > > > sort of way. > > > > > > The configuration is brain dead simple, the README has examples. All you > > > need > > > to do is make your local.cfg, run ./setup and then run ./fsperf and you > > > are good > > > to go. > > > > Why re-invent the test infrastructure? Why not just make it a > > tests/perf subdir in fstests? > > > > Probably should have led with that shouldn't I have? There's nothing keeping > me > from doing it, but I didn't want to try and shoehorn in a python thing into > fstests. I need python to do the sqlite and the json parsing to dump into the > sqlite database. > > Now if you (and others) are not opposed to this being dropped into tests/perf > then I'll work that up. But it's definitely going to need to be done in > python. > I know you yourself have said you aren't opposed to using python in the past, > so > if that's still the case then I can definitely wire it all up. I have no problems with people using python for stuff like this but, OTOH, I'm not the fstests maintainer anymore :P > > > The plan is to add lots of workloads as we discover regressions and such. > > > We > > > don't want anything that takes too long to run otherwise people won't run > > > this, > > > so the existing tests don't take much longer than a few minutes each. I > > > will be > > > adding some more comparison options so you can compare against averages > > > of all > > > previous runs and such. > > > > Yup, that fits exactly into what fstests is for... :P > > > > Integrating into fstests means it will be immediately available to > > all fs developers, it'll run on everything that everyone already has > > setup for filesystem testing, and it will have familiar mkfs/mount > > option setup behaviour so there's no new hoops for everyone to jump > > through to run it... > > > > TBF I specifically made it as easy as possible because I know we all hate > trying > to learn new shit. Yeah, it's also hard to get people to change their workflows to add a whole new test harness into them. It's easy if it's just a new command to an existing workflow :P > I figured this was different enough to warrant a separate > project, especially since I'm going to add block device jobs so Jens can test > block layer things. If we all agree we'd rather see this in fstests then I'm > happy to do that too. Thanks, I'm not fussed either way - it's a good discussion to have, though. If I want to add tests (e.g. my time-honoured fsmark tests), where should I send patches? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote: > Hello, > > One thing that comes up a lot every LSF is the fact that we have no general > way > that we do performance testing. Every fs developer has a set of scripts or > things that they run with varying degrees of consistency, but nothing central > that we all use. I for one am getting tired of finding regressions when we > are > deploying new kernels internally, so I wired this thing up to try and address > this need. > > We all hate convoluted setups, the more brain power we have to put in to > setting > something up the less likely we are to use it, so I took the xfstests approach > of making it relatively simple to get running and relatively easy to add new > tests. For right now the only thing this framework does is run fio scripts. > I > chose fio because it already gathers loads of performance data about it's > runs. > We have everything we need there, latency, bandwidth, cpu time, and all broken > down by reads, writes, and trims. I figure most of us are familiar enough > with > fio and how it works to make it relatively easy to add new tests to the > framework. > > I've posted my code up on github, you can get it here > > https://github.com/josefbacik/fsperf > > All (well most) of the results from fio are stored in a local sqlite database. > Right now the comparison stuff is very crude, it simply checks against the > previous run and it only checks a few of the keys by default. You can check > latency if you want, but while writing this stuff up it seemed that latency > was > too variable from run to run to be useful in a "did my thing regress or > improve" > sort of way. > > The configuration is brain dead simple, the README has examples. All you need > to do is make your local.cfg, run ./setup and then run ./fsperf and you are > good > to go. Why re-invent the test infrastructure? Why not just make it a tests/perf subdir in fstests? > The plan is to add lots of workloads as we discover regressions and such. We > don't want anything that takes too long to run otherwise people won't run > this, > so the existing tests don't take much longer than a few minutes each. I will > be > adding some more comparison options so you can compare against averages of all > previous runs and such. Yup, that fits exactly into what fstests is for... :P Integrating into fstests means it will be immediately available to all fs developers, it'll run on everything that everyone already has setup for filesystem testing, and it will have familiar mkfs/mount option setup behaviour so there's no new hoops for everyone to jump through to run it... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?
On Tue, Oct 03, 2017 at 01:40:51PM -0700, Matthew Wilcox wrote: > On Wed, Oct 04, 2017 at 07:10:35AM +1100, Dave Chinner wrote: > > On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote: > > > [repost. I didn´t notice autocompletion gave me wrong address for > > > fsdevel, > > > blacklisted now] > > > > > > Hello. > > > > > > What do you think of > > > > > > http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs > > > > Domain not found. > > Must be an Australian problem ... Probably, I forgot to stand on my head so everything must have been sent to the server upside down Though it is a curious failure - it failed until I went to "openzfs.org" and that redirected to "open-zfs.org" and now it all works. Somewhat bizarre. > A ZFS channel program (ZCP) is a small script written in a domain specific > language that manipulate ZFS internals in a single, atomically-visible > operation. For instance, to delete all snapshots of a filesystem a ZCP > could be written which 1) generates the list of snapshots, 2) traverses > that list, and 3) destroys each snapshot unconditionally. Because > each of these statements would be evaluated from within the kernel, > ZCPs can guarantee safety from interference with other concurrent ZFS > modifications. Executing from inside the kernel allows us to guarantee > atomic visibility of these operations (correctness) and allows them to > be performed in a single transaction group (performance). > > A successful implementation of ZCP will: > > 1. Support equivalent functionality for all of the current ZFS commands > with improved performance and correctness from the point of view of the > user of ZFS. > > 2. Facilitate the quick addition of new and useful commands as > ZCP enables the implementation of more powerful operations which > previously would have been unsafe to implement in user programs, or > would require modifications to the kernel for correctness. Since the > ZCP layer guarantees the atomicity of each ZCP, we only need to write > new sync_tasks for individual simple operations, then can use ZCPs to > chain those simple operations together into more complicated operations. > > 3. Allow ZFS users to safely implement their own ZFS operations without > performing operations they don’t have the privileges for. > > 4. Improve the performance and correctness of existing applications > built on ZFS operations. /me goes and looks at the slides Seems like they are trying to solve a problem of their own making, in that admin operations are run by the kernel from a separate task that is really, really slow. So this scripting is a method of aggregating multiple "sync tasks" into a single operation so there isn't delays between tasks. /me chokes on slide 8/8 "Add a Lua interpreter to the kernel, implement ZFS intrinsics (...) as extensions to the Lua language" Somehow, I don't see that happening in Linux. Yes, I can see us potentially adding some custom functionality in filesystems with eBPF (e.g. custom allocation policies), but I think admin operations need to be done from userspace through a clear, stable interface that supports all the necessary primitives to customise admin operations for different needs. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?
On Tue, Oct 03, 2017 at 03:19:18PM +0200, Martin Steigerwald wrote: > [repost. I didn´t notice autocompletion gave me wrong address for fsdevel, > blacklisted now] > > Hello. > > What do you think of > > http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs Domain not found. -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs/150 regression test for reading compressed data
On Wed, Sep 20, 2017 at 05:52:43PM -0600, Liu Bo wrote: > We had a bug in btrfs compression code which could end up with a > kernel panic. > > This is adding a regression test for the bug and I've also sent a > kernel patch to fix the bug. > > The patch is "Btrfs: fix kernel oops while reading compressed data". > > Signed-off-by: Liu Bo <bo.li@oracle.com> > --- > tests/btrfs/150 | 102 > > tests/btrfs/150.out | 3 ++ > tests/btrfs/group | 1 + > 3 files changed, 106 insertions(+) > create mode 100755 tests/btrfs/150 > create mode 100644 tests/btrfs/150.out > > diff --git a/tests/btrfs/150 b/tests/btrfs/150 > new file mode 100755 > index 000..834be51 > --- /dev/null > +++ b/tests/btrfs/150 > @@ -0,0 +1,102 @@ > +#! /bin/bash > +# FS QA Test btrfs/150 > +# > +# This is a regression test which ends up with a kernel oops in btrfs. group += dangerous > +# It occurs when btrfs's read repair happens while reading a compressed > +# extent. > +# The patch for this is > +# x Incomplete? > +# > +#--- > +# Copyright (c) 2017 Liu Bo. All Rights Reserved. You're signing off this patch an Oracle employee, but claiming personal copyright. Please clarify who owns the copyright - if it's your personal copyright then please sign off with a personal email address, not your employer's... Also, I note that these recently added tests from you: tests/btrfs/140:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/141:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/142:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/143:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/generic/406:# Copyright (c) 2017 Liu Bo. All Rights Reserved. all have this same ambiguity - personal copyright with employer signoff in the commit. This definitely needs clarification and fixing if it is wrong > +disable_io_failure() > +{ > +echo 0 > $SYSFS_BDEV/make-it-fail > +echo 0 > $DEBUGFS_MNT/fail_make_request/probability > +echo 0 > $DEBUGFS_MNT/fail_make_request/times > +} > + > +_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1 > + > +# It doesn't matter which compression algorithm we use. > +_scratch_mount -ocompress > + > +# Create a file with all data being compressed > +$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io needs an fsync to reach disk. > +# Raid1 consists of two copies and btrfs decides which copy to read by > reader's > +# %pid. Now we inject errors to copy #1 and copy #0 is good. We want to > read > +# the bad copy to trigger read-repair. > +while true; do > + disable_io_failure > + # invalidate the page cache > + $XFS_IO_PROG -f -c "fadvise -d 0 128K" $SCRATCH_MNT/foobar | > _filter_xfs_io > + > + enable_io_failure > + od -x $SCRATCH_MNT/foobar > /dev/null & why are you using od to read the data when the output is piped to dev/null? why not just xfs_io -c "pread 0 8k" ? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote: > On Sun 02-04-17 09:05:26, Dave Chinner wrote: > > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > > > Because if above is acceptable we could make reported i_version to be > > > > > a sum > > > > > of "superblock crash counter" and "inode i_version". We increment > > > > > "superblock crash counter" whenever we detect unclean filesystem > > > > > shutdown. > > > > > That way after a crash we are guaranteed each inode will report new > > > > > i_version (the sum would probably have to look like "superblock crash > > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > > > i_version numbers we gave away but did not write to disk but > > > > > still...). > > > > > Thoughts? > > > > > > How hard is this for filesystems to support? Do they need an on-disk > > > format change to keep track of the crash counter? > > > > Yes. We'll need version counter in the superblock, and we'll need to > > know what the increment semantics are. > > > > The big question is how do we know there was a crash? The only thing > > a journalling filesystem knows at mount time is whether it is clean > > or requires recovery. Filesystems can require recovery for many > > reasons that don't involve a crash (e.g. root fs is never unmounted > > cleanly, so always requires recovery). Further, some filesystems may > > not even know there was a crash at mount time because their > > architecture always leaves a consistent filesystem on disk (e.g. COW > > filesystems) > > What filesystems can or cannot easily do obviously differs. Ext4 has a > recovery flag set in superblock on RW mount/remount and cleared on > umount/RO remount. Even this doesn't help. A recent bug that was reported to the XFS list - turns out that systemd can't remount-ro the root filesystem sucessfully on shutdown because there are open write fds on the root filesystem when it attempts the remount. So it just reboots without a remount-ro. This uncovered a bug in grub in that it (still!) thinks sync(1) is sufficient to get all the metadata that points to a kernel image onto disk in places it can read. XFS, like ext4, leaves it in the journal and so the system then fails to boot because systemd didn't remount-ro the root fs and hence the journal was never flushed before reboot and so grub can't find the kernel and so everything fails > This flag being set on mount would imply incrementing the crash > counter. It should be pretty easy for each filesystem to implement > such flag and the counter but I agree it requires an on-disk > format change. Yup, anything we want that is persistent and consistent across filesystems will need on-disk format changes. Hence we need a solid specification first, not to mention tests to validate correct behaviour across all filesystems in xfstests... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > Because if above is acceptable we could make reported i_version to be a > > > sum > > > of "superblock crash counter" and "inode i_version". We increment > > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > > That way after a crash we are guaranteed each inode will report new > > > i_version (the sum would probably have to look like "superblock crash > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > i_version numbers we gave away but did not write to disk but still...). > > > Thoughts? > > How hard is this for filesystems to support? Do they need an on-disk > format change to keep track of the crash counter? Yes. We'll need version counter in the superblock, and we'll need to know what the increment semantics are. The big question is how do we know there was a crash? The only thing a journalling filesystem knows at mount time is whether it is clean or requires recovery. Filesystems can require recovery for many reasons that don't involve a crash (e.g. root fs is never unmounted cleanly, so always requires recovery). Further, some filesystems may not even know there was a crash at mount time because their architecture always leaves a consistent filesystem on disk (e.g. COW filesystems) > I wonder if repeated crashes can lead to any odd corner cases. WIthout defined, locked down behavour of the superblock counter, the almost certainly corner cases will exist... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > - It's durable; the above comparison still works if there were > > > > > > reboots > > > > > > between the two i_version checks. > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > atomically with the changes that caused them? What if a > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > > the inode dirty and I think that will end up with the new i_version at > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > "Yeah, there could be atomicity issues there." > > > > > > I think providing that level of atomicity may be difficult, though > > > maybe there's some way to make the querying of i_version block until > > > the inode update has been journalled? > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > Until that you can see arbitrary combination of data & i_version after the > > crash. We do take care to keep data and metadata in sync only when there > > are security implications to that (like exposing uninitialized disk blocks) > > and if not, we are as lazy as we can to improve performance... > > > > > > Yeah, I think what we'll have to do here is ensure that those > filesystems do an fsync prior to reporting the i_version getattr > codepath. It's not pretty, but I don't see a real alternative. I think that's even more problematic. ->getattr currently runs completely unlocked for performance reasons - it's racy w.r.t. to ongoing modifications to begin with, so /nothing/ that is returned to userspace via stat/statx can be guaranteed to be "coherent". Linus will be very unhappy if you make his git workload (which is /very/ stat heavy) run slower by adding any sort of locking in this hot path. Even if we did put an fsync() into ->getattr() (and dealt with all the locking issues that entails), by the time the statx syscall returns to userspace the i_version value may not match the data/metadata in the inode(*). IOWs, by the time i_version gets to userspace, it is out of date and any use of it for data versioning from userspace is going to be prone to race conditions. Cheers, Dave. (*) fiemap has exactly the same "stale the moment internal fs locks are released" race conditions, which is why it cannot safely be used for mapping holes when copying file data -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - It's durable; the above comparison still works if there were reboots > > between the two i_version checks. > > - I don't know how realistic this is--we may need to figure out > > if there's a weaker guarantee that's still useful. Do > > filesystems actually make ctime/mtime/i_version changes > > atomically with the changes that caused them? What if a > > change attribute is exposed to an NFS client but doesn't make > > it to disk, and then that value is reused after reboot? > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > the inode dirty and I think that will end up with the new i_version at > least being journalled before __mark_inode_dirty returns. The change may be journalled, but it isn't guaranteed stable until fsync is run on the inode. NFS server operations commit the metadata changed by a modification through ->commit_metadata or sync_inode_metadata() before the response is sent back to the client, hence guaranteeing that i_version changes through the NFS server are stable and durable. This is not the case for normal operations done through the POSIX API - the journalling is asynchronous and the only durability guarantees are provided by fsync() > That said, I suppose it is possible for us to bump the counter, hand > that new counter value out to a NFS client and then the box crashes > before it makes it to the journal. Yup, this has aways been a problem when you mix posix applications running on the NFS server modifying the same files as the NFS clients are accessing and requiring synchronisation. > Not sure how big a problem that really is. This coherency problem has always existed on the server side... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] nowait aio: return on congested block device
On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgold...@suse.com> > > A new flag BIO_NOWAIT is introduced to identify bio's > orignating from iocb with IOCB_NOWAIT. This flag indicates > to return immediately if a request cannot be made instead > of retrying. So this makes a congested block device run the bio IO completion callback with an -EAGAIN error present? Are all the filesystem direct IO submission and completion routines OK with that? i.e. does such a congestion case cause filesystems to temporarily expose stale data to unprivileged users when the IO is requeued in this way? e.g. filesystem does allocation without blocking, submits bio, device is congested, runs IO completion with error, so nothing written to allocated blocks, write gets queued, so other read comes in while the write is queued, reads data from uninitialised blocks that were allocated during the write Seems kinda problematic to me to have a undocumented design constraint (i.e a landmine) where we submit the AIO only to have it error out and then expect the filesystem to do something special and different /without blocking/ on EAGAIN. Why isn't the congestion check at a higher layer like we do for page cache readahead? i.e. using the bdi*congested() API at the time we are doing all the other filesystem blocking checks. -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote: > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote: > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote: > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote: > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote: > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages( > > > > > bp->b_addr = NULL; > > > > > } else { > > > > > int retried = 0; > > > > > - unsigned noio_flag; > > > > > + unsigned nofs_flag; > > > > > > > > > > /* > > > > >* vm_map_ram() will allocate auxillary structures (e.g. > > > > >* pagetables) with GFP_KERNEL, yet we are likely to be > > > > > under > > > > >* GFP_NOFS context here. Hence we need to tell memory > > > > > reclaim > > > > > - * that we are in such a context via PF_MEMALLOC_NOIO > > > > > to prevent > > > > > + * that we are in such a context via PF_MEMALLOC_NOFS > > > > > to prevent > > > > >* memory reclaim re-entering the filesystem here and > > > > >* potentially deadlocking. > > > > >*/ > > > > > > > > This comment feels out of date ... how about: > > > > > > which part is out of date? > > > > > > > > > > > /* > > > > * vm_map_ram will allocate auxiliary structures (eg > > > > page > > > > * tables) with GFP_KERNEL. If that tries to reclaim > > > > memory > > > > * by calling back into this filesystem, we may > > > > deadlock. > > > > * Prevent that by setting the NOFS flag. > > > > */ > > > > > > dunno, the previous wording seems clear enough to me. Maybe little bit > > > more chatty than yours but I am not sure this is worth changing. > > > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..." > > wording of the old comment because it captures the uncertainty of > > whether or not we actually are already under NOFS. If someone actually > > has audited this code well enough to know for sure then yes let's change > > the comment, but I haven't gone that far. > > I believe we can drop the memalloc_nofs_save then as well because either > we are called from a potentially dangerous context and thus we are in > the nofs scope we we do not need the protection at all. No, absolutely not. "Belief" is not a sufficient justification for removing low level deadlock avoidance infrastructure. This code needs to remain in _xfs_buf_map_pages() until a full audit of the caller paths is done and we're 100% certain that there are no lurking deadlocks. For example, I'm pretty sure we can call into _xfs_buf_map_pages() outside of a transaction context but with an inode ILOCK held exclusively. If we then recurse into memory reclaim and try to run a transaction during reclaim, we have an inverted ILOCK vs transaction locking order. i.e. we are not allowed to call xfs_trans_reserve() with an ILOCK held as that can deadlock the log: log full, locked inode pins tail of log, inode cannot be flushed because ILOCK is held by caller waiting for log space to become available i.e. there are certain situations where holding a ILOCK is a deadlock vector. See xfs_lock_inodes() for an example of the lengths we go to avoid ILOCK based log deadlocks like this... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives
On Mon, Dec 19, 2016 at 02:06:19PM -0800, Darrick J. Wong wrote: > On Tue, Dec 20, 2016 at 08:24:13AM +1100, Dave Chinner wrote: > > On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote: > > > From: Michal Hocko <mho...@suse.com> > > > > > > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce > > > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it > > > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing > > > KM_NOFS tags to keep lockdep happy") and use the new flag for them > > > instead. There is really no reason to make these allocations contexts > > > weaker just because of the lockdep which even might not be enabled > > > in most cases. > > > > > > Signed-off-by: Michal Hocko <mho...@suse.com> > > > > I'd suggest that it might be better to drop this patch for now - > > it's not necessary for the context flag changeover but does > > introduce a risk of regressions if the conversion is wrong. > > I was just about to write in that while I didn't see anything obviously > wrong with the NOFS removals, I also don't know for sure that we can't > end up recursively in those code paths (specifically the directory > traversal thing). The issue is with code paths that can be called from both inside and outside transaction context - lockdep complains when it sees an allocation path that is used with both GFP_NOFS and GFP_KERNEL context, as it doesn't know that the GFP_KERNEL usage is safe or not. So things like the directory buffer path, which can be called from readdir without a transaction context, have various KM_NOFS flags scattered through it so that lockdep doesn't get all upset every time readdir is called... There are other cases like this - btree manipulation via bunmapi() can be called without transaction context to remove delayed alloc extents, and that puts all of the btree cursor and incore extent list handling in the same boat (all those allocations are KM_NOFS), etc. So it's not really recursion that is the problem here - it's different allocation contexts that lockdep can't know about unless it's told about them. We've done that with KM_NOFS in the past; in future we should use this KM_NOLOCKDEP flag, though I'd prefer a better name for it. e.g. KM_NOTRANS to indicate that the allocation can occur both inside and outside of transaction context Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives
On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote: > From: Michal Hocko <mho...@suse.com> > > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing > KM_NOFS tags to keep lockdep happy") and use the new flag for them > instead. There is really no reason to make these allocations contexts > weaker just because of the lockdep which even might not be enabled > in most cases. > > Signed-off-by: Michal Hocko <mho...@suse.com> I'd suggest that it might be better to drop this patch for now - it's not necessary for the context flag changeover but does introduce a risk of regressions if the conversion is wrong. Hence I think this is better as a completely separate series which audits and changes all the unnecessary KM_NOFS allocations in one go. I've never liked whack-a-mole style changes like this - do it once, do it properly Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Btrfs: add mount option for dax
On Fri, Dec 09, 2016 at 10:41:34AM -0800, Liu Bo wrote: > On Fri, Dec 09, 2016 at 03:47:20PM +1100, Dave Chinner wrote: > > On Wed, Dec 07, 2016 at 01:45:05PM -0800, Liu Bo wrote: > > > Signed-off-by: Liu Bo <bo.li@oracle.com> > > > --- > > > fs/btrfs/ctree.h | 1 + > > > fs/btrfs/super.c | 40 +++- > > > 2 files changed, 40 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > > index 0b8ce2b..e54c6e6 100644 > > > --- a/fs/btrfs/ctree.h > > > +++ b/fs/btrfs/ctree.h > > > @@ -1317,6 +1317,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct > > > btrfs_root *root) > > > #define BTRFS_MOUNT_FRAGMENT_METADATA(1 << 25) > > > #define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26) > > > #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) > > > +#define BTRFS_MOUNT_DAX (1 << 28) > > > > > > #define BTRFS_DEFAULT_COMMIT_INTERVAL(30) > > > #define BTRFS_DEFAULT_MAX_INLINE (2048) > > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > > > index 74ed5aa..9b18f3d 100644 > > > --- a/fs/btrfs/super.c > > > +++ b/fs/btrfs/super.c > > > @@ -323,7 +323,7 @@ enum { > > > Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard, > > > Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow, > > > Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_usebackuproot, > > > - Opt_nologreplay, Opt_norecovery, > > > + Opt_nologreplay, Opt_norecovery, Opt_dax, > > > > Can we please not create more filesystems with a DAX mount option? > > This was only even an enabler, and not meant to be a permanent > > thing. The permanent functionality for DAX is supposed to be > > per-inode inheritable DAX flags - not mount options - so that > > applications can choose on a per file basis to enable/disable DAX > > access as they see fit. > > > > This also enables the filesystem to reject the attempt to turn on > > DAX if the set of contexts for the file are not DAX compatible > > Sounds good, I'll try to update it to use inode DAX flag directly. xfs_io already has chattr/lsattr support (+/-x) for the FS_XFLAG_DAX flag in FS_IOC_FS{GS}ETXATTR, and you can have a look at the XFS code in xfs_ioctl.c for the operations that are needed to dynamically change the S_DAX flag on an inode (e.g xfs_ioctl_setattr_dax_invalidate()) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] Btrfs: add DAX support for nocow btrfs
On Wed, Dec 07, 2016 at 01:45:08PM -0800, Liu Bo wrote: > Since I haven't figure out how to map multiple devices to userspace without > pagecache, this DAX support is only for single-device, and I don't think > DAX(Direct Access) can work with cow, this is limited to nocow case. I made > this by setting nodatacow in dax mount option. DAX can be made to work with COW quite easily - it's already been done, in fact. Go look up Nova for how it works with DAX: https://github.com/Andiry/nova Essentially, it has a set of "temporary pages" it links to the inode where writes are done directly, and when a synchronisation event occurs it pulls them from the per-inode list, does whatever transformations are needed (e.g. CRC calculation, mirroring, etc) and marks them them as current in the inode extent list. When a new overwrite comes along, it allocates a new block in the temporary page list, copies the existing data into it, and then uses that block for DAX until the next synchronisation event occurs. For XFS, CoW for DAX through read/write isn't really any different to the direct IO path we currently already have. And for page write faults on shared extents, instead of zeroing the newly allocated block we simply copy the original data into the new block before the allocation returns. It does mean, however, that XFS does not have the capability for data transformations in the IO path. This limits us to atomic write devices (software raid 0 or hardware redundancy such as DIMM mirroring), but we can still do out-of-band online data transformations and movement (e.g. dedupe, defrag) with DAX. Yes, I know these methods are very different to how btrfs uses COW. However, my point is that DAX and CoW and/or mulitple devices are not incompatible if the architecture is correctly structured. i.e DAX should be able to work even with most of btrfs's special magic still enabled. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Btrfs: add mount option for dax
On Wed, Dec 07, 2016 at 01:45:05PM -0800, Liu Bo wrote: > Signed-off-by: Liu Bo <bo.li@oracle.com> > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/super.c | 40 +++- > 2 files changed, 40 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 0b8ce2b..e54c6e6 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1317,6 +1317,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct > btrfs_root *root) > #define BTRFS_MOUNT_FRAGMENT_METADATA(1 << 25) > #define BTRFS_MOUNT_FREE_SPACE_TREE (1 << 26) > #define BTRFS_MOUNT_NOLOGREPLAY (1 << 27) > +#define BTRFS_MOUNT_DAX (1 << 28) > > #define BTRFS_DEFAULT_COMMIT_INTERVAL(30) > #define BTRFS_DEFAULT_MAX_INLINE (2048) > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 74ed5aa..9b18f3d 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -323,7 +323,7 @@ enum { > Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard, > Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow, > Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_usebackuproot, > - Opt_nologreplay, Opt_norecovery, > + Opt_nologreplay, Opt_norecovery, Opt_dax, Can we please not create more filesystems with a DAX mount option? This was only even an enabler, and not meant to be a permanent thing. The permanent functionality for DAX is supposed to be per-inode inheritable DAX flags - not mount options - so that applications can choose on a per file basis to enable/disable DAX access as they see fit. This also enables the filesystem to reject the attempt to turn on DAX if the set of contexts for the file are not DAX compatible Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] fstests: common: Introduce function to check qgroup correctness
On Wed, Nov 30, 2016 at 08:56:03AM +0800, Qu Wenruo wrote: > > > At 11/30/2016 05:01 AM, Dave Chinner wrote: > >On Tue, Nov 29, 2016 at 03:32:54PM +0800, Qu Wenruo wrote: > >>Old btrfs qgroup test cases uses fix golden output numbers, which limits > >>the coverage since they can't handle mount options like compress or > >>inode_map, and cause false alert. > >> > >>Introduce _btrfs_check_scratch_qgroup() function to check qgroup > >>correctness using "btrfs check --qgroup-report" function, which will > >>follow the way kernel handle qgroup and are proved very reliable. > >> > >>Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com> > >>--- > >> common/rc | 19 +++ > >> 1 file changed, 19 insertions(+) > >> > >>diff --git a/common/rc b/common/rc > >>index 8c99306..35d2d56 100644 > >>--- a/common/rc > >>+++ b/common/rc > >>@@ -3018,6 +3018,25 @@ _require_deletable_scratch_dev_pool() > >>done > >> } > >> > >>+# We check if "btrfs check" support to check qgroup correctness > >>+# Old fixed golden output can cover case like compress and inode_map > >>+# mount options, which limits the coverage > >>+_require_btrfs_check_qgroup() > >>+{ > >>+ _require_command "$BTRFS_UTIL_PROG" btrfs > >>+ output=$($BTRFS_UTIL_PROG check --help | grep "qgroup-report") > >>+ if [ -z "$output" ]; then > >>+ _notrun "$BTRFS_UTIL_PROG too old (must support 'check > >>--qgroup-report')" > >>+ fi > >>+} > > > >Why wouldn't this just set a global variable that you then > >check in _check_scratch_fs and run the _btrfs_check_scratch_qgroup() > >call then? > > The problem is, "btrfs check --qgroup-report" will do force report, > even for case like qgroup rescan still running. > > Some test, like btrfs/114 which tests rescan, false report will > cause problem. So for those specific tests, you aren't going to be running "btrfs check --qgroup-report", right? In which case, those tests should not call _require_btrfs_check_qgroup(), and then _check_scratch_fs() will not run the quota check. i.e. there will be no difference to the current behaviour. > So here I choose the manually checking other than always do it at > _check_scratch_fs(). I don't see what the problem you are avoiding is. Either it is safe to run the quota check or it isn't, and triggering it to run in _check_scratch_fs() via a _requires rule makes no difference to that. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] fstests: common: Introduce function to check qgroup correctness
On Tue, Nov 29, 2016 at 03:32:54PM +0800, Qu Wenruo wrote: > Old btrfs qgroup test cases uses fix golden output numbers, which limits > the coverage since they can't handle mount options like compress or > inode_map, and cause false alert. > > Introduce _btrfs_check_scratch_qgroup() function to check qgroup > correctness using "btrfs check --qgroup-report" function, which will > follow the way kernel handle qgroup and are proved very reliable. > > Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com> > --- > common/rc | 19 +++ > 1 file changed, 19 insertions(+) > > diff --git a/common/rc b/common/rc > index 8c99306..35d2d56 100644 > --- a/common/rc > +++ b/common/rc > @@ -3018,6 +3018,25 @@ _require_deletable_scratch_dev_pool() > done > } > > +# We check if "btrfs check" support to check qgroup correctness > +# Old fixed golden output can cover case like compress and inode_map > +# mount options, which limits the coverage > +_require_btrfs_check_qgroup() > +{ > + _require_command "$BTRFS_UTIL_PROG" btrfs > + output=$($BTRFS_UTIL_PROG check --help | grep "qgroup-report") > + if [ -z "$output" ]; then > + _notrun "$BTRFS_UTIL_PROG too old (must support 'check > --qgroup-report')" > + fi > +} Why wouldn't this just set a global variable that you then check in _check_scratch_fs and run the _btrfs_check_scratch_qgroup() call then? What about all the tests that currently run without this functionality being present? They will now notrun rather than use the golden output match - this seems like a regression to me, especially for distro QE testing older kernel/progs combinations... > + > +_btrfs_check_scratch_qgroup() > +{ > + _require_btrfs_check_qgroup This needs to go in the test itself before the test is run, not get hidden in a function call at the end of the test. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Wed, Nov 23, 2016 at 06:14:47PM -0500, Zygo Blaxell wrote: > On Thu, Nov 24, 2016 at 09:13:28AM +1100, Dave Chinner wrote: > > On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote: > > > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote: > > > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote: > > > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote: > > > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the > > > > > >implicit EOF gets around this in the existing XFS > > > > > > implementation. I > > > > > >copied this for the Btrfs implementation. > > > > > > > > > > Somewhat tangential to this patch, but on the dedup topic: Can we > > > > > raise > > > > > or drop that 16MB limit? > > > > > > > > > > The maximum btrfs extent length is 128MB. Currently the btrfs dedup > > > > > behavior for a 128MB extent is to generate 8x16MB shared extent > > > > > references > > > > > with different extent offsets to a single 128MB physical extent. > > > > > These references no longer look like the original 128MB extent to a > > > > > userspace dedup tool. That raises the difficulty level substantially > > > > > for a userspace dedup tool when it tries to figure out which extents > > > > > to > > > > > keep and which to discard or rewrite. > > > > > > > > That, IMO, is a btrfs design/implementation problem, not a problem > > > > with the API. Applications are always going to end up doing things > > > > that aren't perfectly aligned to extent boundaries or sizes > > > > regardless of the size limit that is placed on the dedupe ranges. > > > > > > Given that XFS doesn't have all the problems btrfs does, why does XFS > > > have the same aribitrary size limit? Especially since XFS demonstrably > > > doesn't need it? > > > > Creating a new-but-slightly-incompatible jsut for XFS makes no > > sense - we have multiple filesystems that support this functionality > > and so they all should use the same APIs and present (as far as is > > possible) the same behaviour to userspace. > > OK. Let's just remove the limit on all the filesystems then. > XFS doesn't need it, and btrfs can be fixed. Yet applications still have to support kernel versions where btrfs has a limit. IOWs, we can remove the limit for future improvement, but that doesn't mean userspace is free from having to know about the existing limit constraints. That is, once a behaviour has been exposed to userspace through an API, we can't just change it and act like it was always that way - apps still have to support kernels that expose the old behaviour. i.e. the old behaviour is there forever, and this why designing userspace APIs is /hard/. It's also why it's better to use an existing, slightly less than ideal API than invent a new one that will simply have different problems exposed in future... > > IOWs it's more important to use existing APIs than to invent a new > > one that does almost the same thing. This way userspace applications > > don't need to be changed to support new XFS functionality and we > > make life easier for everyone. > > Except removing the limit doesn't work that way. An application that > didn't impose an undocumented limit on itself wouldn't break when moved > to a filesystem that imposed no such limit, i.e. if XFS had no limit, > an application that moved from btrfs to XFS would just work. It goes /both ways/ though. Write an app on XFS that does not care about limits and it won't work on btrfs because it gets unexpected errors. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote: > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote: > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote: > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote: > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the > > > >implicit EOF gets around this in the existing XFS implementation. I > > > >copied this for the Btrfs implementation. > > > > > > Somewhat tangential to this patch, but on the dedup topic: Can we raise > > > or drop that 16MB limit? > > > > > > The maximum btrfs extent length is 128MB. Currently the btrfs dedup > > > behavior for a 128MB extent is to generate 8x16MB shared extent references > > > with different extent offsets to a single 128MB physical extent. > > > These references no longer look like the original 128MB extent to a > > > userspace dedup tool. That raises the difficulty level substantially > > > for a userspace dedup tool when it tries to figure out which extents to > > > keep and which to discard or rewrite. > > > > That, IMO, is a btrfs design/implementation problem, not a problem > > with the API. Applications are always going to end up doing things > > that aren't perfectly aligned to extent boundaries or sizes > > regardless of the size limit that is placed on the dedupe ranges. > > Given that XFS doesn't have all the problems btrfs does, why does XFS > have the same aribitrary size limit? Especially since XFS demonstrably > doesn't need it? Creating a new-but-slightly-incompatible jsut for XFS makes no sense - we have multiple filesystems that support this functionality and so they all should use the same APIs and present (as far as is possible) the same behaviour to userspace. IOWs it's more important to use existing APIs than to invent a new one that does almost the same thing. This way userspace applications don't need to be changed to support new XFS functionality and we make life easier for everyone. A shiny new API without warts would be nice, but we've already got to support the existing one forever, it does the job we need and so it's less burden on everyone if we just use it as is. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote: > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote: > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the > >implicit EOF gets around this in the existing XFS implementation. I > >copied this for the Btrfs implementation. > > Somewhat tangential to this patch, but on the dedup topic: Can we raise > or drop that 16MB limit? > > The maximum btrfs extent length is 128MB. Currently the btrfs dedup > behavior for a 128MB extent is to generate 8x16MB shared extent references > with different extent offsets to a single 128MB physical extent. > These references no longer look like the original 128MB extent to a > userspace dedup tool. That raises the difficulty level substantially > for a userspace dedup tool when it tries to figure out which extents to > keep and which to discard or rewrite. That, IMO, is a btrfs design/implementation problem, not a problem with the API. Applications are always going to end up doing things that aren't perfectly aligned to extent boundaries or sizes regardless of the size limit that is placed on the dedupe ranges. > XFS may not have this problem--I haven't checked. It doesn't - it tracks shared blocks exactly and merges adjacent extent records whenever possible. > Even if we want to keep the 16MB limit, there's also no way to query the > kernel from userspace to find out what the limit is, other than by trial > and error. It's not even in a header file, userspace just has to *know*. So add a define to the API to make it visible to applications and document it in the man page. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: Block btrfs from test case generic/372
(Did you forget to cc fste...@vger.kernel.org?) On Tue, Nov 15, 2016 at 04:13:32PM +0800, Qu Wenruo wrote: > Since btrfs always return the whole extent even part of it is shared > with other files, so the hole/extent counts differs for "file1" in this > test case. > > For example: > > /-- File 1 Extent 0-\ > / \ > |<--Extent A-->| > \ / \ / > \ File 2/\ File 2/ >Ext 0~4KExt 64k~68K > > In that case, fiemap on File 1 will only return 1 large extent A with > SHARED flag. > While XFS will split it into 3 extents, first and last 4K with SHARED > flag while the rest without SHARED flag. fiemap should behave the same across all filesystems if at all possible. This test failure indicates btrfs doesn't report an accurate representation of shared extents which, IMO, is a btrfs issue that needs fixing, not a test problem Regardless of this > This makes the test case meaningless as btrfs doesn't follow such > assumption. > So black list btrfs for this test case to avoid false alert. ... we are not going to add ad-hoc filesystem blacklists for random tests. Adding "blacklists" without any explanation of why something has been blacklisted is simply a bad practice. We use _require rules to specifically document what functionality is required for the test and check that it provided. i.e. this: _require_explicit_shared_extents() { if [ $FSTYP == "btrfs" ]; then _not_run "btrfs can't report accurate shared extent ranges in fiemap" fi } documents /exactly/ why this test is not run on btrfs. And, quite frankly, while this is /better/ it still ignores the fact we have functions like _within_tolerance for allowing a range of result values to be considered valid rather than just a fixed value. IOWs, changing the check of the extent count of file 1 post reflink to use a _within_tolerance range would mean the test would validate file1 on all reflink supporting filesystems and we don't need to exclude btrfs at all... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs: Check false ENOSPC bug caused by incorrect metadata reserve
On Thu, Nov 10, 2016 at 11:24:34AM +0800, Eryu Guan wrote: > On Thu, Nov 10, 2016 at 10:42:36AM +0800, Qu Wenruo wrote: > > > > > > At 11/10/2016 10:19 AM, Dave Chinner wrote: > > > On Thu, Nov 10, 2016 at 08:34:20AM +0800, Qu Wenruo wrote: > > > > At 11/09/2016 05:43 PM, Eryu Guan wrote: > > > > > On Wed, Nov 09, 2016 at 08:24:38AM +0800, Qu Wenruo wrote: > > > > > > At 11/08/2016 06:58 PM, Eryu Guan wrote: > > > > > > > On Tue, Nov 08, 2016 at 05:15:15PM +0800, Qu Wenruo wrote: > > > > > We should use helpers and not open-code when helpers are available. So > > > > > we should really use _scratch_mkfs_sized here. > > > > > > > > > > And "-n 64k" is only to make bug easier to reproduce, but I don't > > > > > think > > > > > it's necessary. In my testings, 50%-75% runs hit the ENOSPC failure on > > > > > my x86_64 test vm, even I removed the "-n 64k" mkfs option from the > > > > > test. > > > > > > > > > > So I'd say remove "-n 64k" and test whatever mkfs options user > > > > > specified. > > > > > > > > I really don't like the idea to allow user to override any mount > > > > option to reduce the possibility. > > > > > > That's not the point. Overriding mount options reduces test coverage > > > because it limits the test to only the exact configuration that > > > reproduced the bug that was seen. If a user is testing a specific > > > configuration, then we really want to run the test on that config. > > > > > > > And further more, this testcase is not a generic test, but a > > > > regression/pinpoint test to expose one specific bug. > > > > > > If you want to make sure that the bug doesn't return, then you need > > > to run the /entire test suite/ with the configuration that exposes > > > the problem. You wouldn't be suggesting this specific set of mount > > > options if users weren't using that configuration. Hence you really > > > need to run the entire test suite with that configuration to make > > > sure you haven't broken those user's systems > > > > > > And, yes, I test lots of different XFS configurations in their > > > entirity every day on every change I make or review, so I'm not > > > suggesting that you should do anything I don't already do. > > > > OK, most of your points makes sense. > > I'll update the case. > > > > And I want to make it clear, doesn it mean, newly submitted test case > > shouldn't specify any mkfs/mount option? > > My understanding is that if test needs a extremely rare configuration to > hit a bug, then it should specify mkfs/mount options, e.g. xfs/297 tests > a bug in XFS only in very specific configuration: > > _scratch_mkfs_xfs -d agcount=16,su=256k,sw=12 -l su=256k,size=5120b > >/dev/null 2>&1 Yes, but there's a key reason we can do this for /XFS/. I point this out every time this comes up: We can do this with /XFS/ because the *implementation of _scratch_mkfs_xfs()* has a fallback strategy on failure. That is, _scratch_mkfs_xfs will first try to make the filesystems with $MKFS_OPTIONS + $TEST_SUPPLIED_OPTIONS. If that fails, it then runs mkfs with just $TEST_SUPPLIED_OPTIONS. IOWs, XFS always tries to use the global options, and only removes them when there is a problem combining them with $TEST_SUPPLIED_OPTIONS. This is how custom test options should be applied for all filesystems, not just XFS. It gets rid of the need for any test to care about MKFS_OPTIONS. What we are also missing is that we need to apply this process to scratch mount options as we don't do it for any filesystem. That will get rid of the need for tests to care about what is in MOUNT_OPTIONS, too... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs: Check false ENOSPC bug caused by incorrect metadata reserve
On Thu, Nov 10, 2016 at 08:34:20AM +0800, Qu Wenruo wrote: > At 11/09/2016 05:43 PM, Eryu Guan wrote: > >On Wed, Nov 09, 2016 at 08:24:38AM +0800, Qu Wenruo wrote: > >>At 11/08/2016 06:58 PM, Eryu Guan wrote: > >>>On Tue, Nov 08, 2016 at 05:15:15PM +0800, Qu Wenruo wrote: > >We should use helpers and not open-code when helpers are available. So > >we should really use _scratch_mkfs_sized here. > > > >And "-n 64k" is only to make bug easier to reproduce, but I don't think > >it's necessary. In my testings, 50%-75% runs hit the ENOSPC failure on > >my x86_64 test vm, even I removed the "-n 64k" mkfs option from the > >test. > > > >So I'd say remove "-n 64k" and test whatever mkfs options user > >specified. > > I really don't like the idea to allow user to override any mount > option to reduce the possibility. That's not the point. Overriding mount options reduces test coverage because it limits the test to only the exact configuration that reproduced the bug that was seen. If a user is testing a specific configuration, then we really want to run the test on that config. > And further more, this testcase is not a generic test, but a > regression/pinpoint test to expose one specific bug. If you want to make sure that the bug doesn't return, then you need to run the /entire test suite/ with the configuration that exposes the problem. You wouldn't be suggesting this specific set of mount options if users weren't using that configuration. Hence you really need to run the entire test suite with that configuration to make sure you haven't broken those user's systems And, yes, I test lots of different XFS configurations in their entirity every day on every change I make or review, so I'm not suggesting that you should do anything I don't already do. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic: create and delete files repeatedly to exercise ENOSPC behaviour
On Tue, Nov 01, 2016 at 07:19:30PM +0800, Wang Xiaoguang wrote: > In btrfs, sometimes though the number of created files' consumed disk space > are not larger than fs's free space, we can still get some ENOSPC error, it > may be that btrfs does not try hard to reclaim disk space(I have sent kernel > patch to resolve this kind of enospc error. Note, this false enospc error > will not always happen even in kernel without my fixing patch). > > Currently only in btrfs, I get this ENOSPC error, xfs and ext4 work well. . > +RUN_TIME=$((600 * $TIME_FACTOR)) > +fs_size=$((15 * 1024 * 1024 * 1024)) > +_scratch_mkfs_sized $fs_size > $seqres.full 2>&1 > +_scratch_mount > $seqres.full 2>&1 > + > +testfile1=$SCRATCH_MNT/testfile1 > +testfile2=$SCRATCH_MNT/testfile2 > +filesize1=$(((fs_size * 80) / 100)) > +filesize2=$(((fs_size * 5) / 100)) > + > +do_test() > +{ > + while [ -f $SCRATCH_MNT/run ]; do > + $XFS_IO_PROG -fc "pwrite 0 $filesize1" $testfile1 > /dev/null > + $XFS_IO_PROG -fc "pwrite 0 $filesize2" $testfile2 > /dev/null > + rm -f $testfile1 $testfile2 > + done > +} What are you trying to test here that other ENOSPC tests don't exercise? Why cant you just use preallocation to trigger ENOSPC repeatedly instead of writing data? That would allow multiple iterations per second, not one every few minutes... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] generic: make 17[1-4] work well when btrfs compression is enabled
On Tue, Nov 01, 2016 at 04:49:34PM +0800, Wang Xiaoguang wrote: > hi Darrick, > > Common/populate needs xfs_io supports falloc and fpunch, > so I didn't put _fill_fs() in common/populate. Tests will include common/rc first, and so pick up the functionality _fill_fs requires before it's included from common/populate. So there is no reason to put it in common/rc. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
On Thu, Oct 27, 2016 at 09:13:44AM -0400, Josef Bacik wrote: > On 10/26/2016 08:30 PM, Dave Chinner wrote: > >On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote: > >>On 10/25/2016 06:01 PM, Dave Chinner wrote: > >>>On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > >>>>With anything that populates the inode/dentry cache with a lot of one > >>>>time use > >>>>inodes we can really put a lot of pressure on the system for things we > >>>>don't > >>>>need to keep in cache. It takes two runs through the LRU to evict these > >>>>one use > >>>>entries, and if you have a lot of memory you can end up with 10's of > >>>>millions of > >>>>entries in the dcache or icache that have never actually been touched > >>>>since they > >>>>were first instantiated, and it will take a lot of CPU and a lot of > >>>>pressure to > >>>>evict all of them. > >>>> > >>>>So instead do what we do with pagecache, only set the *REFERENCED flags > >>>>if we > >>>>are being used after we've been put onto the LRU. This makes a > >>>>significant > >>>>difference in the system's ability to evict these useless cache entries. > >>>>With a > >>>>fs_mark workload that creates 40 million files we get the following > >>>>results (all > >>>>in files/sec) > >>> > >>>What's the workload, storage, etc? > >> > >>Oops sorry I thought I said it. It's fs_mark creating 20 million > >>empty files on a single NVME drive. > > > >How big is the drive/filesystem (e.g. has impact on XFS allocation > >concurrency)? And multiple btrfs subvolumes, too, by the sound of > >it. How did you set those up? What about concurrency, directory > >sizes, etc? Can you post the fsmark command line as these details > >do actually matter... > > > >Getting the benchmark configuration to reproduce posted results > >should not require playing 20 questions! > > This is the disk > > Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > > This is the script > > https://paste.fedoraproject.org/461874/73910147/1 Link no workee. This does, though: https://paste.fedoraproject.org/461874/73910147 > > It's on a 1 socket 8 core cpu with 16gib of ram. Thanks! > >Is there an order difference in reclaim as a result of earlier > >reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs > >rather than contending on spinning locks? Is it having to wait for > >transaction commits or some other metadata IO because reclaim is > >happening earlier? i.e. The question I'm asking is what, exactly, > >leads to such a marked performance improvement in steady state > >behaviour? > > I would have seen this in my traces. There's tons of places to > improve btrfs's performance and behavior here no doubt. But simply > moving from pagecache to a slab shrinker shouldn't have drastically > changed how we perform in this test. Not sure what you mean by this. Do you mean that because of the preceeding patches that change the page cache accounting for btrfs metadata there is now not enough pressure being placed on the btrfs inode cache shrinker? FWIW, this zero length file fsmark workload produces zero page cache pressure on XFS. If the case is that btrfs is no longer using the pa > I feel like the shrinkers need > to be used differently, but I completely destroyed vmscan.c trying > different approaches and none of them made a difference like this > patch made. From what I was seeing in my trace we were simply > reclaiming less per kswapd scan iteration with the old approach vs. > the new approach. Yup, that's what default_seeks is supposed to be used to tune - the relative pressure that should be placed on the slab compared to everything else. > >I want to know because if there's behavioural changes in LRU reclaim > >order having a significant effect on affecting btrfs, then there is > >going to be some effects on other filesystems, too. Maybe not in > >this benchmark, but we can't anticipate potential problems if we > >don't understand exactly what is going on here. > > So I'll just describe to you what I was seeing and maybe we can work > out where we think the problem is. > > 1) We go at X speed until we fill up all of the memory with the various > caches. > 2) We lost about 15% when kswapd kicks in and it never recovered. That's expected behaviour (i.e. that's exactly what I see when XFS fill memory a
Re: bio linked list corruption.
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote: > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. > > XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: > 2938 > [ cut here ] > kernel BUG at fs/xfs/xfs_message.c:113! > invalid opcode: [#1] PREEMPT SMP > CPU: 1 PID: 6227 Comm: trinity-c9 Not tainted 4.9.0-rc1-think+ #6 > task: 8804f4658040 task.stack: 88050568c000 > RIP: 0010:[] > [] assfail+0x1b/0x20 [xfs] > RSP: :88050568f9e8 EFLAGS: 00010282 > RAX: ffea RBX: 0046 RCX: 0001 > RDX: ffc0 RSI: 000a RDI: a02fe34d > RBP: 88050568f9e8 R08: R09: > R10: 000a R11: f000 R12: 88050568fb44 > R13: 00f3 R14: 8804f292bf88 R15: 000e0046 > FS: 7fe2ddfdfb40() GS:88050a00() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fe2dbabd000 CR3: 0004f461f000 CR4: 001406e0 > Stack: > 88050568fa88 a027ccee fff9 8804f16fd8b0 > 3ffa 0032 8804f292bf40 4976 > 000e0008 04fd 8804 5107 > Call Trace: > [] xfs_bmap_add_extent_hole_delay+0x54e/0x620 [xfs] > [] xfs_bmapi_reserve_delalloc+0x2b4/0x400 [xfs] > [] xfs_file_iomap_begin_delay.isra.12+0x247/0x3c0 [xfs] > [] xfs_file_iomap_begin+0x181/0x270 [xfs] > [] iomap_apply+0x53/0x100 > [] iomap_file_buffered_write+0x6b/0x90 All this iomap code is new, so it's entirely possible that this is a new bug. The assert failure is indicative that the delalloc extent's metadata reservation grew when we expected it to stay the same or shrink. > XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: > fs/xfs/xfs_trans.c, line: 309 ... > [] xfs_trans_mod_sb+0x241/0x280 [xfs] > [] xfs_ag_resv_alloc_extent+0x4f/0xc0 [xfs] > [] xfs_alloc_ag_vextent+0x23d/0x300 [xfs] > [] xfs_alloc_vextent+0x5fb/0x6d0 [xfs] > [] xfs_bmap_btalloc+0x304/0x8e0 [xfs] > [] ? xfs_iext_bno_to_ext+0xee/0x170 [xfs] > [] xfs_bmap_alloc+0x2b/0x40 [xfs] > [] xfs_bmapi_write+0x640/0x1210 [xfs] > [] xfs_iomap_write_allocate+0x166/0x350 [xfs] > [] xfs_map_blocks+0x1b0/0x260 [xfs] > [] xfs_do_writepage+0x23b/0x730 [xfs] And that's indicative of a delalloc metadata reservation being being too small and so we're allocating unreserved blocks. Different symptoms, same underlying cause, I think. I see the latter assert from time to time in my testing, but it's not common (maybe once a month) and I've never been able to track it down. However, it doesn't affect production systems unless they hit ENOSPC hard enough that it causes the critical reserve pool to be exhausted iand so the allocation fails. That's extremely rare - usually takes a several hundred processes all trying to write as had as they can concurrently and to all slip through the ENOSPC detection without the correct metadata reservations and all require multiple metadata blocks to be allocated durign writeback... If you've got a way to trigger it quickly and reliably, that would be helpful... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote: > On 10/25/2016 06:01 PM, Dave Chinner wrote: > >On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > >>With anything that populates the inode/dentry cache with a lot of one time > >>use > >>inodes we can really put a lot of pressure on the system for things we don't > >>need to keep in cache. It takes two runs through the LRU to evict these > >>one use > >>entries, and if you have a lot of memory you can end up with 10's of > >>millions of > >>entries in the dcache or icache that have never actually been touched since > >>they > >>were first instantiated, and it will take a lot of CPU and a lot of > >>pressure to > >>evict all of them. > >> > >>So instead do what we do with pagecache, only set the *REFERENCED flags if > >>we > >>are being used after we've been put onto the LRU. This makes a significant > >>difference in the system's ability to evict these useless cache entries. > >>With a > >>fs_mark workload that creates 40 million files we get the following results > >>(all > >>in files/sec) > > > >What's the workload, storage, etc? > > Oops sorry I thought I said it. It's fs_mark creating 20 million > empty files on a single NVME drive. How big is the drive/filesystem (e.g. has impact on XFS allocation concurrency)? And multiple btrfs subvolumes, too, by the sound of it. How did you set those up? What about concurrency, directory sizes, etc? Can you post the fsmark command line as these details do actually matter... Getting the benchmark configuration to reproduce posted results should not require playing 20 questions! > >>The reason Btrfs has a much larger improvement is because it holds a lot > >>more > >>things in memory so benefits more from faster slab reclaim, but across the > >>board > >>is an improvement for each of the file systems. > > > >Less than 1% for XFS and ~1.5% for ext4 is well within the > >run-to-run variation of fsmark. It looks like it might be slightly > >faster, but it's not a cut-and-dried win for anything other than > >btrfs. > > > > Sure the win in this benchmark is clearly benefiting btrfs the most, > but I think the overall approach is sound and likely to help > everybody in theory. Yup, but without an explanation of why it makes such a big change to btrfs, we can't really say what effect it's really going to have. Why does cycling the inode a second time through the LRU make any difference? Once we're in steady state on this workload, one or two cycles through the LRU should make no difference at all to performance because all the inodes are instantiated in identical states (including the referenced bit) and so scanning treats every inode identically. i.e. the reclaim rate (i.e. how often evict_inode() is called) should be exactly the same and the only difference is the length of time between the inode being put on the LRU and when it is evicted. Is there an order difference in reclaim as a result of earlier reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs rather than contending on spinning locks? Is it having to wait for transaction commits or some other metadata IO because reclaim is happening earlier? i.e. The question I'm asking is what, exactly, leads to such a marked performance improvement in steady state behaviour? I want to know because if there's behavioural changes in LRU reclaim order having a significant effect on affecting btrfs, then there is going to be some effects on other filesystems, too. Maybe not in this benchmark, but we can't anticipate potential problems if we don't understand exactly what is going on here. > Inside FB we definitely have had problems > where the memory pressure induced by some idi^H^H^Hprocess goes > along and runs find / which causes us to evict real things that are > being used rather than these one use inodes. That's one of the problems the IO throttling in the XFS shrinker tends to avoid. i.e. This is one of the specific cases we expect to see on all production systems - backup applications are the common cause of regular full filesystem traversals. FWIW, there's an element of deja vu in this thread: that XFS inode cache shrinker IO throttling is exactly what Chris proposed we gut last week to solve some other FB memory reclaim problem that had no explanation of the root cause. (http://www.spinics.net/lists/linux-xfs/msg01541.html) > This sort of behavior > could possibly be mitigated by this patch, but I haven't sat down to > figure out a reliable way to mirror this workload to test that > theory. Thanks I use fsmark to create filesystems with tens of millions of small fi
Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
On Wed, Oct 26, 2016 at 04:03:54PM -0400, Josef Bacik wrote: > On 10/25/2016 07:36 PM, Dave Chinner wrote: > >So, 2-way has not improved. If changing referenced behaviour was an > >obvious win for btrfs, we'd expect to see that here as well. > >however, because 4-way improved by 20%, I think all we're seeing is > >a slight change in lock contention levels inside btrfs. Indeed, > >looking at the profiles I see that lock contention time was reduced > >to around 32% of the total CPU used (down by about 20%): > > > > 20.79% [kernel] [k] __pv_queued_spin_lock_slowpath > > 3.85% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > > 3.68% [kernel] [k] _raw_spin_lock > > 3.40% [kernel] [k] queued_write_lock_slowpath > > . > > > >IOWs, the performance increase comes from the fact that changing > >inode cache eviction patterns causes slightly less lock contention > >in btrfs inode reclaim. IOWs, the problem that needs fixing is the > >btrfs lock contention, not the VFS cache LRU algorithms. > > > >Root cause analysis needs to be done properly before behavioural > >changes like this are proposed, people! > > > > We are doing different things. Well, no, we're doing the same thing. Except... > Yes when you do it all into the same > subvol the lock contention is obviously much worse and the > bottleneck, but that's not what I'm doing, I'm creating a subvol for > each thread to reduce the lock contention on the root nodes. .. you have a non-default filesystem config that you didn't mention even in response to a /direct question/. Seriously, this is a major configuration detail that is necessary for anyone who wants to replicate your results! The difference is that I'm using is the /default/ btrfs filesystem configuration and you're using a carefully contrived structure that is designed to work around fundamental scalability issues the benchmark normally exposes. This raises another question: Does this subvol setup reflect production configurations, or is it simply a means to get the benchmark to put enough load on the inode cache to demonstrate the effect of changing the reference bits? > If you > retest by doing that then you will see the differences I was seeing. > Are you suggesting that I just made these numbers up? No, I'm suggesting that you haven't explained the problem or how to reproduce it in sufficient detail. You haven't explained why reducing scanning (which in and of iteself has minimal overhead and is not visible to btrfs) has such a marked effect on the behaviour of btrfs. You haven't given us details on the workload or storage config so we can reproduce the numbers (and still haven't!). You're arguing that numbers far below variablity and accuracy measurement limits are significant (i.e. the patched numbers are better for XFS and ext4). There's lots of information you've chosen to omit that we need to know. To fill in the gaps, I've done some analysis, provided perf numbers and CPU profiles, and put them into an explanation that explains why there is a difference in performance for a default btrfs config. I note that you've not actually addressed that analysis and the problems it uncovers, but instead just said "I'm using a different configuration". What about all the btrfs users that aren't using your configuration that will see these problems? How does the scanning change actually benefit them more than fixing the locking issues that exist? Memory reclaim and LRU algorithms affect everyone, not just a benchmark configuration. We need to know a lot more about the issue at hand to be able to evaluate whether this is the /right solution for the problem./ > Instead of > assuming I'm an idiot and don't know how to root cause issues please > instead ask what is different from your run versus my run. Thanks, I'm assuming you're a compentent engineer, Josef, who knows that details about benchmarks and performance measurement are extremely important, and that someone reviewing code needs to understand why the behaviour change had the impact it did to be able to evaluate it properly. I'm assuming that you also know that if you understand the root cause of a problem, you put it in the commit message so that everyone else knows it, too. So if you know what the root cause of the btrfs problem that reducing reclaim scanning avoids, then please explain it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote: > On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > > With anything that populates the inode/dentry cache with a lot of one time > > use > > inodes we can really put a lot of pressure on the system for things we don't > > need to keep in cache. It takes two runs through the LRU to evict these > > one use > > entries, and if you have a lot of memory you can end up with 10's of > > millions of > > entries in the dcache or icache that have never actually been touched since > > they > > were first instantiated, and it will take a lot of CPU and a lot of > > pressure to > > evict all of them. > > > > So instead do what we do with pagecache, only set the *REFERENCED flags if > > we > > are being used after we've been put onto the LRU. This makes a significant > > difference in the system's ability to evict these useless cache entries. > > With a > > fs_mark workload that creates 40 million files we get the following results > > (all > > in files/sec) > > What's the workload, storage, etc? > > > Btrfs Patched Unpatched > > Average Files/sec: 72209.3 63254.2 > > p50 Files/sec: 70850 57560 > > p90 Files/sec: 68757 53085 > > p99 Files/sec: 68757 53085 > > So how much of this is from changing the dentry referenced > behaviour, and how much from the inode? Can you separate out the two > changes so we know which one is actually affecting reclaim > performance? FWIW, I just went to run my usual zero-length file creation fsmark test (16-way create, large sparse FS image on SSDs). XFS (with debug enabled) takes 4m10s to run at an average of ~230k files/s, with a std deviation of +/-1.7k files/s. Unfortunately, btrfs turns that into more heat than it ever has done before. It's only managing 35k files/s and the profile looks like this: 58.79% [kernel] [k] __pv_queued_spin_lock_slowpath 5.61% [kernel] [k] queued_write_lock_slowpath 1.65% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 1.38% [kernel] [k] reschedule_interrupt 1.08% [kernel] [k] _raw_spin_lock 0.92% [kernel] [k] __radix_tree_lookup 0.86% [kernel] [k] _raw_spin_lock_irqsave 0.83% [kernel] [k] btrfs_set_lock_blocking_rw I killed it because this would take too long to run. I reduced the concurrency down to 4-way, spinlock contention went down to about 40% of the CPU time. I reduced the concurrency down to 2 and saw about 16% of CPU time being spent in lock contention. Throughput results: btrfs throughput 2-way 4-way unpatched 46938.1+/-2.8e+03 40273.4+/-3.1e+03 patched 45697.2+/-2.4e+03 49287.1+/-3e+03 So, 2-way has not improved. If changing referenced behaviour was an obvious win for btrfs, we'd expect to see that here as well. however, because 4-way improved by 20%, I think all we're seeing is a slight change in lock contention levels inside btrfs. Indeed, looking at the profiles I see that lock contention time was reduced to around 32% of the total CPU used (down by about 20%): 20.79% [kernel] [k] __pv_queued_spin_lock_slowpath 3.85% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 3.68% [kernel] [k] _raw_spin_lock 3.40% [kernel] [k] queued_write_lock_slowpath . IOWs, the performance increase comes from the fact that changing inode cache eviction patterns causes slightly less lock contention in btrfs inode reclaim. IOWs, the problem that needs fixing is the btrfs lock contention, not the VFS cache LRU algorithms. Root cause analysis needs to be done properly before behavioural changes like this are proposed, people! -Dave. PS: I caught this profile on unmount when the 8 million inodes cached inodes were being reclaimed. evict_inodes() ignores the referenced bit, so this gives a lot of insight into the work being done by inode reclaim in a filesystem: 18.54% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 9.43% [kernel] [k] rb_erase 8.03% [kernel] [k] __btrfs_release_delayed_node 7.23% [kernel] [k] _raw_spin_lock 6.93% [kernel] [k] __list_del_entry 4.35% [kernel] [k] __slab_free 3.93% [kernel] [k] __mutex_lock_slowpath 2.77% [kernel] [k] bit_waitqueue 2.58% [kernel] [k] kmem_cache_alloc 2.50% [kernel] [k] __radix_tree_lookup 2.44% [kernel] [k] _raw_spin_lock_irq 2.18% [kernel] [k] kmem_cache_free 2.17% [kernel] [k] evict <<<<<<<<<<< 2.13% [kernel] [k] fsnotify_destroy_marks 1.68% [kernel] [k] btrfs_remove_delayed_node 1.61% [kernel] [k] __call_rcu.constprop.70 1.50% [kernel
Re: [PATCH 5/5] fs: don't set *REFERENCED unless we are on the lru list
On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > With anything that populates the inode/dentry cache with a lot of one time use > inodes we can really put a lot of pressure on the system for things we don't > need to keep in cache. It takes two runs through the LRU to evict these one > use > entries, and if you have a lot of memory you can end up with 10's of millions > of > entries in the dcache or icache that have never actually been touched since > they > were first instantiated, and it will take a lot of CPU and a lot of pressure > to > evict all of them. > > So instead do what we do with pagecache, only set the *REFERENCED flags if we > are being used after we've been put onto the LRU. This makes a significant > difference in the system's ability to evict these useless cache entries. > With a > fs_mark workload that creates 40 million files we get the following results > (all > in files/sec) What's the workload, storage, etc? > Btrfs Patched Unpatched > Average Files/sec:72209.3 63254.2 > p50 Files/sec:70850 57560 > p90 Files/sec:68757 53085 > p99 Files/sec:68757 53085 So how much of this is from changing the dentry referenced behaviour, and how much from the inode? Can you separate out the two changes so we know which one is actually affecting reclaim performance? Indeed, I wonder if just changing the superblock shrinker default_seeks for btrfs would have exactly the same impact because that canbe used to exactly double the reclaim scan rate for the same memory pressure. If that doesn't change performance by a similar amount (changing defaults seeks is the normal way of changing shrinker balance), then more digging is required here to explain why the referenced bits make such an impact to steady state performance... > XFS Patched Unpatched > Average Files/sec:61025.5 60719.5 > p50 Files/sec:60107 59465 > p90 Files/sec:59300 57966 > p99 Files/sec:59227 57528 You made XFS never use I_REFERENCED at all (hint: not all filesystems use find_inode/find_inode_fast()), so it's not clear that the extra scanning (less than 1% difference in average throughput) is actuallly the cause of things being slower in btrfs. > The reason Btrfs has a much larger improvement is because it holds a lot more > things in memory so benefits more from faster slab reclaim, but across the > board > is an improvement for each of the file systems. Less than 1% for XFS and ~1.5% for ext4 is well within the run-to-run variation of fsmark. It looks like it might be slightly faster, but it's not a cut-and-dried win for anything other than btrfs. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] generic: make 17[1-4] work well when btrfs compression is enabled
On Mon, Oct 10, 2016 at 04:06:17PM +0800, Wang Xiaoguang wrote: > When enabling btrfs compression, original codes can not fill fs > correctly, fix this. > > Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com> > --- > V2: In common/, I did't find an existing function suitable for > these 4 test cases to fill fs, so I still use _pwrite_byte() with > a big enough file length fo fill fs. Note, for btrfs, metadata space > still is not full, only data space is full, but it's OK for these > 4 test cases. > > All these 4 cases pass in xfs and btrfs(without compression), if > btrfs has compression enabled, these 4 cases will fail for false > enospc error, I have sent kernel patches to fix this bug. > --- > tests/generic/171 | 7 --- > tests/generic/172 | 6 -- > tests/generic/173 | 7 --- > tests/generic/174 | 7 --- > 4 files changed, 16 insertions(+), 11 deletions(-) > > diff --git a/tests/generic/171 b/tests/generic/171 > index f391685..2d7f683 100755 > --- a/tests/generic/171 > +++ b/tests/generic/171 > @@ -75,9 +75,10 @@ _cp_reflink $testdir/bigfile $testdir/clonefile > sync > > echo "Allocate the rest of the space" > -nr_free=$(stat -f -c '%f' $testdir) > -touch $testdir/file0 $testdir/file1 > -_pwrite_byte 0x61 0 $((blksz * nr_free)) $testdir/eat_my_space >> > $seqres.full 2>&1 > +# here we try to write big enough data in case some fs support > +# compression, such as btrfs. > +len=$((2 ** 36)) > +_pwrite_byte 0x61 0 $len $testdir/eat_my_space >> $seqres.full 2>&1 NACK. This is no better than the last proposal. 2^36 = 64GB, which will not fill filesystems of any significant size, especially as the data will compress down to nearly nothing. Trying to hack around compression artifacts by inflating the size of the file just doesn't work reliably. The way to fix this is to either use one of the "fill filesystem" functions that isn't affected by compression, or to add the ability to xfs_io to write incompressible data patterns and otherwise leave the test alone. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/175: disable inline data feature for btrfs
On Mon, Oct 10, 2016 at 01:06:47PM +0800, Wang Xiaoguang wrote: > For btrfs, if compression is enabled, it may generate inline data for a > blocksize data range, this inline data is stored in fs tree, will not have > a individual extent, try to reflink this data range at a not-zero offset > will return EOPNOTSUPP, so here we disable inline data feature for btrfs. > > Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com> > --- > tests/generic/175 | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/tests/generic/175 b/tests/generic/175 > index 964580c..b3f90dc 100755 > --- a/tests/generic/175 > +++ b/tests/generic/175 > @@ -50,6 +50,13 @@ rm -f "$seqres.full" > > echo "Format and mount" > _scratch_mkfs > "$seqres.full" 2>&1 > +# For btrfs, if compression is enabled, it may generate inline data for a > +# blocksize data range, this inline data is stored in fs tree, will not have > +# a individual extent, try to reflink this data range at a not-zero offset > +# will return EOPNOTSUPP, so here we disable inline data feature for btrfs. > +if [ "$FSTYP" = "btrfs" ]; then > + export MOUNT_OPTIONS="-o max_inline=0 $MOUNT_OPTIONS" > +fi Can we /please stop/ putting special case code like this in tests? This is an unsustainable and unmaintainable practice - it's making a mess of the test code. If there are specific mount options that needs to be avoided, then add an option to filter them out. e.g. something like this: _scratch_options_filter btrfs compress so that it removes any compression option from the btrfs mount/mkfs that is run for that test. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is is possible to submit binary image as fstest test case?
On Fri, Oct 07, 2016 at 06:05:51PM +0200, David Sterba wrote: > On Fri, Oct 07, 2016 at 08:18:38PM +1100, Dave Chinner wrote: > > On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote: > > > So I'm wondering if I can just upload a zipped raw image as part of > > > the test case? > > > > Preferably not. We've managed to avoid pre-built images in xfstests > > for 15 years, so there'd have to be a really good reason to start > > doing this, especially as once we open that floodgate we'll end up > > with everyone wanting to do this and it will blow out the size of > > the repository in now time. > > > > If the issue is just timing or being unable to trigger an error > > at the right time, this is what error injection frameworks or > > debug-only sysfs hooks are for. The XFS kernel code has both, > > xfstests use both, and they pretty much do away with the need for > > custom binary filesystem images for such testing... > > Error injection framework may not be alwasy available, eg. in kernels > built for production. Yet I'd like to make the test possible. Why would you ever enable error injection on a production kernel? We simply don't run the error injection tests when the infrastructure is not there, jsut like we do with all other tests that are depenent on optional kernel/fs features > Also, I find it useful to keep the exact images that are attached to a > report and not necessarily try to recreate it to trigger a bug. If the > images are small, hosting them with the sources makes testing easy. > Big images would probably go to own repository and be linked. > > I don't really expect fstests to store crafted images and would advise > Qu to not even try to propose that, because the answer was quite > predictable. You say that like it's a bad thing? What's the guarantee that a specific corrupt image will always be sufficient to trigger the problem the test is supposed to check? i.e. we could change a different part of the FS code and that could mask the bug the image used to trigger. The test then passes, but we haven't actually fix the bug that the test used to exercise. i.e. we've got a false "we fixed the problem" report, when all we did is prevent a specific vector from tripping over it. Error injection and sysfs hooks into debug code are much more reliable ways of testing that the root cause of a problem is fixed and stays fixed. The reproducing trigger cannot be masked by changes in other code layers, so we know that if the error injection/sysfs hook test handles the problem correctly, we have actually fixed the underlying bug and not just masked it... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Making statfs return qgroup info (for container use case etc.)
On Fri, Oct 07, 2016 at 06:58:47PM +0200, David Sterba wrote: > On Fri, Oct 07, 2016 at 09:40:11AM +1100, Dave Chinner wrote: > > XFS does this with directory tree quotas. It was implmented 10 years > > ago or so, IIRC... > > Sometimes, the line between a historical remark and trolling is thin All I've done is quickly point out that we've already got an implementation of the exact functionality being asked for and given a rough indication of when the commits hit the tree. i.e. so anyone wanting to implment this in btrfs can search for it easily and work out how to use the existing VFS infrastructure to report this information. You can take that as trolling if you want, but if you think I have time for playing stupid, petty, idiotic games like that then you need to ask yourself why you have had such a defensive and paranoid reaction... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic: make 17[1-4] work well when btrfs compression is enabled
On Fri, Oct 07, 2016 at 03:00:42PM +0800, Wang Xiaoguang wrote: > When enabling btrfs compression, original codes can not fill fs > correctly, fix this. > > Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com> > --- > tests/generic/171 | 4 +--- > tests/generic/172 | 2 +- > tests/generic/173 | 4 +--- > tests/generic/174 | 4 +--- > 4 files changed, 4 insertions(+), 10 deletions(-) > > diff --git a/tests/generic/171 b/tests/generic/171 > index f391685..d2ae8e4 100755 > --- a/tests/generic/171 > +++ b/tests/generic/171 > @@ -75,9 +75,7 @@ _cp_reflink $testdir/bigfile $testdir/clonefile > sync > > echo "Allocate the rest of the space" > -nr_free=$(stat -f -c '%f' $testdir) > -touch $testdir/file0 $testdir/file1 > -_pwrite_byte 0x61 0 $((blksz * nr_free)) $testdir/eat_my_space >> > $seqres.full 2>&1 > +dd if=/dev/zero of=$testdir/eat_my_space >> $seqres.full 2>&1 Please don't replace xfs_io writes using a specific data pattern with dd calls that write zeros. Indeed, we don't use dd for new tests anymore - xfs_io should be used. Write a function that fills all the remaining free space (one may already exist) and call it in all these places. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is is possible to submit binary image as fstest test case?
On Fri, Oct 07, 2016 at 05:26:27PM +0800, Qu Wenruo wrote: > > > At 10/07/2016 05:18 PM, Dave Chinner wrote: > >On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote: > >>Hi, > >> > >>Just as the title says, for some case(OK, btrfs again) we need to > >>catch a file system in special timing. > >> > >>In this specific case, we need to grab a btrfs image undergoing > >>balancing, just before the balance finished. > >> > >>Although we can use flakey to drop all write, we still don't have > >>method to catch the timing of the that transaction. > >> > >> > >>On the other hand, we can tweak our local kernel, adding > >>msleep()/message and dump the disk during the sleep. > >>And the image I dumped can easily trigger btrfs kernel and user-space bug. > >> > >>So I'm wondering if I can just upload a zipped raw image as part of > >>the test case? > > > >Preferably not. We've managed to avoid pre-built images in xfstests > >for 15 years, so there'd have to be a really good reason to start > >doing this, especially as once we open that floodgate we'll end up > >with everyone wanting to do this and it will blow out the size of > >the repository in now time. > > Makes sense. > For btrfs-progs, which includes test images, it already takes about > 77M, even we have tried our best to reduce image size. > > > > >If the issue is just timing or being unable to trigger an error > >at the right time, this is what error injection frameworks or > >debug-only sysfs hooks are for. The XFS kernel code has both, > >xfstests use both, and they pretty much do away with the need for > >custom binary filesystem images for such testing... > > So again, btrfs is lacking infrastructure for debug. > It seems that we can only rely on images out of xfstest tree, That's the /wrong answer/. Go and implement debug infrastructure that btrfs needs - if you wait for someone else to do it, it will never get done and btrfs will never stabilise Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is is possible to submit binary image as fstest test case?
On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote: > Hi, > > Just as the title says, for some case(OK, btrfs again) we need to > catch a file system in special timing. > > In this specific case, we need to grab a btrfs image undergoing > balancing, just before the balance finished. > > Although we can use flakey to drop all write, we still don't have > method to catch the timing of the that transaction. > > > On the other hand, we can tweak our local kernel, adding > msleep()/message and dump the disk during the sleep. > And the image I dumped can easily trigger btrfs kernel and user-space bug. > > So I'm wondering if I can just upload a zipped raw image as part of > the test case? Preferably not. We've managed to avoid pre-built images in xfstests for 15 years, so there'd have to be a really good reason to start doing this, especially as once we open that floodgate we'll end up with everyone wanting to do this and it will blow out the size of the repository in now time. If the issue is just timing or being unable to trigger an error at the right time, this is what error injection frameworks or debug-only sysfs hooks are for. The XFS kernel code has both, xfstests use both, and they pretty much do away with the need for custom binary filesystem images for such testing... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Making statfs return qgroup info (for container use case etc.)
On Thu, Oct 06, 2016 at 02:51:58PM +0100, Tim Small wrote: > Hello, > > I use btrfs for containers (e.g. full OS style containers using lxc/lxd > etc. rather than application containers). btrfs de-duplication and > send/receive are very useful features for this use-case. > > Each container runs in its own subvolume, and I use quotas to stop one > container exhausting the disk space of the others. > > df shows the total space + free space for the entire filesystem - which > is confusing for users within the containers. Worse - they don't have > any way of knowing how much quota they actually have left (other than du > -xs / vs. whatever I've told them). > > It'd be really nice if running e.g. statfs("/home", ...) within a > container could be made to return the qgroup usage + limit for the path > (e.g. by passing in an option at mount time - such as qgroup level > maybe?) , instead of the global filesystem data in f_bfree f_blocks etc. XFS does this with directory tree quotas. It was implmented 10 years ago or so, IIRC... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Preliminary BTRFS Encryption
On Tue, Sep 13, 2016 at 09:39:46PM +0800, Anand Jain wrote: > > This patchset adds btrfs encryption support. > > The main objective of this series is to have bugs fixed and stability. > I have verified with fstests to confirm that there is no regression. > > A design write-up is coming next, however here below is the quick example > on the cli usage. Please try out, let me know if I have missed something. Yup, that best practices say "do not roll your own encryption infrastructure". This is just my 2c worth - take it or leave it, don't other flaming. Keep in mind that I'm not picking on btrfs here - I asked similar hard questions about the proposed f2fs encryption implementation. That was a "copy and snowflake" version of the ext4 encryption code - they made changes and now we have generic code and common functionality between ext4 and f2fs. > Also would like to mention that a review from the security experts is due, > which is important and I believe those review comments can be accommodated > without major changes from here. That's a fairly significant red flag to me - security reviews need to be done at the design phase against specific threat models - security review is not a code/implementation review... The ext4 developers got this right by publishing threat models and design docs, which got quite a lot of review and feedback before code was published for review. https://docs.google.com/document/d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/edit#heading=h.qmnirp22ipew [small reorder of comments] > As of now these patch set supports encryption on per subvolume, as > managing properties on per subvolume is a kind of core to btrfs, which is > easier for data center solution-ing, seamlessly persistent and easy to > manage. We've got dmcrypt for this sort of transparent "device level" encryption. Do we really need another btrfs layer that re-implements generic, robust, widely deployed, stable functionality? What concerns me the most here is that it seems like that nothing has been learnt from the btrfs RAID5/6 debacle. i.e. the btrfs reimplementation of existing robust, stable, widely deployed infrastructure was fatally flawed and despite regular corruption reports they were ignored for, what, 2 years? And then a /user/ spent the time to isolate the problem, and now several months later it still hasn't been fixed. I haven't seen any developer interest in fixing it, either. This meets the definition of unmaintained software, and it sets a poor example for how complex new btrfs features might be maintained in the long term. Encryption simply cannot be treated like this - it has to be right, and it has to be well maintained. So what is being done differently ito the RAID5/6 review process this time that will make the new btrfs-specific encryption implementation solid and have minimal risk of zero-day fatal flaws? And how are you going to guarantee that it will be adequately maintained several years down the track? > Also yes, thanks for the emails, I hear, per file encryption and inline > with vfs layer is also important, which is wip among other things in the > list. The generic file encryption code is solid, reviewed, tested and already widely deployed via two separate filesystems. There is a much wider pool of developers who will maintain it, reveiw changes and know all the traps that a new implementation might fall into. There's a much bigger safety net here, which significantly lowers the risk of zero-day fatal flaws in a new implementation and of flaws in future modifications and enhancements. Hence, IMO, the first thing to do is implement and make the generic file encryption support solid and robust, not tack it on as an afterthought for the magic btrfs encryption pixies to take care of. Indeed, with the generic file encryption, btrfs may not even need the special subvolume encryption pixies. i.e. you can effectively implement subvolume encryption via configuration of a multi-user encryption key for each subvolume and apply it to the subvolume tree root at creation time. Then only users with permission to unlock the subvolume key can access it. Once the generic file encryption is solid and fulfils the needs of most users, then you can look to solving the less common threat models that neither dmcrypt or per-file encryption address. Only if the generic code cannot be expanded to address specific threat models should you then implement something that is unique to btrfs Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 1/6] fstests: common: Introduce _post_mount_hook for btrfs
On Wed, Sep 14, 2016 at 09:55:22AM +0800, Qu Wenruo wrote: > Introduce _post_mount_hook(), which will be executed after mounting > scratch/test. > > It's quite useful for fs(OK, only btrfs yet, again) which needs to > use ioctl other than mount option to enable some of its feature. Just implement a _btrfs_mount() function (similar to _overlay_mount()) to do btrfs specific things at mount time. > Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com> > --- > common/rc | 23 +++ > 1 file changed, 23 insertions(+) > > diff --git a/common/rc b/common/rc > index 23c007a..631397f 100644 > --- a/common/rc > +++ b/common/rc > @@ -321,6 +321,27 @@ _overlay_scratch_unmount() > $UMOUNT_PROG $SCRATCH_MNT > } > > +_run_btrfs_post_mount_hook() > +{ > + mnt_point=$1 > + for n in $ALWAYS_ENABLE_BTRFS_FEATURE; do What's this magic, undefined, undocumented variable? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
On Mon, Sep 12, 2016 at 10:56:04AM -0400, Josef Bacik wrote: > I think that looping through all the sb's in the system would be > kinda shitty for this tho, we want the "get number of dirty pages" > part to be relatively fast. What if I do something like the > shrinker_control only for dirty objects. So the fs registers some > dirty_objects_control, we call into each of those and get the counts > from that. Does that sound less crappy? Thanks, Hmmm - just an off-the-wall thought on this If you're going to do that, then why wouldn't you simply use a "shrinker" to do the metadata writeback rather than having a hook to count dirty objects to pass to some other writeback code that calls a hook to write the metadata? That way filesystems can also implement dirty accounting and "writers" for each cache of objects they currently implement shrinkers for. i.e. just expanding shrinkers to be able to "count dirty objects" and "write dirty objects" so that we can tell filesystems to write back all their different metadata caches proportionally to the size of the page cache and it's dirty state. The existing file data and inode writeback could then just be new generic "superblock shrinker" operations, and the fs could have it's own private metadata writeback similar to the private sb shrinker callout we currently have... And, in doing so, we might be able to completely hide memcg from the writeback implementations similar to the way memcg is completely hidden from the shrinker reclaim implementations... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] writeback: allow for dirty metadata accounting
On Mon, Sep 12, 2016 at 10:24:12AM -0400, Josef Bacik wrote: > Dave your reply got eaten somewhere along the way for me, so all i > have is this email. I'm going to respond to your stuff here. No worries, I'll do a 2-in-1 reply :P > On 09/12/2016 03:34 AM, Jan Kara wrote: > >On Mon 12-09-16 10:46:56, Dave Chinner wrote: > >>On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote: > >>>On Mon 22-08-16 13:35:01, Josef Bacik wrote: > >>>>Provide a mechanism for file systems to indicate how much dirty metadata > >>>>they > >>>>are holding. This introduces a few things > >>>> > >>>>1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY. > >>>>2) WB stat for dirty metadata. This way we know if we need to try and > >>>>call into > >>>>the file system to write out metadata. This could potentially be used in > >>>>the > >>>>future to make balancing of dirty pages smarter. > >>> > >>>So I'm curious about one thing: In the previous posting you have mentioned > >>>that the main motivation for this work is to have a simple support for > >>>sub-pagesize dirty metadata blocks that need tracking in btrfs. However you > >>>do the dirty accounting at page granularity. What are your plans to handle > >>>this mismatch? > >>> > >>>The thing is you actually shouldn't miscount by too much as that could > >>>upset some checks in mm checking how much dirty pages a node has directing > >>>how reclaim should be done... But it's a question whether NR_METADATA_DIRTY > >>>should be actually used in the checks in node_limits_ok() or in > >>>node_pagecache_reclaimable() at all because once you start accounting dirty > >>>slab objects, you are really on a thin ice... > >> > >>The other thing I'm concerned about is that it's a btrfs-only thing, > >>which means having dirty btrfs metadata on a system with different > >>filesystems (e.g. btrfs root/home, XFS data) is going to affect how > >>memory balance and throttling is run on other filesystems. i.e. it's > >>going ot make a filesystem specific issue into a problem that > >>affects global behaviour. > > > >Umm, I don't think it will be different than currently. Currently btrfs > >dirty metadata is accounted as dirty page cache because they have this > >virtual fs inode which owns all metadata pages. Yes, so effectively they are treated the same as file data pages w.r.t. throttling, writeback and reclaim > >It is pretty similar to > >e.g. ext2 where you have bdev inode which effectively owns all metadata > >pages and these dirty pages account towards the dirty limits. For ext4 > >things are more complicated due to journaling and thus ext4 hides the fact > >that a metadata page is dirty until the corresponding transaction is > >committed. But from that moment on dirty metadata is again just a dirty > >pagecache page in the bdev inode. Yeah, though those filesystems don't suffer from the uncontrolled explosion of metadata that btrfs is suffering from, so simply treating them as another dirty inode that needs flushing works just fine. > >So current Josef's patch just splits the counter in which btrfs metadata > >pages would be accounted but effectively there should be no change in the > >behavior. Yup, I missed the addition to the node_pagecache_reclaimable() that ensures reclaim sees the same number or dirty pages... > >It is just a question whether this approach is workable in the > >future when they'd like to track different objects than just pages in the > >counter. I don't think it can. Because the counters directly influences the page lru reclaim scanning algorithms, it can only be used to account for pages that are in the LRUs. Other objects like slab objects need to be accounted for and reclaimed by the shrinker infrastructure. Accounting for metadata writeback is a different issue - it could track slab objects if we wanted to, but the issue is that these are often difficult to determine the amount of IO needed to clean them so generic balancing is hard. (e.g. effect of inode write clustering). > +1 to what Jan said. Btrfs's dirty metadata is always going to > affect any other file systems in the system, no matter how we deal > with it. In fact it's worse with our btree_inode approach as the > dirtied_when thing will likely screw somebody and make us skip > writing out dirty metadata when we want to. XFS takes care of metadata flushing with a periodic background work controlled by /proc/sys/fs/xfs/xfssyncd_centisecs. We t